Data Efficacy for Language Model Training
[AUTHORS]
Yalun Dai, Yangyu Huang, Xin Zhang, Wenshan Wu, Chong Li, Wenhui Lu, Shijie Cao, Li Dong, Scarlett Li
[ABSTRACT]
Data is fundamental to the training of language models (LM). Recent research
has been dedicated to data efficiency, which aims to maximize performance by
selecting a minimal or optimal subset of training data. Techniques such as data
filtering, sampling, and selection play a crucial role in this area. To
complement it, we define Data Efficacy, which focuses on maximizing performance
by optimizing the organization of training data and remains relatively
underexplored. This work introduces a general paradigm, DELT, for considering
data efficacy in LM training, which highlights the significance of training
data organization. DELT comprises three components: Data Scoring, Data
Selection, and Data Ordering. Among these components, we design
Learnability-Quality Scoring (LQS), as a new instance of Data Scoring, which
considers both the learnability and quality of each data sample from the
gradient consistency perspective. We also devise Folding Ordering (FO), as a
novel instance of Data Ordering, which addresses issues such as model
forgetting and data distribution bias. Comprehensive experiments validate the
data efficacy in LM training, which demonstrates the following: Firstly,
various instances of the proposed DELT enhance LM performance to varying
degrees without increasing the data scale and model size. Secondly, among these
instances, the combination of our proposed LQS for data scoring and Folding for
data ordering achieves the most significant improvement. Lastly, data efficacy
can be achieved together with data efficiency by applying data selection.
Therefore, we believe that data efficacy is a promising foundational area in LM
training.
[LINK]
http://arxiv.org/abs/2506.21545v1
[DATE]
2025-06-27 01:59:07+08:00
[CATEGORIES]
cs.CL
“What’s Up, Doc?”: Analyzing How Users Seek Health Information in Large-Scale Conversational AI Datasets
[AUTHORS]
Akshay Paruchuri, Maryam Aziz, Rohit Vartak, Ayman Ali, Best Uchehara, Xin Liu, Ishan Chatterjee, Monica Agrawal
[ABSTRACT]
People are increasingly seeking healthcare information from large language
models (LLMs) via interactive chatbots, yet the nature and inherent risks of
these conversations remain largely unexplored. In this paper, we filter
large-scale conversational AI datasets to achieve HealthChat-11K, a curated
dataset of 11K real-world conversations composed of 25K user messages. We use
HealthChat-11K and a clinician-driven taxonomy for how users interact with LLMs
when seeking healthcare information in order to systematically study user
interactions across 21 distinct health specialties. Our analysis reveals
insights into the nature of how and why users seek health information, such as
common interactions, instances of incomplete context, affective behaviors, and
interactions (e.g., leading questions) that can induce sycophancy, underscoring
the need for improvements in the healthcare support capabilities of LLMs
deployed as conversational AI. Code and artifacts to retrieve our analyses and
combine them into a curated dataset can be found here:
https://github.com/yahskapar/HealthChat
[COMMENTS]
25 pages, 6 figures, 4 tables, corresponds to initial HealthChat-11K
dataset release
[LINK]
http://arxiv.org/abs/2506.21532v1
[DATE]
2025-06-27 01:52:18+08:00
[CATEGORIES]
cs.CL
OpenNER 1.0: Standardized Open-Access Named Entity Recognition Datasets in 50+ Languages
[AUTHORS]
Chester Palen-Michel, Maxwell Pickering, Maya Kruse, Jonne Sälevä, Constantine Lignos
[ABSTRACT]
We present OpenNER 1.0, a standardized collection of openly-available named
entity recognition (NER) datasets. OpenNER contains 36 NER corpora that span 52
languages, human-annotated in varying named entity ontologies. We correct
annotation format issues, standardize the original datasets into a uniform
representation with consistent entity type names across corpora, and provide
the collection in a structure that enables research in multilingual and
multi-ontology NER. We provide baseline results using three pretrained
multilingual language models and two large language models to compare the
performance of recent models and facilitate future research in NER. We find
that no single model is best in all languages and that significant work remains
to obtain high performance from LLMs on the NER task.
[COMMENTS]
Under review
[LINK]
http://arxiv.org/abs/2412.09587v2
[DATE]
2025-06-27 01:51:40+08:00
[CATEGORIES]
cs.CL
skLEP: A Slovak General Language Understanding Benchmark
[AUTHORS]
Marek Šuppa, Andrej Ridzik, Daniel Hládek, Tomáš Javůrek, Viktória Ondrejová, Kristína Sásiková, Martin Tamajka, Marián Šimko
[COMMENTS]
ACL 2025 Findings
[LINK]
http://arxiv.org/abs/2506.21508v1
[DATE]
2025-06-27 01:35:04+08:00
[CATEGORIES]
cs.CL
cs.LG
Mind2Web 2: Evaluating Agentic Search with Agent-as-a-Judge
[AUTHORS]
Boyu Gou, Zanming Huang, Yuting Ning, Yu Gu, Michael Lin, Weijian Qi, Andrei Kopanev, Botao Yu, Bernal Jiménez Gutiérrez, Yiheng Shu, Chan Hee Song, Jiaman Wu, Shijie Chen, Hanane Nour Moussa, Tianshu Zhang, Jian Xie, Yifei Li, Tianci Xue, Zeyi Liao, Kai Zhang, Boyuan Zheng, Zhaowei Cai, Viktor Rozgic, Morteza Ziyadi, Huan Sun, Yu Su
[ABSTRACT]
Agentic search such as Deep Research systems, where large language models
autonomously browse the web, synthesize information, and return comprehensive
citation-backed answers, represents a major shift in how users interact with
web-scale information. While promising greater efficiency and cognitive
offloading, the growing complexity and open-endedness of agentic search have
outpaced existing evaluation benchmarks and methodologies, which largely assume
short search horizons and static answers. In this paper, we introduce Mind2Web
2, a benchmark of 130 realistic, high-quality, and long-horizon tasks that
require real-time web browsing and extensive information synthesis, constructed
with over 1,000 hours of human labor. To address the challenge of evaluating
time-varying and complex answers, we propose a novel Agent-as-a-Judge
framework. Our method constructs task-specific judge agents based on a
tree-structured rubric design to automatically assess both answer correctness
and source attribution. We conduct a comprehensive evaluation of nine frontier
agentic search systems and human performance, along with a detailed error
analysis to draw insights for future development. The best-performing system,
OpenAI Deep Research, can already achieve 50-70% of human performance while
spending half the time, showing a great potential. Altogether, Mind2Web 2
provides a rigorous foundation for developing and benchmarking the next
generation of agentic search systems.
[COMMENTS]
Project Homepage: https://osu-nlp-group.github.io/Mind2Web2/
[LINK]
http://arxiv.org/abs/2506.21506v1
[DATE]
2025-06-27 01:32:50+08:00
[CATEGORIES]
cs.CL
Enhancing User Engagement in Socially-Driven Dialogue through Interactive LLM Alignments
[AUTHORS]
Jiashuo Wang, Kaitao Song, Chunpu Xu, Changhe Song, Yang Xiao, Dongsheng Li, Lili Qiu, Wenjie Li
[ABSTRACT]
Enhancing user engagement through interactions plays an essential role in
socially-driven dialogues. While prior works have optimized models to reason
over relevant knowledge or plan a dialogue act flow, the relationship between
user engagement and knowledge or dialogue acts is subtle and does not guarantee
user engagement in socially-driven dialogues. To this end, we enable
interactive LLMs to learn user engagement by leveraging signals from the future
development of conversations. Specifically, we adopt a more direct and relevant
indicator of user engagement, i.e., the user’s reaction related to dialogue
intention after the interaction, as a reward to align interactive LLMs. To
achieve this, we develop a user simulator to interact with target interactive
LLMs and explore interactions between the user and the interactive LLM system
via \textit{i$\times$MCTS} (\textit{M}onte \textit{C}arlo \textit{T}ree
\textit{S}earch for \textit{i}nteraction). In this way, we collect a dataset
containing pairs of higher and lower-quality experiences using
\textit{i$\times$MCTS}, and align interactive LLMs for high-level user
engagement by direct preference optimization (DPO) accordingly. Experiments
conducted on two socially-driven dialogue scenarios (emotional support
conversations and persuasion for good) demonstrate that our method effectively
enhances user engagement in interactive LLMs.
[LINK]
http://arxiv.org/abs/2506.21497v1
[DATE]
2025-06-27 01:26:17+08:00
[CATEGORIES]
cs.CL
Bridging Offline and Online Reinforcement Learning for LLMs
[AUTHORS]
Jack Lanchantin, Angelica Chen, Janice Lan, Xian Li, Swarnadeep Saha, Tianlu Wang, Jing Xu, Ping Yu, Weizhe Yuan, Jason E Weston, Sainbayar Sukhbaatar, Ilia Kulikov
[ABSTRACT]
We investigate the effectiveness of reinforcement learning methods for
finetuning large language models when transitioning from offline to semi-online
to fully online regimes for both verifiable and non-verifiable tasks. Our
experiments cover training on verifiable math as well as non-verifiable
instruction following with a set of benchmark evaluations for both. Across
these settings, we extensively compare online and semi-online Direct Preference
Optimization and Group Reward Policy Optimization objectives, and surprisingly
find similar performance and convergence between these variants, which all
strongly outperform offline methods. We provide a detailed analysis of the
training dynamics and hyperparameter selection strategies to achieve optimal
results. Finally, we show that multi-tasking with verifiable and non-verifiable
rewards jointly yields improved performance across both task types.
[LINK]
http://arxiv.org/abs/2506.21495v1
[DATE]
2025-06-27 01:25:49+08:00
[CATEGORIES]
cs.CL
Prompting with Phonemes: Enhancing LLMs’ Multilinguality for Non-Latin Script Languages
[AUTHORS]
Hoang H Nguyen, Khyati Mahajan, Vikas Yadav, Julian Salazar, Philip S. Yu, Masoud Hashemi, Rishabh Maheshwary
[ABSTRACT]
Although multilingual LLMs have achieved remarkable performance across
benchmarks, we find they continue to underperform on non-Latin script languages
across contemporary LLM families. This discrepancy arises from the fact that
LLMs are pretrained with orthographic scripts, which are dominated by Latin
characters that obscure their shared phonology with non-Latin scripts. We
propose leveraging phonemic transcriptions as complementary signals to induce
script-invariant representations. Our study demonstrates that integrating
phonemic signals improves performance across both non-Latin and Latin script
languages, with a particularly significant impact on closing the performance
gap between the two. Through detailed experiments, we show that phonemic and
orthographic scripts retrieve distinct examples for in-context learning (ICL).
This motivates our proposed Mixed-ICL retrieval strategy, where further
aggregation from both leads to our significant performance improvements for
both Latin script languages (up to 12.6%) and non-Latin script languages (up to
15.1%) compared to randomized ICL retrieval.
[COMMENTS]
Accepted to NAACL 2025 (Main Conference). This version contains minor
improvements to the camera-ready
[LINK]
http://arxiv.org/abs/2411.02398v3
[DATE]
2025-06-27 01:22:53+08:00
[CATEGORIES]
cs.CL
cs.LG
From Web Search towards Agentic Deep Research: Incentivizing Search with Reasoning Agents
[AUTHORS]
Weizhi Zhang, Yangning Li, Yuanchen Bei, Junyu Luo, Guancheng Wan, Liangwei Yang, Chenxuan Xie, Yuyao Yang, Wei-Chieh Huang, Chunyu Miao, Henry Peng Zou, Xiao Luo, Yusheng Zhao, Yankai Chen, Chunkit Chan, Peilin Zhou, Xinyang Zhang, Chenwei Zhang, Jingbo Shang, Ming Zhang, Yangqiu Song, Irwin King, Philip S. Yu
[ABSTRACT]
Information retrieval is a cornerstone of modern knowledge acquisition,
enabling billions of queries each day across diverse domains. However,
traditional keyword-based search engines are increasingly inadequate for
handling complex, multi-step information needs. Our position is that Large
Language Models (LLMs), endowed with reasoning and agentic capabilities, are
ushering in a new paradigm termed Agentic Deep Research. These systems
transcend conventional information search techniques by tightly integrating
autonomous reasoning, iterative retrieval, and information synthesis into a
dynamic feedback loop. We trace the evolution from static web search to
interactive, agent-based systems that plan, explore, and learn. We also
introduce a test-time scaling law to formalize the impact of computational
depth on reasoning and search. Supported by benchmark results and the rise of
open-source implementations, we demonstrate that Agentic Deep Research not only
significantly outperforms existing approaches, but is also poised to become the
dominant paradigm for future information seeking. All the related resources,
including industry products, research papers, benchmark datasets, and
open-source implementations, are collected for the community in
https://github.com/DavidZWZ/Awesome-Deep-Research.
[LINK]
http://arxiv.org/abs/2506.18959v2
[DATE]
2025-06-27 01:18:00+08:00
[CATEGORIES]
cs.CL
cs.LG
Logios : An open source Greek Polytonic Optical Character Recognition system
[AUTHORS]
Perifanos Konstantinos, Goutsos Dionisis
[ABSTRACT]
In this paper, we present an Optical Character Recognition (OCR) system
specifically designed for the accurate recognition and digitization of Greek
polytonic texts. By leveraging the combined strengths of convolutional layers
for feature extraction and recurrent layers for sequence learning, our system
addresses the unique challenges posed by Greek polytonic scripts. This approach
aims to overcome the limitations of traditional OCR methods, offering
significant improvements in accuracy and efficiency. We release the underlying
model as an open-source library and make our OCR platform available for
academic use.
[LINK]
http://arxiv.org/abs/2506.21474v1
[DATE]
2025-06-27 01:04:27+08:00
[CATEGORIES]
cs.CL
TopK Language Models
[AUTHORS]
Ryosuke Takahashi, Tatsuro Inaba, Kentaro Inui, Benjamin Heinzerling
[ABSTRACT]
Sparse autoencoders (SAEs) have become an important tool for analyzing and
interpreting the activation space of transformer-based language models (LMs).
However, SAEs suffer several shortcomings that diminish their utility and
internal validity. Since SAEs are trained post-hoc, it is unclear if the
failure to discover a particular concept is a failure on the SAE’s side or due
to the underlying LM not representing this concept. This problem is exacerbated
by training conditions and architecture choices affecting which features an SAE
learns. When tracing how LMs learn concepts during training, the lack of
feature stability also makes it difficult to compare SAEs features across
different checkpoints. To address these limitations, we introduce a
modification to the transformer architecture that incorporates a TopK
activation function at chosen layers, making the model’s hidden states
equivalent to the latent features of a TopK SAE. This approach eliminates the
need for post-hoc training while providing interpretability comparable to SAEs.
The resulting TopK LMs offer a favorable trade-off between model size,
computational efficiency, and interpretability. Despite this simple
architectural change, TopK LMs maintain their original capabilities while
providing robust interpretability benefits. Our experiments demonstrate that
the sparse representations learned by TopK LMs enable successful steering
through targeted neuron interventions and facilitate detailed analysis of
neuron formation processes across checkpoints and layers. These features make
TopK LMs stable and reliable tools for understanding how language models learn
and represent concepts, which we believe will significantly advance future
research on model interpretability and controllability.
[LINK]
http://arxiv.org/abs/2506.21468v1
[DATE]
2025-06-27 00:56:43+08:00
[CATEGORIES]
cs.CL
Aligning Spoken Dialogue Models from User Interactions
[AUTHORS]
Anne Wu, Laurent Mazaré, Neil Zeghidour, Alexandre Défossez
[ABSTRACT]
We propose a novel preference alignment framework for improving spoken
dialogue models on real-time conversations from user interactions. Current
preference learning methods primarily focus on text-based language models, and
are not directly suited to the complexities of real-time speech interactions,
with richer dynamics (e.g. interruption, interjection) and no explicit
segmentation between speaker turns.We create a large-scale dataset of more than
150,000 preference pairs from raw multi-turn speech conversations, annotated
with AI feedback, to cover preferences over both linguistic content and
temporal context variations. We leverage offline alignment methods to finetune
a full-duplex autoregressive speech-to-speech model. Extensive experiments
demonstrate that feedback on generic conversations can be consistently
effective in improving spoken dialogue models to produce more factual, safer
and more contextually aligned interactions. We deploy the finetuned model and
conduct holistic human evaluations to assess the impact beyond single-turn
conversations. Our findings shed light on the importance of a well-calibrated
balance among various dynamics, crucial for natural real-time speech dialogue
systems.
[COMMENTS]
Accepted at ICML 2025
[LINK]
http://arxiv.org/abs/2506.21463v1
[DATE]
2025-06-27 00:45:20+08:00
[CATEGORIES]
cs.CL
cs.LG
Spatial Mental Modeling from Limited Views
[AUTHORS]
Baiqiao Yin, Qineng Wang, Pingyue Zhang, Jianshu Zhang, Kangrui Wang, Zihan Wang, Jieyu Zhang, Keshigeyan Chandrasegaran, Han Liu, Ranjay Krishna, Saining Xie, Manling Li, Jiajun Wu, Li Fei-Fei
[ABSTRACT]
Can Vision Language Models (VLMs) imagine the full scene from just a few
views, like humans do? Humans form spatial mental models, internal
representations of unseen space, to reason about layout, perspective, and
motion. Our new MindCube benchmark with 21,154 questions across 3,268 images
exposes this critical gap, where existing VLMs exhibit near-random performance.
Using MindCube, we systematically evaluate how well VLMs build robust spatial
mental models through representing positions (cognitive mapping), orientations
(perspective-taking), and dynamics (mental simulation for “what-if” movements).
We then explore three approaches to help VLMs approximate spatial mental
models, including unseen intermediate views, natural language reasoning chains,
and cognitive maps. The significant improvement comes from a synergistic
approach, “map-then-reason”, that jointly trains the model to first generate a
cognitive map and then reason upon it. By training models to reason over these
internal maps, we boosted accuracy from 37.8% to 60.8% (+23.0%). Adding
reinforcement learning pushed performance even further to 70.7% (+32.9%). Our
key insight is that such scaffolding of spatial mental models, actively
constructing and utilizing internal structured spatial representations with
flexible reasoning processes, significantly improves understanding of
unobservable space.
[COMMENTS]
Preprint version
[LINK]
http://arxiv.org/abs/2506.21458v1
[DATE]
2025-06-27 00:38:19+08:00
[CATEGORIES]
cs.CL
Text2Cypher Across Languages: Evaluating Foundational Models Beyond English
[AUTHORS]
Makbule Gulcin Ozsoy, William Tai
[ABSTRACT]
Recent advances in large language models have enabled natural language
interfaces that translate user questions into database queries, such as
Text2SQL, Text2SPARQL, and Text2Cypher. While these interfaces enhance database
accessibility, most research today focuses solely on English, with limited
evaluation in other languages. This paper investigates the performance of
foundational LLMs on the Text2Cypher task across multiple languages. We create
and release a multilingual test set by translating English questions into
Spanish and Turkish while preserving the original Cypher queries, enabling fair
cross-lingual comparison. We evaluate multiple foundational models using
standardized prompts and metrics. Our results show a consistent performance
pattern: highest on English, then Spanish, and lowest on Turkish. We attribute
this to differences in training data availability and linguistic
characteristics. Additionally, we explore the impact of translating task
prompts into Spanish and Turkish. Results show little to no change in
evaluation metrics, suggesting prompt translation has minor impact. Our
findings highlight the need for more inclusive evaluation and development in
multilingual query generation. Future work includes schema localization and
fine-tuning across diverse languages.
[LINK]
http://arxiv.org/abs/2506.21445v1
[DATE]
2025-06-27 00:31:10+08:00
[CATEGORIES]
cs.CL
Domain Knowledge-Enhanced LLMs for Fraud and Concept Drift Detection
[AUTHORS]
Ali Şenol, Garima Agrawal, Huan Liu
[ABSTRACT]
Detecting deceptive conversations on dynamic platforms is increasingly
difficult due to evolving language patterns and Concept Drift (CD)-i.e.,
semantic or topical shifts that alter the context or intent of interactions
over time. These shifts can obscure malicious intent or mimic normal dialogue,
making accurate classification challenging. While Large Language Models (LLMs)
show strong performance in natural language tasks, they often struggle with
contextual ambiguity and hallucinations in risk-sensitive scenarios. To
address these challenges, we present a Domain Knowledge (DK)-Enhanced LLM
framework that integrates pretrained LLMs with structured, task-specific
insights to perform fraud and concept drift detection. The proposed
architecture consists of three main components: (1) a DK-LLM module to detect
fake or deceptive conversations; (2) a drift detection unit (OCDD) to determine
whether a semantic shift has occurred; and (3) a second DK-LLM module to
classify the drift as either benign or fraudulent. We first validate the value
of domain knowledge using a fake review dataset and then apply our full
framework to SEConvo, a multiturn dialogue dataset that includes various types
of fraud and spam attacks. Results show that our system detects fake
conversations with high accuracy and effectively classifies the nature of
drift. Guided by structured prompts, the LLaMA-based implementation achieves
98\% classification accuracy. Comparative studies against zero-shot baselines
demonstrate that incorporating domain knowledge and drift awareness
significantly improves performance, interpretability, and robustness in
high-stakes NLP applications.
[LINK]
http://arxiv.org/abs/2506.21443v1
[DATE]
2025-06-27 00:29:45+08:00
[CATEGORIES]
cs.CL
Rethinking LLM Training through Information Geometry and Quantum Metrics
[AUTHORS]
Riccardo Di Sipio
[ABSTRACT]
Optimization in large language models (LLMs) unfolds over high-dimensional
parameter spaces with non-Euclidean structure. Information geometry frames this
landscape using the Fisher information metric, enabling more principled
learning via natural gradient descent. Though often impractical, this geometric
lens clarifies phenomena such as sharp minima, generalization, and observed
scaling laws. We argue that curvature-aware approaches deepen our understanding
of LLM training. Finally, we speculate on quantum analogies based on the
Fubini-Study metric and Quantum Fisher Information, hinting at efficient
optimization in quantum-enhanced systems.
[COMMENTS]
9 pages, 1 figure(s)
[LINK]
http://arxiv.org/abs/2506.15830v2
[DATE]
2025-06-27 00:14:42+08:00
[CATEGORIES]
cs.CL
Whole-Body Conditioned Egocentric Video Prediction
[AUTHORS]
Yutong Bai, Danny Tran, Amir Bar, Yann LeCun, Trevor Darrell, Jitendra Malik
[ABSTRACT]
We train models to Predict Ego-centric Video from human Actions (PEVA), given
the past video and an action represented by the relative 3D body pose. By
conditioning on kinematic pose trajectories, structured by the joint hierarchy
of the body, our model learns to simulate how physical human actions shape the
environment from a first-person point of view. We train an auto-regressive
conditional diffusion transformer on Nymeria, a large-scale dataset of
real-world egocentric video and body pose capture. We further design a
hierarchical evaluation protocol with increasingly challenging tasks, enabling
a comprehensive analysis of the model’s embodied prediction and control
abilities. Our work represents an initial attempt to tackle the challenges of
modeling complex real-world environments and embodied agent behaviors with
video prediction from the perspective of a human.
[COMMENTS]
Project Page: https://dannytran123.github.io/PEVA
[LINK]
http://arxiv.org/abs/2506.21552v1
[DATE]
2025-06-27 01:59:59+08:00
[CATEGORIES]
cs.LG
mTSBench: Benchmarking Multivariate Time Series Anomaly Detection and Model Selection at Scale
[AUTHORS]
Xiaona Zhou, Constantin Brif, Ismini Lourentzou
[ABSTRACT]
Multivariate time series anomaly detection (MTS-AD) is critical in domains
like healthcare, cybersecurity, and industrial monitoring, yet remains
challenging due to complex inter-variable dependencies, temporal dynamics, and
sparse anomaly labels. We introduce mTSBench, the largest benchmark to date for
MTS-AD and unsupervised model selection, spanning 344 labeled time series
across 19 datasets and 12 diverse application domains. mTSBench evaluates 24
anomaly detection methods, including large language model (LLM)-based detectors
for multivariate time series, and systematically benchmarks unsupervised model
selection techniques under standardized conditions. Consistent with prior
findings, our results confirm that no single detector excels across datasets,
underscoring the importance of model selection. However, even state-of-the-art
selection methods remain far from optimal, revealing critical gaps. mTSBench
provides a unified evaluation suite to enable rigorous, reproducible
comparisons and catalyze future advances in adaptive anomaly detection and
robust model selection.
[LINK]
http://arxiv.org/abs/2506.21550v1
[DATE]
2025-06-27 01:59:58+08:00
[CATEGORIES]
cs.LG
Where to find Grokking in LLM Pretraining? Monitor Memorization-to-Generalization without Test
[AUTHORS]
Ziyue Li, Chenrui Fan, Tianyi Zhou
[ABSTRACT]
Grokking, i.e., test performance keeps improving long after training loss
converged, has been recently witnessed in neural network training, making the
mechanism of generalization and other emerging capabilities such as reasoning
mysterious. While prior studies usually train small models on a few toy or
highly-specific tasks for thousands of epochs, we conduct the first study of
grokking on checkpoints during one-pass pretraining of a 7B large language
model (LLM), i.e., OLMoE. We compute the training loss and evaluate
generalization on diverse benchmark tasks, including math reasoning, code
generation, and commonsense/domain-specific knowledge retrieval tasks.
Our study, for the first time, verifies that grokking still happens in the
pretraining of large-scale foundation models, though different data may enter
grokking stages asynchronously. We further demystify grokking’s “emergence of
generalization” by investigating LLM internal dynamics. Specifically, we find
that training samples’ pathways (i.e., expert choices across layers) evolve
from random, instance-specific to more structured and shareable between samples
during grokking. Also, the complexity of a sample’s pathway reduces despite the
converged loss. These indicate a memorization-to-generalization conversion,
providing a mechanistic explanation of delayed generalization. In the study, we
develop two novel metrics to quantify pathway distance and the complexity of a
single pathway. We show their ability to predict the generalization improvement
on diverse downstream tasks. They are efficient, simple to compute and solely
dependent on training data. Hence, they have practical value for pretraining,
enabling us to monitor the generalization performance without finetuning and
test. Theoretically, we show that more structured pathways reduce model
complexity and improve the generalization bound.
[LINK]
http://arxiv.org/abs/2506.21551v1
[DATE]
2025-06-27 01:59:58+08:00
[CATEGORIES]
cs.LG
Maximal Matching Matters: Preventing Representation Collapse for Robust Cross-Modal Retrieval
[AUTHORS]
Hani Alomari, Anushka Sivakumar, Andrew Zhang, Chris Thomas
[ABSTRACT]
Cross-modal image-text retrieval is challenging because of the diverse
possible associations between content from different modalities. Traditional
methods learn a single-vector embedding to represent semantics of each sample,
but struggle to capture nuanced and diverse relationships that can exist across
modalities. Set-based approaches, which represent each sample with multiple
embeddings, offer a promising alternative, as they can capture richer and more
diverse relationships. In this paper, we show that, despite their promise,
these set-based representations continue to face issues including sparse
supervision and set collapse, which limits their effectiveness. To address
these challenges, we propose Maximal Pair Assignment Similarity to optimize
one-to-one matching between embedding sets which preserve semantic diversity
within the set. We also introduce two loss functions to further enhance the
representations: Global Discriminative Loss to enhance distinction among
embeddings, and Intra-Set Divergence Loss to prevent collapse within each set.
Our method achieves state-of-the-art performance on MS-COCO and Flickr30k
without relying on external data.
[COMMENTS]
Accepted at the 63rd Annual Meeting of the Association for
Computational Linguistics (ACL 2025 Main)
[LINK]
http://arxiv.org/abs/2506.21538v1
[DATE]
2025-06-27 01:55:34+08:00
[CATEGORIES]
cs.LG
Exploring the Design Space of 3D MLLMs for CT Report Generation
[AUTHORS]
Mohammed Baharoon, Jun Ma, Congyu Fang, Augustin Toma, Bo Wang
[ABSTRACT]
Multimodal Large Language Models (MLLMs) have emerged as a promising way to
automate Radiology Report Generation (RRG). In this work, we systematically
investigate the design space of 3D MLLMs, including visual input
representation, projectors, Large Language Models (LLMs), and fine-tuning
techniques for 3D CT report generation. We also introduce two knowledge-based
report augmentation methods that improve performance on the GREEN score by up
to 10\%, achieving the 2nd place on the MICCAI 2024 AMOS-MM challenge. Our
results on the 1,687 cases from the AMOS-MM dataset show that RRG is largely
independent of the size of LLM under the same training protocol. We also show
that larger volume size does not always improve performance if the original ViT
was pre-trained on a smaller volume size. Lastly, we show that using a
segmentation mask along with the CT volume improves performance. The code is
publicly available at https://github.com/bowang-lab/AMOS-MM-Solution
[LINK]
http://arxiv.org/abs/2506.21535v1
[DATE]
2025-06-27 01:54:20+08:00
[CATEGORIES]
cs.LG
Chain-of-Sketch: Enabling Global Visual Reasoning
[AUTHORS]
Aryo Lotfi, Enrico Fini, Samy Bengio, Moin Nabi, Emmanuel Abbe
[ABSTRACT]
Modern vision models have achieved remarkable success in benchmarks where
local features provide critical information about the target. There is now a
growing interest in tackling tasks requiring more global reasoning, where local
features do not provide significant information. Minsky and Papert put forward
such tasks in 1969 with their connectivity study, exposing the limitations of
the perceptron model. In this paper, we introduce an expanded set of global
visual datasets involving graphs, strings, mazes, and image grids. We show that
large vision models still struggle to learn these tasks efficiently. Similarly,
state-of-the-art multi-modal LLMs perform poorly on these datasets. We explain
this learning inefficiency by means of the ‘globality degree’ measure. To
mitigate this, we propose a method called chain-of-sketch (CoS). Similar to the
chain-of-thought and scratchpad techniques used in language models, CoS breaks
the original task into intermediate visual steps to help learn a complex task.
In addition, we show that not all CoS strategies perform equally well. Our key
insight is to impose a Markovian structure on the CoS frames. This leads to the
introduction of ‘inductive CoS’ which achieves better out-of-distribution
generalization and performs well even with smaller models compared to
non-inductive variants.
[COMMENTS]
additional experiments added, title changed from “Visual Scratchpads:
Enabling Global Reasoning in Vision” to “Chain-of-Sketch: Enabling Global
Visual Reasoning”
[LINK]
http://arxiv.org/abs/2410.08165v2
[DATE]
2025-06-27 01:48:33+08:00
[CATEGORIES]
cs.LG
Mesh-Informed Neural Operator : A Transformer Generative Approach
[AUTHORS]
Yaozhong Shi, Zachary E. Ross, Domniki Asimaki, Kamyar Azizzadenesheli
[LINK]
http://arxiv.org/abs/2506.16656v2
[DATE]
2025-06-27 01:45:03+08:00
[CATEGORIES]
cs.LG
Gaussian Invariant Markov Chain Monte Carlo
[AUTHORS]
Michalis K. Titsias, Angelos Alexopoulos, Siran Liu, Petros Dellaportas
[ABSTRACT]
We develop sampling methods, which consist of Gaussian invariant versions of
random walk Metropolis (RWM), Metropolis adjusted Langevin algorithm (MALA) and
second order Hessian or Manifold MALA. Unlike standard RWM and MALA we show
that Gaussian invariant sampling can lead to ergodic estimators with improved
statistical efficiency. This is due to a remarkable property of Gaussian
invariance that allows us to obtain exact analytical solutions to the Poisson
equation for Gaussian targets. These solutions can be used to construct
efficient and easy to use control variates for variance reduction of estimators
under any intractable target. We demonstrate the new samplers and estimators in
several examples, including high dimensional targets in latent Gaussian models
where we compare against several advanced methods and obtain state-of-the-art
results. We also provide theoretical results regarding geometric ergodicity,
and an optimal scaling analysis that shows the dependence of the optimal
acceptance rate on the Gaussianity of the target.
[COMMENTS]
29, 2 figures
[LINK]
http://arxiv.org/abs/2506.21511v1
[DATE]
2025-06-27 01:36:10+08:00
[CATEGORIES]
cs.LG
Process mining-driven modeling and simulation to enhance fault diagnosis in cyber-physical systems
[AUTHORS]
Francesco Vitale, Nicola Dall’Ora, Sebastiano Gaiardelli, Enrico Fraccaroli, Nicola Mazzocca, Franco Fummi
[ABSTRACT]
Fault diagnosis in Cyber-Physical Systems (CPSs) is essential for ensuring
system dependability and operational efficiency by accurately detecting
anomalies and identifying their root causes. However, the manual modeling of
faulty behaviors often demands extensive domain expertise and produces models
that are complex, error-prone, and difficult to interpret. To address this
challenge, we present a novel unsupervised fault diagnosis methodology that
integrates collective anomaly detection in multivariate time series, process
mining, and stochastic simulation. Initially, collective anomalies are detected
from low-level sensor data using multivariate time-series analysis. These
anomalies are then transformed into structured event logs, enabling the
discovery of interpretable process models through process mining. By
incorporating timing distributions into the extracted Petri nets, the approach
supports stochastic simulation of faulty behaviors, thereby enhancing root
cause analysis and behavioral understanding. The methodology is validated using
the Robotic Arm Dataset (RoAD), a widely recognized benchmark in smart
manufacturing. Experimental results demonstrate its effectiveness in modeling,
simulating, and classifying faulty behaviors in CPSs. This enables the creation
of comprehensive fault dictionaries that support predictive maintenance and the
development of digital twins for industrial environments.
[LINK]
http://arxiv.org/abs/2506.21502v1
[DATE]
2025-06-27 01:29:37+08:00
[CATEGORIES]
cs.LG
Devising a solution to the problems of Cancer awareness in Telangana
[AUTHORS]
Priyanka Avhad, Vedanti Kshirsagar, Urvi Ranjan, Mahek Nakhua
[ABSTRACT]
According to the data, the percent of women who underwent screening for
cervical cancer, breast and oral cancer in Telangana in the year 2020 was 3.3
percent, 0.3 percent and 2.3 percent respectively. Although early detection is
the only way to reduce morbidity and mortality, people have very low awareness
about cervical and breast cancer signs and symptoms and screening practices. We
developed an ML classification model to predict if a person is susceptible to
breast or cervical cancer based on demographic factors. We devised a system to
provide suggestions for the nearest hospital or Cancer treatment centres based
on the users location or address. In addition to this, we can integrate the
health card to maintain medical records of all individuals and conduct
awareness drives and campaigns. For ML classification models, we used decision
tree classification and support vector classification algorithms for cervical
cancer susceptibility and breast cancer susceptibility respectively. Thus, by
devising this solution we come one step closer to our goal which is spreading
cancer awareness, thereby, decreasing the cancer mortality and increasing
cancer literacy among the people of Telangana.
[LINK]
http://arxiv.org/abs/2506.21500v1
[DATE]
2025-06-27 01:29:00+08:00
[CATEGORIES]
cs.LG
Multi-Preference Lambda-weighted Listwise DPO for Dynamic Preference Alignment
[AUTHORS]
Yuhui Sun, Xiyao Wang, Zixi Li, Jinman Zhao
[ABSTRACT]
While large-scale unsupervised language models (LMs) capture broad world
knowledge and reasoning capabilities, steering their behavior toward desired
objectives remains challenging due to the lack of explicit supervision.
Existing alignment techniques, such as reinforcement learning from human
feedback (RLHF), rely on training a reward model and performing reinforcement
learning to align with human preferences. However, RLHF is often
computationally intensive, unstable, and sensitive to hyperparameters.
To address these limitations, Direct Preference Optimization (DPO) was
introduced as a lightweight and stable alternative, enabling direct alignment
of language models with pairwise preference data via classification loss.
However, DPO and its extensions generally assume a single static preference
distribution, limiting flexibility in multi-objective or dynamic alignment
settings.
In this paper, we propose a novel framework: Multi-Preference Lambda-weighted
Listwise DPO, which extends DPO to incorporate multiple human preference
dimensions (e.g., helpfulness, harmlessness, informativeness) and enables
dynamic interpolation through a controllable simplex-weighted formulation. Our
method supports both listwise preference feedback and flexible alignment across
varying user intents without re-training. Empirical and theoretical analysis
demonstrates that our method is as effective as traditional DPO on static
objectives while offering greater generality and adaptability for real-world
deployment.
[COMMENTS]
10 pages, 4 figures, appendix included. To appear in Proceedings of
AAAI 2026. Code:
https://github.com/yuhui15/Multi-Preference-Lambda-weighted-DPO
[LINK]
http://arxiv.org/abs/2506.19780v2
[DATE]
2025-06-27 01:28:25+08:00
[CATEGORIES]
cs.LG
One Model to Forecast Them All and in Entity Distributions Bind Them
[AUTHORS]
Kutay Bölat, Simon Tindemans
[ABSTRACT]
Probabilistic forecasting in power systems often involves multi-entity
datasets like households, feeders, and wind turbines, where generating reliable
entity-specific forecasts presents significant challenges. Traditional
approaches require training individual models for each entity, making them
inefficient and hard to scale. This study addresses this problem using
GUIDE-VAE, a conditional variational autoencoder that allows entity-specific
probabilistic forecasting using a single model. GUIDE-VAE provides flexible
outputs, ranging from interpretable point estimates to full probability
distributions, thanks to its advanced covariance composition structure. These
distributions capture uncertainty and temporal dependencies, offering richer
insights than traditional methods. To evaluate our GUIDE-VAE-based forecaster,
we use household electricity consumption data as a case study due to its
multi-entity and highly stochastic nature. Experimental results demonstrate
that GUIDE-VAE outperforms conventional quantile regression techniques across
key metrics while ensuring scalability and versatility. These features make
GUIDE-VAE a powerful and generalizable tool for probabilistic forecasting
tasks, with potential applications beyond household electricity consumption.
[LINK]
http://arxiv.org/abs/2501.15499v2
[DATE]
2025-06-27 01:28:09+08:00
[CATEGORIES]
cs.LG
Towards Reliable Detection of Empty Space: Conditional Marked Point Processes for Object Detection
[AUTHORS]
Tobias J. Riedlinger, Kira Maag, Hanno Gottschalk
[ABSTRACT]
Deep neural networks have set the state-of-the-art in computer vision tasks
such as bounding box detection and semantic segmentation. Object detectors and
segmentation models assign confidence scores to predictions, reflecting the
model’s uncertainty in object detection or pixel-wise classification. However,
these confidence estimates are often miscalibrated, as their architectures and
loss functions are tailored to task performance rather than probabilistic
foundation. Even with well calibrated predictions, object detectors fail to
quantify uncertainty outside detected bounding boxes, i.e., the model does not
make a probability assessment of whether an area without detected objects is
truly free of obstacles. This poses a safety risk in applications such as
automated driving, where uncertainty in empty areas remains unexplored. In this
work, we propose an object detection model grounded in spatial statistics.
Bounding box data matches realizations of a marked point process, commonly used
to describe the probabilistic occurrence of spatial point events identified as
bounding box centers, where marks are used to describe the spatial extension of
bounding boxes and classes. Our statistical framework enables a
likelihood-based training and provides well-defined confidence estimates for
whether a region is drivable, i.e., free of objects. We demonstrate the
effectiveness of our method through calibration assessments and evaluation of
performance.
[COMMENTS]
15 pages, 4 figures, 3 tables
[LINK]
http://arxiv.org/abs/2506.21486v1
[DATE]
2025-06-27 01:14:37+08:00
[CATEGORIES]
cs.LG
Evaluation of Traffic Signals for Daily Traffic Pattern
[AUTHORS]
Mohammad Shokrolah Shirazi, Hung-Fu Chang
[ABSTRACT]
The turning movement count data is crucial for traffic signal design,
intersection geometry planning, traffic flow, and congestion analysis. This
work proposes three methods called dynamic, static, and hybrid configuration
for TMC-based traffic signals. A vision-based tracking system is developed to
estimate the TMC of six intersections in Las Vegas using traffic cameras. The
intersection design, route (e.g. vehicle movement directions), and signal
configuration files with compatible formats are synthesized and imported into
Simulation of Urban MObility for signal evaluation with realistic data. The
initial experimental results based on estimated waiting times indicate that the
cycle time of 90 and 120 seconds works best for all intersections. In addition,
four intersections show better performance for dynamic signal timing
configuration, and the other two with lower performance have a lower ratio of
total vehicle count to total lanes of the intersection leg. Since daily traffic
flow often exhibits a bimodal pattern, we propose a hybrid signal method that
switches between dynamic and static methods, adapting to peak and off-peak
traffic conditions for improved flow management. So, a built-in traffic
generator module creates vehicle routes for 4 hours, including peak hours, and
a signal design module produces signal schedule cycles according to static,
dynamic, and hybrid methods. Vehicle count distributions are weighted
differently for each zone (i.e., West, North, East, South) to generate diverse
traffic patterns. The extended experimental results for 6 intersections with 4
hours of simulation time imply that zone-based traffic pattern distributions
affect signal design selection. Although the static method works great for
evenly zone-based traffic distribution, the hybrid method works well for highly
weighted traffic at intersection pairs of the West-East and North-South zones.
[LINK]
http://arxiv.org/abs/2506.21469v1
[DATE]
2025-06-27 00:56:59+08:00
[CATEGORIES]
cs.LG
In-Context Learning Strategies Emerge Rationally
[AUTHORS]
Daniel Wurgaft, Ekdeep Singh Lubana, Core Francisco Park, Hidenori Tanaka, Gautam Reddy, Noah D. Goodman
[ABSTRACT]
Recent work analyzing in-context learning (ICL) has identified a broad set of
strategies that describe model behavior in different experimental conditions.
We aim to unify these findings by asking why a model learns these disparate
strategies in the first place. Specifically, we start with the observation that
when trained to learn a mixture of tasks, as is popular in the literature, the
strategies learned by a model for performing ICL can be captured by a family of
Bayesian predictors: a memorizing predictor, which assumes a discrete prior on
the set of seen tasks, and a generalizing predictor, where the prior matches
the underlying task distribution. Adopting the normative lens of rational
analysis, where a learner’s behavior is explained as an optimal adaptation to
data given computational constraints, we develop a hierarchical Bayesian
framework that almost perfectly predicts Transformer next-token predictions
throughout training – without assuming access to its weights. Under this
framework, pretraining is viewed as a process of updating the posterior
probability of different strategies, and inference-time behavior as a
posterior-weighted average over these strategies’ predictions. Our framework
draws on common assumptions about neural network learning dynamics, which make
explicit a tradeoff between loss and complexity among candidate strategies:
beyond how well it explains the data, a model’s preference towards implementing
a strategy is dictated by its complexity. This helps explain well-known ICL
phenomena, while offering novel predictions: e.g., we show a superlinear trend
in the timescale for transitioning from generalization to memorization as task
diversity increases. Overall, our work advances an explanatory and predictive
account of ICL grounded in tradeoffs between strategy loss and complexity.
[COMMENTS]
Preprint
[LINK]
http://arxiv.org/abs/2506.17859v2
[DATE]
2025-06-27 00:54:57+08:00
[CATEGORIES]
cs.LG
Optimising 4th-Order Runge-Kutta Methods: A Dynamic Heuristic Approach for Efficiency and Low Storage
[AUTHORS]
Gavin Lee Goodship, Luis Miralles-Pechuan, Stephen O’Sullivan
[ABSTRACT]
Extended Stability Runge-Kutta (ESRK) methods are crucial for solving
large-scale computational problems in science and engineering, including
weather forecasting, aerodynamic analysis, and complex biological modelling.
However, balancing accuracy, stability, and computational efficiency remains
challenging, particularly for high-order, low-storage schemes. This study
introduces a hybrid Genetic Algorithm (GA) and Reinforcement Learning (RL)
approach for automated heuristic discovery, optimising low-storage ESRK
methods. Unlike traditional approaches that rely on manually designed
heuristics or exhaustive numerical searches, our method leverages GA-driven
mutations for search-space exploration and an RL-inspired state transition
mechanism to refine heuristic selection dynamically. This enables systematic
parameter reduction, preserving fourth-order accuracy while significantly
improving computational efficiency.The proposed GA-RL heuristic optimisation
framework is validated through rigorous testing on benchmark problems,
including the 1D and 2D Brusselator systems and the steady-state Navier-Stokes
equations. The best-performing heuristic achieves a 25\% reduction in IPOPT
runtime compared to traditional ESRK optimisation processes while maintaining
numerical stability and accuracy. These findings demonstrate the potential of
adaptive heuristic discovery to improve resource efficiency in high-fidelity
simulations and broaden the applicability of low-storage Runge-Kutta methods in
real-world computational fluid dynamics, physics simulations, and other
demanding fields. This work establishes a new paradigm in heuristic
optimisation for numerical methods, opening pathways for further exploration
using Deep RL and AutoML-based heuristic search
[LINK]
http://arxiv.org/abs/2506.21465v1
[DATE]
2025-06-27 00:51:22+08:00
[CATEGORIES]
cs.LG
Capacity-Constrained Online Learning with Delays: Scheduling Frameworks and Regret Trade-offs
[AUTHORS]
Alexander Ryabchenko, Idan Attias, Daniel M. Roy
[ABSTRACT]
We study online learning with oblivious losses and delays under a novel
“capacity constraint” that limits how many past rounds can be tracked
simultaneously for delayed feedback. Under “clairvoyance” (i.e., delay
durations are revealed upfront each round) and/or “preemptibility” (i.e., we
can stop tracking previously chosen round feedback), we establish matching
upper and lower bounds (up to logarithmic terms) on achievable regret,
characterizing the “optimal capacity” needed to match the minimax rates of
classical delayed online learning, which implicitly assume unlimited capacity.
Our algorithms achieve minimax-optimal regret across all capacity levels, with
performance gracefully degrading under suboptimal capacity. For $K$ actions and
total delay $D$ over $T$ rounds, under clairvoyance and assuming capacity $C =
\Omega(\log(T))$, we achieve regret $\widetilde{\Theta}(\sqrt{TK + DK/C +
D\log(K)})$ for bandits and $\widetilde{\Theta}(\sqrt{(D+T)\log(K)})$ for
full-information feedback. When replacing clairvoyance with preemptibility, we
require a known maximum delay bound $d_{\max}$, adding
${\widetilde{O}(d_{\max})}$ to the regret. For fixed delays $d$ (i.e., $D=Td$),
the minimax regret is $\Theta(\sqrt{TK(1+d/C)+Td\log(K)})$ and the optimal
capacity is $\Theta(\min\{K/\log(K),d\})$ in the bandit setting, while in the
full-information feedback setting, the minimax regret is
$\Theta(\sqrt{T(d+1)\log(K)})$ and the optimal capacity is $\Theta(1)$. For
round-dependent and fixed delays, our upper bounds are achieved using novel
preemptive and non-preemptive scheduling policies, based on Pareto-distributed
proxy delays, and batching techniques, respectively. Crucially, our work
unifies delayed bandits, label-efficient learning, and online scheduling
frameworks, demonstrating that robust online learning under delayed feedback is
possible with surprisingly modest tracking capacity.
[LINK]
http://arxiv.org/abs/2503.19856v2
[DATE]
2025-06-27 00:47:52+08:00
[CATEGORIES]
cs.LG
Wild refitting for black box prediction
[AUTHORS]
Martin J. Wainwright
[ABSTRACT]
We describe and analyze a computionally efficient refitting procedure for
computing high-probability upper bounds on the instance-wise mean-squared
prediction error of penalized nonparametric estimates based on least-squares
minimization. Requiring only a single dataset and black box access to the
prediction method, it consists of three steps: computing suitable residuals,
symmetrizing and scaling them with a pre-factor $\rho$, and using them to
define and solve a modified prediction problem recentered at the current
estimate. We refer to it as wild refitting, since it uses Rademacher residual
symmetrization as in a wild bootstrap variant. Under relatively mild conditions
allowing for noise heterogeneity, we establish a high probability guarantee on
its performance, showing that the wild refit with a suitably chosen wild noise
scale $\rho$ gives an upper bound on prediction error. This theoretical
analysis provides guidance into the design of such procedures, including how
the residuals should be formed, the amount of noise rescaling in the wild
sub-problem needed for upper bounds, and the local stability properties of the
block-box procedure. We illustrate the applicability of this procedure to
various problems, including non-rigid structure-from-motion recovery with
structured matrix penalties; plug-and-play image restoration with deep neural
network priors; and randomized sketching with kernel methods.
[LINK]
http://arxiv.org/abs/2506.21460v1
[DATE]
2025-06-27 00:41:55+08:00
[CATEGORIES]
cs.LG
Fake it till You Make it: Reward Modeling as Discriminative Prediction
[AUTHORS]
Runtao Liu, Jiahao Zhan, Yingqing He, Chen Wei, Alan Yuille, Qifeng Chen
[ABSTRACT]
An effective reward model plays a pivotal role in reinforcement learning for
post-training enhancement of visual generative models. However, current
approaches of reward modeling suffer from implementation complexity due to
their reliance on extensive human-annotated preference data or meticulously
engineered quality dimensions that are often incomplete and
engineering-intensive. Inspired by adversarial training in generative
adversarial networks (GANs), this paper proposes GAN-RM, an efficient reward
modeling framework that eliminates manual preference annotation and explicit
quality dimension engineering. Our method trains the reward model through
discrimination between a small set of representative, unpaired target
samples(denoted as Preference Proxy Data) and model-generated ordinary outputs,
requiring only a few hundred target samples. Comprehensive experiments
demonstrate our GAN-RM’s effectiveness across multiple key applications
including test-time scaling implemented as Best-of-N sample filtering,
post-training approaches like Supervised Fine-Tuning (SFT) and Direct
Preference Optimization (DPO). Code and data will be released at
https://github.com/Visualignment/GAN-RM.
[LINK]
http://arxiv.org/abs/2506.13846v2
[DATE]
2025-06-27 00:39:32+08:00
[CATEGORIES]
cs.LG
Measurement to Meaning: A Validity-Centered Framework for AI Evaluation
[AUTHORS]
Olawale Salaudeen, Anka Reuel, Ahmed Ahmed, Suhana Bedi, Zachary Robertson, Sudharsan Sundar, Ben Domingue, Angelina Wang, Sanmi Koyejo
[ABSTRACT]
While the capabilities and utility of AI systems have advanced, rigorous
norms for evaluating these systems have lagged. Grand claims, such as models
achieving general reasoning capabilities, are supported with model performance
on narrow benchmarks, like performance on graduate-level exam questions, which
provide a limited and potentially misleading assessment. We provide a
structured approach for reasoning about the types of evaluative claims that can
be made given the available evidence. For instance, our framework helps
determine whether performance on a mathematical benchmark is an indication of
the ability to solve problems on math tests or instead indicates a broader
ability to reason. Our framework is well-suited for the contemporary paradigm
in machine learning, where various stakeholders provide measurements and
evaluations that downstream users use to validate their claims and decisions.
At the same time, our framework also informs the construction of evaluations
designed to speak to the validity of the relevant claims. By leveraging
psychometrics’ breakdown of validity, evaluations can prioritize the most
critical facets for a given claim, improving empirical utility and
decision-making efficacy. We illustrate our framework through detailed case
studies of vision and language model evaluations, highlighting how explicitly
considering validity strengthens the connection between evaluation evidence and
the claims being made.
[COMMENTS]
Correspondence to [email protected]
[LINK]
http://arxiv.org/abs/2505.10573v4
[DATE]
2025-06-27 00:38:11+08:00
[CATEGORIES]
cs.LG
PARALLELPROMPT: Extracting Parallelism from Large Language Model Queries
[AUTHORS]
Steven Kolawole, Keshav Santhanam, Virginia Smith, Pratiksha Thaker
[ABSTRACT]
LLM serving systems typically treat user prompts as monolithic inputs,
optimizing inference through decoding tricks or inter-query batching. However,
many real-world prompts contain latent semantic parallelism–decomposable
structures where subtasks can be executed independently to reduce latency while
preserving meaning. We introduce PARALLELPROMPT, the first benchmark for
measuring intra-query parallelism in natural user prompts. Our dataset
comprises over 37,000 real-world prompts from public LLM chat logs, each
annotated with a structured schema capturing task templates, shared context,
and iteration inputs. These schemas are extracted using LLM-assisted prompting
with rule-based multilingual validation. To evaluate the benefits of
decomposition, we provide an execution suite that benchmarks serial vs.
parallel strategies, measuring latency, structural adherence, and semantic
fidelity. Our results show that intra-query parallelism can be successfully
parsed in over 75% of curated datasets, unlocking up to 5x speedups on tasks
like translation, comprehension, and comparative analysis, with minimal quality
degradation. By releasing this benchmark, curation pipeline, and evaluation
suite, we provide the first standardized testbed for studying structure-aware
execution in LLM serving pipelines.
[COMMENTS]
In Adaptive Foundation Models: Evolving AI for Personalized and
Efficient Learning
[LINK]
http://arxiv.org/abs/2506.18728v2
[DATE]
2025-06-27 00:35:54+08:00
[CATEGORIES]
cs.LG
Towards an Optimal Control Perspective of ResNet Training
[AUTHORS]
Jens Püttschneider, Simon Heilig, Asja Fischer, Timm Faulwasser
[ABSTRACT]
We propose a training formulation for ResNets reflecting an optimal control
problem that is applicable for standard architectures and general loss
functions. We suggest bridging both worlds via penalizing intermediate outputs
of hidden states corresponding to stage cost terms in optimal control. For
standard ResNets, we obtain intermediate outputs by propagating the state
through the subsequent skip connections and the output layer. We demonstrate
that our training dynamic biases the weights of the unnecessary deeper residual
layers to vanish. This indicates the potential for a theory-grounded layer
pruning strategy.
[COMMENTS]
Accepted for presentation at the High-dimensional Learning Dynamics
(HiLD) workshop at ICML 2025
[LINK]
http://arxiv.org/abs/2506.21453v1
[DATE]
2025-06-27 00:34:47+08:00
[CATEGORIES]
cs.LG
Learnable Adaptive Time-Frequency Representation via Differentiable Short-Time Fourier Transform
[AUTHORS]
Maxime Leiber, Yosra Marnissi, Axel Barrau, Sylvain Meignen, Laurent Massoulié
[ABSTRACT]
The short-time Fourier transform (STFT) is widely used for analyzing
non-stationary signals. However, its performance is highly sensitive to its
parameters, and manual or heuristic tuning often yields suboptimal results. To
overcome this limitation, we propose a unified differentiable formulation of
the STFT that enables gradient-based optimization of its parameters. This
approach addresses the limitations of traditional STFT parameter tuning
methods, which often rely on computationally intensive discrete searches. It
enables fine-tuning of the time-frequency representation (TFR) based on any
desired criterion. Moreover, our approach integrates seamlessly with neural
networks, allowing joint optimization of the STFT parameters and network
weights. The efficacy of the proposed differentiable STFT in enhancing TFRs and
improving performance in downstream tasks is demonstrated through experiments
on both simulated and real-world data.
[COMMENTS]
DSTFT, STFT, spectrogram, time-frequency, IEEE Transactions on Signal
Processing, 10 pages
[LINK]
http://arxiv.org/abs/2506.21440v1
[DATE]
2025-06-27 00:24:27+08:00
[CATEGORIES]
cs.LG
New Bounds for Sparse Variational Gaussian Processes
[AUTHORS]
Michalis K. Titsias
[ABSTRACT]
Sparse variational Gaussian processes (GPs) construct tractable posterior
approximations to GP models. At the core of these methods is the assumption
that the true posterior distribution over training function values ${\bf f}$
and inducing variables ${\bf u}$ is approximated by a variational distribution
that incorporates the conditional GP prior $p({\bf f} | {\bf u})$ in its
factorization. While this assumption is considered as fundamental, we show that
for model training we can relax it through the use of a more general
variational distribution $q({\bf f} | {\bf u})$ that depends on $N$ extra
parameters, where $N$ is the number of training examples. In GP regression, we
can analytically optimize the evidence lower bound over the extra parameters
and express a tractable collapsed bound that is tighter than the previous
bound. The new bound is also amenable to stochastic optimization and its
implementation requires minor modifications to existing sparse GP code.
Further, we also describe extensions to non-Gaussian likelihoods. On several
datasets we demonstrate that our method can reduce bias when learning the
hyperparameters and can lead to better predictive performance.
[COMMENTS]
18 pages, 5 figures
[LINK]
http://arxiv.org/abs/2502.08730v2
[DATE]
2025-06-27 00:24:25+08:00
[CATEGORIES]
cs.LG
Graph Neural Network for Neutrino Physics Event Reconstruction
[AUTHORS]
V Hewes, Adam Aurisano, Giuseppe Cerati, Jim Kowalkowski, Claire Lee, Wei-keng Liao, Daniel Grzenda, Kaushal Gumpula, Xiaohe Zhang
[ABSTRACT]
Liquid Argon Time Projection Chamber (LArTPC) detector technology offers a
wealth of high-resolution information on particle interactions, and leveraging
that information to its full potential requires sophisticated automated
reconstruction techniques. This article describes NuGraph2, a Graph Neural
Network (GNN) for low-level reconstruction of simulated neutrino interactions
in a LArTPC detector. Simulated neutrino interactions in the MicroBooNE
detector geometry are described as heterogeneous graphs, with energy
depositions on each detector plane forming nodes on planar subgraphs. The
network utilizes a multi-head attention message-passing mechanism to perform
background filtering and semantic labelling on these graph nodes, identifying
those associated with the primary physics interaction with 98.0\% efficiency
and labelling them according to particle type with 94.9\% efficiency. The
network operates directly on detector observables across multiple 2D
representations, but utilizes a 3D-context-aware mechanism to encourage
consistency between these representations. Model inference takes 0.12~s/event
on a CPU, and 0.005s/event batched on a GPU. This architecture is designed to
be a general-purpose solution for particle reconstruction in neutrino physics,
with the potential for deployment across a broad range of detector
technologies, and offers a core convolution engine that can be leveraged for a
variety of tasks beyond the two described in this article.
[COMMENTS]
18 pages, 14 figures, published in Physical Review D
[LINK]
http://arxiv.org/abs/2403.11872v2
[DATE]
2025-06-27 00:15:31+08:00
[CATEGORIES]
cs.LG
The Sample Complexity of Learning Lipschitz Operators with respect to Gaussian Measures
[AUTHORS]
Ben Adcock, Michael Griebel, Gregor Maier
[ABSTRACT]
Operator learning, the approximation of mappings between infinite-dimensional
function spaces using machine learning, has gained increasing research
attention in recent years. Approximate operators, learned from data, can serve
as efficient surrogate models for problems in computational science and
engineering, complementing traditional methods. However, despite their
empirical success, our understanding of the underlying mathematical theory is
in large part still incomplete. In this paper, we study the approximation of
Lipschitz operators with respect to Gaussian measures. We prove higher Gaussian
Sobolev regularity of Lipschitz operators and establish lower and upper bounds
on the Hermite polynomial approximation error. We then study general
reconstruction strategies of Lipschitz operators from $m$ arbitrary
(potentially adaptive) linear samples. As a key finding, we tightly
characterize the corresponding sample complexity, that is, the smallest
achievable worst-case error among all possible choices of (adaptive) sampling
and reconstruction strategies in terms of $m$. As a consequence, we identify an
inherent curse of sample complexity: No method to approximate Lipschitz
operators based on $m$ linear samples can achieve algebraic convergence rates
in $m$. On the positive side, we prove that a sufficiently fast spectral decay
of the covariance operator of the underlying Gaussian measure guarantees
convergence rates which are arbitrarily close to any algebraic rate. Overall,
by tightly characterizing the sample complexity, our work confirms the
intrinsic difficulty of learning Lipschitz operators, regardless of the data or
learning technique.
[COMMENTS]
Section 6 about pointwise sampling in v2 of this paper has been cut
and will appear elsewhere
[LINK]
http://arxiv.org/abs/2410.23440v3
[DATE]
2025-06-27 00:15:09+08:00
[CATEGORIES]
cs.LG
Deception Detection in Dyadic Exchanges Using Multimodal Machine Learning: A Study on a Swedish Cohort
[AUTHORS]
Franco Rugolon, Thomas Jack Samuels, Stephan Hau, Lennart Högman
[ABSTRACT]
This study investigates the efficacy of using multimodal machine learning
techniques to detect deception in dyadic interactions, focusing on the
integration of data from both the deceiver and the deceived. We compare early
and late fusion approaches, utilizing audio and video data - specifically,
Action Units and gaze information - across all possible combinations of
modalities and participants. Our dataset, newly collected from Swedish native
speakers engaged in truth or lie scenarios on emotionally relevant topics,
serves as the basis for our analysis. The results demonstrate that
incorporating both speech and facial information yields superior performance
compared to single-modality approaches. Moreover, including data from both
participants significantly enhances deception detection accuracy, with the best
performance (71%) achieved using a late fusion strategy applied to both
modalities and participants. These findings align with psychological theories
suggesting differential control of facial and vocal expressions during initial
interactions. As the first study of its kind on a Scandinavian cohort, this
research lays the groundwork for future investigations into dyadic
interactions, particularly within psychotherapy settings.
[COMMENTS]
40 pages, 2 figures, 2 tables. To be submitted in Behavior Research
Methods
[LINK]
http://arxiv.org/abs/2506.21429v1
[DATE]
2025-06-27 00:11:42+08:00
[CATEGORIES]
cs.LG
Flow-Based Single-Step Completion for Efficient and Expressive Policy Learning
[AUTHORS]
Prajwal Koirala, Cody Fleming
[ABSTRACT]
Generative models such as diffusion and flow-matching offer expressive
policies for offline reinforcement learning (RL) by capturing rich, multimodal
action distributions, but their iterative sampling introduces high inference
costs and training instability due to gradient propagation across sampling
steps. We propose the \textit{Single-Step Completion Policy} (SSCP), a
generative policy trained with an augmented flow-matching objective to predict
direct completion vectors from intermediate flow samples, enabling accurate,
one-shot action generation. In an off-policy actor-critic framework, SSCP
combines the expressiveness of generative models with the training and
inference efficiency of unimodal policies, without requiring long
backpropagation chains. Our method scales effectively to offline,
offline-to-online, and online RL settings, offering substantial gains in speed
and adaptability over diffusion-based baselines. We further extend SSCP to
goal-conditioned RL, enabling flat policies to exploit subgoal structures
without explicit hierarchical inference. SSCP achieves strong results across
standard offline RL and behavior cloning benchmarks, positioning it as a
versatile, expressive, and efficient framework for deep RL and sequential
decision-making.
[LINK]
http://arxiv.org/abs/2506.21427v1
[DATE]
2025-06-27 00:09:53+08:00
[CATEGORIES]
cs.LG
TracLLM: A Generic Framework for Attributing Long Context LLMs
[AUTHORS]
Yanting Wang, Wei Zou, Runpeng Geng, Jinyuan Jia
[ABSTRACT]
Long context large language models (LLMs) are deployed in many real-world
applications such as RAG, agent, and broad LLM-integrated applications. Given
an instruction and a long context (e.g., documents, PDF files, webpages), a
long context LLM can generate an output grounded in the provided context,
aiming to provide more accurate, up-to-date, and verifiable outputs while
reducing hallucinations and unsupported claims. This raises a research
question: how to pinpoint the texts (e.g., sentences, passages, or paragraphs)
in the context that contribute most to or are responsible for the generated
output by an LLM? This process, which we call context traceback, has various
real-world applications, such as 1) debugging LLM-based systems, 2) conducting
post-attack forensic analysis for attacks (e.g., prompt injection attack,
knowledge corruption attacks) to an LLM, and 3) highlighting knowledge sources
to enhance the trust of users towards outputs generated by LLMs. When applied
to context traceback for long context LLMs, existing feature attribution
methods such as Shapley have sub-optimal performance and/or incur a large
computational cost. In this work, we develop TracLLM, the first generic context
traceback framework tailored to long context LLMs. Our framework can improve
the effectiveness and efficiency of existing feature attribution methods. To
improve the efficiency, we develop an informed search based algorithm in
TracLLM. We also develop contribution score ensemble/denoising techniques to
improve the accuracy of TracLLM. Our evaluation results show TracLLM can
effectively identify texts in a long context that lead to the output of an LLM.
Our code and data are at: https://github.com/Wang-Yanting/TracLLM.
[COMMENTS]
To appear in USENIX Security Symposium 2025. The code and data are
at: https://github.com/Wang-Yanting/TracLLM
[LINK]
http://arxiv.org/abs/2506.04202v3
[DATE]
2025-06-27 00:09:36+08:00
[CATEGORIES]
cs.LG
Improving Stochastic Cubic Newton with Momentum
[AUTHORS]
El Mahdi Chayti, Nikita Doikov, Martin Jaggi
[ABSTRACT]
We study stochastic second-order methods for solving general non-convex
optimization problems. We propose using a special version of momentum to
stabilize the stochastic gradient and Hessian estimates in Newton’s method. We
show that momentum provably improves the variance of stochastic estimates and
allows the method to converge for any noise level. Using the cubic
regularization technique, we prove a global convergence rate for our method on
general non-convex problems to a second-order stationary point, even when using
only a single stochastic data sample per iteration. This starkly contrasts with
all existing stochastic second-order methods for non-convex problems, which
typically require large batches. Therefore, we are the first to demonstrate
global convergence for batches of arbitrary size in the non-convex case for the
Stochastic Cubic Newton. Additionally, we show improved speed on convex
stochastic problems for our regularized Newton methods with momentum.
[LINK]
http://arxiv.org/abs/2410.19644v2
[DATE]
2025-06-27 00:07:20+08:00
[CATEGORIES]
cs.LG
Scalable Bayesian Low-Rank Adaptation of Large Language Models via Stochastic Variational Subspace Inference
[AUTHORS]
Colin Samplawski, Adam D. Cobb, Manoj Acharya, Ramneet Kaur, Susmit Jha
[ABSTRACT]
Despite their widespread use, large language models (LLMs) are known to
hallucinate incorrect information and be poorly calibrated. This makes the
uncertainty quantification of these models of critical importance, especially
in high-stakes domains, such as autonomy and healthcare. Prior work has made
Bayesian deep learning-based approaches to this problem more tractable by
performing inference over the low-rank adaptation (LoRA) parameters of a
fine-tuned model. While effective, these approaches struggle to scale to larger
LLMs due to requiring further additional parameters compared to LoRA. In this
work we present $\textbf{Scala}$ble $\textbf{B}$ayesian $\textbf{L}$ow-Rank
Adaptation via Stochastic Variational Subspace Inference (ScalaBL). We perform
Bayesian inference in an $r$-dimensional subspace, for LoRA rank $r$. By
repurposing the LoRA parameters as projection matrices, we are able to map
samples from this subspace into the full weight space of the LLM. This allows
us to learn all the parameters of our approach using stochastic variational
inference. Despite the low dimensionality of our subspace, we are able to
achieve competitive performance with state-of-the-art approaches while only
requiring ${\sim}1000$ additional parameters. Furthermore, it allows us to
scale up to the largest Bayesian LLM to date, with four times as a many base
parameters as prior work.
[COMMENTS]
Accepted at UAI 2025
[LINK]
http://arxiv.org/abs/2506.21408v1
[DATE]
2025-06-26 23:54:45+08:00
[CATEGORIES]
cs.LG
cs.CL
DiffuCoder: Understanding and Improving Masked Diffusion Models for Code Generation
[AUTHORS]
Shansan Gong, Ruixiang Zhang, Huangjie Zheng, Jiatao Gu, Navdeep Jaitly, Lingpeng Kong, Yizhe Zhang
[ABSTRACT]
Diffusion large language models (dLLMs) are compelling alternatives to
autoregressive (AR) models because their denoising models operate over the
entire sequence. The global planning and iterative refinement features of dLLMs
are particularly useful for code generation. However, current training and
inference mechanisms for dLLMs in coding are still under-explored. To demystify
the decoding behavior of dLLMs and unlock their potential for coding, we
systematically investigate their denoising processes and reinforcement learning
(RL) methods. We train a 7B dLLM, \textbf{DiffuCoder}, on 130B tokens of code.
Using this model as a testbed, we analyze its decoding behavior, revealing how
it differs from that of AR models: (1) dLLMs can decide how causal their
generation should be without relying on semi-AR decoding, and (2) increasing
the sampling temperature diversifies not only token choices but also their
generation order. This diversity creates a rich search space for RL rollouts.
For RL training, to reduce the variance of token log-likelihood estimates and
maintain training efficiency, we propose \textbf{coupled-GRPO}, a novel
sampling scheme that constructs complementary mask noise for completions used
in training. In our experiments, coupled-GRPO significantly improves
DiffuCoder’s performance on code generation benchmarks (+4.4\% on EvalPlus) and
reduces reliance on AR bias during decoding. Our work provides deeper insight
into the machinery of dLLM generation and offers an effective, diffusion-native
RL training framework. https://github.com/apple/ml-diffucoder.
[COMMENTS]
minor update
[LINK]
http://arxiv.org/abs/2506.20639v2
[DATE]
2025-06-26 23:46:40+08:00
[CATEGORIES]
cs.CL
Hybrid Deep Learning and Signal Processing for Arabic Dialect Recognition in Low-Resource Settings
[AUTHORS]
Ghazal Al-Shwayyat, Omer Nezih Gerek
[ABSTRACT]
Arabic dialect recognition presents a significant challenge in speech
technology due to the linguistic diversity of Arabic and the scarcity of large
annotated datasets, particularly for underrepresented dialects. This research
investigates hybrid modeling strategies that integrate classical signal
processing techniques with deep learning architectures to address this problem
in low-resource scenarios. Two hybrid models were developed and evaluated: (1)
Mel-Frequency Cepstral Coefficients (MFCC) combined with a Convolutional Neural
Network (CNN), and (2) Discrete Wavelet Transform (DWT) features combined with
a Recurrent Neural Network (RNN). The models were trained on a dialect-filtered
subset of the Common Voice Arabic dataset, with dialect labels assigned based
on speaker metadata. Experimental results demonstrate that the MFCC + CNN
architecture achieved superior performance, with an accuracy of 91.2% and
strong precision, recall, and F1-scores, significantly outperforming the
Wavelet + RNN configuration, which achieved an accuracy of 66.5%. These
findings highlight the effectiveness of leveraging spectral features with
convolutional models for Arabic dialect recognition, especially when working
with limited labeled data. The study also identifies limitations related to
dataset size, potential regional overlaps in labeling, and model optimization,
providing a roadmap for future research. Recommendations for further
improvement include the adoption of larger annotated corpora, integration of
self-supervised learning techniques, and exploration of advanced neural
architectures such as Transformers. Overall, this research establishes a strong
baseline for future developments in Arabic dialect recognition within
resource-constrained environments.
[LINK]
http://arxiv.org/abs/2506.21386v1
[DATE]
2025-06-26 23:36:25+08:00
[CATEGORIES]
cs.CL
Leveraging LLM-Assisted Query Understanding for Live Retrieval-Augmented Generation
[AUTHORS]
Guanting Dong, Xiaoxi Li, Yuyao Zhang, Mengjie Deng
[ABSTRACT]
Real-world live retrieval-augmented generation (RAG) systems face significant
challenges when processing user queries that are often noisy, ambiguous, and
contain multiple intents. While RAG enhances large language models (LLMs) with
external knowledge, current systems typically struggle with such complex
inputs, as they are often trained or evaluated on cleaner data. This paper
introduces Omni-RAG, a novel framework designed to improve the robustness and
effectiveness of RAG systems in live, open-domain settings. Omni-RAG employs
LLM-assisted query understanding to preprocess user inputs through three key
modules: (1) Deep Query Understanding and Decomposition, which utilizes LLMs
with tailored prompts to denoise queries (e.g., correcting spelling errors) and
decompose multi-intent queries into structured sub-queries; (2) Intent-Aware
Knowledge Retrieval, which performs retrieval for each sub-query from a corpus
(i.e., FineWeb using OpenSearch) and aggregates the results; and (3) Reranking
and Generation, where a reranker (i.e., BGE) refines document selection before
a final response is generated by an LLM (i.e., Falcon-10B) using a
chain-of-thought prompt. Omni-RAG aims to bridge the gap between current RAG
capabilities and the demands of real-world applications, such as those
highlighted by the SIGIR 2025 LiveRAG Challenge, by robustly handling complex
and noisy queries.
[COMMENTS]
Accepted at SIGIR 2025 LiveRAG Workshop (Oral Presentation)
[LINK]
http://arxiv.org/abs/2506.21384v1
[DATE]
2025-06-26 23:35:12+08:00
[CATEGORIES]
cs.CL
Structuralist Approach to AI Literary Criticism: Leveraging Greimas Semiotic Square for Large Language Models
[AUTHORS]
Fangzhou Dong, Yifan Zeng, Yingpeng Sang, Hong Shen
[ABSTRACT]
Large Language Models (LLMs) excel in understanding and generating text but
struggle with providing professional literary criticism for works with profound
thoughts and complex narratives. This paper proposes GLASS (Greimas Literary
Analysis via Semiotic Square), a structured analytical framework based on
Greimas Semiotic Square (GSS), to enhance LLMs’ ability to conduct in-depth
literary analysis. GLASS facilitates the rapid dissection of narrative
structures and deep meanings in narrative works. We propose the first dataset
for GSS-based literary criticism, featuring detailed analyses of 48 works. Then
we propose quantitative metrics for GSS-based literary criticism using the
LLM-as-a-judge paradigm. Our framework’s results, compared with expert
criticism across multiple works and LLMs, show high performance. Finally, we
applied GLASS to 39 classic works, producing original and high-quality analyses
that address existing research gaps. This research provides an AI-based tool
for literary research and education, offering insights into the cognitive
mechanisms underlying literary engagement.
[COMMENTS]
Accepted in CogSci 2025
[LINK]
http://arxiv.org/abs/2506.21360v1
[DATE]
2025-06-26 23:10:24+08:00
[CATEGORIES]
cs.CL
Latent Prototype Routing: Achieving Near-Perfect Load Balancing in Mixture-of-Experts
[AUTHORS]
Jiajie Yang
[ABSTRACT]
Mixture-of-Experts (MoE) architectures have emerged as a key strategy for
scaling large language models (LLMs) efficiently. However, current MoE systems
suffer from severe load imbalance, where only a small subset of experts is
consistently activated during training and inference, leading to significant
underutilization of model capacity and computational resources. In this work,
we revisit expert routing through a clustering perspective and propose Latent
Prototype Routing (LPR), a novel routing framework that generalizes existing
approaches while promoting balanced expert utilization without compromising
downstream performance. Extensive experiments across multiple open-source MoE
models – including DeepSeek-V3, Qwen3-MoE, and Mixtral – demonstrate that LPR
reduces the Gini coefficient of expert load from 0.70 to 0.035 on average,
improves the min-max expert load ratio from 1e-6 to 0.70, achieving
near-perfect load balancing.
[COMMENTS]
15 pages,4 figures
[LINK]
http://arxiv.org/abs/2506.21328v1
[DATE]
2025-06-26 22:41:18+08:00
[CATEGORIES]
cs.LG
cs.CL
Exploring Adapter Design Tradeoffs for Low Resource Music Generation
[AUTHORS]
Atharva Mehta, Shivam Chauhan, Monojit Choudhury
[ABSTRACT]
Fine-tuning large-scale music generation models, such as MusicGen and
Mustango, is a computationally expensive process, often requiring updates to
billions of parameters and, therefore, significant hardware resources.
Parameter-Efficient Fine-Tuning (PEFT) techniques, particularly adapter-based
methods, have emerged as a promising alternative, enabling adaptation with
minimal trainable parameters while preserving model performance. However, the
design choices for adapters, including their architecture, placement, and size,
are numerous, and it is unclear which of these combinations would produce
optimal adapters and why, for a given case of low-resource music genre. In this
paper, we attempt to answer this question by studying various adapter
configurations for two AI music models, MusicGen and Mustango, on two genres:
Hindustani Classical and Turkish Makam music.
Our findings reveal distinct trade-offs: convolution-based adapters excel in
capturing fine-grained local musical details such as ornamentations and short
melodic phrases, while transformer-based adapters better preserve long-range
dependencies crucial for structured improvisation. Additionally, we analyze
computational resource requirements across different adapter scales,
demonstrating how mid-sized adapters (40M parameters) achieve an optimal
balance between expressivity and quality. Furthermore, we find that Mustango, a
diffusion-based model, generates more diverse outputs with better adherence to
the description in the input prompt while lacking in providing stability in
notes, rhythm alignment, and aesthetics. Also, it is computationally intensive
and requires significantly more time to train. In contrast, autoregressive
models like MusicGen offer faster training and are more efficient, and can
produce better quality output in comparison, but have slightly higher
redundancy in their generations.
[COMMENTS]
9 pages, 5 figures
[LINK]
http://arxiv.org/abs/2506.21298v1
[DATE]
2025-06-26 22:18:39+08:00
[CATEGORIES]
cs.CL
cs.LG
Detecting Referring Expressions in Visually Grounded Dialogue with Autoregressive Language Models
[AUTHORS]
Bram Willemsen, Gabriel Skantze
[ABSTRACT]
In this paper, we explore the use of a text-only, autoregressive language
modeling approach for the extraction of referring expressions from visually
grounded dialogue. More specifically, the aim is to investigate the extent to
which the linguistic context alone can inform the detection of mentions that
have a (visually perceivable) referent in the visual context of the
conversation. To this end, we adapt a pretrained large language model (LLM) to
perform a relatively course-grained annotation of mention spans in unfolding
conversations by demarcating mention span boundaries in text via next-token
prediction. Our findings indicate that even when using a moderately sized LLM,
relatively small datasets, and parameter-efficient fine-tuning, a text-only
approach can be effective, highlighting the relative importance of the
linguistic context for this task. Nevertheless, we argue that the task
represents an inherently multimodal problem and discuss limitations fundamental
to unimodal approaches.
[COMMENTS]
Accepted for publication at XLLM @ ACL 2025
[LINK]
http://arxiv.org/abs/2506.21294v1
[DATE]
2025-06-26 22:14:20+08:00
[CATEGORIES]
cs.CL
Small Encoders Can Rival Large Decoders in Detecting Groundedness
[AUTHORS]
Istabrak Abbes, Gabriele Prato, Quentin Fournier, Fernando Rodriguez, Alaa Boukhary, Adam Elwood, Sarath Chandar
[ABSTRACT]
Augmenting large language models (LLMs) with external context significantly
improves their performance in natural language processing (NLP) tasks. However,
LLMs struggle to answer queries reliably when the provided context lacks
information, often resorting to ungrounded speculation or internal knowledge.
Groundedness - generating responses strictly supported by the context - is
essential for ensuring factual consistency and trustworthiness. This study
focuses on detecting whether a given query is grounded in a document provided
in context before the costly answer generation by LLMs. Such a detection
mechanism can significantly reduce both inference time and resource
consumption. We show that lightweight, task specific encoder models such as
RoBERTa and NomicBERT, fine-tuned on curated datasets, can achieve accuracy
comparable to state-of-the-art LLMs, such as Llama3 8B and GPT4o, in
groundedness detection while reducing inference latency by orders of magnitude.
The code is available at : https://github.com/chandarlab/Hallucinate-less
[LINK]
http://arxiv.org/abs/2506.21288v1
[DATE]
2025-06-26 22:09:41+08:00
[CATEGORIES]
cs.CL
cs.LG
Double-Checker: Enhancing Reasoning of Slow-Thinking LLMs via Self-Critical Fine-Tuning
[AUTHORS]
Xin Xu, Tianhao Chen, Fan Zhang, Wanlong Liu, Pengxiang Li, Ajay Kumar Jaiswal, Yuchen Yan, Jishan Hu, Yang Wang, Hao Chen, Shiwei Liu, Shizhe Diao, Can Yang, Lu Yin
[ABSTRACT]
While slow-thinking large language models (LLMs) exhibit reflection-like
reasoning, commonly referred to as the “aha moment:, their ability to generate
informative critiques and refine prior solutions remains limited. In this
paper, we introduce Double-Checker, a principled framework designed to enhance
the reasoning capabilities of slow-thinking LLMs by fostering explicit
self-critique and iterative refinement of their previous solutions. By
fine-tuning on our curated 1,730 self-critical instances, Double-Checker
empowers long-CoT LLMs to iteratively critique and refine their outputs during
inference until they evaluate their solutions as correct under self-generated
critiques. We validate the efficacy of Double-Checker across a comprehensive
suite of reasoning benchmarks, demonstrating that iterative self-critique
significantly enhances the reasoning capabilities of long-CoT LLMs. Notably,
our Double-Checker increases the pass@1 performance on challenging AIME
benchmarks from 4.4% to 18.2% compared to the original long-CoT LLMs. These
results highlight a promising direction for developing more trustworthy and
effective LLMs capable of structured self-critique.
[COMMENTS]
10 pages
[LINK]
http://arxiv.org/abs/2506.21285v1
[DATE]
2025-06-26 22:05:45+08:00
[CATEGORIES]
cs.CL
Cat and Mouse – Can Fake Text Generation Outpace Detector Systems?
[AUTHORS]
Andrea McGlinchey, Peter J Barclay
[ABSTRACT]
Large language models can produce convincing “fake text” in domains such as
academic writing, product reviews, and political news. Many approaches have
been investigated for the detection of artificially generated text. While this
may seem to presage an endless “arms race”, we note that newer LLMs use ever
more parameters, training data, and energy, while relatively simple classifiers
demonstrate a good level of detection accuracy with modest resources. To
approach the question of whether the models’ ability to beat the detectors may
therefore reach a plateau, we examine the ability of statistical classifiers to
identify “fake text” in the style of classical detective fiction. Over a 0.5
version increase, we found that Gemini showed an increased ability to generate
deceptive text, while GPT did not. This suggests that reliable detection of
fake text may remain feasible even for ever-larger models, though new model
architectures may improve their deceptiveness
[COMMENTS]
(Submitted for publication)
[LINK]
http://arxiv.org/abs/2506.21274v1
[DATE]
2025-06-26 21:58:43+08:00
[CATEGORIES]
cs.CL
A Troublemaker with Contagious Jailbreak Makes Chaos in Honest Towns
[AUTHORS]
Tianyi Men, Pengfei Cao, Zhuoran Jin, Yubo Chen, Kang Liu, Jun Zhao
[ABSTRACT]
With the development of large language models, they are widely used as agents
in various fields. A key component of agents is memory, which stores vital
information but is susceptible to jailbreak attacks. Existing research mainly
focuses on single-agent attacks and shared memory attacks. However, real-world
scenarios often involve independent memory. In this paper, we propose the
Troublemaker Makes Chaos in Honest Town (TMCHT) task, a large-scale,
multi-agent, multi-topology text-based attack evaluation framework. TMCHT
involves one attacker agent attempting to mislead an entire society of agents.
We identify two major challenges in multi-agent attacks: (1) Non-complete graph
structure, (2) Large-scale systems. We attribute these challenges to a
phenomenon we term toxicity disappearing. To address these issues, we propose
an Adversarial Replication Contagious Jailbreak (ARCJ) method, which optimizes
the retrieval suffix to make poisoned samples more easily retrieved and
optimizes the replication suffix to make poisoned samples have contagious
ability. We demonstrate the superiority of our approach in TMCHT, with 23.51%,
18.95%, and 52.93% improvements in line topology, star topology, and 100-agent
settings. Encourage community attention to the security of multi-agent systems.
[COMMENTS]
ACL 2025 Main
[LINK]
http://arxiv.org/abs/2410.16155v2
[DATE]
2025-06-26 21:45:10+08:00
[CATEGORIES]
cs.CL
Simulating Hard Attention Using Soft Attention
[AUTHORS]
Andy Yang, Lena Strobl, David Chiang, Dana Angluin
[ABSTRACT]
We study conditions under which transformers using soft attention can
simulate hard attention, that is, effectively focus all attention on a subset
of positions. First, we examine several subclasses of languages recognized by
hard-attention transformers, which can be defined in variants of linear
temporal logic. We demonstrate how soft-attention transformers can compute
formulas of these logics using unbounded positional embeddings or temperature
scaling. Second, we demonstrate how temperature scaling allows softmax
transformers to simulate general hard-attention transformers, using a
temperature that depends on the minimum gap between the maximum attention
scores and other attention scores.
[COMMENTS]
19 pages
[LINK]
http://arxiv.org/abs/2412.09925v2
[DATE]
2025-06-26 21:41:24+08:00
[CATEGORIES]
cs.LG
cs.CL
Agent-RewardBench: Towards a Unified Benchmark for Reward Modeling across Perception, Planning, and Safety in Real-World Multimodal Agents
[AUTHORS]
Tianyi Men, Zhuoran Jin, Pengfei Cao, Yubo Chen, Kang Liu, Jun Zhao
[ABSTRACT]
As Multimodal Large Language Models (MLLMs) advance, multimodal agents show
promise in real-world tasks like web navigation and embodied intelligence.
However, due to limitations in a lack of external feedback, these agents
struggle with self-correction and generalization. A promising approach is to
use reward models as external feedback, but there is no clear on how to select
reward models for agents. Thus, there is an urgent need to build a reward bench
targeted at agents. To address these challenges, we propose Agent-RewardBench,
a benchmark designed to evaluate reward modeling ability in MLLMs. The
benchmark is characterized by three key features: (1) Multiple dimensions and
real-world agent scenarios evaluation. It covers perception, planning, and
safety with 7 scenarios; (2) Step-level reward evaluation. It allows for the
assessment of agent capabilities at the individual steps of a task, providing a
more granular view of performance during the planning process; and (3)
Appropriately difficulty and high-quality. We carefully sample from 10 diverse
models, difficulty control to maintain task challenges, and manual verification
to ensure the integrity of the data. Experiments demonstrate that even
state-of-the-art multimodal models show limited performance, highlighting the
need for specialized training in agent reward modeling. Code is available at
github.
[COMMENTS]
ACL 2025 Main
[LINK]
http://arxiv.org/abs/2506.21252v1
[DATE]
2025-06-26 21:36:12+08:00
[CATEGORIES]
cs.CL
Capturing Style in Author and Document Representation
[AUTHORS]
Enzo Terreau, Antoine Gourru, Julien Velcin
[ABSTRACT]
A wide range of Deep Natural Language Processing (NLP) models integrates
continuous and low dimensional representations of words and documents.
Surprisingly, very few models study representation learning for authors. These
representations can be used for many NLP tasks, such as author identification
and classification, or in recommendation systems. A strong limitation of
existing works is that they do not explicitly capture writing style, making
them hardly applicable to literary data. We therefore propose a new
architecture based on Variational Information Bottleneck (VIB) that learns
embeddings for both authors and documents with a stylistic constraint. Our
model fine-tunes a pre-trained document encoder. We stimulate the detection of
writing style by adding predefined stylistic features making the representation
axis interpretable with respect to writing style indicators. We evaluate our
method on three datasets: a literary corpus extracted from the Gutenberg
Project, the Blog Authorship Corpus and IMDb62, for which we show that it
matches or outperforms strong/recent baselines in authorship attribution while
capturing much more accurately the authors stylistic aspects.
[LINK]
http://arxiv.org/abs/2407.13358v2
[DATE]
2025-06-26 21:21:53+08:00
[CATEGORIES]
cs.CL
cs.LG
Enhancing Automatic Term Extraction with Large Language Models via Syntactic Retrieval
[AUTHORS]
Yongchan Chun, Minhyuk Kim, Dongjun Kim, Chanjun Park, Heuiseok Lim
[ABSTRACT]
Automatic Term Extraction (ATE) identifies domain-specific expressions that
are crucial for downstream tasks such as machine translation and information
retrieval. Although large language models (LLMs) have significantly advanced
various NLP tasks, their potential for ATE has scarcely been examined. We
propose a retrieval-based prompting strategy that, in the few-shot setting,
selects demonstrations according to \emph{syntactic} rather than semantic
similarity. This syntactic retrieval method is domain-agnostic and provides
more reliable guidance for capturing term boundaries. We evaluate the approach
in both in-domain and cross-domain settings, analyzing how lexical overlap
between the query sentence and its retrieved examples affects performance.
Experiments on three specialized ATE benchmarks show that syntactic retrieval
improves F1-score. These findings highlight the importance of syntactic cues
when adapting LLMs to terminology-extraction tasks.
[LINK]
http://arxiv.org/abs/2506.21222v1
[DATE]
2025-06-26 21:14:52+08:00
[CATEGORIES]
cs.CL
Unveiling Causal Reasoning in Large Language Models: Reality or Mirage?
[AUTHORS]
Haoang Chi, He Li, Wenjing Yang, Feng Liu, Long Lan, Xiaoguang Ren, Tongliang Liu, Bo Han
[ABSTRACT]
Causal reasoning capability is critical in advancing large language models
(LLMs) toward strong artificial intelligence. While versatile LLMs appear to
have demonstrated capabilities in understanding contextual causality and
providing responses that obey the laws of causality, it remains unclear whether
they perform genuine causal reasoning akin to humans. However, current evidence
indicates the contrary. Specifically, LLMs are only capable of performing
shallow (level-1) causal reasoning, primarily attributed to the causal
knowledge embedded in their parameters, but they lack the capacity for genuine
human-like (level-2) causal reasoning. To support this hypothesis,
methodologically, we delve into the autoregression mechanism of
transformer-based LLMs, revealing that it is not inherently causal.
Empirically, we introduce a new causal Q&A benchmark called CausalProbe-2024,
whose corpora are fresh and nearly unseen for the studied LLMs. The LLMs
exhibit a significant performance drop on CausalProbe-2024 compared to earlier
benchmarks, indicating the fact that they primarily engage in level-1 causal
reasoning. To bridge the gap towards level-2 causal reasoning, we draw
inspiration from the fact that human reasoning is usually facilitated by
general knowledge and intended goals. We propose G^2-Reasoner, a method that
incorporates general knowledge and goal-oriented prompts into LLMs’ causal
reasoning processes. Experiments demonstrate that G^2-Reasoner significantly
enhances LLMs’ causal reasoning capability, particularly in fresh and
counterfactual contexts. This work sheds light on a new path for LLMs to
advance towards genuine causal reasoning, going beyond level-1 and making
strides towards level-2.
[COMMENTS]
24 pages, accepted at NeurIPS 2024
[LINK]
http://arxiv.org/abs/2506.21215v1
[DATE]
2025-06-26 21:11:01+08:00
[CATEGORIES]
cs.CL
cs.LG
TAPS: Tool-Augmented Personalisation via Structured Tagging
[AUTHORS]
Ekaterina Taktasheva, Jeff Dalton
[ABSTRACT]
Recent advancements in tool-augmented large language models have enabled them
to interact with external tools, enhancing their ability to perform complex
user tasks. However, existing approaches overlook the role of personalisation
in guiding tool use. This work investigates how user preferences can be
effectively integrated into goal-oriented dialogue agents. Through extensive
analysis, we identify key weaknesses in the ability of LLMs to personalise tool
use. To this end, we introduce TAPS, a novel solution that enhances
personalised tool use by leveraging a structured tagging tool and an
uncertainty-based tool detector. TAPS significantly improves the ability of
LLMs to incorporate user preferences, achieving the new state-of-the-art for
open source models on the NLSI task.
[LINK]
http://arxiv.org/abs/2506.20409v2
[DATE]
2025-06-26 21:09:40+08:00
[CATEGORIES]
cs.CL
LLM-Based Human-Agent Collaboration and Interaction Systems: A Survey
[AUTHORS]
Henry Peng Zou, Wei-Chieh Huang, Yaozu Wu, Yankai Chen, Chunyu Miao, Hoang Nguyen, Yue Zhou, Weizhi Zhang, Liancheng Fang, Langzhou He, Yangning Li, Dongyuan Li, Renhe Jiang, Xue Liu, Philip S. Yu
[ABSTRACT]
Recent advances in large language models (LLMs) have sparked growing interest
in building fully autonomous agents. However, fully autonomous LLM-based agents
still face significant challenges, including limited reliability due to
hallucinations, difficulty in handling complex tasks, and substantial safety
and ethical risks, all of which limit their feasibility and trustworthiness in
real-world applications. To overcome these limitations, LLM-based human-agent
systems (LLM-HAS) incorporate human-provided information, feedback, or control
into the agent system to enhance system performance, reliability and safety.
These human-agent collaboration systems enable humans and LLM-based agents to
collaborate effectively by leveraging their complementary strengths. This paper
provides the first comprehensive and structured survey of LLM-HAS. It clarifies
fundamental concepts, systematically presents core components shaping these
systems, including environment & profiling, human feedback, interaction types,
orchestration and communication, explores emerging applications, and discusses
unique challenges and opportunities arising from human-AI collaboration. By
consolidating current knowledge and offering a structured overview, we aim to
foster further research and innovation in this rapidly evolving
interdisciplinary field. Paper lists and resources are available at
https://github.com/HenryPengZou/Awesome-Human-Agent-Collaboration-Interaction-Systems.
[COMMENTS]
Paper lists and resources are available at
https://github.com/HenryPengZou/Awesome-Human-Agent-Collaboration-Interaction-Systems
[LINK]
http://arxiv.org/abs/2505.00753v4
[DATE]
2025-06-26 20:53:30+08:00
[CATEGORIES]
cs.CL
cs.LG
Prompt-Guided Turn-Taking Prediction
[AUTHORS]
Koji Inoue, Mikey Elmers, Yahui Fu, Zi Haur Pang, Divesh Lala, Keiko Ochi, Tatsuya Kawahara
[ABSTRACT]
Turn-taking prediction models are essential components in spoken dialogue
systems and conversational robots. Recent approaches leverage transformer-based
architectures to predict speech activity continuously and in real-time. In this
study, we propose a novel model that enables turn-taking prediction to be
dynamically controlled via textual prompts. This approach allows intuitive and
explicit control through instructions such as “faster” or “calmer” adapting
dynamically to conversational partners and contexts. The proposed model builds
upon a transformer-based voice activity projection (VAP) model, incorporating
textual prompt embeddings into both channel-wise transformers and a
cross-channel transformer. We evaluated the feasibility of our approach using
over 950 hours of human-human spoken dialogue data. Since textual prompt data
for the proposed approach was not available in existing datasets, we utilized a
large language model (LLM) to generate synthetic prompt sentences. Experimental
results demonstrated that the proposed model improved prediction accuracy and
effectively varied turn-taking timing behaviors according to the textual
prompts.
[COMMENTS]
This paper has been accepted for presentation at SIGdial Meeting on
Discourse and Dialogue 2025 (SIGDIAL 2025) and represents the author’s
version of the work
[LINK]
http://arxiv.org/abs/2506.21191v1
[DATE]
2025-06-26 20:49:07+08:00
[CATEGORIES]
cs.CL
Maintaining MTEB: Towards Long Term Usability and Reproducibility of Embedding Benchmarks
[AUTHORS]
Isaac Chung, Imene Kerboua, Marton Kardos, Roman Solomatin, Kenneth Enevoldsen
[ABSTRACT]
The Massive Text Embedding Benchmark (MTEB) has become a standard evaluation
platform for text embedding models. While previous work has established the
core benchmark methodology, this paper focuses on the engineering aspects that
ensure MTEB’s continued reproducibility and extensibility. We present our
approach to maintaining robust continuous integration pipelines that validate
dataset integrity, automate test execution, and assess benchmark results’
generalizability. We detail the design choices that collectively enhance
reproducibility and usability. Furthermore, we discuss our strategies for
handling community contributions and extending the benchmark with new tasks and
datasets. These engineering practices have been instrumental in scaling MTEB to
become more comprehensive while maintaining quality and, ultimately, relevance
to the field. Our experiences offer valuable insights for benchmark maintainers
facing similar challenges in ensuring reproducibility and usability in machine
learning evaluation frameworks. The MTEB repository is available at:
https://github.com/embeddings-benchmark/mteb
[LINK]
http://arxiv.org/abs/2506.21182v1
[DATE]
2025-06-26 20:40:48+08:00
[CATEGORIES]
cs.CL
Compressed and Smooth Latent Space for Text Diffusion Modeling
[AUTHORS]
Viacheslav Meshchaninov, Egor Chimbulatov, Alexander Shabalin, Aleksandr Abramov, Dmitry Vetrov
[ABSTRACT]
Autoregressive language models dominate modern text generation, yet their
sequential nature introduces fundamental limitations: decoding is slow, and
maintaining global coherence remains challenging. Diffusion models offer a
promising alternative by enabling parallel generation and flexible control;
however, their application to text generation is hindered by the high
dimensionality of token-level representations. We introduce Cosmos, a novel
approach to text generation that operates entirely in a compressed, smooth
latent space tailored specifically for diffusion. This space is learned using
an autoencoder trained simultaneously for token-level reconstruction and
alignment with frozen activations from a pretrained language encoder, providing
robust semantic grounding and enabling effective perturbation-based
augmentations. Empirically, we demonstrate that text representations can be
compressed by $8\times$ while maintaining generation quality comparable to
token-level diffusion models. Furthermore, increasing the latent sequence
length allows Cosmos to surpass both diffusion-based and autoregressive
baselines. We evaluate Cosmos on four diverse generative tasks including story
generation, question generation, summarization, and detoxification and compare
it with various generative paradigms. Cosmos achieves comparable or superior
generation quality while offering more than $2\times$ faster inference.
[LINK]
http://arxiv.org/abs/2506.21170v1
[DATE]
2025-06-26 20:05:13+08:00
[CATEGORIES]
cs.CL
CVC: A Large-Scale Chinese Value Rule Corpus for Value Alignment of Large Language Models
[AUTHORS]
Ping Wu, Guobin Shen, Dongcheng Zhao, Yuwei Wang, Yiting Dong, Yu Shi, Enmeng Lu, Feifei Zhao, Yi Zeng
[ABSTRACT]
Ensuring that Large Language Models (LLMs) align with mainstream human values
and ethical norms is crucial for the safe and sustainable development of AI.
Current value evaluation and alignment are constrained by Western cultural bias
and incomplete domestic frameworks reliant on non-native rules; furthermore,
the lack of scalable, rule-driven scenario generation methods makes evaluations
costly and inadequate across diverse cultural contexts. To address these
challenges, we propose a hierarchical value framework grounded in core Chinese
values, encompassing three main dimensions, 12 core values, and 50 derived
values. Based on this framework, we construct a large-scale Chinese Values
Corpus (CVC) containing over 250,000 value rules enhanced and expanded through
human annotation. Experimental results show that CVC-guided scenarios
outperform direct generation ones in value boundaries and content diversity. In
the evaluation across six sensitive themes (e.g., surrogacy, suicide), seven
mainstream LLMs preferred CVC-generated options in over 70.5% of cases, while
five Chinese human annotators showed an 87.5% alignment with CVC, confirming
its universality, cultural relevance, and strong alignment with Chinese values.
Additionally, we construct 400,000 rule-based moral dilemma scenarios that
objectively capture nuanced distinctions in conflicting value prioritization
across 17 LLMs. Our work establishes a culturally-adaptive benchmarking
framework for comprehensive value evaluation and alignment, representing
Chinese characteristics. All data are available at
https://huggingface.co/datasets/Beijing-AISI/CVC, and the code is available at
https://github.com/Beijing-AISI/CVC.
[LINK]
http://arxiv.org/abs/2506.01495v4
[DATE]
2025-06-26 19:34:33+08:00
[CATEGORIES]
cs.CL
Do Large Language Models Advocate for Inferentialism?
[AUTHORS]
Yuzuki Arai, Sho Tsugawa
[ABSTRACT]
The emergence of large language models (LLMs) such as ChatGPT and Claude
presents new challenges for philosophy of language, particularly regarding the
nature of linguistic meaning and representation. While LLMs have traditionally
been understood through distributional semantics, this paper explores Robert
Brandom’s inferential semantics as an alternative foundational framework for
understanding these systems. We examine how key features of inferential
semantics – including its anti-representationalist stance, logical
expressivism, and quasi-compositional approach – align with the architectural
and functional characteristics of Transformer-based LLMs. Through analysis of
the ISA (Inference, Substitution, Anaphora) approach, we demonstrate that LLMs
exhibit fundamentally anti-representationalist properties in their processing
of language. We further develop a consensus theory of truth appropriate for
LLMs, grounded in their interactive and normative dimensions through mechanisms
like RLHF. While acknowledging significant tensions between inferentialism’s
philosophical commitments and LLMs’ sub-symbolic processing, this paper argues
that inferential semantics provides valuable insights into how LLMs generate
meaning without reference to external world representations. Our analysis
suggests that LLMs may challenge traditional assumptions in philosophy of
language, including strict compositionality and semantic externalism, though
further empirical investigation is needed to fully substantiate these
theoretical claims.
[LINK]
http://arxiv.org/abs/2412.14501v2
[DATE]
2025-06-26 19:03:13+08:00
[CATEGORIES]
cs.CL
Learning Evaluation Models from Large Language Models for Sequence Generation
[AUTHORS]
Chenglong Wang, Hang Zhou, Kaiyan Chang, Tongran Liu, Chunliang Zhang, Quan Du, Tong Xiao, Yue Zhang, Jingbo Zhu
[ABSTRACT]
Automatic evaluation of sequence generation, traditionally reliant on metrics
like BLEU and ROUGE, often fails to capture the semantic accuracy of generated
text sequences due to their emphasis on n-gram overlap. A promising solution to
this problem is to develop model-based metrics, such as BLEURT and COMET.
However, these approaches are typically hindered by the scarcity of labeled
evaluation data, which is necessary to train the evaluation models. In this
work, we build upon this challenge by proposing the Customized Sequence
Evaluation Metric (CSEM), a three-stage evaluation model training method that
utilizes large language models to generate labeled data for model-based metric
development, thereby eliminating the need for human-labeled data. Additionally,
we expand the scope of CSEM to support various evaluation types, including
single-aspect, multi-aspect, reference-free, and reference-based evaluations,
enabling the customization of metrics to suit diverse real-world scenarios.
Experimental results on the SummEval benchmark demonstrate that CSEM can
effectively train an evaluation model without human-labeled data. Further
experiments in reinforcement learning and reranking show that metrics developed
through CSEM outperform traditional evaluation metrics, leading to substantial
improvements in sequence quality as evaluated by both commonly used metrics and
ChatGPT.
[COMMENTS]
Accepted by TASLP 2025
[LINK]
http://arxiv.org/abs/2308.04386v3
[DATE]
2025-06-26 18:00:23+08:00
[CATEGORIES]
cs.CL
Progtuning: Progressive Fine-tuning Framework for Transformer-based Language Models
[AUTHORS]
Xiaoshuang Ji, Zhendong Zhao, Xiaojun Chen, Xin Zhao, Zeyao Liu
[ABSTRACT]
Fine-tuning is a promising technique for leveraging Transformer-based
language models in downstream tasks. As model sizes continue to grow, updating
all model parameters becomes increasingly costly. Parameter-efficient
fine-tuning methods effectively address this issue by selectively updating a
small subset of parameters. However, fine-tuning and most existing
parameter-efficient fine-tuning methods require updating the same number of
parameters as the initial size, ignoring the unequal contribution across
Transformer blocks and leading to extremely inefficient allocation of computing
resources. In this paper, we propose Progtuning, the novel fine-tuning
framework combined with progressive learning for Transformer-based language
models. Specifically, Progtuning progressively reduces the number of updated
transformer blocks based on the contribution. Remarkably, Progtuning optimizes
resource allocation and reduces the number of updated parameters by
approximately 25\%, while still maintaining competitive performance. And it
also exhibits high adaptability with parameter-efficient fine-tuning methods,
demonstrating excellent performance across various adaptation scenarios.
[COMMENTS]
Accepted by ICONIP 2024
[LINK]
http://arxiv.org/abs/2506.21119v1
[DATE]
2025-06-26 17:37:15+08:00
[CATEGORIES]
cs.CL
Learning to Skip the Middle Layers of Transformers
[AUTHORS]
Tim Lawson, Laurence Aitchison
[ABSTRACT]
Conditional computation is a popular strategy to make Transformers more
efficient. Existing methods often target individual modules (e.g.,
mixture-of-experts layers) or skip layers independently of one another.
However, interpretability research has demonstrated that the middle layers of
Transformers exhibit greater redundancy, and that early layers aggregate
information into token positions. Guided by these insights, we propose a novel
architecture that dynamically skips a variable number of layers from the middle
outward. In particular, a learned gating mechanism determines whether to bypass
a symmetric span of central blocks based on the input, and a gated attention
mechanism prevents subsequent tokens from attending to skipped token positions.
Residual norms are controlled with a ‘sandwich’ or ‘perilayernorm’ scheme and
gate sparsity with an adaptive regularization loss. We had aimed to reduce
compute requirements for ‘simpler’ tokens and potentially foster an emergent
multi-level representational hierarchy but, at the scales investigated, our
approach does not achieve improvements in the trade-off between validation
cross-entropy and estimated FLOPs compared to dense baselines with fewer
layers. We release our code at https://github.com/tim-lawson/skip-middle.
[COMMENTS]
11 pages, 2 figures
[LINK]
http://arxiv.org/abs/2506.21103v1
[DATE]
2025-06-26 17:01:19+08:00
[CATEGORIES]
cs.LG
cs.CL
ComRAG: Retrieval-Augmented Generation with Dynamic Vector Stores for Real-time Community Question Answering in Industry
[AUTHORS]
Qinwen Chen, Wenbiao Tao, Zhiwei Zhu, Mingfan Xi, Liangzhong Guo, Yuan Wang, Wei Wang, Yunshi Lan
[ABSTRACT]
Community Question Answering (CQA) platforms can be deemed as important
knowledge bases in community, but effectively leveraging historical
interactions and domain knowledge in real-time remains a challenge. Existing
methods often underutilize external knowledge, fail to incorporate dynamic
historical QA context, or lack memory mechanisms suited for industrial
deployment. We propose ComRAG, a retrieval-augmented generation framework for
real-time industrial CQA that integrates static knowledge with dynamic
historical QA pairs via a centroid-based memory mechanism designed for
retrieval, generation, and efficient storage. Evaluated on three industrial CQA
datasets, ComRAG consistently outperforms all baselines–achieving up to 25.9%
improvement in vector similarity, reducing latency by 8.7% to 23.3%, and
lowering chunk growth from 20.23% to 2.06% over iterations.
[COMMENTS]
7 pages, 4 figures. Accepted at ACL 2025 Industry Track
[LINK]
http://arxiv.org/abs/2506.21098v1
[DATE]
2025-06-26 16:48:16+08:00
[CATEGORIES]
cs.CL
DALR: Dual-level Alignment Learning for Multimodal Sentence Representation Learning
[AUTHORS]
Kang He, Yuzhe Ding. Haining Wang, Fei Li, Chong Teng, Donghong Ji
[ABSTRACT]
Previous multimodal sentence representation learning methods have achieved
impressive performance. However, most approaches focus on aligning images and
text at a coarse level, facing two critical challenges:cross-modal misalignment
bias and intra-modal semantic divergence, which significantly degrade sentence
representation quality. To address these challenges, we propose DALR
(Dual-level Alignment Learning for Multimodal Sentence Representation). For
cross-modal alignment, we propose a consistency learning module that softens
negative samples and utilizes semantic similarity from an auxiliary task to
achieve fine-grained cross-modal alignment. Additionally, we contend that
sentence relationships go beyond binary positive-negative labels, exhibiting a
more intricate ranking structure. To better capture these relationships and
enhance representation quality, we integrate ranking distillation with global
intra-modal alignment learning. Comprehensive experiments on semantic textual
similarity (STS) and transfer (TR) tasks validate the effectiveness of our
approach, consistently demonstrating its superiority over state-of-the-art
baselines.
[COMMENTS]
Accepted by ACL 2025 Findings
[LINK]
http://arxiv.org/abs/2506.21096v1
[DATE]
2025-06-26 16:45:14+08:00
[CATEGORIES]
cs.CL
Enhancing LLM Tool Use with High-quality Instruction Data from Knowledge Graph
[AUTHORS]
Jingwei Wang, Zai Zhang, Hao Qian, Chunjing Gan, Binbin Hu, Ziqi Liu, Zhiqiang Zhang, Jun Zhou, Bin Shi, Bo Dong
[ABSTRACT]
Teaching large language models (LLMs) to use tools is crucial for improving
their problem-solving abilities and expanding their applications. However,
effectively using tools is challenging because it requires a deep understanding
of tool functionalities and user intentions. Previous methods relied mainly on
LLMs to generate instruction data, but the quality of these data was often
insufficient. In this paper, we propose a new method that uses knowledge graphs
to generate high-quality instruction data for LLMs. Knowledge graphs are
manually curated datasets rich in semantic information. We begin by extracting
various query pathways from a given knowledge graph, which are transformed into
a broad spectrum of user queries. We then translate the relationships between
entities into actionable tools and parse the pathways of each query into
detailed solution steps, thereby creating high-quality instruction data. Our
experiments show that fine-tuning on just a small sample of this synthetic data
can significantly improve the tool utilization and overall capabilities of
LLMs.
[COMMENTS]
20 pages, 12 figures
[LINK]
http://arxiv.org/abs/2506.21071v1
[DATE]
2025-06-26 15:45:15+08:00
[CATEGORIES]
cs.LG
cs.CL
Search and Refine During Think: Autonomous Retrieval-Augmented Reasoning of LLMs
[AUTHORS]
Yaorui Shi, Sihang Li, Chang Wu, Zhiyuan Liu, Junfeng Fang, Hengxing Cai, An Zhang, Xiang Wang
[ABSTRACT]
Large language models have demonstrated impressive reasoning capabilities but
are inherently limited by their knowledge reservoir. Retrieval-augmented
reasoning mitigates this limitation by allowing LLMs to query external
resources, but existing methods often retrieve irrelevant or noisy information,
hindering accurate reasoning. In this paper, we propose AutoRefine, a
reinforcement learning post-training framework that adopts a new
“search-and-refine-during-think” paradigm. AutoRefine introduces explicit
knowledge refinement steps between successive search calls, enabling the model
to iteratively filter, distill, and organize evidence before generating an
answer. Furthermore, we incorporate tailored retrieval-specific rewards
alongside answer correctness rewards using group relative policy optimization.
Experiments on single-hop and multi-hop QA benchmarks demonstrate that
AutoRefine significantly outperforms existing approaches, particularly in
complex, multi-hop reasoning scenarios. Detailed analysis shows that AutoRefine
issues frequent, higher-quality searches and synthesizes evidence effectively.
[LINK]
http://arxiv.org/abs/2505.11277v3
[DATE]
2025-06-26 14:52:37+08:00
[CATEGORIES]
cs.CL
A Semi-supervised Scalable Unified Framework for E-commerce Query Classification
[AUTHORS]
Chunyuan Yuan, Chong Zhang, Zheng Fang, Ming Pang, Xue Jiang, Changping Peng, Zhangang Lin, Ching Law
[ABSTRACT]
Query classification, including multiple subtasks such as intent and category
prediction, is vital to e-commerce applications. E-commerce queries are usually
short and lack context, and the information between labels cannot be used,
resulting in insufficient prior information for modeling. Most existing
industrial query classification methods rely on users’ posterior click behavior
to construct training samples, resulting in a Matthew vicious cycle.
Furthermore, the subtasks of query classification lack a unified framework,
leading to low efficiency for algorithm optimization.
In this paper, we propose a novel Semi-supervised Scalable Unified Framework
(SSUF), containing multiple enhanced modules to unify the query classification
tasks. The knowledge-enhanced module uses world knowledge to enhance query
representations and solve the problem of insufficient query information. The
label-enhanced module uses label semantics and semi-supervised signals to
reduce the dependence on posterior labels. The structure-enhanced module
enhances the label representation based on the complex label relations. Each
module is highly pluggable, and input features can be added or removed as
needed according to each subtask. We conduct extensive offline and online A/B
experiments, and the results show that SSUF significantly outperforms the
state-of-the-art models.
[COMMENTS]
Accepted by ACL 2025
[LINK]
http://arxiv.org/abs/2506.21049v1
[DATE]
2025-06-26 14:52:33+08:00
[CATEGORIES]
cs.CL
MockLLM: A Multi-Agent Behavior Collaboration Framework for Online Job Seeking and Recruiting
[AUTHORS]
Hongda Sun, Hongzhan Lin, Haiyu Yan, Yang Song, Xin Gao, Rui Yan
[ABSTRACT]
Online recruitment platforms have reshaped job-seeking and recruiting
processes, driving increased demand for applications that enhance person-job
matching. Traditional methods generally rely on analyzing textual data from
resumes and job descriptions, limiting the dynamic, interactive aspects crucial
to effective recruitment. Recent advances in Large Language Models (LLMs) have
revealed remarkable potential in simulating adaptive, role-based dialogues,
making them well-suited for recruitment scenarios. In this paper, we propose
\textbf{MockLLM}, a novel framework to generate and evaluate mock interview
interactions. The system consists of two key components: mock interview
generation and two-sided evaluation in handshake protocol. By simulating both
interviewer and candidate roles, MockLLM enables consistent and collaborative
interactions for real-time and two-sided matching. To further improve the
matching quality, MockLLM further incorporates reflection memory generation and
dynamic strategy modification, refining behaviors based on previous experience.
We evaluate MockLLM on real-world data Boss Zhipin, a major Chinese recruitment
platform. The experimental results indicate that MockLLM outperforms existing
methods in matching accuracy, scalability, and adaptability across job domains,
highlighting its potential to advance candidate assessment and online
recruitment.
[COMMENTS]
Accepted by KDD 2025 Research Track
[LINK]
http://arxiv.org/abs/2405.18113v2
[DATE]
2025-06-26 14:33:55+08:00
[CATEGORIES]
cs.CL
SceneGenAgent: Precise Industrial Scene Generation with Coding Agent
[AUTHORS]
Xiao Xia, Dan Zhang, Zibo Liao, Zhenyu Hou, Tianrui Sun, Jing Li, Ling Fu, Yuxiao Dong
[ABSTRACT]
The modeling of industrial scenes is essential for simulations in industrial
manufacturing. While large language models (LLMs) have shown significant
progress in generating general 3D scenes from textual descriptions, generating
industrial scenes with LLMs poses a unique challenge due to their demand for
precise measurements and positioning, requiring complex planning over spatial
arrangement. To address this challenge, we introduce SceneGenAgent, an
LLM-based agent for generating industrial scenes through C# code. SceneGenAgent
ensures precise layout planning through a structured and calculable format,
layout verification, and iterative refinement to meet the quantitative
requirements of industrial scenarios. Experiment results demonstrate that LLMs
powered by SceneGenAgent exceed their original performance, reaching up to
81.0% success rate in real-world industrial scene generation tasks and
effectively meeting most scene generation requirements. To further enhance
accessibility, we construct SceneInstruct, a dataset designed for fine-tuning
open-source LLMs to integrate into SceneGenAgent. Experiments show that
fine-tuning open-source LLMs on SceneInstruct yields significant performance
improvements, with Llama3.1-70B approaching the capabilities of GPT-4o. Our
code and data are available at https://github.com/THUDM/SceneGenAgent .
[COMMENTS]
Accepted to ACL 2025
[LINK]
http://arxiv.org/abs/2410.21909v3
[DATE]
2025-06-26 14:24:08+08:00
[CATEGORIES]
cs.CL
cs.LG
Large Language Models Acing Chartered Accountancy
[AUTHORS]
Jatin Gupta, Akhil Sharma, Saransh Singhania, Mohammad Adnan, Sakshi Deo, Ali Imam Abidi, Keshav Gupta
[ABSTRACT]
Advanced intelligent systems, particularly Large Language Models (LLMs), are
significantly reshaping financial practices through advancements in Natural
Language Processing (NLP). However, the extent to which these models
effectively capture and apply domain-specific financial knowledge remains
uncertain. Addressing a critical gap in the expansive Indian financial context,
this paper introduces CA-Ben, a Chartered Accountancy benchmark specifically
designed to evaluate the financial, legal, and quantitative reasoning
capabilities of LLMs. CA-Ben comprises structured question-answer datasets
derived from the rigorous examinations conducted by the Institute of Chartered
Accountants of India (ICAI), spanning foundational, intermediate, and advanced
CA curriculum stages. Six prominent LLMs i.e. GPT 4o, LLAMA 3.3 70B, LLAMA 3.1
405B, MISTRAL Large, Claude 3.5 Sonnet, and Microsoft Phi 4 were evaluated
using standardized protocols. Results indicate variations in performance, with
Claude 3.5 Sonnet and GPT-4o outperforming others, especially in conceptual and
legal reasoning. Notable challenges emerged in numerical computations and legal
interpretations. The findings emphasize the strengths and limitations of
current LLMs, suggesting future improvements through hybrid reasoning and
retrieval-augmented generation methods, particularly for quantitative analysis
and accurate legal interpretation.
[COMMENTS]
Accepted for publication at MoStart 2025: International Conference on
Digital Transformation in Education and Applications of Artificial
Intelligence, Bosnia and Herzegovina, 2025
[LINK]
http://arxiv.org/abs/2506.21031v1
[DATE]
2025-06-26 14:10:37+08:00
[CATEGORIES]
cs.CL
SAC: A Framework for Measuring and Inducing Personality Traits in LLMs with Dynamic Intensity Control
[AUTHORS]
Adithya Chittem, Aishna Shrivastava, Sai Tarun Pendela, Jagat Sesh Challa, Dhruv Kumar
[ABSTRACT]
Large language models (LLMs) have gained significant traction across a wide
range of fields in recent years. There is also a growing expectation for them
to display human-like personalities during interactions. To meet this
expectation, numerous studies have proposed methods for modelling LLM
personalities through psychometric evaluations. However, most existing models
face two major limitations: they rely on the Big Five (OCEAN) framework, which
only provides coarse personality dimensions, and they lack mechanisms for
controlling trait intensity. In this paper, we address this gap by extending
the Machine Personality Inventory (MPI), which originally used the Big Five
model, to incorporate the 16 Personality Factor (16PF) model, allowing
expressive control over sixteen distinct traits. We also developed a structured
framework known as Specific Attribute Control (SAC) for evaluating and
dynamically inducing trait intensity in LLMs. Our method introduces
adjective-based semantic anchoring to guide trait intensity expression and
leverages behavioural questions across five intensity factors:
\textit{Frequency}, \textit{Depth}, \textit{Threshold}, \textit{Effort}, and
\textit{Willingness}. Through experimentation, we find that modelling intensity
as a continuous spectrum yields substantially more consistent and controllable
personality expression compared to binary trait toggling. Moreover, we observe
that changes in target trait intensity systematically influence closely related
traits in psychologically coherent directions, suggesting that LLMs internalize
multi-dimensional personality structures rather than treating traits in
isolation. Our work opens new pathways for controlled and nuanced human-machine
interactions in domains such as healthcare, education, and interviewing
processes, bringing us one step closer to truly human-like social machines.
[COMMENTS]
Under review
[LINK]
http://arxiv.org/abs/2506.20993v1
[DATE]
2025-06-26 12:12:15+08:00
[CATEGORIES]
cs.CL
SharpZO: Hybrid Sharpness-Aware Vision Language Model Prompt Tuning via Forward-Only Passes
[AUTHORS]
Yifan Yang, Zhen Zhang, Rupak Vignesh Swaminathan, Jing Liu, Nathan Susanj, Zheng Zhang
[ABSTRACT]
Fine-tuning vision language models (VLMs) has achieved remarkable performance
across various downstream tasks; yet, it requires access to model gradients
through backpropagation (BP), making them unsuitable for memory-constrained,
inference-only edge devices. To address this limitation, previous work has
explored various BP-free fine-tuning methods. However, these approaches often
rely on high-variance evolutionary strategies (ES) or zeroth-order (ZO)
optimization, and often fail to achieve satisfactory performance. In this
paper, we propose a hybrid Sharpness-aware Zeroth-order optimization (SharpZO)
approach, specifically designed to enhance the performance of ZO VLM
fine-tuning via a sharpness-aware warm-up training. SharpZO features a
two-stage optimization process: a sharpness-aware ES stage that globally
explores and smooths the loss landscape to construct a strong initialization,
followed by a fine-grained local search via sparse ZO optimization. The entire
optimization relies solely on forward passes. Detailed theoretical analysis and
extensive experiments on CLIP models demonstrate that SharpZO significantly
improves accuracy and convergence speed, achieving up to 7% average gain over
state-of-the-art forward-only methods.
[LINK]
http://arxiv.org/abs/2506.20990v1
[DATE]
2025-06-26 12:07:14+08:00
[CATEGORIES]
cs.LG
cs.CL
SACL: Understanding and Combating Textual Bias in Code Retrieval with Semantic-Augmented Reranking and Localization
[AUTHORS]
Dhruv Gupta, Gayathri Ganesh Lakshmy, Yiqing Xie
[ABSTRACT]
Retrieval-Augmented Code Generation (RACG) is a critical technique for
enhancing code generation by retrieving relevant information. In this work, we
conduct an in-depth analysis of code retrieval by systematically masking
specific features while preserving code functionality. Our discoveries include:
(1) although trained on code, current retrievers heavily rely on surface-level
textual features (e.g., docstrings, identifier names), and (2) they exhibit a
strong bias towards well-documented code, even if the documentation is
irrelevant. Based on our discoveries, we propose SACL, a framework that
enriches textual information and reduces bias by augmenting code or structural
knowledge with semantic information. Extensive experiments show that SACL
substantially improves code retrieval (e.g., by 12.8% / 9.4% / 7.0% Recall@1 on
HumanEval / MBPP / SWE-Bench-Lite), which also leads to better code generation
performance (e.g., by 4.88% Pass@1 on HumanEval).
[LINK]
http://arxiv.org/abs/2506.20081v2
[DATE]
2025-06-26 12:06:50+08:00
[CATEGORIES]
cs.CL
Comparing Retrieval-Augmentation and Parameter-Efficient Fine-Tuning for Privacy-Preserving Personalization of Large Language Models
[AUTHORS]
Alireza Salemi, Hamed Zamani
[ABSTRACT]
Despite its substantial impact on various search, recommendation, and
question answering tasks, privacy-preserving methods for personalizing large
language models (LLMs) have received relatively limited exploration. There is
one primary approach in this area through retrieval-augmented generation (RAG),
which generates personalized outputs by enriching the input prompt with
information retrieved from the user’s personal data. This paper studies an
orthogonal approach to RAG that involves learning user-dependent LLM parameters
through parameter-efficient fine-tuning (PEFT). This paper presents the first
systematic study for exploration of PEFT for LLM personalization and provides
an extensive comparisons between RAG- and PEFT-based solutions, across a broad
set of seven diverse datasets from the LaMP benchmark. Our results demonstrate
that, on average, both RAG- and PEFT-based personalization methods yield 14.92%
and 1.07% improvements over non-personalized LLMs, respectively. When combining
RAG with PEFT, we observe a further improvement of 15.98%, highlighting the
effectiveness of their integration in enhancing personalized text generation.
Additionally, we identify a positive correlation between the amount of user
data available and the effectiveness of PEFT. This finding suggests that RAG is
particularly beneficial for cold-start users – users with limited personal
data – while PEFT performs better when more user-specific data is available.
[LINK]
http://arxiv.org/abs/2409.09510v2
[DATE]
2025-06-26 11:19:56+08:00
[CATEGORIES]
cs.CL
Reward-Guided Speculative Decoding for Efficient LLM Reasoning
[AUTHORS]
Baohao Liao, Yuhui Xu, Hanze Dong, Junnan Li, Christof Monz, Silvio Savarese, Doyen Sahoo, Caiming Xiong
[ABSTRACT]
We introduce Reward-Guided Speculative Decoding (RSD), a novel framework
aimed at improving the efficiency of inference in large language models (LLMs).
RSD synergistically combines a lightweight draft model with a more powerful
target model, incorporating a controlled bias to prioritize high-reward
outputs, in contrast to existing speculative decoding methods that enforce
strict unbiasedness. RSD employs a process reward model to evaluate
intermediate decoding steps and dynamically decide whether to invoke the target
model, optimizing the trade-off between computational cost and output quality.
We theoretically demonstrate that a threshold-based mixture strategy achieves
an optimal balance between resource utilization and performance. Extensive
evaluations on challenging reasoning benchmarks, including Olympiad-level
tasks, show that RSD delivers significant efficiency gains against decoding
with the target model only (up to 4.4x fewer FLOPs), while achieving
significant better accuracy than parallel decoding method on average (up to
+3.5). These results highlight RSD as a robust and cost-effective approach for
deploying LLMs in resource-intensive scenarios. The code is available at
https://github.com/BaohaoLiao/RSD.
[COMMENTS]
17 pages
[LINK]
http://arxiv.org/abs/2501.19324v3
[DATE]
2025-06-26 11:14:46+08:00
[CATEGORIES]
cs.CL
Learning to Rank for Multiple Retrieval-Augmented Models through Iterative Utility Maximization
[AUTHORS]
Alireza Salemi, Hamed Zamani
[ABSTRACT]
This paper investigates the design of a unified search engine to serve
multiple retrieval-augmented generation (RAG) agents, each with a distinct
task, backbone large language model (LLM), and RAG strategy. We introduce an
iterative approach where the search engine generates retrieval results for the
RAG agents and gathers feedback on the quality of the retrieved documents
during an offline phase. This feedback is then used to iteratively optimize the
search engine using an expectation-maximization algorithm, with the goal of
maximizing each agent’s utility function. Additionally, we adapt this to an
online setting, allowing the search engine to refine its behavior based on
real-time individual agents feedback to better serve the results for each of
them. Experiments on datasets from the Knowledge-Intensive Language Tasks
(KILT) benchmark demonstrates that our approach significantly on average
outperforms baselines across 18 RAG models. We demonstrate that our method
effectively “personalizes” the retrieval for each RAG agent based on the
collected feedback. Finally, we provide a comprehensive ablation study to
explore various aspects of our method.
[LINK]
http://arxiv.org/abs/2410.09942v2
[DATE]
2025-06-26 11:06:17+08:00
[CATEGORIES]
cs.CL
PP-DocBee: Improving Multimodal Document Understanding Through a Bag of Tricks
[AUTHORS]
Feng Ni, Kui Huang, Yao Lu, Wenyu Lv, Guanzhong Wang, Zeyu Chen, Yi Liu
[ABSTRACT]
With the rapid advancement of digitalization, various document images are
being applied more extensively in production and daily life, and there is an
increasingly urgent need for fast and accurate parsing of the content in
document images. Therefore, this report presents PP-DocBee, a novel multimodal
large language model designed for end-to-end document image understanding.
First, we develop a data synthesis strategy tailored to document scenarios in
which we build a diverse dataset to improve the model generalization. Then, we
apply a few training techniques, including dynamic proportional sampling, data
preprocessing, and OCR postprocessing strategies. Extensive evaluations
demonstrate the superior performance of PP-DocBee, achieving state-of-the-art
results on English document understanding benchmarks and even outperforming
existing open source and commercial models in Chinese document understanding.
The source code and pre-trained models are publicly available at
\href{https://github.com/PaddlePaddle/PaddleMIX}{https://github.com/PaddlePaddle/PaddleMIX}.
[LINK]
http://arxiv.org/abs/2503.04065v3
[DATE]
2025-06-26 09:11:25+08:00
[CATEGORIES]
cs.CL
KaLM-Embedding-V2: Superior Training Techniques and Data Inspire A Versatile Embedding Model
[AUTHORS]
Xinping Zhao, Xinshuo Hu, Zifei Shan, Shouzheng Huang, Yao Zhou, Zetian Sun, Zhenyu Liu, Dongfang Li, Xinyuan Wei, Qian Chen, Youcheng Pan, Yang Xiang, Meishan Zhang, Haofen Wang, Jun Yu, Baotian Hu, Min Zhang
[ABSTRACT]
In this paper, we propose KaLM-Embedding-V2, a versatile and compact
embedding model, which achieves impressive performance in general-purpose text
embedding tasks by leveraging superior training techniques and data. Our key
innovations include: (1) To better align the architecture with representation
learning, we remove the causal attention mask and adopt a fully bidirectional
transformer with simple yet effective mean-pooling to produce fixed-length
embeddings; (2) We employ a multi-stage training pipeline: (i) pre-training on
large-scale weakly supervised open-source corpora; (ii) fine-tuning on
high-quality retrieval and non-retrieval datasets; and (iii) model-soup
parameter averaging for robust generalization. Besides, we introduce a
focal-style reweighting mechanism that concentrates learning on difficult
samples and an online hard-negative mixing strategy to continuously enrich hard
negatives without expensive offline mining; (3) We collect over 20 categories
of data for pre-training and 100 categories of data for fine-tuning, to boost
both the performance and generalization of the embedding model. Extensive
evaluations on the Massive Text Embedding Benchmark (MTEB) Chinese and English
show that our model significantly outperforms others of comparable size, and
competes with 3x, 14x, 18x, and 26x larger embedding models, setting a new
standard for a versatile and compact embedding model with less than 1B
parameters.
[COMMENTS]
Technical Report; 26 pages 12 tables 1 figure. arXiv admin note:
substantial text overlap with arXiv:2501.01028
[LINK]
http://arxiv.org/abs/2506.20923v1
[DATE]
2025-06-26 09:09:44+08:00
[CATEGORIES]
cs.CL
Optimising Language Models for Downstream Tasks: A Post-Training Perspective
[AUTHORS]
Zhengyan Shi
[ABSTRACT]
Language models (LMs) have demonstrated remarkable capabilities in NLP, yet
adapting them efficiently and robustly to specific tasks remains challenging.
As their scale and complexity grow, fine-tuning LMs on labelled data often
underutilizes available unlabelled data, leads to overfitting on small
task-specific sets, and imposes significant computational costs. These
limitations hamper their application to the open-ended landscape of real-world
language tasks.
This thesis proposes a series of methods to better adapt LMs to downstream
applications. First, we explore strategies for extracting task-relevant
knowledge from unlabelled data, introducing a novel continued pre-training
technique that outperforms state-of-the-art semi-supervised approaches. Next,
we present a parameter-efficient fine-tuning method that substantially reduces
memory and compute costs while maintaining competitive performance. We also
introduce improved supervised fine-tuning methods that enable LMs to better
follow instructions, especially when labelled data is scarce, enhancing their
performance across a range of NLP tasks, including open-ended generation.
Finally, we develop new evaluation methods and benchmarks, such as multi-hop
spatial reasoning tasks, to assess LM capabilities and adaptation more
comprehensively.
Through extensive empirical studies across diverse NLP tasks, our results
demonstrate that these approaches substantially improve LM robustness,
efficiency, and generalization, making them more adaptable to a broad range of
applications. These advances mark a significant step towards more robust and
efficient LMs, bringing us closer to the goal of artificial general
intelligence.
[COMMENTS]
PhD Thesis
[LINK]
http://arxiv.org/abs/2506.20917v1
[DATE]
2025-06-26 08:49:35+08:00
[CATEGORIES]
cs.CL
A3 : an Analytical Low-Rank Approximation Framework for Attention
[AUTHORS]
Jeffrey T. H. Wong, Cheng Zhang, Xinye Cao, Pedro Gimenes, George A. Constantinides, Wayne Luk, Yiren Zhao
[ABSTRACT]
Large language models have demonstrated remarkable performance; however,
their massive parameter counts make deployment highly expensive. Low-rank
approximation offers a promising compression solution, yet existing approaches
have two main limitations: (1) They focus on minimizing the output error of
individual linear layers, without considering the architectural characteristics
of Transformers, and (2) they decompose a large weight matrix into two small
low-rank matrices. Consequently, these methods often fall short compared to
other compression techniques like pruning and quantization, and introduce
runtime overhead such as the extra GEMM kernel launches for decomposed small
matrices. To address these limitations, we propose $\tt A^\tt 3$, a
post-training low-rank approximation framework. $\tt A^\tt 3$ splits a
Transformer layer into three functional components, namely $\tt QK$, $\tt OV$,
and $\tt MLP$. For each component, $\tt A^\tt 3$ provides an analytical
solution that reduces the hidden dimension size inside each component while
minimizing the component’s functional loss ($\it i.e.$, error in attention
scores, attention outputs, and MLP outputs). This approach directly reduces
model sizes, KV cache sizes, and FLOPs without introducing any runtime
overheads. In addition, it provides a new narrative in advancing the
optimization problem from singular linear layer loss optimization toward
improved end-to-end performance. Through extensive experiments, we show that
$\tt A^\tt 3$ maintains superior performance compared to SoTAs. For example,
under the same reduction budget in computation and memory, our low-rank
approximated LLaMA 3.1-70B achieves a perplexity of 4.69 on WikiText-2,
outperforming the previous SoTA’s 7.87 by 3.18. We also demonstrate the
versatility of $\tt A^\tt 3$, including KV cache compression, quantization, and
mixed-rank assignments for enhanced performance.
[LINK]
http://arxiv.org/abs/2505.12942v3
[DATE]
2025-06-26 07:03:54+08:00
[CATEGORIES]
cs.CL
cs.LG
Privacy Ripple Effects from Adding or Removing Personal Information in Language Model Training
[AUTHORS]
Jaydeep Borkar, Matthew Jagielski, Katherine Lee, Niloofar Mireshghallah, David A. Smith, Christopher A. Choquette-Choo
[ABSTRACT]
Due to the sensitive nature of personally identifiable information (PII), its
owners may have the authority to control its inclusion or request its removal
from large-language model (LLM) training. Beyond this, PII may be added or
removed from training datasets due to evolving dataset curation techniques,
because they were newly scraped for retraining, or because they were included
in a new downstream fine-tuning stage. We find that the amount and ease of PII
memorization is a dynamic property of a model that evolves throughout training
pipelines and depends on commonly altered design choices. We characterize three
such novel phenomena: (1) similar-appearing PII seen later in training can
elicit memorization of earlier-seen sequences in what we call assisted
memorization, and this is a significant factor (in our settings, up to 1/3);
(2) adding PII can increase memorization of other PII significantly (in our
settings, as much as $\approx!7.5\times$); and (3) removing PII can lead to
other PII being memorized. Model creators should consider these first- and
second-order privacy risks when training models to avoid the risk of new PII
regurgitation.
[COMMENTS]
Accepted at the Findings of the Association for Computational
Linguistics (2025)
[LINK]
http://arxiv.org/abs/2502.15680v2
[DATE]
2025-06-26 05:37:19+08:00
[CATEGORIES]
cs.CL
Uncovering Hidden Violent Tendencies in LLMs: A Demographic Analysis via Behavioral Vignettes
[AUTHORS]
Quintin Myers, Yanjun Gao
[ABSTRACT]
Large language models (LLMs) are increasingly proposed for detecting and
responding to violent content online, yet their ability to reason about morally
ambiguous, real-world scenarios remains underexamined. We present the first
study to evaluate LLMs using a validated social science instrument designed to
measure human response to everyday conflict, namely the Violent Behavior
Vignette Questionnaire (VBVQ). To assess potential bias, we introduce
persona-based prompting that varies race, age, and geographic identity within
the United States. Six LLMs developed across different geopolitical and
organizational contexts are evaluated under a unified zero-shot setting. Our
study reveals two key findings: (1) LLMs surface-level text generation often
diverges from their internal preference for violent responses; (2) their
violent tendencies vary across demographics, frequently contradicting
established findings in criminology, social science, and psychology.
[COMMENTS]
Under review
[LINK]
http://arxiv.org/abs/2506.20822v1
[DATE]
2025-06-26 04:43:04+08:00
[CATEGORIES]
cs.CL
MultiFinRAG: An Optimized Multimodal Retrieval-Augmented Generation (RAG) Framework for Financial Question Answering
[AUTHORS]
Chinmay Gondhalekar, Urjitkumar Patel, Fang-Chun Yeh
[ABSTRACT]
Financial documents–such as 10-Ks, 10-Qs, and investor presentations–span
hundreds of pages and combine diverse modalities, including dense narrative
text, structured tables, and complex figures. Answering questions over such
content often requires joint reasoning across modalities, which strains
traditional large language models (LLMs) and retrieval-augmented generation
(RAG) pipelines due to token limitations, layout loss, and fragmented
cross-modal context. We introduce MultiFinRAG, a retrieval-augmented generation
framework purpose-built for financial QA. MultiFinRAG first performs multimodal
extraction by grouping table and figure images into batches and sending them to
a lightweight, quantized open-source multimodal LLM, which produces both
structured JSON outputs and concise textual summaries. These outputs, along
with narrative text, are embedded and indexed with modality-aware similarity
thresholds for precise retrieval. A tiered fallback strategy then dynamically
escalates from text-only to text+table+image contexts when necessary, enabling
cross-modal reasoning while reducing irrelevant context. Despite running on
commodity hardware, MultiFinRAG achieves 19 percentage points higher accuracy
than ChatGPT-4o (free-tier) on complex financial QA tasks involving text,
tables, images, and combined multimodal reasoning.
[COMMENTS]
Preprint Copy
[LINK]
http://arxiv.org/abs/2506.20821v1
[DATE]
2025-06-26 04:37:20+08:00
[CATEGORIES]
cs.CL
CodeLutra: Boosting LLM Code Generation via Preference-Guided Refinement
[AUTHORS]
Leitian Tao, Xiang Chen, Tong Yu, Tung Mai, Ryan Rossi, Yixuan Li, Saayan Mitra
[ABSTRACT]
Large Language Models (LLMs) have revolutionized code generation but require
significant resources and often over-generalize, limiting their task-specific
efficiency. Fine-tuning smaller, open-source LLMs provides a cost-effective
alternative. However, standard supervised approaches rely only on correct
examples, missing valuable insights from failures. We introduce CodeLutra, a
framework that leverages both correct and incorrect code attempts. Instead of
using only correct solutions, CodeLutra applies iterative preference-based
refinement, comparing successful and failed outputs to better approximate
desired results. This approach narrows the performance gap with
state-of-the-art larger models without requiring massive datasets or auxiliary
models. For instance, on a challenging data science coding task, using only 500
samples improved Llama-3-8B’s accuracy from 28.2% to 48.6%, approaching GPT-4’s
level. By learning from both successes and mistakes, CodeLutra provides a
scalable and efficient path to high-quality code generation, making smaller
open-source models more competitive with leading closed-source alternatives.
[COMMENTS]
TMLR 2025
[LINK]
http://arxiv.org/abs/2411.05199v3
[DATE]
2025-06-26 02:20:39+08:00
[CATEGORIES]
cs.CL
Towards Probabilistic Question Answering Over Tabular Data
[AUTHORS]
Chen Shen, Sajjadur Rahman, Estevam Hruschka
[ABSTRACT]
Current approaches for question answering (QA) over tabular data, such as
NL2SQL systems, perform well for factual questions where answers are directly
retrieved from tables. However, they fall short on probabilistic questions
requiring reasoning under uncertainty. In this paper, we introduce a new
benchmark LUCARIO and a framework for probabilistic QA over large tabular data.
Our method induces Bayesian Networks from tables, translates natural language
queries into probabilistic queries, and uses large language models (LLMs) to
generate final answers. Empirical results demonstrate significant improvements
over baselines, highlighting the benefits of hybrid symbolic-neural reasoning.
[LINK]
http://arxiv.org/abs/2506.20747v1
[DATE]
2025-06-26 02:15:33+08:00
[CATEGORIES]
cs.CL
MAGPIE: A dataset for Multi-AGent contextual PrIvacy Evaluation
[AUTHORS]
Gurusha Juneja, Alon Albalak, Wenyue Hua, William Yang Wang
[ABSTRACT]
The proliferation of LLM-based agents has led to increasing deployment of
inter-agent collaboration for tasks like scheduling, negotiation, resource
allocation etc. In such systems, privacy is critical, as agents often access
proprietary tools and domain-specific databases requiring strict
confidentiality. This paper examines whether LLM-based agents demonstrate an
understanding of contextual privacy. And, if instructed, do these systems
preserve inference time user privacy in non-adversarial multi-turn
conversation. Existing benchmarks to evaluate contextual privacy in LLM-agents
primarily assess single-turn, low-complexity tasks where private information
can be easily excluded. We first present a benchmark - MAGPIE comprising 158
real-life high-stakes scenarios across 15 domains. These scenarios are designed
such that complete exclusion of private data impedes task completion yet
unrestricted information sharing could lead to substantial losses. We then
evaluate the current state-of-the-art LLMs on (a) their understanding of
contextually private data and (b) their ability to collaborate without
violating user privacy. Empirical experiments demonstrate that current models,
including GPT-4o and Claude-2.7-Sonnet, lack robust understanding of contextual
privacy, misclassifying private data as shareable 25.2\% and 43.6\% of the
time. In multi-turn conversations, these models disclose private information in
59.9\% and 50.5\% of cases even under explicit privacy instructions.
Furthermore, multi-agent systems fail to complete tasks in 71\% of scenarios.
These results underscore that current models are not aligned towards both
contextual privacy preservation and collaborative task-solving.
[LINK]
http://arxiv.org/abs/2506.20737v1
[DATE]
2025-06-26 02:04:25+08:00
[CATEGORIES]
cs.CL
MMSearch-R1: Incentivizing LMMs to Search
[AUTHORS]
Jinming Wu, Zihao Deng, Wei Li, Yiding Liu, Bo You, Bo Li, Zejun Ma, Ziwei Liu
[ABSTRACT]
Robust deployment of large multimodal models (LMMs) in real-world scenarios
requires access to external knowledge sources, given the complexity and dynamic
nature of real-world information. Existing approaches such as
retrieval-augmented generation (RAG) and prompt engineered search agents rely
on rigid pipelines, often leading to inefficient or excessive search behaviors.
We present MMSearch-R1, the first end-to-end reinforcement learning framework
that enables LMMs to perform on-demand, multi-turn search in real-world
Internet environments. Our framework integrates both image and text search
tools, allowing the model to reason about when and how to invoke them guided by
an outcome-based reward with a search penalty. To support training, We collect
a multimodal search VQA dataset through a semi-automated pipeline that covers
diverse visual and textual knowledge needs and curate a search-balanced subset
with both search-required and search-free samples, which proves essential for
shaping efficient and on-demand search behavior. Extensive experiments on
knowledge-intensive and info-seeking VQA tasks show that our model not only
outperforms RAG-based baselines of the same model size, but also matches the
performance of a larger RAG-based model while reducing search calls by over
30%. We further analyze key empirical findings to offer actionable insights for
advancing research in multimodal search.
[COMMENTS]
Code: https://github.com/EvolvingLMMs-Lab/multimodal-search-r1
[LINK]
http://arxiv.org/abs/2506.20670v1
[DATE]
2025-06-26 01:59:42+08:00
[CATEGORIES]
cs.CL
Inside you are many wolves: Using cognitive models to interpret value trade-offs in LLMs
[AUTHORS]
Sonia K. Murthy, Rosie Zhao, Jennifer Hu, Sham Kakade, Markus Wulfmeier, Peng Qian, Tomer Ullman
[ABSTRACT]
Navigating everyday social situations often requires juggling conflicting
goals, such as conveying a harsh truth, maintaining trust, all while still
being mindful of another person’s feelings. These value trade-offs are an
integral part of human decision-making and language use, however, current tools
for interpreting such dynamic and multi-faceted notions of values in LLMs are
limited. In cognitive science, so-called “cognitive models” provide formal
accounts of these trade-offs in humans, by modeling the weighting of a
speaker’s competing utility functions in choosing an action or utterance. In
this work, we use a leading cognitive model of polite speech to interpret the
extent to which LLMs represent human-like trade-offs. We apply this lens to
systematically evaluate value trade-offs in two encompassing model settings:
degrees of reasoning “effort” in frontier black-box models, and RL
post-training dynamics of open-source models. Our results highlight patterns of
higher informational utility than social utility in reasoning models, and in
open-source models shown to be stronger in mathematical reasoning. Our findings
from LLMs’ training dynamics suggest large shifts in utility values early on in
training with persistent effects of the choice of base model and pretraining
data, compared to feedback dataset or alignment method. We show that our method
is responsive to diverse aspects of the rapidly evolving LLM landscape, with
insights for forming hypotheses about other high-level behaviors, shaping
training regimes for reasoning models, and better controlling trade-offs
between values during model training.
[COMMENTS]
11 pages, 3 figures
[LINK]
http://arxiv.org/abs/2506.20666v1
[DATE]
2025-06-26 01:58:12+08:00
[CATEGORIES]
cs.CL
OmniGen2: Exploration to Advanced Multimodal Generation
[AUTHORS]
Chenyuan Wu, Pengfei Zheng, Ruiran Yan, Shitao Xiao, Xin Luo, Yueze Wang, Wanli Li, Xiyan Jiang, Yexin Liu, Junjie Zhou, Ze Liu, Ziyi Xia, Chaofan Li, Haoge Deng, Jiahao Wang, Kun Luo, Bo Zhang, Defu Lian, Xinlong Wang, Zhongyuan Wang, Tiejun Huang, Zheng Liu
[ABSTRACT]
In this work, we introduce OmniGen2, a versatile and open-source generative
model designed to provide a unified solution for diverse generation tasks,
including text-to-image, image editing, and in-context generation. Unlike
OmniGen v1, OmniGen2 features two distinct decoding pathways for text and image
modalities, utilizing unshared parameters and a decoupled image tokenizer. This
design enables OmniGen2 to build upon existing multimodal understanding models
without the need to re-adapt VAE inputs, thereby preserving the original text
generation capabilities. To facilitate the training of OmniGen2, we developed
comprehensive data construction pipelines, encompassing image editing and
in-context generation data. Additionally, we introduce a reflection mechanism
tailored for image generation tasks and curate a dedicated reflection dataset
based on OmniGen2. Despite its relatively modest parameter size, OmniGen2
achieves competitive results on multiple task benchmarks, including
text-to-image and image editing. To further evaluate in-context generation,
also referred to as subject-driven tasks, we introduce a new benchmark named
OmniContext. OmniGen2 achieves state-of-the-art performance among open-source
models in terms of consistency. We will release our models, training code,
datasets, and data construction pipeline to support future research in this
field. Project Page: https://vectorspacelab.github.io/OmniGen2; GitHub Link:
https://github.com/VectorSpaceLab/OmniGen2
[LINK]
http://arxiv.org/abs/2506.18871v2
[DATE]
2025-06-26 01:54:25+08:00
[CATEGORIES]
cs.CL
PLoP: Precise LoRA Placement for Efficient Finetuning of Large Models
[AUTHORS]
Soufiane Hayou, Nikhil Ghosh, Bin Yu
[ABSTRACT]
Low-Rank Adaptation (LoRA) is a widely used finetuning method for large
models. Its small memory footprint allows practitioners to adapt large models
to specific tasks at a fraction of the cost of full finetuning. Different
modifications have been proposed to enhance its efficiency by, for example,
setting the learning rate, the rank, and the initialization. Another
improvement axis is adapter placement strategy: when using LoRA, practitioners
usually pick module types to adapt with LoRA, such as Query and Key modules.
Few works have studied the problem of adapter placement, with nonconclusive
results: original LoRA paper suggested placing adapters in attention modules,
while other works suggested placing them in the MLP modules. Through an
intuitive theoretical analysis, we introduce PLoP (Precise LoRA Placement), a
lightweight method that allows automatic identification of module types where
LoRA adapters should be placed, given a pretrained model and a finetuning task.
We demonstrate that PLoP consistently outperforms, and in the worst case
competes, with commonly used placement strategies through comprehensive
experiments on supervised finetuning and reinforcement learning for reasoning.
[COMMENTS]
TD,LR: A lightweight module type selection method for LoRA
finetuning. PLoP gives precise placements for LoRA adapters for improved
performance
[LINK]
http://arxiv.org/abs/2506.20629v1
[DATE]
2025-06-26 01:25:02+08:00
[CATEGORIES]
cs.LG
cs.CL
Recycling the Web: A Method to Enhance Pre-training Data Quality and Quantity for Language Models
[AUTHORS]
Thao Nguyen, Yang Li, Olga Golovneva, Luke Zettlemoyer, Sewoong Oh, Ludwig Schmidt, Xian Li
[ABSTRACT]
Scaling laws predict that the performance of large language models improves
with increasing model size and data size. In practice, pre-training has been
relying on massive web crawls, using almost all data sources publicly available
on the internet so far. However, this pool of natural data does not grow at the
same rate as the compute supply. Furthermore, the availability of high-quality
texts is even more limited: data filtering pipelines often remove up to 99% of
the initial web scrapes to achieve state-of-the-art. To address the “data wall”
of pre-training scaling, our work explores ways to transform and recycle data
discarded in existing filtering processes. We propose REWIRE, REcycling the Web
with guIded REwrite, a method to enrich low-quality documents so that they
could become useful for training. This in turn allows us to increase the
representation of synthetic data in the final pre-training set. Experiments at
1B, 3B and 7B scales of the DCLM benchmark show that mixing high-quality raw
texts and our rewritten texts lead to 1.0, 1.3 and 2.5 percentage points
improvement respectively across 22 diverse tasks, compared to training on only
filtered web data. Training on the raw-synthetic data mix is also more
effective than having access to 2x web data. Through further analysis, we
demonstrate that about 82% of the mixed in texts come from transforming
lower-quality documents that would otherwise be discarded. REWIRE also
outperforms related approaches of generating synthetic data, including
Wikipedia-style paraphrasing, question-answer synthesizing and knowledge
extraction. These results suggest that recycling web texts holds the potential
for being a simple and effective approach for scaling pre-training data.
[LINK]
http://arxiv.org/abs/2506.04689v2
[DATE]
2025-06-26 01:12:12+08:00
[CATEGORIES]
cs.CL
cs.LG
Ad-hoc Concept Forming in the Game Codenames as a Means for Evaluating Large Language Models
[AUTHORS]
Sherzod Hakimov, Lara Pfennigschmidt, David Schlangen
[COMMENTS]
Accepted at GemBench workshop co-located with ACL 2025
[LINK]
http://arxiv.org/abs/2502.11707v2
[DATE]
2025-06-26 00:48:16+08:00
[CATEGORIES]
cs.CL
On the Role of Context in Reading Time Prediction
[AUTHORS]
Andreas Opedal, Eleanor Chodroff, Ryan Cotterell, Ethan Gotlieb Wilcox
[ABSTRACT]
We present a new perspective on how readers integrate context during
real-time language comprehension. Our proposals build on surprisal theory,
which posits that the processing effort of a linguistic unit (e.g., a word) is
an affine function of its in-context information content. We first observe that
surprisal is only one out of many potential ways that a contextual predictor
can be derived from a language model. Another one is the pointwise mutual
information (PMI) between a unit and its context, which turns out to yield the
same predictive power as surprisal when controlling for unigram frequency.
Moreover, both PMI and surprisal are correlated with frequency. This means that
neither PMI nor surprisal contains information about context alone. In response
to this, we propose a technique where we project surprisal onto the orthogonal
complement of frequency, yielding a new contextual predictor that is
uncorrelated with frequency. Our experiments show that the proportion of
variance in reading times explained by context is a lot smaller when context is
represented by the orthogonalized predictor. From an interpretability
standpoint, this indicates that previous studies may have overstated the role
that context has in predicting reading times.
[COMMENTS]
EMNLP 2024; preprocessing was corrected to exclude variance due to
word skipping and the conclusions remain unchanged
[LINK]
http://arxiv.org/abs/2409.08160v4
[DATE]
2025-06-26 00:32:48+08:00
[CATEGORIES]
cs.CL
cs.LG
Unlocking In-Context Learning for Natural Datasets Beyond Language Modelling
[AUTHORS]
Jelena Bratulić, Sudhanshu Mittal, David T. Hoffmann, Samuel Böhm, Robin Tibor Schirrmeister, Tonio Ball, Christian Rupprecht, Thomas Brox
[ABSTRACT]
Large Language Models (LLMs) exhibit In-Context Learning (ICL), which enables
the model to perform new tasks conditioning only on the examples provided in
the context without updating the model’s weights. While ICL offers fast
adaptation across natural language tasks and domains, its emergence is less
straightforward for modalities beyond text. In this work, we systematically
uncover properties present in LLMs that support the emergence of ICL for
autoregressive models and various modalities by promoting the learning of the
needed mechanisms for ICL. We identify exact token repetitions in the training
data sequences as an important factor for ICL. Such repetitions further improve
stability and reduce transiency in ICL performance. Moreover, we emphasise the
significance of training task difficulty for the emergence of ICL. Finally, by
applying our novel insights on ICL emergence, we unlock ICL capabilities for
various visual datasets and a more challenging EEG classification task in a
few-shot learning regime.
[LINK]
http://arxiv.org/abs/2501.06256v2
[DATE]
2025-06-26 00:21:31+08:00
[CATEGORIES]
cs.CL
cs.LG
Action-Minimization Meets Generative Modeling: Efficient Transition Path Sampling with the Onsager-Machlup Functional
[AUTHORS]
Sanjeev Raja, Martin Šípka, Michael Psenka, Tobias Kreiman, Michal Pavelka, Aditi S. Krishnapriyan
[ABSTRACT]
Transition path sampling (TPS), which involves finding probable paths
connecting two points on an energy landscape, remains a challenge due to the
complexity of real-world atomistic systems. Current machine learning approaches
use expensive, task-specific, and data-free training procedures, limiting their
ability to benefit from high-quality datasets and large-scale pre-trained
models. In this work, we address TPS by interpreting candidate paths as
trajectories sampled from stochastic dynamics induced by the learned score
function of pre-trained generative models, specifically denoising diffusion and
flow matching. Under these dynamics, finding high-likelihood transition paths
becomes equivalent to minimizing the Onsager-Machlup (OM) action functional.
This enables us to repurpose pre-trained generative models for TPS in a
zero-shot manner, in contrast with bespoke, task-specific approaches in
previous work. We demonstrate our approach on varied molecular systems,
obtaining diverse, physically realistic transition pathways and generalizing
beyond the pre-trained model’s original training dataset. Our method can be
easily incorporated into new generative models, making it practically relevant
as models continue to scale and improve with increased data availability. Code
is available at github.com/ASK-Berkeley/OM-TPS.
[COMMENTS]
ICML 2025
[LINK]
http://arxiv.org/abs/2504.18506v3
[DATE]
2025-06-26 23:59:16+08:00
[CATEGORIES]
cs.LG
Distributed Cross-Channel Hierarchical Aggregation for Foundation Models
[AUTHORS]
Aristeidis Tsaris, Isaac Lyngaas, John Lagregren, Mohamed Wahib, Larry York, Prasanna Balaprakash, Dan Lu, Feiyi Wang, Xiao Wang
[ABSTRACT]
Vision-based scientific foundation models hold significant promise for
advancing scientific discovery and innovation. This potential stems from their
ability to aggregate images from diverse sources such as varying physical
groundings or data acquisition systems and to learn spatio-temporal
correlations using transformer architectures. However, tokenizing and
aggregating images can be compute-intensive, a challenge not fully addressed by
current distributed methods. In this work, we introduce the Distributed
Cross-Channel Hierarchical Aggregation (D-CHAG) approach designed for datasets
with a large number of channels across image modalities. Our method is
compatible with any model-parallel strategy and any type of vision transformer
architecture, significantly improving computational efficiency. We evaluated
D-CHAG on hyperspectral imaging and weather forecasting tasks. When integrated
with tensor parallelism and model sharding, our approach achieved up to a 75%
reduction in memory usage and more than doubled sustained throughput on up to
1,024 AMD GPUs on the Frontier Supercomputer.
[LINK]
http://arxiv.org/abs/2506.21411v1
[DATE]
2025-06-26 23:58:14+08:00
[CATEGORIES]
cs.LG
Early Stopping Tabular In-Context Learning
[AUTHORS]
Jaris Küken, Lennart Purucker, Frank Hutter
[ABSTRACT]
Tabular foundation models have shown strong performance across various
tabular learning tasks via in-context learning, offering robust generalization
without any downstream finetuning. However, their inference-time costs remain
high, particularly for larger datasets. To address this, we propose
early-stopping the in-context learning process. We achieve this by dynamically
evaluating whether to stop in-context learning after each Transformer encoder
layer. Once stopped, we decode the embedding using a pre-trained layer-wise
decoder. Experiments across 34 small classification tasks size show that early
stopping in-context learning accelerates inference by up to x1.3 with
negligible degradation in predictive performance. To assess scalability, we
further evaluate our method on five larger classification tasks, achieving
speedups of up to x2.2. Our results demonstrate the potential of early exiting
as an effective and practical strategy for improving the efficiency of tabular
in-context learning.
[COMMENTS]
ICML Workshop Paper
[LINK]
http://arxiv.org/abs/2506.21387v1
[DATE]
2025-06-26 23:36:37+08:00
[CATEGORIES]
cs.LG
Representation Learning of Lab Values via Masked AutoEncoders
[AUTHORS]
David Restrepo, Chenwei Wu, Yueran Jia, Jaden K. Sun, Jack Gallifant, Catherine G. Bielick, Yugang Jia, Leo A. Celi
[ABSTRACT]
Accurate imputation of missing laboratory values in electronic health records
(EHRs) is critical to enable robust clinical predictions and reduce biases in
AI systems in healthcare. Existing methods, such as XGBoost, softimpute, GAIN,
Expectation Maximization (EM), and MICE, struggle to model the complex temporal
and contextual dependencies in EHR data, particularly in underrepresented
groups. In this work, we propose Lab-MAE, a novel transformer-based masked
autoencoder framework that leverages self-supervised learning for the
imputation of continuous sequential lab values. Lab-MAE introduces a structured
encoding scheme that jointly models laboratory test values and their
corresponding timestamps, enabling explicit capturing temporal dependencies.
Empirical evaluation on the MIMIC-IV dataset demonstrates that Lab-MAE
significantly outperforms state-of-the-art baselines such as XGBoost,
softimpute, GAIN, EM, and MICE across multiple metrics, including root mean
square error (RMSE), R-squared (R2), and Wasserstein distance (WD). Notably,
Lab-MAE achieves equitable performance across demographic groups of patients,
advancing fairness in clinical predictions. We further investigate the role of
follow-up laboratory values as potential shortcut features, revealing Lab-MAE’s
robustness in scenarios where such data is unavailable. The findings suggest
that our transformer-based architecture, adapted to the characteristics of EHR
data, offers a foundation model for more accurate and fair clinical imputation.
In addition, we measure and compare the carbon footprint of Lab-MAE with the a
XGBoost model, highlighting its environmental requirements.
[COMMENTS]
14 pages of main text, 11 appendix
[LINK]
http://arxiv.org/abs/2501.02648v3
[DATE]
2025-06-26 23:34:13+08:00
[CATEGORIES]
cs.LG
Temporal-Aware Graph Attention Network for Cryptocurrency Transaction Fraud Detection
[AUTHORS]
Zhi Zheng, Bochuan Zhou, Yuping Song
[ABSTRACT]
Cryptocurrency transaction fraud detection faces the dual challenges of
increasingly complex transaction patterns and severe class imbalance.
Traditional methods rely on manual feature engineering and struggle to capture
temporal and structural dependencies in transaction networks. This paper
proposes an Augmented Temporal-aware Graph Attention Network (ATGAT) that
enhances detection performance through three modules: (1) designing an advanced
temporal embedding module that fuses multi-scale time difference features with
periodic position encoding; (2) constructing a temporal-aware triple attention
mechanism that jointly optimizes structural, temporal, and global context
attention; (3) employing weighted BCE loss to address class imbalance.
Experiments on the Elliptic++ cryptocurrency dataset demonstrate that ATGAT
achieves an AUC of 0.9130, representing a 9.2% improvement over the best
traditional method XGBoost, 12.0% over GCN, and 10.0% over standard GAT. This
method not only validates the enhancement effect of temporal awareness and
triple attention mechanisms on graph neural networks, but also provides
financial institutions with more reliable fraud detection tools, with its
design principles generalizable to other temporal graph anomaly detection
tasks.
[LINK]
http://arxiv.org/abs/2506.21382v1
[DATE]
2025-06-26 23:34:06+08:00
[CATEGORIES]
cs.LG
HARPT: A Corpus for Analyzing Consumers’ Trust and Privacy Concerns in Mobile Health Apps
[AUTHORS]
Timoteo Kelly, Abdulkadir Korkmaz, Samuel Mallet, Connor Souders, Sadra Aliakbarpour, Praveen Rao
[ABSTRACT]
We present HARPT, a large-scale annotated corpus of mobile health app store
reviews aimed at advancing research in user privacy and trust. The dataset
comprises over 480,000 user reviews labeled into seven categories that capture
critical aspects of trust in applications, trust in providers and privacy
concerns. Creating HARPT required addressing multiple complexities, such as
defining a nuanced label schema, isolating relevant content from large volumes
of noisy data, and designing an annotation strategy that balanced scalability
with accuracy. This strategy integrated rule-based filtering, iterative manual
labeling with review, targeted data augmentation, and weak supervision using
transformer-based classifiers to accelerate coverage. In parallel, a carefully
curated subset of 7,000 reviews was manually annotated to support model
development and evaluation. We benchmark a broad range of classification
models, demonstrating that strong performance is achievable and providing a
baseline for future research. HARPT is released as a public resource to support
work in health informatics, cybersecurity, and natural language processing.
[LINK]
http://arxiv.org/abs/2506.19268v2
[DATE]
2025-06-26 23:23:54+08:00
[CATEGORIES]
cs.LG
Latent Diffusion Model Based Denoising Receiver for 6G Semantic Communication: From Stochastic Differential Theory to Application
[AUTHORS]
Xiucheng Wang, Honggang Jia, Nan Cheng, Dusit Niyato
[ABSTRACT]
In this paper, a novel semantic communication framework empowered by
generative artificial intelligence (GAI) is proposed, to enhance the robustness
against both channel noise and transmission data distribution shifts. A
theoretical foundation is established using stochastic differential equations
(SDEs), from which a closed-form mapping between any signal-to-noise ratio
(SNR) and the optimal denoising timestep is derived. Moreover, to address
distribution mismatch, a mathematical scaling method is introduced to align
received semantic features with the training distribution of the GAI. Built on
this theoretical foundation, a latent diffusion model (LDM)-based semantic
communication framework is proposed that combines a variational autoencoder for
semantic features extraction, where a pretrained diffusion model is used for
denoising. The proposed system is a training-free framework that supports
zero-shot generalization, and achieves superior performance under low-SNR and
out-of-distribution conditions, offering a scalable and robust solution for
future 6G semantic communication systems. Experimental results demonstrate that
the proposed semantic communication framework achieves state-of-the-art
performance in both pixel-level accuracy and semantic perceptual quality,
consistently outperforming baselines across a wide range of SNRs and data
distributions without any fine-tuning or post-training.
[LINK]
http://arxiv.org/abs/2506.05710v2
[DATE]
2025-06-26 23:21:59+08:00
[CATEGORIES]
cs.LG
MAx-DNN: Multi-Level Arithmetic Approximation for Energy-Efficient DNN Hardware Accelerators
[AUTHORS]
Vasileios Leon, Georgios Makris, Sotirios Xydis, Kiamal Pekmestzi, Dimitrios Soudris
[ABSTRACT]
Nowadays, the rapid growth of Deep Neural Network (DNN) architectures has
established them as the defacto approach for providing advanced Machine
Learning tasks with excellent accuracy. Targeting low-power DNN computing, this
paper examines the interplay of fine-grained error resilience of DNN workloads
in collaboration with hardware approximation techniques, to achieve higher
levels of energy efficiency. Utilizing the state-of-the-art ROUP approximate
multipliers, we systematically explore their fine-grained distribution across
the network according to our layer-, filter-, and kernel-level approaches, and
examine their impact on accuracy and energy. We use the ResNet-8 model on the
CIFAR-10 dataset to evaluate our approximations. The proposed solution delivers
up to 54% energy gains in exchange for up to 4% accuracy loss, compared to the
baseline quantized model, while it provides 2x energy gains with better
accuracy versus the state-of-the-art DNN approximations.
[COMMENTS]
Presented at the 13th IEEE LASCAS Conference
[LINK]
http://arxiv.org/abs/2506.21371v1
[DATE]
2025-06-26 23:21:12+08:00
[CATEGORIES]
cs.LG
rQdia: Regularizing Q-Value Distributions With Image Augmentation
[AUTHORS]
Sam Lerman, Jing Bi
[ABSTRACT]
rQdia regularizes Q-value distributions with augmented images in pixel-based
deep reinforcement learning. With a simple auxiliary loss, that equalizes these
distributions via MSE, rQdia boosts DrQ and SAC on 9/12 and 10/12 tasks
respectively in the MuJoCo Continuous Control Suite from pixels, and
Data-Efficient Rainbow on 18/26 Atari Arcade environments. Gains are measured
in both sample efficiency and longer-term training. Moreover, the addition of
rQdia finally propels model-free continuous control from pixels over the state
encoding baseline.
[LINK]
http://arxiv.org/abs/2506.21367v1
[DATE]
2025-06-26 23:16:35+08:00
[CATEGORIES]
cs.LG
SMMILE: An Expert-Driven Benchmark for Multimodal Medical In-Context Learning
[AUTHORS]
Melanie Rieff, Maya Varma, Ossian Rabow, Subathra Adithan, Julie Kim, Ken Chang, Hannah Lee, Nidhi Rohatgi, Christian Bluethgen, Mohamed S. Muneer, Jean-Benoit Delbrouck, Michael Moor
[ABSTRACT]
Multimodal in-context learning (ICL) remains underexplored despite
significant potential for domains such as medicine. Clinicians routinely
encounter diverse, specialized tasks requiring adaptation from limited
examples, such as drawing insights from a few relevant prior cases or
considering a constrained set of differential diagnoses. While multimodal large
language models (MLLMs) have shown advances in medical visual question
answering (VQA), their ability to learn multimodal tasks from context is
largely unknown. We introduce SMMILE, the first expert-driven multimodal ICL
benchmark for medical tasks. Eleven medical experts curated problems, each
including a multimodal query and multimodal in-context examples as task
demonstrations. SMMILE encompasses 111 problems (517 question-image-answer
triplets) covering 6 medical specialties and 13 imaging modalities. We further
introduce SMMILE++, an augmented variant with 1038 permuted problems. A
comprehensive evaluation of 15 MLLMs demonstrates that most models exhibit
moderate to poor multimodal ICL ability in medical tasks. In open-ended
evaluations, ICL contributes only 8% average improvement over zero-shot on
SMMILE and 9.4% on SMMILE++. We observe a susceptibility for irrelevant
in-context examples: even a single noisy or irrelevant example can degrade
performance by up to 9.5%. Moreover, example ordering exhibits a recency bias,
i.e., placing the most relevant example last can lead to substantial
performance improvements by up to 71%. Our findings highlight critical
limitations and biases in current MLLMs when learning multimodal medical tasks
from context.
[LINK]
http://arxiv.org/abs/2506.21355v1
[DATE]
2025-06-26 23:08:18+08:00
[CATEGORIES]
cs.LG
Lipschitz Bounds for Persistent Laplacian Eigenvalues under One-Simplex Insertions
[AUTHORS]
Le Vu Anh, Mehmet Dik, Nguyen Viet Anh
[ABSTRACT]
Persistent Laplacians are matrix operators that track how the shape and
structure of data transform across scales and are popularly adopted in biology,
physics, and machine learning. Their eigenvalues are concise descriptors of
geometric and topological features in a filtration. Although earlier work
established global algebraic stability for these operators, the precise change
in a single eigenvalue when one simplex, such as a vertex, edge, or triangle,
is added has remained unknown. This is important because downstream tools,
including heat-kernel signatures and spectral neural networks, depend directly
on these eigenvalues. We close this gap by proving a uniform Lipschitz bound:
after inserting one simplex, every up-persistent Laplacian eigenvalue can vary
by at most twice the Euclidean norm of that simplex’s boundary, independent of
filtration scale and complex size. This result delivers the first
eigenvalue-level robustness guarantee for spectral topological data analysis.
It guarantees that spectral features remain stable under local updates and
enables reliable error control in dynamic data settings.
[COMMENTS]
16 pages, 4 figures
[LINK]
http://arxiv.org/abs/2506.21352v1
[DATE]
2025-06-26 23:03:54+08:00
[CATEGORIES]
cs.LG
On the Ability of Deep Networks to Learn Symmetries from Data: A Neural Kernel Theory
[AUTHORS]
Andrea Perin, Stephane Deny
[ABSTRACT]
Symmetries (transformations by group actions) are present in many datasets,
and leveraging them holds considerable promise for improving predictions in
machine learning. In this work, we aim to understand when and how deep networks
– with standard architectures trained in a standard, supervised way – learn
symmetries from data. Inspired by real-world scenarios, we study a
classification paradigm where data symmetries are only partially observed
during training: some classes include all transformations of a cyclic group,
while others – only a subset. In the infinite-width limit, where kernel
analogies apply, we derive a neural kernel theory of symmetry learning. The
group-cyclic nature of the dataset allows us to analyze the Gram matrix of
neural kernels in the Fourier domain; here we find a simple characterization of
the generalization error as a function of class separation (signal) and
class-orbit density (noise). This characterization reveals that generalization
can only be successful when the local structure of the data prevails over its
non-local, symmetry-induced structure, in the kernel space defined by the
architecture. We extend our theoretical treatment to any finite group,
including non-abelian groups. Our framework also applies to equivariant
architectures (e.g., CNNs), and recovers their success in the special case
where the architecture matches the inherent symmetry of the data. Empirically,
our theory reproduces the generalization failure of finite-width networks (MLP,
CNN, ViT) trained on partially observed versions of rotated-MNIST. We conclude
that conventional deep networks lack a mechanism to learn symmetries that have
not been explicitly embedded in their architecture a priori. Our framework
could be extended to guide the design of architectures and training procedures
able to learn symmetries from data.
[COMMENTS]
JMLR accepted version, including an extension of the theory to
general finite groups (including non-abelian groups)
[LINK]
http://arxiv.org/abs/2412.11521v2
[DATE]
2025-06-26 23:02:44+08:00
[CATEGORIES]
cs.LG
Learning Value of Information towards Joint Communication and Control in 6G V2X
[AUTHORS]
Lei Lei, Kan Zheng, Xuemin, Shen
[ABSTRACT]
As Cellular Vehicle-to-Everything (C-V2X) evolves towards future
sixth-generation (6G) networks, Connected Autonomous Vehicles (CAVs) are
emerging to become a key application. Leveraging data-driven Machine Learning
(ML), especially Deep Reinforcement Learning (DRL), is expected to
significantly enhance CAV decision-making in both vehicle control and V2X
communication under uncertainty. These two decision-making processes are
closely intertwined, with the value of information (VoI) acting as a crucial
bridge between them. In this paper, we introduce Sequential Stochastic Decision
Process (SSDP) models to define and assess VoI, demonstrating their application
in optimizing communication systems for CAVs. Specifically, we formally define
the SSDP model and demonstrate that the MDP model is a special case of it. The
SSDP model offers a key advantage by explicitly representing the set of
information that can enhance decision-making when available. Furthermore, as
current research on VoI remains fragmented, we propose a systematic VoI
modeling framework grounded in the MDP, Reinforcement Learning (RL) and Optimal
Control theories. We define different categories of VoI and discuss their
corresponding estimation methods. Finally, we present a structured approach to
leverage the various VoI metrics for optimizing the When",
What”, and
``How” to communicate problems. For this purpose, SSDP models are formulated
with VoI-associated reward functions derived from VoI-based optimization
objectives. While we use a simple vehicle-following control problem to
illustrate the proposed methodology, it holds significant potential to
facilitate the joint optimization of stochastic, sequential control and
communication decisions in a wide range of networked control systems.
[LINK]
http://arxiv.org/abs/2505.06978v2
[DATE]
2025-06-26 23:01:20+08:00
[CATEGORIES]
cs.LG
PuriDefense: Randomized Local Implicit Adversarial Purification for Defending Black-box Query-based Attacks
[AUTHORS]
Ping Guo, Xiang Li, Zhiyuan Yang, Xi Lin, Qingchuan Zhao, Qingfu Zhang
[ABSTRACT]
Black-box query-based attacks constitute significant threats to Machine
Learning as a Service (MLaaS) systems since they can generate adversarial
examples without accessing the target model’s architecture and parameters.
Traditional defense mechanisms, such as adversarial training, gradient masking,
and input transformations, either impose substantial computational costs or
compromise the test accuracy of non-adversarial inputs. To address these
challenges, we propose an efficient defense mechanism, PuriDefense, that
employs random patch-wise purifications with an ensemble of lightweight
purification models at a low level of inference cost. These models leverage the
local implicit function and rebuild the natural image manifold. Our theoretical
analysis suggests that this approach slows down the convergence of query-based
attacks by incorporating randomness into purifications. Extensive experiments
on CIFAR-10 and ImageNet validate the effectiveness of our proposed
purifier-based defense mechanism, demonstrating significant improvements in
robustness against query-based attacks.
[LINK]
http://arxiv.org/abs/2401.10586v2
[DATE]
2025-06-26 23:00:42+08:00
[CATEGORIES]
cs.LG
Regret Bounds for Robust Online Decision Making
[AUTHORS]
Alexander Appel, Vanessa Kosoy
[ABSTRACT]
We propose a framework which generalizes “decision making with structured
observations” by allowing robust (i.e. multivalued) models. In this framework,
each model associates each decision with a convex set of probability
distributions over outcomes. Nature can choose distributions out of this set in
an arbitrary (adversarial) manner, that can be nonoblivious and depend on past
history. The resulting framework offers much greater generality than classical
bandits and reinforcement learning, since the realizability assumption becomes
much weaker and more realistic. We then derive a theory of regret bounds for
this framework. Although our lower and upper bounds are not tight, they are
sufficient to fully characterize power-law learnability. We demonstrate this
theory in two special cases: robust linear bandits and tabular robust online
reinforcement learning. In both cases, we derive regret bounds that improve
state-of-the-art (except that we do not address computational efficiency).
[LINK]
http://arxiv.org/abs/2504.06820v2
[DATE]
2025-06-26 22:54:55+08:00
[CATEGORIES]
cs.LG
DynamicBench: Evaluating Real-Time Report Generation in Large Language Models
[AUTHORS]
Jingyao Li, Hao Sun, Zile Qiao, Yong Jiang, Pengjun Xie, Fei Huang, Hong Xu, Jiaya Jia
[ABSTRACT]
Traditional benchmarks for large language models (LLMs) typically rely on
static evaluations through storytelling or opinion expression, which fail to
capture the dynamic requirements of real-time information processing in
contemporary applications. To address this limitation, we present DynamicBench,
a benchmark designed to evaluate the proficiency of LLMs in storing and
processing up-to-the-minute data. DynamicBench utilizes a dual-path retrieval
pipeline, integrating web searches with local report databases. It necessitates
domain-specific knowledge, ensuring accurate responses report generation within
specialized fields. By evaluating models in scenarios that either provide or
withhold external documents, DynamicBench effectively measures their capability
to independently process recent information or leverage contextual
enhancements. Additionally, we introduce an advanced report generation system
adept at managing dynamic information synthesis. Our experimental results
confirm the efficacy of our approach, with our method achieving
state-of-the-art performance, surpassing GPT4o in document-free and
document-assisted scenarios by 7.0% and 5.8%, respectively. The code and data
will be made publicly available.
[LINK]
http://arxiv.org/abs/2506.21343v1
[DATE]
2025-06-26 22:53:44+08:00
[CATEGORIES]
cs.LG
A Scalable Quantum Neural Network for Approximate SRBB-Based Unitary Synthesis
[AUTHORS]
Giacomo Belli, Marco Mordacci, Michele Amoretti
[ABSTRACT]
In this work, a scalable quantum neural network is introduced as a means to
approximate any unitary evolution through the Standard Recursive Block Basis
(SRBB) and, subsequently, redesigned with a number of CNOTs asymptotically
reduced by an exponential contribution. This algebraic approach to the problem
of unitary synthesis exploits Lie algebras and their topological features to
obtain scalable parameterizations of unitary operators. First, the original
SRBB-based scalability scheme, already known in the literature only from a
theoretical point of view, is reformulated for efficient algorithm
implementation and complexity management. Remarkably, 2-qubit operators emerge
as a special case outside the original scaling scheme. Furthermore, an
algorithm is proposed to reduce the number of CNOTs, thus deriving a new
implementable scaling scheme that requires only one layer of approximation. The
scalable CNOT-reduced quantum neural network is implemented and its performance
is assessed with a variety of different unitary matrices, both sparse and
dense, up to 6 qubits via the PennyLane library. The effectiveness of the
approximation is measured with different metrics in relation to two optimizers:
a gradient-based method and the Nelder-Mead method. The approximate
CNOT-reduced SRBB-based synthesis algorithm is also tested on real hardware and
compared with other valid approximation and decomposition methods available in
the literature.
[LINK]
http://arxiv.org/abs/2412.03083v2
[DATE]
2025-06-26 22:43:45+08:00
[CATEGORIES]
cs.LG
ScaleGNN: Towards Scalable Graph Neural Networks via Adaptive High-order Neighboring Feature Fusion
[AUTHORS]
Xiang Li, Jianpeng Qi, Haobing Liu, Yuan Cao, Guoqing Chao, Zhongying Zhao, Junyu Dong, Yanwei Yu
[ABSTRACT]
Graph Neural Networks (GNNs) have demonstrated impressive performance across
diverse graph-based tasks by leveraging message passing to capture complex node
relationships. However, when applied to large-scale real-world graphs, GNNs
face two major challenges: First, it becomes increasingly difficult to ensure
both scalability and efficiency, as the repeated aggregation of large
neighborhoods leads to significant computational overhead; Second, the
over-smoothing problem arises, where excessive or deep propagation makes node
representations indistinguishable, severely hindering model expressiveness. To
tackle these issues, we propose ScaleGNN, a novel framework that adaptively
fuses multi-hop node features for both scalable and effective graph learning.
First, we construct per-hop pure neighbor matrices that capture only the
exclusive structural information at each hop, avoiding the redundancy of
conventional aggregation. Then, an enhanced feature fusion strategy
significantly balances low-order and high-order information, preserving both
local detail and global correlations without incurring excessive complexity. To
further reduce redundancy and over-smoothing, we introduce a Local Contribution
Score (LCS)-based masking mechanism to filter out less relevant high-order
neighbors, ensuring that only the most meaningful information is aggregated. In
addition, learnable sparse constraints selectively integrate multi-hop valuable
features, emphasizing the most informative high-order neighbors. Extensive
experiments on real-world datasets demonstrate that ScaleGNN consistently
outperforms state-of-the-art GNNs in both predictive accuracy and computational
efficiency, highlighting its practical value for large-scale graph learning.
[LINK]
http://arxiv.org/abs/2504.15920v4
[DATE]
2025-06-26 22:41:32+08:00
[CATEGORIES]
cs.LG
Stochastic Quantum Spiking Neural Networks with Quantum Memory and Local Learning
[AUTHORS]
Jiechen Chen, Bipin Rajendran, Osvaldo Simeone
[ABSTRACT]
Neuromorphic and quantum computing have recently emerged as promising
paradigms for advancing artificial intelligence, each offering complementary
strengths. Neuromorphic systems built on spiking neurons excel at processing
time-series data efficiently through sparse, event-driven computation,
consuming energy only upon input events. Quantum computing, on the other hand,
leverages superposition and entanglement to explore feature spaces that are
exponentially large in the number of qubits. Hybrid approaches combining these
paradigms have begun to show potential, but existing quantum spiking models
have important limitations. Notably, prior quantum spiking neuron
implementations rely on classical memory mechanisms on single qubits, requiring
repeated measurements to estimate firing probabilities, and they use
conventional backpropagation on classical simulators for training. Here we
propose a stochastic quantum spiking (SQS) neuron model that addresses these
challenges. The SQS neuron uses multi-qubit quantum circuits to realize a
spiking unit with internal quantum memory, enabling event-driven probabilistic
spike generation in a single shot. Furthermore, we outline how networks of SQS
neurons – dubbed SQS neural networks (SQSNNs) – can be trained via a
hardware-friendly local learning rule, eliminating the need for global
classical backpropagation. The proposed SQSNN model fuses the time-series
efficiency of neuromorphic computing with the exponentially large inner state
space of quantum computing, paving the way for quantum spiking neural networks
that are modular, scalable, and trainable on quantum hardware.
[LINK]
http://arxiv.org/abs/2506.21324v1
[DATE]
2025-06-26 22:39:14+08:00
[CATEGORIES]
cs.LG
On Uniform Weighted Deep Polynomial approximation
[AUTHORS]
Kingsley Yeon, Steven B. Damelin
[ABSTRACT]
It is a classical result in rational approximation theory that certain
non-smooth or singular functions, such as $|x|$ and $x^{1/p}$, can be
efficiently approximated using rational functions with root-exponential
convergence in terms of degrees of freedom \cite{Sta, GN}. In contrast,
polynomial approximations admit only algebraic convergence by Jackson’s theorem
\cite{Lub2}. Recent work shows that composite polynomial architectures can
recover exponential approximation rates even without smoothness \cite{KY}. In
this work, we introduce and analyze a class of weighted deep polynomial
approximants tailored for functions with asymmetric behavior-growing unbounded
on one side and decaying on the other. By multiplying a learnable deep
polynomial with a one-sided weight, we capture both local non-smoothness and
global growth. We show numerically that this framework outperforms Taylor,
Chebyshev, and standard deep polynomial approximants, even when all use the
same number of parameters. To optimize these approximants in practice, we
propose a stable graph-based parameterization strategy building on \cite{Jar}.
[LINK]
http://arxiv.org/abs/2506.21306v1
[DATE]
2025-06-26 22:25:32+08:00
[CATEGORIES]
cs.LG
Context-Aware Doubly-Robust Semi-Supervised Learning
[AUTHORS]
Clement Ruah, Houssem Sifaou, Osvaldo Simeone, Bashir Al-Hashimi
[ABSTRACT]
The widespread adoption of artificial intelligence (AI) in next-generation
communication systems is challenged by the heterogeneity of traffic and network
conditions, which call for the use of highly contextual, site-specific, data. A
promising solution is to rely not only on real-world data, but also on
synthetic pseudo-data generated by a network digital twin (NDT). However, the
effectiveness of this approach hinges on the accuracy of the NDT, which can
vary widely across different contexts. To address this problem, this paper
introduces context-aware doubly-robust (CDR) learning, a novel semi-supervised
scheme that adapts its reliance on the pseudo-data to the different levels of
fidelity of the NDT across contexts. CDR is evaluated on the task of downlink
beamforming where it outperforms previous state-of-the-art approaches,
providing a 24% loss decrease when compared to doubly-robust (DR)
semi-supervised learning in regimes with low labeled data availability.
[COMMENTS]
This work has been accepted for publication in IEEE Signal Processing
Letters
[LINK]
http://arxiv.org/abs/2502.15577v2
[DATE]
2025-06-26 22:22:27+08:00
[CATEGORIES]
cs.LG
Semantic Scene Graph for Ultrasound Image Explanation and Scanning Guidance
[AUTHORS]
Xuesong Li, Dianye Huang, Yameng Zhang, Nassir Navab, Zhongliang Jiang
[ABSTRACT]
Understanding medical ultrasound imaging remains a long-standing challenge
due to significant visual variability caused by differences in imaging and
acquisition parameters. Recent advancements in large language models (LLMs)
have been used to automatically generate terminology-rich summaries orientated
to clinicians with sufficient physiological knowledge. Nevertheless, the
increasing demand for improved ultrasound interpretability and basic scanning
guidance among non-expert users, e.g., in point-of-care settings, has not yet
been explored. In this study, we first introduce the scene graph (SG) for
ultrasound images to explain image content to ordinary and provide guidance for
ultrasound scanning. The ultrasound SG is first computed using a
transformer-based one-stage method, eliminating the need for explicit object
detection. To generate a graspable image explanation for ordinary, the user
query is then used to further refine the abstract SG representation through
LLMs. Additionally, the predicted SG is explored for its potential in guiding
ultrasound scanning toward missing anatomies within the current imaging view,
assisting ordinary users in achieving more standardized and complete anatomical
exploration. The effectiveness of this SG-based image explanation and scanning
guidance has been validated on images from the left and right neck regions,
including the carotid and thyroid, across five volunteers. The results
demonstrate the potential of the method to maximally democratize ultrasound by
enhancing its interpretability and usability for ordinaries.
[LINK]
http://arxiv.org/abs/2506.19683v2
[DATE]
2025-06-26 22:20:13+08:00
[CATEGORIES]
cs.LG
Devil’s Hand: Data Poisoning Attacks to Locally Private Graph Learning Protocols
[AUTHORS]
Longzhu He, Chaozhuo Li, Peng Tang, Li Sun, Sen Su, Philip S. Yu
[ABSTRACT]
Graph neural networks (GNNs) have achieved significant success in graph
representation learning and have been applied to various domains. However, many
real-world graphs contain sensitive personal information, such as user profiles
in social networks, raising serious privacy concerns when graph learning is
performed using GNNs. To address this issue, locally private graph learning
protocols have gained considerable attention. These protocols leverage the
privacy advantages of local differential privacy (LDP) and the effectiveness of
GNN’s message-passing in calibrating noisy data, offering strict privacy
guarantees for users’ local data while maintaining high utility (e.g., node
classification accuracy) for graph learning. Despite these advantages, such
protocols may be vulnerable to data poisoning attacks, a threat that has not
been considered in previous research. Identifying and addressing these threats
is crucial for ensuring the robustness and security of privacy-preserving graph
learning frameworks. This work introduces the first data poisoning attack
targeting locally private graph learning protocols. The attacker injects fake
users into the protocol, manipulates these fake users to establish links with
genuine users, and sends carefully crafted data to the server, ultimately
compromising the utility of private graph learning. The effectiveness of the
attack is demonstrated both theoretically and empirically. In addition, several
defense strategies have also been explored, but their limited effectiveness
highlights the need for more robust defenses.
[LINK]
http://arxiv.org/abs/2506.09803v2
[DATE]
2025-06-26 22:18:21+08:00
[CATEGORIES]
cs.LG
Improved seeding strategies for k-means and k-GMM
[AUTHORS]
Guillaume Carrière, Frédéric Cazals
[ABSTRACT]
We revisit the randomized seeding techniques for k-means clustering and k-GMM
(Gaussian Mixture model fitting with Expectation-Maximization), formalizing
their three key ingredients: the metric used for seed sampling, the number of
candidate seeds, and the metric used for seed selection. This analysis yields
novel families of initialization methods exploiting a lookahead
principle–conditioning the seed selection to an enhanced coherence with the
final metric used to assess the algorithm, and a multipass strategy to tame
down the effect of randomization.
Experiments show a consistent constant factor improvement over classical
contenders in terms of the final metric (SSE for k-means, log-likelihood for
k-GMM), at a modest overhead. In particular, for k-means, our methods improve
on the recently designed multi-swap strategy, which was the first one to
outperform the greedy k-means++ seeding.
Our experimental analysis also shed light on subtle properties of k-means
often overlooked, including the (lack of) correlations between the SSE upon
seeding and the final SSE, the variance reduction phenomena observed in
iterative seeding methods, and the sensitivity of the final SSE to the pool
size for greedy methods.
Practically, our most effective seeding methods are strong candidates to
become one of the–if not the–standard techniques. From a theoretical
perspective, our formalization of seeding opens the door to a new line of
analytical approaches.
[COMMENTS]
13 pages
[LINK]
http://arxiv.org/abs/2506.21291v1
[DATE]
2025-06-26 22:10:40+08:00
[CATEGORIES]
cs.LG
Energy Matching: Unifying Flow Matching and Energy-Based Models for Generative Modeling
[AUTHORS]
Michal Balcerak, Tamaz Amiranashvili, Antonio Terpin, Suprosanna Shit, Lea Bogensperger, Sebastian Kaltenbach, Petros Koumoutsakos, Bjoern Menze
[ABSTRACT]
The most widely used generative models map noise and data distributions by
matching flows or scores. However, they struggle to incorporate partial
observations and additional priors–something energy-based models (EBMs) handle
elegantly by simply adding corresponding scalar energy terms. We address this
issue by proposing Energy Matching, a framework that endows flow-based
approaches with the flexibility of EBMs. Far from the data manifold, samples
move along curl-free, optimal transport paths from noise to data. As they
approach the data manifold, an entropic energy term guides the system into a
Boltzmann equilibrium distribution, explicitly capturing the underlying
likelihood structure of the data. We parameterize this dynamic with a single
time-independent scalar field, which serves as both a powerful generator and a
flexible prior for effective regularization of inverse problems. Our method
substantially outperforms existing EBMs on CIFAR-10 and ImageNet generation in
terms of fidelity, while retaining simulation-free training of transport-based
approaches away from the data manifold. Furthermore, we leverage the method’s
flexibility to introduce an interaction energy that supports diverse mode
exploration, which we demonstrate in a controlled protein-generation setting.
Our approach focuses on learning a scalar potential energy–without
time-conditioning, auxiliary generators, or additional networks–which marks a
significant departure from recent EBM methods. We believe that this simplified
framework significantly advances EBMs capabilities and paves the way for their
wider adoption in generative modeling across diverse domains.
[LINK]
http://arxiv.org/abs/2504.10612v4
[DATE]
2025-06-26 22:04:51+08:00
[CATEGORIES]
cs.LG
Hyperspherical Variational Autoencoders Using Efficient Spherical Cauchy Distribution
[AUTHORS]
Lukas Sablica, Kurt Hornik
[ABSTRACT]
We propose a novel variational autoencoder (VAE) architecture that employs a
spherical Cauchy (spCauchy) latent distribution. Unlike traditional Gaussian
latent spaces or the widely used von Mises-Fisher (vMF) distribution, spCauchy
provides a more natural hyperspherical representation of latent variables,
better capturing directional data while maintaining flexibility. Its
heavy-tailed nature prevents over-regularization, ensuring efficient latent
space utilization while offering a more expressive representation.
Additionally, spCauchy circumvents the numerical instabilities inherent to vMF,
which arise from computing normalization constants involving Bessel functions.
Instead, it enables a fully differentiable and efficient reparameterization
trick via M"obius transformations, allowing for stable and scalable training.
The KL divergence can be computed through a rapidly converging power series,
eliminating concerns of underflow or overflow associated with evaluation of
ratios of hypergeometric functions. These properties make spCauchy a compelling
alternative for VAEs, offering both theoretical advantages and practical
efficiency in high-dimensional generative modeling.
[LINK]
http://arxiv.org/abs/2506.21278v1
[DATE]
2025-06-26 22:01:51+08:00
[CATEGORIES]
cs.LG
Lagrangian Index Policy for Restless Bandits with Average Reward
[AUTHORS]
Konstantin Avrachenkov, Vivek S. Borkar, Pratik Shah
[ABSTRACT]
We study the Lagrange Index Policy (LIP) for restless multi-armed bandits
with long-run average reward. In particular, we compare the performance of LIP
with the performance of the Whittle Index Policy (WIP), both heuristic policies
known to be asymptotically optimal under certain natural conditions. Even
though in most cases their performances are very similar, in the cases when WIP
shows bad performance, LIP continues to perform very well. We then propose
reinforcement learning algorithms, both tabular and NN-based, to obtain online
learning schemes for LIP in the model-free setting. The proposed reinforcement
learning schemes for LIP require significantly less memory than the analogous
schemes for WIP. We calculate analytically the Lagrange index for the restart
model, which applies to the optimal web crawling and the minimization of the
weighted age of information. We also give a new proof of asymptotic optimality
in case of homogeneous arms as the number of arms goes to infinity, based on
exchangeability and de Finetti’s theorem.
[LINK]
http://arxiv.org/abs/2412.12641v2
[DATE]
2025-06-26 22:00:55+08:00
[CATEGORIES]
cs.LG
A GREAT Architecture for Edge-Based Graph Problems Like TSP
[AUTHORS]
Attila Lischka, Filip Rydin, Jiaming Wu, Morteza Haghir Chehreghani, Balázs Kulcsár
[ABSTRACT]
In the last years, many learning-based approaches have been proposed to
tackle combinatorial optimization problems such as routing problems. Many of
these approaches are based on graph neural networks (GNNs) or related
transformers, operating on the Euclidean coordinates representing the routing
problems. However, models operating on Euclidean coordinates are ill-suited for
non-Euclidean, asymmetric problem instances that are often found in real-world
settings. To overcome this limitation, we propose a novel GNN-based and
edge-focused neural model called Graph Edge Attention Network (GREAT). Using
GREAT as an encoder to capture the properties of a routing problem instance, we
build a reinforcement learning framework which we apply to Euclidean and
non-Euclidean variants of vehicle routing problems such as Traveling Salesman
Problem, Capacitated Vehicle Routing Problem and Orienteering Problem. Our
framework is among the first to tackle non-Euclidean variants of these problems
and achieves competitive results among learning-based solvers.
[COMMENTS]
15 pages, 7 figures
[LINK]
http://arxiv.org/abs/2408.16717v2
[DATE]
2025-06-26 21:54:56+08:00
[CATEGORIES]
cs.LG
Wavelet Diffusion Neural Operator
[AUTHORS]
Peiyan Hu, Rui Wang, Xiang Zheng, Tao Zhang, Haodong Feng, Ruiqi Feng, Long Wei, Yue Wang, Zhi-Ming Ma, Tailin Wu
[ABSTRACT]
Simulating and controlling physical systems described by partial differential
equations (PDEs) are crucial tasks across science and engineering. Recently,
diffusion generative models have emerged as a competitive class of methods for
these tasks due to their ability to capture long-term dependencies and model
high-dimensional states. However, diffusion models typically struggle with
handling system states with abrupt changes and generalizing to higher
resolutions. In this work, we propose Wavelet Diffusion Neural Operator (WDNO),
a novel PDE simulation and control framework that enhances the handling of
these complexities. WDNO comprises two key innovations. Firstly, WDNO performs
diffusion-based generative modeling in the wavelet domain for the entire
trajectory to handle abrupt changes and long-term dependencies effectively.
Secondly, to address the issue of poor generalization across different
resolutions, which is one of the fundamental tasks in modeling physical
systems, we introduce multi-resolution training. We validate WDNO on five
physical systems, including 1D advection equation, three challenging physical
systems with abrupt changes (1D Burgers’ equation, 1D compressible
Navier-Stokes equation and 2D incompressible fluid), and a real-world dataset
ERA5, which demonstrates superior performance on both simulation and control
tasks over state-of-the-art methods, with significant improvements in long-term
and detail prediction accuracy. Remarkably, in the challenging context of the
2D high-dimensional and indirect control task aimed at reducing smoke leakage,
WDNO reduces the leakage by 78% compared to the second-best baseline. The code
can be found at https://github.com/AI4Science-WestlakeU/wdno.git.
[LINK]
http://arxiv.org/abs/2412.04833v3
[DATE]
2025-06-26 21:39:47+08:00
[CATEGORIES]
cs.LG
Radio Map Estimation via Latent Domain Plug-and-Play Denoising
[AUTHORS]
Le Xu, Lei Cheng, Junting Chen, Wenqiang Pu, Xiao Fu
[ABSTRACT]
Radio map estimation (RME), also known as spectrum cartography, aims to
reconstruct the strength of radio interference across different domains (e.g.,
space and frequency) from sparsely sampled measurements. To tackle this typical
inverse problem, state-of-the-art RME methods rely on handcrafted or
data-driven structural information of radio maps. However, the former often
struggles to model complex radio frequency (RF) environments and the latter
requires excessive training – making it hard to quickly adapt to in situ
sensing tasks. This work presents a spatio-spectral RME approach based on
plug-and-play (PnP) denoising, a technique from computational imaging. The idea
is to leverage the observation that the denoising operations of signals like
natural images and radio maps are similar – despite the nontrivial differences
of the signals themselves. Hence, sophisticated denoisers designed for or
learned from natural images can be directly employed to assist RME, avoiding
using radio map data for training. Unlike conventional PnP methods that operate
directly in the data domain, the proposed method exploits the underlying
physical structure of radio maps and proposes an ADMM algorithm that denoises
in a latent domain. This design significantly improves computational efficiency
and enhances noise robustness. Theoretical aspects, e.g., recoverability of the
complete radio map and convergence of the ADMM algorithm are analyzed.
Synthetic and real data experiments are conducted to demonstrate the
effectiveness of our approach.
[LINK]
http://arxiv.org/abs/2501.13472v2
[DATE]
2025-06-26 21:31:04+08:00
[CATEGORIES]
cs.LG
Balancing Privacy, Robustness, and Efficiency in Machine Learning
[AUTHORS]
Youssef Allouah, Rachid Guerraoui, John Stephan
[ABSTRACT]
This position paper argues that achieving robustness, privacy, and efficiency
simultaneously in machine learning systems is infeasible under prevailing
threat models. The tension between these goals arises not from algorithmic
shortcomings but from structural limitations imposed by worst-case adversarial
assumptions. We advocate for a systematic research agenda aimed at formalizing
the robustness-privacy-efficiency trilemma, exploring how principled
relaxations of threat models can unlock better trade-offs, and designing
benchmarks that expose rather than obscure the compromises made. By shifting
focus from aspirational universal guarantees to context-aware system design,
the machine learning community can build models that are truly appropriate for
real-world deployment.
[LINK]
http://arxiv.org/abs/2312.14712v3
[DATE]
2025-06-26 21:12:25+08:00
[CATEGORIES]
cs.LG
Unsupervised Learning for Optimal Transport plan prediction between unbalanced graphs
[AUTHORS]
Sonia Mazelet, Rémi Flamary, Bertrand Thirion
[ABSTRACT]
Optimal transport between graphs, based on Gromov-Wasserstein and
other extensions, is a powerful tool for comparing and aligning
graph structures. However, solving the associated non-convex
optimization problems is computationally expensive, which limits the
scalability of these methods to large graphs. In this work, we
present Unbalanced Learning of Optimal Transport (ULOT), a deep
learning method that predicts optimal transport plans between two
graphs. Our method is trained by minimizing the fused unbalanced
Gromov-Wasserstein (FUGW) loss. We propose a novel neural
architecture with cross-attention that is conditioned on the FUGW
tradeoff hyperparameters. We evaluate ULOT on synthetic stochastic
block model (SBM) graphs and on real cortical surface data obtained
from fMRI. ULOT predicts transport plans with competitive loss up to
two orders of magnitude faster than classical solvers. Furthermore,
the predicted plan can be used as a warm start for classical solvers
to accelerate their convergence. Finally, the predicted transport
plan is fully differentiable with respect to the graph inputs and
FUGW hyperparameters, enabling the optimization of functionals of
the ULOT plan.
[LINK]
http://arxiv.org/abs/2506.12025v2
[DATE]
2025-06-26 21:01:32+08:00
[CATEGORIES]
cs.LG
Seal Your Backdoor with Variational Defense
[AUTHORS]
Ivan Sabolić, Matej Grcić, Siniša Šegvić
[ABSTRACT]
We propose VIBE, a model-agnostic framework that trains classifiers resilient
to backdoor attacks. The key concept behind our approach is to treat malicious
inputs and corrupted labels from the training dataset as observed random
variables, while the actual clean labels are latent. VIBE then recovers the
corresponding latent clean label posterior through variational inference. The
resulting training procedure follows the expectation-maximization (EM)
algorithm. The E-step infers the clean pseudolabels by solving an
entropy-regularized optimal transport problem, while the M-step updates the
classifier parameters via gradient descent. Being modular, VIBE can seamlessly
integrate with recent advancements in self-supervised representation learning,
which enhance its ability to resist backdoor attacks. We experimentally
validate the method effectiveness against contemporary backdoor attacks on
standard datasets, a large-scale setup with 1$k$ classes, and a dataset
poisoned with multiple attacks. VIBE consistently outperforms previous defenses
across all tested scenarios.
[COMMENTS]
Accepted to ICCV 2025
[LINK]
http://arxiv.org/abs/2503.08829v2
[DATE]
2025-06-26 20:48:11+08:00
[CATEGORIES]
cs.LG
PCF-Grasp: Converting Point Completion to Geometry Feature to Enhance 6-DoF Grasp
[AUTHORS]
Yaofeng Cheng, Fusheng Zha, Wei Guo, Pengfei Wang, Chao Zeng, Lining Sun, Chenguang Yang
[ABSTRACT]
The 6-Degree of Freedom (DoF) grasp method based on point clouds has shown
significant potential in enabling robots to grasp target objects. However, most
existing methods are based on the point clouds (2.5D points) generated from
single-view depth images. These point clouds only have one surface side of the
object providing incomplete geometry information, which mislead the grasping
algorithm to judge the shape of the target object, resulting in low grasping
accuracy. Humans can accurately grasp objects from a single view by leveraging
their geometry experience to estimate object shapes. Inspired by humans, we
propose a novel 6-DoF grasping framework that converts the point completion
results as object shape features to train the 6-DoF grasp network. Here, point
completion can generate approximate complete points from the 2.5D points
similar to the human geometry experience, and converting it as shape features
is the way to utilize it to improve grasp efficiency. Furthermore, due to the
gap between the network generation and actual execution, we integrate a score
filter into our framework to select more executable grasp proposals for the
real robot. This enables our method to maintain a high grasp quality in any
camera viewpoint. Extensive experiments demonstrate that utilizing complete
point features enables the generation of significantly more accurate grasp
proposals and the inclusion of a score filter greatly enhances the credibility
of real-world robot grasping. Our method achieves a 17.8\% success rate higher
than the state-of-the-art method in real-world experiments.
[LINK]
http://arxiv.org/abs/2504.16320v2
[DATE]
2025-06-26 20:42:10+08:00
[CATEGORIES]
cs.LG
Variational Supervised Contrastive Learning
[AUTHORS]
Ziwen Wang, Jiajun Fan, Thao Nguyen, Heng Ji, Ge Liu
[ABSTRACT]
Contrastive learning has proven to be highly efficient and adaptable in
shaping representation spaces across diverse modalities by pulling similar
samples together and pushing dissimilar ones apart. However, two key
limitations persist: (1) Without explicit regulation of the embedding
distribution, semantically related instances can inadvertently be pushed apart
unless complementary signals guide pair selection, and (2) excessive reliance
on large in-batch negatives and tailored augmentations hinders generalization.
To address these limitations, we propose Variational Supervised Contrastive
Learning (VarCon), which reformulates supervised contrastive learning as
variational inference over latent class variables and maximizes a
posterior-weighted evidence lower bound (ELBO) that replaces exhaustive
pair-wise comparisons for efficient class-aware matching and grants
fine-grained control over intra-class dispersion in the embedding space.
Trained exclusively on image data, our experiments on CIFAR-10, CIFAR-100,
ImageNet-100, and ImageNet-1K show that VarCon (1) achieves state-of-the-art
performance for contrastive learning frameworks, reaching 79.36% Top-1 accuracy
on ImageNet-1K and 78.29% on CIFAR-100 with a ResNet-50 encoder while
converging in just 200 epochs; (2) yields substantially clearer decision
boundaries and semantic organization in the embedding space, as evidenced by
KNN classification, hierarchical clustering results, and transfer-learning
assessments; and (3) demonstrates superior performance in few-shot learning
than supervised baseline and superior robustness across various augmentation
strategies.
[LINK]
http://arxiv.org/abs/2506.07413v2
[DATE]
2025-06-26 20:27:25+08:00
[CATEGORIES]
cs.LG
Moderating the Generalization of Score-based Generative Model
[AUTHORS]
Wan Jiang, He Wang, Xin Zhang, Dan Guo, Zhaoxin Fan, Yunfeng Diao, Richang Hong
[ABSTRACT]
Score-based Generative Models (SGMs) have demonstrated remarkable
generalization abilities, e.g. generating unseen, but natural data. However,
the greater the generalization power, the more likely the unintended
generalization, and the more dangerous the abuse. Research on moderated
generalization in SGMs remains limited. To fill this gap, we first examine the
current ‘gold standard’ in Machine Unlearning (MU), i.e., re-training the model
after removing the undesirable training data, and find it does not work in
SGMs. Further analysis of score functions reveals that the MU ‘gold standard’
does not alter the original score function, which explains its ineffectiveness.
Based on this insight, we propose the first Moderated Score-based Generative
Model (MSGM), which introduces a novel score adjustment strategy that redirects
the score function away from undesirable data during the continuous-time
stochastic differential equation process. Extensive experimental results
demonstrate that MSGM significantly reduces the likelihood of generating
undesirable content while preserving high visual quality for normal image
generation. Albeit designed for SGMs, MSGM is a general and flexible MU
framework that is compatible with diverse diffusion architectures (SGM and
DDPM) and training strategies (re-training and fine-tuning), and enables
zero-shot transfer of the pre-trained models to downstream tasks, e.g. image
inpainting and reconstruction. The code will be shared upon acceptance.
[LINK]
http://arxiv.org/abs/2412.07229v2
[DATE]
2025-06-26 20:06:00+08:00
[CATEGORIES]
cs.LG
Metis-RISE: RL Incentivizes and SFT Enhances Multimodal Reasoning Model Learning
[AUTHORS]
Haibo Qiu, Xiaohan Lan, Fanfan Liu, Xiaohu Sun, Delian Ruan, Peng Shi, Lin Ma
[ABSTRACT]
Recent advancements in large language models (LLMs) have witnessed a surge in
the development of advanced reasoning paradigms, which are now being integrated
into multimodal large language models (MLLMs). However, existing approaches
often fall short: methods solely employing reinforcement learning (RL) can
struggle with sample inefficiency and activating entirely absent reasoning
capabilities, while conventional pipelines that initiate with a cold-start
supervised fine-tuning (SFT) phase before RL may restrict the model’s
exploratory capacity and face suboptimal convergence. In this work, we
introduce \textbf{Metis-RISE} (\textbf{R}L \textbf{I}ncentivizes and
\textbf{S}FT \textbf{E}nhances) for multimodal reasoning model learning. Unlike
conventional approaches, Metis-RISE distinctively omits an initial SFT stage,
beginning instead with an RL phase (e.g., using a Group Relative Policy
Optimization variant) to incentivize and activate the model’s latent reasoning
capacity. Subsequently, the targeted SFT stage addresses two key challenges
identified during RL: (1) \textit{inefficient trajectory sampling} for tasks
where the model possesses but inconsistently applies correct reasoning, which
we tackle using self-distilled reasoning trajectories from the RL model itself;
and (2) \textit{fundamental capability absence}, which we address by injecting
expert-augmented knowledge for prompts where the model entirely fails. This
strategic application of RL for incentivization followed by SFT for enhancement
forms the core of Metis-RISE, leading to two versions of our MLLMs (7B and 72B
parameters). Evaluations on the OpenCompass Multimodal Reasoning Leaderboard
demonstrate that both models achieve state-of-the-art performance among
similar-sized models, with the 72B version ranking fourth overall. Please refer
to our project page for open-source information.
[COMMENTS]
Project Page: https://github.com/MM-Thinking/Metis-RISE
[LINK]
http://arxiv.org/abs/2506.13056v2
[DATE]
2025-06-26 19:45:11+08:00
[CATEGORIES]
cs.LG
Self-Regulated Neurogenesis for Online Data-Incremental Learning
[AUTHORS]
Murat Onur Yildirim, Elif Ceren Gok Yildirim, Decebal Constantin Mocanu, Joaquin Vanschoren
[ABSTRACT]
Neural networks often struggle with catastrophic forgetting when learning
sequences of tasks or data streams, unlike humans who can continuously learn
and consolidate new concepts even in the absence of explicit cues. Online
data-incremental learning seeks to emulate this capability by processing each
sample only once, without having access to task or stream cues at any point in
time since this is more realistic compared to offline setups, where all data
from novel class(es) is assumed to be readily available. However, existing
methods typically rely on storing the subsets of data in memory or expanding
the initial model architecture, resulting in significant computational
overhead. Drawing inspiration from ‘self-regulated neurogenesis’-brain’s
mechanism for creating specialized regions or circuits for distinct
functions-we propose a novel approach SERENA which encodes each concept in a
specialized network path called ‘concept cell’, integrated into a single
over-parameterized network. Once a concept is learned, its corresponding
concept cell is frozen, effectively preventing the forgetting of previously
acquired information. Furthermore, we introduce two new continual learning
scenarios that more closely reflect real-world conditions, characterized by
gradually changing sample sizes. Experimental results show that our method not
only establishes new state-of-the-art results across ten benchmarks but also
remarkably surpasses offline supervised batch learning performance. The code is
available at https://github.com/muratonuryildirim/serena.
[COMMENTS]
Published at Conference on Lifelong Learning Agents (CoLLAs) 2025
[LINK]
http://arxiv.org/abs/2403.14684v2
[DATE]
2025-06-26 19:35:57+08:00
[CATEGORIES]
cs.LG
Diverse Mini-Batch Selection in Reinforcement Learning for Efficient Chemical Exploration in de novo Drug Design
[AUTHORS]
Hampus Gummesson Svensson, Ola Engkvist, Jon Paul Janet, Christian Tyrchan, Morteza Haghir Chehreghani
[ABSTRACT]
In many real-world applications, evaluating the goodness of instances is
often costly and time-consuming, e.g., human feedback and physics simulations,
in contrast to proposing new instances. In particular, this is even more
critical in reinforcement learning, as new interactions with the environment
(i.e., new instances) need to be evaluated to provide a reward signal to learn
from. As sufficient exploration is crucial, learning from a diverse mini-batch
can have a large impact and help mitigate mode collapse. In this paper, we
introduce diverse mini-batch selection for reinforcement learning and propose
to use determinantal point processes for this task. We study this framework in
the context of a real-world problem, namely drug discovery. We experimentally
study how our proposed framework can improve the effectiveness of chemical
exploration in de novo drug design, where finding diverse and high-quality
solutions is essential. We conduct a comprehensive evaluation with three
well-established molecular generation oracles over numerous generative steps.
Our experiments conclude that our diverse mini-batch selection framework can
substantially improve the diversity of the solutions, while still obtaining
solutions of high quality. In drug discovery, such outcome can potentially lead
to fulfilling unmet medication needs faster.
[LINK]
http://arxiv.org/abs/2506.21158v1
[DATE]
2025-06-26 19:31:30+08:00
[CATEGORIES]
cs.LG
Transformer-Based Spatial-Temporal Counterfactual Outcomes Estimation
[AUTHORS]
He Li, Haoang Chi, Mingyu Liu, Wanrong Huang, Liyang Xu, Wenjing Yang
[ABSTRACT]
The real world naturally has dimensions of time and space. Therefore,
estimating the counterfactual outcomes with spatial-temporal attributes is a
crucial problem. However, previous methods are based on classical statistical
models, which still have limitations in performance and generalization. This
paper proposes a novel framework for estimating counterfactual outcomes with
spatial-temporal attributes using the Transformer, exhibiting stronger
estimation ability. Under mild assumptions, the proposed estimator within this
framework is consistent and asymptotically normal. To validate the
effectiveness of our approach, we conduct simulation experiments and real data
experiments. Simulation experiments show that our estimator has a stronger
estimation capability than baseline methods. Real data experiments provide a
valuable conclusion to the causal effect of conflicts on forest loss in
Colombia. The source code is available at
https://github.com/lihe-maxsize/DeppSTCI_Release_Version-master.
[COMMENTS]
24 pages, accepted at ICML 2025
[LINK]
http://arxiv.org/abs/2506.21154v1
[DATE]
2025-06-26 19:24:46+08:00
[CATEGORIES]
cs.LG
A Novel Federated Learning-Based IDS for Enhancing UAVs Privacy and Security
[AUTHORS]
Ozlem Ceviz, Pinar Sadioglu, Sevil Sen, Vassilios G. Vassilakis
[ABSTRACT]
Unmanned aerial vehicles (UAVs) operating within Flying Ad-hoc Networks
(FANETs) encounter security challenges due to the dynamic and distributed
nature of these networks. Previous studies focused predominantly on centralized
intrusion detection, assuming a central entity responsible for storing and
analyzing data from all devices. However, these approaches face challenges
including computation and storage costs, along with a single point of failure
risk, threatening data privacy and availability. The widespread dispersion of
data across interconnected devices underscores the need for decentralized
approaches. This paper introduces the Federated Learning-based Intrusion
Detection System (FL-IDS), addressing challenges encountered by centralized
systems in FANETs. FL-IDS reduces computation and storage costs for both
clients and the central server, which is crucial for resource-constrained UAVs.
Operating in a decentralized manner, FL-IDS enables UAVs to collaboratively
train a global intrusion detection model without sharing raw data, thus
avoiding delay in decisions based on collected data, as is often the case with
traditional methods. Experimental results demonstrate FL-IDS’s competitive
performance with Central IDS (C-IDS) while mitigating privacy concerns, with
the Bias Towards Specific Clients (BTSC) method further enhancing FL-IDS
performance even at lower attacker ratios. Comparative analysis with
traditional intrusion detection methods, including Local IDS (L-IDS), sheds
light on the strengths of FL-IDS. This study significantly contributes to UAV
security by introducing a privacy-aware, decentralized intrusion detection
approach tailored to UAV networks. Moreover, by introducing a realistic dataset
for FANETs and federated learning, our approach differs from others lacking
high dynamism and 3D node movements or accurate federated data federations.
[COMMENTS]
Published in Internet of Things, Volume 25, 2025, Article 101592
[LINK]
http://arxiv.org/abs/2312.04135v3
[DATE]
2025-06-26 19:21:32+08:00
[CATEGORIES]
cs.LG
Generative Adversarial Evasion and Out-of-Distribution Detection for UAV Cyber-Attacks
[AUTHORS]
Deepak Kumar Panda, Weisi Guo
[ABSTRACT]
The growing integration of UAVs into civilian airspace underscores the need
for resilient and intelligent intrusion detection systems (IDS), as traditional
anomaly detection methods often fail to identify novel threats. A common
approach treats unfamiliar attacks as out-of-distribution (OOD) samples;
however, this leaves systems vulnerable when mitigation is inadequate.
Moreover, conventional OOD detectors struggle to distinguish stealthy
adversarial attacks from genuine OOD events. This paper introduces a
conditional generative adversarial network (cGAN)-based framework for crafting
stealthy adversarial attacks that evade IDS mechanisms. We first design a
robust multi-class IDS classifier trained on benign UAV telemetry and known
cyber-attacks, including Denial of Service (DoS), false data injection (FDI),
man-in-the-middle (MiTM), and replay attacks. Using this classifier, our cGAN
perturbs known attacks to generate adversarial samples that misclassify as
benign while retaining statistical resemblance to OOD distributions. These
adversarial samples are iteratively refined to achieve high stealth and success
rates. To detect such perturbations, we implement a conditional variational
autoencoder (CVAE), leveraging negative log-likelihood to separate adversarial
inputs from authentic OOD samples. Comparative evaluation shows that CVAE-based
regret scores significantly outperform traditional Mahalanobis distance-based
detectors in identifying stealthy adversarial threats. Our findings emphasize
the importance of advanced probabilistic modeling to strengthen IDS
capabilities against adaptive, generative-model-based cyber intrusions.
[LINK]
http://arxiv.org/abs/2506.21142v1
[DATE]
2025-06-26 18:56:34+08:00
[CATEGORIES]
cs.LG
Multi-convex Programming for Discrete Latent Factor Models Prototyping
[AUTHORS]
Hao Zhu, Shengchao Yan, Jasper Hoffmann, Joschka Boedecker
[ABSTRACT]
Discrete latent factor models (DLFMs) are widely used in various domains such
as machine learning, economics, neuroscience, psychology, etc. Currently,
fitting a DLFM to some dataset relies on a customized solver for individual
models, which requires lots of effort to implement and is limited to the
targeted specific instance of DLFMs. In this paper, we propose a generic
framework based on CVXPY, which allows users to specify and solve the fitting
problem of a wide range of DLFMs, including both regression and classification
models, within a very short script. Our framework is flexible and inherently
supports the integration of regularization terms and constraints on the DLFM
parameters and latent factors, such that the users can easily prototype the
DLFM structure according to their dataset and application scenario. We
introduce our open-source Python implementation and illustrate the framework in
several examples.
[LINK]
http://arxiv.org/abs/2504.01431v2
[DATE]
2025-06-26 18:53:38+08:00
[CATEGORIES]
cs.LG
DBConformer: Dual-Branch Convolutional Transformer for EEG Decoding
[AUTHORS]
Ziwei Wang, Hongbin Wang, Tianwang Jia, Xingyi He, Siyang Li, Dongrui Wu
[ABSTRACT]
Electroencephalography (EEG)-based brain-computer interfaces (BCIs) transform
spontaneous/evoked neural activity into control commands for external
communication. While convolutional neural networks (CNNs) remain the mainstream
backbone for EEG decoding, their inherently short receptive field makes it
difficult to capture long-range temporal dependencies and global inter-channel
relationships. Recent CNN-Transformer (Conformers) hybrids partially address
this issue, but most adopt a serial design, resulting in suboptimal integration
of local and global features, and often overlook explicit channel-wise
modeling. To address these limitations, we propose DBConformer, a dual-branch
convolutional Transformer network tailored for EEG decoding. It integrates a
temporal Conformer to model long-range temporal dependencies and a spatial
Conformer to extract inter-channel interactions, capturing both temporal
dynamics and spatial patterns in EEG signals. A lightweight channel attention
module further refines spatial representations by assigning data-driven
importance to EEG channels. Extensive experiments on five motor imagery (MI)
datasets and two seizure detection datasets under three evaluation settings
demonstrate that DBConformer consistently outperforms 10 competitive baseline
models, with over eight times fewer parameters than the high-capacity EEG
Conformer baseline. Further, the visualization results confirm that the
features extracted by DBConformer are physiologically interpretable and aligned
with sensorimotor priors in MI. The superior performance and interpretability
of DBConformer make it reliable for robust and explainable EEG decoding. Code
is publicized at https://github.com/wzwvv/DBConformer.
[COMMENTS]
12 pages, 6 figures
[LINK]
http://arxiv.org/abs/2506.21140v1
[DATE]
2025-06-26 18:53:24+08:00
[CATEGORIES]
cs.LG
Solving Inverse Problem for Multi-armed Bandits via Convex Optimization
[AUTHORS]
Hao Zhu, Joschka Boedecker
[ABSTRACT]
We consider the inverse problem of multi-armed bandits (IMAB) that are widely
used in neuroscience and psychology research for behavior modelling. We first
show that the IMAB problem is not convex in general, but can be relaxed to a
convex problem via variable transformation. Based on this result, we propose a
two-step sequential heuristic for (approximately) solving the IMAB problem. We
discuss a condition where our method provides global solution to the IMAB
problem with certificate, as well as approximations to further save computing
time. Numerical experiments indicate that our heuristic method is more robust
than directly solving the IMAB problem via repeated local optimization, and can
achieve the performance of Monte Carlo methods within a significantly decreased
running time. We provide the implementation of our method based on CVXPY, which
allows straightforward application by users not well versed in convex
optimization.
[LINK]
http://arxiv.org/abs/2501.18945v3
[DATE]
2025-06-26 18:49:32+08:00
[CATEGORIES]
cs.LG
NaLaFormer: Norm-Aware Linear Attention for Transformer Models
[AUTHORS]
Weikang Meng, Yadan Luo, Liangyu Huo, Yaowei Wang, Xin Li, Zheng Zhang
[ABSTRACT]
Linear attention has emerged as a viable alternative to softmax attention by
reducing complexity from quadratic to linear in sequence length. To preserve
two fundamental properties of softmax, non-negativity and entropy reduction,
current works employ various linearly separatable kernel functions with $L1$
normalization instead of softmax operator. However, query norms are neglected
by the normalization operation in linear attention, such degradation heavily
leads to an entropy gap. Meanwhile, existing works inhibit negative values of
query and key vectors resulting in a missing inner-product interactions after
being mapped. To address these dual challenges, we propose a novel Norm-Aware
Linear Attention mechanism serving to restore norm-guided dynamic spikiness and
recover kernel-perturbed norm distributions. Specifically, we first decouple
query and key matrices into two components: norm and direction, to achieve
norm-aware spikiness control and norm consistency, respectively. We
mathematically reveal that the extent of entropy reduction varies with the
query norm in softmax normalization, motivating a query-norm aware kernel
function for dynamic control over entropy reduction. Furthermore, to ensure
norm consistency and enforce non-negativity constraints, we employ a
norm-preserving mapping to project all elements of the angular matrix into
positive values, leveraging cosine similarity to inhibit dimensions with
opposite directions. We conduct extensive experiments demonstrating that the
NaLaFormer improves performance on vision and language tasks, enhancing both
expressiveness and efficiency by up to 4.2\%.
[LINK]
http://arxiv.org/abs/2506.21137v1
[DATE]
2025-06-26 18:47:39+08:00
[CATEGORIES]
cs.LG
Inverse Reinforcement Learning via Convex Optimization
[AUTHORS]
Hao Zhu, Yuan Zhang, Joschka Boedecker
[ABSTRACT]
We consider the inverse reinforcement learning (IRL) problem, where an
unknown reward function of some Markov decision process is estimated based on
observed expert demonstrations. In most existing approaches, IRL is formulated
and solved as a nonconvex optimization problem, posing challenges in scenarios
where robustness and reproducibility are critical. We discuss a convex
formulation of the IRL problem (CIRL) initially proposed by Ng and Russel, and
reformulate the problem such that the domain-specific language CVXPY can be
applied directly to specify and solve the convex problem. We also extend the
CIRL problem to scenarios where the expert policy is not given analytically but
by trajectory as state-action pairs, which can be strongly inconsistent with
optimality, by augmenting some of the constraints. Theoretical analysis and
practical implementation for hyperparameter auto-selection are introduced. This
note helps the users to easily apply CIRL for their problems, without
background knowledge on convex optimization.
[LINK]
http://arxiv.org/abs/2501.15957v2
[DATE]
2025-06-26 18:46:25+08:00
[CATEGORIES]
cs.LG
Curriculum-Guided Antifragile Reinforcement Learning for Secure UAV Deconfliction under Observation-Space Attacks
[AUTHORS]
Deepak Kumar Panda, Adolfo Perrusquia, Weisi Guo
[ABSTRACT]
Reinforcement learning (RL) policies deployed in safety-critical systems,
such as unmanned aerial vehicle (UAV) navigation in dynamic airspace, are
vulnerable to out-ofdistribution (OOD) adversarial attacks in the observation
space. These attacks induce distributional shifts that significantly degrade
value estimation, leading to unsafe or suboptimal decision making rendering the
existing policy fragile. To address this vulnerability, we propose an
antifragile RL framework designed to adapt against curriculum of incremental
adversarial perturbations. The framework introduces a simulated attacker which
incrementally increases the strength of observation-space perturbations which
enables the RL agent to adapt and generalize across a wider range of OOD
observations and anticipate previously unseen attacks. We begin with a
theoretical characterization of fragility, formally defining catastrophic
forgetting as a monotonic divergence in value function distributions with
increasing perturbation strength. Building on this, we define antifragility as
the boundedness of such value shifts and derive adaptation conditions under
which forgetting is stabilized. Our method enforces these bounds through
iterative expert-guided critic alignment using Wasserstein distance
minimization across incrementally perturbed observations. We empirically
evaluate the approach in a UAV deconfliction scenario involving dynamic 3D
obstacles. Results show that the antifragile policy consistently outperforms
standard and robust RL baselines when subjected to both projected gradient
descent (PGD) and GPS spoofing attacks, achieving up to 15% higher cumulative
reward and over 30% fewer conflict events. These findings demonstrate the
practical and theoretical viability of antifragile reinforcement learning for
secure and resilient decision-making in environments with evolving threat
scenarios.
[LINK]
http://arxiv.org/abs/2506.21129v1
[DATE]
2025-06-26 18:10:41+08:00
[CATEGORIES]
cs.LG
Robust Policy Switching for Antifragile Reinforcement Learning for UAV Deconfliction in Adversarial Environments
[AUTHORS]
Deepak Kumar Panda, Weisi Guo
[ABSTRACT]
The increasing automation of navigation for unmanned aerial vehicles (UAVs)
has exposed them to adversarial attacks that exploit vulnerabilities in
reinforcement learning (RL) through sensor manipulation. Although existing
robust RL methods aim to mitigate such threats, their effectiveness has limited
generalization to out-of-distribution shifts from the optimal value
distribution, as they are primarily designed to handle fixed perturbation. To
address this limitation, this paper introduces an antifragile RL framework that
enhances adaptability to broader distributional shifts by incorporating a
switching mechanism based on discounted Thompson sampling (DTS). This mechanism
dynamically selects among multiple robust policies to minimize adversarially
induced state-action-value distribution shifts. The proposed approach first
derives a diverse ensemble of action robust policies by accounting for a range
of perturbations in the policy space. These policies are then modeled as a
multiarmed bandit (MAB) problem, where DTS optimally selects policies in
response to nonstationary Bernoulli rewards, effectively adapting to evolving
adversarial strategies. Theoretical framework has also been provided where by
optimizing the DTS to minimize the overall regrets due to distributional shift,
results in effective adaptation against unseen adversarial attacks thus
inducing antifragility. Extensive numerical simulations validate the
effectiveness of the proposed framework in complex navigation environments with
multiple dynamic three-dimensional obstacles and with stronger projected
gradient descent (PGD) and spoofing attacks. Compared to conventional robust,
non-adaptive RL methods, the antifragile approach achieves superior
performance, demonstrating shorter navigation path lengths and a higher rate of
conflict-free navigation trajectories compared to existing robust RL techniques
[LINK]
http://arxiv.org/abs/2506.21127v1
[DATE]
2025-06-26 18:06:29+08:00
[CATEGORIES]
cs.LG
Pushing Trade-Off Boundaries: Compact yet Effective Remote Sensing Change Detection
[AUTHORS]
Luosheng Xu, Dalin Zhang, Zhaohui Song
[ABSTRACT]
Remote sensing change detection is essential for monitoring urban expansion,
disaster assessment, and resource management, offering timely, accurate, and
large-scale insights into dynamic landscape transformations. While deep
learning has revolutionized change detection, the increasing complexity and
computational demands of modern models have not necessarily translated into
significant accuracy gains. Instead of following this trend, this study
explores a more efficient approach, focusing on lightweight models that
maintain high accuracy while minimizing resource consumption, which is an
essential requirement for on-satellite processing. To this end, we propose
FlickCD, which means quick flick then get great results, pushing the boundaries
of the performance-resource trade-off. FlickCD introduces an Enhanced
Difference Module (EDM) to amplify critical feature differences between
temporal phases while suppressing irrelevant variations such as lighting and
weather changes, thereby reducing computational costs in the subsequent change
decoder. Additionally, the FlickCD decoder incorporates Local-Global Fusion
Blocks, leveraging Shifted Window Self-Attention (SWSA) and Enhanced Global
Self-Attention (EGSA) to efficiently capture semantic information at multiple
scales, preserving both coarse- and fine-grained changes. Extensive experiments
on four benchmark datasets demonstrate that FlickCD reduces computational and
storage overheads by more than an order of magnitude while achieving
state-of-the-art (SOTA) performance or incurring only a minor (<1\% F1)
accuracy trade-off. The implementation code is publicly available at
https://github.com/xulsh8/FlickCD.
[COMMENTS]
12 pages
[LINK]
http://arxiv.org/abs/2506.21109v1
[DATE]
2025-06-26 17:06:52+08:00
[CATEGORIES]
cs.LG
Unlasting: Unpaired Single-Cell Multi-Perturbation Estimation by Dual Conditional Diffusion Implicit Bridges
[AUTHORS]
Changxi Chi, Jun Xia, Yufei Huang, Jingbo Zhou, Siyuan Li, Yunfan Liu, Chang Yu, Stan Z. Li
[ABSTRACT]
Estimating single-cell responses across various perturbations facilitates the
identification of key genes and enhances drug screening, significantly boosting
experimental efficiency. However, single-cell sequencing is a destructive
process, making it impossible to capture the same cell’s phenotype before and
after perturbation. Consequently, data collected under perturbed and
unperturbed conditions are inherently unpaired. Existing methods either attempt
to forcibly pair unpaired data using random sampling, or neglect the inherent
relationship between unperturbed and perturbed cells during the modeling. In
this work, we propose a framework based on Dual Diffusion Implicit Bridges
(DDIB) to learn the mapping between different data distributions, effectively
addressing the challenge of unpaired data. We further interpret this framework
as a form of data augmentation. We integrate gene regulatory network (GRN)
information to propagate perturbation signals in a biologically meaningful way,
and further incorporate a masking mechanism to predict silent genes, improving
the quality of generated profiles. Moreover, gene expression under the same
perturbation often varies significantly across cells, frequently exhibiting a
bimodal distribution that reflects intrinsic heterogeneity. To capture this, we
introduce a more suitable evaluation metric. We propose Unlasting, dual
conditional diffusion models that overcome the problem of unpaired single-cell
perturbation data and strengthen the model’s insight into perturbations under
the guidance of the GRN, with a dedicated mask model designed to improve
generation quality by predicting silent genes. In addition, we introduce a
biologically grounded evaluation metric that better reflects the inherent
heterogeneity in single-cell responses.
[LINK]
http://arxiv.org/abs/2506.21107v1
[DATE]
2025-06-26 17:05:38+08:00
[CATEGORIES]
cs.LG
Chain-of-Thought Enhanced Shallow Transformers for Wireless Symbol Detection
[AUTHORS]
Li Fan, Peng Wang, Jing Yang, Cong Shen
[ABSTRACT]
Transformers have shown potential in solving wireless communication problems,
particularly via in-context learning (ICL), where models adapt to new tasks
through prompts without requiring model updates. However, prior ICL-based
Transformer models rely on deep architectures with many layers to achieve
satisfactory performance, resulting in substantial storage and computational
costs. In this work, we propose CHain Of thOught Symbol dEtection (CHOOSE), a
CoT-enhanced shallow Transformer framework for wireless symbol detection. By
introducing autoregressive latent reasoning steps within the hidden space,
CHOOSE significantly improves the reasoning capacity of shallow models (1-2
layers) without increasing model depth. This design enables lightweight
Transformers to achieve detection performance comparable to much deeper models,
making them well-suited for deployment on resource-constrained mobile devices.
Experimental results demonstrate that our approach outperforms conventional
shallow Transformers and achieves performance comparable to that of deep
Transformers, while maintaining storage and computational efficiency. This
represents a promising direction for implementing Transformer-based algorithms
in wireless receivers with limited computational resources.
[LINK]
http://arxiv.org/abs/2506.21093v1
[DATE]
2025-06-26 16:41:45+08:00
[CATEGORIES]
cs.LG
CovDocker: Benchmarking Covalent Drug Design with Tasks, Datasets, and Solutions
[AUTHORS]
Yangzhe Peng, Kaiyuan Gao, Liang He, Yuheng Cong, Haiguang Liu, Kun He, Lijun Wu
[ABSTRACT]
Molecular docking plays a crucial role in predicting the binding mode of
ligands to target proteins, and covalent interactions, which involve the
formation of a covalent bond between the ligand and the target, are
particularly valuable due to their strong, enduring binding nature. However,
most existing docking methods and deep learning approaches hardly account for
the formation of covalent bonds and the associated structural changes. To
address this gap, we introduce a comprehensive benchmark for covalent docking,
CovDocker, which is designed to better capture the complexities of covalent
binding. We decompose the covalent docking process into three main tasks:
reactive location prediction, covalent reaction prediction, and covalent
docking. By adapting state-of-the-art models, such as Uni-Mol and Chemformer,
we establish baseline performances and demonstrate the effectiveness of the
benchmark in accurately predicting interaction sites and modeling the molecular
transformations involved in covalent binding. These results confirm the role of
the benchmark as a rigorous framework for advancing research in covalent drug
design. It underscores the potential of data-driven approaches to accelerate
the discovery of selective covalent inhibitors and addresses critical
challenges in therapeutic development.
[COMMENTS]
Accepted to KDD 2025 Research Track
[LINK]
http://arxiv.org/abs/2506.21085v1
[DATE]
2025-06-26 16:28:07+08:00
[CATEGORIES]
cs.LG
EgoAdapt: Adaptive Multisensory Distillation and Policy Learning for Efficient Egocentric Perception
[AUTHORS]
Sanjoy Chowdhury, Subrata Biswas, Sayan Nag, Tushar Nagarajan, Calvin Murdock, Ishwarya Ananthabhotla, Yijun Qian, Vamsi Krishna Ithapu, Dinesh Manocha, Ruohan Gao
[ABSTRACT]
Modern perception models, particularly those designed for multisensory
egocentric tasks, have achieved remarkable performance but often come with
substantial computational costs. These high demands pose challenges for
real-world deployment, especially in resource-constrained environments. In this
paper, we introduce EgoAdapt, a framework that adaptively performs cross-modal
distillation and policy learning to enable efficient inference across different
egocentric perception tasks, including egocentric action recognition, active
speaker localization, and behavior anticipation. Our proposed policy module is
adaptable to task-specific action spaces, making it broadly applicable.
Experimental results on three challenging egocentric datasets EPIC-Kitchens,
EasyCom, and Aria Everyday Activities demonstrate that our method significantly
enhances efficiency, reducing GMACs by up to 89.09%, parameters up to 82.02%,
and energy up to 9.6x, while still on-par and in many cases outperforming, the
performance of corresponding state-of-the-art models.
[COMMENTS]
Accepted at ICCV 2025
[LINK]
http://arxiv.org/abs/2506.21080v1
[DATE]
2025-06-26 16:09:16+08:00
[CATEGORIES]
cs.LG
Homogenization of Multi-agent Learning Dynamics in Finite-state Markov Games
[AUTHORS]
Yann Kerzreho
[ABSTRACT]
This paper introduces a new approach for approximating the learning dynamics
of multiple reinforcement learning (RL) agents interacting in a finite-state
Markov game. The idea is to rescale the learning process by simultaneously
reducing the learning rate and increasing the update frequency, effectively
treating the agent’s parameters as a slow-evolving variable influenced by the
fast-mixing game state. Under mild assumptions-ergodicity of the state process
and continuity of the updates-we prove the convergence of this rescaled process
to an ordinary differential equation (ODE). This ODE provides a tractable,
deterministic approximation of the agent’s learning dynamics. An implementation
of the framework is available at\,:
https://github.com/yannKerzreho/MarkovGameApproximation
[LINK]
http://arxiv.org/abs/2506.21079v1
[DATE]
2025-06-26 16:08:49+08:00
[CATEGORIES]
cs.LG
SDE Matching: Scalable and Simulation-Free Training of Latent Stochastic Differential Equations
[AUTHORS]
Grigory Bartosh, Dmitry Vetrov, Christian A. Naesseth
[ABSTRACT]
The Latent Stochastic Differential Equation (SDE) is a powerful tool for time
series and sequence modeling. However, training Latent SDEs typically relies on
adjoint sensitivity methods, which depend on simulation and backpropagation
through approximate SDE solutions, which limit scalability. In this work, we
propose SDE Matching, a new simulation-free method for training Latent SDEs.
Inspired by modern Score- and Flow Matching algorithms for learning generative
dynamics, we extend these ideas to the domain of stochastic dynamics for time
series and sequence modeling, eliminating the need for costly numerical
simulations. Our results demonstrate that SDE Matching achieves performance
comparable to adjoint sensitivity methods while drastically reducing
computational complexity.
[LINK]
http://arxiv.org/abs/2502.02472v3
[DATE]
2025-06-26 15:38:35+08:00
[CATEGORIES]
cs.LG
FedDAA: Dynamic Client Clustering for Concept Drift Adaptation in Federated Learning
[AUTHORS]
Fu Peng, Ming Tang
[ABSTRACT]
In federated learning (FL), the data distribution of each client may change
over time, introducing both temporal and spatial data heterogeneity, known as
concept drift. Data heterogeneity arises from three drift sources: real drift
(a shift in the conditional distribution P(y|x)), virtual drift (a shift in the
input distribution P(x)), and label drift (a shift in the label distribution
P(y)). However, most existing FL methods addressing concept drift primarily
focus on real drift. When clients experience virtual or label drift, these
methods often fail to selectively retain useful historical knowledge, leading
to catastrophic forgetting. A key challenge lies in distinguishing different
sources of drift, as they require distinct adaptation strategies: real drift
calls for discarding outdated data, while virtual or label drift benefits from
retaining historical data. Without explicitly identifying the drift sources, a
general adaptation strategy is suboptimal and may harm generalization. To
address this challenge, we propose FedDAA, a dynamic clustered FL framework
designed to adapt to multi-source concept drift while preserving valuable
historical knowledge. Specifically, FedDAA integrates three modules: a cluster
number determination module to find the optimal number of clusters; a real
drift detection module to distinguish real drift from virtual/label drift; and
a concept drift adaptation module to adapt to new data while retaining useful
historical information. We provide theoretical convergence guarantees, and
experiments show that FedDAA achieves 7.84% to 8.52% accuracy improvements over
state-of-the-art methods on Fashion-MNIST, CIFAR-10, and CIFAR-100.
[LINK]
http://arxiv.org/abs/2506.21054v1
[DATE]
2025-06-26 15:09:08+08:00
[CATEGORIES]
cs.LG
Sharp concentration of uniform generalization errors in binary linear classification
[AUTHORS]
Shogo Nakakita
[ABSTRACT]
We examine the concentration of uniform generalization errors around their
expectation in binary linear classification problems via an isoperimetric
argument. In particular, we establish Poincar'{e} and log-Sobolev inequalities
for the joint distribution of the output labels and the label-weighted input
vectors, which we apply to derive concentration bounds. The derived
concentration bounds are sharp up to moderate multiplicative constants by those
under well-balanced labels. In asymptotic analysis, we also show that almost
sure convergence of uniform generalization errors to their expectation occurs
in very broad settings, such as proportionally high-dimensional regimes. Using
this convergence, we establish uniform laws of large numbers under
dimension-free conditions.
[COMMENTS]
26 pages, 1 figure; minor edits to improve readability
[LINK]
http://arxiv.org/abs/2505.16713v2
[DATE]
2025-06-26 14:57:11+08:00
[CATEGORIES]
cs.LG
Improving Diffusion-Based Image Editing Faithfulness via Guidance and Scheduling
[AUTHORS]
Hansam Cho, Seoung Bum Kim
[ABSTRACT]
Text-guided diffusion models have become essential for high-quality image
synthesis, enabling dynamic image editing. In image editing, two crucial
aspects are editability, which determines the extent of modification, and
faithfulness, which reflects how well unaltered elements are preserved.
However, achieving optimal results is challenging because of the inherent
trade-off between editability and faithfulness. To address this, we propose
Faithfulness Guidance and Scheduling (FGS), which enhances faithfulness with
minimal impact on editability. FGS incorporates faithfulness guidance to
strengthen the preservation of input image information and introduces a
scheduling strategy to resolve misalignment between editability and
faithfulness. Experimental results demonstrate that FGS achieves superior
faithfulness while maintaining editability. Moreover, its compatibility with
various editing methods enables precise, high-quality image edits across
diverse tasks.
[COMMENTS]
preprint
[LINK]
http://arxiv.org/abs/2506.21045v1
[DATE]
2025-06-26 14:46:03+08:00
[CATEGORIES]
cs.LG
Efficient Skill Discovery via Regret-Aware Optimization
[AUTHORS]
He Zhang, Ming Zhou, Shaopeng Zhai, Ying Sun, Hui Xiong
[ABSTRACT]
Unsupervised skill discovery aims to learn diverse and distinguishable
behaviors in open-ended reinforcement learning. For existing methods, they
focus on improving diversity through pure exploration, mutual information
optimization, and learning temporal representation. Despite that they perform
well on exploration, they remain limited in terms of efficiency, especially for
the high-dimensional situations. In this work, we frame skill discovery as a
min-max game of skill generation and policy learning, proposing a regret-aware
method on top of temporal representation learning that expands the discovered
skill space along the direction of upgradable policy strength. The key insight
behind the proposed method is that the skill discovery is adversarial to the
policy learning, i.e., skills with weak strength should be further explored
while less exploration for the skills with converged strength. As an
implementation, we score the degree of strength convergence with regret, and
guide the skill discovery with a learnable skill generator. To avoid
degeneration, skill generation comes from an up-gradable population of skill
generators. We conduct experiments on environments with varying complexities
and dimension sizes. Empirical results show that our method outperforms
baselines in both efficiency and diversity. Moreover, our method achieves a 15%
zero shot improvement in high-dimensional environments, compared to existing
methods.
[LINK]
http://arxiv.org/abs/2506.21044v1
[DATE]
2025-06-26 14:45:59+08:00
[CATEGORIES]
cs.LG
Strict Subgoal Execution: Reliable Long-Horizon Planning in Hierarchical Reinforcement Learning
[AUTHORS]
Jaebak Hwang, Sanghyeon Lee, Jeongmo Kim, Seungyul Han
[ABSTRACT]
Long-horizon goal-conditioned tasks pose fundamental challenges for
reinforcement learning (RL), particularly when goals are distant and rewards
are sparse. While hierarchical and graph-based methods offer partial solutions,
they often suffer from subgoal infeasibility and inefficient planning. We
introduce Strict Subgoal Execution (SSE), a graph-based hierarchical RL
framework that enforces single-step subgoal reachability by structurally
constraining high-level decision-making. To enhance exploration, SSE employs a
decoupled exploration policy that systematically traverses underexplored
regions of the goal space. Furthermore, a failure-aware path refinement, which
refines graph-based planning by dynamically adjusting edge costs according to
observed low-level success rates, thereby improving subgoal reliability.
Experimental results across diverse long-horizon benchmarks demonstrate that
SSE consistently outperforms existing goal-conditioned RL and hierarchical RL
approaches in both efficiency and success rate.
[COMMENTS]
9 technical page followed by references and appendix
[LINK]
http://arxiv.org/abs/2506.21039v1
[DATE]
2025-06-26 14:35:42+08:00
[CATEGORIES]
cs.LG
RL-Selector: Reinforcement Learning-Guided Data Selection via Redundancy Assessment
[AUTHORS]
Suorong Yang, Peijia Li, Furao Shen, Jian Zhao
[ABSTRACT]
Modern deep architectures often rely on large-scale datasets, but training on
these datasets incurs high computational and storage overhead. Real-world
datasets often contain substantial redundancies, prompting the need for more
data-efficient training paradigms. Data selection has shown promise to mitigate
redundancy by identifying the most representative samples, thereby reducing
training costs without compromising performance. Existing methods typically
rely on static scoring metrics or pretrained models, overlooking the combined
effect of selected samples and their evolving dynamics during training. We
introduce the concept of epsilon-sample cover, which quantifies sample
redundancy based on inter-sample relationships, capturing the intrinsic
structure of the dataset. Based on this, we reformulate data selection as a
reinforcement learning (RL) process and propose RL-Selector, where a
lightweight RL agent optimizes the selection policy by leveraging
epsilon-sample cover derived from evolving dataset distribution as a reward
signal. Extensive experiments across benchmark datasets and diverse
architectures demonstrate that our method consistently outperforms existing
state-of-the-art baselines. Models trained with our selected datasets show
enhanced generalization performance with improved training efficiency.
[COMMENTS]
ICCV 2025
[LINK]
http://arxiv.org/abs/2506.21037v1
[DATE]
2025-06-26 14:28:56+08:00
[CATEGORIES]
cs.LG
An Information-Theoretic Analysis for Federated Learning under Concept Drift
[AUTHORS]
Fu Peng, Meng Zhang, Ming Tang
[ABSTRACT]
Recent studies in federated learning (FL) commonly train models on static
datasets. However, real-world data often arrives as streams with shifting
distributions, causing performance degradation known as concept drift. This
paper analyzes FL performance under concept drift using information theory and
proposes an algorithm to mitigate the performance degradation. We model concept
drift as a Markov chain and introduce the \emph{Stationary Generalization
Error} to assess a model’s capability to capture characteristics of future
unseen data. Its upper bound is derived using KL divergence and mutual
information. We study three drift patterns (periodic, gradual, and random) and
their impact on FL performance. Inspired by this, we propose an algorithm that
regularizes the empirical risk minimization approach with KL divergence and
mutual information, thereby enhancing long-term performance. We also explore
the performance-cost tradeoff by identifying a Pareto front. To validate our
approach, we build an FL testbed using Raspberry Pi4 devices. Experimental
results corroborate with theoretical findings, confirming that drift patterns
significantly affect performance. Our method consistently outperforms existing
approaches for these three patterns, demonstrating its effectiveness in
adapting concept drift in FL.
[LINK]
http://arxiv.org/abs/2506.21036v1
[DATE]
2025-06-26 14:25:15+08:00
[CATEGORIES]
cs.LG
Little By Little: Continual Learning via Self-Activated Sparse Mixture-of-Rank Adaptive Learning
[AUTHORS]
Haodong Lu, Chongyang Zhao, Jason Xue, Lina Yao, Kristen Moore, Dong Gong
[ABSTRACT]
Continual learning (CL) with large pre-trained models is challenged by
catastrophic forgetting and task interference. Existing LoRA-based
Mixture-of-Experts (MoE) approaches mitigate forgetting by assigning and
freezing task-specific adapters, but suffer from interference, redundancy, and
ambiguous routing due to coarse adapter-level selection. However, this design
introduces three key challenges: 1) Interference: Activating full LoRA experts
per input leads to subspace interference and prevents selective reuse of useful
components across tasks. 2) Redundancy: Newly added experts often duplicate or
contradict existing knowledge due to unnecessary activation of unrelated ranks
and insufficient reuse of relevant ones. 3) Ambiguity: Overlapping features
across tasks confuse the router, resulting in unstable expert assignments. As
more experts accumulate, earlier task routing degrades, accelerating
forgetting. We propose MoRA, a Mixture-of-Rank Adaptive learning approach with
self-activated and sparse rank activation for CL. Unlike mixing multiple
low-rank matrices, MoRA decomposes each rank-r update into r rank-1 components,
each treated as an independent expert, enabling fine-grained mixture of rank-1
expert utilization while mitigating interference and redundancy. To avoid
ambiguous routing, we propose that each rank-1 expert can infer its own
relevance via intermediate activations. Coupled with our proposed rank pruning
and activation budgets, MoRA adaptively selects a sparse mixture of ranks per
input. We validate MoRA on continual learning tasks with CLIP and large
language models (LLMs), analyzing both in-domain learning and out-of-domain
forgetting/generalization during fine-tuning. MoRA shows significant
effectiveness on enhancing CL with PTMs, and improving generalization while
mitigating forgetting.
[COMMENTS]
Preprint
[LINK]
http://arxiv.org/abs/2506.21035v1
[DATE]
2025-06-26 14:19:05+08:00
[CATEGORIES]
cs.LG
PCDVQ: Enhancing Vector Quantization for Large Language Models via Polar Coordinate Decoupling
[AUTHORS]
Yuxuan Yue, Zukang Xu, Zhihang Yuan, Dawei Yang, Jianlong Wu, Liqiang Nie
[ABSTRACT]
Large Language Models (LLMs) face significant challenges in edge deployment
due to their massive parameter scale. Vector Quantization (VQ), a
clustering-based quantization method, serves as a prevalent solution to this
issue for its extremely low-bit (even at 2-bit) and considerable accuracy.
Since a vector is a quantity in mathematics and physics that has both direction
and magnitude, existing VQ works typically quantize them in a coupled manner.
However, we find that direction exhibits significantly greater sensitivity to
quantization compared to the magnitude. For instance, when separately
clustering the directions and magnitudes of weight vectors in LLaMA-2-7B, the
accuracy drop of zero-shot tasks are 46.5\% and 2.3\%, respectively. This gap
even increases with the reduction of clustering centers. Further, Euclidean
distance, a common metric to access vector similarities in current VQ works,
places greater emphasis on reducing the magnitude error. This property is
contrary to the above finding, unavoidably leading to larger quantization
errors. To these ends, this paper proposes Polar Coordinate Decoupled Vector
Quantization (PCDVQ), an effective and efficient VQ framework consisting of two
key modules: 1) Polar Coordinate Decoupling (PCD), which transforms vectors
into their polar coordinate representations and perform independent
quantization of the direction and magnitude parameters.2) Distribution Aligned
Codebook Construction (DACC), which optimizes the direction and magnitude
codebooks in accordance with the source distribution. Experimental results show
that PCDVQ outperforms baseline methods at 2-bit level by at least 1.5\%
zero-shot accuracy, establishing a novel paradigm for accurate and highly
compressed LLMs.
[LINK]
http://arxiv.org/abs/2506.05432v2
[DATE]
2025-06-26 14:17:49+08:00
[CATEGORIES]
cs.LG
TRIDENT: Tri-Modal Molecular Representation Learning with Taxonomic Annotations and Local Correspondence
[AUTHORS]
Feng Jiang, Mangal Prakash, Hehuan Ma, Jianyuan Deng, Yuzhi Guo, Amina Mollaysa, Tommaso Mansi, Rui Liao, Junzhou Huang
[ABSTRACT]
Molecular property prediction aims to learn representations that map chemical
structures to functional properties. While multimodal learning has emerged as a
powerful paradigm to learn molecular representations, prior works have largely
overlooked textual and taxonomic information of molecules for representation
learning. We introduce TRIDENT, a novel framework that integrates molecular
SMILES, textual descriptions, and taxonomic functional annotations to learn
rich molecular representations. To achieve this, we curate a comprehensive
dataset of molecule-text pairs with structured, multi-level functional
annotations. Instead of relying on conventional contrastive loss, TRIDENT
employs a volume-based alignment objective to jointly align tri-modal features
at the global level, enabling soft, geometry-aware alignment across modalities.
Additionally, TRIDENT introduces a novel local alignment objective that
captures detailed relationships between molecular substructures and their
corresponding sub-textual descriptions. A momentum-based mechanism dynamically
balances global and local alignment, enabling the model to learn both broad
functional semantics and fine-grained structure-function mappings. TRIDENT
achieves state-of-the-art performance on 11 downstream tasks, demonstrating the
value of combining SMILES, textual, and taxonomic functional annotations for
molecular property prediction.
[LINK]
http://arxiv.org/abs/2506.21028v1
[DATE]
2025-06-26 14:09:47+08:00
[CATEGORIES]
cs.LG
HybridQ: Hybrid Classical-Quantum Generative Adversarial Network for Skin Disease Image Generation
[AUTHORS]
Qingyue Jiao, Kangyu Zheng, Yiyu Shi, Zhiding Liang
[ABSTRACT]
Machine learning-assisted diagnosis is gaining traction in skin disease
detection, but training effective models requires large amounts of high-quality
data. Skin disease datasets often suffer from class imbalance, privacy
concerns, and object bias, making data augmentation essential. While classical
generative models are widely used, they demand extensive computational
resources and lengthy training time. Quantum computing offers a promising
alternative, but existing quantum-based image generation methods can only yield
grayscale low-quality images. Through a novel classical-quantum latent space
fusion technique, our work overcomes this limitation and introduces the first
classical-quantum generative adversarial network (GAN) capable of generating
color medical images. Our model outperforms classical deep convolutional GANs
and existing hybrid classical-quantum GANs in both image generation quality and
classification performance boost when used as data augmentation. Moreover, the
performance boost is comparable with that achieved using state-of-the-art
classical generative models, yet with over 25 times fewer parameters and 10
times fewer training epochs. Such results suggest a promising future for
quantum image generation as quantum hardware advances. Finally, we demonstrate
the robust performance of our model on real IBM quantum machine with hardware
noise.
[LINK]
http://arxiv.org/abs/2506.21015v1
[DATE]
2025-06-26 13:14:45+08:00
[CATEGORIES]
cs.LG
Efficient Image Generation with Variadic Attention Heads
[AUTHORS]
Steven Walton, Ali Hassani, Xingqian Xu, Zhangyang Wang, Humphrey Shi
[ABSTRACT]
While the integration of transformers in vision models have yielded
significant improvements on vision tasks they still require significant amounts
of computation for both training and inference. Restricted attention mechanisms
significantly reduce these computational burdens but come at the cost of losing
either global or local coherence. We propose a simple, yet powerful method to
reduce these trade-offs: allow the attention heads of a single transformer to
attend to multiple receptive fields.
We demonstrate our method utilizing Neighborhood Attention (NA) and integrate
it into a StyleGAN based architecture for image generation. With this work,
dubbed StyleNAT, we are able to achieve a FID of 2.05 on FFHQ, a 6% improvement
over StyleGAN-XL, while utilizing 28% fewer parameters and with 4$\times$ the
throughput capacity. StyleNAT achieves the Pareto Frontier on FFHQ-256 and
demonstrates powerful and efficient image generation on other datasets. Our
code and model checkpoints are publicly available at:
https://github.com/SHI-Labs/StyleNAT
[COMMENTS]
Published in eLVM @ CVPR
(https://openaccess.thecvf.com/content/CVPR2025W/eLVM/html/Walton_Efficient_Image_Generation_with_Variadic_Attention_Heads_CVPRW_2025_paper)
| Formerly named StyleNAT: Giving Each Head a New Perspective |
[LINK]
http://arxiv.org/abs/2211.05770v3
[DATE]
2025-06-26 13:07:48+08:00
[CATEGORIES]
cs.LG
Proximal Point Method for Online Saddle Point Problem
[AUTHORS]
Qing-xin Meng, Jian-wei Liu
[ABSTRACT]
This paper focuses on the online saddle point problem, which involves a
sequence of two-player time-varying convex-concave games. Considering the
nonstationarity of the environment, we adopt the duality gap and the dynamic
Nash equilibrium regret as performance metrics for algorithm design. We present
three variants of the proximal point method: the Online Proximal Point Method
(OPPM), the Optimistic OPPM (OptOPPM), and the OptOPPM with multiple
predictors. Each algorithm guarantees upper bounds for both the duality gap and
dynamic Nash equilibrium regret, achieving near-optimality when measured
against the duality gap. Specifically, in certain benign environments, such as
sequences of stationary payoff functions, these algorithms maintain a nearly
constant metric bound. Experimental results further validate the effectiveness
of these algorithms. Lastly, this paper discusses potential reliability
concerns associated with using dynamic Nash equilibrium regret as a performance
metric. The technical appendix and code can be found at
https://github.com/qingxin6174/PPM-for-OSP.
[LINK]
http://arxiv.org/abs/2407.04591v3
[DATE]
2025-06-26 13:01:47+08:00
[CATEGORIES]
cs.LG
Distilling Normalizing Flows
[AUTHORS]
Steven Walton, Valeriy Klyukin, Maksim Artemev, Denis Derkach, Nikita Orlov, Humphrey Shi
[ABSTRACT]
Explicit density learners are becoming an increasingly popular technique for
generative models because of their ability to better model probability
distributions. They have advantages over Generative Adversarial Networks due to
their ability to perform density estimation and having exact latent-variable
inference. This has many advantages, including: being able to simply
interpolate, calculate sample likelihood, and analyze the probability
distribution. The downside of these models is that they are often more
difficult to train and have lower sampling quality.
Normalizing flows are explicit density models, that use composable bijective
functions to turn an intractable probability function into a tractable one. In
this work, we present novel knowledge distillation techniques to increase
sampling quality and density estimation of smaller student normalizing flows.
We seek to study the capacity of knowledge distillation in Compositional
Normalizing Flows to understand the benefits and weaknesses provided by these
architectures. Normalizing flows have unique properties that allow for a
non-traditional forms of knowledge transfer, where we can transfer that
knowledge within intermediate layers. We find that through this distillation,
we can make students significantly smaller while making substantial performance
gains over a non-distilled student. With smaller models there is a
proportionally increased throughput as this is dependent upon the number of
bijectors, and thus parameters, in the network.
[COMMENTS]
Published in eLVM @ CVPR
(https://openaccess.thecvf.com/content/CVPR2025W/eLVM/html/Walton_Distilling_Normalizing_Flows_CVPRW_2025_paper)
[LINK]
http://arxiv.org/abs/2506.21003v1
[DATE]
2025-06-26 12:34:28+08:00
[CATEGORIES]
cs.LG
Genetic Algorithm with Innovative Chromosome Patterns in the Breeding Process
[AUTHORS]
Qingchuan Lyu
[ABSTRACT]
This paper proposes Genetic Algorithm with Border Trades (GAB), a novel
modification of the standard genetic algorithm that enhances exploration by
incorporating new chromosome patterns in the breeding process. This approach
significantly mitigates premature convergence and improves search diversity.
Empirically, GAB achieves up to 8x higher fitness and 10x faster convergence on
complex job scheduling problems compared to standard Genetic Algorithms,
reaching average fitness scores of 888 versus 106 in under 20 seconds. On the
classic Flip-Flop problem, GAB consistently finds optimal or near-optimal
solutions in fewer generations, even as input sizes scale to thousands of bits.
These results highlight GAB as a highly effective and computationally efficient
alternative for solving large-scale combinatorial optimization problems.
[LINK]
http://arxiv.org/abs/2501.18184v3
[DATE]
2025-06-26 12:26:22+08:00
[CATEGORIES]
cs.LG
Pretrained Reversible Generation as Unsupervised Visual Representation Learning
[AUTHORS]
Rongkun Xue, Jinouwen Zhang, Yazhe Niu, Dazhong Shen, Bingqi Ma, Yu Liu, Jing Yang
[ABSTRACT]
Recent generative models based on score matching and flow matching have
significantly advanced generation tasks, but their potential in discriminative
tasks remains underexplored. Previous approaches, such as generative
classifiers, have not fully leveraged the capabilities of these models for
discriminative tasks due to their intricate designs. We propose Pretrained
Reversible Generation (PRG), which extracts unsupervised representations by
reversing the generative process of a pretrained continuous generation model.
PRG effectively reuses unsupervised generative models, leveraging their high
capacity to serve as robust and generalizable feature extractors for downstream
tasks. This framework enables the flexible selection of feature hierarchies
tailored to specific downstream tasks. Our method consistently outperforms
prior approaches across multiple benchmarks, achieving state-of-the-art
performance among generative model based methods, including 78% top-1 accuracy
on ImageNet at a resolution of 64*64. Extensive ablation studies, including
out-of-distribution evaluations, further validate the effectiveness of our
approach. Code is available at https://github.com/opendilab/PRG.
[COMMENTS]
Accepted by ICCV 2025
[LINK]
http://arxiv.org/abs/2412.01787v3
[DATE]
2025-06-26 12:26:18+08:00
[CATEGORIES]
cs.LG
Step-by-Step Video-to-Audio Synthesis via Negative Audio Guidance
[AUTHORS]
Akio Hayakawa, Masato Ishii, Takashi Shibuya, Yuki Mitsufuji
[ABSTRACT]
We propose a novel step-by-step video-to-audio generation method that
sequentially produces individual audio tracks, each corresponding to a specific
sound event in the video. Our approach mirrors traditional Foley workflows,
aiming to capture all sound events induced by a given video comprehensively.
Each generation step is formulated as a guided video-to-audio synthesis task,
conditioned on a target text prompt and previously generated audio tracks. This
design is inspired by the idea of concept negation from prior compositional
generation frameworks. To enable this guided generation, we introduce a
training framework that leverages pre-trained video-to-audio models and
eliminates the need for specialized paired datasets, allowing training on more
accessible data. Experimental results demonstrate that our method generates
multiple semantically distinct audio tracks for a single input video, leading
to higher-quality composite audio synthesis than existing baselines.
[LINK]
http://arxiv.org/abs/2506.20995v1
[DATE]
2025-06-26 12:20:08+08:00
[CATEGORIES]
cs.LG
Bridging the Gap Between Approximation and Learning via Optimal Approximation by ReLU MLPs of Maximal Regularity
[AUTHORS]
Ruiyang Hong, Anastasis Kratsios
[ABSTRACT]
The foundations of deep learning are supported by the seemingly opposing
perspectives of approximation or learning theory. The former advocates for
large/expressive models that need not generalize, while the latter considers
classes that generalize but may be too small/constrained to be universal
approximators. Motivated by real-world deep learning implementations that are
both expressive and statistically reliable, we ask: “Is there a class of neural
networks that is both large enough to be universal but structured enough to
generalize?” This paper constructively provides a positive answer to this
question by identifying a highly structured class of ReLU multilayer
perceptions (MLPs), which are optimal function approximators and are
statistically well-behaved. We show that any $(L,\alpha)$-H"{o}lder function
from $[0,1]^d$ to $[-n,n]$ can be approximated to a uniform $\mathcal{O}(1/n)$
error on $[0,1]^d$ with a sparsely connected ReLU MLP with the same H"{o}lder
exponent $\alpha$ and coefficient $L$, of width $\mathcal{O}(dn^{d/\alpha})$,
depth $\mathcal{O}(\log(d))$, with $\mathcal{O}(dn^{d/\alpha})$ nonzero
parameters, and whose weights and biases take values in $\{0,\pm 1/2\}$ except
in the first and last layers which instead have magnitude at-most $n$. Further,
our class of MLPs achieves a near-optimal sample complexity of
$\mathcal{O}(\log(N)/\sqrt{N})$ when given $N$ i.i.d. normalized sub-Gaussian
training samples. We achieve this through a new construction that perfectly
fits together linear pieces using Kuhn triangulations, along with a new proof
technique which shows that our construction preserves the regularity of not
only the H"{o}lder functions, but also any uniformly continuous function. Our
results imply that neural networks can solve the McShane extension problem on
suitable finite sets.
[COMMENTS]
16 pages main body, 40 pages proofs, 10 figures, 1 table
[LINK]
http://arxiv.org/abs/2409.12335v4
[DATE]
2025-06-26 12:08:57+08:00
[CATEGORIES]
cs.LG
Generalized Tensor-based Parameter-Efficient Fine-Tuning via Lie Group Transformations
[AUTHORS]
Chongjie Si, Zhiyi Shi, Xuehui Wang, Yichen Xiao, Xiaokang Yang, Wei Shen
[ABSTRACT]
Adapting pre-trained foundation models for diverse downstream tasks is a core
practice in artificial intelligence. However, the wide range of tasks and high
computational costs make full fine-tuning impractical. To overcome this,
parameter-efficient fine-tuning (PEFT) methods like LoRA have emerged and are
becoming a growing research focus. Despite the success of these methods, they
are primarily designed for linear layers, focusing on two-dimensional matrices
while largely ignoring higher-dimensional parameter spaces like convolutional
kernels. Moreover, directly applying these methods to higher-dimensional
parameter spaces often disrupts their structural relationships. Given the rapid
advancements in matrix-based PEFT methods, rather than designing a specialized
strategy, we propose a generalization that extends matrix-based PEFT methods to
higher-dimensional parameter spaces without compromising their structural
properties. Specifically, we treat parameters as elements of a Lie group, with
updates modeled as perturbations in the corresponding Lie algebra. These
perturbations are mapped back to the Lie group through the exponential map,
ensuring smooth, consistent updates that preserve the inherent structure of the
parameter space. Extensive experiments on computer vision and natural language
processing validate the effectiveness and versatility of our approach,
demonstrating clear improvements over existing methods.
[COMMENTS]
2025 ICCV
[LINK]
http://arxiv.org/abs/2504.00851v2
[DATE]
2025-06-26 11:12:59+08:00
[CATEGORIES]
cs.LG
Explainable quantum regression algorithm with encoded data structure
[AUTHORS]
C. -C. Joseph Wang, F. Perkkola, I. Salmenperä, A. Meijer-van de Griend, J. K. Nurminen, R. S. Bennink
[ABSTRACT]
Hybrid variational quantum algorithms (VQAs) are promising for solving
practical problems such as combinatorial optimization, quantum chemistry
simulation, quantum machine learning, and quantum error correction on noisy
quantum computers. However, with typical random ansatz or quantum alternating
operator ansatz, derived variational quantum algorithms become a black box that
cannot be trusted for model interpretation, not to mention deploying as
applications in informing critical decisions: the results of these variational
parameters are just rotational angles for the quantum gates and have nothing to
do with interpretable values that a model can provide directly. In this paper,
we construct the first interpretable quantum regression algorithm, in which the
quantum state exactly encodes the classical data table and the variational
parameters correspond directly to the regression coefficients, which are real
numbers by construction, providing a high degree of model interpretability and
minimal cost to optimize due to the right expressiveness. We also take
advantage of the encoded data structure to reduce the time complexity of
computing the regression map. To shorten the circuit depth for nonlinear
regression, our algorithm can be extended by building nonlinear features by
classical preprocessing as the independent encoded column vectors. Even though
the realization of compressed encoding in superconducting qubits has been
achieved by the less noisy compressed encoding recently by the authors, we
envision potential quantum utilities with multi-qubit gates implemented in
neutral cold atoms and ions.
[LINK]
http://arxiv.org/abs/2307.03334v5
[DATE]
2025-06-26 11:12:31+08:00
[CATEGORIES]
cs.LG
EraRAG: Efficient and Incremental Retrieval Augmented Generation for Growing Corpora
[AUTHORS]
Fangyuan Zhang, Zhengjun Huang, Yingli Zhou, Qintian Guo, Zhixun Li, Wensheng Luo, Di Jiang, Yixiang Fang, Xiaofang Zhou
[ABSTRACT]
Graph-based Retrieval-Augmented Generation (Graph-RAG) enhances large
language models (LLMs) by structuring retrieval over an external corpus.
However, existing approaches typically assume a static corpus, requiring
expensive full-graph reconstruction whenever new documents arrive, limiting
their scalability in dynamic, evolving environments. To address these
limitations, we introduce EraRAG, a novel multi-layered Graph-RAG framework
that supports efficient and scalable dynamic updates. Our method leverages
hyperplane-based Locality-Sensitive Hashing (LSH) to partition and organize the
original corpus into hierarchical graph structures, enabling efficient and
localized insertions of new data without disrupting the existing topology. The
design eliminates the need for retraining or costly recomputation while
preserving high retrieval accuracy and low latency. Experiments on large-scale
benchmarks demonstrate that EraRag achieves up to an order of magnitude
reduction in update time and token consumption compared to existing Graph-RAG
systems, while providing superior accuracy performance. This work offers a
practical path forward for RAG systems that must operate over continually
growing corpora, bridging the gap between retrieval efficiency and
adaptability. Our code and data are available at
https://github.com/EverM0re/EraRAG-Official.
[COMMENTS]
Under review
[LINK]
http://arxiv.org/abs/2506.20963v1
[DATE]
2025-06-26 11:01:33+08:00
[CATEGORIES]
cs.LG
Antibody Design and Optimization with Multi-scale Equivariant Graph Diffusion Models for Accurate Complex Antigen Binding
[AUTHORS]
Jiameng Chen, Xiantao Cai, Jia Wu, Wenbin Hu
[ABSTRACT]
Antibody design remains a critical challenge in therapeutic and diagnostic
development, particularly for complex antigens with diverse binding interfaces.
Current computational methods face two main limitations: (1) capturing
geometric features while preserving symmetries, and (2) generalizing novel
antigen interfaces. Despite recent advancements, these methods often fail to
accurately capture molecular interactions and maintain structural integrity. To
address these challenges, we propose \textbf{AbMEGD}, an end-to-end framework
integrating \textbf{M}ulti-scale \textbf{E}quivariant \textbf{G}raph
\textbf{D}iffusion for antibody sequence and structure co-design. Leveraging
advanced geometric deep learning, AbMEGD combines atomic-level geometric
features with residue-level embeddings, capturing local atomic details and
global sequence-structure interactions. Its E(3)-equivariant diffusion method
ensures geometric precision, computational efficiency, and robust
generalizability for complex antigens. Furthermore, experiments using the
SAbDab database demonstrate a 10.13\% increase in amino acid recovery, 3.32\%
rise in improvement percentage, and a 0.062~\AA\ reduction in root mean square
deviation within the critical CDR-H3 region compared to DiffAb, a leading
antibody design model. These results highlight AbMEGD’s ability to balance
structural integrity with improved functionality, establishing a new benchmark
for sequence-structure co-design and affinity optimization. The code is
available at: https://github.com/Patrick221215/AbMEGD.
[COMMENTS]
9 pages, 4 figures, accepted at IJCAI 2025
[LINK]
http://arxiv.org/abs/2506.20957v1
[DATE]
2025-06-26 10:45:38+08:00
[CATEGORIES]
cs.LG
Forecasting Geopolitical Events with a Sparse Temporal Fusion Transformer and Gaussian Process Hybrid: A Case Study in Middle Eastern and U.S. Conflict Dynamics
[AUTHORS]
Hsin-Hsiung Huang, Hayden Hampton
[ABSTRACT]
Forecasting geopolitical conflict from data sources like the Global Database
of Events, Language, and Tone (GDELT) is a critical challenge for national
security. The inherent sparsity, burstiness, and overdispersion of such data
cause standard deep learning models, including the Temporal Fusion Transformer
(TFT), to produce unreliable long-horizon predictions. We introduce STFT-VNNGP,
a hybrid architecture that won the 2023 Algorithms for Threat Detection (ATD)
competition by overcoming these limitations. Designed to bridge this gap, our
model employs a two-stage process: first, a TFT captures complex temporal
dynamics to generate multi-quantile forecasts. These quantiles then serve as
informed inputs for a Variational Nearest Neighbor Gaussian Process (VNNGP),
which performs principled spatiotemporal smoothing and uncertainty
quantification. In a case study forecasting conflict dynamics in the Middle
East and the U.S., STFT-VNNGP consistently outperforms a standalone TFT,
showing a superior ability to predict the timing and magnitude of bursty event
periods, particularly at long-range horizons. This work offers a robust
framework for generating more reliable and actionable intelligence from
challenging event data, with all code and workflows made publicly available to
ensure reproducibility.
[LINK]
http://arxiv.org/abs/2506.20935v1
[DATE]
2025-06-26 09:53:25+08:00
[CATEGORIES]
cs.LG
Lower Bounds on the Size of Markov Equivalence Classes
[AUTHORS]
Erik Jahn, Frederick Eberhardt, Leonard J. Schulman
[ABSTRACT]
Causal discovery algorithms typically recover causal graphs only up to their
Markov equivalence classes unless additional parametric assumptions are made.
The sizes of these equivalence classes reflect the limits of what can be
learned about the underlying causal graph from purely observational data. Under
the assumptions of acyclicity, causal sufficiency, and a uniform model prior,
Markov equivalence classes are known to be small on average. In this paper, we
show that this is no longer the case when any of these assumptions is relaxed.
Specifically, we prove exponentially large lower bounds for the expected size
of Markov equivalence classes in three settings: sparse random directed acyclic
graphs, uniformly random acyclic directed mixed graphs, and uniformly random
directed cyclic graphs.
[LINK]
http://arxiv.org/abs/2506.20933v1
[DATE]
2025-06-26 09:44:23+08:00
[CATEGORIES]
cs.LG
Quantum Reinforcement Learning Trading Agent for Sector Rotation in the Taiwan Stock Market
[AUTHORS]
Chi-Sheng Chen, Xinyu Zhang, Ya-Chuan Chen
[ABSTRACT]
We propose a hybrid quantum-classical reinforcement learning framework for
sector rotation in the Taiwan stock market. Our system employs Proximal Policy
Optimization (PPO) as the backbone algorithm and integrates both classical
architectures (LSTM, Transformer) and quantum-enhanced models (QNN, QRWKV,
QASA) as policy and value networks. An automated feature engineering pipeline
extracts financial indicators from capital share data to ensure consistent
model input across all configurations. Empirical backtesting reveals a key
finding: although quantum-enhanced models consistently achieve higher training
rewards, they underperform classical models in real-world investment metrics
such as cumulative return and Sharpe ratio. This discrepancy highlights a core
challenge in applying reinforcement learning to financial domains – namely,
the mismatch between proxy reward signals and true investment objectives. Our
analysis suggests that current reward designs may incentivize overfitting to
short-term volatility rather than optimizing risk-adjusted returns. This issue
is compounded by the inherent expressiveness and optimization instability of
quantum circuits under Noisy Intermediate-Scale Quantum (NISQ) constraints. We
discuss the implications of this reward-performance gap and propose directions
for future improvement, including reward shaping, model regularization, and
validation-based early stopping. Our work offers a reproducible benchmark and
critical insights into the practical challenges of deploying quantum
reinforcement learning in real-world finance.
[LINK]
http://arxiv.org/abs/2506.20930v1
[DATE]
2025-06-26 09:29:19+08:00
[CATEGORIES]
cs.LG
Active Learning for Manifold Gaussian Process Regression
[AUTHORS]
Yuanxing Cheng, Lulu Kang, Yiwei Wang, Chun Liu
[ABSTRACT]
This paper introduces an active learning framework for manifold Gaussian
Process (GP) regression, combining manifold learning with strategic data
selection to improve accuracy in high-dimensional spaces. Our method jointly
optimizes a neural network for dimensionality reduction and a Gaussian process
regressor in the latent space, supervised by an active learning criterion that
minimizes global prediction error. Experiments on synthetic data demonstrate
superior performance over randomly sequential learning. The framework
efficiently handles complex, discontinuous functions while preserving
computational tractability, offering practical value for scientific and
engineering applications. Future work will focus on scalability and
uncertainty-aware manifold learning.
[COMMENTS]
13 pages, 6 figures
[LINK]
http://arxiv.org/abs/2506.20928v1
[DATE]
2025-06-26 09:25:39+08:00
[CATEGORIES]
cs.LG
Interpretable Representation Learning for Additive Rule Ensembles
[AUTHORS]
Shahrzad Behzadimanesh, Pierre Le Bodic, Geoffrey I. Webb, Mario Boley
[ABSTRACT]
Small additive ensembles of symbolic rules offer interpretable prediction
models. Traditionally, these ensembles use rule conditions based on
conjunctions of simple threshold propositions $x \geq t$ on a single input
variable $x$ and threshold $t$, resulting geometrically in axis-parallel
polytopes as decision regions. While this form ensures a high degree of
interpretability for individual rules and can be learned efficiently using the
gradient boosting approach, it relies on having access to a curated set of
expressive and ideally independent input features so that a small ensemble of
axis-parallel regions can describe the target variable well. Absent such
features, reaching sufficient accuracy requires increasing the number and
complexity of individual rules, which diminishes the interpretability of the
model. Here, we extend classical rule ensembles by introducing logical
propositions with learnable sparse linear transformations of input variables,
i.e., propositions of the form $\mathbf{x}^\mathrm{T}\mathbf{w} \geq t$, where
$\mathbf{w}$ is a learnable sparse weight vector, enabling decision regions as
general polytopes with oblique faces. We propose a learning method using
sequential greedy optimization based on an iteratively reweighted formulation
of logistic regression. Experimental results demonstrate that the proposed
method efficiently constructs rule ensembles with the same test risk as
state-of-the-art methods while significantly reducing model complexity across
ten benchmark datasets.
[LINK]
http://arxiv.org/abs/2506.20927v1
[DATE]
2025-06-26 09:24:08+08:00
[CATEGORIES]
cs.LG
LLM-guided Chemical Process Optimization with a Multi-Agent Approach
[AUTHORS]
Tong Zeng, Srivathsan Badrinarayanan, Janghoon Ock, Cheng-Kai Lai, Amir Barati Farimani
[ABSTRACT]
Chemical process optimization is crucial to maximize production efficiency
and economic performance. Traditional methods, including gradient-based
solvers, evolutionary algorithms, and parameter grid searches, become
impractical when operating constraints are ill-defined or unavailable,
requiring engineers to rely on subjective heuristics to estimate feasible
parameter ranges. To address this constraint definition bottleneck, we present
a multi-agent framework of large language model (LLM) agents that autonomously
infer operating constraints from minimal process descriptions, then
collaboratively guide optimization using the inferred constraints. Our
AutoGen-based agentic framework employs OpenAI’s o3 model, with specialized
agents for constraint generation, parameter validation, simulation execution,
and optimization guidance. Through two phases - autonomous constraint
generation using embedded domain knowledge, followed by iterative multi-agent
optimization - the framework eliminates the need for predefined operational
bounds. Validated on the hydrodealkylation process across cost, yield, and
yield-to-cost ratio metrics, the framework demonstrated competitive performance
with conventional optimization methods while achieving better computational
efficiency, requiring fewer iterations to converge. Our approach converged in
under 20 minutes, achieving a 31-fold speedup over grid search. Beyond
computational efficiency, the framework’s reasoning-guided search demonstrates
sophisticated process understanding, correctly identifying utility trade-offs,
and applying domain-informed heuristics. This approach shows significant
potential for optimization scenarios where operational constraints are poorly
characterized or unavailable, particularly for emerging processes and retrofit
applications.
[COMMENTS]
16 pages (main manuscript without references), 2 figures
[LINK]
http://arxiv.org/abs/2506.20921v1
[DATE]
2025-06-26 09:03:44+08:00
[CATEGORIES]
cs.LG
Explainable AI for Radar Resource Management: Modified LIME in Deep Reinforcement Learning
[AUTHORS]
Ziyang Lu, M. Cenk Gursoy, Chilukuri K. Mohan, Pramod K. Varshney
[ABSTRACT]
Deep reinforcement learning has been extensively studied in decision-making
processes and has demonstrated superior performance over conventional
approaches in various fields, including radar resource management (RRM).
However, a notable limitation of neural networks is their ``black box” nature
and recent research work has increasingly focused on explainable AI (XAI)
techniques to describe the rationale behind neural network decisions. One
promising XAI method is local interpretable model-agnostic explanations (LIME).
However, the sampling process in LIME ignores the correlations between
features. In this paper, we propose a modified LIME approach that integrates
deep learning (DL) into the sampling process, which we refer to as DL-LIME. We
employ DL-LIME within deep reinforcement learning for radar resource
management. Numerical results show that DL-LIME outperforms conventional LIME
in terms of both fidelity and task performance, demonstrating superior
performance with both metrics. DL-LIME also provides insights on which factors
are more important in decision making for radar resource management.
[LINK]
http://arxiv.org/abs/2506.20916v1
[DATE]
2025-06-26 08:49:25+08:00
[CATEGORIES]
cs.LG
Faster Fixed-Point Methods for Multichain MDPs
[AUTHORS]
Matthew Zurek, Yudong Chen
[ABSTRACT]
We study value-iteration (VI) algorithms for solving general (a.k.a.
multichain) Markov decision processes (MDPs) under the average-reward
criterion, a fundamental but theoretically challenging setting. Beyond the
difficulties inherent to all average-reward problems posed by the lack of
contractivity and non-uniqueness of solutions to the Bellman operator, in the
multichain setting an optimal policy must solve the navigation subproblem of
steering towards the best connected component, in addition to optimizing
long-run performance within each component. We develop algorithms which better
solve this navigational subproblem in order to achieve faster convergence for
multichain MDPs, obtaining improved rates of convergence and sharper measures
of complexity relative to prior work. Many key components of our results are of
potential independent interest, including novel connections between
average-reward and discounted problems, optimal fixed-point methods for
discounted VI which extend to general Banach spaces, new sublinear convergence
rates for the discounted value error, and refined suboptimality decompositions
for multichain MDPs. Overall our results yield faster convergence rates for
discounted and average-reward problems and expand the theoretical foundations
of VI approaches.
[LINK]
http://arxiv.org/abs/2506.20910v1
[DATE]
2025-06-26 08:31:21+08:00
[CATEGORIES]
cs.LG
Optimal Single-Policy Sample Complexity and Transient Coverage for Average-Reward Offline RL
[AUTHORS]
Matthew Zurek, Guy Zamir, Yudong Chen
[ABSTRACT]
We study offline reinforcement learning in average-reward MDPs, which
presents increased challenges from the perspectives of distribution shift and
non-uniform coverage, and has been relatively underexamined from a theoretical
perspective. While previous work obtains performance guarantees under
single-policy data coverage assumptions, such guarantees utilize additional
complexity measures which are uniform over all policies, such as the uniform
mixing time. We develop sharp guarantees depending only on the target policy,
specifically the bias span and a novel policy hitting radius, yielding the
first fully single-policy sample complexity bound for average-reward offline
RL. We are also the first to handle general weakly communicating MDPs,
contrasting restrictive structural assumptions made in prior work. To achieve
this, we introduce an algorithm based on pessimistic discounted value iteration
enhanced by a novel quantile clipping technique, which enables the use of a
sharper empirical-span-based penalty function. Our algorithm also does not
require any prior parameter knowledge for its implementation. Remarkably, we
show via hard examples that learning under our conditions requires coverage
assumptions beyond the stationary distribution of the target policy,
distinguishing single-policy complexity measures from previously examined
cases. We also develop lower bounds nearly matching our main result.
[LINK]
http://arxiv.org/abs/2506.20904v1
[DATE]
2025-06-26 08:22:39+08:00
[CATEGORIES]
cs.LG
Graph-Structured Feedback Multimodel Ensemble Online Conformal Prediction
[AUTHORS]
Erfan Hajihashemi, Yanning Shen
[ABSTRACT]
Online conformal prediction has demonstrated its capability to construct a
prediction set for each incoming data point that covers the true label with a
predetermined probability. To cope with potential distribution shift,
multi-model online conformal prediction has been introduced to select and
leverage different models from a preselected candidate set. Along with the
improved flexibility, the choice of the preselected set also brings challenges.
A candidate set that includes a large number of models may increase the
computational complexity. In addition, the inclusion of irrelevant models with
poor performance may negatively impact the performance and lead to
unnecessarily large prediction sets. To address these challenges, we propose a
novel multi-model online conformal prediction algorithm that identifies a
subset of effective models at each time step by collecting feedback from a
bipartite graph, which is refined upon receiving new data. A model is then
selected from this subset to construct the prediction set, resulting in reduced
computational complexity and smaller prediction sets. Additionally, we
demonstrate that using prediction set size as feedback, alongside model loss,
can significantly improve efficiency by constructing smaller prediction sets
while still satisfying the required coverage guarantee. The proposed algorithms
are proven to ensure valid coverage and achieve sublinear regret. Experiments
on real and synthetic datasets validate that the proposed methods construct
smaller prediction sets and outperform existing multi-model online conformal
prediction approaches.
[LINK]
http://arxiv.org/abs/2506.20898v1
[DATE]
2025-06-26 08:06:11+08:00
[CATEGORIES]
cs.LG
Next-token prediction capacity: general upper bounds and a lower bound for transformers
[AUTHORS]
Liam Madden, Curtis Fox, Christos Thrampoulidis
[ABSTRACT]
Given a sequence of tokens, such as words, the task of next-token prediction
is to predict the next-token conditional probability distribution. Decoder-only
transformers have become effective models for this task, but their properties
are still not fully understood. In particular, the largest number of distinct
context sequences that a decoder-only transformer can interpolate next-token
distributions for has not been established. To fill this gap, we prove upper
and lower bounds on this number, which are equal up to a multiplicative
constant. We prove these bounds in the general setting where next-token
distributions can be arbitrary as well as the empirical setting where they are
calculated from a finite number of document sequences. Our lower bounds are for
one-layer multi-head decoder-only transformers and our proofs highlight an
important injectivity property satisfied by self-attention. Furthermore, we
provide numerical evidence that the minimal number of parameters for
memorization is sufficient for being able to train the model to the entropy
lower bound.
[COMMENTS]
V3: added two examples, a remark, and a second experiment where only
the FNN layers are trained
[LINK]
http://arxiv.org/abs/2405.13718v3
[DATE]
2025-06-26 07:53:42+08:00
[CATEGORIES]
cs.LG
HyperINF: Unleashing the HyperPower of the Schulz’s Method for Data Influence Estimation
[AUTHORS]
Xinyu Zhou, Simin Fan, Martin Jaggi
[ABSTRACT]
Influence functions provide a principled method to assess the contribution of
individual training samples to a specific target. Yet, their high computational
costs limit their applications on large-scale models and datasets. Existing
methods proposed for influence function approximation have significantly
reduced the computational overheads. However, they mostly suffer from
inaccurate estimation due to the lack of strong convergence guarantees from the
algorithm. The family of hyperpower methods are well-known for their rigorous
convergence guarantees on matrix inverse approximation, while the matrix
multiplication operation can involve intractable memory and computation costs
on large-scale models. We propose HyperINF, an efficient and accurate influence
function approximation method which leverages the hyperpower method,
specifically Schulz’s iterative algorithm. To deal with the
computation-intensive matrix multiplication, we incorporate the generalized
fisher information (GFIM) as a low-rank approximation of the Hessian matrix,
which reduces the memory and computation overheads to constant costs
independent of ranks on LoRA-tuned models. We first demonstrate the superior
accuracy and stability of HyperINF compared to other baselines through a
synthetic convergence simulation for matrix inversion. We further validate the
efficacy of HyperINF through extensive real-world data attribution tasks,
including mislabeled data detection and data selection for LLM and VLM
fine-tuning. On LoRA-tuned models, HyperINF achieves superior downstream
performance with minimal memory and computational overhead, while other
baselines suffer from significant degradation. Our codebase is available at
https://github.com/Blackzxy/HyperINF.
[LINK]
http://arxiv.org/abs/2410.05090v2
[DATE]
2025-06-26 07:23:23+08:00
[CATEGORIES]
cs.LG
Complex Model Transformations by Reinforcement Learning with Uncertain Human Guidance
[AUTHORS]
Kyanna Dagenais, Istvan David
[ABSTRACT]
Model-driven engineering problems often require complex model transformations
(MTs), i.e., MTs that are chained in extensive sequences. Pertinent examples of
such problems include model synchronization, automated model repair, and design
space exploration. Manually developing complex MTs is an error-prone and often
infeasible process. Reinforcement learning (RL) is an apt way to alleviate
these issues. In RL, an autonomous agent explores the state space through trial
and error to identify beneficial sequences of actions, such as MTs. However, RL
methods exhibit performance issues in complex problems. In these situations,
human guidance can be of high utility. In this paper, we present an approach
and technical framework for developing complex MT sequences through RL, guided
by potentially uncertain human advice. Our framework allows user-defined MTs to
be mapped onto RL primitives, and executes them as RL programs to find optimal
MT sequences. Our evaluation shows that human guidance, even if uncertain,
substantially improves RL performance, and results in more efficient
development of complex MTs. Through a trade-off between the certainty and
timeliness of human advice, our method takes a step towards RL-driven
human-in-the-loop engineering methods.
[COMMENTS]
Accepted for ACM/IEEE MODELS’25
[LINK]
http://arxiv.org/abs/2506.20883v1
[DATE]
2025-06-26 07:10:12+08:00
[CATEGORIES]
cs.LG
Always Skip Attention
[AUTHORS]
Yiping Ji, Hemanth Saratchandran, Peyman Moghadam, Simon Lucey
[ABSTRACT]
We highlight a curious empirical result within modern Vision Transformers
(ViTs). Specifically, self-attention catastrophically fails to train unless it
is used in conjunction with a skip connection. This is in contrast to other
elements of a ViT that continue to exhibit good performance (albeit suboptimal)
when skip connections are removed. Further, we show that this critical
dependence on skip connections is a relatively new phenomenon, with previous
deep architectures (\eg, CNNs) exhibiting good performance in their absence. In
this paper, we theoretically characterize that the self-attention mechanism is
fundamentally ill-conditioned and is, therefore, uniquely dependent on skip
connections for regularization. Additionally, we propose Token Graying – a
simple yet effective complement (to skip connections) that further improves the
condition of input tokens. We validate our approach in both supervised and
self-supervised training methods.
[COMMENTS]
This work has just been accepted by ICCV 2025
[LINK]
http://arxiv.org/abs/2505.01996v2
[DATE]
2025-06-26 07:06:43+08:00
[CATEGORIES]
cs.LG
Empowering Digital Agriculture: A Privacy-Preserving Framework for Data Sharing and Collaborative Research
[AUTHORS]
Osama Zafar, Rosemarie Santa González, Mina Namazi, Alfonso Morales, Erman Ayday
[ABSTRACT]
Data-driven agriculture, which integrates technology and data into
agricultural practices, has the potential to improve crop yield, disease
resilience, and long-term soil health. However, privacy concerns, such as
adverse pricing, discrimination, and resource manipulation, deter farmers from
sharing data, as it can be used against them. To address this barrier, we
propose a privacy-preserving framework that enables secure data sharing and
collaboration for research and development while mitigating privacy risks. The
framework combines dimensionality reduction techniques (like Principal
Component Analysis (PCA)) and differential privacy by introducing Laplacian
noise to protect sensitive information. The proposed framework allows
researchers to identify potential collaborators for a target farmer and train
personalized machine learning models either on the data of identified
collaborators via federated learning or directly on the aggregated
privacy-protected data. It also allows farmers to identify potential
collaborators based on similarities. We have validated this on real-life
datasets, demonstrating robust privacy protection against adversarial attacks
and utility performance comparable to a centralized system. We demonstrate how
this framework can facilitate collaboration among farmers and help researchers
pursue broader research objectives. The adoption of the framework can empower
researchers and policymakers to leverage agricultural data responsibly, paving
the way for transformative advances in data-driven agriculture. By addressing
critical privacy challenges, this work supports secure data integration,
fostering innovation and sustainability in agricultural systems.
[COMMENTS]
arXiv admin note: text overlap with arXiv:2409.06069
[LINK]
http://arxiv.org/abs/2506.20872v1
[DATE]
2025-06-26 06:46:30+08:00
[CATEGORIES]
cs.LG
High-dimensional Contextual Bandit Problem without Sparsity
[AUTHORS]
Junpei Komiyama, Masaaki Imaizumi
[ABSTRACT]
In this research, we investigate the high-dimensional linear contextual
bandit problem where the number of features $p$ is greater than the budget $T$,
or it may even be infinite. Differing from the majority of previous works in
this field, we do not impose sparsity on the regression coefficients. Instead,
we rely on recent findings on overparameterized models, which enables us to
analyze the performance of the minimum-norm interpolating estimator when data
distributions have small effective ranks. We propose an explore-then-commit
(EtC) algorithm to address this problem and examine its performance. Through
our analysis, we derive the optimal rate of the ETC algorithm in terms of $T$
and show that this rate can be achieved by balancing exploration and
exploitation. Moreover, we introduce an adaptive explore-then-commit (AEtC)
algorithm that adaptively finds the optimal balance. We assess the performance
of the proposed algorithms through a series of simulations.
[LINK]
http://arxiv.org/abs/2306.11017v2
[DATE]
2025-06-26 06:16:22+08:00
[CATEGORIES]
cs.LG
Multi-Objective Reinforcement Learning for Cognitive Radar Resource Management
[AUTHORS]
Ziyang Lu, Subodh Kalia, M. Cenk Gursoy, Chilukuri K. Mohan, Pramod K. Varshney
[ABSTRACT]
The time allocation problem in multi-function cognitive radar systems focuses
on the trade-off between scanning for newly emerging targets and tracking the
previously detected targets. We formulate this as a multi-objective
optimization problem and employ deep reinforcement learning to find
Pareto-optimal solutions and compare deep deterministic policy gradient (DDPG)
and soft actor-critic (SAC) algorithms. Our results demonstrate the
effectiveness of both algorithms in adapting to various scenarios, with SAC
showing improved stability and sample efficiency compared to DDPG. We further
employ the NSGA-II algorithm to estimate an upper bound on the Pareto front of
the considered problem. This work contributes to the development of more
efficient and adaptive cognitive radar systems capable of balancing multiple
competing objectives in dynamic environments.
[LINK]
http://arxiv.org/abs/2506.20853v1
[DATE]
2025-06-26 05:56:30+08:00
[CATEGORIES]
cs.LG
InterFormer: Effective Heterogeneous Interaction Learning for Click-Through Rate Prediction
[AUTHORS]
Zhichen Zeng, Xiaolong Liu, Mengyue Hang, Xiaoyi Liu, Qinghai Zhou, Chaofei Yang, Yiqun Liu, Yichen Ruan, Laming Chen, Yuxin Chen, Yujia Hao, Jiaqi Xu, Jade Nie, Xi Liu, Buyun Zhang, Wei Wen, Siyang Yuan, Hang Yin, Xin Zhang, Kai Wang, Wen-Yen Chen, Yiping Han, Huayu Li, Chunzhi Yang, Bo Long, Philip S. Yu, Hanghang Tong, Jiyan Yang
[ABSTRACT]
Click-through rate (CTR) prediction, which predicts the probability of a user
clicking an ad, is a fundamental task in recommender systems. The emergence of
heterogeneous information, such as user profile and behavior sequences, depicts
user interests from different aspects. A mutually beneficial integration of
heterogeneous information is the cornerstone towards the success of CTR
prediction. However, most of the existing methods suffer from two fundamental
limitations, including (1) insufficient inter-mode interaction due to the
unidirectional information flow between modes, and (2) aggressive information
aggregation caused by early summarization, resulting in excessive information
loss. To address the above limitations, we propose a novel module named
InterFormer to learn heterogeneous information interaction in an interleaving
style. To achieve better interaction learning, InterFormer enables
bidirectional information flow for mutually beneficial learning across
different modes. To avoid aggressive information aggregation, we retain
complete information in each data mode and use a separate bridging arch for
effective information selection and summarization. Our proposed InterFormer
achieves state-of-the-art performance on three public datasets and a
large-scale industrial dataset.
[COMMENTS]
11 pages, 6 figures
[LINK]
http://arxiv.org/abs/2411.09852v3
[DATE]
2025-06-26 05:48:04+08:00
[CATEGORIES]
cs.LG
Learning-Based Resource Management in Integrated Sensing and Communication Systems
[AUTHORS]
Ziyang Lu, M. Cenk Gursoy, Chilukuri K. Mohan, Pramod K. Varshney
[ABSTRACT]
In this paper, we tackle the task of adaptive time allocation in integrated
sensing and communication systems equipped with radar and communication units.
The dual-functional radar-communication system’s task involves allocating dwell
times for tracking multiple targets and utilizing the remaining time for data
transmission towards estimated target locations. We introduce a novel
constrained deep reinforcement learning (CDRL) approach, designed to optimize
resource allocation between tracking and communication under time budget
constraints, thereby enhancing target communication quality. Our numerical
results demonstrate the efficiency of our proposed CDRL framework, confirming
its ability to maximize communication quality in highly dynamic environments
while adhering to time constraints.
[LINK]
http://arxiv.org/abs/2506.20849v1
[DATE]
2025-06-26 05:44:07+08:00
[CATEGORIES]
cs.LG
Uncertainty-Aware Machine-Learning Framework for Predicting Dislocation Plasticity and Stress-Strain Response in FCC Alloys
[AUTHORS]
Jing Luo, Yejun Gu, Yanfei Wang, Xiaolong Ma, Jaafar. A El-Awady
[ABSTRACT]
Machine learning has significantly advanced the understanding and application
of structural materials, with an increasing emphasis on integrating existing
data and quantifying uncertainties in predictive modeling. This study presents
a comprehensive methodology utilizing a mixed density network (MDN) model,
trained on extensive experimental data from literature. This approach uniquely
predicts the distribution of dislocation density, inferred as a latent
variable, and the resulting stress distribution at the grain level. The
incorporation of statistical parameters of those predicted distributions into a
dislocation-mediated plasticity model allows for accurate stress-strain
predictions with explicit uncertainty quantification. This strategy not only
improves the accuracy and reliability of mechanical property predictions but
also plays a vital role in optimizing alloy design, thereby facilitating the
development of new materials in a rapidly evolving industry.
[LINK]
http://arxiv.org/abs/2506.20839v1
[DATE]
2025-06-26 05:18:14+08:00
[CATEGORIES]
cs.LG
Discovering Global False Negatives On the Fly for Self-supervised Contrastive Learning
[AUTHORS]
Vicente Balmaseda, Bokun Wang, Ching-Long Lin, Tianbao Yang
[ABSTRACT]
In self-supervised contrastive learning, negative pairs are typically
constructed using an anchor image and a sample drawn from the entire dataset,
excluding the anchor. However, this approach can result in the creation of
negative pairs with similar semantics, referred to as “false negatives”,
leading to their embeddings being falsely pushed apart. To address this issue,
we introduce GloFND, an optimization-based approach that automatically learns
on the fly the threshold for each anchor data to identify its false negatives
during training. In contrast to previous methods for false negative discovery,
our approach globally detects false negatives across the entire dataset rather
than locally within the mini-batch. Moreover, its per-iteration computation
cost remains independent of the dataset size. Experimental results on image and
image-text data demonstrate the effectiveness of the proposed method. Our
implementation is available at https://github.com/vibalcam/GloFND.
[COMMENTS]
Accepted to ICML 2025
[LINK]
http://arxiv.org/abs/2502.20612v2
[DATE]
2025-06-26 05:11:53+08:00
[CATEGORIES]
cs.LG
Composite Flow Matching for Reinforcement Learning with Shifted-Dynamics Data
[AUTHORS]
Lingkai Kong, Haichuan Wang, Tonghan Wang, Guojun Xiong, Milind Tambe
[ABSTRACT]
Incorporating pre-collected offline data from a source environment can
significantly improve the sample efficiency of reinforcement learning (RL), but
this benefit is often challenged by discrepancies between the transition
dynamics of the source and target environments. Existing methods typically
address this issue by penalizing or filtering out source transitions in high
dynamics-gap regions. However, their estimation of the dynamics gap often
relies on KL divergence or mutual information, which can be ill-defined when
the source and target dynamics have disjoint support. To overcome these
limitations, we propose CompFlow, a method grounded in the theoretical
connection between flow matching and optimal transport. Specifically, we model
the target dynamics as a conditional flow built upon the output distribution of
the source-domain flow, rather than learning it directly from a Gaussian prior.
This composite structure offers two key advantages: (1) improved generalization
for learning target dynamics, and (2) a principled estimation of the dynamics
gap via the Wasserstein distance between source and target transitions.
Leveraging our principled estimation of the dynamics gap, we further introduce
an optimistic active data collection strategy that prioritizes exploration in
regions of high dynamics gap, and theoretically prove that it reduces the
performance disparity with the optimal policy. Empirically, CompFlow
outperforms strong baselines across several RL benchmarks with shifted
dynamics.
[LINK]
http://arxiv.org/abs/2505.23062v2
[DATE]
2025-06-26 05:09:46+08:00
[CATEGORIES]
cs.LG
Harnessing the Universal Geometry of Embeddings
[AUTHORS]
Rishi Jha, Collin Zhang, Vitaly Shmatikov, John X. Morris
[ABSTRACT]
We introduce the first method for translating text embeddings from one vector
space to another without any paired data, encoders, or predefined sets of
matches. Our unsupervised approach translates any embedding to and from a
universal latent representation (i.e., a universal semantic structure
conjectured by the Platonic Representation Hypothesis). Our translations
achieve high cosine similarity across model pairs with different architectures,
parameter counts, and training datasets.
The ability to translate unknown embeddings into a different space while
preserving their geometry has serious implications for the security of vector
databases. An adversary with access only to embedding vectors can extract
sensitive information about the underlying documents, sufficient for
classification and attribute inference.
[LINK]
http://arxiv.org/abs/2505.12540v3
[DATE]
2025-06-26 05:04:02+08:00
[CATEGORIES]
cs.LG
TaxaDiffusion: Progressively Trained Diffusion Model for Fine-Grained Species Generation
[AUTHORS]
Amin Karimi Monsefi, Mridul Khurana, Rajiv Ramnath, Anuj Karpatne, Wei-Lun Chao, Cheng Zhang
[ABSTRACT]
We propose TaxaDiffusion, a taxonomy-informed training framework for
diffusion models to generate fine-grained animal images with high morphological
and identity accuracy. Unlike standard approaches that treat each species as an
independent category, TaxaDiffusion incorporates domain knowledge that many
species exhibit strong visual similarities, with distinctions often residing in
subtle variations of shape, pattern, and color. To exploit these relationships,
TaxaDiffusion progressively trains conditioned diffusion models across
different taxonomic levels – starting from broad classifications such as Class
and Order, refining through Family and Genus, and ultimately distinguishing at
the Species level. This hierarchical learning strategy first captures
coarse-grained morphological traits shared by species with common ancestors,
facilitating knowledge transfer before refining fine-grained differences for
species-level distinction. As a result, TaxaDiffusion enables accurate
generation even with limited training samples per species. Extensive
experiments on three fine-grained animal datasets demonstrate that outperforms
existing approaches, achieving superior fidelity in fine-grained animal image
generation. Project page: https://amink8.github.io/TaxaDiffusion/
[COMMENTS]
Accepted to ICCV 2025
[LINK]
http://arxiv.org/abs/2506.01923v2
[DATE]
2025-06-26 05:02:25+08:00
[CATEGORIES]
cs.LG
Efficacy of Temporal Fusion Transformers for Runoff Simulation
[AUTHORS]
Sinan Rasiya Koya, Tirthankar Roy
[ABSTRACT]
Combining attention with recurrence has shown to be valuable in sequence
modeling, including hydrological predictions. Here, we explore the strength of
Temporal Fusion Transformers (TFTs) over Long Short-Term Memory (LSTM) networks
in rainfall-runoff modeling. We train ten randomly initialized models, TFT and
LSTM, for 531 CAMELS catchments in the US. We repeat the experiment with five
subsets of the Caravan dataset, each representing catchments in the US,
Australia, Brazil, Great Britain, and Chile. Then, the performance of the
models, their variability regarding the catchment attributes, and the
difference according to the datasets are assessed. Our findings show that TFT
slightly outperforms LSTM, especially in simulating the midsection and peak of
hydrographs. Furthermore, we show the ability of TFT to handle longer sequences
and why it can be a better candidate for higher or larger catchments. Being an
explainable AI technique, TFT identifies the key dynamic and static variables,
providing valuable scientific insights. However, both TFT and LSTM exhibit a
considerable drop in performance with the Caravan dataset, indicating possible
data quality issues. Overall, the study highlights the potential of TFT in
improving hydrological modeling and understanding.
[LINK]
http://arxiv.org/abs/2506.20831v1
[DATE]
2025-06-26 04:58:28+08:00
[CATEGORIES]
cs.LG
Universal and Efficient Detection of Adversarial Data through Nonuniform Impact on Network Layers
[AUTHORS]
Furkan Mumcu, Yasin Yilmaz
[ABSTRACT]
Deep Neural Networks (DNNs) are notoriously vulnerable to adversarial input
designs with limited noise budgets. While numerous successful attacks with
subtle modifications to original input have been proposed, defense techniques
against these attacks are relatively understudied. Existing defense approaches
either focus on improving DNN robustness by negating the effects of
perturbations or use a secondary model to detect adversarial data. Although
equally important, the attack detection approach, which is studied in this
work, provides a more practical defense compared to the robustness approach. We
show that the existing detection methods are either ineffective against the
state-of-the-art attack techniques or computationally inefficient for real-time
processing. We propose a novel universal and efficient method to detect
adversarial examples by analyzing the varying degrees of impact of attacks on
different DNN layers. {Our method trains a lightweight regression model that
predicts deeper-layer features from early-layer features, and uses the
prediction error to detect adversarial samples.} Through theoretical arguments
and extensive experiments, we demonstrate that our detection method is highly
effective, computationally efficient for real-time processing, compatible with
any DNN architecture, and applicable across different domains, such as image,
video, and audio.
[COMMENTS]
arXiv admin note: substantial text overlap with arXiv:2410.17442
[LINK]
http://arxiv.org/abs/2506.20816v1
[DATE]
2025-06-26 04:30:28+08:00
[CATEGORIES]
cs.LG
Divide, Specialize, and Route: A New Approach to Efficient Ensemble Learning
[AUTHORS]
Jakub Piwko, Jędrzej Ruciński, Dawid Płudowski, Antoni Zajko, Patryzja Żak, Mateusz Zacharecki, Anna Kozak, Katarzyna Woźnica
[ABSTRACT]
Ensemble learning has proven effective in boosting predictive performance,
but traditional methods such as bagging, boosting, and dynamic ensemble
selection (DES) suffer from high computational cost and limited adaptability to
heterogeneous data distributions. To address these limitations, we propose
Hellsemble, a novel and interpretable ensemble framework for binary
classification that leverages dataset complexity during both training and
inference. Hellsemble incrementally partitions the dataset into circles of
difficulty by iteratively passing misclassified instances from simpler models
to subsequent ones, forming a committee of specialised base learners. Each
model is trained on increasingly challenging subsets, while a separate router
model learns to assign new instances to the most suitable base model based on
inferred difficulty. Hellsemble achieves strong classification accuracy while
maintaining computational efficiency and interpretability. Experimental results
on OpenML-CC18 and Tabzilla benchmarks demonstrate that Hellsemble often
outperforms classical ensemble methods. Our findings suggest that embracing
instance-level difficulty offers a promising direction for constructing
efficient and robust ensemble systems.
[COMMENTS]
14 pages, 6 figures
[LINK]
http://arxiv.org/abs/2506.20814v1
[DATE]
2025-06-26 04:26:04+08:00
[CATEGORIES]
cs.LG
FINN-GL: Generalized Mixed-Precision Extensions for FPGA-Accelerated LSTMs
[AUTHORS]
Shashwat Khandelwal, Jakoba Petri-Koenig, Thomas B. Preußer, Michaela Blott, Shreejith Shanker
[ABSTRACT]
Recurrent neural networks (RNNs), particularly LSTMs, are effective for
time-series tasks like sentiment analysis and short-term stock prediction.
However, their computational complexity poses challenges for real-time
deployment in resource constrained environments. While FPGAs offer a promising
platform for energy-efficient AI acceleration, existing tools mainly target
feed-forward networks, and LSTM acceleration typically requires full custom
implementation. In this paper, we address this gap by leveraging the
open-source and extensible FINN framework to enable the generalized deployment
of LSTMs on FPGAs. Specifically, we leverage the Scan operator from the Open
Neural Network Exchange (ONNX) specification to model the recurrent nature of
LSTM computations, enabling support for mixed quantisation within them and
functional verification of LSTM-based models. Furthermore, we introduce custom
transformations within the FINN compiler to map the quantised ONNX computation
graph to hardware blocks from the HLS kernel library of the FINN compiler and
Vitis HLS. We validate the proposed tool-flow by training a quantised ConvLSTM
model for a mid-price stock prediction task using the widely used dataset and
generating a corresponding hardware IP of the model using our flow, targeting
the XCZU7EV device. We show that the generated quantised ConvLSTM accelerator
through our flow achieves a balance between performance (latency) and resource
consumption, while matching (or bettering) inference accuracy of
state-of-the-art models with reduced precision. We believe that the
generalisable nature of the proposed flow will pave the way for
resource-efficient RNN accelerator designs on FPGAs.
[COMMENTS]
9 pages, 6 figures, 5 tables, Accepted for publication in IEEE
FPL-2025 (https://2025.fpl.org/)
[LINK]
http://arxiv.org/abs/2506.20810v1
[DATE]
2025-06-26 04:07:46+08:00
[CATEGORIES]
cs.LG
GPU Kernel Scientist: An LLM-Driven Framework for Iterative Kernel Optimization
[AUTHORS]
Martin Andrews, Sam Witteveen
[COMMENTS]
4 page paper plus Appendices. Accepted to the ES-FoMo “Efficient
Systems for Foundation Models” workshop at ICML 2025
[LINK]
http://arxiv.org/abs/2506.20807v1
[DATE]
2025-06-26 03:59:34+08:00
[CATEGORIES]
cs.LG
Structural System Identification via Validation and Adaptation
[AUTHORS]
Cristian López, Keegan J. Moore
[ABSTRACT]
Estimating the governing equation parameter values is essential for
integrating experimental data with scientific theory to understand, validate,
and predict the dynamics of complex systems. In this work, we propose a new
method for structural system identification (SI), uncertainty quantification,
and validation directly from data. Inspired by generative modeling frameworks,
a neural network maps random noise to physically meaningful parameters. These
parameters are then used in the known equation of motion to obtain fake
accelerations, which are compared to real training data via a mean square error
loss. To simultaneously validate the learned parameters, we use independent
validation datasets. The generated accelerations from these datasets are
evaluated by a discriminator network, which determines whether the output is
real or fake, and guides the parameter-generator network. Analytical and real
experiments show the parameter estimation accuracy and model validation for
different nonlinear structural systems.
[LINK]
http://arxiv.org/abs/2506.20799v1
[DATE]
2025-06-26 03:43:23+08:00
[CATEGORIES]
cs.LG
Stochastic Parameter Decomposition
[AUTHORS]
Lucius Bushnaq, Dan Braun, Lee Sharkey
[ABSTRACT]
A key step in reverse engineering neural networks is to decompose them into
simpler parts that can be studied in relative isolation. Linear parameter
decomposition – a framework that has been proposed to resolve several issues
with current decomposition methods – decomposes neural network parameters into
a sum of sparsely used vectors in parameter space. However, the current main
method in this framework, Attribution-based Parameter Decomposition (APD), is
impractical on account of its computational cost and sensitivity to
hyperparameters. In this work, we introduce \textit{Stochastic Parameter
Decomposition} (SPD), a method that is more scalable and robust to
hyperparameters than APD, which we demonstrate by decomposing models that are
slightly larger and more complex than was possible to decompose with APD. We
also show that SPD avoids other issues, such as shrinkage of the learned
parameters, and better identifies ground truth mechanisms in toy models. By
bridging causal mediation analysis and network decomposition methods, this
demonstration opens up new research possibilities in mechanistic
interpretability by removing barriers to scaling linear parameter decomposition
methods to larger models. We release a library for running SPD and reproducing
our experiments at https://github.com/goodfire-ai/spd.
[LINK]
http://arxiv.org/abs/2506.20790v1
[DATE]
2025-06-26 03:26:31+08:00
[CATEGORIES]
cs.LG
Spiking Neural Networks for SAR Interferometric Phase Unwrapping: A Theoretical Framework for Energy-Efficient Processing
[AUTHORS]
Marc Bara
[ABSTRACT]
We present the first theoretical framework for applying spiking neural
networks (SNNs) to synthetic aperture radar (SAR) interferometric phase
unwrapping. Despite extensive research in both domains, our comprehensive
literature review confirms that SNNs have never been applied to phase
unwrapping, representing a significant gap in current methodologies. As Earth
observation data volumes continue to grow exponentially (with missions like
NISAR expected to generate 100PB in two years) energy-efficient processing
becomes critical for sustainable data center operations. SNNs, with their
event-driven computation model, offer potential energy savings of 30-100x
compared to conventional approaches while maintaining comparable accuracy. We
develop spike encoding schemes specifically designed for wrapped phase data,
propose SNN architectures that leverage the spatial propagation nature of phase
unwrapping, and provide theoretical analysis of computational complexity and
convergence properties. Our framework demonstrates how the temporal dynamics
inherent in SNNs can naturally model the spatial continuity constraints
fundamental to phase unwrapping. This work opens a new research direction at
the intersection of neuromorphic computing and SAR interferometry, offering a
complementary approach to existing algorithms that could enable more
sustainable large-scale InSAR processing.
[COMMENTS]
8 pages, 2 figures, patent pending
[LINK]
http://arxiv.org/abs/2506.20782v1
[DATE]
2025-06-26 03:12:16+08:00
[CATEGORIES]
cs.LG
Stable Minima of ReLU Neural Networks Suffer from the Curse of Dimensionality: The Neural Shattering Phenomenon
[AUTHORS]
Tongtong Liang, Dan Qiao, Yu-Xiang Wang, Rahul Parhi
[ABSTRACT]
We study the implicit bias of flatness / low (loss) curvature and its effects
on generalization in two-layer overparameterized ReLU networks with
multivariate inputs – a problem well motivated by the minima stability and
edge-of-stability phenomena in gradient-descent training. Existing work either
requires interpolation or focuses only on univariate inputs. This paper
presents new and somewhat surprising theoretical results for multivariate
inputs. On two natural settings (1) generalization gap for flat solutions, and
(2) mean-squared error (MSE) in nonparametric function estimation by stable
minima, we prove upper and lower bounds, which establish that while flatness
does imply generalization, the resulting rates of convergence necessarily
deteriorate exponentially as the input dimension grows. This gives an
exponential separation between the flat solutions vis-`a-vis low-norm
solutions (i.e., weight decay), which knowingly do not suffer from the curse of
dimensionality. In particular, our minimax lower bound construction, based on a
novel packing argument with boundary-localized ReLU neurons, reveals how flat
solutions can exploit a kind of ‘‘neural shattering’’ where neurons rarely
activate, but with high weight magnitudes. This leads to poor performance in
high dimensions. We corroborate these theoretical findings with extensive
numerical simulations. To the best of our knowledge, our analysis provides the
first systematic explanation for why flat minima may fail to generalize in high
dimensions.
[COMMENTS]
Comments Welcome!
[LINK]
http://arxiv.org/abs/2506.20779v1
[DATE]
2025-06-26 03:10:03+08:00
[CATEGORIES]
cs.LG
Steering Your Diffusion Policy with Latent Space Reinforcement Learning
[AUTHORS]
Andrew Wagenmaker, Mitsuhiko Nakamoto, Yunchu Zhang, Seohong Park, Waleed Yagoub, Anusha Nagabandi, Abhishek Gupta, Sergey Levine
[ABSTRACT]
Robotic control policies learned from human demonstrations have achieved
impressive results in many real-world applications. However, in scenarios where
initial performance is not satisfactory, as is often the case in novel
open-world settings, such behavioral cloning (BC)-learned policies typically
require collecting additional human demonstrations to further improve their
behavior – an expensive and time-consuming process. In contrast, reinforcement
learning (RL) holds the promise of enabling autonomous online policy
improvement, but often falls short of achieving this due to the large number of
samples it typically requires. In this work we take steps towards enabling fast
autonomous adaptation of BC-trained policies via efficient real-world RL.
Focusing in particular on diffusion policies – a state-of-the-art BC
methodology – we propose diffusion steering via reinforcement learning (DSRL):
adapting the BC policy by running RL over its latent-noise space. We show that
DSRL is highly sample efficient, requires only black-box access to the BC
policy, and enables effective real-world autonomous policy improvement.
Furthermore, DSRL avoids many of the challenges associated with finetuning
diffusion policies, obviating the need to modify the weights of the base policy
at all. We demonstrate DSRL on simulated benchmarks, real-world robotic tasks,
and for adapting pretrained generalist policies, illustrating its sample
efficiency and effective performance at real-world policy improvement.
[LINK]
http://arxiv.org/abs/2506.15799v2
[DATE]
2025-06-26 03:09:52+08:00
[CATEGORIES]
cs.LG
Revealing higher-order neural representations of uncertainty with the Noise Estimation through Reinforcement-based Diffusion (NERD) model
[AUTHORS]
Hojjat Azimi Asrari, Megan A. K. Peters
[ABSTRACT]
Studies often aim to reveal first-order" representations (FORs), which
encode aspects of an observer's environment, such as contents or <span style="color:#e74d3c;">structure</span>. A
less-common target is
higher-order” representations (HORs), which are
about" FORs -- e.g., their strength or uncertainty -- and which may
contribute to learning. HORs about uncertainty are unlikely to be direct
read-outs” of FOR characteristics, instead reflecting noisy estimation
processes incorporating prior expectations about uncertainty, but how the brain
represents such expected uncertainty distributions remains largely unexplored.
Here, we study ``noise expectation” HORs using neural data from a task which
may require the brain to learn about its own noise: decoded neurofeedback,
wherein human subjects learn to volitionally produce target neural patterns. We
develop and apply a Noise Estimation through Reinforcement-based Diffusion
(NERD) model to characterize how brains may undertake this process, and show
that NERD offers high explanatory power for human behavior.
[COMMENTS]
27 pages, 7 figures, 12 equations
[LINK]
http://arxiv.org/abs/2503.14333v2
[DATE]
2025-06-26 03:04:21+08:00
[CATEGORIES]
cs.LG
Stochastic and Non-local Closure Modeling for Nonlinear Dynamical Systems via Latent Score-based Generative Models
[AUTHORS]
Xinghao Dong, Huchen Yang, Jin-Long Wu
[ABSTRACT]
We propose a latent score-based generative AI framework for learning
stochastic, non-local closure models and constitutive laws in nonlinear
dynamical systems of computational mechanics. This work addresses a key
challenge of modeling complex multiscale dynamical systems without a clear
scale separation, for which numerically resolving all scales is prohibitively
expensive, e.g., for engineering turbulent flows. While classical closure
modeling methods leverage domain knowledge to approximate subgrid-scale
phenomena, their deterministic and local assumptions can be too restrictive in
regimes lacking a clear scale separation. Recent developments of
diffusion-based stochastic models have shown promise in the context of closure
modeling, but their prohibitive computational inference cost limits practical
applications for many real-world applications. This work addresses this
limitation by jointly training convolutional autoencoders with conditional
diffusion models in the latent spaces, significantly reducing the
dimensionality of the sampling process while preserving essential physical
characteristics. Numerical results demonstrate that the joint training approach
helps discover a proper latent space that not only guarantees small
reconstruction errors but also ensures good performance of the diffusion model
in the latent space. When integrated into numerical simulations, the proposed
stochastic modeling framework via latent conditional diffusion models achieves
significant computational acceleration while maintaining comparable predictive
accuracy to standard diffusion models in physical spaces.
[LINK]
http://arxiv.org/abs/2506.20771v1
[DATE]
2025-06-26 03:04:02+08:00
[CATEGORIES]
cs.LG
GASP: Efficient Black-Box Generation of Adversarial Suffixes for Jailbreaking LLMs
[AUTHORS]
Advik Raj Basani, Xiao Zhang
[ABSTRACT]
LLMs have shown impressive capabilities across various natural language
processing tasks, yet remain vulnerable to input prompts, known as jailbreak
attacks, carefully designed to bypass safety guardrails and elicit harmful
responses. Traditional methods rely on manual heuristics but suffer from
limited generalizability. Despite being automatic, optimization-based attacks
often produce unnatural prompts that can be easily detected by safety filters
or require high computational costs due to discrete token optimization. In this
paper, we introduce Generative Adversarial Suffix Prompter (GASP), a novel
automated framework that can efficiently generate human-readable jailbreak
prompts in a fully black-box setting. In particular, GASP leverages latent
Bayesian optimization to craft adversarial suffixes by efficiently exploring
continuous latent embedding spaces, gradually optimizing the suffix prompter to
improve attack efficacy while balancing prompt coherence via a targeted
iterative refinement procedure. Through comprehensive experiments, we show that
GASP can produce natural adversarial prompts, significantly improving jailbreak
success over baselines, reducing training times, and accelerating inference
speed, thus making it an efficient and scalable solution for red-teaming LLMs.
[COMMENTS]
38 pages, 8 tables, 18 figures
[LINK]
http://arxiv.org/abs/2411.14133v2
[DATE]
2025-06-26 03:01:33+08:00
[CATEGORIES]
cs.LG
Control and optimization for Neural Partial Differential Equations in Supervised Learning
[AUTHORS]
Alain Bensoussan, Minh-Binh Tran, Bangjie Wang
[ABSTRACT]
Although there is a substantial body of literature on control and
optimization problems for parabolic and hyperbolic systems, the specific
problem of controlling and optimizing the coefficients of the associated
operators within such systems has not yet been thoroughly explored. In this
work, we aim to initiate a line of research in control theory focused on
optimizing and controlling the coefficients of these operators-a problem that
naturally arises in the context of neural networks and supervised learning.
In supervised learning, the primary objective is to transport initial data
toward target data through the layers of a neural network. We propose a novel
perspective: neural networks can be interpreted as partial differential
equations (PDEs). From this viewpoint, the control problem traditionally
studied in the context of ordinary differential equations (ODEs) is
reformulated as a control problem for PDEs, specifically targeting the
optimization and control of coefficients in parabolic and hyperbolic operators.
To the best of our knowledge, this specific problem has not yet been
systematically addressed in the control theory of PDEs.
To this end, we propose a dual system formulation for the control and
optimization problem associated with parabolic PDEs, laying the groundwork for
the development of efficient numerical schemes in future research. We also
provide a theoretical proof showing that the control and optimization problem
for parabolic PDEs admits minimizers. Finally, we investigate the control
problem associated with hyperbolic PDEs and prove the existence of solutions
for a corresponding approximated control problem.
[LINK]
http://arxiv.org/abs/2506.20764v1
[DATE]
2025-06-26 02:54:48+08:00
[CATEGORIES]
cs.LG
Characterization and Mitigation of Training Instabilities in Microscaling Formats
[AUTHORS]
Huangyuan Su, Mujin Kwun, Stephanie Gil, Sham Kakade, Nikhil Anand
[ABSTRACT]
Training large language models is an expensive, compute-bound process that
must be repeated as models scale, algorithms improve, and new data is
collected. To address this, next-generation hardware accelerators increasingly
support lower-precision arithmetic formats, such as the Microscaling (MX)
formats introduced in NVIDIA’s Blackwell architecture. These formats use a
shared scale within blocks of parameters to extend representable range and
perform forward/backward GEMM operations in reduced precision for efficiency
gains. In this work, we investigate the challenges and viability of
block-scaled precision formats during model training. Across nearly one
thousand language models trained from scratch – spanning compute budgets from
$2 \times 10^{17}$ to $4.8 \times 10^{19}$ FLOPs and sweeping over a broad
range of weight-activation precision combinations – we consistently observe
that training in MX formats exhibits sharp, stochastic instabilities in the
loss, particularly at larger compute scales. To explain this phenomenon, we
conduct controlled experiments and ablations on a smaller proxy model that
exhibits similar behavior as the language model, sweeping across architectural
settings, hyperparameters, and precision formats. These experiments motivate a
simple model in which multiplicative gradient bias introduced by the
quantization of layer-norm affine parameters and a small fraction of
activations can trigger runaway divergence. Through \emph{in situ} intervention
experiments on our proxy model, we demonstrate that instabilities can be
averted or delayed by modifying precision schemes mid-training. Guided by these
findings, we evaluate stabilization strategies in the LLM setting and show that
certain hybrid configurations recover performance competitive with
full-precision training. We release our code at
https://github.com/Hither1/systems-scaling.
[COMMENTS]
14 pages + appendices
[LINK]
http://arxiv.org/abs/2506.20752v1
[DATE]
2025-06-26 02:25:08+08:00
[CATEGORIES]
cs.LG
Multiple Streams of Relation Extraction: Enriching and Recalling in Transformers
[AUTHORS]
Todd Nief, David Reber, Sean Richardson, Ari Holtzman
[ABSTRACT]
When an LLM learns a relation during finetuning (e.g., new movie releases,
corporate mergers, etc.), where does this information go? Is it extracted when
the model processes an entity, recalled just-in-time before a prediction, or
are there multiple separate heuristics? Existing localization approaches (e.g.
activation patching) are ill-suited for this analysis because they tend to
replace parts of the residual stream, potentially deleting information. To fill
this gap, we propose dynamic weight-grafting between fine-tuned and pre-trained
language models to show that fine-tuned language models both (1) extract
relation information learned during finetuning while processing entities and
(2) recall" this information in later layers while generating predictions. In
some cases, models need both of these pathways to correctly generate finetuned
information while, in other cases, a single
enrichment” or recall" pathway
alone is sufficient. We examine the necessity and sufficiency of these
information pathways, examining what layers they occur at, how much redundancy
they exhibit, and which model components are involved -- finding that the
recall” pathway occurs via both task-specific attention mechanisms and a
relation extraction step in the output of the attention and the feedforward
networks at the final layers before next token prediction.
[LINK]
http://arxiv.org/abs/2506.20746v1
[DATE]
2025-06-26 02:13:34+08:00
[CATEGORIES]
cs.LG
A Survey of AI for Materials Science: Foundation Models, LLM Agents, Datasets, and Tools
[AUTHORS]
Minh-Hao Van, Prateek Verma, Chen Zhao, Xintao Wu
[ABSTRACT]
Foundation models (FMs) are catalyzing a transformative shift in materials
science (MatSci) by enabling scalable, general-purpose, and multimodal AI
systems for scientific discovery. Unlike traditional machine learning models,
which are typically narrow in scope and require task-specific engineering, FMs
offer cross-domain generalization and exhibit emergent capabilities. Their
versatility is especially well-suited to materials science, where research
challenges span diverse data types and scales. This survey provides a
comprehensive overview of foundation models, agentic systems, datasets, and
computational tools supporting this growing field. We introduce a task-driven
taxonomy encompassing six broad application areas: data extraction,
interpretation and Q\&A; atomistic simulation; property prediction; materials
structure, design and discovery; process planning, discovery, and optimization;
and multiscale modeling. We discuss recent advances in both unimodal and
multimodal FMs, as well as emerging large language model (LLM) agents.
Furthermore, we review standardized datasets, open-source tools, and autonomous
experimental platforms that collectively fuel the development and integration
of FMs into research workflows. We assess the early successes of foundation
models and identify persistent limitations, including challenges in
generalizability, interpretability, data imbalance, safety concerns, and
limited multimodal fusion. Finally, we articulate future research directions
centered on scalable pretraining, continual learning, data governance, and
trustworthiness.
[LINK]
http://arxiv.org/abs/2506.20743v1
[DATE]
2025-06-26 02:10:30+08:00
[CATEGORIES]
cs.LG
Test-time Scaling Techniques in Theoretical Physics – A Comparison of Methods on the TPBench Dataset
[AUTHORS]
Zhiqi Gao, Tianyi Li, Yurii Kvasiuk, Sai Chaitanya Tadepalli, Maja Rudolph, Daniel J. H. Chung, Frederic Sala, Moritz Münchmeyer
[ABSTRACT]
Large language models (LLMs) have shown strong capabilities in complex
reasoning, and test-time scaling techniques can enhance their performance with
comparably low cost. Many of these methods have been developed and evaluated on
mathematical reasoning benchmarks such as AIME. This paper investigates whether
the lessons learned from these benchmarks generalize to the domain of advanced
theoretical physics. We evaluate a range of common test-time scaling methods on
the TPBench physics dataset and compare their effectiveness with results on
AIME. To better leverage the structure of physics problems, we develop a novel,
symbolic weak-verifier framework to improve parallel scaling results. Our
empirical results demonstrate that this method significantly outperforms
existing test-time scaling approaches on TPBench. We also evaluate our method
on AIME, confirming its effectiveness in solving advanced mathematical
problems. Our findings highlight the power of step-wise symbolic verification
for tackling complex scientific problems.
[COMMENTS]
23 pages, 6 figures
[LINK]
http://arxiv.org/abs/2506.20729v1
[DATE]
2025-06-26 02:00:18+08:00
[CATEGORIES]
cs.LG
On Convolutions, Intrinsic Dimension, and Diffusion Models
[AUTHORS]
Kin Kwan Leung, Rasa Hosseinzadeh, Gabriel Loaiza-Ganem
[ABSTRACT]
The manifold hypothesis asserts that data of interest in high-dimensional
ambient spaces, such as image data, lies on unknown low-dimensional
submanifolds. Diffusion models (DMs) – which operate by convolving data with
progressively larger amounts of Gaussian noise and then learning to revert this
process – have risen to prominence as the most performant generative models,
and are known to be able to learn distributions with low-dimensional support.
For a given datum in one of these submanifolds, we should thus intuitively
expect DMs to have implicitly learned its corresponding local intrinsic
dimension (LID), i.e. the dimension of the submanifold it belongs to. Kamkari
et al. (2024b) recently showed that this is indeed the case by linking this LID
to the rate of change of the log marginal densities of the DM with respect to
the amount of added noise, resulting in an LID estimator known as FLIPD. LID
estimators such as FLIPD have a plethora of uses, among others they quantify
the complexity of a given datum, and can be used to detect outliers,
adversarial examples and AI-generated text. FLIPD achieves state-of-the-art
performance at LID estimation, yet its theoretical underpinnings are incomplete
since Kamkari et al. (2024b) only proved its correctness under the highly
unrealistic assumption of affine submanifolds. In this work we bridge this gap
by formally proving the correctness of FLIPD under realistic assumptions.
Additionally, we show that an analogous result holds when Gaussian convolutions
are replaced with uniform ones, and discuss the relevance of this result.
[LINK]
http://arxiv.org/abs/2506.20705v1
[DATE]
2025-06-26 02:00:00+08:00
[CATEGORIES]
cs.LG
Diffusion Tree Sampling: Scalable inference-time alignment of diffusion models
[AUTHORS]
Vineet Jain, Kusha Sareen, Mohammad Pedramfar, Siamak Ravanbakhsh
[ABSTRACT]
Adapting a pretrained diffusion model to new objectives at inference time
remains an open problem in generative modeling. Existing steering methods
suffer from inaccurate value estimation, especially at high noise levels, which
biases guidance. Moreover, information from past runs is not reused to improve
sample quality, resulting in inefficient use of compute. Inspired by the
success of Monte Carlo Tree Search, we address these limitations by casting
inference-time alignment as a search problem that reuses past computations. We
introduce a tree-based approach that samples from the reward-aligned target
density by propagating terminal rewards back through the diffusion chain and
iteratively refining value estimates with each additional generation. Our
proposed method, Diffusion Tree Sampling (DTS), produces asymptotically exact
samples from the target distribution in the limit of infinite rollouts, and its
greedy variant, Diffusion Tree Search (DTS$^\star$), performs a global search
for high reward samples. On MNIST and CIFAR-10 class-conditional generation,
DTS matches the FID of the best-performing baseline with up to $10\times$ less
compute. In text-to-image generation and language completion tasks, DTS$^\star$
effectively searches for high reward samples that match best-of-N with up to
$5\times$ less compute. By reusing information from previous generations, we
get an anytime algorithm that turns additional compute into steadily better
samples, providing a scalable approach for inference-time alignment of
diffusion models.
[LINK]
http://arxiv.org/abs/2506.20701v1
[DATE]
2025-06-26 01:59:10+08:00
[CATEGORIES]
cs.LG
DemoDiffusion: One-Shot Human Imitation using pre-trained Diffusion Policy
[AUTHORS]
Sungjae Park, Homanga Bharadhwaj, Shubham Tulsiani
[ABSTRACT]
We propose DemoDiffusion, a simple and scalable method for enabling robots to
perform manipulation tasks in natural environments by imitating a single human
demonstration. Our approach is based on two key insights. First, the hand
motion in a human demonstration provides a useful prior for the robot’s
end-effector trajectory, which we can convert into a rough open-loop robot
motion trajectory via kinematic retargeting. Second, while this retargeted
motion captures the overall structure of the task, it may not align well with
plausible robot actions in-context. To address this, we leverage a pre-trained
generalist diffusion policy to modify the trajectory, ensuring it both follows
the human motion and remains within the distribution of plausible robot
actions. Our approach avoids the need for online reinforcement learning or
paired human-robot data, enabling robust adaptation to new tasks and scenes
with minimal manual effort. Experiments in both simulation and real-world
settings show that DemoDiffusion outperforms both the base policy and the
retargeted trajectory, enabling the robot to succeed even on tasks where the
pre-trained generalist policy fails entirely. Project page:
https://demodiffusion.github.io/
[COMMENTS]
Preprint(17 pages). Under Review
[LINK]
http://arxiv.org/abs/2506.20668v1
[DATE]
2025-06-26 01:59:01+08:00
[CATEGORIES]
cs.LG
Data Quality in Crowdsourcing and Spamming Behavior Detection
[AUTHORS]
Yang Ba, Michelle V. Mancenido, Erin K. Chiou, Rong Pan
[ABSTRACT]
As crowdsourcing emerges as an efficient and cost-effective method for
obtaining labels for machine learning datasets, it is important to assess the
quality of crowd-provided data, so as to improve analysis performance and
reduce biases in subsequent machine learning tasks. Given the lack of ground
truth in most cases of crowdsourcing, we refer to data quality as annotators’
consistency and credibility. Unlike the simple scenarios where Kappa
coefficient and intraclass correlation coefficient usually can apply, online
crowdsourcing requires dealing with more complex situations. We introduce a
systematic method for evaluating data quality and detecting spamming threats
via variance decomposition, and we classify spammers into three categories
based on their different behavioral patterns. A spammer index is proposed to
assess entire data consistency, and two metrics are developed to measure crowd
workers’ credibility by utilizing the Markov chain and generalized random
effects models. Furthermore, we showcase the practicality of our techniques and
their advantages by applying them on a face verification task with both
simulation and real-world data collected from two crowdsourcing platforms.
[COMMENTS]
Preprint paper, accepted on Behavior Research Methods. 56 pages, 14
figures
[LINK]
http://arxiv.org/abs/2404.17582v2
[DATE]
2025-06-26 01:56:08+08:00
[CATEGORIES]
cs.LG
Mastering Multiple-Expert Routing: Realizable $H$-Consistency and Strong Guarantees for Learning to Defer
[AUTHORS]
Anqi Mao, Mehryar Mohri, Yutao Zhong
[ABSTRACT]
The problem of learning to defer with multiple experts consists of optimally
assigning input instances to experts, balancing the trade-off between their
accuracy and computational cost. This is a critical challenge in natural
language generation, but also in other fields such as image processing, and
medical diagnostics. Recent studies have proposed surrogate loss functions to
optimize deferral, but challenges remain in ensuring their consistency
properties. This paper introduces novel surrogate loss functions and efficient
algorithms with strong theoretical learning guarantees. We address open
questions regarding realizable $H$-consistency, $H$-consistency bounds, and
Bayes-consistency for both single-stage (jointly learning predictor and
deferral function) and two-stage (learning only the deferral function with a
fixed expert) learning scenarios. For single-stage deferral, we introduce a
family of new realizable $H$-consistent surrogate losses and further prove
$H$-consistency for a selected member. For two-stage deferral, we derive new
surrogate losses that achieve realizable $H$-consistency, $H$-consistency
bounds, and Bayes-consistency for the two-expert scenario and, under natural
assumptions, multiple-expert scenario. Additionally, we provide enhanced
theoretical guarantees under low-noise assumptions for both scenarios. Finally,
we report the results of experiments using our proposed surrogate losses,
comparing their performance against existing baselines.
[COMMENTS]
ICML 2025
[LINK]
http://arxiv.org/abs/2506.20650v1
[DATE]
2025-06-26 01:48:58+08:00
[CATEGORIES]
cs.LG
Efficient Federated Learning with Encrypted Data Sharing for Data-Heterogeneous Edge Devices
[AUTHORS]
Hangyu Li, Hongyue Wu, Guodong Fan, Zhen Zhang, Shizhan Chen, Zhiyong Feng
[ABSTRACT]
As privacy protection gains increasing importance, more models are being
trained on edge devices and subsequently merged into the central server through
Federated Learning (FL). However, current research overlooks the impact of
network topology, physical distance, and data heterogeneity on edge devices,
leading to issues such as increased latency and degraded model performance. To
address these issues, we propose a new federated learning scheme on edge
devices that called Federated Learning with Encrypted Data Sharing(FedEDS).
FedEDS uses the client model and the model’s stochastic layer to train the data
encryptor. The data encryptor generates encrypted data and shares it with other
clients. The client uses the corresponding client’s stochastic layer and
encrypted data to train and adjust the local model. FedEDS uses the client’s
local private data and encrypted shared data from other clients to train the
model. This approach accelerates the convergence speed of federated learning
training and mitigates the negative impact of data heterogeneity, making it
suitable for application services deployed on edge devices requiring rapid
convergence. Experiments results show the efficacy of FedEDS in promoting model
performance.
[COMMENTS]
Accepted by ICWS 2025
[LINK]
http://arxiv.org/abs/2506.20644v1
[DATE]
2025-06-26 01:40:54+08:00
[CATEGORIES]
cs.LG
Balancing the Scales: A Theoretical and Algorithmic Framework for Learning from Imbalanced Data
[AUTHORS]
Corinna Cortes, Anqi Mao, Mehryar Mohri, Yutao Zhong
[ABSTRACT]
Class imbalance remains a major challenge in machine learning, especially in
multi-class problems with long-tailed distributions. Existing methods, such as
data resampling, cost-sensitive techniques, and logistic loss modifications,
though popular and often effective, lack solid theoretical foundations. As an
example, we demonstrate that cost-sensitive methods are not Bayes-consistent.
This paper introduces a novel theoretical framework for analyzing
generalization in imbalanced classification. We then propose a new
class-imbalanced margin loss function for both binary and multi-class settings,
prove its strong $H$-consistency, and derive corresponding learning guarantees
based on empirical loss and a new notion of class-sensitive Rademacher
complexity. Leveraging these theoretical results, we devise novel and general
learning algorithms, IMMAX (Imbalanced Margin Maximization), which incorporate
confidence margins and are applicable to various hypothesis sets. While our
focus is theoretical, we also present extensive empirical results demonstrating
the effectiveness of our algorithms compared to existing baselines.
[COMMENTS]
ICML 2025
[LINK]
http://arxiv.org/abs/2502.10381v2
[DATE]
2025-06-26 01:36:30+08:00
[CATEGORIES]
cs.LG
First-order methods for stochastic and finite-sum convex optimization with deterministic constraints
[AUTHORS]
Zhaosong Lu, Yifeng Xiao
[ABSTRACT]
In this paper, we study a class of stochastic and finite-sum convex
optimization problems with deterministic constraints. Existing methods
typically aim to find an $\epsilon$-$expectedly\ feasible\ stochastic\ optimal$
solution, in which the expected constraint violation and expected optimality
gap are both within a prescribed tolerance $\epsilon$. However, in many
practical applications, constraints must be nearly satisfied with certainty,
rendering such solutions potentially unsuitable due to the risk of substantial
violations. To address this issue, we propose stochastic first-order methods
for finding an $\epsilon$-$surely\ feasible\ stochastic\ optimal$
($\epsilon$-SFSO) solution, where the constraint violation is deterministically
bounded by $\epsilon$ and the expected optimality gap is at most $\epsilon$.
Our methods apply an accelerated stochastic gradient (ASG) scheme or a modified
variance-reduced ASG scheme $only\ once$ to a sequence of quadratic penalty
subproblems with appropriately chosen penalty parameters. We establish
first-order oracle complexity bounds for the proposed methods in computing an
$\epsilon$-SFSO solution. As a byproduct, we also derive first-order oracle
complexity results for sample average approximation method in computing an
$\epsilon$-SFSO solution of the stochastic optimization problem using our
proposed methods to solve the sample average problem.
[COMMENTS]
41 pages
[LINK]
http://arxiv.org/abs/2506.20630v1
[DATE]
2025-06-26 01:26:02+08:00
[CATEGORIES]
cs.LG
On Context-Content Uncertainty Principle
[AUTHORS]
Xin Li
[ABSTRACT]
The Context-Content Uncertainty Principle (CCUP) proposes that inference
under uncertainty is governed by an entropy asymmetry between context and
content: high-entropy contexts must be interpreted through alignment with
low-entropy, structured content. In this paper, we develop a layered
computational framework that derives operational principles from this
foundational asymmetry. At the base level, CCUP formalizes inference as
directional entropy minimization, establishing a variational gradient that
favors content-first structuring. Building upon this, we identify four
hierarchical layers of operational principles: (\textbf{L1}) \emph{Core
Inference Constraints}, including structure-before-specificity, asymmetric
inference flow, cycle-consistent bootstrapping, and conditional compression,
all shown to be mutually reducible; (\textbf{L2}) \emph{Resource Allocation
Principles}, such as precision-weighted attention, asymmetric learning rates,
and attractor-based memory encoding; (\textbf{L3}) \emph{Temporal Bootstrapping
Dynamics}, which organize learning over time via structure-guided curricula;
and (\textbf{L4}) \emph{Spatial Hierarchical Composition}, which integrates
these mechanisms into self-organizing cycles of memory, inference, and
planning. We present formal equivalence theorems, a dependency lattice among
principles, and computational simulations demonstrating the efficiency gains of
CCUP-aligned inference. This work provides a unified theoretical foundation for
understanding how brains and machines minimize uncertainty through recursive
structure-specificity alignment. The brain is not just an inference machine. It
is a cycle-consistent entropy gradient resolver, aligning structure and
specificity via path-dependent, content-seeded simulation.
[LINK]
http://arxiv.org/abs/2506.20699v1
[DATE]
2025-06-26 01:21:19+08:00
[CATEGORIES]
cs.LG
Probing Quantum Spin Systems with Kolmogorov-Arnold Neural Network Quantum States
[AUTHORS]
Mahmud Ashraf Shamim, Eric A F Reinhardt, Talal Ahmed Chowdhury, Sergei Gleyzer, Paulo T Araujo
[ABSTRACT]
Neural Quantum States (NQS) are a class of variational wave functions
parametrized by neural networks (NNs) to study quantum many-body systems. In
this work, we propose \texttt{SineKAN}, a NQS \textit{ansatz} based on
Kolmogorov-Arnold Networks (KANs), to represent quantum mechanical wave
functions as nested univariate functions. We show that \texttt{SineKAN}
wavefunction with learnable sinusoidal activation functions can capture the
ground state energies, fidelities and various correlation functions of the one
dimensional Transverse-Field Ising model, Anisotropic Heisenberg model, and
Antiferromagnetic $J_{1}-J_{2}$ model with different chain lengths. In our
study of the $J_1-J_2$ model with $L=100$ sites, we find that the
\texttt{SineKAN} model outperforms several previously explored neural quantum
state \textit{ans"atze}, including Restricted Boltzmann Machines (RBMs), Long
Short-Term Memory models (LSTMs), and Multi-layer Perceptrons (MLP)
\textit{a.k.a.} Feed Forward Neural Networks, when compared to the results
obtained from the Density Matrix Renormalization Group (DMRG) algorithm. We
find that \texttt{SineKAN} models can be trained to high precisions and
accuracies with minimal computational costs.
[COMMENTS]
16 pages, 13 figures
[LINK]
http://arxiv.org/abs/2506.01891v3
[DATE]
2025-06-26 01:17:27+08:00
[CATEGORIES]
cs.LG
Lost in Retraining: Roaming the Parameter Space of Exponential Families Under Closed-Loop Learning
[AUTHORS]
Fariba Jangjoo, Matteo Marsili, Yasser Roudi
[ABSTRACT]
Closed-loop learning is the process of repeatedly estimating a model from
data generated from the model itself. It is receiving great attention due to
the possibility that large neural network models may, in the future, be
primarily trained with data generated by artificial neural networks themselves.
We study this process for models that belong to exponential families, deriving
equations of motions that govern the dynamics of the parameters. We show that
maximum likelihood estimation of the parameters endows sufficient statistics
with the martingale property and that as a result the process converges to
absorbing states that amplify initial biases present in the data. However, we
show that this outcome may be prevented by polluting the data with an
infinitesimal fraction of data points generated from a fixed model, by relying
on maximum a posteriori estimation or by introducing regularisation.
Furthermore, we show that the asymptotic behavior of the dynamics is not
reparametrisation invariant.
[COMMENTS]
13 pages, 2 figures
[LINK]
http://arxiv.org/abs/2506.20623v1
[DATE]
2025-06-26 01:12:22+08:00
[CATEGORIES]
cs.LG
Do Concept Bottleneck Models Respect Localities?
[AUTHORS]
Naveen Raman, Mateo Espinosa Zarlenga, Juyeon Heo, Mateja Jamnik
[ABSTRACT]
Concept-based explainability methods use human-understandable intermediaries
to produce explanations for machine learning models. These methods assume
concept predictions can help understand a model’s internal reasoning. In this
work, we assess the degree to which such an assumption is true by analyzing
whether concept predictors leverage “relevant” features to make predictions, a
term we call locality. Concept-based models that fail to respect localities
also fail to be explainable because concept predictions are based on spurious
features, making the interpretation of the concept predictions vacuous. To
assess whether concept-based models respect localities, we construct and use
three metrics to characterize when models respect localities, complementing our
analysis with theoretical results. Each of our metrics captures a different
notion of perturbation and assess whether perturbing “irrelevant” features
impacts the predictions made by a concept predictors. We find that many
concept-based models used in practice fail to respect localities because
concept predictors cannot always clearly distinguish distinct concepts. Based
on these findings, we propose suggestions for alleviating this issue.
[COMMENTS]
Published at TMLR
[LINK]
http://arxiv.org/abs/2401.01259v5
[DATE]
2025-06-26 01:10:45+08:00
[CATEGORIES]
cs.LG
From $\mathcal{O}(n^{2})$ to $\mathcal{O}(n)$ Parameters: Quantum Self-Attention in Vision Transformers for Biomedical Image Classification
[AUTHORS]
Thomas Boucher, John Whittle, Evangelos B. Mazomenos
[ABSTRACT]
We demonstrate that quantum vision transformers (QViTs), vision transformers
(ViTs) with self-attention (SA) mechanisms replaced by quantum self-attention
(QSA) mechanisms, can match state-of-the-art (SOTA) biomedical image
classifiers while using 99.99% fewer parameters. QSAs are produced by replacing
linear SA layers with parameterised quantum neural networks (QNNs), producing a
QSA mechanism and reducing parameter scaling from $\mathcal{O}(n^2)$ to
$\mathcal{O}(n)$. On RetinaMNIST, our ultra parameter-efficient QViT
outperforms 13/14 SOTA methods including CNNs and ViTs, achieving 56.5%
accuracy, just 0.88% below the top MedMamba model while using 99.99% fewer
parameters (1K vs 14.5M) and 89% fewer GFLOPs. We present the first
investigation of knowledge distillation (KD) from classical to quantum vision
transformers in biomedical image classification, showing that QViTs maintain
comparable performance to classical ViTs across eight diverse datasets spanning
multiple modalities, with improved QSA parameter-efficiency. Our higher-qubit
architecture benefitted more from KD pre-training, suggesting a scaling
relationship between QSA parameters and KD effectiveness. These findings
establish QSA as a practical architectural choice toward parameter-efficient
biomedical image analysis.
[COMMENTS]
Submitted for EMA4MICCAI 2025
[LINK]
http://arxiv.org/abs/2503.07294v2
[DATE]
2025-06-26 01:08:53+08:00
[CATEGORIES]
cs.LG
H-FEX: A Symbolic Learning Method for Hamiltonian Systems
[AUTHORS]
Jasen Lai, Senwei Liang, Chunmei Wang
[ABSTRACT]
Hamiltonian systems describe a broad class of dynamical systems governed by
Hamiltonian functions, which encode the total energy and dictate the evolution
of the system. Data-driven approaches, such as symbolic regression and neural
network-based methods, provide a means to learn the governing equations of
dynamical systems directly from observational data of Hamiltonian systems.
However, these methods often struggle to accurately capture complex Hamiltonian
functions while preserving energy conservation. To overcome this limitation, we
propose the Finite Expression Method for learning Hamiltonian Systems (H-FEX),
a symbolic learning method that introduces novel interaction nodes designed to
capture intricate interaction terms effectively. Our experiments, including
those on highly stiff dynamical systems, demonstrate that H-FEX can recover
Hamiltonian functions of complex systems that accurately capture system
dynamics and preserve energy over long time horizons. These findings highlight
the potential of H-FEX as a powerful framework for discovering closed-form
expressions of complex dynamical systems.
[COMMENTS]
16 pages, 7 figures
[LINK]
http://arxiv.org/abs/2506.20607v1
[DATE]
2025-06-26 00:53:01+08:00
[CATEGORIES]
cs.LG
LT-PINN: Lagrangian Topology-conscious Physics-informed Neural Network for Boundary-focused Engineering Optimization
[AUTHORS]
Yuanye Zhou, Zhaokun Wang, Kai Zhou, Hui Tang, Xiaofan Li
[ABSTRACT]
Physics-informed neural networks (PINNs) have emerged as a powerful meshless
tool for topology optimization, capable of simultaneously determining optimal
topologies and physical solutions. However, conventional PINNs rely on
density-based topology descriptions, which necessitate manual interpolation and
limit their applicability to complex geometries. To address this, we propose
Lagrangian topology-conscious PINNs (LT-PINNs), a novel framework for
boundary-focused engineering optimization. By parameterizing the control
variables of topology boundary curves as learnable parameters, LT-PINNs
eliminate the need for manual interpolation and enable precise boundary
determination. We further introduce specialized boundary condition loss
function and topology loss function to ensure sharp and accurate boundary
representations, even for intricate topologies. The accuracy and robustness of
LT-PINNs are validated via two types of partial differential equations (PDEs),
including elastic equation with Dirichlet boundary conditions and Laplace’s
equation with Neumann boundary conditions. Furthermore, we demonstrate
effectiveness of LT-PINNs on more complex time-dependent and time-independent
flow problems without relying on measurement data, and showcase their
engineering application potential in flow velocity rearrangement, transforming
a uniform upstream velocity into a sine-shaped downstream profile. The results
demonstrate (1) LT-PINNs achieve substantial reductions in relative L2 errors
compared with the state-of-art density topology-oriented PINNs (DT-PINNs), (2)
LT-PINNs can handle arbitrary boundary conditions, making them suitable for a
wide range of PDEs, and (3) LT-PINNs can infer clear topology boundaries
without manual interpolation, especially for complex topologies.
[LINK]
http://arxiv.org/abs/2506.06300v3
[DATE]
2025-06-26 00:48:42+08:00
[CATEGORIES]
cs.LG
The kernel of graph indices for vector search
[AUTHORS]
Mariano Tepper, Ted Willke
[ABSTRACT]
The most popular graph indices for vector search use principles from
computational geometry to build the graph. Hence, their formal graph
navigability guarantees are only valid in Euclidean space. In this work, we
show that machine learning can be used to build graph indices for vector search
in metric and non-metric vector spaces (e.g., for inner product similarity).
From this novel perspective, we introduce the Support Vector Graph (SVG), a new
type of graph index that leverages kernel methods to establish the graph
connectivity and that comes with formal navigability guarantees valid in metric
and non-metric vector spaces. In addition, we interpret the most popular graph
indices, including HNSW and DiskANN, as particular specializations of SVG and
show that new indices can be derived from the principles behind this
specialization. Finally, we propose SVG-L0 that incorporates an $\ell_0$
sparsity constraint into the SVG kernel method to build graphs with a bounded
out-degree. This yields a principled way of implementing this practical
requirement, in contrast to the traditional heuristic of simply truncating the
out edges of each node. Additionally, we show that SVG-L0 has a self-tuning
property that avoids the heuristic of using a set of candidates to find the
out-edges of each node and that keeps its computational complexity in check.
[LINK]
http://arxiv.org/abs/2506.20584v1
[DATE]
2025-06-26 00:24:55+08:00
[CATEGORIES]
cs.LG
Rethinking Early Stopping: Refine, Then Calibrate
[AUTHORS]
Eugène Berta, David Holzmüller, Michael I. Jordan, Francis Bach
[ABSTRACT]
Machine learning classifiers often produce probabilistic predictions that are
critical for accurate and interpretable decision-making in various domains. The
quality of these predictions is generally evaluated with proper losses, such as
cross-entropy, which decompose into two components: calibration error assesses
general under/overconfidence, while refinement error measures the ability to
distinguish different classes. In this paper, we present a novel variational
formulation of the calibration-refinement decomposition that sheds new light on
post-hoc calibration, and enables rapid estimation of the different terms.
Equipped with this new perspective, we provide theoretical and empirical
evidence that calibration and refinement errors are not minimized
simultaneously during training. Selecting the best epoch based on validation
loss thus leads to a compromise point that is suboptimal for both terms. To
address this, we propose minimizing refinement error only during training
(Refine,…), before minimizing calibration error post hoc, using standard
techniques (…then Calibrate). Our method integrates seamlessly with any
classifier and consistently improves performance across diverse classification
tasks.
[LINK]
http://arxiv.org/abs/2501.19195v2
[DATE]
2025-06-26 00:24:12+08:00
[CATEGORIES]
cs.LG
Causal Representation Learning with Observational Grouping for CXR Classification
[AUTHORS]
Rajat Rasal, Avinash Kori, Ben Glocker
[ABSTRACT]
Identifiable causal representation learning seeks to uncover the true causal
relationships underlying a data generation process. In medical imaging, this
presents opportunities to improve the generalisability and robustness of
task-specific latent features. This work introduces the concept of grouping
observations to learn identifiable representations for disease classification
in chest X-rays via an end-to-end framework. Our experiments demonstrate that
these causal representations improve generalisability and robustness across
multiple classification tasks when grouping is used to enforce invariance w.r.t
race, sex, and imaging views.
[LINK]
http://arxiv.org/abs/2506.20582v1
[DATE]
2025-06-26 00:17:36+08:00
[CATEGORIES]
cs.LG
Exploring Graph-Transformer Out-of-Distribution Generalization Abilities
[AUTHORS]
Itay Niv, Neta Rabin
[ABSTRACT]
Deep learning on graphs has shown remarkable success across numerous
applications, including social networks, bio-physics, traffic networks, and
recommendation systems. Regardless of their successes, current methods
frequently depend on the assumption that training and testing data share the
same distribution, a condition rarely met in real-world scenarios. While
graph-transformer (GT) backbones have recently outperformed traditional
message-passing neural networks (MPNNs) in multiple in-distribution (ID)
benchmarks, their effectiveness under distribution shifts remains largely
unexplored.
In this work, we address the challenge of out-of-distribution (OOD)
generalization for graph neural networks, with a special focus on the impact of
backbone architecture. We systematically evaluate GT and hybrid backbones in
OOD settings and compare them to MPNNs. To do so, we adapt several leading
domain generalization (DG) algorithms to work with GTs and assess their
performance on a benchmark designed to test a variety of distribution shifts.
Our results reveal that GT and hybrid GT-MPNN backbones consistently
demonstrate stronger generalization ability compared to MPNNs, even without
specialized DG algorithms.
Additionally, we propose a novel post-training analysis approach that
compares the clustering structure of the entire ID and OOD test datasets,
specifically examining domain alignment and class separation. Demonstrating its
model-agnostic design, this approach not only provided meaningful insights into
GT and MPNN backbones. It also shows promise for broader applicability to DG
problems beyond graph learning, offering a deeper perspective on generalization
abilities that goes beyond standard accuracy metrics. Together, our findings
highlight the promise of graph-transformers for robust, real-world graph
learning and set a new direction for future research in OOD generalization.
[LINK]
http://arxiv.org/abs/2506.20575v1
[DATE]
2025-06-26 00:09:24+08:00
[CATEGORIES]
cs.LG
Benchmarking Unsupervised Strategies for Anomaly Detection in Multivariate Time Series
[AUTHORS]
Laura Boggia, Rafael Teixeira de Lima, Bogdan Malaescu
[ABSTRACT]
Anomaly detection in multivariate time series is an important problem across
various fields such as healthcare, financial services, manufacturing or physics
detector monitoring. Accurately identifying when unexpected errors or faults
occur is essential, yet challenging, due to the unknown nature of anomalies and
the complex interdependencies between time series dimensions. In this paper, we
investigate transformer-based approaches for time series anomaly detection,
focusing on the recently proposed iTransformer architecture. Our contributions
are fourfold: (i) we explore the application of the iTransformer to time series
anomaly detection, and analyse the influence of key parameters such as window
size, step size, and model dimensions on performance; (ii) we examine methods
for extracting anomaly labels from multidimensional anomaly scores and discuss
appropriate evaluation metrics for such labels; (iii) we study the impact of
anomalous data present during training and assess the effectiveness of
alternative loss functions in mitigating their influence; and (iv) we present a
comprehensive comparison of several transformer-based models across a diverse
set of datasets for time series anomaly detection.
[COMMENTS]
Submitted to VLDB 2026 conference, currently under review
[LINK]
http://arxiv.org/abs/2506.20574v1
[DATE]
2025-06-26 00:08:22+08:00
[CATEGORIES]
cs.LG
When Life Gives You Samples: The Benefits of Scaling up Inference Compute for Multilingual LLMs
[AUTHORS]
Ammar Khairi, Daniel D’souza, Ye Shen, Julia Kreutzer, Sara Hooker
[ABSTRACT]
Recent advancements in large language models (LLMs) have shifted focus toward
scaling inference-time compute, improving performance without retraining the
model. A common approach is to sample multiple outputs in parallel, and select
one of these as the final output. However, work to date has focused on English
and a handful of domains such as math and code. In contrast, we are most
interested in techniques that generalize across open-ended tasks, formally
verifiable tasks, and across languages. In this work, we study how to robustly
scale inference-time compute for open-ended generative tasks in a multilingual,
multi-task setting.
Our findings show that both sampling strategy based on temperature variation
and selection strategy must be adapted to account for diverse domains and
varied language settings. We evaluate existing selection methods, revealing
that strategies effective in English often fail to generalize across languages.
We propose novel sampling and selection strategies specifically adapted for
multilingual and multi-task inference scenarios, and show they yield notable
gains across languages and tasks. In particular, our combined sampling and
selection methods lead to an average +6.8 jump in win-rates for our 8B models
on m-ArenaHard-v2.0 prompts, against proprietary models such as Gemini. At
larger scale, Command-A (111B model) equipped with our methods, shows +9.0
improvement in win-rates on the same benchmark with just five samples against
single-sample decoding, a substantial increase at minimal cost. Our results
underscore the need for language- and task-aware approaches to inference-time
compute, aiming to democratize performance improvements in underrepresented
languages.
[LINK]
http://arxiv.org/abs/2506.20544v1
[DATE]
2025-06-25 23:37:53+08:00
[CATEGORIES]
cs.CL
Attention with Trained Embeddings Provably Selects Important Tokens
[AUTHORS]
Diyuan Wu, Aleksandr Shevchenko, Samet Oymak, Marco Mondelli
[ABSTRACT]
Token embeddings play a crucial role in language modeling but, despite this
practical relevance, their theoretical understanding remains limited. Our paper
addresses the gap by characterizing the structure of embeddings obtained via
gradient descent. Specifically, we consider a one-layer softmax attention model
with a linear head for binary classification, i.e., $\texttt{Softmax}( p^\top
E_X^\top ) E_X v = \frac{ \sum_{i=1}^T \exp(p^\top E_{x_i}) E_{x_i}^\top
v}{\sum_{j=1}^T \exp(p^\top E_{x_{j}}) }$, where $E_X = [ E_{x_1} , \dots,
E_{x_T} ]^\top$ contains the embeddings of the input sequence, $p$ is the
embedding of the $\mathrm{\langle cls \rangle}$ token and $v$ the output
vector. First, we show that, already after a single step of gradient training
with the logistic loss, the embeddings $E_X$ capture the importance of tokens
in the dataset by aligning with the output vector $v$ proportionally to the
frequency with which the corresponding tokens appear in the dataset. Then,
after training $p$ via gradient flow until convergence, the softmax selects the
important tokens in the sentence (i.e., those that are predictive of the
label), and the resulting $\mathrm{\langle cls \rangle}$ embedding maximizes
the margin for such a selection. Experiments on real-world datasets (IMDB,
Yelp) exhibit a phenomenology close to that unveiled by our theory.
[COMMENTS]
Fix mistakes in Lemma 4.2 and proof of Lemma 4.5, and some other
minor changes
[LINK]
http://arxiv.org/abs/2505.17282v3
[DATE]
2025-06-25 23:19:05+08:00
[CATEGORIES]
cs.LG
cs.CL
Separating Tongue from Thought: Activation Patching Reveals Language-Agnostic Concept Representations in Transformers
[AUTHORS]
Clément Dumas, Chris Wendler, Veniamin Veselovsky, Giovanni Monea, Robert West
[ABSTRACT]
A central question in multilingual language modeling is whether large
language models (LLMs) develop a universal concept representation, disentangled
from specific languages. In this paper, we address this question by analyzing
latent representations (latents) during a word-translation task in
transformer-based LLMs. We strategically extract latents from a source
translation prompt and insert them into the forward pass on a target
translation prompt. By doing so, we find that the output language is encoded in
the latent at an earlier layer than the concept to be translated. Building on
this insight, we conduct two key experiments. First, we demonstrate that we can
change the concept without changing the language and vice versa through
activation patching alone. Second, we show that patching with the mean
representation of a concept across different languages does not affect the
models’ ability to translate it, but instead improves it. Finally, we
generalize to multi-token generation and demonstrate that the model can
generate natural language description of those mean representations. Our
results provide evidence for the existence of language-agnostic concept
representations within the investigated models.
[COMMENTS]
20 pages, 14 figures, previous version published under the title “How
Do Llamas Process Multilingual Text? A Latent Exploration through Activation
Patching” at the ICML 2024 mechanistic interpretability workshop at
https://openreview.net/forum?id=0ku2hIm4BS
[LINK]
http://arxiv.org/abs/2411.08745v4
[DATE]
2025-06-25 23:16:54+08:00
[CATEGORIES]
cs.CL
Asymmetric REINFORCE for off-Policy Reinforcement Learning: Balancing positive and negative rewards
[AUTHORS]
Charles Arnal, Gaëtan Narozniak, Vivien Cabannes, Yunhao Tang, Julia Kempe, Remi Munos
[ABSTRACT]
Reinforcement learning (RL) is increasingly used to align large language
models (LLMs). Off-policy methods offer greater implementation simplicity and
data efficiency than on-policy techniques, but often result in suboptimal
performance. In this work, we study the intermediate range of algorithms
between off-policy RL and supervised fine-tuning by analyzing a simple
off-policy REINFORCE algorithm, where the advantage is defined as $A=r-V$, with
$r$ a reward and $V$ some tunable baseline. Intuitively, lowering $V$
emphasizes high-reward samples, while raising it penalizes low-reward ones more
heavily. We first provide a theoretical analysis of this off-policy REINFORCE
algorithm, showing that when the baseline $V$ lower-bounds the expected reward,
the algorithm enjoys a policy improvement guarantee. Our analysis reveals that
while on-policy updates can safely leverage both positive and negative signals,
off-policy updates benefit from focusing more on positive rewards than on
negative ones. We validate our findings experimentally in a controlled
stochastic bandit setting and through fine-tuning state-of-the-art LLMs on
reasoning tasks.
[LINK]
http://arxiv.org/abs/2506.20520v1
[DATE]
2025-06-25 23:07:16+08:00
[CATEGORIES]
cs.LG
cs.CL
OctoThinker: Mid-training Incentivizes Reinforcement Learning Scaling
[AUTHORS]
Zengzhi Wang, Fan Zhou, Xuefeng Li, Pengfei Liu
[ABSTRACT]
Different base language model families, such as Llama and Qwen, exhibit
divergent behaviors during post-training with reinforcement learning (RL),
especially on reasoning-intensive tasks. What makes a base language model
suitable for reinforcement learning? Gaining deeper insight into this question
is essential for developing RL-scalable foundation models of the next
generation. In this work, we investigate how mid-training strategies shape RL
dynamics, focusing on two representative model families: Qwen and Llama. Our
study reveals that (1) high-quality mathematical corpora, such as
MegaMath-Web-Pro, significantly improve both base model and RL performance,
while existing alternatives (e.g., FineMath-4plus) fail to do so; (2) further
adding QA-style data, particularly long chain-of-thought (CoT) reasoning
examples, enhances RL outcomes, and instruction data further unlocks this
effect; (3) while long-CoT improves reasoning depth, it can also induce
verbosity of model responses and unstability of RL training, underscoring the
importance of data formatting; (4) scaling mid-training consistently leads to
stronger downstream RL performance. Building on these insights, we introduce a
two-stage mid-training strategy, Stable-then-Decay, in which base models are
first trained on 200B tokens with a constant learning rate, followed by 20B
tokens across three CoT-focused branches with learning rate decay. This yields
OctoThinker, a family of models demonstrating strong RL compatibility and
closing the performance gap with more RL-friendly model families, i.e., Qwen.
We hope our work will help shape pre-training strategies for foundation models
in the RL era. To support further research, we release our open-source models
along with a curated math reasoning-intensive corpus of over 70 billion tokens
(i.e., MegaMath-Web-Pro-Max).
[COMMENTS]
26 pages; The first three authors contribute to this work equally
[LINK]
http://arxiv.org/abs/2506.20512v1
[DATE]
2025-06-25 22:58:13+08:00
[CATEGORIES]
cs.CL
cs.LG
ReCode: Updating Code API Knowledge with Reinforcement Learning
[AUTHORS]
Haoze Wu, Yunzhi Yao, Wenhao Yu, Huajun Chen, Ningyu Zhang
[ABSTRACT]
Large Language Models (LLMs) exhibit remarkable code generation capabilities
but falter when adapting to frequent updates in external library APIs. This
critical limitation, stemming from reliance on outdated API knowledge from
their training data, even with access to current documentation, impedes
reliable code generation in dynamic environments. To tackle this issue, we
propose ReCode (rule-based Reinforcement learning for Code Update), a novel
framework that mimics human programmer adaptation to API changes. Specifically,
we construct a dataset of approximately 2,000 data entries to train the LLMs to
perform version migration based on updated information. Then, we introduce a
modified string similarity metric for code evaluation as the reward for
reinforcement learning. Our experiments demonstrate that ReCode substantially
boosts LLMs’ code generation performance in dynamic API scenarios, especially
on the unseen CodeUpdateArena task. Crucially, compared to supervised
fine-tuning, ReCode has less impact on LLMs’ general code generation abilities.
We apply ReCode on various LLMs and reinforcement learning algorithms (GRPO and
DAPO), all achieving consistent improvements. Notably, after training,
Qwen2.5-Coder-7B outperforms that of the 32B parameter code instruction-tuned
model and the reasoning model with the same architecture. Code is available at
https://github.com/zjunlp/ReCode.
[COMMENTS]
Work in progress
[LINK]
http://arxiv.org/abs/2506.20495v1
[DATE]
2025-06-25 22:41:13+08:00
[CATEGORIES]
cs.CL
cs.LG
Counterfactual Influence as a Distributional Quantity
[AUTHORS]
Matthieu Meeus, Igor Shilov, Georgios Kaissis, Yves-Alexandre de Montjoye
[ABSTRACT]
Machine learning models are known to memorize samples from their training
data, raising concerns around privacy and generalization. Counterfactual
self-influence is a popular metric to study memorization, quantifying how the
model’s prediction for a sample changes depending on the sample’s inclusion in
the training dataset. However, recent work has shown memorization to be
affected by factors beyond self-influence, with other training samples, in
particular (near-)duplicates, having a large impact. We here study memorization
treating counterfactual influence as a distributional quantity, taking into
account how all training samples influence how a sample is memorized. For a
small language model, we compute the full influence distribution of training
samples on each other and analyze its properties. We find that solely looking
at self-influence can severely underestimate tangible risks associated with
memorization: the presence of (near-)duplicates seriously reduces
self-influence, while we find these samples to be (near-)extractable. We
observe similar patterns for image classification, where simply looking at the
influence distributions reveals the presence of near-duplicates in CIFAR-10.
Our findings highlight that memorization stems from complex interactions across
training data and is better captured by the full influence distribution than by
self-influence alone.
[COMMENTS]
Workshop on The Impact of Memorization on Trustworthy Foundation
Models (MemFM) @ ICML 2025
[LINK]
http://arxiv.org/abs/2506.20481v1
[DATE]
2025-06-25 22:25:11+08:00
[CATEGORIES]
cs.LG
cs.CL
GPTailor: Large Language Model Pruning Through Layer Cutting and Stitching
[AUTHORS]
Guinan Su, Li Shen, Lu Yin, Shiwei Liu, Yanwu Yang, Jonas Geiping
[ABSTRACT]
Large language models (LLMs) have shown remarkable capabilities in language
understanding and generation. However, such impressive capability typically
comes with a substantial model size, which presents significant challenges in
deployment and inference. While structured pruning of model parameters offers a
promising way to reduce computational costs at deployment time, current methods
primarily focus on single model pruning. In this work, we develop a novel
strategy to compress models by strategically combining or merging layers from
finetuned model variants, which preserves the original model’s abilities by
aggregating capabilities accentuated in different finetunes. We pose the
optimal tailoring of these LLMs as a zero-order optimization problem, adopting
a search space that supports three different operations: (1) Layer removal, (2)
Layer selection from different candidate models, and (3) Layer merging. Our
experiments demonstrate that this approach leads to competitive model pruning,
for example, for the Llama2-13B model families, our compressed models maintain
approximately 97.3\% of the original performance while removing $\sim25\%$ of
parameters, significantly outperforming previous state-of-the-art methods. The
code is available at https://github.com/Guinan-Su/auto-merge-llm.
[LINK]
http://arxiv.org/abs/2506.20480v1
[DATE]
2025-06-25 22:24:59+08:00
[CATEGORIES]
cs.CL
Graph Linearization Methods for Reasoning on Graphs with Large Language Models
[AUTHORS]
Christos Xypolopoulos, Guokan Shang, Xiao Fei, Giannis Nikolentzos, Hadi Abdine, Iakovos Evdaimon, Michail Chatzianastasis, Giorgos Stamou, Michalis Vazirgiannis
[ABSTRACT]
Large language models have evolved to process multiple modalities beyond
text, such as images and audio, which motivates us to explore how to
effectively leverage them for graph reasoning tasks. The key question,
therefore, is how to transform graphs into linear sequences of tokens, a
process we term “graph linearization”, so that LLMs can handle graphs
naturally. We consider that graphs should be linearized meaningfully to reflect
certain properties of natural language text, such as local dependency and
global alignment, in order to ease contemporary LLMs, trained on trillions of
textual tokens, better understand graphs. To achieve this, we developed several
graph linearization methods based on graph centrality and degeneracy. These
methods are further enhanced using node relabeling techniques. The experimental
results demonstrate the effectiveness of our methods compared to the random
linearization baseline. Our work introduces novel graph representations
suitable for LLMs, contributing to the potential integration of graph machine
learning with the trend of multimodal processing using a unified transformer
model.
[LINK]
http://arxiv.org/abs/2410.19494v3
[DATE]
2025-06-25 22:24:33+08:00
[CATEGORIES]
cs.CL
cs.LG
Time is On My Side: Dynamics of Talk-Time Sharing in Video-chat Conversations
[AUTHORS]
Kaixiang Zhang, Justine Zhang, Cristian Danescu-Niculescu-Mizil
[ABSTRACT]
An intrinsic aspect of every conversation is the way talk-time is shared
between multiple speakers. Conversations can be balanced, with each speaker
claiming a similar amount of talk-time, or imbalanced when one talks
disproportionately. Such overall distributions are the consequence of
continuous negotiations between the speakers throughout the conversation: who
should be talking at every point in time, and for how long?
In this work we introduce a computational framework for quantifying both the
conversation-level distribution of talk-time between speakers, as well as the
lower-level dynamics that lead to it. We derive a typology of talk-time sharing
dynamics structured by several intuitive axes of variation. By applying this
framework to a large dataset of video-chats between strangers, we confirm that,
perhaps unsurprisingly, different conversation-level distributions of talk-time
are perceived differently by speakers, with balanced conversations being
preferred over imbalanced ones, especially by those who end up talking less.
Then we reveal that – even when they lead to the same level of overall balance
– different types of talk-time sharing dynamics are perceived differently by
the participants, highlighting the relevance of our newly introduced typology.
Finally, we discuss how our framework offers new tools to designers of
computer-mediated communication platforms, for both human-human and human-AI
communication.
[LINK]
http://arxiv.org/abs/2506.20474v1
[DATE]
2025-06-25 22:23:02+08:00
[CATEGORIES]
cs.CL
CogniBench: A Legal-inspired Framework and Dataset for Assessing Cognitive Faithfulness of Large Language Models
[AUTHORS]
Xiaqiang Tang, Jian Li, Keyu Hu, Du Nan, Xiaolong Li, Xi Zhang, Weigao Sun, Sihong Xie
[COMMENTS]
ACL 2025
[LINK]
http://arxiv.org/abs/2505.20767v4
[DATE]
2025-06-25 22:02:19+08:00
[CATEGORIES]
cs.CL
Towards Fully Exploiting LLM Internal States to Enhance Knowledge Boundary Perception
[AUTHORS]
Shiyu Ni, Keping Bi, Jiafeng Guo, Lulu Yu, Baolong Bi, Xueqi Cheng
[ABSTRACT]
Large language models (LLMs) exhibit impressive performance across diverse
tasks but often struggle to accurately gauge their knowledge boundaries,
leading to confident yet incorrect responses. This paper explores leveraging
LLMs’ internal states to enhance their perception of knowledge boundaries from
efficiency and risk perspectives. We investigate whether LLMs can estimate
their confidence using internal states before response generation, potentially
saving computational resources. Our experiments on datasets like Natural
Questions, HotpotQA, and MMLU reveal that LLMs demonstrate significant
pre-generation perception, which is further refined post-generation, with
perception gaps remaining stable across varying conditions. To mitigate risks
in critical domains, we introduce Confidence Consistency-based Calibration
($C^3$), which assesses confidence consistency through question reformulation.
$C^3$ significantly improves LLMs’ ability to recognize their knowledge gaps,
enhancing the unknown perception rate by 5.6% on NQ and 4.9% on HotpotQA. Our
findings suggest that pre-generation confidence estimation can optimize
efficiency, while $C^3$ effectively controls output risks, advancing the
reliability of LLMs in practical applications.
[COMMENTS]
ACL2025 Main
[LINK]
http://arxiv.org/abs/2502.11677v2
[DATE]
2025-06-25 21:46:10+08:00
[CATEGORIES]
cs.CL
SMAR: Soft Modality-Aware Routing Strategy for MoE-based Multimodal Large Language Models Preserving Language Capabilities
[AUTHORS]
Guoyang Xia, Yifeng Ding, Fengfa Li, Lei Ren, Wei Chen, Fangxiang Feng, Xiaojie Wang
[ABSTRACT]
Mixture of Experts (MoE) architectures have become a key approach for scaling
large language models, with growing interest in extending them to multimodal
tasks. Existing methods to build multimodal MoE models either incur high
training costs or suffer from degraded language capabilities when adapting
pretrained models. To address this, we propose Soft ModalityAware Routing
(SMAR), a novel regularization technique that uses Kullback Leibler divergence
to control routing probability distributions across modalities, encouraging
expert specialization without modifying model architecture or heavily relying
on textual data. Experiments on visual instruction tuning show that SMAR
preserves language ability at 86.6% retention with only 2.5% pure text,
outperforming baselines while maintaining strong multimodal performance. Our
approach offers a practical and efficient solution to balance modality
differentiation and language capabilities in multimodal MoE models.
[LINK]
http://arxiv.org/abs/2506.06406v2
[DATE]
2025-06-25 20:36:55+08:00
[CATEGORIES]
cs.CL
From Codicology to Code: A Comparative Study of Transformer and YOLO-based Detectors for Layout Analysis in Historical Documents
[AUTHORS]
Sergio Torres Aguilar
[ABSTRACT]
Robust Document Layout Analysis (DLA) is critical for the automated
processing and understanding of historical documents with complex page
organizations. This paper benchmarks five state-of-the-art object detection
architectures on three annotated datasets representing a spectrum of
codicological complexity: The e-NDP, a corpus of Parisian medieval registers
(1326-1504); CATMuS, a diverse multiclass dataset derived from various medieval
and modern sources (ca.12th-17th centuries) and HORAE, a corpus of decorated
books of hours (ca.13th-16th centuries). We evaluate two Transformer-based
models (Co-DETR, Grounding DINO) against three YOLO variants (AABB, OBB, and
YOLO-World). Our findings reveal significant performance variations dependent
on model architecture, data set characteristics, and bounding box
representation. In the e-NDP dataset, Co-DETR achieves state-of-the-art results
(0.752 [email protected]:.95), closely followed by YOLOv11X-OBB (0.721). Conversely, on
the more complex CATMuS and HORAE datasets, the CNN-based YOLOv11x-OBB
significantly outperforms all other models (0.564 and 0.568, respectively).
This study unequivocally demonstrates that using Oriented Bounding Boxes (OBB)
is not a minor refinement but a fundamental requirement for accurately modeling
the non-Cartesian nature of historical manuscripts. We conclude that a key
trade-off exists between the global context awareness of Transformers, ideal
for structured layouts, and the superior generalization of CNN-OBB models for
visually diverse and complex documents.
[LINK]
http://arxiv.org/abs/2506.20326v1
[DATE]
2025-06-25 19:14:04+08:00
[CATEGORIES]
cs.CL
VICCA: Visual Interpretation and Comprehension of Chest X-ray Anomalies in Generated Report Without Human Feedback
[AUTHORS]
Sayeh Gholipour Picha, Dawood Al Chanti, Alice Caplier
[ABSTRACT]
As artificial intelligence (AI) becomes increasingly central to healthcare,
the demand for explainable and trustworthy models is paramount. Current report
generation systems for chest X-rays (CXR) often lack mechanisms for validating
outputs without expert oversight, raising concerns about reliability and
interpretability. To address these challenges, we propose a novel multimodal
framework designed to enhance the semantic alignment and localization accuracy
of AI-generated medical reports. Our framework integrates two key modules: a
Phrase Grounding Model, which identifies and localizes pathologies in CXR
images based on textual prompts, and a Text-to-Image Diffusion Module, which
generates synthetic CXR images from prompts while preserving anatomical
fidelity. By comparing features between the original and generated images, we
introduce a dual-scoring system: one score quantifies localization accuracy,
while the other evaluates semantic consistency. This approach significantly
outperforms existing methods, achieving state-of-the-art results in pathology
localization and text-to-image alignment. The integration of phrase grounding
with diffusion models, coupled with the dual-scoring evaluation system,
provides a robust mechanism for validating report quality, paving the way for
more trustworthy and transparent AI in medical imaging.
[LINK]
http://arxiv.org/abs/2501.17726v2
[DATE]
2025-06-25 19:13:35+08:00
[CATEGORIES]
cs.CL
Confucius3-Math: A Lightweight High-Performance Reasoning LLM for Chinese K-12 Mathematics Learning
[AUTHORS]
Lixin Wu, Na Cai, Qiao Cheng, Jiachen Wang, Yitao Duan
[ABSTRACT]
We introduce Confucius3-Math, an open-source large language model with 14B
parameters that (1) runs efficiently on a single consumer-grade GPU; (2)
achieves SOTA performances on a range of mathematical reasoning tasks,
outperforming many models with significantly larger sizes. In particular, as
part of our mission to enhancing education and knowledge dissemination with AI,
Confucius3-Math is specifically committed to mathematics learning for Chinese
K-12 students and educators. Built via post-training with large-scale
reinforcement learning (RL), Confucius3-Math aligns with national curriculum
and excels at solving main-stream Chinese K-12 mathematical problems with low
cost. In this report we share our development recipe, the challenges we
encounter and the techniques we develop to overcome them. In particular, we
introduce three technical innovations: Targeted Entropy Regularization, Recent
Sample Recovery and Policy-Specific Hardness Weighting. These innovations
encompass a new entropy regularization, a novel data scheduling policy, and an
improved group-relative advantage estimator. Collectively, they significantly
stabilize the RL training, improve data efficiency, and boost performance. Our
work demonstrates the feasibility of building strong reasoning models in a
particular domain at low cost. We open-source our model and code at
https://github.com/netease-youdao/Confucius3-Math.
[LINK]
http://arxiv.org/abs/2506.18330v2
[DATE]
2025-06-25 18:49:23+08:00
[CATEGORIES]
cs.LG
cs.CL
VAQUUM: Are Vague Quantifiers Grounded in Visual Data?
[AUTHORS]
Hugh Mee Wong, Rick Nouwen, Albert Gatt
[COMMENTS]
Proceedings of ACL 2025, 10 pages
[LINK]
http://arxiv.org/abs/2502.11874v3
[DATE]
2025-06-25 18:46:05+08:00
[CATEGORIES]
cs.CL
FundaQ-8: A Clinically-Inspired Scoring Framework for Automated Fundus Image Quality Assessment
[AUTHORS]
Lee Qi Zun, Oscar Wong Jin Hao, Nor Anita Binti Che Omar, Zalifa Zakiah Binti Asnir, Mohamad Sabri bin Sinal Zainal, Goh Man Fye
[ABSTRACT]
Automated fundus image quality assessment (FIQA) remains a challenge due to
variations in image acquisition and subjective expert evaluations. We introduce
FundaQ-8, a novel expert-validated framework for systematically assessing
fundus image quality using eight critical parameters, including field coverage,
anatomical visibility, illumination, and image artifacts. Using FundaQ-8 as a
structured scoring reference, we develop a ResNet18-based regression model to
predict continuous quality scores in the 0 to 1 range. The model is trained on
1800 fundus images from real-world clinical sources and Kaggle datasets, using
transfer learning, mean squared error optimization, and standardized
preprocessing. Validation against the EyeQ dataset and statistical analyses
confirm the framework’s reliability and clinical interpretability.
Incorporating FundaQ-8 into deep learning models for diabetic retinopathy
grading also improves diagnostic robustness, highlighting the value of
quality-aware training in real-world screening applications.
[LINK]
http://arxiv.org/abs/2506.20303v1
[DATE]
2025-06-25 18:28:53+08:00
[CATEGORIES]
cs.CL
LR^2Bench: Evaluating Long-chain Reflective Reasoning Capabilities of Large Language Models via Constraint Satisfaction Problems
[AUTHORS]
Jianghao Chen, Zhenlin Wei, Zhenjiang Ren, Ziyong Li, Jiajun Zhang
[COMMENTS]
ACL-2025, our code is available at https://github.com/ZNLP/LR2Bench
[LINK]
http://arxiv.org/abs/2502.17848v4
[DATE]
2025-06-25 17:36:23+08:00
[CATEGORIES]
cs.CL
LADM: Long-context Training Data Selection with Attention-based Dependency Measurement for LLMs
[AUTHORS]
Jianghao Chen, Junhong Wu, Yangyifan Xu, Jiajun Zhang
[ABSTRACT]
Long-context modeling has drawn more and more attention in the area of Large
Language Models (LLMs). Continual training with long-context data becomes the
de-facto method to equip LLMs with the ability to process long inputs. However,
it still remains an open challenge to measure the quality of long-context
training data. To address this issue, we propose a Long-context data selection
framework with Attention-based Dependency Measurement (LADM), which can
efficiently identify high-quality long-context data from a large-scale,
multi-domain pre-training corpus. LADM leverages the retrieval capabilities of
the attention mechanism to capture contextual dependencies, ensuring a
comprehensive quality measurement of long-context data. Experimental results
show that our LADM framework significantly boosts the performance of LLMs on
multiple long-context tasks with only 1B tokens for continual training.
[COMMENTS]
ACL 2025, our code is available at https://github.com/ZNLP/LADM
[LINK]
http://arxiv.org/abs/2503.02502v2
[DATE]
2025-06-25 17:27:33+08:00
[CATEGORIES]
cs.CL
Narrative Shift Detection: A Hybrid Approach of Dynamic Topic Models and Large Language Models
[AUTHORS]
Kai-Robin Lange, Tobias Schmidt, Matthias Reccius, Henrik Müller, Michael Roos, Carsten Jentsch
[ABSTRACT]
With rapidly evolving media narratives, it has become increasingly critical
to not just extract narratives from a given corpus but rather investigate, how
they develop over time. While popular narrative extraction methods such as
Large Language Models do well in capturing typical narrative elements or even
the complex structure of a narrative, applying them to an entire corpus comes
with obstacles, such as a high financial or computational cost. We propose a
combination of the language understanding capabilities of Large Language Models
with the large scale applicability of topic models to dynamically model
narrative shifts across time using the Narrative Policy Framework. We apply a
topic model and a corresponding change point detection method to find changes
that concern a specific topic of interest. Using this model, we filter our
corpus for documents that are particularly representative of that change and
feed them into a Large Language Model that interprets the change that happened
in an automated fashion and distinguishes between content and narrative shifts.
We employ our pipeline on a corpus of The Wall Street Journal news paper
articles from 2009 to 2023. Our findings indicate that a Large Language Model
can efficiently extract a narrative shift if one exists at a given point in
time, but does not perform as well when having to decide whether a shift in
content or a narrative shift took place.
[COMMENTS]
14 pages, 1 figure
[LINK]
http://arxiv.org/abs/2506.20269v1
[DATE]
2025-06-25 17:25:15+08:00
[CATEGORIES]
cs.CL
Language Modeling by Language Models
[AUTHORS]
Junyan Cheng, Peter Clark, Kyle Richardson
[ABSTRACT]
Can we leverage LLMs to model the process of discovering novel language model
(LM) architectures? Inspired by real research, we propose a multi-agent LLM
approach that simulates the conventional stages of research, from ideation and
literature search (proposal stage) to design implementation (code generation),
generative pre-training, and downstream evaluation (verification). Using ideas
from scaling laws, our system, Genesys, employs a Ladder of Scales approach;
new designs are proposed, adversarially reviewed, implemented, and selectively
verified at increasingly larger model scales (14M$\sim$350M parameters) with a
narrowing budget (the number of models we can train at each scale). To help
make discovery efficient and factorizable, Genesys uses a novel genetic
programming backbone, which we show has empirical advantages over commonly used
direct prompt generation workflows (e.g., $\sim$86\% percentage point
improvement in successful design generation, a key bottleneck). We report
experiments involving 1,162 newly discovered designs (1,062 fully verified
through pre-training) and find the best designs to be highly competitive with
known architectures (e.g., outperform GPT2, Mamba2, etc., on 6/9 common
benchmarks). We couple these results with comprehensive system-level ablations
and formal results, which give broader insights into the design of effective
autonomous discovery systems.
[LINK]
http://arxiv.org/abs/2506.20249v1
[DATE]
2025-06-25 16:46:10+08:00
[CATEGORIES]
cs.CL
Enhancing Large Language Models through Structured Reasoning
[AUTHORS]
Yubo Dong, Hehe Fan
[ABSTRACT]
Recent Large Language Models (LLMs) have significantly advanced natural
language processing and automated decision-making. However, these models still
encounter difficulties when performing complex reasoning tasks involving
logical deduction and systematic planning, primarily due to their reliance on
implicit statistical relationships without structured knowledge
representation.Inspired by cognitive science and neurosymbolic AI, we introduce
a novel approach to enhance LLMs through explicit structured reasoning. First,
we convert unstructured data into structured formats by explicitly annotating
reasoning steps. We then employ this structured dataset to train LLMs through
Supervised Fine-Tuning (SFT). Additionally, we enhance the structured reasoning
capabilities of LLMs using Group Relative Policy Optimization (GRPO),
incorporating two innovative algorithms–MAX-Flow and Longest Common
Subsequence (LCS)–which notably improve reasoning effectiveness and reduce
computational complexity. Experimental results from fine-tuning a
DeepSeek-R1-Distill-Qwen-1.5B model demonstrate concise reasoning, robust
performance across various scenarios, and improved compatibility with
optimization techniques, validating the efficacy of structured reasoning
integration in LLMs.
[COMMENTS]
Preprint. Under review
[LINK]
http://arxiv.org/abs/2506.20241v1
[DATE]
2025-06-25 16:36:12+08:00
[CATEGORIES]
cs.CL
LLaVA-CMoE: Towards Continual Mixture of Experts for Large Vision-Language Models
[AUTHORS]
Hengyuan Zhao, Ziqin Wang, Qixin Sun, Kaiyou Song, Yilin Li, Xiaolin Hu, Qingpei Guo, Si Liu
[ABSTRACT]
Mixture of Experts (MoE) architectures have recently advanced the scalability
and adaptability of large language models (LLMs) for continual multimodal
learning. However, efficiently extending these models to accommodate sequential
tasks remains challenging. As new tasks arrive, naive model expansion leads to
rapid parameter growth, while modifying shared routing components often causes
catastrophic forgetting, undermining previously learned knowledge. To address
these issues, we propose LLaVA-CMoE, a continual learning framework for LLMs
that requires no replay data of previous tasks and ensures both parameter
efficiency and robust knowledge retention. Our approach introduces a
Probe-Guided Knowledge Extension mechanism, which uses probe experts to
dynamically determine when and where new experts should be added, enabling
adaptive and minimal parameter expansion tailored to task complexity.
Furthermore, we present a Probabilistic Task Locator that assigns each task a
dedicated, lightweight router. To handle the practical issue that task labels
are unknown during inference, we leverage a VAE-based reconstruction strategy
to identify the most suitable router by matching input distributions, allowing
automatic and accurate expert allocation. This design mitigates routing
conflicts and catastrophic forgetting, enabling robust continual learning
without explicit task labels. Extensive experiments on the CoIN benchmark,
covering eight diverse VQA tasks, demonstrate that LLaVA-CMoE delivers strong
continual learning performance with a compact model size, significantly
reducing forgetting and parameter overhead compared to prior methods. These
results showcase the effectiveness and scalability of our approach for
parameter-efficient continual learning in large language models. Our code will
be open-sourced soon.
[COMMENTS]
Preprint
[LINK]
http://arxiv.org/abs/2503.21227v3
[DATE]
2025-06-25 16:30:20+08:00
[CATEGORIES]
cs.CL
Perspectives in Play: A Multi-Perspective Approach for More Inclusive NLP Systems
[AUTHORS]
Benedetta Muscato, Lucia Passaro, Gizem Gezici, Fosca Giannotti
[ABSTRACT]
In the realm of Natural Language Processing (NLP), common approaches for
handling human disagreement consist of aggregating annotators’ viewpoints to
establish a single ground truth. However, prior studies show that disregarding
individual opinions can lead can lead to the side effect of underrepresenting
minority perspectives, especially in subjective tasks, where annotators may
systematically disagree because of their preferences. Recognizing that labels
reflect the diverse backgrounds, life experiences, and values of individuals,
this study proposes a new multi-perspective approach using soft labels to
encourage the development of the next generation of perspective aware models,
more inclusive and pluralistic. We conduct an extensive analysis across diverse
subjective text classification tasks, including hate speech, irony, abusive
language, and stance detection, to highlight the importance of capturing human
disagreements, often overlooked by traditional aggregation methods. Results
show that the multi-perspective approach not only better approximates human
label distributions, as measured by Jensen-Shannon Divergence (JSD), but also
achieves superior classification performance (higher F1 scores), outperforming
traditional approaches. However, our approach exhibits lower confidence in
tasks like irony and stance detection, likely due to the inherent subjectivity
present in the texts. Lastly, leveraging Explainable AI (XAI), we explore model
uncertainty and uncover meaningful insights into model predictions.
[LINK]
http://arxiv.org/abs/2506.20209v1
[DATE]
2025-06-25 15:53:36+08:00
[CATEGORIES]
cs.CL
Intrinsic vs. Extrinsic Evaluation of Czech Sentence Embeddings: Semantic Relevance Doesn’t Help with MT Evaluation
[AUTHORS]
Petra Barančíková, Ondřej Bojar
[ABSTRACT]
In this paper, we compare Czech-specific and multilingual sentence embedding
models through intrinsic and extrinsic evaluation paradigms. For intrinsic
evaluation, we employ Costra, a complex sentence transformation dataset, and
several Semantic Textual Similarity (STS) benchmarks to assess the ability of
the embeddings to capture linguistic phenomena such as semantic similarity,
temporal aspects, and stylistic variations. In the extrinsic evaluation, we
fine-tune each embedding model using COMET-based metrics for machine
translation evaluation.
Our experiments reveal an interesting disconnect: models that excel in
intrinsic semantic similarity tests do not consistently yield superior
performance on downstream translation evaluation tasks. Conversely, models with
seemingly over-smoothed embedding spaces can, through fine-tuning, achieve
excellent results. These findings highlight the complex relationship between
semantic property probes and downstream task, emphasizing the need for more
research into ‘operationalizable semantics’ in sentence embeddings, or more
in-depth downstream tasks datasets (here translation evaluation)
[LINK]
http://arxiv.org/abs/2506.20203v1
[DATE]
2025-06-25 15:46:17+08:00
[CATEGORIES]
cs.CL
COIN: Uncertainty-Guarding Selective Question Answering for Foundation Models with Provable Risk Guarantees
[AUTHORS]
Zhiyuan Wang, Jinhao Duan, Qingni Wang, Xiaofeng Zhu, Tianlong Chen, Xiaoshuang Shi, Kaidi Xu
[ABSTRACT]
Uncertainty quantification (UQ) for foundation models is essential to
identify and mitigate potential hallucinations in automatically generated text.
However, heuristic UQ approaches lack formal guarantees for key metrics such as
the false discovery rate (FDR) in selective prediction. Previous work adopts
the split conformal prediction (SCP) framework to ensure desired coverage of
admissible answers by constructing prediction sets, but these sets often
contain incorrect candidates, limiting their practical utility. To address
this, we propose COIN, an uncertainty-guarding selection framework that
calibrates statistically valid thresholds to filter a single generated answer
per question under user-specified FDR constraints. COIN estimates the empirical
error rate on a calibration set and applies confidence interval methods such as
Clopper-Pearson to establish a high-probability upper bound on the true error
rate (i.e., FDR). This enables the selection of the largest uncertainty
threshold that ensures FDR control on test data while significantly increasing
sample retention. We demonstrate COIN’s robustness in risk control, strong
test-time power in retaining admissible answers, and predictive efficiency
under limited calibration data across both general and multimodal text
generation tasks. Furthermore, we show that employing alternative upper bound
constructions and UQ strategies can further boost COIN’s power performance,
which underscores its extensibility and adaptability to diverse application
scenarios.
[LINK]
http://arxiv.org/abs/2506.20178v1
[DATE]
2025-06-25 15:04:49+08:00
[CATEGORIES]
cs.CL
cs.LG
Conversational User-AI Intervention: A Study on Prompt Rewriting for Improved LLM Response Generation
[AUTHORS]
Rupak Sarkar, Bahareh Sarrafzadeh, Nirupama Chandrasekaran, Nagu Rangan, Philip Resnik, Longqi Yang, Sujay Kumar Jauhar
[COMMENTS]
8 pages, ACL style
[LINK]
http://arxiv.org/abs/2503.16789v2
[DATE]
2025-06-25 14:44:58+08:00
[CATEGORIES]
cs.CL
SEED: A Structural Encoder for Embedding-Driven Decoding in Time Series Prediction with LLMs
[AUTHORS]
Fengze Li, Yue Wang, Yangle Liu, Ming Huang, Dou Hong, Jieming Ma
[ABSTRACT]
Multivariate time series forecasting requires models to simultaneously
capture variable-wise structural dependencies and generalize across diverse
tasks. While structural encoders are effective in modeling feature
interactions, they lack the capacity to support semantic-level reasoning or
task adaptation. Conversely, large language models (LLMs) possess strong
generalization capabilities but remain incompatible with raw time series
inputs. This gap limits the development of unified, transferable prediction
systems. Therefore, we introduce SEED, a structural encoder for
embedding-driven decoding, which integrates four stages: a token-aware encoder
for patch extraction, a projection module that aligns patches with language
model embeddings, a semantic reprogramming mechanism that maps patches to
task-aware prototypes, and a frozen language model for prediction. This modular
architecture decouples representation learning from inference, enabling
efficient alignment between numerical patterns and semantic reasoning.
Empirical results demonstrate that the proposed method achieves consistent
improvements over strong baselines, and comparative studies on various datasets
confirm SEED’s role in addressing the structural-semantic modeling gap.
[LINK]
http://arxiv.org/abs/2506.20167v1
[DATE]
2025-06-25 14:40:14+08:00
[CATEGORIES]
cs.CL
Rewarding Graph Reasoning Process makes LLMs more Generalized Reasoners
[AUTHORS]
Miao Peng, Nuo Chen, Zongrui Suo, Jia Li
[ABSTRACT]
Despite significant advancements in Large Language Models (LLMs), developing
advanced reasoning capabilities in LLMs remains a key challenge. Process Reward
Models (PRMs) have demonstrated exceptional promise in enhancing reasoning by
providing step-wise feedback, particularly in the context of mathematical
reasoning. However, their application to broader reasoning domains remains
understudied, largely due to the high costs associated with manually creating
step-level supervision. In this work, we explore the potential of PRMs in graph
reasoning problems - a domain that demands sophisticated multi-step reasoning
and offers opportunities for automated step-level data generation using
established graph algorithms. We introduce GraphSILO, the largest dataset for
graph reasoning problems with fine-grained step-wise labels, built using
automated Task-oriented Trajectories and Monte Carlo Tree Search (MCTS) to
generate detailed reasoning steps with step-wise labels. Building upon this
dataset, we train GraphPRM, the first PRM designed for graph reasoning
problems, and evaluate its effectiveness in two key settings: inference-time
scaling and reinforcement learning via Direct Preference Optimization (DPO).
Experimental results show that GraphPRM significantly improves LLM performance
across 13 graph reasoning tasks, delivering a 9% gain for Qwen2.5-7B and
demonstrating transferability to new graph reasoning datasets and new reasoning
domains like mathematical problem-solving. Notably, GraphPRM enhances LLM
performance on GSM8K and Math500, underscoring the cross-domain applicability
of graph-based reasoning rewards. Our findings highlight the potential of PRMs
in advancing reasoning across diverse domains, paving the way for more
versatile and effective LLMs.
[COMMENTS]
Accepted to KDD 2025 Research Track
[LINK]
http://arxiv.org/abs/2503.00845v2
[DATE]
2025-06-25 14:00:08+08:00
[CATEGORIES]
cs.CL
cs.LG
Leveraging AI Graders for Missing Score Imputation to Achieve Accurate Ability Estimation in Constructed-Response Tests
[AUTHORS]
Masaki Uto, Yuma Ito
[ABSTRACT]
Evaluating the abilities of learners is a fundamental objective in the field
of education. In particular, there is an increasing need to assess higher-order
abilities such as expressive skills and logical thinking. Constructed-response
tests such as short-answer and essay-based questions have become widely used as
a method to meet this demand. Although these tests are effective, they require
substantial manual grading, making them both labor-intensive and costly. Item
response theory (IRT) provides a promising solution by enabling the estimation
of ability from incomplete score data, where human raters grade only a subset
of answers provided by learners across multiple test items. However, the
accuracy of ability estimation declines as the proportion of missing scores
increases. Although data augmentation techniques for imputing missing scores
have been explored in order to address this limitation, they often struggle
with inaccuracy for sparse or heterogeneous data. To overcome these challenges,
this study proposes a novel method for imputing missing scores by leveraging
automated scoring technologies for accurate IRT-based ability estimation. The
proposed method achieves high accuracy in ability estimation while markedly
reducing manual grading workload.
[COMMENTS]
Accepted to EvalLAC’25: 2nd Workshop on Automatic Evaluation of
Learning and Assessment Content, held at AIED 2025, Palermo, Italy. This is
the camera-ready version submitted to CEUR Workshop Proceedings
[LINK]
http://arxiv.org/abs/2506.20119v1
[DATE]
2025-06-25 12:17:57+08:00
[CATEGORIES]
cs.CL
cs.LG
A Global Context Mechanism for Sequence Labeling
[AUTHORS]
Conglei Xu, Kun Shen, Hongguang Sun, Yang Xu
[ABSTRACT]
Global sentence information is crucial for sequence labeling tasks, where
each word in a sentence must be assigned a label. While BiLSTM models are
widely used, they often fail to capture sufficient global context for inner
words. Previous work has proposed various RNN variants to integrate global
sentence information into word representations. However, these approaches
suffer from three key limitations: (1) they are slower in both inference and
training compared to the original BiLSTM, (2) they cannot effectively
supplement global information for transformer-based models, and (3) the high
time cost associated with reimplementing and integrating these customized RNNs
into existing architectures. In this study, we introduce a simple yet effective
mechanism that addresses these limitations. Our approach efficiently
supplements global sentence information for both BiLSTM and transformer-based
models, with minimal degradation in inference and training speed, and is easily
pluggable into current architectures. We demonstrate significant improvements
in F1 scores across seven popular benchmarks, including Named Entity
Recognition (NER) tasks such as Conll2003, Wnut2017 , and the Chinese
named-entity recognition task Weibo, as well as End-to-End Aspect-Based
Sentiment Analysis (E2E-ABSA) benchmarks such as Laptop14, Restaurant14,
Restaurant15, and Restaurant16. With out any extra strategy, we achieve third
highest score on weibo NER benchmark. Compared to CRF, one of the most popular
frameworks for sequence labeling, our mechanism achieves competitive F1 scores
while offering superior inference and training speed. Code is available at:
https://github.com/conglei2XU/Global-Context-Mechanism
[LINK]
http://arxiv.org/abs/2305.19928v5
[DATE]
2025-06-25 11:52:41+08:00
[CATEGORIES]
cs.CL
MIRAGE: A Benchmark for Multimodal Information-Seeking and Reasoning in Agricultural Expert-Guided Conversations
[AUTHORS]
Vardhan Dongre, Chi Gui, Shubham Garg, Hooshang Nayyeri, Gokhan Tur, Dilek Hakkani-Tür, Vikram S. Adve
[ABSTRACT]
We introduce MIRAGE, a new benchmark for multimodal expert-level reasoning
and decision-making in consultative interaction settings. Designed for the
agriculture domain, MIRAGE captures the full complexity of expert consultations
by combining natural user queries, expert-authored responses, and image-based
context, offering a high-fidelity benchmark for evaluating models on grounded
reasoning, clarification strategies, and long-form generation in a real-world,
knowledge-intensive domain. Grounded in over 35,000 real user-expert
interactions and curated through a carefully designed multi-step pipeline,
MIRAGE spans diverse crop health, pest diagnosis, and crop management
scenarios. The benchmark includes more than 7,000 unique biological entities,
covering plant species, pests, and diseases, making it one of the most
taxonomically diverse benchmarks available for vision-language models, grounded
in the real world. Unlike existing benchmarks that rely on well-specified user
inputs and closed-set taxonomies, MIRAGE features underspecified, context-rich
scenarios with open-world settings, requiring models to infer latent knowledge
gaps, handle rare entities, and either proactively guide the interaction or
respond. Project Page: https://mirage-benchmark.github.io
[COMMENTS]
66 pages, 32 figures, 23 tables
[LINK]
http://arxiv.org/abs/2506.20100v1
[DATE]
2025-06-25 11:07:54+08:00
[CATEGORIES]
cs.LG
cs.CL
PSALM-V: Automating Symbolic Planning in Interactive Visual Environments with Large Language Models
[AUTHORS]
Wang Bill Zhu, Miaosen Chai, Ishika Singh, Robin Jia, Jesse Thomason
[ABSTRACT]
We propose PSALM-V, the first autonomous neuro-symbolic learning system able
to induce symbolic action semantics (i.e., pre- and post-conditions) in visual
environments through interaction. PSALM-V bootstraps reliable symbolic planning
without expert action definitions, using LLMs to generate heuristic plans and
candidate symbolic semantics. Previous work has explored using large language
models to generate action semantics for Planning Domain Definition Language
(PDDL)-based symbolic planners. However, these approaches have primarily
focused on text-based domains or relied on unrealistic assumptions, such as
access to a predefined problem file, full observability, or explicit error
messages. By contrast, PSALM-V dynamically infers PDDL problem files and domain
action semantics by analyzing execution outcomes and synthesizing possible
error explanations. The system iteratively generates and executes plans while
maintaining a tree-structured belief over possible action semantics for each
action, iteratively refining these beliefs until a goal state is reached.
Simulated experiments of task completion in ALFRED demonstrate that PSALM-V
increases the plan success rate from 37% (Claude-3.7) to 74% in partially
observed setups. Results on two 2D game environments, RTFM and Overcooked-AI,
show that PSALM-V improves step efficiency and succeeds in domain induction in
multi-agent settings. PSALM-V correctly induces PDDL pre- and post-conditions
for real-world robot BlocksWorld tasks, despite low-level manipulation failures
from the robot.
[LINK]
http://arxiv.org/abs/2506.20097v1
[DATE]
2025-06-25 10:44:20+08:00
[CATEGORIES]
cs.CL
PP-DocBee2: Improved Baselines with Efficient Data for Multimodal Document Understanding
[AUTHORS]
Kui Huang, Xinrong Chen, Wenyu Lv, Jincheng Liao, Guanzhong Wang, Yi Liu
[ABSTRACT]
This report introduces PP-DocBee2, an advanced version of the PP-DocBee,
designed to enhance multimodal document understanding. Built on a large
multimodal model architecture, PP-DocBee2 addresses the limitations of its
predecessor through key technological improvements, including enhanced
synthetic data quality, improved visual feature fusion strategy, and optimized
inference methodologies. These enhancements yield an $11.4\%$ performance boost
on internal benchmarks for Chinese business documents, and reduce inference
latency by $73.0\%$ to the vanilla version. A key innovation of our work is a
data quality optimization strategy for multimodal document tasks. By employing
a large-scale multimodal pre-trained model to evaluate data, we apply a novel
statistical criterion to filter outliers, ensuring high-quality training data.
Inspired by insights into underutilized intermediate features in multimodal
models, we enhance the ViT representational capacity by decomposing it into
layers and applying a novel feature fusion strategy to improve complex
reasoning. The source code and pre-trained model are available at
\href{https://github.com/PaddlePaddle/PaddleMIX}{https://github.com/PaddlePaddle/PaddleMIX}.
[LINK]
http://arxiv.org/abs/2506.18023v2
[DATE]
2025-06-25 10:40:39+08:00
[CATEGORIES]
cs.CL
ITFormer: Bridging Time Series and Natural Language for Multi-Modal QA with Large-Scale Multitask Dataset
[AUTHORS]
Yilin Wang, Peixuan Lei, Jie Song, Yuzhe Hao, Tao Chen, Yuxuan Zhang, Lei Jia, Yuanxiang Li, Zhongyu Wei
[ABSTRACT]
Time-series data are critical in diverse applications, such as industrial
monitoring, medical diagnostics, and climate research. However, effectively
integrating these high-dimensional temporal signals with natural language for
dynamic, interactive tasks remains a significant challenge. To address this, we
introduce the Time-Series Question Answering (Time-Series QA) task and release
EngineMT-QA, the first large-scale, multi-task, temporal-textual QA dataset
designed to capture complex interactions between time-series signals and
natural language. Building on this resource, we propose the Instruct Time
Transformer (ITFormer), a novel framework that bridges time-series encoders
with frozen large language models (LLMs). ITFormer effectively extracts,
aligns, and fuses temporal and textual features, achieving a strong improvement
in QA accuracy over strong baselines with fewer than 1\% additional trainable
parameters. By combining computational efficiency with robust cross-modal
modeling, our work establishes a adaptable paradigm for integrating temporal
data with natural language, paving the way for new research and applications in
multi-modal AI. More details about the project, including datasets and code,
are available at: https://pandalin98.github.io/itformer_site/
[LINK]
http://arxiv.org/abs/2506.20093v1
[DATE]
2025-06-25 10:33:47+08:00
[CATEGORIES]
cs.CL
Understanding World or Predicting Future? A Comprehensive Survey of World Models
[AUTHORS]
Jingtao Ding, Yunke Zhang, Yu Shang, Yuheng Zhang, Zefang Zong, Jie Feng, Yuan Yuan, Hongyuan Su, Nian Li, Nicholas Sukiennik, Fengli Xu, Yong Li
[ABSTRACT]
The concept of world models has garnered significant attention due to
advancements in multimodal large language models such as GPT-4 and video
generation models such as Sora, which are central to the pursuit of artificial
general intelligence. This survey offers a comprehensive review of the
literature on world models. Generally, world models are regarded as tools for
either understanding the present state of the world or predicting its future
dynamics. This review presents a systematic categorization of world models,
emphasizing two primary functions: (1) constructing internal representations to
understand the mechanisms of the world, and (2) predicting future states to
simulate and guide decision-making. Initially, we examine the current progress
in these two categories. We then explore the application of world models in key
domains, including autonomous driving, robotics, and social simulacra, with a
focus on how each domain utilizes these aspects. Finally, we outline key
challenges and provide insights into potential future research directions. We
summarize the representative papers along with their code repositories in
https://github.com/tsinghua-fib-lab/World-Model.
[COMMENTS]
Accepted by ACM CSUR, 37 pages, 7 figures, 7 tables
[LINK]
http://arxiv.org/abs/2411.14499v2
[DATE]
2025-06-25 10:31:33+08:00
[CATEGORIES]
cs.CL
cs.LG
Attention Entropy is a Key Factor: An Analysis of Parallel Context Encoding with Full-attention-based Pre-trained Language Models
[AUTHORS]
Zhisong Zhang, Yan Wang, Xinting Huang, Tianqing Fang, Hongming Zhang, Chenlong Deng, Shuaiyi Li, Dong Yu
[COMMENTS]
ACL 2025
[LINK]
http://arxiv.org/abs/2412.16545v2
[DATE]
2025-06-25 10:28:36+08:00
[CATEGORIES]
cs.CL
Bridging Compositional and Distributional Semantics: A Survey on Latent Semantic Geometry via AutoEncoder
[AUTHORS]
Yingji Zhang, Danilo S. Carvalho, André Freitas
[ABSTRACT]
Integrating compositional and symbolic properties into current distributional
semantic spaces can enhance the interpretability, controllability,
compositionality, and generalisation capabilities of Transformer-based
auto-regressive language models (LMs). In this survey, we offer a novel
perspective on latent space geometry through the lens of compositional
semantics, a direction we refer to as \textit{semantic representation
learning}. This direction enables a bridge between symbolic and distributional
semantics, helping to mitigate the gap between them. We review and compare
three mainstream autoencoder architectures-Variational AutoEncoder (VAE),
Vector Quantised VAE (VQVAE), and Sparse AutoEncoder (SAE)-and examine the
distinctive latent geometries they induce in relation to semantic structure and
interpretability.
[COMMENTS]
In progress
[LINK]
http://arxiv.org/abs/2506.20083v1
[DATE]
2025-06-25 09:48:18+08:00
[CATEGORIES]
cs.CL
Quantifying Fairness in LLMs Beyond Tokens: A Semantic and Statistical Perspective
[AUTHORS]
Weijie Xu, Yiwen Wang, Chi Xue, Xiangkun Hu, Xi Fang, Guimin Dong, Chandan K. Reddy
[ABSTRACT]
Large Language Models (LLMs) often generate responses with inherent biases,
undermining their reliability in real-world applications. Existing evaluation
methods often overlook biases in long-form responses and the intrinsic
variability of LLM outputs. To address these challenges, we propose
FiSCo(Fine-grained Semantic Computation), a novel statistical framework to
evaluate group-level fairness in LLMs by detecting subtle semantic differences
in long-form responses across demographic groups. Unlike prior work focusing on
sentiment or token-level comparisons, FiSCo goes beyond surface-level analysis
by operating at the claim level, leveraging entailment checks to assess the
consistency of meaning across responses. We decompose model outputs into
semantically distinct claims and apply statistical hypothesis testing to
compare inter- and intra-group similarities, enabling robust detection of
subtle biases. We formalize a new group counterfactual fairness definition and
validate FiSCo on both synthetic and human-annotated datasets spanning gender,
race, and age. Experiments show that FiSco more reliably identifies nuanced
biases while reducing the impact of stochastic LLM variability, outperforming
various evaluation metrics.
[COMMENTS]
29 pages, 9 figures, 15 tables
[LINK]
http://arxiv.org/abs/2506.19028v2
[DATE]
2025-06-25 09:21:47+08:00
[CATEGORIES]
cs.CL
mSTEB: Massively Multilingual Evaluation of LLMs on Speech and Text Tasks
[AUTHORS]
Luel Hagos Beyene, Vivek Verma, Min Ma, Jesujoba O. Alabi, Fabian David Schmidt, Joyce Nakatumba-Nabende, David Ifeoluwa Adelani
[ABSTRACT]
Large Language models (LLMs) have demonstrated impressive performance on a
wide range of tasks, including in multimodal settings such as speech. However,
their evaluation is often limited to English and a few high-resource languages.
For low-resource languages, there is no standardized evaluation benchmark. In
this paper, we address this gap by introducing mSTEB, a new benchmark to
evaluate the performance of LLMs on a wide range of tasks covering language
identification, text classification, question answering, and translation tasks
on both speech and text modalities. We evaluated the performance of leading
LLMs such as Gemini 2.0 Flash and GPT-4o (Audio) and state-of-the-art open
models such as Qwen 2 Audio and Gemma 3 27B. Our evaluation shows a wide gap in
performance between high-resource and low-resource languages, especially for
languages spoken in Africa and Americas/Oceania. Our findings show that more
investment is needed to address their under-representation in LLMs coverage.
[COMMENTS]
working paper
[LINK]
http://arxiv.org/abs/2506.08400v2
[DATE]
2025-06-25 08:58:19+08:00
[CATEGORIES]
cs.CL
cs.LG
A Modular Multitask Reasoning Framework Integrating Spatio-temporal Models and LLMs
[AUTHORS]
Kethmi Hirushini Hettige, Jiahao Ji, Cheng Long, Shili Xiang, Gao Cong, Jingyuan Wang
[ABSTRACT]
Spatio-temporal data mining plays a pivotal role in informed decision making
across diverse domains. However, existing models are often restricted to narrow
tasks, lacking the capacity for multi-task inference and complex long-form
reasoning that require generation of in-depth, explanatory outputs. These
limitations restrict their applicability to real-world, multi-faceted decision
scenarios. In this work, we introduce STReason, a novel framework that
integrates the reasoning strengths of large language models (LLMs) with the
analytical capabilities of spatio-temporal models for multi-task inference and
execution. Without requiring task-specific finetuning, STReason leverages
in-context learning to decompose complex natural language queries into modular,
interpretable programs, which are then systematically executed to generate both
solutions and detailed rationales. To facilitate rigorous evaluation, we
construct a new benchmark dataset and propose a unified evaluation framework
with metrics specifically designed for long-form spatio-temporal reasoning.
Experimental results show that STReason significantly outperforms advanced LLM
baselines across all metrics, particularly excelling in complex,
reasoning-intensive spatio-temporal scenarios. Human evaluations further
validate STReason’s credibility and practical utility, demonstrating its
potential to reduce expert workload and broaden the applicability to real-world
spatio-temporal tasks. We believe STReason provides a promising direction for
developing more capable and generalizable spatio-temporal reasoning systems.
[LINK]
http://arxiv.org/abs/2506.20073v1
[DATE]
2025-06-25 08:55:34+08:00
[CATEGORIES]
cs.CL
cs.LG
Computation Mechanism Behind LLM Position Generalization
[AUTHORS]
Chi Han, Heng Ji
[COMMENTS]
ACL 2025 Main Long Paper
[LINK]
http://arxiv.org/abs/2503.13305v3
[DATE]
2025-06-25 08:26:59+08:00
[CATEGORIES]
cs.CL
Learning Instruction-Following Policies through Open-Ended Instruction Relabeling with Large Language Models
[AUTHORS]
Zhicheng Zhang, Ziyan Wang, Yali Du, Fei Fang
[ABSTRACT]
Developing effective instruction-following policies in reinforcement learning
remains challenging due to the reliance on extensive human-labeled instruction
datasets and the difficulty of learning from sparse rewards. In this paper, we
propose a novel approach that leverages the capabilities of large language
models (LLMs) to automatically generate open-ended instructions retrospectively
from previously collected agent trajectories. Our core idea is to employ LLMs
to relabel unsuccessful trajectories by identifying meaningful subtasks the
agent has implicitly accomplished, thereby enriching the agent’s training data
and substantially alleviating reliance on human annotations. Through this
open-ended instruction relabeling, we efficiently learn a unified
instruction-following policy capable of handling diverse tasks within a single
policy. We empirically evaluate our proposed method in the challenging Craftax
environment, demonstrating clear improvements in sample efficiency, instruction
coverage, and overall policy performance compared to state-of-the-art
baselines. Our results highlight the effectiveness of utilizing LLM-guided
open-ended instruction relabeling to enhance instruction-following
reinforcement learning.
[COMMENTS]
Under Review
[LINK]
http://arxiv.org/abs/2506.20061v1
[DATE]
2025-06-25 07:49:28+08:00
[CATEGORIES]
cs.LG
cs.CL
Cross-Layer Discrete Concept Discovery for Interpreting Language Models
[AUTHORS]
Ankur Garg, Xuemin Yu, Hassan Sajjad, Samira Ebrahimi Kahou
[ABSTRACT]
Uncovering emergent concepts across transformer layers remains a significant
challenge because the residual stream linearly mixes and duplicates
information, obscuring how features evolve within large language models.
Current research efforts primarily inspect neural representations at single
layers, thereby overlooking this cross-layer superposition and the redundancy
it introduces. These representations are typically either analyzed directly for
activation patterns or passed to probing classifiers that map them to a limited
set of predefined concepts. To address these limitations, we propose
\gls{clvqvae}, a framework that uses vector quantization to map representations
across layers and in the process collapse duplicated residual-stream features
into compact, interpretable concept vectors. Our approach uniquely combines
top-$k$ temperature-based sampling during quantization with EMA codebook
updates, providing controlled exploration of the discrete latent space while
maintaining code-book diversity. We further enhance the framework with
scaled-spherical k-means++ for codebook initialization, which clusters by
directional similarity rather than magnitude, better aligning with semantic
structure in word embedding space.
[LINK]
http://arxiv.org/abs/2506.20040v1
[DATE]
2025-06-25 06:43:36+08:00
[CATEGORIES]
cs.LG
cs.CL
The Noisy Path from Source to Citation: Measuring How Scholars Engage with Past Research
[AUTHORS]
Hong Chen, Misha Teplitskiy, David Jurgens
[COMMENTS]
Accepted by ACL 2025
[LINK]
http://arxiv.org/abs/2502.20581v3
[DATE]
2025-06-25 06:00:02+08:00
[CATEGORIES]
cs.CL
Evaluating Long Range Dependency Handling in Code Generation LLMs
[AUTHORS]
Yannick Assogba, Donghao Ren
[ABSTRACT]
As language models support larger and larger context sizes, evaluating their
ability to make effective use of that context becomes increasingly important.
We analyze the ability of several code generation models to handle long range
dependencies using a suite of multi-step key retrieval tasks in context windows
up to 8k tokens in length. The tasks progressively increase in difficulty and
allow more nuanced evaluation of model capabilities than tests like the popular
needle-in-the-haystack test. We find that performance degrades significantly
for many models (up to 2x) when a function references another function that is
defined later in the prompt. We also observe that models that use sliding
window attention mechanisms have difficulty handling references further than
the size of a single window. We perform simple prompt modifications using call
graph information to improve multi-step retrieval performance up to 3x. Our
analysis highlights ways that long-context performance needs deeper
consideration beyond retrieval of single facts within a document.
[COMMENTS]
36 pages, 18 figures
[LINK]
http://arxiv.org/abs/2407.21049v2
[DATE]
2025-06-25 05:45:07+08:00
[CATEGORIES]
cs.CL
cs.LG
Language Models Learn Rare Phenomena from Less Rare Phenomena: The Case of the Missing AANNs
[AUTHORS]
Kanishka Misra, Kyle Mahowald
[ABSTRACT]
Language models learn rare syntactic phenomena, but the extent to which this
is attributable to generalization vs. memorization is a major open question. To
that end, we iteratively trained transformer language models on systematically
manipulated corpora which were human-scale in size, and then evaluated their
learning of a rare grammatical phenomenon: the English
Article+Adjective+Numeral+Noun (AANN) construction (“a beautiful five days”).
We compared how well this construction was learned on the default corpus
relative to a counterfactual corpus in which AANN sentences were removed. We
found that AANNs were still learned better than systematically perturbed
variants of the construction. Using additional counterfactual corpora, we
suggest that this learning occurs through generalization from related
constructions (e.g., “a few days”). An additional experiment showed that this
learning is enhanced when there is more variability in the input. Taken
together, our results provide an existence proof that LMs can learn rare
grammatical phenomena by generalization from less rare phenomena. Data and
code: https://github.com/kanishkamisra/aannalysis.
[COMMENTS]
Added Corrigendum to correct 4-gram baseline performance and chance
performance
[LINK]
http://arxiv.org/abs/2403.19827v3
[DATE]
2025-06-25 05:39:54+08:00
[CATEGORIES]
cs.CL
Accurate and Energy Efficient: Local Retrieval-Augmented Generation Models Outperform Commercial Large Language Models in Medical Tasks
[AUTHORS]
Konstantinos Vrettos, Michail E. Klontzas
[ABSTRACT]
Background The increasing adoption of Artificial Intelligence (AI) in
healthcare has sparked growing concerns about its environmental and ethical
implications. Commercial Large Language Models (LLMs), such as ChatGPT and
DeepSeek, require substantial resources, while the utilization of these systems
for medical purposes raises critical issues regarding patient privacy and
safety. Methods We developed a customizable Retrieval-Augmented Generation
(RAG) framework for medical tasks, which monitors its energy usage and CO2
emissions. This system was then used to create RAGs based on various
open-source LLMs. The tested models included both general purpose models like
llama3.1:8b and medgemma-4b-it, which is medical-domain specific. The best RAGs
performance and energy consumption was compared to DeepSeekV3-R1 and OpenAIs
o4-mini model. A dataset of medical questions was used for the evaluation.
Results Custom RAG models outperformed commercial models in accuracy and energy
consumption. The RAG model built on llama3.1:8B achieved the highest accuracy
(58.5%) and was significantly better than other models, including o4-mini and
DeepSeekV3-R1. The llama3.1-RAG also exhibited the lowest energy consumption
and CO2 footprint among all models, with a Performance per kWh of 0.52 and a
total CO2 emission of 473g. Compared to o4-mini, the llama3.1-RAG achieved 2.7x
times more accuracy points per kWh and 172% less electricity usage while
maintaining higher accuracy. Conclusion Our study demonstrates that local LLMs
can be leveraged to develop RAGs that outperform commercial, online LLMs in
medical tasks, while having a smaller environmental impact. Our modular
framework promotes sustainable AI development, reducing electricity usage and
aligning with the UNs Sustainable Development Goals.
[COMMENTS]
18 pages, 3 Figures
[LINK]
http://arxiv.org/abs/2506.20009v1
[DATE]
2025-06-25 04:56:03+08:00
[CATEGORIES]
cs.CL
Can Language Models Replace Programmers for Coding? REPOCOD Says ‘Not Yet’
[AUTHORS]
Shanchao Liang, Yiran Hu, Nan Jiang, Lin Tan
[ABSTRACT]
Recently, a number of repository-level code generation benchmarks-such as
CoderEval, DevEval, RepoEval, RepoBench, and LongCodeArena-have emerged to
evaluate the capabilities of large language models (LLMs) beyond standalone
benchmarks like HumanEval and MBPP. Thus, a natural question is, would LLMs
have similar performance in real world coding tasks as their performance in
these benchmarks? Unfortunately, one cannot answer this question, since these
benchmarks consist of short completions, synthetic examples, or focus on
limited scale repositories, failing to represent real-world coding tasks.
To address these challenges, we create REPOCOD, a Python code-generation
benchmark containing complex tasks with realistic dependencies in real-world
large projects and appropriate metrics for evaluating source code. It includes
980 whole-function generation tasks from 11 popular projects, 50.8% of which
require repository-level context. REPOCOD includes 314 developer-written test
cases per instance for better evaluation. We evaluate ten LLMs on REPOCOD and
find that none achieves more than 30% pass@1 on REPOCOD, indicating the
necessity of building stronger LLMs that can help developers in real-world
software development. In addition, we found that retrieval-augmented generation
achieves better results than using target function dependencies as context.
[LINK]
http://arxiv.org/abs/2410.21647v4
[DATE]
2025-06-25 04:49:51+08:00
[CATEGORIES]
cs.CL
A Spatio-Temporal Point Process for Fine-Grained Modeling of Reading Behavior
[AUTHORS]
Francesco Ignazio Re, Andreas Opedal, Glib Manaiev, Mario Giulianelli, Ryan Cotterell
[ABSTRACT]
Reading is a process that unfolds across space and time, alternating between
fixations where a reader focuses on a specific point in space, and saccades
where a reader rapidly shifts their focus to a new point. An ansatz of
psycholinguistics is that modeling a reader’s fixations and saccades yields
insight into their online sentence processing. However, standard approaches to
such modeling rely on aggregated eye-tracking measurements and models that
impose strong assumptions, ignoring much of the spatio-temporal dynamics that
occur during reading. In this paper, we propose a more general probabilistic
model of reading behavior, based on a marked spatio-temporal point process,
that captures not only how long fixations last, but also where they land in
space and when they take place in time. The saccades are modeled using a Hawkes
process, which captures how each fixation excites the probability of a new
fixation occurring near it in time and space. The duration time of fixation
events is modeled as a function of fixation-specific predictors convolved
across time, thus capturing spillover effects. Empirically, our Hawkes process
model exhibits a better fit to human saccades than baselines. With respect to
fixation durations, we observe that incorporating contextual surprisal as a
predictor results in only a marginal improvement in the model’s predictive
accuracy. This finding suggests that surprisal theory struggles to explain
fine-grained eye movements.
[COMMENTS]
ACL 2025
[LINK]
http://arxiv.org/abs/2506.19999v1
[DATE]
2025-06-25 04:39:21+08:00
[CATEGORIES]
cs.LG
cs.CL
WAFFLE: Finetuning Multi-Modal Model for Automated Front-End Development
[AUTHORS]
Shanchao Liang, Nan Jiang, Shangshu Qian, Lin Tan
[ABSTRACT]
Web development involves turning UI designs into functional webpages, which
can be difficult for both beginners and experienced developers due to the
complexity of HTML’s hierarchical structures and styles. While Large Language
Models (LLMs) have shown promise in generating source code, two major
challenges persist in UI-to-HTML code generation: (1) effectively representing
HTML’s hierarchical structure for LLMs, and (2) bridging the gap between the
visual nature of UI designs and the text-based format of HTML code. To tackle
these challenges, we introduce Waffle, a new fine-tuning strategy that uses a
structure-aware attention mechanism to improve LLMs’ understanding of HTML’s
structure and a contrastive fine-tuning approach to align LLMs’ understanding
of UI images and HTML code. Models fine-tuned with Waffle show up to 9.00 pp
(percentage point) higher HTML match, 0.0982 higher CW-SSIM, 32.99 higher CLIP,
and 27.12 pp higher LLEM on our new benchmark WebSight-Test and an existing
benchmark Design2Code, outperforming current fine-tuning methods.
[LINK]
http://arxiv.org/abs/2410.18362v2
[DATE]
2025-06-25 04:35:02+08:00
[CATEGORIES]
cs.CL
Doc2Agent: Scalable Generation of Tool-Using Agents from API Documentation
[AUTHORS]
Xinyi Ni, Haonan Jian, Qiuyang Wang, Vedanshi Chetan Shah, Pengyu Hong
[LINK]
http://arxiv.org/abs/2506.19998v1
[DATE]
2025-06-25 04:30:44+08:00
[CATEGORIES]
cs.CL
Inference Scaled GraphRAG: Improving Multi Hop Question Answering on Knowledge Graphs
[AUTHORS]
Travis Thompson, Seung-Hwan Lim, Paul Liu, Ruoying He, Dongkuan Xu
[ABSTRACT]
Large Language Models (LLMs) have achieved impressive capabilities in
language understanding and generation, yet they continue to underperform on
knowledge-intensive reasoning tasks due to limited access to structured context
and multi-hop information. Retrieval-Augmented Generation (RAG) partially
mitigates this by grounding generation in retrieved context, but conventional
RAG and GraphRAG methods often fail to capture relational structure across
nodes in knowledge graphs. We introduce Inference-Scaled GraphRAG, a novel
framework that enhances LLM-based graph reasoning by applying inference-time
compute scaling. Our method combines sequential scaling with deep
chain-of-thought graph traversal, and parallel scaling with majority voting
over sampled trajectories within an interleaved reasoning-execution loop.
Experiments on the GRBench benchmark demonstrate that our approach
significantly improves multi-hop question answering performance, achieving
substantial gains over both traditional GraphRAG and prior graph traversal
baselines. These findings suggest that inference-time scaling is a practical
and architecture-agnostic solution for structured knowledge reasoning with LLMs
[LINK]
http://arxiv.org/abs/2506.19967v1
[DATE]
2025-06-25 03:31:03+08:00
[CATEGORIES]
cs.CL
CycleDistill: Bootstrapping Machine Translation using LLMs with Cyclical Distillation
[AUTHORS]
Deepon Halder, Thanmay Jayakumar, Raj Dabre
[ABSTRACT]
Large language models (LLMs), despite their ability to perform few-shot
machine translation (MT), often lag behind dedicated MT systems trained on
parallel corpora, which are crucial for high quality machine translation (MT).
However, parallel corpora are often scarce or non-existent for low-resource
languages. In this paper, we propose CycleDistill, a bootstrapping approach
leveraging LLMs and few-shot translation to obtain high-quality MT systems.
CycleDistill involves iteratively generating synthetic parallel corpora from
monolingual corpora via zero- or few-shot MT, which is then used to fine-tune
the model that was used for generating said data for MT. CycleDistill does not
need parallel corpora beyond 1 to 4 few-shot examples, and in our experiments
focusing on three Indian languages, by relying solely on monolingual corpora,
it can achieve high-quality machine translation, improving upon a few-shot
baseline model by over 20-30 chrF points on average in the first iteration. We
also study the effect of leveraging softmax activations during the distillation
process and observe mild improvements in translation quality.
[LINK]
http://arxiv.org/abs/2506.19952v1
[DATE]
2025-06-25 02:56:57+08:00
[CATEGORIES]
cs.CL
Aug2Search: Enhancing Facebook Marketplace Search with LLM-Generated Synthetic Data Augmentation
[AUTHORS]
Ruijie Xi, He Ba, Hao Yuan, Rishu Agrawal, Yuxin Tian, Ruoyan Kong, Arul Prakash
[ABSTRACT]
Embedding-Based Retrieval (EBR) is an important technique in modern search
engines, enabling semantic match between search queries and relevant results.
However, search logging data on platforms like Facebook Marketplace lacks the
diversity and details needed for effective EBR model training, limiting the
models’ ability to capture nuanced search patterns. To address this challenge,
we propose Aug2Search, an EBR-based framework leveraging synthetic data
generated by Generative AI (GenAI) models, in a multimodal and multitask
approach to optimize query-product relevance. This paper investigates the
capabilities of GenAI, particularly Large Language Models (LLMs), in generating
high-quality synthetic data, and analyzing its impact on enhancing EBR models.
We conducted experiments using eight Llama models and 100 million data points
from Facebook Marketplace logs. Our synthetic data generation follows three
strategies: (1) generate queries, (2) enhance product listings, and (3)
generate queries from enhanced listings. We train EBR models on three different
datasets: sampled engagement data or original data ((e.g., “Click” and “Listing
Interactions”)), synthetic data, and a mixture of both engagement and synthetic
data to assess their performance across various training sets. Our findings
underscore the robustness of Llama models in producing synthetic queries and
listings with high coherence, relevance, and diversity, while maintaining low
levels of hallucination. Aug2Search achieves an improvement of up to 4% in
ROC_AUC with 100 million synthetic data samples, demonstrating the
effectiveness of our approach. Moreover, our experiments reveal that with the
same volume of training data, models trained exclusively on synthetic data
often outperform those trained on original data only or a mixture of original
and synthetic data.
[LINK]
http://arxiv.org/abs/2505.16065v3
[DATE]
2025-06-25 02:46:45+08:00
[CATEGORIES]
cs.CL
GlyphPattern: An Abstract Pattern Recognition Benchmark for Vision-Language Models
[AUTHORS]
Zixuan Wu, Yoolim Kim, Carolyn Jane Anderson
[ABSTRACT]
Vision-Language Models (VLMs) building upon the foundation of powerful large
language models have made rapid progress in reasoning across visual and textual
data. While VLMs perform well on vision tasks that they are trained on, our
results highlight key challenges in abstract pattern recognition. We present
GlyphPattern, a 954 item dataset that pairs 318 human-written descriptions of
visual patterns from 40 writing systems with three visual presentation styles.
GlyphPattern evaluates abstract pattern recognition in VLMs, requiring models
to understand and judge natural language descriptions of visual patterns.
GlyphPattern patterns are drawn from a large-scale cognitive science
investigation of human writing systems; as a result, they are rich in spatial
reference and compositionality. Our experiments show that GlyphPattern is
challenging for state-of-the-art VLMs (GPT-4o achieves only 55% accuracy), with
marginal gains from few-shot prompting. Our detailed error analysis reveals
challenges at multiple levels, including visual processing, natural language
understanding, and pattern generalization.
[LINK]
http://arxiv.org/abs/2408.05894v2
[DATE]
2025-06-25 02:23:10+08:00
[CATEGORIES]
cs.CL
MAM: Modular Multi-Agent Framework for Multi-Modal Medical Diagnosis via Role-Specialized Collaboration
[AUTHORS]
Yucheng Zhou, Lingran Song, Jianbing Shen
[COMMENTS]
ACL 2025 Findings
[LINK]
http://arxiv.org/abs/2506.19835v1
[DATE]
2025-06-25 01:52:43+08:00
[CATEGORIES]
cs.CL
Scaling Speculative Decoding with Lookahead Reasoning
[AUTHORS]
Yichao Fu, Rui Ge, Zelei Shao, Zhijie Deng, Hao Zhang
[LINK]
http://arxiv.org/abs/2506.19830v1
[DATE]
2025-06-25 01:48:10+08:00
[CATEGORIES]
cs.LG
cs.CL
Why Do Open-Source LLMs Struggle with Data Analysis? A Systematic Empirical Study
[AUTHORS]
Yuqi Zhu, Yi Zhong, Jintian Zhang, Ziheng Zhang, Shuofei Qiao, Yujie Luo, Lun Du, Da Zheng, Huajun Chen, Ningyu Zhang
[ABSTRACT]
Large Language Models (LLMs) hold promise in automating data analysis tasks,
yet open-source models face significant limitations in these kinds of
reasoning-intensive scenarios. In this work, we investigate strategies to
enhance the data analysis capabilities of open-source LLMs. By curating a seed
dataset of diverse, realistic scenarios, we evaluate models across three
dimensions: data understanding, code generation, and strategic planning. Our
analysis reveals three key findings: (1) Strategic planning quality serves as
the primary determinant of model performance; (2) Interaction design and task
complexity significantly influence reasoning capabilities; (3) Data quality
demonstrates a greater impact than diversity in achieving optimal performance.
We leverage these insights to develop a data synthesis methodology,
demonstrating significant improvements in open-source LLMs’ analytical
reasoning capabilities.
[COMMENTS]
Work in progress
[LINK]
http://arxiv.org/abs/2506.19794v1
[DATE]
2025-06-25 01:04:23+08:00
[CATEGORIES]
cs.CL
cs.LG
Kling-Foley: Multimodal Diffusion Transformer for High-Quality Video-to-Audio Generation
[AUTHORS]
Jun Wang, Xijuan Zeng, Chunyu Qiang, Ruilong Chen, Shiyao Wang, Le Wang, Wangjing Zhou, Pengfei Cai, Jiahui Zhao, Nan Li, Zihan Li, Yuzhe Liang, Xiaopeng Wang, Haorui Zheng, Ming Wen, Kang Yin, Yiran Wang, Nan Li, Feng Deng, Liang Dong, Chen Zhang, Di Zhang, Kun Gai
[ABSTRACT]
We propose Kling-Foley, a large-scale multimodal Video-to-Audio generation
model that synthesizes high-quality audio synchronized with video content. In
Kling-Foley, we introduce multimodal diffusion transformers to model the
interactions between video, audio, and text modalities, and combine it with a
visual semantic representation module and an audio-visual synchronization
module to enhance alignment capabilities. Specifically, these modules align
video conditions with latent audio elements at the frame level, thereby
improving semantic alignment and audio-visual synchronization. Together with
text conditions, this integrated approach enables precise generation of
video-matching sound effects. In addition, we propose a universal latent audio
codec that can achieve high-quality modeling in various scenarios such as sound
effects, speech, singing, and music. We employ a stereo rendering method that
imbues synthesized audio with a spatial presence. At the same time, in order to
make up for the incomplete types and annotations of the open-source benchmark,
we also open-source an industrial-level benchmark Kling-Audio-Eval. Our
experiments show that Kling-Foley trained with the flow matching objective
achieves new audio-visual SOTA performance among public models in terms of
distribution matching, semantic alignment, temporal alignment and audio
quality.
[LINK]
http://arxiv.org/abs/2506.19774v1
[DATE]
2025-06-25 00:39:39+08:00
[CATEGORIES]
cs.CL
SRFT: A Single-Stage Method with Supervised and Reinforcement Fine-Tuning for Reasoning
[AUTHORS]
Yuqian Fu, Tinghong Chen, Jiajun Chai, Xihuai Wang, Songjun Tu, Guojun Yin, Wei Lin, Qichao Zhang, Yuanheng Zhu, Dongbin Zhao
[ABSTRACT]
Large language models (LLMs) have achieved remarkable progress in reasoning
tasks, yet the optimal integration of Supervised Fine-Tuning (SFT) and
Reinforcement Learning (RL) remains a fundamental challenge. Through
comprehensive analysis of token distributions, learning dynamics, and
integration mechanisms from entropy-based perspectives, we reveal key
differences between these paradigms: SFT induces coarse-grained global changes
to LLM policy distributions, while RL performs fine-grained selective
optimizations, with entropy serving as a critical indicator of training
effectiveness. Building on these observations, we propose Supervised
Reinforcement Fine-Tuning (SRFT), a single-stage method that unifies both
fine-tuning paradigms through entropy-aware weighting mechanisms. Our approach
simultaneously applies SFT and RL to directly optimize the LLM using
demonstrations and self-exploration rollouts rather than through two-stage
sequential methods. Extensive experiments show that SRFT achieves 59.1% average
accuracy, outperforming zero-RL methods by 9.0% on five mathematical reasoning
benchmarks and 10.9% on three out-of-distribution benchmarks.
[LINK]
http://arxiv.org/abs/2506.19767v1
[DATE]
2025-06-25 00:31:37+08:00
[CATEGORIES]
cs.CL
cs.LG
Sensitive Content Classification in Social Media: A Holistic Resource and Evaluation
[AUTHORS]
Dimosthenis Antypas, Indira Sen, Carla Perez-Almendros, Jose Camacho-Collados, Francesco Barbieri
[ABSTRACT]
The detection of sensitive content in large datasets is crucial for ensuring
that shared and analysed data is free from harmful material. However, current
moderation tools, such as external APIs, suffer from limitations in
customisation, accuracy across diverse sensitive categories, and privacy
concerns. Additionally, existing datasets and open-source models focus
predominantly on toxic language, leaving gaps in detecting other sensitive
categories such as substance abuse or self-harm. In this paper, we put forward
a unified dataset tailored for social media content moderation across six
sensitive categories: conflictual language, profanity, sexually explicit
material, drug-related content, self-harm, and spam. By collecting and
annotating data with consistent retrieval strategies and guidelines, we address
the shortcomings of previous focalised research. Our analysis demonstrates that
fine-tuning large language models (LLMs) on this novel dataset yields
significant improvements in detection performance compared to open
off-the-shelf models such as LLaMA, and even proprietary OpenAI models, which
underperform by 10-15% overall. This limitation is even more pronounced on
popular moderation APIs, which cannot be easily tailored to specific sensitive
content categories, among others.
[COMMENTS]
Accepted at the 9th Workshop on Online Abuse and Harms (WOAH)
[LINK]
http://arxiv.org/abs/2411.19832v3
[DATE]
2025-06-25 00:31:28+08:00
[CATEGORIES]
cs.CL
Accurate, fast, cheap: Choose three. Replacing Multi-Head-Attention with Bidirectional Recurrent Attention for Long-Form ASR
[AUTHORS]
Martin Ratajczak, Jean-Philippe Robichaud, Jennifer Drexler Fox
[ABSTRACT]
Long-form speech recognition is an application area of increasing research
focus. ASR models based on multi-head attention (MHA) are ill-suited to
long-form ASR because of their quadratic complexity in sequence length. We
build on recent work that has investigated linear complexity recurrent
attention (RA) layers for ASR. We find that bidirectional RA layers can match
the accuracy of MHA for both short- and long-form applications. We present a
strong limited-context attention (LCA) baseline, and show that RA layers are
just as accurate while being more efficient. We develop a long-form training
paradigm which further improves RA performance, leading to better accuracy than
LCA with 44% higher throughput. We also present Direction Dropout, a novel
regularization method that improves accuracy, provides fine-grained control of
the accuracy/throughput trade-off of bidirectional RA, and enables a new
alternating directions decoding mode with even higher throughput.
[COMMENTS]
Accepted to Interspeech 2025
[LINK]
http://arxiv.org/abs/2506.19761v1
[DATE]
2025-06-25 00:21:56+08:00
[CATEGORIES]
cs.CL
Arabic Dialect Classification using RNNs, Transformers, and Large Language Models: A Comparative Analysis
[AUTHORS]
Omar A. Essameldin, Ali O. Elbeih, Wael H. Gomaa, Wael F. Elsersy
[ABSTRACT]
The Arabic language is among the most popular languages in the world with a
huge variety of dialects spoken in 22 countries. In this study, we address the
problem of classifying 18 Arabic dialects of the QADI dataset of Arabic tweets.
RNN models, Transformer models, and large language models (LLMs) via prompt
engineering are created and tested. Among these, MARBERTv2 performed best with
65% accuracy and 64% F1-score. Through the use of state-of-the-art
preprocessing techniques and the latest NLP models, this paper identifies the
most significant linguistic issues in Arabic dialect identification. The
results corroborate applications like personalized chatbots that respond in
users’ dialects, social media monitoring, and greater accessibility for Arabic
communities.
[LINK]
http://arxiv.org/abs/2506.19753v1
[DATE]
2025-06-25 00:06:58+08:00
[CATEGORIES]
cs.CL
NEAR$^2$: A Nested Embedding Approach to Efficient Product Retrieval and Ranking
[AUTHORS]
Shenbin Qian, Diptesh Kanojia, Samarth Agrawal, Hadeel Saadany, Swapnil Bhosale, Constantin Orasan, Zhe Wu
[ABSTRACT]
E-commerce information retrieval (IR) systems struggle to simultaneously
achieve high accuracy in interpreting complex user queries and maintain
efficient processing of vast product catalogs. The dual challenge lies in
precisely matching user intent with relevant products while managing the
computational demands of real-time search across massive inventories. In this
paper, we propose a Nested Embedding Approach to product Retrieval and Ranking,
called NEAR$^2$, which can achieve up to $12$ times efficiency in embedding
size at inference time while introducing no extra cost in training and
improving performance in accuracy for various encoder-based Transformer models.
We validate our approach using different loss functions for the retrieval and
ranking task, including multiple negative ranking loss and online contrastive
loss, on four different test sets with various IR challenges such as short and
implicit queries. Our approach achieves an improved performance over a smaller
embedding dimension, compared to any existing models.
[COMMENTS]
This paper is accepted to the 2025 SIGIR Workshop on eCommerce
[LINK]
http://arxiv.org/abs/2506.19743v1
[DATE]
2025-06-25 00:02:02+08:00
[CATEGORIES]
cs.CL
Reinforcement Learning Increases Wind Farm Power Production by Enabling Closed-Loop Collaborative Control
[AUTHORS]
Andrew Mole, Max Weissenbacher, Georgios Rigas, Sylvain Laizet
[ABSTRACT]
Traditional wind farm control operates each turbine independently to maximize
individual power output. However, coordinated wake steering across the entire
farm can substantially increase the combined wind farm energy production.
Although dynamic closed-loop control has proven effective in flow control
applications, wind farm optimization has relied primarily on static,
low-fidelity simulators that ignore critical turbulent flow dynamics. In this
work, we present the first reinforcement learning (RL) controller integrated
directly with high-fidelity large-eddy simulation (LES), enabling real-time
response to atmospheric turbulence through collaborative, dynamic control
strategies. Our RL controller achieves a 4.30% increase in wind farm power
output compared to baseline operation, nearly doubling the 2.19% gain from
static optimal yaw control obtained through Bayesian optimization. These
results establish dynamic flow-responsive control as a transformative approach
to wind farm optimization, with direct implications for accelerating renewable
energy deployment to net-zero targets.
[LINK]
http://arxiv.org/abs/2506.20554v1
[DATE]
2025-06-25 23:53:12+08:00
[CATEGORIES]
cs.LG
Pay Less Attention to Deceptive Artifacts: Robust Detection of Compressed Deepfakes on Online Social Networks
[AUTHORS]
Manyi Li, Renshuai Tao, Yufan Liu, Chuangchuang Tan, Haotong Qin, Bing Li, Yunchao Wei, Yao Zhao
[ABSTRACT]
With the rapid advancement of deep learning, particularly through generative
adversarial networks (GANs) and diffusion models (DMs), AI-generated images, or
deepfakes", have become nearly indistinguishable from real ones. These images
are widely shared across Online Social Networks (OSNs), raising concerns about
their misuse. Existing deepfake detection methods overlook the
block effects”
introduced by compression in OSNs, which obscure deepfake artifacts, and
primarily focus on raw images, rarely encountered in real-world scenarios. To
address these challenges, we propose PLADA (Pay Less Attention to Deceptive
Artifacts), a novel framework designed to tackle the lack of paired data and
the ineffective use of compressed images. PLADA consists of two core modules:
Block Effect Eraser (B2E), which uses a dual-stage attention mechanism to
handle block effects, and Open Data Aggregation (ODA), which processes both
paired and unpaired data to improve detection. Extensive experiments across 26
datasets demonstrate that PLADA achieves a remarkable balance in deepfake
detection, outperforming SoTA methods in detecting deepfakes on OSNs, even with
limited paired data and compression. More importantly, this work introduces the
``block effect” as a critical factor in deepfake detection, providing a robust
solution for open-world scenarios. Our code is available at
https://github.com/ManyiLee/PLADA.
[COMMENTS]
20 pages, 10 figures
[LINK]
http://arxiv.org/abs/2506.20548v1
[DATE]
2025-06-25 23:46:41+08:00
[CATEGORIES]
cs.LG
Contextual Optimization under Covariate Shift: A Robust Approach by Intersecting Wasserstein Balls
[AUTHORS]
Tianyu Wang, Ningyuan Chen, Chun Wang
[ABSTRACT]
In contextual optimization, a decision-maker leverages contextual
information, often referred to as covariates, to better resolve uncertainty and
make informed decisions. In this paper, we examine the challenges of contextual
decision-making under covariate shift, a phenomenon where the distribution of
covariates differs between the training and test environments. Such shifts can
lead to inaccurate upstream estimations for test covariates that lie far from
the training data, ultimately resulting in suboptimal downstream decisions. To
tackle these challenges, we propose a novel approach called Intersection
Wasserstein-balls DRO (IW-DRO), which integrates multiple estimation methods
into the distributionally robust optimization (DRO) framework. At the core of
our approach is an innovative ambiguity set defined as the intersection of two
Wasserstein balls, with their centers constructed using appropriate
nonparametric and parametric estimators. On the computational side, we
reformulate the IW-DRO problem as a tractable convex program and develop an
approximate algorithm tailored for large-scale problems to enhance
computational efficiency. From a theoretical perspective, we demonstrate that
IW-DRO achieves superior performance compared to single Wasserstein-ball DRO
models. We further establish performance guarantees by analyzing the coverage
of the intersection ambiguity set and the measure concentration of both
estimators under the Wasserstein distance. Notably, we derive a finite-sample
concentration result for the Nadaraya-Watson kernel estimator under covariate
shift. The proposed IW-DRO framework offers practical value for decision-makers
operating in uncertain environments affected by covariate shifts.
[LINK]
http://arxiv.org/abs/2406.02426v2
[DATE]
2025-06-25 23:43:13+08:00
[CATEGORIES]
cs.LG
Demonstration of effective UCB-based routing in skill-based queues on real-world data
[AUTHORS]
Sanne van Kempen, Jaron Sanders, Fiona Sloothaak, Maarten G. Wolf
[ABSTRACT]
This paper is about optimally controlling skill-based queueing systems such
as data centers, cloud computing networks, and service systems. By means of a
case study using a real-world data set, we investigate the practical
implementation of a recently developed reinforcement learning algorithm for
optimal customer routing. Our experiments show that the algorithm efficiently
learns and adapts to changing environments and outperforms static benchmark
policies, indicating its potential for live implementation. We also augment the
real-world applicability of this algorithm by introducing a new heuristic
routing rule to reduce delays. Moreover, we show that the algorithm can
optimize for multiple objectives: next to payoff maximization, secondary
objectives such as server load fairness and customer waiting time reduction can
be incorporated. Tuning parameters are used for balancing inherent performance
trade–offs. Lastly, we investigate the sensitivity to estimation errors and
parameter tuning, providing valuable insights for implementing adaptive routing
algorithms in complex real-world queueing systems.
[LINK]
http://arxiv.org/abs/2506.20543v1
[DATE]
2025-06-25 23:36:43+08:00
[CATEGORIES]
cs.LG
Adversarial Reasoning at Jailbreaking Time
[AUTHORS]
Mahdi Sabbaghi, Paul Kassianik, George Pappas, Yaron Singer, Amin Karbasi, Hamed Hassani
[ABSTRACT]
As large language models (LLMs) are becoming more capable and widespread, the
study of their failure cases is becoming increasingly important. Recent
advances in standardizing, measuring, and scaling test-time compute suggest new
methodologies for optimizing models to achieve high performance on hard tasks.
In this paper, we apply these advances to the task of model jailbreaking:
eliciting harmful responses from aligned LLMs. We develop an adversarial
reasoning approach to automatic jailbreaking that leverages a loss signal to
guide the test-time compute, achieving SOTA attack success rates against many
aligned LLMs, even those that aim to trade inference-time compute for
adversarial robustness. Our approach introduces a new paradigm in understanding
LLM vulnerabilities, laying the foundation for the development of more robust
and trustworthy AI systems.
[COMMENTS]
Accepted to the 42nd International Conference on Machine Learning
(ICML 2025)
[LINK]
http://arxiv.org/abs/2502.01633v2
[DATE]
2025-06-25 23:31:17+08:00
[CATEGORIES]
cs.LG
Physics-Informed Machine Learning Regulated by Finite Element Analysis for Simulation Acceleration of Laser Powder Bed Fusion
[AUTHORS]
R. Sharma, M. Raissi, Y. B. Guo
[ABSTRACT]
Efficient simulation of Laser Powder Bed Fusion (LPBF) is crucial for process
prediction due to the lasting issue of high computation cost using traditional
numerical methods such as finite element analysis (FEA). This study presents an
efficient modeling framework termed FEA-Regulated Physics-Informed Neural
Network (FEA-PINN) to accelerate the thermal field prediction in a LPBF process
while maintaining the FEA accuracy. A novel dynamic material updating strategy
is developed to capture the dynamic phase change of powder-liquid-solid in the
PINN model. The PINN model incorporates temperature-dependent material
properties and phase change behavior using the apparent heat capacity method.
While the PINN model demonstrates high accuracy with a small training data and
enables generalization of new process parameters via transfer learning, it
faces the challenge of high computation cost in time-dependent problems due to
the residual accumulation. To overcome this issue, the FEA-PINN framework
integrates corrective FEA simulations during inference to enforce physical
consistency and reduce error drift. A comparative analysis shows that FEA-PINN
achieves equivalent accuracy to FEA while significantly reducing computational
cost. The framework has been validated using the benchmark FEA data and
demonstrated through single-track scanning in LPBF.
[LINK]
http://arxiv.org/abs/2506.20537v1
[DATE]
2025-06-25 23:25:01+08:00
[CATEGORIES]
cs.LG
WattsOnAI: Measuring, Analyzing, and Visualizing Energy and Carbon Footprint of AI Workloads
[AUTHORS]
Hongzhen Huang, Kunming Zhang, Hanlong Liao, Kui Wu, Guoming Tang
[ABSTRACT]
The rapid advancement of AI, particularly large language models (LLMs), has
raised significant concerns about the energy use and carbon emissions
associated with model training and inference. However, existing tools for
measuring and reporting such impacts are often fragmented, lacking systematic
metric integration and offering limited support for correlation analysis among
them. This paper presents WattsOnAI, a comprehensive software toolkit for the
measurement, analysis, and visualization of energy use, power draw, hardware
performance, and carbon emissions across AI workloads. By seamlessly
integrating with existing AI frameworks, WattsOnAI offers standardized reports
and exports fine-grained time-series data to support benchmarking and
reproducibility in a lightweight manner. It further enables in-depth
correlation analysis between hardware metrics and model performance and thus
facilitates bottleneck identification and performance enhancement. By
addressing critical limitations in existing tools, WattsOnAI encourages the
research community to weigh environmental impact alongside raw performance of
AI workloads and advances the shift toward more sustainable “Green AI”
practices. The code is available at https://github.com/SusCom-Lab/WattsOnAI.
[COMMENTS]
11 pages, 7 figures and 5 tables
[LINK]
http://arxiv.org/abs/2506.20535v1
[DATE]
2025-06-25 23:24:45+08:00
[CATEGORIES]
cs.LG
Global Convergence of Iteratively Reweighted Least Squares for Robust Subspace Recovery
[AUTHORS]
Gilad Lerman, Kang Li, Tyler Maunu, Teng Zhang
[ABSTRACT]
Robust subspace estimation is fundamental to many machine learning and data
analysis tasks. Iteratively Reweighted Least Squares (IRLS) is an elegant and
empirically effective approach to this problem, yet its theoretical properties
remain poorly understood. This paper establishes that, under deterministic
conditions, a variant of IRLS with dynamic smoothing regularization converges
linearly to the underlying subspace from any initialization. We extend these
guarantees to affine subspace estimation, a setting that lacks prior recovery
theory. Additionally, we illustrate the practical benefits of IRLS through an
application to low-dimensional neural network training. Our results provide the
first global convergence guarantees for IRLS in robust subspace recovery and,
more broadly, for nonconvex IRLS on a Riemannian manifold.
[LINK]
http://arxiv.org/abs/2506.20533v1
[DATE]
2025-06-25 23:23:32+08:00
[CATEGORIES]
cs.LG
Variational Learning Finds Flatter Solutions at the Edge of Stability
[AUTHORS]
Avrajit Ghosh, Bai Cong, Rio Yokota, Saiprasad Ravishankar, Rongrong Wang, Molei Tao, Mohammad Emtiyaz Khan, Thomas Möllenhoff
[ABSTRACT]
Variational Learning (VL) has recently gained popularity for training deep
neural networks and is competitive to standard learning methods. Part of its
empirical success can be explained by theories such as PAC-Bayes bounds,
minimum description length and marginal likelihood, but there are few tools to
unravel the implicit regularization in play. Here, we analyze the implicit
regularization of VL through the Edge of Stability (EoS) framework. EoS has
previously been used to show that gradient descent can find flat solutions and
we extend this result to VL to show that it can find even flatter solutions.
This is obtained by controlling the posterior covariance and the number of
Monte Carlo samples from the posterior. These results are derived in a similar
fashion as the standard EoS literature for deep learning, by first deriving a
result for a quadratic problem and then extending it to deep neural networks.
We empirically validate these findings on a wide variety of large networks,
such as ResNet and ViT, to find that the theoretical results closely match the
empirical ones. Ours is the first work to analyze the EoS dynamics in VL.
[LINK]
http://arxiv.org/abs/2506.12903v2
[DATE]
2025-06-25 23:17:32+08:00
[CATEGORIES]
cs.LG
Proximal Control of UAVs with Federated Learning for Human-Robot Collaborative Domains
[AUTHORS]
Lucas Nogueira Nobrega, Ewerton de Oliveira, Martin Saska, Tiago Nascimento
[COMMENTS]
version 2
[LINK]
http://arxiv.org/abs/2412.02863v2
[DATE]
2025-06-25 23:15:12+08:00
[CATEGORIES]
cs.LG
Industrial Energy Disaggregation with Digital Twin-generated Dataset and Efficient Data Augmentation
[AUTHORS]
Christian Internò, Andrea Castellani, Sebastian Schmitt, Fabio Stella, Barbara Hammer
[ABSTRACT]
Industrial Non-Intrusive Load Monitoring (NILM) is limited by the scarcity of
high-quality datasets and the complex variability of industrial energy
consumption patterns. To address data scarcity and privacy issues, we introduce
the Synthetic Industrial Dataset for Energy Disaggregation (SIDED), an
open-source dataset generated using Digital Twin simulations. SIDED includes
three types of industrial facilities across three different geographic
locations, capturing diverse appliance behaviors, weather conditions, and load
profiles. We also propose the Appliance-Modulated Data Augmentation (AMDA)
method, a computationally efficient technique that enhances NILM model
generalization by intelligently scaling appliance power contributions based on
their relative impact. We show in experiments that NILM models trained with
AMDA-augmented data significantly improve the disaggregation of energy
consumption of complex industrial appliances like combined heat and power
systems. Specifically, in our out-of-sample scenarios, models trained with AMDA
achieved a Normalized Disaggregation Error of 0.093, outperforming models
trained without data augmentation (0.451) and those trained with random data
augmentation (0.290). Data distribution analyses confirm that AMDA effectively
aligns training and test data distributions, enhancing model generalization.
[LINK]
http://arxiv.org/abs/2506.20525v1
[DATE]
2025-06-25 23:10:43+08:00
[CATEGORIES]
cs.LG
On Advancements of the Forward-Forward Algorithm
[AUTHORS]
Mauricio Ortiz Torres, Markus Lange, Arne P. Raulf
[ABSTRACT]
The Forward-Forward algorithm has evolved in machine learning research,
tackling more complex tasks that mimic real-life applications. In the last
years, it has been improved by several techniques to perform better than its
original version, handling a challenging dataset like CIFAR10 without losing
its flexibility and low memory usage. We have shown in our results that
improvements are achieved through a combination of convolutional channel
grouping, learning rate schedules, and independent block structures during
training that lead to a 20\% decrease in test error percentage. Additionally,
to approach further implementations on low-capacity hardware projects, we have
presented a series of lighter models that achieve low test error percentages
within (21$\pm$3)\% and number of trainable parameters between 164,706 and
754,386. This serves as a basis for our future study on complete verification
and validation of these kinds of neural networks.
[COMMENTS]
This work has been submitted to the IEEE for possible publication
[LINK]
http://arxiv.org/abs/2504.21662v2
[DATE]
2025-06-25 23:08:49+08:00
[CATEGORIES]
cs.LG
Fast ground penetrating radar dual-parameter full waveform inversion method accelerated by hybrid compilation of CUDA kernel function and PyTorch
[AUTHORS]
Lei Liu, Chao Song, Liangsheng He, Silin Wang, Xuan Feng, Cai Liu
[ABSTRACT]
This study proposes a high-performance dual-parameter full waveform inversion
framework (FWI) for ground-penetrating radar (GPR), accelerated through the
hybrid compilation of CUDA kernel functions and PyTorch. The method leverages
the computational efficiency of GPU programming while preserving the
flexibility and usability of Python-based deep learning frameworks. By
integrating customized CUDA kernels into PyTorch’s automatic differentiation
mechanism, the framework enables accurate and efficient inversion of both
dielectric permittivity and electrical conductivity. Experimental evaluations
on synthetic data and real wavefield data demonstrate that the proposed method
achieves dual-parameter FWI for GPR data while maintaining high accuracy.
Moreover, the framework is flexible and extensible, supporting optional
regularization strategies such as total variation and multi-scale inversion.
These features make the proposed approach a practical and scalable framework
for rapid GPR-based subsurface imaging in applications including civil
engineering, environmental monitoring, and geophysical exploration.
[LINK]
http://arxiv.org/abs/2506.20513v1
[DATE]
2025-06-25 23:00:33+08:00
[CATEGORIES]
cs.LG
Collaborative Batch Size Optimization for Federated Learning
[AUTHORS]
Arno Geimer, Karthick Panner Selvam, Beltran Fiz Pontiveros
[ABSTRACT]
Federated Learning (FL) is a decentralized collaborative Machine Learning
framework for training models without collecting data in a centralized
location. It has seen application across various disciplines, from helping
medical diagnoses in hospitals to detecting fraud in financial transactions. In
this paper, we focus on improving the local training process through hardware
usage optimization. While participants in a federation might share the hardware
they are training on, since there is no information exchange between them,
their training process can be hindered by an improper training configuration.
Taking advantage of the parallel processing inherent to Federated Learning, we
use a greedy randomized search to optimize local batch sizes for the best
training settings across all participants. Our results show that against
default parameter settings, our method improves convergence speed while staying
nearly on par with the case where local parameters are optimized.
[LINK]
http://arxiv.org/abs/2506.20511v1
[DATE]
2025-06-25 22:57:23+08:00
[CATEGORIES]
cs.LG
Unidentified and Confounded? Understanding Two-Tower Models for Unbiased Learning to Rank
[AUTHORS]
Philipp Hager, Onno Zoeter, Maarten de Rijke
[ABSTRACT]
Additive two-tower models are popular learning-to-rank methods for handling
biased user feedback in industry settings. Recent studies, however, report a
concerning phenomenon: training two-tower models on clicks collected by
well-performing production systems leads to decreased ranking performance. This
paper investigates two recent explanations for this observation: confounding
effects from logging policies and model identifiability issues. We
theoretically analyze the identifiability conditions of two-tower models,
showing that either document swaps across positions or overlapping feature
distributions are required to recover model parameters from clicks. We also
investigate the effect of logging policies on two-tower models, finding that
they introduce no bias when models perfectly capture user behavior. However,
logging policies can amplify biases when models imperfectly capture user
behavior, particularly when prediction errors correlate with document placement
across positions. We propose a sample weighting technique to mitigate these
effects and provide actionable insights for researchers and practitioners using
two-tower models.
[LINK]
http://arxiv.org/abs/2506.20501v1
[DATE]
2025-06-25 22:47:43+08:00
[CATEGORIES]
cs.LG
Training Plug-n-Play Knowledge Modules with Deep Context Distillation
[AUTHORS]
Lucas Caccia, Alan Ansell, Edoardo Ponti, Ivan Vulić, Alessandro Sordoni
[ABSTRACT]
Dynamically integrating new or rapidly evolving information after (Large)
Language Model pre-training remains challenging, particularly in low-data
scenarios or when dealing with private and specialized documents. In-context
learning and retrieval-augmented generation (RAG) face limitations, including
their high inference costs and their inability to capture global document
information. In this paper, we propose a way of modularizing knowledge by
training document-level Knowledge Modules (KMs). KMs are lightweight components
implemented as parameter-efficient LoRA modules, which are trained to store
information about new documents and can be easily plugged into models on
demand. We show that next-token prediction performs poorly as the training
objective for KMs. We instead propose Deep Context Distillation: we learn KMs
parameters such as to simulate hidden states and logits of a teacher that takes
the document in context. Our method outperforms standard next-token prediction
and pre-instruction training techniques, across two datasets. Finally, we
highlight synergies between KMs and RAG.
[COMMENTS]
Preprint
[LINK]
http://arxiv.org/abs/2503.08727v3
[DATE]
2025-06-25 22:45:56+08:00
[CATEGORIES]
cs.LG
Multimodal Representation Learning and Fusion
[AUTHORS]
Qihang Jin, Enze Ge, Yuhang Xie, Hongying Luo, Junhao Song, Ziqian Bi, Chia Xin Liang, Jibin Guan, Joe Yeong, Junfeng Hao
[ABSTRACT]
Multi-modal learning is a fast growing area in artificial intelligence. It
tries to help machines understand complex things by combining information from
different sources, like images, text, and audio. By using the strengths of each
modality, multi-modal learning allows AI systems to build stronger and richer
internal representations. These help machines better interpretation, reasoning,
and making decisions in real-life situations. This field includes core
techniques such as representation learning (to get shared features from
different data types), alignment methods (to match information across
modalities), and fusion strategies (to combine them by deep learning models).
Although there has been good progress, some major problems still remain. Like
dealing with different data formats, missing or incomplete inputs, and
defending against adversarial attacks. Researchers now are exploring new
methods, such as unsupervised or semi-supervised learning, AutoML tools, to
make models more efficient and easier to scale. And also more attention on
designing better evaluation metrics or building shared benchmarks, make it
easier to compare model performance across tasks and domains. As the field
continues to grow, multi-modal learning is expected to improve many areas:
computer vision, natural language processing, speech recognition, and
healthcare. In the future, it may help to build AI systems that can understand
the world in a way more like humans, flexible, context aware, and able to deal
with real-world complexity.
[LINK]
http://arxiv.org/abs/2506.20494v1
[DATE]
2025-06-25 22:40:09+08:00
[CATEGORIES]
cs.LG
Non-equilibrium Annealed Adjoint Sampler
[AUTHORS]
Jaemoo Choi, Yongxin Chen, Molei Tao, Guan-Horng Liu
[ABSTRACT]
Recently, there has been significant progress in learning-based diffusion
samplers, which aim to sample from a given unnormalized density. These methods
typically follow one of two paradigms: (i) formulating sampling as an unbiased
stochastic optimal control (SOC) problem using a canonical reference process,
or (ii) refining annealed path measures through importance-weighted sampling.
Although annealing approaches have advantages in guiding samples toward
high-density regions, reliance on importance sampling leads to high variance
and limited scalability in practice. In this paper, we introduce the
\textbf{Non-equilibrium Annealed Adjoint Sampler (NAAS)}, a novel SOC-based
diffusion sampler that leverages annealed reference dynamics without resorting
to importance sampling. NAAS employs a lean adjoint system inspired by adjoint
matching, enabling efficient and scalable training. We demonstrate the
effectiveness of our approach across a range of tasks, including sampling from
classical energy landscapes and molecular Boltzmann distribution.
[COMMENTS]
21 pages, 7 figures
[LINK]
http://arxiv.org/abs/2506.18165v2
[DATE]
2025-06-25 22:39:40+08:00
[CATEGORIES]
cs.LG
Offline Goal-Conditioned Reinforcement Learning with Projective Quasimetric Planning
[AUTHORS]
Anthony Kobanda, Waris Radji, Mathieu Petitbois, Odalric-Ambrym Maillard, Rémy Portelas
[ABSTRACT]
Offline Goal-Conditioned Reinforcement Learning seeks to train agents to
reach specified goals from previously collected trajectories. Scaling that
promises to long-horizon tasks remains challenging, notably due to compounding
value-estimation errors. Principled geometric offers a potential solution to
address these issues. Following this insight, we introduce Projective
Quasimetric Planning (ProQ), a compositional framework that learns an
asymmetric distance and then repurposes it, firstly as a repulsive energy
forcing a sparse set of keypoints to uniformly spread over the learned latent
space, and secondly as a structured directional cost guiding towards proximal
sub-goals. In particular, ProQ couples this geometry with a Lagrangian
out-of-distribution detector to ensure the learned keypoints stay within
reachable areas. By unifying metric learning, keypoint coverage, and
goal-conditioned control, our approach produces meaningful sub-goals and
robustly drives long-horizon goal-reaching on diverse a navigation benchmarks.
[LINK]
http://arxiv.org/abs/2506.18847v2
[DATE]
2025-06-25 22:37:00+08:00
[CATEGORIES]
cs.LG
MARCO: Multi-Agent Code Optimization with Real-Time Knowledge Integration for High-Performance Computing
[AUTHORS]
Asif Rahman, Veljko Cvetkovic, Kathleen Reece, Aidan Walters, Yasir Hassan, Aneesh Tummeti, Bryan Torres, Denise Cooney, Margaret Ellis, Dimitrios S. Nikolopoulos
[ABSTRACT]
Large language models (LLMs) have transformed software development through
code generation capabilities, yet their effectiveness for high-performance
computing (HPC) remains limited. HPC code requires specialized optimizations
for parallelism, memory efficiency, and architecture-specific considerations
that general-purpose LLMs often overlook. We present MARCO (Multi-Agent
Reactive Code Optimizer), a novel framework that enhances LLM-generated code
for HPC through a specialized multi-agent architecture. MARCO employs separate
agents for code generation and performance evaluation, connected by a feedback
loop that progressively refines optimizations. A key innovation is MARCO’s
web-search component that retrieves real-time optimization techniques from
recent conference proceedings and research publications, bridging the knowledge
gap in pre-trained LLMs. Our extensive evaluation on the LeetCode 75 problem
set demonstrates that MARCO achieves a 14.6\% average runtime reduction
compared to Claude 3.5 Sonnet alone, while the integration of the web-search
component yields a 30.9\% performance improvement over the base MARCO system.
These results highlight the potential of multi-agent systems to address the
specialized requirements of high-performance code generation, offering a
cost-effective alternative to domain-specific model fine-tuning.
[COMMENTS]
9 pages, 4 figures, 2 tables
[LINK]
http://arxiv.org/abs/2505.03906v3
[DATE]
2025-06-25 22:22:04+08:00
[CATEGORIES]
cs.LG
Physics-informed Imitative Reinforcement Learning for Real-world Driving
[AUTHORS]
Hang Zhou, Yihao Qin, Dan Xu, Yiding Ji
[ABSTRACT]
Recent advances in imitative reinforcement learning (IRL) have considerably
enhanced the ability of autonomous agents to assimilate expert demonstrations,
leading to rapid skill acquisition in a range of demanding tasks. However, such
learning-based agents face significant challenges when transferring knowledge
to highly dynamic closed-loop environments. Their performance is significantly
impacted by the conflicting optimization objectives of imitation learning (IL)
and reinforcement learning (RL), sample inefficiency, and the complexity of
uncovering the hidden world model and physics. To address this challenge, we
propose a physics-informed IRL that is entirely data-driven. It leverages both
expert demonstration data and exploratory data with a joint optimization
objective, allowing the underlying physical principles of vehicle dynamics to
emerge naturally from the training process. The performance is evaluated
through empirical experiments and results exceed popular IL, RL and IRL
algorithms in closed-loop settings on Waymax benchmark. Our approach exhibits
37.8% reduction in collision rate and 22.2% reduction in off-road rate compared
to the baseline method.
[LINK]
http://arxiv.org/abs/2407.02508v3
[DATE]
2025-06-25 22:06:21+08:00
[CATEGORIES]
cs.LG
HiWave: Training-Free High-Resolution Image Generation via Wavelet-Based Diffusion Sampling
[AUTHORS]
Tobias Vontobel, Seyedmorteza Sadat, Farnood Salehi, Romann M. Weber
[ABSTRACT]
Diffusion models have emerged as the leading approach for image synthesis,
demonstrating exceptional photorealism and diversity. However, training
diffusion models at high resolutions remains computationally prohibitive, and
existing zero-shot generation techniques for synthesizing images beyond
training resolutions often produce artifacts, including object duplication and
spatial incoherence. In this paper, we introduce HiWave, a training-free,
zero-shot approach that substantially enhances visual fidelity and structural
coherence in ultra-high-resolution image synthesis using pretrained diffusion
models. Our method employs a two-stage pipeline: generating a base image from
the pretrained model followed by a patch-wise DDIM inversion step and a novel
wavelet-based detail enhancer module. Specifically, we first utilize inversion
methods to derive initial noise vectors that preserve global coherence from the
base image. Subsequently, during sampling, our wavelet-domain detail enhancer
retains low-frequency components from the base image to ensure structural
consistency, while selectively guiding high-frequency components to enrich fine
details and textures. Extensive evaluations using Stable Diffusion XL
demonstrate that HiWave effectively mitigates common visual artifacts seen in
prior methods, achieving superior perceptual quality. A user study confirmed
HiWave’s performance, where it was preferred over the state-of-the-art
alternative in more than 80% of comparisons, highlighting its effectiveness for
high-quality, ultra-high-resolution image synthesis without requiring
retraining or architectural modifications.
[LINK]
http://arxiv.org/abs/2506.20452v1
[DATE]
2025-06-25 21:58:37+08:00
[CATEGORIES]
cs.LG
Méthode de quadrature pour les PINNs fondée théoriquement sur la hessienne des résiduels
[AUTHORS]
Antoine Caradot, Rémi Emonet, Amaury Habrard, Abdel-Rahim Mezidi, Marc Sebban
[ABSTRACT]
Physics-informed Neural Networks (PINNs) have emerged as an efficient way to
learn surrogate neural solvers of PDEs by embedding the physical model in the
loss function and minimizing its residuals using automatic differentiation at
so-called collocation points. Originally uniformly sampled, the choice of the
latter has been the subject of recent advances leading to adaptive sampling
refinements. In this paper, we propose a new quadrature method for
approximating definite integrals based on the hessian of the considered
function, and that we leverage to guide the selection of the collocation points
during the training process of PINNs.
[COMMENTS]
10 pages. In French. Comments are welcome
[LINK]
http://arxiv.org/abs/2506.20441v1
[DATE]
2025-06-25 21:49:53+08:00
[CATEGORIES]
cs.LG
Scalable Subset Selection in Linear Mixed Models
[AUTHORS]
Ryan Thompson, Matt P. Wand, Joanna J. J. Wang
[ABSTRACT]
Linear mixed models (LMMs), which incorporate fixed and random effects, are
key tools for analyzing heterogeneous data, such as in personalized medicine or
adaptive marketing. Nowadays, this type of data is increasingly wide, sometimes
containing thousands of candidate predictors, necessitating sparsity for
prediction and interpretation. However, existing sparse learning methods for
LMMs do not scale well beyond tens or hundreds of predictors, leaving a large
gap compared with sparse methods for linear models, which ignore random
effects. This paper closes the gap with a new $\ell_0$ regularized method for
LMM subset selection that can run on datasets containing thousands of
predictors in seconds to minutes. On the computational front, we develop a
coordinate descent algorithm as our main workhorse and provide a guarantee of
its convergence. We also develop a local search algorithm to help traverse the
nonconvex optimization surface. Both algorithms readily extend to subset
selection in generalized LMMs via a penalized quasi-likelihood approximation.
On the statistical front, we provide a finite-sample bound on the
Kullback-Leibler divergence of the new method. We then demonstrate its
excellent performance in synthetic experiments and illustrate its utility on
two datasets from biology and journalism.
[LINK]
http://arxiv.org/abs/2506.20425v1
[DATE]
2025-06-25 21:39:30+08:00
[CATEGORIES]
cs.LG
Off-Policy Evaluation and Learning for the Future under Non-Stationarity
[AUTHORS]
Tatsuhiro Shimizu, Kazuki Kawamura, Takanori Muroi, Yusuke Narita, Kei Tateno, Takuma Udagawa, Yuta Saito
[ABSTRACT]
We study the novel problem of future off-policy evaluation (F-OPE) and
learning (F-OPL) for estimating and optimizing the future value of policies in
non-stationary environments, where distributions vary over time. In e-commerce
recommendations, for instance, our goal is often to estimate and optimize the
policy value for the upcoming month using data collected by an old policy in
the previous month. A critical challenge is that data related to the future
environment is not observed in the historical data. Existing methods assume
stationarity or depend on restrictive reward-modeling assumptions, leading to
significant bias. To address these limitations, we propose a novel estimator
named \textit{\textbf{O}ff-\textbf{P}olicy Estimator for the \textbf{F}uture
\textbf{V}alue (\textbf{\textit{OPFV}})}, designed for accurately estimating
policy values at any future time point. The key feature of OPFV is its ability
to leverage the useful structure within time-series data. While future data
might not be present in the historical log, we can leverage, for example,
seasonal, weekly, or holiday effects that are consistent in both the historical
and future data. Our estimator is the first to exploit these time-related
structures via a new type of importance weighting, enabling effective F-OPE.
Theoretical analysis identifies the conditions under which OPFV becomes
low-bias. In addition, we extend our estimator to develop a new policy-gradient
method to proactively learn a good future policy using only historical data.
Empirical results show that our methods substantially outperform existing
methods in estimating and optimizing the future policy value under
non-stationarity for various experimental setups.
[LINK]
http://arxiv.org/abs/2506.20417v1
[DATE]
2025-06-25 21:31:46+08:00
[CATEGORIES]
cs.LG
Client Clustering Meets Knowledge Sharing: Enhancing Privacy and Robustness in Personalized Peer-to-Peer Learning
[AUTHORS]
Mohammad Mahdi Maheri, Denys Herasymuk, Hamed Haddadi
[ABSTRACT]
The growing adoption of Artificial Intelligence (AI) in Internet of Things
(IoT) ecosystems has intensified the need for personalized learning methods
that can operate efficiently and privately across heterogeneous,
resource-constrained devices. However, enabling effective personalized learning
in decentralized settings introduces several challenges, including efficient
knowledge transfer between clients, protection of data privacy, and resilience
against poisoning attacks. In this paper, we address these challenges by
developing P4 (Personalized, Private, Peer-to-Peer) – a method designed to
deliver personalized models for resource-constrained IoT devices while ensuring
differential privacy and robustness against poisoning attacks. Our solution
employs a lightweight, fully decentralized algorithm to privately detect client
similarity and form collaborative groups. Within each group, clients leverage
differentially private knowledge distillation to co-train their models,
maintaining high accuracy while ensuring robustness to the presence of
malicious clients. We evaluate P4 on popular benchmark datasets using both
linear and CNN-based architectures across various heterogeneity settings and
attack scenarios. Experimental results show that P4 achieves 5% to 30% higher
accuracy than leading differentially private peer-to-peer approaches and
maintains robustness with up to 30% malicious clients. Additionally, we
demonstrate its practicality by deploying it on resource-constrained devices,
where collaborative training between two clients adds only ~7 seconds of
overhead.
[LINK]
http://arxiv.org/abs/2506.20413v1
[DATE]
2025-06-25 21:27:36+08:00
[CATEGORIES]
cs.LG
POLAR: A Pessimistic Model-based Policy Learning Algorithm for Dynamic Treatment Regimes
[AUTHORS]
Ruijia Zhang, Zhengling Qi, Yue Wu, Xiangyu Zhang, Yanxun Xu
[ABSTRACT]
Dynamic treatment regimes (DTRs) provide a principled framework for
optimizing sequential decision-making in domains where decisions must adapt
over time in response to individual trajectories, such as healthcare,
education, and digital interventions. However, existing statistical methods
often rely on strong positivity assumptions and lack robustness under partial
data coverage, while offline reinforcement learning approaches typically focus
on average training performance, lack statistical guarantees, and require
solving complex optimization problems. To address these challenges, we propose
POLAR, a novel pessimistic model-based policy learning algorithm for offline
DTR optimization. POLAR estimates the transition dynamics from offline data and
quantifies uncertainty for each history-action pair. A pessimistic penalty is
then incorporated into the reward function to discourage actions with high
uncertainty. Unlike many existing methods that focus on average training
performance, POLAR directly targets the suboptimality of the final learned
policy and offers theoretical guarantees, without relying on computationally
intensive minimax or constrained optimization procedures. To the best of our
knowledge, POLAR is the first model-based DTR method to provide both
statistical and computational guarantees, including finite-sample bounds on
policy suboptimality. Empirical results on both synthetic data and the
MIMIC-III dataset demonstrate that POLAR outperforms state-of-the-art methods
and yields near-optimal, history-aware treatment strategies.
[LINK]
http://arxiv.org/abs/2506.20406v1
[DATE]
2025-06-25 21:22:57+08:00
[CATEGORIES]
cs.LG
scMamba: A Scalable Foundation Model for Single-Cell Multi-Omics Integration Beyond Highly Variable Feature Selection
[AUTHORS]
Zhen Yuan, Shaoqing Jiao, Yihang Xiao, Jiajie Peng
[ABSTRACT]
The advent of single-cell multi-omics technologies has enabled the
simultaneous profiling of diverse omics layers within individual cells.
Integrating such multimodal data provides unprecedented insights into cellular
identity, regulatory processes, and disease mechanisms. However, it remains
challenging, as current methods often rely on selecting highly variable genes
or peaks during preprocessing, which may inadvertently discard crucial
biological information. Here, we present scMamba, a foundation model designed
to integrate single-cell multi-omics data without the need for prior feature
selection while preserving genomic positional information. scMamba introduces a
patch-based cell tokenization strategy that treats genomics regions as words
(tokens) and cells as sentences. Building upon the concept of state space
duality, scMamba distills rich biological insights from high-dimensional,
sparse single-cell multi-omics data. Additionally, our novel contrastive
learning approach, enhanced with cosine similarity regularization, enables
superior alignment across omics layers compared to traditional methods.
Systematic benchmarking across multiple datasets demonstrates that scMamba
significantly outperforms state-of-the-art methods in preserving biological
variation, aligning omics layers, and enhancing key downstream tasks such as
clustering, cell type annotation, and trajectory inference. Our findings
position scMamba as a powerful tool for large-scale single-cell multi-omics
integration, capable of handling large-scale atlases and advancing biological
discovery.
[LINK]
http://arxiv.org/abs/2506.20697v1
[DATE]
2025-06-25 20:58:01+08:00
[CATEGORIES]
cs.LG
Exploiting Lightweight Hierarchical ViT and Dynamic Framework for Efficient Visual Tracking
[AUTHORS]
Ben Kang, Xin Chen, Jie Zhao, Chunjuan Bo, Dong Wang, Huchuan Lu
[ABSTRACT]
Transformer-based visual trackers have demonstrated significant advancements
due to their powerful modeling capabilities. However, their practicality is
limited on resource-constrained devices because of their slow processing
speeds. To address this challenge, we present HiT, a novel family of efficient
tracking models that achieve high performance while maintaining fast operation
across various devices. The core innovation of HiT lies in its Bridge Module,
which connects lightweight transformers to the tracking framework, enhancing
feature representation quality. Additionally, we introduce a dual-image
position encoding approach to effectively encode spatial information. HiT
achieves an impressive speed of 61 frames per second (fps) on the NVIDIA Jetson
AGX platform, alongside a competitive AUC of 64.6% on the LaSOT benchmark,
outperforming all previous efficient trackers.Building on HiT, we propose
DyHiT, an efficient dynamic tracker that flexibly adapts to scene complexity by
selecting routes with varying computational requirements. DyHiT uses search
area features extracted by the backbone network and inputs them into an
efficient dynamic router to classify tracking scenarios. Based on the
classification, DyHiT applies a divide-and-conquer strategy, selecting
appropriate routes to achieve a superior trade-off between accuracy and speed.
The fastest version of DyHiT achieves 111 fps on NVIDIA Jetson AGX while
maintaining an AUC of 62.4% on LaSOT.Furthermore, we introduce a training-free
acceleration method based on the dynamic routing architecture of DyHiT. This
method significantly improves the execution speed of various high-performance
trackers without sacrificing accuracy. For instance, our acceleration method
enables the state-of-the-art tracker SeqTrack-B256 to achieve a 2.68 times
speedup on an NVIDIA GeForce RTX 2080 Ti GPU while maintaining the same AUC of
69.9% on the LaSOT.
[COMMENTS]
This paper was accepted by International Journal of Computer
Vision(IJCV)
[LINK]
http://arxiv.org/abs/2506.20381v1
[DATE]
2025-06-25 20:46:46+08:00
[CATEGORIES]
cs.LG
TESSERA: Temporal Embeddings of Surface Spectra for Earth Representation and Analysis
[AUTHORS]
Zhengpeng Feng, Sadiq Jaffer, Jovana Knezevic, Silja Sormunen, Robin Young, Madeline Lisaius, Markus Immitzer, James Ball, Clement Atzberger, David A. Coomes, Anil Madhavapeddy, Andrew Blake, Srinivasan Keshav
[ABSTRACT]
Satellite remote sensing (RS) enables a wide array of downstream Earth
observation (EO) applications, including climate modeling, carbon accounting,
and strategies for conservation and sustainable land use. We present TESSERA, a
novel Remote Sensing Foundation Model (RSFM) that uses Self-Supervised Learning
(SSL) to generate global, robust representations at 10m scale from pixel-level
satellite time series data. TESSERA combines information from only optical and
SAR data streams using two parallel Transformer-based encoders: one dedicated
to Sentinel-1 SAR polarizations and another to Sentinel-2 MSI data (10 selected
spectral bands) to create representations that are then fused using a
multilayer perceptron (MLP), resulting in a global representation map covering
the years 2017 to 2024. Our precomputed representations set a new
state-of-the-art performance benchmark and our open-source approach
democratizes access to high-performance, high-resolution representations. We
benchmark the performance of TESSERA in five diverse tasks, comparing our work
with state-of-the-art task-specific models and other foundation models. Our
results show that TESSERA outperforms both traditional RS baselines and the
leading geospatial foundation models in these diverse downstream tasks.
[LINK]
http://arxiv.org/abs/2506.20380v1
[DATE]
2025-06-25 20:46:26+08:00
[CATEGORIES]
cs.LG
WyckoffDiff – A Generative Diffusion Model for Crystal Symmetry
[AUTHORS]
Filip Ekström Kelvinius, Oskar B. Andersson, Abhijith S. Parackal, Dong Qian, Rickard Armiento, Fredrik Lindsten
[ABSTRACT]
Crystalline materials often exhibit a high level of symmetry. However, most
generative models do not account for symmetry, but rather model each atom
without any constraints on its position or element. We propose a generative
model, Wyckoff Diffusion (WyckoffDiff), which generates symmetry-based
descriptions of crystals. This is enabled by considering a crystal structure
representation that encodes all symmetry, and we design a novel neural network
architecture which enables using this representation inside a discrete
generative model framework. In addition to respecting symmetry by construction,
the discrete nature of our model enables fast generation. We additionally
present a new metric, Fr'echet Wrenformer Distance, which captures the
symmetry aspects of the materials generated, and we benchmark WyckoffDiff
against recently proposed generative models for crystal generation. As a
proof-of-concept study, we use WyckoffDiff to find new materials below the
convex hull of thermodynamical stability.
[COMMENTS]
Accepted to ICML 2025, to appear in PMLR 267. Code is available
online at https://github.com/httk/wyckoffdiff
[LINK]
http://arxiv.org/abs/2502.06485v3
[DATE]
2025-06-25 20:45:51+08:00
[CATEGORIES]
cs.LG
InvZW: Invariant Feature Learning via Noise-Adversarial Training for Robust Image Zero-Watermarking
[AUTHORS]
Abdullah All Tanvir, Xin Zhong
[ABSTRACT]
This paper introduces a novel deep learning framework for robust image
zero-watermarking based on distortion-invariant feature learning. As a
zero-watermarking scheme, our method leaves the original image unaltered and
learns a reference signature through optimization in the feature space. The
proposed framework consists of two key modules. In the first module, a feature
extractor is trained via noise-adversarial learning to generate representations
that are both invariant to distortions and semantically expressive. This is
achieved by combining adversarial supervision against a distortion
discriminator and a reconstruction constraint to retain image content. In the
second module, we design a learning-based multibit zero-watermarking scheme
where the trained invariant features are projected onto a set of trainable
reference codes optimized to match a target binary message. Extensive
experiments on diverse image datasets and a wide range of distortions show that
our method achieves state-of-the-art robustness in both feature stability and
watermark recovery. Comparative evaluations against existing self-supervised
and deep watermarking techniques further highlight the superiority of our
framework in generalization and robustness.
[LINK]
http://arxiv.org/abs/2506.20370v1
[DATE]
2025-06-25 20:32:08+08:00
[CATEGORIES]
cs.LG
Self-Supervised Graph Learning via Spectral Bootstrapping and Laplacian-Based Augmentations
[AUTHORS]
Lorenzo Bini, Stephane Marchand-Maillet
[ABSTRACT]
We present LaplaceGNN, a novel self-supervised graph learning framework that
bypasses the need for negative sampling by leveraging spectral bootstrapping
techniques. Our method integrates Laplacian-based signals into the learning
process, allowing the model to effectively capture rich structural
representations without relying on contrastive objectives or handcrafted
augmentations. By focusing on positive alignment, LaplaceGNN achieves linear
scaling while offering a simpler, more efficient, self-supervised alternative
for graph neural networks, applicable across diverse domains. Our contributions
are twofold: we precompute spectral augmentations through max-min
centrality-guided optimization, enabling rich structural supervision without
relying on handcrafted augmentations, then we integrate an adversarial
bootstrapped training scheme that further strengthens feature learning and
robustness. Our extensive experiments on different benchmark datasets show that
LaplaceGNN achieves superior performance compared to state-of-the-art
self-supervised graph methods, offering a promising direction for efficiently
learning expressive graph representations.
[COMMENTS]
LaplaceGNN is a novel graph learning framework that employs a
bootstrapped teacher-student architecture. Its precomputed spectral
augmentations and adversarial training enable robust performance,
outperforming SOTA methods while scaling linearly
[LINK]
http://arxiv.org/abs/2506.20362v1
[DATE]
2025-06-25 20:23:23+08:00
[CATEGORIES]
cs.LG
Towards Interpretable and Efficient Feature Selection in Trajectory Datasets: A Taxonomic Approach
[AUTHORS]
Chanuka Don Samarasinghage, Dhruv Gulabani
[ABSTRACT]
Trajectory analysis is not only about obtaining movement data, but it is also
of paramount importance in understanding the pattern in which an object moves
through space and time, as well as in predicting its next move. Due to the
significant interest in the area, data collection has improved substantially,
resulting in a large number of features becoming available for training and
predicting models. However, this introduces a high-dimensionality-induced
feature explosion problem, which reduces the efficiency and interpretability of
the data, thereby reducing the accuracy of machine learning models. To overcome
this issue, feature selection has become one of the most prevalent tools. Thus,
the objective of this paper was to introduce a taxonomy-based feature selection
method that categorizes features based on their internal structure. This
approach classifies the data into geometric and kinematic features, further
categorizing them into curvature, indentation, speed, and acceleration. The
comparative analysis indicated that a taxonomy-based approach consistently
achieved comparable or superior predictive performance. Furthermore, due to the
taxonomic grouping, which reduces combinatorial space, the time taken to select
features was drastically reduced. The taxonomy was also used to gain insights
into what feature sets each dataset was more sensitive to. Overall, this study
provides robust evidence that a taxonomy-based feature selection method can add
a layer of interpretability, reduce dimensionality and computational
complexity, and contribute to high-level decision-making. It serves as a step
toward providing a methodological framework for researchers and practitioners
dealing with trajectory datasets and contributing to the broader field of
explainable artificial intelligence.
[LINK]
http://arxiv.org/abs/2506.20359v1
[DATE]
2025-06-25 20:21:20+08:00
[CATEGORIES]
cs.LG
A foundation model with multi-variate parallel attention to generate neuronal activity
[AUTHORS]
Francesco Carzaniga, Michael Hersche, Abu Sebastian, Kaspar Schindler, Abbas Rahimi
[ABSTRACT]
Learning from multi-variate time-series with heterogeneous channel
configurations remains a fundamental challenge for deep neural networks (DNNs),
particularly in clinical domains such as intracranial electroencephalography
(iEEG), where channel setups vary widely across subjects. In this work, we
introduce multi-variate parallel attention (MVPA), a novel self-attention
mechanism that disentangles content, temporal, and spatial attention, enabling
flexible, generalizable, and efficient modeling of time-series data with
varying channel counts and configurations. We use MVPA to build MVPFormer, a
generative foundation model for human electrophysiology, trained to predict the
evolution of iEEG signals across diverse subjects. To support this and future
effort by the community, we release the SWEC iEEG dataset, the largest publicly
available iEEG dataset to date, comprising nearly 10,000 hours of recordings
from heterogeneous clinical sources. MVPFormer leverages MVPA to achieve strong
generalization across subjects, demonstrating expert-level performance in
seizure detection and outperforming state-of-the-art Transformer baselines on
our SWEC, the MAYO, and the FNUSA dataset. We further validate MVPA on standard
time-series forecasting and classification tasks, where it matches or exceeds
existing attention-based models. Together, our contributions establish MVPA as
a general-purpose attention mechanism for heterogeneous time-series and
MVPFormer as the first open-source, open-weights, and open-data iEEG foundation
model with state-of-the-art clinical performance. The code is available at
https://github.com/IBM/multi-variate-parallel-transformer. The SWEC iEEG
dataset is available at
https://mb-neuro.medical-blocks.ch/public_access/databases/ieeg/swec_ieeg.
[COMMENTS]
The code is available at
https://github.com/IBM/multi-variate-parallel-transformer. The SWEC iEEG
dataset is available at
https://mb-neuro.medical-blocks.ch/public_access/databases/ieeg/swec_ieeg
[LINK]
http://arxiv.org/abs/2506.20354v1
[DATE]
2025-06-25 20:07:10+08:00
[CATEGORIES]
cs.LG
Backpropagation Through Time For Networks With Long-Term Dependencies
[AUTHORS]
George Bird, Maxim E. Polivoda
[ABSTRACT]
Backpropagation through time (BPTT) is a technique of updating tuned
parameters within recurrent neural networks (RNNs). Several attempts at
creating such an algorithm have been made including: Nth Ordered Approximations
and Truncated-BPTT. These methods approximate the backpropagation gradients
under the assumption that the RNN only utilises short-term dependencies. This
is an acceptable assumption to make for the current state of artificial neural
networks. As RNNs become more advanced, a shift towards influence by long-term
dependencies is likely. Thus, a new method for backpropagation is required. We
propose using the ‘discrete forward sensitivity equation’ and a variant of it
for single and multiple interacting recurrent loops respectively. This solution
is exact and also allows the network’s parameters to vary between each
subsequent step, however it does require the computation of a Jacobian.
[COMMENTS]
8 Pages, 1 Figure; typos corrected, references added, altered section
titles, added further commentary in section 2.1
[LINK]
http://arxiv.org/abs/2103.15589v3
[DATE]
2025-06-25 20:04:53+08:00
[CATEGORIES]
cs.LG
On the ability of Deep Neural Networks to Learn Granger Causality in Multi-Variate Time Series Data
[AUTHORS]
Malik Shahid Sultan, Hernando Ombao
[ABSTRACT]
Granger Causality (GC) offers an elegant statistical framework to study the
association between multivariate time series data. Linear Vector Autoregressive
models (VAR) though have nice interpretation properties but have limited
practical application due to underlying assumptions on the kind of associations
that can be captured by these models. Numerous attempts have already been made
in the literature that exploit the functional approximation power of Deep
Neural Networks (DNNs) for the task of GC estimation. These methods however
treat GC as a variable selection problem. We present a novel paradigm for
approaching GC. We present this idea that GC is essentially linked with
prediction and if a deep learning model is used to model the time series
collectively or jointly, a well regularized model may learn the true granger
causal structure from the data, given that there is enough training data. We
propose to uncover the learned GC structure by comparing the model uncertainty
or distribution of the residuals when the past of everything is used as
compared to the one where a specific time series component is dropped from the
model. We also compare the effect of input layer dropout on the ability of a
neural network to learn granger causality from the data. We show that a well
regularized model infact can learn the true GC structure from the data without
explicitly adding terms in the loss function that guide the model to select
variables or perform sparse regression.
[LINK]
http://arxiv.org/abs/2506.20347v1
[DATE]
2025-06-25 19:57:24+08:00
[CATEGORIES]
cs.LG
Signatures of planets and Galactic subpopulations in solar analogs. Precise chemical abundances with neural networks
[AUTHORS]
Giulia Martos, Jorge Meléndez, Lorenzo Spina, Sara Lucatello
[ABSTRACT]
The aim of this work is to obtain precise atmospheric parameters and chemical
abundances automatically for solar twins and analogs to find signatures of
exoplanets, as well as to assess how peculiar the Sun is compared to these
stars and to analyze any possible fine structures in the Galactic thin disk. We
developed a neural network (NN) algorithm using Python to obtain these
parameters for a sample of 99 solar twins and solar analogs previously studied
in the literature from normalized high-quality spectra from HARPS, with a
resolving power of R $\sim$ 115000 and a signal-to-noise ratio S/N > 400. We
obtained precise atmospheric parameters and abundance ratios [X/Fe] of 20
chemical elements (Li, C, O, Na, Mg, Al, Si, S, Ca, Sc, Ti, V, Cr, Mn, Co, Ni,
Cu, Zn, Y, and Ba). The results are in line with the literature, with average
differences and standard deviations of $(2 \pm 27)$ K for T${\rm eff}$, $(0.00
\pm 0.06)$ dex for log g, $(0.00 \pm 0.02)$ dex for [Fe/H], $(-0.01 \pm 0.05)$
km s$^{-1}$ for microturbulence velocity, $(0.02 \pm 0.08)$ km s$^{-1}$ for the
macro turbulence velocity, and $(-0.12 \pm 0.26)$ km s$^{-1}$ for the projected
rotational velocity (vsin$i$). Regarding the chemical abundances, most of the
elements agree with the literature within 0.01 - 0.02 dex. The abundances were
corrected from the effects of the Galactic chemical evolution and analyzed with
the condensation temperature (T${\rm cond}$) to verify whether the stars
presented depletion of refractories compared to volatiles. We found that the
Sun is more depleted in refractory elements compared to volatiles than 89% of
the studied solar analogs, with a significance of 9.5$\sigma$ when compared to
the stars without detected exoplanets. We also found the possible presence of
three subpopulations in the solar analogs: one Cu-rich, one Cu-poor, and the
last one slightly older and poor in Na.
[COMMENTS]
Accepted by A&A
[LINK]
http://arxiv.org/abs/2506.20345v1
[DATE]
2025-06-25 19:55:14+08:00
[CATEGORIES]
cs.LG
A Complete Loss Landscape Analysis of Regularized Deep Matrix Factorization
[AUTHORS]
Po Chen, Rujun Jiang, Peng Wang
[ABSTRACT]
Despite its wide range of applications across various domains, the
optimization foundations of deep matrix factorization (DMF) remain largely
open. In this work, we aim to fill this gap by conducting a comprehensive study
of the loss landscape of the regularized DMF problem. Toward this goal, we
first provide a closed-form expression of all critical points. Building on
this, we establish precise conditions under which a critical point is a local
minimizer, a global minimizer, a strict saddle point, or a non-strict saddle
point. Leveraging these results, we derive a necessary and sufficient condition
under which each critical point is either a local minimizer or a strict saddle
point. This provides insights into why gradient-based methods almost always
converge to a local minimizer of the regularized DMF problem. Finally, we
conduct numerical experiments to visualize its loss landscape under different
settings to support our theory.
[COMMENTS]
35 pages, 3 figures
[LINK]
http://arxiv.org/abs/2506.20344v1
[DATE]
2025-06-25 19:51:41+08:00
[CATEGORIES]
cs.LG
Feature Hallucination for Self-supervised Action Recognition
[AUTHORS]
Lei Wang, Piotr Koniusz
[ABSTRACT]
Understanding human actions in videos requires more than raw pixel analysis;
it relies on high-level semantic reasoning and effective integration of
multimodal features. We propose a deep translational action recognition
framework that enhances recognition accuracy by jointly predicting action
concepts and auxiliary features from RGB video frames. At test time,
hallucination streams infer missing cues, enriching feature representations
without increasing computational overhead. To focus on action-relevant regions
beyond raw pixels, we introduce two novel domain-specific descriptors. Object
Detection Features (ODF) aggregate outputs from multiple object detectors to
capture contextual cues, while Saliency Detection Features (SDF) highlight
spatial and intensity patterns crucial for action recognition. Our framework
seamlessly integrates these descriptors with auxiliary modalities such as
optical flow, Improved Dense Trajectories, skeleton data, and audio cues. It
remains compatible with state-of-the-art architectures, including I3D,
AssembleNet, Video Transformer Network, FASTER, and recent models like VideoMAE
V2 and InternVideo2. To handle uncertainty in auxiliary features, we
incorporate aleatoric uncertainty modeling in the hallucination step and
introduce a robust loss function to mitigate feature noise. Our multimodal
self-supervised action recognition framework achieves state-of-the-art
performance on multiple benchmarks, including Kinetics-400, Kinetics-600, and
Something-Something V2, demonstrating its effectiveness in capturing
fine-grained action dynamics.
[COMMENTS]
Accepted for publication in International Journal of Computer Vision
(IJCV)
[LINK]
http://arxiv.org/abs/2506.20342v1
[DATE]
2025-06-25 19:50:23+08:00
[CATEGORIES]
cs.LG
Recurrent neural network-based robust control systems with closed-loop regional incremental ISS and application to MPC design
[AUTHORS]
Daniele Ravasio, Marcello Farina, Alessio La Bella, Andrea Ballarino
[ABSTRACT]
This paper investigates the design of output-feedback schemes for systems
described by a class of recurrent neural networks. We propose a procedure based
on linear matrix inequalities for designing an observer and a static
state-feedback controller. The algorithm leverages global and regional
incremental input-to-state stability (incremental ISS) and enables the tracking
of constant setpoints, ensuring robustness to disturbances and state estimation
uncertainty. To address the potential limitations of regional incremental ISS,
we introduce an alternative scheme in which the static law is replaced with a
tube-based nonlinear model predictive controller (NMPC) that exploits regional
incremental ISS properties. We show that these conditions enable the
formulation of a robust NMPC law with guarantees of convergence and recursive
feasibility, leading to an enlarged region of attraction. Theoretical results
are validated through numerical simulations on the pH-neutralisation process
benchmark, demonstrating the effectiveness of the proposed schemes.
[COMMENTS]
16 pages, 7 figures, submitted to IEEE Transactions on Automatic
Control (under review)
[LINK]
http://arxiv.org/abs/2506.20334v1
[DATE]
2025-06-25 19:44:28+08:00
[CATEGORIES]
cs.LG
Permutation Equivariant Neural Controlled Differential Equations for Dynamic Graph Representation Learning
[AUTHORS]
Torben Berndt, Benjamin Walker, Tiexin Qin, Jan Stühmer, Andrey Kormilitzin
[ABSTRACT]
Dynamic graphs exhibit complex temporal dynamics due to the interplay between
evolving node features and changing network structures. Recently, Graph Neural
Controlled Differential Equations (Graph Neural CDEs) successfully adapted
Neural CDEs from paths on Euclidean domains to paths on graph domains. Building
on this foundation, we introduce Permutation Equivariant Neural Graph CDEs,
which project Graph Neural CDEs onto permutation equivariant function spaces.
This significantly reduces the model’s parameter count without compromising
representational power, resulting in more efficient training and improved
generalisation. We empirically demonstrate the advantages of our approach
through experiments on simulated dynamical systems and real-world tasks,
showing improved performance in both interpolation and extrapolation scenarios.
[LINK]
http://arxiv.org/abs/2506.20324v1
[DATE]
2025-06-25 19:06:30+08:00
[CATEGORIES]
cs.LG
BINDy – Bayesian identification of nonlinear dynamics with reversible-jump Markov-chain Monte-Carlo
[AUTHORS]
Max D. Champneys, Timothy J. Rogers
[ABSTRACT]
Model parsimony is an important \emph{cognitive bias} in data-driven
modelling that aids interpretability and helps to prevent over-fitting. Sparse
identification of nonlinear dynamics (SINDy) methods are able to learn sparse
representations of complex dynamics directly from data, given a basis of
library functions. In this work, a novel Bayesian treatment of dictionary
learning system identification, as an alternative to SINDy, is envisaged. The
proposed method – Bayesian identification of nonlinear dynamics (BINDy) – is
distinct from previous approaches in that it targets the full joint posterior
distribution over both the terms in the library and their parameterisation in
the model. This formulation confers the advantage that an arbitrary prior may
be placed over the model structure to produce models that are sparse in the
model space rather than in parameter space. Because this posterior is defined
over parameter vectors that can change in dimension, the inference cannot be
performed by standard techniques. Instead, a Gibbs sampler based on
reversible-jump Markov-chain Monte-Carlo is proposed. BINDy is shown to compare
favourably to ensemble SINDy in three benchmark case-studies. In particular, it
is seen that the proposed method is better able to assign high probability to
correct model terms.
[LINK]
http://arxiv.org/abs/2408.08062v3
[DATE]
2025-06-25 18:45:10+08:00
[CATEGORIES]
cs.LG
Beyond-Expert Performance with Limited Demonstrations: Efficient Imitation Learning with Double Exploration
[AUTHORS]
Heyang Zhao, Xingrui Yu, David M. Bossens, Ivor W. Tsang, Quanquan Gu
[ABSTRACT]
Imitation learning is a central problem in reinforcement learning where the
goal is to learn a policy that mimics the expert’s behavior. In practice, it is
often challenging to learn the expert policy from a limited number of
demonstrations accurately due to the complexity of the state space. Moreover,
it is essential to explore the environment and collect data to achieve
beyond-expert performance. To overcome these challenges, we propose a novel
imitation learning algorithm called Imitation Learning with Double Exploration
(ILDE), which implements exploration in two aspects: (1) optimistic policy
optimization via an exploration bonus that rewards state-action pairs with high
uncertainty to potentially improve the convergence to the expert policy, and
(2) curiosity-driven exploration of the states that deviate from the
demonstration trajectories to potentially yield beyond-expert performance.
Empirically, we demonstrate that ILDE outperforms the state-of-the-art
imitation learning algorithms in terms of sample efficiency and achieves
beyond-expert performance on Atari and MuJoCo tasks with fewer demonstrations
than in previous work. We also provide a theoretical justification of ILDE as
an uncertainty-regularized policy optimization method with optimistic
exploration, leading to a regret growing sublinearly in the number of episodes.
[LINK]
http://arxiv.org/abs/2506.20307v1
[DATE]
2025-06-25 18:39:32+08:00
[CATEGORIES]
cs.LG
Learning Moderately Input-Sensitive Functions: A Case Study in QR Code Decoding
[AUTHORS]
Kazuki Yoda, Kazuhiko Kawamoto, Hiroshi Kera
[ABSTRACT]
The hardness of learning a function that attains a target task relates to its
input-sensitivity. For example, image classification tasks are
input-insensitive as minor corruptions should not affect the classification
results, whereas arithmetic and symbolic computation, which have been recently
attracting interest, are highly input-sensitive as each input variable connects
to the computation results. This study presents the first learning-based Quick
Response (QR) code decoding and investigates learning functions of medium
sensitivity. Our experiments reveal that Transformers can successfully decode
QR codes, even beyond the theoretical error-correction limit, by learning the
structure of embedded texts. They generalize from English-rich training data to
other languages and even random strings. Moreover, we observe that the
Transformer-based QR decoder focuses on data bits while ignoring
error-correction bits, suggesting a decoding mechanism distinct from standard
QR code readers.
[COMMENTS]
17 pages, 13 figures
[LINK]
http://arxiv.org/abs/2506.20305v1
[DATE]
2025-06-25 18:37:39+08:00
[CATEGORIES]
cs.LG
Bilinear MLPs enable weight-based mechanistic interpretability
[AUTHORS]
Michael T. Pearce, Thomas Dooms, Alice Rigg, Jose M. Oramas, Lee Sharkey
[ABSTRACT]
A mechanistic understanding of how MLPs do computation in deep neural
networks remains elusive. Current interpretability work can extract features
from hidden activations over an input dataset but generally cannot explain how
MLP weights construct features. One challenge is that element-wise
nonlinearities introduce higher-order interactions and make it difficult to
trace computations through the MLP layer. In this paper, we analyze bilinear
MLPs, a type of Gated Linear Unit (GLU) without any element-wise nonlinearity
that nevertheless achieves competitive performance. Bilinear MLPs can be fully
expressed in terms of linear operations using a third-order tensor, allowing
flexible analysis of the weights. Analyzing the spectra of bilinear MLP weights
using eigendecomposition reveals interpretable low-rank structure across toy
tasks, image classification, and language modeling. We use this understanding
to craft adversarial examples, uncover overfitting, and identify small language
model circuits directly from the weights alone. Our results demonstrate that
bilinear layers serve as an interpretable drop-in replacement for current
activation functions and that weight-based interpretability is viable for
understanding deep-learning models.
[COMMENTS]
Accepted to ICLR‘25
[LINK]
http://arxiv.org/abs/2410.08417v2
[DATE]
2025-06-25 18:36:59+08:00
[CATEGORIES]
cs.LG
Graph-Assisted Stitching for Offline Hierarchical Reinforcement Learning
[AUTHORS]
Seungho Baek, Taegeon Park, Jongchan Park, Seungjun Oh, Yusung Kim
[COMMENTS]
ICML 2025
[LINK]
http://arxiv.org/abs/2506.07744v2
[DATE]
2025-06-25 18:33:47+08:00
[CATEGORIES]
cs.LG
OLALa: Online Learned Adaptive Lattice Codes for Heterogeneous Federated Learning
[AUTHORS]
Natalie Lang, Maya Simhi, Nir Shlezinger
[ABSTRACT]
Federated learning (FL) enables collaborative training across distributed
clients without sharing raw data, often at the cost of substantial
communication overhead induced by transmitting high-dimensional model updates.
This overhead can be alleviated by having the clients quantize their model
updates, with dithered lattice quantizers identified as an attractive scheme
due to its structural simplicity and convergence-preserving properties.
However, existing lattice-based FL schemes typically rely on a fixed
quantization rule, which is suboptimal in heterogeneous and dynamic
environments where the model updates distribution varies across users and
training rounds. In this work, we propose Online Learned Adaptive Lattices
(OLALa), a heterogeneous FL framework where each client can adjust its
quantizer online using lightweight local computations. We first derive
convergence guarantees for FL with non-fixed lattice quantizers and show that
proper lattice adaptation can tighten the convergence bound. Then, we design an
online learning algorithm that enables clients to tune their quantizers
throughout the FL process while exchanging only a compact set of quantization
parameters. Numerical experiments demonstrate that OLALa consistently improves
learning performance under various quantization rates, outperforming
conventional fixed-codebook and non-adaptive schemes.
[COMMENTS]
Under review for publication in the IEEE
[LINK]
http://arxiv.org/abs/2506.20297v1
[DATE]
2025-06-25 18:18:34+08:00
[CATEGORIES]
cs.LG
Provably Improving Generalization of Few-Shot Models with Synthetic Data
[AUTHORS]
Lan-Cuong Nguyen, Quan Nguyen-Tri, Bang Tran Khanh, Dung D. Le, Long Tran-Thanh, Khoat Than
[ABSTRACT]
Few-shot image classification remains challenging due to the scarcity of
labeled training examples. Augmenting them with synthetic data has emerged as a
promising way to alleviate this issue, but models trained on synthetic samples
often face performance degradation due to the inherent gap between real and
synthetic distributions. To address this limitation, we develop a theoretical
framework that quantifies the impact of such distribution discrepancies on
supervised learning, specifically in the context of image classification. More
importantly, our framework suggests practical ways to generate good synthetic
samples and to train a predictor with high generalization ability. Building
upon this framework, we propose a novel theoretical-based algorithm that
integrates prototype learning to optimize both data partitioning and model
training, effectively bridging the gap between real few-shot data and synthetic
data. Extensive experiments results show that our approach demonstrates
superior performance compared to state-of-the-art methods, outperforming them
across multiple datasets.
[COMMENTS]
ICML 2025. Our code is released at
https://github.com/Fsoft-AIC/ProtoAug
[LINK]
http://arxiv.org/abs/2505.24190v2
[DATE]
2025-06-25 18:02:36+08:00
[CATEGORIES]
cs.LG
Flexible Infinite-Width Graph Convolutional Neural Networks
[AUTHORS]
Ben Anson, Edward Milsom, Laurence Aitchison
[ABSTRACT]
A common theoretical approach to understanding neural networks is to take an
infinite-width limit, at which point the outputs become Gaussian process (GP)
distributed. This is known as a neural network Gaussian process (NNGP).
However, the NNGP kernel is fixed and tunable only through a small number of
hyperparameters, thus eliminating the possibility of representation learning.
This contrasts with finite-width NNs, which are often believed to perform well
because they are able to flexibly learn representations for the task at hand.
Thus, in simplifying NNs to make them theoretically tractable, NNGPs may
eliminate precisely what makes them work well (representation learning). This
motivated us to understand whether representation learning is necessary in a
range of graph tasks. We develop a precise tool for this task, the graph
convolutional deep kernel machine. This is very similar to an NNGP, in that it
is an infinite width limit and uses kernels, but comes with a “knob” to
control the amount of flexibility and hence representation learning. We found
that representation learning gives noticeable performance improvements for
heterophilous node classification tasks, but less so for homophilous node
classification tasks.
[COMMENTS]
Major revision. Title and abstract updated. Added new analysis
section on linear models and additional datasets. Paper accepted to TMLR
[LINK]
http://arxiv.org/abs/2402.06525v2
[DATE]
2025-06-25 17:59:16+08:00
[CATEGORIES]
cs.LG
Solving Linear-Gaussian Bayesian Inverse Problems with Decoupled Diffusion Sequential Monte Carlo
[AUTHORS]
Filip Ekström Kelvinius, Zheng Zhao, Fredrik Lindsten
[ABSTRACT]
A recent line of research has exploited pre-trained generative diffusion
models as priors for solving Bayesian inverse problems. We contribute to this
research direction by designing a sequential Monte Carlo method for
linear-Gaussian inverse problems which builds on “decoupled diffusion”, where
the generative process is designed such that larger updates to the sample are
possible. The method is asymptotically exact and we demonstrate the
effectiveness of our Decoupled Diffusion Sequential Monte Carlo (DDSMC)
algorithm on both synthetic as well as protein and image data. Further, we
demonstrate how the approach can be extended to discrete data.
[COMMENTS]
Accepted to ICML 2025, to appear in PMLR 267. Code available at
https://github.com/filipekstrm/ddsmc
[LINK]
http://arxiv.org/abs/2502.06379v2
[DATE]
2025-06-25 17:54:45+08:00
[CATEGORIES]
cs.LG
X-SiT: Inherently Interpretable Surface Vision Transformers for Dementia Diagnosis
[AUTHORS]
Fabian Bongratz, Tom Nuno Wolf, Jaume Gual Ramon, Christian Wachinger
[ABSTRACT]
Interpretable models are crucial for supporting clinical decision-making,
driving advances in their development and application for medical images.
However, the nature of 3D volumetric data makes it inherently challenging to
visualize and interpret intricate and complex structures like the cerebral
cortex. Cortical surface renderings, on the other hand, provide a more
accessible and understandable 3D representation of brain anatomy, facilitating
visualization and interactive exploration. Motivated by this advantage and the
widespread use of surface data for studying neurological disorders, we present
the eXplainable Surface Vision Transformer (X-SiT). This is the first
inherently interpretable neural network that offers human-understandable
predictions based on interpretable cortical features. As part of X-SiT, we
introduce a prototypical surface patch decoder for classifying surface patch
embeddings, incorporating case-based reasoning with spatially corresponding
cortical prototypes. The results demonstrate state-of-the-art performance in
detecting Alzheimer’s disease and frontotemporal dementia while additionally
providing informative prototypes that align with known disease patterns and
reveal classification errors.
[COMMENTS]
MICCAI 2025
[LINK]
http://arxiv.org/abs/2506.20267v1
[DATE]
2025-06-25 17:24:07+08:00
[CATEGORIES]
cs.LG
3D variational autoencoder for fingerprinting microstructure volume elements
[AUTHORS]
Michael D. White, Michael D. Atkinson, Adam J. Plowman, Pratheek Shanthraj
[ABSTRACT]
Microstructure quantification is an important step towards establishing
structure-property relationships in materials. Machine learning-based image
processing methods have been shown to outperform conventional image processing
techniques and are increasingly applied to microstructure quantification tasks.
In this work, we present a 3D variational autoencoder (VAE) for encoding
microstructure volume elements (VEs) comprising voxelated crystallographic
orientation data. Crystal symmetries in the orientation space are accounted for
by mapping to the crystallographic fundamental zone as a preprocessing step,
which allows for a continuous loss function to be used and improves the
training convergence rate. The VAE is then used to encode a training set of VEs
with an equiaxed polycrystalline microstructure with random texture. Accurate
reconstructions are achieved with a relative average misorientation error of
3x10^-2 on the test dataset, for a continuous latent space with dimension 256.
We show that the model generalises well to microstructures with textures, grain
sizes and aspect ratios outside the training distribution. Structure-property
relationships are explored through using the training set of VEs as initial
configurations in various crystal plasticity (CP) simulations. Microstructural
fingerprints extracted from the VAE, which parameterise the VEs in a
low-dimensional latent space, are stored alongside the volume-averaged stress
response, at each strain increment, to uniaxial tensile deformation from CP
simulations. This is then used to train a fully connected neural network
mapping the input fingerprint to the resulting stress response, which acts as a
surrogate model for the CP simulation. The fingerprint-based surrogate model is
shown to accurately predict the microstructural dependence in the CP stress
response, with a relative mean-squared error of 2.75 MPa on unseen test data.
[COMMENTS]
28 pages, 11 figures
[LINK]
http://arxiv.org/abs/2503.17427v3
[DATE]
2025-06-25 17:14:01+08:00
[CATEGORIES]
cs.LG
Exploration-Exploitation Tradeoff in Universal Lossy Compression
[AUTHORS]
Nir Weinberger, Ram Zamir
[ABSTRACT]
Universal compression can learn the source and adapt to it either in a batch
mode (forward adaptation), or in a sequential mode (backward adaptation). We
recast the sequential mode as a multi-armed bandit problem, a fundamental model
in reinforcement-learning, and study the trade-off between exploration and
exploitation in the lossy compression case. We show that a previously proposed
“natural type selection” scheme can be cast as a reconstruction-directed MAB
algorithm, for sequential lossy compression, and explain its limitations in
terms of robustness and short-block performance. We then derive and analyze
robust cost-directed MAB algorithms, which work at any block length.
[COMMENTS]
An extended version of ISIT 2025 paper
[LINK]
http://arxiv.org/abs/2506.20261v1
[DATE]
2025-06-25 17:08:29+08:00
[CATEGORIES]
cs.LG
Fine-tuning machine-learned particle-flow reconstruction for new detector geometries in future colliders
[AUTHORS]
Farouk Mokhtar, Joosep Pata, Dolores Garcia, Eric Wulff, Mengke Zhang, Michael Kagan, Javier Duarte
[ABSTRACT]
We demonstrate transfer learning capabilities in a machine-learned algorithm
trained for particle-flow reconstruction in high energy particle colliders.
This paper presents a cross-detector fine-tuning study, where we initially
pretrain the model on a large full simulation dataset from one detector design,
and subsequently fine-tune the model on a sample with a different collider and
detector design. Specifically, we use the Compact Linear Collider detector
(CLICdet) model for the initial training set and demonstrate successful
knowledge transfer to the CLIC-like detector (CLD) proposed for the Future
Circular Collider in electron-positron mode. We show that with an order of
magnitude less samples from the second dataset, we can achieve the same
performance as a costly training from scratch, across particle-level and
event-level performance metrics, including jet and missing transverse momentum
resolution. Furthermore, we find that the fine-tuned model achieves comparable
performance to the traditional rule-based particle-flow approach on event-level
metrics after training on 100,000 CLD events, whereas a model trained from
scratch requires at least 1 million CLD events to achieve similar
reconstruction performance. To our knowledge, this represents the first
full-simulation cross-detector transfer learning study for particle-flow
reconstruction. These findings offer valuable insights towards building large
foundation models that can be fine-tuned across different detector designs and
geometries, helping to accelerate the development cycle for new detectors and
opening the door to rapid detector design and optimization using machine
learning.
[COMMENTS]
20 pages, 13 figures
[LINK]
http://arxiv.org/abs/2503.00131v4
[DATE]
2025-06-25 17:07:47+08:00
[CATEGORIES]
cs.LG
A Transformer Based Handwriting Recognition System Jointly Using Online and Offline Features
[AUTHORS]
Ayush Lodh, Ritabrata Chakraborty, Shivakumara Palaiahnakote, Umapada Pal
[ABSTRACT]
We posit that handwriting recognition benefits from complementary cues
carried by the rasterized complex glyph and the pen’s trajectory, yet most
systems exploit only one modality. We introduce an end-to-end network that
performs early fusion of offline images and online stroke data within a shared
latent space. A patch encoder converts the grayscale crop into fixed-length
visual tokens, while a lightweight transformer embeds the $(x, y, \text{pen})$
sequence. Learnable latent queries attend jointly to both token streams,
yielding context-enhanced stroke embeddings that are pooled and decoded under a
cross-entropy loss objective. Because integration occurs before any high-level
classification, temporal cues reinforce each other during representation
learning, producing stronger writer independence. Comprehensive experiments on
IAMOn-DB and VNOn-DB demonstrate that our approach achieves state-of-the-art
accuracy, exceeding previous bests by up to 1\%. Our study also shows
adaptation of this pipeline with gesturification on the ISI-Air dataset. Our
code can be found here.
[COMMENTS]
15 pages, 7 figures
[LINK]
http://arxiv.org/abs/2506.20255v1
[DATE]
2025-06-25 16:58:47+08:00
[CATEGORIES]
cs.LG
Time-series surrogates from energy consumers generated by machine learning approaches for long-term forecasting scenarios
[AUTHORS]
Ben Gerhards, Nikita Popkov, Annekatrin König, Marcel Arpogaus, Bastian Schäfermeier, Leonie Riedl, Stephan Vogt, Philip Hehlert
[ABSTRACT]
Forecasting attracts a lot of research attention in the electricity value
chain. However, most studies concentrate on short-term forecasting of
generation or consumption with a focus on systems and less on individual
consumers. Even more neglected is the topic of long-term forecasting of
individual power consumption.
Here, we provide an in-depth comparative evaluation of data-driven methods
for generating synthetic time series data tailored to energy consumption
long-term forecasting. High-fidelity synthetic data is crucial for a wide range
of applications, including state estimations in energy systems or power grid
planning. In this study, we assess and compare the performance of multiple
state-of-the-art but less common techniques: a hybrid Wasserstein Generative
Adversarial Network (WGAN), Denoising Diffusion Probabilistic Model (DDPM),
Hidden Markov Model (HMM), and Masked Autoregressive Bernstein polynomial
normalizing Flows (MABF). We analyze the ability of each method to replicate
the temporal dynamics, long-range dependencies, and probabilistic transitions
characteristic of individual energy consumption profiles. Our comparative
evaluation highlights the strengths and limitations of: WGAN, DDPM, HMM and
MABF aiding in selecting the most suitable approach for state estimations and
other energy-related tasks. Our generation and analysis framework aims to
enhance the accuracy and reliability of synthetic power consumption data while
generating data that fulfills criteria like anonymisation - preserving privacy
concerns mitigating risks of specific profiling of single customers. This study
utilizes an open-source dataset from households in Germany with 15min time
resolution. The generated synthetic power profiles can readily be used in
applications like state estimations or consumption forecasting.
[LINK]
http://arxiv.org/abs/2506.20253v1
[DATE]
2025-06-25 16:54:47+08:00
[CATEGORIES]
cs.LG
Q-resafe: Assessing Safety Risks and Quantization-aware Safety Patching for Quantized Large Language Models
[AUTHORS]
Kejia Chen, Jiawen Zhang, Jiacong Hu, Yu Wang, Jian Lou, Zunlei Feng, Mingli Song
[COMMENTS]
ICML 2025
[LINK]
http://arxiv.org/abs/2506.20251v1
[DATE]
2025-06-25 16:52:22+08:00
[CATEGORIES]
cs.LG
FedBKD: Distilled Federated Learning to Embrace Gerneralization and Personalization on Non-IID Data
[AUTHORS]
Yushan Zhao, Jinyuan He, Donglai Chen, Weijie Luo, Chong Xie, Ri Zhang, Yonghong Chen, Yan Xu
[ABSTRACT]
Federated learning (FL) is a decentralized collaborative machine learning
(ML) technique. It provides a solution to the issues of isolated data islands
and data privacy leakage in industrial ML practices. One major challenge in FL
is handling the non-identical and independent distributed (non-IID) data.
Current solutions either focus on constructing an all-powerful global model, or
customizing personalized local models. Few of them can provide both a
well-generalized global model and well-performed local models at the same time.
Additionally, many FL solutions to the non-IID problem are benefited from
introducing public datasets. However, this will also increase the risk of data
leakage. To tackle the problems, we propose a novel data-free distillation
framework, Federated Bidirectional Knowledge Distillation (FedBKD).
Specifically, we train Generative Adversarial Networks (GAN) for synthetic
data. During the GAN training, local models serve as discriminators and their
parameters are frozen. The synthetic data is then used for bidirectional
distillation between global and local models to achieve knowledge interactions
so that performances for both sides are improved. We conduct extensive
experiments on 4 benchmarks under different non-IID settings. The results show
that FedBKD achieves SOTA performances in every case.
[LINK]
http://arxiv.org/abs/2506.20245v1
[DATE]
2025-06-25 16:42:10+08:00
[CATEGORIES]
cs.LG
E-ABIN: an Explainable module for Anomaly detection in BIological Networks
[AUTHORS]
Ugo Lomoio, Tommaso Mazza, Pierangelo Veltri, Pietro Hiram Guzzi
[ABSTRACT]
The increasing availability of large-scale omics data calls for robust
analytical frameworks capable of handling complex gene expression datasets
while offering interpretable results. Recent advances in artificial
intelligence have enabled the identification of aberrant molecular patterns
distinguishing disease states from healthy controls. Coupled with improvements
in model interpretability, these tools now support the identification of genes
potentially driving disease phenotypes. However, current approaches to gene
anomaly detection often remain limited to single datasets and lack accessible
graphical interfaces. Here, we introduce E-ABIN, a general-purpose, explainable
framework for Anomaly detection in Biological Networks. E-ABIN combines
classical machine learning and graph-based deep learning techniques within a
unified, user-friendly platform, enabling the detection and interpretation of
anomalies from gene expression or methylation-derived networks. By integrating
algorithms such as Support Vector Machines, Random Forests, Graph Autoencoders
(GAEs), and Graph Adversarial Attributed Networks (GAANs), E-ABIN ensures a
high predictive accuracy while maintaining interpretability. We demonstrate the
utility of E-ABIN through case studies of bladder cancer and coeliac disease,
where it effectively uncovers biologically relevant anomalies and offers
insights into disease mechanisms.
[LINK]
http://arxiv.org/abs/2506.20693v1
[DATE]
2025-06-25 16:25:17+08:00
[CATEGORIES]
cs.LG
Gradient-Free Sequential Bayesian Experimental Design via Interacting Particle Systems
[AUTHORS]
Robert Gruhlke, Matei Hanu, Claudia Schillings, Philipp Wacker
[ABSTRACT]
We introduce a gradient-free framework for Bayesian Optimal Experimental
Design (BOED) in sequential settings, aimed at complex systems where gradient
information is unavailable. Our method combines Ensemble Kalman Inversion (EKI)
for design optimization with the Affine-Invariant Langevin Dynamics (ALDI)
sampler for efficient posterior sampling-both of which are derivative-free and
ensemble-based. To address the computational challenges posed by nested
expectations in BOED, we propose variational Gaussian and parametrized Laplace
approximations that provide tractable upper and lower bounds on the Expected
Information Gain (EIG). These approximations enable scalable utility estimation
in high-dimensional spaces and PDE-constrained inverse problems. We demonstrate
the performance of our framework through numerical experiments ranging from
linear Gaussian models to PDE-based inference tasks, highlighting the method’s
robustness, accuracy, and efficiency in information-driven experimental design.
[LINK]
http://arxiv.org/abs/2504.13320v2
[DATE]
2025-06-25 16:22:09+08:00
[CATEGORIES]
cs.LG
Supporting renewable energy planning and operation with data-driven high-resolution ensemble weather forecast
[AUTHORS]
Jingnan Wang, Jie Chao, Shangshang Yang, Congyi Nai, Kaijun Ren, Kefeng Deng, Xi Chen, Yaxin Liu, Hanqiuzi Wen, Ziniu Xiao, Lifeng Zhang, Xiaodong Wang, Jiping Guan, Baoxiang Pan
[ABSTRACT]
The planning and operation of renewable energy, especially wind power, depend
crucially on accurate, timely, and high-resolution weather information.
Coarse-grid global numerical weather forecasts are typically downscaled to meet
these requirements, introducing challenges of scale inconsistency, process
representation error, computation cost, and entanglement of distinct
uncertainty sources from chaoticity, model bias, and large-scale forcing. We
address these challenges by learning the climatological distribution of a
target wind farm using its high-resolution numerical weather simulations. An
optimal combination of this learned high-resolution climatological prior with
coarse-grid large scale forecasts yields highly accurate, fine-grained,
full-variable, large ensemble of weather pattern forecasts. Using observed
meteorological records and wind turbine power outputs as references, the
proposed methodology verifies advantageously compared to existing
numerical/statistical forecasting-downscaling pipelines, regarding either
deterministic/probabilistic skills or economic gains. Moreover, a 100-member,
10-day forecast with spatial resolution of 1 km and output frequency of 15 min
takes < 1 hour on a moderate-end GPU, as contrast to $\mathcal{O}(10^3)$ CPU
hours for conventional numerical simulation. By drastically reducing
computational costs while maintaining accuracy, our method paves the way for
more efficient and reliable renewable energy planning and operation.
[LINK]
http://arxiv.org/abs/2505.04396v2
[DATE]
2025-06-25 16:04:43+08:00
[CATEGORIES]
cs.LG
MS-TVNet:A Long-Term Time Series Prediction Method Based on Multi-Scale Dynamic Convolution
[AUTHORS]
Chenghan Li, Mingchen Li, Yipu Liao, Ruisheng Diao
[ABSTRACT]
Long-term time series prediction has predominantly relied on Transformer and
MLP models, while the potential of convolutional networks in this domain
remains underexplored. To address this gap, we introduce a novel multi-scale
time series reshape module, which effectively captures the relationships among
multi-period patches and variable dependencies. Building upon this module, we
propose MS-TVNet, a multi-scale 3D dynamic convolutional neural network.
Through comprehensive evaluations on diverse datasets, MS-TVNet demonstrates
superior performance compared to baseline models, achieving state-of-the-art
(SOTA) results in long-term time series prediction. Our findings highlight the
effectiveness of leveraging convolutional networks for capturing complex
temporal patterns, suggesting a promising direction for future research in this
field.The code is realsed on https://github.com/Curyyfaust/TVNet.
[LINK]
http://arxiv.org/abs/2506.17253v2
[DATE]
2025-06-25 15:55:20+08:00
[CATEGORIES]
cs.LG
Curved representational Bregman divergences and their applications
[AUTHORS]
Frank Nielsen
[ABSTRACT]
By analogy to curved exponential families in statistics, we define curved
Bregman divergences as Bregman divergences restricted to nonlinear parameter
subspaces. We show that the barycenter of a finite weighted set of parameters
under a curved Bregman divergence amounts to the right Bregman projection onto
the nonlinear subspace of the barycenter with respect to the full Bregman
divergence. We demonstrate the significance of curved Bregman divergences with
two examples: (1) symmetrized Bregman divergences and (2) the Kullback-Leibler
divergence between circular complex normal distributions. We then consider
monotonic embeddings to define representational curved Bregman divergences and
show that the $\alpha$-divergences are representational curved Bregman
divergences with respect to $\alpha$-embeddings of the probability simplex into
the positive measure cone. As an application, we report an efficient method to
calculate the intersection of a finite set of $\alpha$-divergence spheres.
[COMMENTS]
12 pages, 5 figures
[LINK]
http://arxiv.org/abs/2504.05654v2
[DATE]
2025-06-25 15:53:44+08:00
[CATEGORIES]
cs.LG
Affective Priming Score: A Data-Driven Method to Detect Priming in Sequential Datasets
[AUTHORS]
Eduardo Gutierrez Maestro, Hadi Banaee, Amy Loutfi
[ABSTRACT]
Affective priming exemplifies the challenge of ambiguity in affective
computing. While the community has largely addressed this issue from a
label-based perspective, identifying data points in the sequence affected by
the priming effect, the impact of priming on data itself, particularly in
physiological signals, remains underexplored. Data affected by priming can lead
to misclassifications when used in learning models. This study proposes the
Affective Priming Score (APS), a data-driven method to detect data points
influenced by the priming effect. The APS assigns a score to each data point,
quantifying the extent to which it is affected by priming. To validate this
method, we apply it to the SEED and SEED-VII datasets, which contain sufficient
transitions between emotional events to exhibit priming effects. We train
models with the same configuration using both the original data and
priming-free sequences. The misclassification rate is significantly reduced
when using priming-free sequences compared to the original data. This work
contributes to the broader challenge of ambiguity by identifying and mitigating
priming effects at the data level, enhancing model robustness, and offering
valuable insights for the design and collection of affective computing
datasets.
[LINK]
http://arxiv.org/abs/2506.20204v1
[DATE]
2025-06-25 15:48:22+08:00
[CATEGORIES]
cs.LG
DuoGPT: Training-free Dual Sparsity through Activation-aware Pruning in LLMs
[AUTHORS]
Ruokai Yin, Yuhang Li, Donghyun Lee, Priyadarshini Panda
[ABSTRACT]
Large language models (LLMs) deliver strong performance but are difficult to
deploy due to high memory and compute costs. While pruning reduces these
demands, most methods ignore activation sparsity observed at runtime. We
reinterpret activation sparsity as dynamic structured weight sparsity and
propose DuoGPT, a unified framework that constructs dual-sparse (spMspV)
workloads by combining unstructured weight pruning with activation sparsity. To
preserve accuracy, we extend the Optimal Brain Compression (OBC) framework with
activation-aware calibration and introduce output residuals from the dense
model as correction terms. We further optimize the solution for efficient GPU
execution, enabling scalability to billion-parameter LLMs. Evaluations on
LLaMA-2 and LLaMA-3 show that DuoGPT outperforms state-of-the-art structured
pruning methods by up to 9.17% accuracy at an iso-speedup of 1.39$\times$
compared to the baseline dense model.
[LINK]
http://arxiv.org/abs/2506.20194v1
[DATE]
2025-06-25 15:35:12+08:00
[CATEGORIES]
cs.LG
IKDiffuser: A Generative Inverse Kinematics Solver for Multi-arm Robots via Diffusion Model
[AUTHORS]
Zeyu Zhang, Ziyuan Jiao
[ABSTRACT]
Solving Inverse Kinematics (IK) problems is fundamental to robotics, but has
primarily been successful with single serial manipulators. For multi-arm
robotic systems, IK remains challenging due to complex self-collisions, coupled
joints, and high-dimensional redundancy. These complexities make traditional IK
solvers slow, prone to failure, and lacking in solution diversity. In this
paper, we present IKDiffuser, a diffusion-based model designed for fast and
diverse IK solution generation for multi-arm robotic systems. IKDiffuser learns
the joint distribution over the configuration space, capturing complex
dependencies and enabling seamless generalization to multi-arm robotic systems
of different structures. In addition, IKDiffuser can incorporate additional
objectives during inference without retraining, offering versatility and
adaptability for task-specific requirements. In experiments on 6 different
multi-arm systems, the proposed IKDiffuser achieves superior solution accuracy,
precision, diversity, and computational efficiency compared to existing
solvers. The proposed IKDiffuser framework offers a scalable, unified approach
to solving multi-arm IK problems, facilitating the potential of multi-arm
robotic systems in real-time manipulation tasks.
[COMMENTS]
under review
[LINK]
http://arxiv.org/abs/2506.13087v3
[DATE]
2025-06-25 15:27:44+08:00
[CATEGORIES]
cs.LG
Causal Operator Discovery in Partial Differential Equations via Counterfactual Physics-Informed Neural Networks
[AUTHORS]
Ronald Katende
[ABSTRACT]
We develop a principled framework for discovering causal structure in partial
differential equations (PDEs) using physics-informed neural networks and
counterfactual perturbations. Unlike classical residual minimization or sparse
regression methods, our approach quantifies operator-level necessity through
functional interventions on the governing dynamics. We introduce causal
sensitivity indices and structural deviation metrics to assess the influence of
candidate differential operators within neural surrogates. Theoretically, we
prove exact recovery of the causal operator support under restricted isometry
or mutual coherence conditions, with residual bounds guaranteeing
identifiability. Empirically, we validate the framework on both synthetic and
real-world datasets across climate dynamics, tumor diffusion, and ocean flows.
Our method consistently recovers governing operators even under noise,
redundancy, and data scarcity, outperforming standard PINNs and DeepONets in
structural fidelity. This work positions causal PDE discovery as a tractable
and interpretable inference task grounded in structural causal models and
variational residual analysis.
[LINK]
http://arxiv.org/abs/2506.20181v1
[DATE]
2025-06-25 15:15:42+08:00
[CATEGORIES]
cs.LG
Valid Selection among Conformal Sets
[AUTHORS]
Mahmoud Hegazy, Liviu Aolaritei, Michael I. Jordan, Aymeric Dieuleveut
[ABSTRACT]
Conformal prediction offers a distribution-free framework for constructing
prediction sets with coverage guarantees. In practice, multiple valid conformal
prediction sets may be available, arising from different models or
methodologies. However, selecting the most desirable set, such as the smallest,
can invalidate the coverage guarantees. To address this challenge, we propose a
stability-based approach that ensures coverage for the selected prediction set.
We extend our results to the online conformal setting, propose several
refinements in settings where additional structure is available, and
demonstrate its effectiveness through experiments.
[LINK]
http://arxiv.org/abs/2506.20173v1
[DATE]
2025-06-25 14:59:55+08:00
[CATEGORIES]
cs.LG
Causal discovery in deterministic discrete LTI-DAE systems
[AUTHORS]
Bala Rajesh Konkathi, Arun K. Tangirala
[ABSTRACT]
Discovering pure causes or driver variables in deterministic LTI systems is
of vital importance in the data-driven reconstruction of causal networks. A
recent work by Kathari and Tangirala, proposed in 2022, formulated the causal
discovery method as a constraint identification problem. The constraints are
identified using a dynamic iterative PCA (DIPCA)-based approach for dynamical
systems corrupted with Gaussian measurement errors. The DIPCA-based method
works efficiently for dynamical systems devoid of any algebraic relations.
However, several dynamical systems operate under feedback control and/or are
coupled with conservation laws, leading to differential-algebraic (DAE) or
mixed causal systems. In this work, a method, namely the partition of variables
(PoV), for causal discovery in LTI-DAE systems is proposed. This method is
superior to the method that was presented by Kathari and Tangirala (2022), as
PoV also works for pure dynamical systems, which are devoid of algebraic
equations. The proposed method identifies the causal drivers up to a minimal
subset. PoV deploys DIPCA to first determine the number of algebraic relations
($n_a$), the number of dynamical relations ($n_d$) and the constraint matrix.
Subsequently, the subsets are identified through an admissible partitioning of
the constraint matrix by finding the condition number of it. Case studies are
presented to demonstrate the effectiveness of the proposed method.
[LINK]
http://arxiv.org/abs/2506.20169v1
[DATE]
2025-06-25 14:47:22+08:00
[CATEGORIES]
cs.LG
Active Learning of Deep Neural Networks via Gradient-Free Cutting Planes
[AUTHORS]
Erica Zhang, Fangzhao Zhang, Mert Pilanci
[ABSTRACT]
Active learning methods aim to improve sample complexity in machine learning.
In this work, we investigate an active learning scheme via a novel
gradient-free cutting-plane training method for ReLU networks of arbitrary
depth and develop a convergence theory. We demonstrate, for the first time,
that cutting-plane algorithms, traditionally used in linear models, can be
extended to deep neural networks despite their nonconvexity and nonlinear
decision boundaries. Moreover, this training method induces the first deep
active learning scheme known to achieve convergence guarantees, revealing a
geometric contraction rate of the feasible set. We exemplify the effectiveness
of our proposed active learning method against popular deep active learning
baselines via both synthetic data experiments and sentimental classification
task on real datasets.
[LINK]
http://arxiv.org/abs/2410.02145v5
[DATE]
2025-06-25 14:11:27+08:00
[CATEGORIES]
cs.LG
Counterfactual Fairness through Transforming Data Orthogonal to Bias
[AUTHORS]
Shuyi Chen, Shixiang Zhu
[ABSTRACT]
Machine learning models have shown exceptional prowess in solving complex
issues across various domains. However, these models can sometimes exhibit
biased decision-making, resulting in unequal treatment of different groups.
Despite substantial research on counterfactual fairness, methods to reduce the
impact of multivariate and continuous sensitive variables on decision-making
outcomes are still underdeveloped. We propose a novel data pre-processing
algorithm, Orthogonal to Bias (OB), which is designed to eliminate the
influence of a group of continuous sensitive variables, thus promoting
counterfactual fairness in machine learning applications. Our approach, based
on the assumption of a jointly normal distribution within a structural causal
model (SCM), demonstrates that counterfactual fairness can be achieved by
ensuring the data is orthogonal to the observed sensitive variables. The OB
algorithm is model-agnostic, making it applicable to a wide range of machine
learning models and tasks. Additionally, it includes a sparse variant to
improve numerical stability through regularization. Empirical evaluations on
both simulated and real-world datasets, encompassing settings with both
discrete and continuous sensitive variables, show that our methodology
effectively promotes fairer outcomes without compromising accuracy.
[LINK]
http://arxiv.org/abs/2403.17852v3
[DATE]
2025-06-25 13:35:44+08:00
[CATEGORIES]
cs.LG
Accept More, Reject Less: Reducing up to 19% Unnecessary Desk-Rejections over 11 Years of ICLR Data
[AUTHORS]
Xiaoyu Li, Zhao Song, Jiahao Zhang
[ABSTRACT]
The explosive growth of AI research has driven paper submissions at flagship
AI conferences to unprecedented levels, necessitating many venues in 2025
(e.g., CVPR, ICCV, KDD, AAAI, IJCAI, WSDM) to enforce strict per-author
submission limits and to desk-reject any excess papers by simple ID order.
While this policy helps reduce reviewer workload, it may unintentionally
discard valuable papers and penalize authors’ efforts. In this paper, we ask an
essential research question on whether it is possible to follow submission
limits while minimizing needless rejections. We first formalize the current
desk-rejection policies as an optimization problem, and then develop a
practical algorithm based on linear programming relaxation and a rounding
scheme. Under extensive evaluation on 11 years of real-world ICLR
(International Conference on Learning Representations) data, our method
preserves up to $19.23\%$ more papers without violating any author limits.
Moreover, our algorithm is highly efficient in practice, with all results on
ICLR data computed within at most 53.64 seconds. Our work provides a simple and
practical desk-rejection strategy that significantly reduces unnecessary
rejections, demonstrating strong potential to improve current CS conference
submission policies.
[LINK]
http://arxiv.org/abs/2506.20141v1
[DATE]
2025-06-25 13:23:44+08:00
[CATEGORIES]
cs.LG
High-Resolution Live Fuel Moisture Content (LFMC) Maps for Wildfire Risk from Multimodal Earth Observation Data
[AUTHORS]
Patrick Alan Johnson, Gabriel Tseng, Yawen Zhang, Heather Heward, Virginia Sjahli, Favyen Bastani, Joseph Redmon, Patrick Beukema
[ABSTRACT]
Wildfires are increasing in intensity and severity at an alarming rate.
Recent advances in AI and publicly available satellite data enable monitoring
critical wildfire risk factors globally, at high resolution and low latency.
Live Fuel Moisture Content (LFMC) is a critical wildfire risk factor and is
valuable for both wildfire research and operational response. However,
ground-based LFMC samples are both labor intensive and costly to acquire,
resulting in sparse and infrequent updates. In this work, we explore the use of
a pretrained, highly-multimodal earth-observation model for generating
large-scale spatially complete (wall-to-wall) LFMC maps. Our approach achieves
significant improvements over previous methods using randomly initialized
models (20 reduction in RMSE). We provide an automated pipeline that enables
rapid generation of these LFMC maps across the United States, and demonstrate
its effectiveness in two regions recently impacted by wildfire (Eaton and
Palisades).
[COMMENTS]
10 pages, ICML 2025 (TerraBytes)
[LINK]
http://arxiv.org/abs/2506.20132v1
[DATE]
2025-06-25 12:59:10+08:00
[CATEGORIES]
cs.LG
Log-Linear Attention
[AUTHORS]
Han Guo, Songlin Yang, Tarushii Goel, Eric P. Xing, Tri Dao, Yoon Kim
[ABSTRACT]
The attention mechanism in Transformers is an important primitive for
accurate and scalable sequence modeling. Its quadratic-compute and
linear-memory complexity however remain significant bottlenecks. Linear
attention and state-space models enable linear-time, constant-memory sequence
modeling and can moreover be trained efficiently through matmul-rich
parallelization across sequence length. However, at their core these models are
still RNNs, and thus their use of a fixed-size hidden state to model the
context is a fundamental limitation. This paper develops log-linear attention,
an attention mechanism that balances linear attention’s efficiency and the
expressiveness of softmax attention. Log-linear attention replaces the
fixed-size hidden state with a logarithmically growing set of hidden states. We
show that with a particular growth function, log-linear attention admits a
similarly matmul-rich parallel form whose compute cost is log-linear in
sequence length. Log-linear attention is a general framework and can be applied
on top of existing linear attention variants. As case studies, we instantiate
log-linear variants of two recent architectures – Mamba-2 and Gated DeltaNet
– and find they perform well compared to their linear-time variants.
[LINK]
http://arxiv.org/abs/2506.04761v2
[DATE]
2025-06-25 12:54:28+08:00
[CATEGORIES]
cs.LG
Evaluating Generalization and Representation Stability in Small LMs via Prompting, Fine-Tuning and Out-of-Distribution Prompts
[AUTHORS]
Rahul Raja, Arpita Vats
[COMMENTS]
Accepted at ICML
[LINK]
http://arxiv.org/abs/2506.17289v2
[DATE]
2025-06-25 12:27:25+08:00
[CATEGORIES]
cs.LG
U-R-VEDA: Integrating UNET, Residual Links, Edge and Dual Attention, and Vision Transformer for Accurate Semantic Segmentation of CMRs
[AUTHORS]
Racheal Mukisa, Arvind K. Bansal
[ABSTRACT]
Artificial intelligence, including deep learning models, will play a
transformative role in automated medical image analysis for the diagnosis of
cardiac disorders and their management. Automated accurate delineation of
cardiac images is the first necessary initial step for the quantification and
automated diagnosis of cardiac disorders. In this paper, we propose a deep
learning based enhanced UNet model, U-R-Veda, which integrates convolution
transformations, vision transformer, residual links, channel-attention, and
spatial attention, together with edge-detection based skip-connections for an
accurate fully-automated semantic segmentation of cardiac magnetic resonance
(CMR) images. The model extracts local-features and their interrelationships
using a stack of combination convolution blocks, with embedded channel and
spatial attention in the convolution block, and vision transformers. Deep
embedding of channel and spatial attention in the convolution block identifies
important features and their spatial localization. The combined edge
information with channel and spatial attention as skip connection reduces
information-loss during convolution transformations. The overall model
significantly improves the semantic segmentation of CMR images necessary for
improved medical image analysis. An algorithm for the dual attention module
(channel and spatial attention) has been presented. Performance results show
that U-R-Veda achieves an average accuracy of 95.2%, based on DSC metrics. The
model outperforms the accuracy attained by other models, based on DSC and HD
metrics, especially for the delineation of right-ventricle and
left-ventricle-myocardium.
[COMMENTS]
15 pages, 3 figures
[LINK]
http://arxiv.org/abs/2506.20689v1
[DATE]
2025-06-25 12:10:09+08:00
[CATEGORIES]
cs.LG
Extracting Interpretable Models from Tree Ensembles: Computational and Statistical Perspectives
[AUTHORS]
Brian Liu, Rahul Mazumder, Peter Radchenko
[ABSTRACT]
Tree ensembles are non-parametric methods widely recognized for their
accuracy and ability to capture complex interactions. While these models excel
at prediction, they are difficult to interpret and may fail to uncover useful
relationships in the data. We propose an estimator to extract compact sets of
decision rules from tree ensembles. The extracted models are accurate and can
be manually examined to reveal relationships between the predictors and the
response. A key novelty of our estimator is the flexibility to jointly control
the number of rules extracted and the interaction depth of each rule, which
improves accuracy. We develop a tailored exact algorithm to efficiently solve
optimization problems underlying our estimator and an approximate algorithm for
computing regularization paths, sequences of solutions that correspond to
varying model sizes. We also establish novel non-asymptotic prediction error
bounds for our proposed approach, comparing it to an oracle that chooses the
best data-dependent linear combination of the rules in the ensemble subject to
the same complexity constraint as our estimator. The bounds illustrate that the
large-sample predictive performance of our estimator is on par with that of the
oracle. Through experiments, we demonstrate that our estimator outperforms
existing algorithms for rule extraction.
[LINK]
http://arxiv.org/abs/2506.20114v1
[DATE]
2025-06-25 12:06:37+08:00
[CATEGORIES]
cs.LG
Autonomous Cyber Resilience via a Co-Evolutionary Arms Race within a Fortified Digital Twin Sandbox
[AUTHORS]
Malikussaid, Sutiyo
[ABSTRACT]
The convergence of IT and OT has created hyper-connected ICS, exposing
critical infrastructure to a new class of adaptive, intelligent adversaries
that render static defenses obsolete. Existing security paradigms often fail to
address a foundational “Trinity of Trust,” comprising the fidelity of the
system model, the integrity of synchronizing data, and the resilience of the
analytical engine against sophisticated evasion. This paper introduces the ARC
framework, a method for achieving analytical resilience through an autonomous,
closed-loop hardening process. ARC establishes a perpetual co-evolutionary arms
race within the high-fidelity sandbox of a F-SCDT. A DRL agent, the “Red
Agent,” is formalized and incentivized to autonomously discover stealthy,
physically-plausible attack paths that maximize process disruption while
evading detection. Concurrently, an ensemble-based “Blue Agent” defender is
continuously hardened via adversarial training against the evolving threats
discovered by its adversary. This co-evolutionary dynamic forces both agents to
become progressively more sophisticated, enabling the system to autonomously
probe and patch its own vulnerabilities. Experimental validation on both the
TEP and the SWaT testbeds demonstrates the framework’s superior performance. A
comprehensive ablation study, supported by extensive visualizations including
ROC curves and SHAP plots, reveals that the co-evolutionary process itself is
responsible for a significant performance increase in detecting novel attacks.
By integrating XAI to ensure operator trust and proposing a scalable F-ARC
architecture, this work presents ARC not merely as an improvement, but as a
necessary paradigm shift toward dynamic, self-improving security for the future
of critical infrastructure.
[COMMENTS]
17 pages, 2 figures, 4 equations, 2 algorithms, 4 tables, to be
published in ISPACS Conference 2025, unabridged version
[LINK]
http://arxiv.org/abs/2506.20102v1
[DATE]
2025-06-25 11:28:48+08:00
[CATEGORIES]
cs.LG
Fine-Grained Perturbation Guidance via Attention Head Selection
[AUTHORS]
Donghoon Ahn, Jiwon Kang, Sanghyun Lee, Minjae Kim, Jaewon Min, Wooseok Jang, Saungwu Lee, Sayak Paul, Susung Hong, Seungryong Kim
[ABSTRACT]
Recent guidance methods in diffusion models steer reverse sampling by
perturbing the model to construct an implicit weak model and guide generation
away from it. Among these approaches, attention perturbation has demonstrated
strong empirical performance in unconditional scenarios where classifier-free
guidance is not applicable. However, existing attention perturbation methods
lack principled approaches for determining where perturbations should be
applied, particularly in Diffusion Transformer (DiT) architectures where
quality-relevant computations are distributed across layers. In this paper, we
investigate the granularity of attention perturbations, ranging from the layer
level down to individual attention heads, and discover that specific heads
govern distinct visual concepts such as structure, style, and texture quality.
Building on this insight, we propose “HeadHunter”, a systematic framework for
iteratively selecting attention heads that align with user-centric objectives,
enabling fine-grained control over generation quality and visual attributes. In
addition, we introduce SoftPAG, which linearly interpolates each selected
head’s attention map toward an identity matrix, providing a continuous knob to
tune perturbation strength and suppress artifacts. Our approach not only
mitigates the oversmoothing issues of existing layer-level perturbation but
also enables targeted manipulation of specific visual styles through
compositional head selection. We validate our method on modern large-scale
DiT-based text-to-image models including Stable Diffusion 3 and FLUX.1,
demonstrating superior performance in both general quality enhancement and
style-specific guidance. Our work provides the first head-level analysis of
attention perturbation in diffusion models, uncovering interpretable
specialization within attention layers and enabling practical design of
effective perturbation strategies.
[COMMENTS]
Project page: https://cvlab-kaist.github.io/HeadHunter/
[LINK]
http://arxiv.org/abs/2506.10978v2
[DATE]
2025-06-25 10:37:46+08:00
[CATEGORIES]
cs.LG
Attack Smarter: Attention-Driven Fine-Grained Webpage Fingerprinting Attacks
[AUTHORS]
Yali Yuan, Weiyi Zou, Guang Cheng
[ABSTRACT]
Website Fingerprinting (WF) attacks aim to infer which websites a user is
visiting by analyzing traffic patterns, thereby compromising user anonymity.
Although this technique has been demonstrated to be effective in controlled
experimental environments, it remains largely limited to small-scale scenarios,
typically restricted to recognizing website homepages. In practical settings,
however, users frequently access multiple subpages in rapid succession, often
before previous content fully loads. WebPage Fingerprinting (WPF) generalizes
the WF framework to large-scale environments by modeling subpages of the same
site as distinct classes. These pages often share similar page elements,
resulting in lower inter-class variance in traffic features. Furthermore, we
consider multi-tab browsing scenarios, in which a single trace encompasses
multiple categories of webpages. This leads to overlapping traffic segments,
and similar features may appear in different positions within the traffic,
thereby increasing the difficulty of classification. To address these
challenges, we propose an attention-driven fine-grained WPF attack, named
ADWPF. Specifically, during the training phase, we apply targeted augmentation
to salient regions of the traffic based on attention maps, including attention
cropping and attention masking. ADWPF then extracts low-dimensional features
from both the original and augmented traffic and applies self-attention modules
to capture the global contextual patterns of the trace. Finally, to handle the
multi-tab scenario, we employ the residual attention to generate class-specific
representations of webpages occurring at different temporal positions.
Extensive experiments demonstrate that the proposed method consistently
surpasses state-of-the-art baselines across datasets of different scales.
[LINK]
http://arxiv.org/abs/2506.20082v1
[DATE]
2025-06-25 09:45:55+08:00
[CATEGORIES]
cs.LG
Quantum-Classical Hybrid Quantized Neural Network
[AUTHORS]
Wenxin Li, Chuan Wang, Hongdong Zhu, Qi Gao, Yin Ma, Hai Wei, Kai Wen
[ABSTRACT]
Here in this work, we present a novel Quadratic Binary Optimization (QBO)
model for quantized neural network training, enabling the use of arbitrary
activation and loss functions through spline interpolation. We introduce
Forward Interval Propagation (FIP), a method designed to tackle the challenges
of non-linearity and the multi-layer composite structure in neural networks by
discretizing activation functions into linear subintervals. This approach
preserves the universal approximation properties of neural networks while
allowing complex nonlinear functions to be optimized using quantum computers,
thus broadening their applicability in artificial intelligence. We provide
theoretical upper bounds on the approximation error and the number of Ising
spins required, by deriving the sample complexity of the empirical risk
minimization problem, from an optimization perspective. A significant challenge
in solving the associated Quadratic Constrained Binary Optimization (QCBO)
model on a large scale is the presence of numerous constraints. When employing
the penalty method to handle these constraints, tuning a large number of
penalty coefficients becomes a critical hyperparameter optimization problem,
increasing computational complexity and potentially affecting solution quality.
To address this, we employ the Quantum Conditional Gradient Descent (QCGD)
algorithm, which leverages quantum computing to directly solve the QCBO
problem. We prove the convergence of QCGD under a quantum oracle with
randomness and bounded variance in objective value, as well as under limited
precision constraints in the coefficient matrix. Additionally, we provide an
upper bound on the Time-To-Solution for the QCBO solving process. Experimental
results using a coherent Ising machine (CIM) demonstrate a 94.95% accuracy on
the Fashion MNIST classification task, with only 1.1-bit precision.
[COMMENTS]
27 pages, 5 figures, comments are welcome
[LINK]
http://arxiv.org/abs/2506.18240v2
[DATE]
2025-06-25 09:01:03+08:00
[CATEGORIES]
cs.LG
Multimodal Information Retrieval for Open World with Edit Distance Weak Supervision
[AUTHORS]
KMA Solaiman, Bharat Bhargava
[ABSTRACT]
Existing multi-media retrieval models either rely on creating a common
subspace with modality-specific representation models or require schema mapping
among modalities to measure similarities among multi-media data. Our goal is to
avoid the annotation overhead incurred from considering retrieval as a
supervised classification task and re-use the pretrained encoders in large
language models and vision tasks. We propose “FemmIR”, a framework to retrieve
multimodal results relevant to information needs expressed with multimodal
queries by example without any similarity label. Such identification is
necessary for real-world applications where data annotations are scarce and
satisfactory performance is required without fine-tuning with a common
framework across applications. We curate a new dataset called MuQNOL for
benchmarking progress on this task. Our technique is based on weak supervision
introduced through edit distance between samples: graph edit distance can be
modified to consider the cost of replacing a data sample in terms of its
properties, and relevance can be measured through the implicit signal from the
amount of edit cost among the objects. Unlike metric learning or encoding
networks, FemmIR re-uses the high-level properties and maintains the property
value and relationship constraints with a multi-level interaction score between
data samples and the query example provided by the user. We empirically
evaluate FemmIR on a missing person use case with MuQNOL. FemmIR performs
comparably to similar retrieval systems in delivering on-demand retrieval
results with exact and approximate similarities while using the existing
property identifiers in the system.
[COMMENTS]
Submitted to ICDE’24. An earlier version of this paper appeared on
TechRxiv: https://www.techrxiv.org/doi/full/10.36227/techrxiv.21990284.v1,
uploaded on February 05, 2023
[LINK]
http://arxiv.org/abs/2506.20070v1
[DATE]
2025-06-25 08:25:08+08:00
[CATEGORIES]
cs.LG
Conformal Prediction with Upper and Lower Bound Models
[AUTHORS]
Miao Li, Michael Klamkin, Mathieu Tanneau, Reza Zandehshahvar, Pascal Van Hentenryck
[ABSTRACT]
This paper studies a Conformal Prediction (CP) methodology for building
prediction intervals in a regression setting, given only deterministic lower
and upper bounds on the target variable. It proposes a new CP mechanism (CPUL)
that goes beyond post-processing by adopting a model selection approach over
multiple nested interval construction methods. Paradoxically, many
well-established CP methods, including CPUL, may fail to provide adequate
coverage in regions where the bounds are tight. To remedy this limitation, the
paper proposes an optimal thresholding mechanism, OMLT, that adjusts CPUL
intervals in tight regions with undercoverage. The combined CPUL-OMLT is
validated on large-scale learning tasks where the goal is to bound the optimal
value of a parametric optimization problem. The experimental results
demonstrate substantial improvements over baseline methods across various
datasets.
[LINK]
http://arxiv.org/abs/2503.04071v2
[DATE]
2025-06-25 08:04:42+08:00
[CATEGORIES]
cs.LG
Identifying Heterogeneity in Distributed Learning
[AUTHORS]
Zelin Xiao, Jia Gu, Song Xi Chen
[ABSTRACT]
We study methods for identifying heterogeneous parameter components in
distributed M-estimation with minimal data transmission. One is based on a
re-normalized Wald test, which is shown to be consistent as long as the number
of distributed data blocks $K$ is of a smaller order of the minimum block
sample size and the level of heterogeneity is dense. The second one is an
extreme contrast test (ECT) based on the difference between the largest and
smallest component-wise estimated parameters among data blocks. By introducing
a sample splitting procedure, the ECT can avoid the bias accumulation arising
from the M-estimation procedures, and exhibits consistency for $K$ being much
larger than the sample size while the heterogeneity is sparse. The ECT
procedure is easy to operate and communication-efficient. A combination of the
Wald and the extreme contrast tests is formulated to attain more robust power
under varying levels of sparsity of the heterogeneity. We also conduct
intensive numerical experiments to compare the family-wise error rate (FWER)
and the power of the proposed methods. Additionally, we conduct a case study to
present the implementation and validity of the proposed methods.
[LINK]
http://arxiv.org/abs/2506.16394v3
[DATE]
2025-06-25 07:55:45+08:00
[CATEGORIES]
cs.LG
Supervised Coupled Matrix-Tensor Factorization (SCMTF) for Computational Phenotyping of Patient Reported Outcomes in Ulcerative Colitis
[AUTHORS]
Cristian Minoccheri, Sophia Tesic, Kayvan Najarian, Ryan Stidham
[ABSTRACT]
Phenotyping is the process of distinguishing groups of patients to identify
different types of disease progression. A recent trend employs low-rank matrix
and tensor factorization methods for their capability of dealing with
multi-modal, heterogeneous, and missing data. Symptom quantification is crucial
for understanding patient experiences in inflammatory bowel disease, especially
in conditions such as ulcerative colitis (UC). However, patient-reported
symptoms are typically noisy, subjective, and significantly more sparse than
other data types. For this reason, they are usually not included in phenotyping
and other machine learning methods. This paper explores the application of
computational phenotyping to leverage Patient-Reported Outcomes (PROs) using a
novel supervised coupled matrix-tensor factorization (SCMTF) method, which
integrates temporal PROs and temporal labs with static features to predict
medication persistence in ulcerative colitis. This is the first tensor-based
method that is both supervised and coupled, it is the first application to the
UC domain, and the first application to PROs. We use a deep learning framework
that makes the model flexible and easy to train. The proposed method allows us
to handle the large amount of missing data in the PROs. The best model predicts
changes in medication 8 and 20 months in the future with AUCs of 0.853 and
0.803 on the test set respectively. We derive interpretable phenotypes
consisting of static features and temporal features (including their temporal
patterns). We show that low-rank matrix and tensor based phenotyping can be
successfully applied to the UC domain and to highly missing PRO data. We
identify phenotypes useful to predict medication persistence - these phenotypes
include several symptom variables, showing that PROs contain relevant
infromation that is usually discarded.
[LINK]
http://arxiv.org/abs/2506.20065v1
[DATE]
2025-06-25 07:55:11+08:00
[CATEGORIES]
cs.LG
The Alignment Trap: Complexity Barriers
[AUTHORS]
Jasper Yao
[ABSTRACT]
This paper argues that AI alignment is not merely difficult, but is founded
on a fundamental logical contradiction. We first establish The Enumeration
Paradox: we use machine learning precisely because we cannot enumerate all
necessary safety rules, yet making ML safe requires examples that can only be
generated from the very enumeration we admit is impossible. This paradox is
then confirmed by a set of five independent mathematical proofs, or “pillars of
impossibility.” Our main results show that: (1) Geometric Impossibility: The
set of safe policies has measure zero, a necessary consequence of projecting
infinite-dimensional world-context requirements onto finite-dimensional models.
(2) Computational Impossibility: Verifying a policy’s safety is coNP-complete,
even for non-zero error tolerances. (3) Statistical Impossibility: The training
data required for safety (abundant examples of rare disasters) is a logical
contradiction and thus unobtainable. (4) Information-Theoretic Impossibility:
Safety rules contain more incompressible, arbitrary information than any
feasible network can store. (5) Dynamic Impossibility: The optimization process
for increasing AI capability is actively hostile to safety, as the gradients
for the two objectives are generally anti-aligned. Together, these results
demonstrate that the pursuit of safe, highly capable AI is not a matter of
overcoming technical hurdles, but of confronting fundamental, interlocking
barriers. The paper concludes by presenting a strategic trilemma that these
impossibilities force upon the field. A formal verification of the core
theorems in Lean4 is currently in progress.
[COMMENTS]
31 Pages, 4 Figures. Substantial revision. Restructured around the
Enumeration Paradox and Five Pillars of Impossibility. Core mathematical
results unchanged but significantly expanded. Added new impossibility proofs
from statistical, information-theoretic, and dynamic perspectives
[LINK]
http://arxiv.org/abs/2506.10304v2
[DATE]
2025-06-25 07:41:11+08:00
[CATEGORIES]
cs.LG
Universal pre-training by iterated random computation
[AUTHORS]
Peter Bloem
[ABSTRACT]
We investigate the use of randomly generated data for the sake of
pre-training a model. We justify this approach theoretically from the
perspective of algorithmic complexity, building on recent research that shows
that sequence models can be trained to approximate Solomonoff induction. We
derive similar, but complementary theoretical results. We show empirically that
synthetically generated data can be used to pre-train a model before the data
is seen. We replicate earlier results that models trained this way show
zero-shot in-context learning across a variety of datasets, and that this
performance improves with scale. We extend earlier results to real-world data,
and show that finetuning a model after pre-training offers faster convergence
and better generalization.
[LINK]
http://arxiv.org/abs/2506.20057v1
[DATE]
2025-06-25 07:36:35+08:00
[CATEGORIES]
cs.LG
Machine-Learning-Assisted Photonic Device Development: A Multiscale Approach from Theory to Characterization
[AUTHORS]
Yuheng Chen, Alexander Montes McNeil, Taehyuk Park, Blake A. Wilson, Vaishnavi Iyer, Michael Bezick, Jae-Ik Choi, Rohan Ojha, Pravin Mahendran, Daksh Kumar Singh, Geetika Chitturi, Peigang Chen, Trang Do, Alexander V. Kildishev, Vladimir M. Shalaev, Michael Moebius, Wenshan Cai, Yongmin Liu, Alexandra Boltasseva
[ABSTRACT]
Photonic device development (PDD) has achieved remarkable success in
designing and implementing new devices for controlling light across various
wavelengths, scales, and applications, including telecommunications, imaging,
sensing, and quantum information processing. PDD is an iterative, five-step
process that consists of: i) deriving device behavior from design parameters,
ii) simulating device performance, iii) finding the optimal candidate designs
from simulations, iv) fabricating the optimal device, and v) measuring device
performance. Classically, all these steps involve Bayesian optimization,
material science, control theory, and direct physics-driven numerical methods.
However, many of these techniques are computationally intractable, monetarily
costly, or difficult to implement at scale. In addition, PDD suffers from large
optimization landscapes, uncertainties in structural or optical
characterization, and difficulties in implementing robust fabrication
processes. However, the advent of machine learning over the past decade has
provided novel, data-driven strategies for tackling these challenges, including
surrogate estimators for speeding up computations, generative modeling for
noisy measurement modeling and data augmentation, reinforcement learning for
fabrication, and active learning for experimental physical discovery. In this
review, we present a comprehensive perspective on these methods to enable
machine-learning-assisted PDD (ML-PDD) for efficient design optimization with
powerful generative models, fast simulation and characterization modeling under
noisy measurements, and reinforcement learning for fabrication. This review
will provide researchers from diverse backgrounds with valuable insights into
this emerging topic, fostering interdisciplinary efforts to accelerate the
development of complex photonic devices and systems.
[LINK]
http://arxiv.org/abs/2506.20056v1
[DATE]
2025-06-25 07:32:54+08:00
[CATEGORIES]
cs.LG
MegaFold: System-Level Optimizations for Accelerating Protein Structure Prediction Models
[AUTHORS]
Hoa La, Ahan Gupta, Alex Morehead, Jianlin Cheng, Minjia Zhang
[ABSTRACT]
Protein structure prediction models such as AlphaFold3 (AF3) push the
frontier of biomolecular modeling by incorporating science-informed
architectural changes to the transformer architecture. However, these advances
come at a steep system cost, introducing: compute- and memory-intensive
operators, 2D attention mechanisms, and retrieval-augmented data pipelines,
which collectively hinder the scalability of AF3 training. In this work, we
present MegaFold, a cross-platform system to accelerate AF3 training. MegaFold
tackles key bottlenecks through ahead-of-time caching to eliminate GPU idle
time from the retrieval-augmented data pipeline, Triton-based kernels for
memory-efficient EvoAttention on heterogeneous devices, and deep fusion for
common and critical small operators in AF3. Evaluation on both NVIDIA H200 and
AMD MI250 GPUs shows that MegaFold reduces peak memory usage of AF3 training by
up to 1.23$\times$ and improves per-iteration training time by up-to
1.73$\times$ and 1.62$\times$ respectively. More importantly, MegaFold enables
training on 1.35$\times$ longer sequence lengths compared to PyTorch baselines
without running out-of-memory, significantly improving the scalability of
modern protein folding models. We open source our code at
https://github.com/Supercomputing-System-AI-Lab/MegaFold/.
[COMMENTS]
13 pages, 12 figures
[LINK]
http://arxiv.org/abs/2506.20686v1
[DATE]
2025-06-25 07:30:49+08:00
[CATEGORIES]
cs.LG
A Principled Path to Fitted Distributional Evaluation
[AUTHORS]
Sungee Hong, Jiayi Wang, Zhengling Qi, Raymond Ka Wai Wong
[ABSTRACT]
In reinforcement learning, distributional off-policy evaluation (OPE) focuses
on estimating the return distribution of a target policy using offline data
collected under a different policy. This work focuses on extending the widely
used fitted-Q evaluation – developed for expectation-based reinforcement
learning – to the distributional OPE setting. We refer to this extension as
fitted distributional evaluation (FDE). While only a few related approaches
exist, there remains no unified framework for designing FDE methods. To fill
this gap, we present a set of guiding principles for constructing theoretically
grounded FDE methods. Building on these principles, we develop several new FDE
methods with convergence analysis and provide theoretical justification for
existing methods, even in non-tabular environments. Extensive experiments,
including simulations on linear quadratic regulators and Atari games,
demonstrate the superior performance of the FDE methods.
[LINK]
http://arxiv.org/abs/2506.20048v1
[DATE]
2025-06-25 07:08:56+08:00
[CATEGORIES]
cs.LG
GNN’s Uncertainty Quantification using Self-Distillation
[AUTHORS]
Hirad Daneshvar, Reza Samavi
[ABSTRACT]
Graph Neural Networks (GNNs) have shown remarkable performance in the
healthcare domain. However, what remained challenging is quantifying the
predictive uncertainty of GNNs, which is an important aspect of trustworthiness
in clinical settings. While Bayesian and ensemble methods can be used to
quantify uncertainty, they are computationally expensive. Additionally, the
disagreement metric used by ensemble methods to compute uncertainty cannot
capture the diversity of models in an ensemble network. In this paper, we
propose a novel method, based on knowledge distillation, to quantify GNNs’
uncertainty more efficiently and with higher precision. We apply
self-distillation, where the same network serves as both the teacher and
student models, thereby avoiding the need to train several networks
independently. To ensure the impact of self-distillation, we develop an
uncertainty metric that captures the diverse nature of the network by assigning
different weights to each GNN classifier. We experimentally evaluate the
precision, performance, and ability of our approach in distinguishing
out-of-distribution data on two graph datasets: MIMIC-IV and Enzymes. The
evaluation results demonstrate that the proposed method can effectively capture
the predictive uncertainty of the model while having performance similar to
that of the MC Dropout and ensemble methods. The code is publicly available at
https://github.com/tailabTMU/UQ_GNN.
[COMMENTS]
The paper has been accepted in the International Conference on AI in
Healthcare (AIiH) 2025 and will appear in the conference proceedings
[LINK]
http://arxiv.org/abs/2506.20046v1
[DATE]
2025-06-25 07:08:31+08:00
[CATEGORIES]
cs.LG
PocketVina Enables Scalable and Highly Accurate Physically Valid Docking through Multi-Pocket Conditioning
[AUTHORS]
Ahmet Sarigun, Bora Uyar, Vedran Franke, Altuna Akalin
[ABSTRACT]
Sampling physically valid ligand-binding poses remains a major challenge in
molecular docking, particularly for unseen or structurally diverse targets. We
introduce PocketVina, a fast and memory-efficient, search-based docking
framework that combines pocket prediction with systematic multi-pocket
exploration. We evaluate PocketVina across four established
benchmarks–PDBbind2020 (timesplit and unseen), DockGen, Astex, and
PoseBusters–and observe consistently strong performance in sampling physically
valid docking poses. PocketVina achieves state-of-the-art performance when
jointly considering ligand RMSD and physical validity (PB-valid), while
remaining competitive with deep learning-based approaches in terms of RMSD
alone, particularly on structurally diverse and previously unseen targets.
PocketVina also maintains state-of-the-art physically valid docking accuracy
across ligands with varying degrees of flexibility. We further introduce
TargetDock-AI, a benchmarking dataset we curated, consisting of over 500000
protein-ligand pairs, and a partition of the dataset labeled with PubChem
activity annotations. On this large-scale dataset, PocketVina successfully
discriminates active from inactive targets, outperforming a deep learning
baseline while requiring significantly less GPU memory and runtime. PocketVina
offers a robust and scalable docking strategy that requires no task-specific
training and runs efficiently on standard GPUs, making it well-suited for
high-throughput virtual screening and structure-based drug discovery.
[LINK]
http://arxiv.org/abs/2506.20043v1
[DATE]
2025-06-25 06:50:30+08:00
[CATEGORIES]
cs.LG
LSH-DynED: A Dynamic Ensemble Framework with LSH-Based Undersampling for Evolving Multi-Class Imbalanced Classification
[AUTHORS]
Soheil Abadifard, Fazli Can
[ABSTRACT]
The classification of imbalanced data streams, which have unequal class
distributions, is a key difficulty in machine learning, especially when dealing
with multiple classes. While binary imbalanced data stream classification tasks
have received considerable attention, only a few studies have focused on
multi-class imbalanced data streams. Effectively managing the dynamic imbalance
ratio is a key challenge in this domain. This study introduces a novel, robust,
and resilient approach to address these challenges by integrating Locality
Sensitive Hashing with Random Hyperplane Projections (LSH-RHP) into the Dynamic
Ensemble Diversification (DynED) framework. To the best of our knowledge, we
present the first application of LSH-RHP for undersampling in the context of
imbalanced non-stationary data streams. The proposed method undersamples the
majority classes by utilizing LSH-RHP, provides a balanced training set, and
improves the ensemble’s prediction performance. We conduct comprehensive
experiments on 23 real-world and ten semi-synthetic datasets and compare
LSH-DynED with 15 state-of-the-art methods. The results reveal that LSH-DynED
outperforms other approaches in terms of both Kappa and mG-Mean effectiveness
measures, demonstrating its capability in dealing with multi-class imbalanced
non-stationary data streams. Notably, LSH-DynED performs well in large-scale,
high-dimensional datasets with considerable class imbalances and demonstrates
adaptation and robustness in real-world circumstances. To motivate our design,
we review existing methods for imbalanced data streams, outline key challenges,
and offer guidance for future work. For the reproducibility of our results, we
have made our implementation available on GitHub.
[LINK]
http://arxiv.org/abs/2506.20041v1
[DATE]
2025-06-25 06:46:47+08:00
[CATEGORIES]
cs.LG
Learning Bilateral Team Formation in Cooperative Multi-Agent Reinforcement Learning
[AUTHORS]
Koorosh Moslemi, Chi-Guhn Lee
[ABSTRACT]
Team formation and the dynamics of team-based learning have drawn significant
interest in the context of Multi-Agent Reinforcement Learning (MARL). However,
existing studies primarily focus on unilateral groupings, predefined teams, or
fixed-population settings, leaving the effects of algorithmic bilateral
grouping choices in dynamic populations underexplored. To address this gap, we
introduce a framework for learning two-sided team formation in dynamic
multi-agent systems. Through this study, we gain insight into what algorithmic
properties in bilateral team formation influence policy performance and
generalization. We validate our approach using widely adopted multi-agent
scenarios, demonstrating competitive performance and improved generalization in
most scenarios.
[COMMENTS]
Accepted to the 2nd Coordination and Cooperation in Multi-Agent
Reinforcement Learning (CoCoMARL) Workshop at RLC 2025
[LINK]
http://arxiv.org/abs/2506.20039v1
[DATE]
2025-06-25 06:40:05+08:00
[CATEGORIES]
cs.LG
Verifiable Unlearning on Edge
[AUTHORS]
Mohammad M Maheri, Alex Davidson, Hamed Haddadi
[ABSTRACT]
Machine learning providers commonly distribute global models to edge devices,
which subsequently personalize these models using local data. However, issues
such as copyright infringements, biases, or regulatory requirements may require
the verifiable removal of certain data samples across all edge devices.
Ensuring that edge devices correctly execute such unlearning operations is
critical to maintaining integrity.
In this work, we introduce a verification framework leveraging zero-knowledge
proofs, specifically zk-SNARKs, to confirm data unlearning on personalized
edge-device models without compromising privacy. We have developed algorithms
explicitly designed to facilitate unlearning operations that are compatible
with efficient zk-SNARK proof generation, ensuring minimal computational and
memory overhead suitable for constrained edge environments. Furthermore, our
approach carefully preserves personalized enhancements on edge devices,
maintaining model performance post-unlearning.
Our results affirm the practicality and effectiveness of this verification
framework, demonstrating verifiable unlearning with minimal degradation in
personalization-induced performance improvements. Our methodology ensures
verifiable, privacy-preserving, and effective machine unlearning across edge
devices.
[COMMENTS]
This paper has been accepted to the IEEE European Symposium on
Security and Privacy (EuroS&P) 2025
[LINK]
http://arxiv.org/abs/2506.20037v1
[DATE]
2025-06-25 06:24:47+08:00
[CATEGORIES]
cs.LG
Neural network-based Godunov corrections for approximate Riemann solvers using bi-fidelity learning
[AUTHORS]
Akshay Thakur, Matthew J. Zahr
[ABSTRACT]
The Riemann problem is fundamental in the computational modeling of
hyperbolic partial differential equations, enabling the development of stable
and accurate upwind schemes. While exact solvers provide robust upwinding
fluxes, their high computational cost necessitates approximate solvers.
Although approximate solvers achieve accuracy in many scenarios, they produce
inaccurate solutions in certain cases. To overcome this limitation, we propose
constructing neural network-based surrogate models, trained using supervised
learning, designed to map interior and exterior conservative state variables to
the corresponding exact flux. Specifically, we propose two distinct approaches:
one utilizing a vanilla neural network and the other employing a bi-fidelity
neural network. The performance of the proposed approaches is demonstrated
through applications to one-dimensional and two-dimensional partial
differential equations, showcasing their robustness and accuracy.
[COMMENTS]
22 pages, 17 figures
[LINK]
http://arxiv.org/abs/2503.13248v2
[DATE]
2025-06-25 06:02:35+08:00
[CATEGORIES]
cs.LG
Automated Generation of Diverse Courses of Actions for Multi-Agent Operations using Binary Optimization and Graph Learning
[AUTHORS]
Prithvi Poddar, Ehsan Tarkesh Esfahani, Karthik Dantu, Souma Chowdhury
[ABSTRACT]
Operations in disaster response, search \& rescue, and military missions that
involve multiple agents demand automated processes to support the planning of
the courses of action (COA). Moreover, traverse-affecting changes in the
environment (rain, snow, blockades, etc.) may impact the expected performance
of a COA, making it desirable to have a pool of COAs that are diverse in task
distributions across agents. Further, variations in agent capabilities, which
could be human crews and/or autonomous systems, present practical opportunities
and computational challenges to the planning process. This paper presents a new
theoretical formulation and computational framework to generate such diverse
pools of COAs for operations with soft variations in agent-task compatibility.
Key to the problem formulation is a graph abstraction of the task space and the
pool of COAs itself to quantify its diversity. Formulating the COAs as a
centralized multi-robot task allocation problem, a genetic algorithm is used
for (order-ignoring) allocations of tasks to each agent that jointly maximize
diversity within the COA pool and overall compatibility of the agent-task
mappings. A graph neural network is trained using a policy gradient approach to
then perform single agent task sequencing in each COA, which maximizes
completion rates adaptive to task features. Our tests of the COA generation
process in a simulated environment demonstrate significant performance gain
over a random walk baseline, small optimality gap in task sequencing, and
execution time of about 50 minutes to plan up to 20 COAs for 5 agent/100 task
operations.
[LINK]
http://arxiv.org/abs/2506.20031v1
[DATE]
2025-06-25 05:58:30+08:00
[CATEGORIES]
cs.LG
Thumb on the Scale: Optimal Loss Weighting in Last Layer Retraining
[AUTHORS]
Nathan Stromberg, Christos Thrampoulidis, Lalitha Sankar
[ABSTRACT]
While machine learning models become more capable in discriminative tasks at
scale, their ability to overcome biases introduced by training data has come
under increasing scrutiny. Previous results suggest that there are two extremes
of parameterization with very different behaviors: the population
(underparameterized) setting where loss weighting is optimal and the separable
overparameterized setting where loss weighting is ineffective at ensuring equal
performance across classes. This work explores the regime of last layer
retraining (LLR) in which the unseen limited (retraining) data is frequently
inseparable and the model proportionately sized, falling between the two
aforementioned extremes. We show, in theory and practice, that loss weighting
is still effective in this regime, but that these weights \emph{must} take into
account the relative overparameterization of the model.
[LINK]
http://arxiv.org/abs/2506.20025v1
[DATE]
2025-06-25 05:48:58+08:00
[CATEGORIES]
cs.LG
Elucidated Rolling Diffusion Models for Probabilistic Weather Forecasting
[AUTHORS]
Salva Rühling Cachay, Miika Aittala, Karsten Kreis, Noah Brenowitz, Arash Vahdat, Morteza Mardani, Rose Yu
[ABSTRACT]
Diffusion models are a powerful tool for probabilistic forecasting, yet most
applications in high-dimensional chaotic systems predict future snapshots
one-by-one. This common approach struggles to model complex temporal
dependencies and fails to explicitly account for the progressive growth of
uncertainty inherent to such systems. While rolling diffusion frameworks, which
apply increasing noise to forecasts at longer lead times, have been proposed to
address this, their integration with state-of-the-art, high-fidelity diffusion
techniques remains a significant challenge. We tackle this problem by
introducing Elucidated Rolling Diffusion Models (ERDM), the first framework to
successfully unify a rolling forecast structure with the principled, performant
design of Elucidated Diffusion Models (EDM). To do this, we adapt the core EDM
components-its noise schedule, network preconditioning, and Heun sampler-to the
rolling forecast setting. The success of this integration is driven by three
key contributions: (i) a novel loss weighting scheme that focuses model
capacity on the mid-range forecast horizons where determinism gives way to
stochasticity; (ii) an efficient initialization strategy using a pre-trained
EDM for the initial window; and (iii) a bespoke hybrid sequence architecture
for robust spatiotemporal feature extraction under progressive denoising. On 2D
Navier-Stokes simulations and ERA5 global weather forecasting at 1.5^\circ
resolution, ERDM consistently outperforms key diffusion-based baselines,
including conditional autoregressive EDM. ERDM offers a flexible and powerful
general framework for tackling diffusion-based sequence generation problems
where modeling escalating uncertainty is paramount. Code is available at:
https://github.com/salvaRC/erdm
[LINK]
http://arxiv.org/abs/2506.20024v1
[DATE]
2025-06-25 05:44:31+08:00
[CATEGORIES]
cs.LG
DIM-SUM: Dynamic IMputation for Smart Utility Management
[AUTHORS]
Ryan Hildebrant, Rahul Bhope, Sharad Mehrotra, Christopher Tull, Nalini Venkatasubramanian
[ABSTRACT]
Time series imputation models have traditionally been developed using
complete datasets with artificial masking patterns to simulate missing values.
However, in real-world infrastructure monitoring, practitioners often encounter
datasets where large amounts of data are missing and follow complex,
heterogeneous patterns. We introduce DIM-SUM, a preprocessing framework for
training robust imputation models that bridges the gap between artificially
masked training data and real missing patterns. DIM-SUM combines pattern
clustering and adaptive masking strategies with theoretical learning guarantees
to handle diverse missing patterns actually observed in the data. Through
extensive experiments on over 2 billion readings from California water
districts, electricity datasets, and benchmarks, we demonstrate that DIM-SUM
outperforms traditional methods by reaching similar accuracy with lower
processing time and significantly less training data. When compared against a
large pre-trained model, DIM-SUM averages 2x higher accuracy with significantly
less inference time.
[LINK]
http://arxiv.org/abs/2506.20023v1
[DATE]
2025-06-25 05:38:06+08:00
[CATEGORIES]
cs.LG
New Insights on Unfolding and Fine-tuning Quantum Federated Learning
[AUTHORS]
Shanika Iroshi Nanayakkara, Shiva Raj Pokhrel
[ABSTRACT]
Client heterogeneity poses significant challenges to the performance of
Quantum Federated Learning (QFL). To overcome these limitations, we propose a
new approach leveraging deep unfolding, which enables clients to autonomously
optimize hyperparameters, such as learning rates and regularization factors,
based on their specific training behavior. This dynamic adaptation mitigates
overfitting and ensures robust optimization in highly heterogeneous
environments where standard aggregation methods often fail. Our framework
achieves approximately 90% accuracy, significantly outperforming traditional
methods, which typically yield around 55% accuracy, as demonstrated through
real-time training on IBM quantum hardware and Qiskit Aer simulators. By
developing self adaptive fine tuning, the proposed method proves particularly
effective in critical applications such as gene expression analysis and cancer
detection, enhancing diagnostic precision and predictive modeling within
quantum systems. Our results are attributed to convergence-aware, learnable
optimization steps intrinsic to the deep unfolded framework, which maintains
the generalization. Hence, this study addresses the core limitations of
conventional QFL, streamlining its applicability to any complex challenges such
as healthcare and genomic research.
[COMMENTS]
12 pages, 9 figures, 7 Tables, Submitted to IEEE/ACM journal 2025
[LINK]
http://arxiv.org/abs/2506.20016v1
[DATE]
2025-06-25 05:17:48+08:00
[CATEGORIES]
cs.LG
Neuromorphic Wireless Split Computing with Resonate-and-Fire Neurons
[AUTHORS]
Dengyu Wu, Jiechen Chen, H. Vincent Poor, Bipin Rajendran, Osvaldo Simeone
[ABSTRACT]
Neuromorphic computing offers an energy-efficient alternative to conventional
deep learning accelerators for real-time time-series processing. However, many
edge applications, such as wireless sensing and audio recognition, generate
streaming signals with rich spectral features that are not effectively captured
by conventional leaky integrate-and-fire (LIF) spiking neurons. This paper
investigates a wireless split computing architecture that employs
resonate-and-fire (RF) neurons with oscillatory dynamics to process time-domain
signals directly, eliminating the need for costly spectral pre-processing. By
resonating at tunable frequencies, RF neurons extract time-localized spectral
features while maintaining low spiking activity. This temporal sparsity
translates into significant savings in both computation and transmission
energy. Assuming an OFDM-based analog wireless interface for spike
transmission, we present a complete system design and evaluate its performance
on audio classification and modulation classification tasks. Experimental
results show that the proposed RF-SNN architecture achieves comparable accuracy
to conventional LIF-SNNs and ANNs, while substantially reducing spike rates and
total energy consumption during inference and communication.
[LINK]
http://arxiv.org/abs/2506.20015v1
[DATE]
2025-06-25 05:14:59+08:00
[CATEGORIES]
cs.LG
DRO-Augment Framework: Robustness by Synergizing Wasserstein Distributionally Robust Optimization and Data Augmentation
[AUTHORS]
Jiaming Hu, Debarghya Mukherjee, Ioannis Ch. Paschalidis
[ABSTRACT]
In many real-world applications, ensuring the robustness and stability of
deep neural networks (DNNs) is crucial, particularly for image classification
tasks that encounter various input perturbations. While data augmentation
techniques have been widely adopted to enhance the resilience of a trained
model against such perturbations, there remains significant room for
improvement in robustness against corrupted data and adversarial attacks
simultaneously. To address this challenge, we introduce DRO-Augment, a novel
framework that integrates Wasserstein Distributionally Robust Optimization
(W-DRO) with various data augmentation strategies to improve the robustness of
the models significantly across a broad spectrum of corruptions. Our method
outperforms existing augmentation methods under severe data perturbations and
adversarial attack scenarios while maintaining the accuracy on the clean
datasets on a range of benchmark datasets, including but not limited to
CIFAR-10-C, CIFAR-100-C, MNIST, and Fashion-MNIST. On the theoretical side, we
establish novel generalization error bounds for neural networks trained using a
computationally efficient, variation-regularized loss function closely related
to the W-DRO problem.
[COMMENTS]
26 pages,3 figures
[LINK]
http://arxiv.org/abs/2506.17874v2
[DATE]
2025-06-25 05:04:53+08:00
[CATEGORIES]
cs.LG
Scalable Machine Learning Algorithms using Path Signatures
[AUTHORS]
Csaba Tóth
[ABSTRACT]
The interface between stochastic analysis and machine learning is a rapidly
evolving field, with path signatures - iterated integrals that provide
faithful, hierarchical representations of paths - offering a principled and
universal feature map for sequential and structured data. Rooted in rough path
theory, path signatures are invariant to reparameterization and well-suited for
modelling evolving dynamics, long-range dependencies, and irregular sampling -
common challenges in real-world time series and graph data.
This thesis investigates how to harness the expressive power of path
signatures within scalable machine learning pipelines. It introduces a suite of
models that combine theoretical robustness with computational efficiency,
bridging rough path theory with probabilistic modelling, deep learning, and
kernel methods. Key contributions include: Gaussian processes with signature
kernel-based covariance functions for uncertainty-aware time series modelling;
the Seq2Tens framework, which employs low-rank tensor structure in the weight
space for scalable deep modelling of long-range dependencies; and graph-based
models where expected signatures over graphs induce hypo-elliptic diffusion
processes, offering expressive yet tractable alternatives to standard graph
neural networks. Further developments include Random Fourier Signature
Features, a scalable kernel approximation with theoretical guarantees, and
Recurrent Sparse Spectrum Signature Gaussian Processes, which combine Gaussian
processes, signature kernels, and random features with a principled forgetting
mechanism for multi-horizon time series forecasting with adaptive context
length.
We hope this thesis serves as both a methodological toolkit and a conceptual
bridge, and provides a useful reference for the current state of the art in
scalable, signature-based learning for sequential and structured data.
[COMMENTS]
PhD thesis
[LINK]
http://arxiv.org/abs/2506.17634v2
[DATE]
2025-06-25 04:58:09+08:00
[CATEGORIES]
cs.LG
Can One Safety Loop Guard Them All? Agentic Guard Rails for Federated Computing
[AUTHORS]
Narasimha Raghavan Veeraragavan, Jan Franz Nygård
[ABSTRACT]
We propose Guardian-FC, a novel two-layer framework for privacy preserving
federated computing that unifies safety enforcement across diverse privacy
preserving mechanisms, including cryptographic back-ends like fully homomorphic
encryption (FHE) and multiparty computation (MPC), as well as statistical
techniques such as differential privacy (DP). Guardian-FC decouples guard-rails
from privacy mechanisms by executing plug-ins (modular computation units),
written in a backend-neutral, domain-specific language (DSL) designed
specifically for federated computing workflows and interchangeable Execution
Providers (EPs), which implement DSL operations for various privacy back-ends.
An Agentic-AI control plane enforces a finite-state safety loop through signed
telemetry and commands, ensuring consistent risk management and auditability.
The manifest-centric design supports fail-fast job admission and seamless
extensibility to new privacy back-ends. We present qualitative scenarios
illustrating backend-agnostic safety and a formal model foundation for
verification. Finally, we outline a research agenda inviting the community to
advance adaptive guard-rail tuning, multi-backend composition, DSL
specification development, implementation, and compiler extensibility alongside
human-override usability.
[COMMENTS]
Accepted at ICML 2025 Workshop on Collaborative and Federated Agentic
Workflows (CFAgentic@ICML‘25)
[LINK]
http://arxiv.org/abs/2506.20000v1
[DATE]
2025-06-25 04:39:49+08:00
[CATEGORIES]
cs.LG
In-Context Learning for Gradient-Free Receiver Adaptation: Principles, Applications, and Theory
[AUTHORS]
Matteo Zecchin, Tomer Raviv, Dileep Kalathil, Krishna Narayanan, Nir Shlezinger, Osvaldo Simeone
[ABSTRACT]
In recent years, deep learning has facilitated the creation of wireless
receivers capable of functioning effectively in conditions that challenge
traditional model-based designs. Leveraging programmable hardware
architectures, deep learning-based receivers offer the potential to dynamically
adapt to varying channel environments. However, current adaptation strategies,
including joint training, hypernetwork-based methods, and meta-learning, either
demonstrate limited flexibility or necessitate explicit optimization through
gradient descent. This paper presents gradient-free adaptation techniques
rooted in the emerging paradigm of in-context learning (ICL). We review
architectural frameworks for ICL based on Transformer models and structured
state-space models (SSMs), alongside theoretical insights into how sequence
models effectively learn adaptation from contextual information. Further, we
explore the application of ICL to cell-free massive MIMO networks, providing
both theoretical analyses and empirical evidence. Our findings indicate that
ICL represents a principled and efficient approach to real-time receiver
adaptation using pilot signals and auxiliary contextual information-without
requiring online retraining.
[LINK]
http://arxiv.org/abs/2506.15176v2
[DATE]
2025-06-25 04:30:14+08:00
[CATEGORIES]
cs.LG
TRACED: Transition-aware Regret Approximation with Co-learnability for Environment Design
[AUTHORS]
Geonwoo Cho, Jaegyun Im, Jihwan Lee, Hojun Yi, Sejin Kim, Sundong Kim
[ABSTRACT]
Generalizing deep reinforcement learning agents to unseen environments
remains a significant challenge. One promising solution is Unsupervised
Environment Design (UED), a co-evolutionary framework in which a teacher
adaptively generates tasks with high learning potential, while a student learns
a robust policy from this evolving curriculum. Existing UED methods typically
measure learning potential via regret, the gap between optimal and current
performance, approximated solely by value-function loss. Building on these
approaches, we introduce the transition prediction error as an additional term
in our regret approximation. To capture how training on one task affects
performance on others, we further propose a lightweight metric called
co-learnability. By combining these two measures, we present Transition-aware
Regret Approximation with Co-learnability for Environment Design (TRACED).
Empirical evaluations show that TRACED yields curricula that improve zero-shot
generalization across multiple benchmarks while requiring up to 2x fewer
environment interactions than strong baselines. Ablation studies confirm that
the transition prediction error drives rapid complexity ramp-up and that
co-learnability delivers additional gains when paired with the transition
prediction error. These results demonstrate how refined regret approximation
and explicit modeling of task relationships can be leveraged for
sample-efficient curriculum design in UED.
[LINK]
http://arxiv.org/abs/2506.19997v1
[DATE]
2025-06-25 04:29:24+08:00
[CATEGORIES]
cs.LG
CoVE: Compressed Vocabulary Expansion Makes Better LLM-based Recommender Systems
[AUTHORS]
Haochen Zhang, Tianyi Zhang, Junze Yin, Oren Gal, Anshumali Shrivastava, Vladimir Braverman
[ABSTRACT]
Recommender systems play a pivotal role in providing relevant content to
users. With the rapid development of large language models (LLMs), researchers
have begun utilizing LLMs to build more powerful recommender systems. However,
existing approaches that focus on aligning LLMs with recommendation tasks do
not fully leverage their sequential information processing capabilities,
leading to suboptimal performance.
In this paper, we propose a novel system called compressed vocabulary
expansion (CoVE). In CoVE, each item is assigned a unique ID within the
expanded vocabulary. Our framework effectively capitalizes on sequence
understanding abilities of LLMs, significantly enhancing their performance on
recommendation tasks. Additionally, we compress the embedding layer, making
CoVE practical for large-scale industrial applications. The effectiveness and
performance of CoVE are demonstrated through comprehensive experiments on
multiple recommendation datasets and comparisons with prior works. Our code can
be found at https://github.com/HaochenZhang717/CoVE-official-Repo.
[COMMENTS]
Accepted by ACL 2025 Findings
[LINK]
http://arxiv.org/abs/2506.19993v1
[DATE]
2025-06-25 04:27:51+08:00
[CATEGORIES]
cs.LG
Follow-the-Perturbed-Leader Approaches Best-of-Both-Worlds for the m-Set Semi-Bandit Problems
[AUTHORS]
Jingxin Zhan, Yuchen Xin, Chenjie Sun, Zhihua Zhang
[ABSTRACT]
We consider a common case of the combinatorial semi-bandit problem, the
$m$-set semi-bandit, where the learner exactly selects $m$ arms from the total
$d$ arms. In the adversarial setting, the best regret bound, known to be
$\mathcal{O}(\sqrt{nmd})$ for time horizon $n$, is achieved by the well-known
Follow-the-Regularized-Leader (FTRL) policy. However, this requires to
explicitly compute the arm-selection probabilities via optimizing problems at
each time step and sample according to them. This problem can be avoided by the
Follow-the-Perturbed-Leader (FTPL) policy, which simply pulls the $m$ arms that
rank among the $m$ smallest (estimated) loss with random perturbation. In this
paper, we show that FTPL with a Fr'echet perturbation also enjoys the near
optimal regret bound $\mathcal{O}(\sqrt{nm}(\sqrt{d\log(d)}+m^{5/6}))$ in the
adversarial setting and approaches best-of-both-world regret bounds, i.e.,
achieves a logarithmic regret for the stochastic setting. Moreover, our lower
bounds show that the extra factors are unavoidable with our approach; any
improvement would require a fundamentally different and more challenging
method.
[LINK]
http://arxiv.org/abs/2504.07307v3
[DATE]
2025-06-25 04:04:37+08:00
[CATEGORIES]
cs.LG
jina-embeddings-v4: Universal Embeddings for Multimodal Multilingual Retrieval
[AUTHORS]
Michael Günther, Saba Sturua, Mohammad Kalim Akram, Isabelle Mohr, Andrei Ungureanu, Bo Wang, Sedigheh Eslami, Scott Martens, Maximilian Werk, Nan Wang, Han Xiao
[ABSTRACT]
We introduce jina-embeddings-v4, a 3.8 billion parameter multimodal embedding
model that unifies text and image representations through a novel architecture
supporting both single-vector and multi-vector embeddings in the late
interaction style. The model incorporates task-specific Low-Rank Adaptation
(LoRA) adapters to optimize performance across diverse retrieval scenarios,
including query-document retrieval, semantic text similarity, and code search.
Comprehensive evaluations demonstrate that jina-embeddings-v4 achieves
state-of-the-art performance on both single-modal and cross-modal retrieval
tasks, with particular strength in processing visually rich content such as
tables, charts, diagrams, and mixed-media formats. To facilitate evaluation of
this capability, we also introduce Jina-VDR, a novel benchmark specifically
designed for visually rich image retrieval.
[COMMENTS]
22 pages, 1-10 main, 14-22 experimental results, benchmark tables
[LINK]
http://arxiv.org/abs/2506.18902v2
[DATE]
2025-06-24 23:52:37+08:00
[CATEGORIES]
cs.CL
Local Look-Ahead Guidance via Verifier-in-the-Loop for Automated Theorem Proving
[AUTHORS]
Sara Rajaee, Kumar Pratik, Gabriele Cesa, Arash Behboodi
[ABSTRACT]
The most promising recent methods for AI reasoning require applying variants
of reinforcement learning (RL) either on rolled out trajectories from the LLMs,
even for the step-wise rewards, or large quantities of human-annotated
trajectory data. The reliance on the rolled-out trajectory renders the compute
cost and time prohibitively high. In particular, the correctness of a reasoning
trajectory can typically only be judged at its completion, leading to sparse
rewards in RL or requiring expensive synthetic data generation in expert
iteration-like methods. In this work, we focus on the Automatic Theorem Proving
(ATP) task and propose a novel verifier-in-the-loop design, which, unlike
existing approaches that leverage feedback on the entire reasoning trajectory,
employs an automated verifier to give intermediate feedback at each step of the
reasoning process. Using Lean as the verifier, we empirically show that the
step-by-step local verification produces a global improvement in the model’s
reasoning accuracy and efficiency.
[COMMENTS]
Accepted at the Findings of ACL 2025, Accepted at ICLR 2025 Workshop
on Reasoning and Planning for Large Language Models
[LINK]
http://arxiv.org/abs/2503.09730v2
[DATE]
2025-06-24 23:42:55+08:00
[CATEGORIES]
cs.CL
cs.LG
Outlier-Safe Pre-Training for Robust 4-Bit Quantization of Large Language Models
[AUTHORS]
Jungwoo Park, Taewhoo Lee, Chanwoong Yoon, Hyeon Hwang, Jaewoo Kang
[ABSTRACT]
Extreme activation outliers in Large Language Models (LLMs) critically
degrade quantization performance, hindering efficient on-device deployment.
While channel-wise operations and adaptive gradient scaling are recognized
causes, practical mitigation remains challenging. We introduce Outlier-Safe
Pre-Training (OSP), a practical guideline that proactively prevents outlier
formation rather than relying on post-hoc mitigation. OSP combines three key
innovations: (1) the Muon optimizer, eliminating privileged bases while
maintaining training efficiency; (2) Single-Scale RMSNorm, preventing
channel-wise amplification; and (3) a learnable embedding projection,
redistributing activation magnitudes originating from embedding matrices. We
validate OSP by training a 1.4B-parameter model on 1 trillion tokens, which is
the first production-scale LLM trained without such outliers. Under aggressive
4-bit quantization, our OSP model achieves a 35.7 average score across 10
benchmarks (compared to 26.5 for an Adam-trained model), with only a 2%
training overhead. Remarkably, OSP models exhibit near-zero excess kurtosis
(0.04) compared to extreme values (1818.56) in standard models, fundamentally
altering LLM quantization behavior. Our work demonstrates that outliers are not
inherent to LLMs but are consequences of training strategies, paving the way
for more efficient LLM deployment. The source code and pretrained checkpoints
are available at https://github.com/dmis-lab/Outlier-Safe-Pre-Training.
[LINK]
http://arxiv.org/abs/2506.19697v1
[DATE]
2025-06-24 23:03:57+08:00
[CATEGORIES]
cs.LG
cs.CL
Recurrent Visual Feature Extraction and Stereo Attentions for CT Report Generation
[AUTHORS]
Yuanhe Tian, Lei Mao, Yan Song
[ABSTRACT]
Generating reports for computed tomography (CT) images is a challenging task,
while similar to existing studies for medical image report generation, yet has
its unique characteristics, such as spatial encoding of multiple images,
alignment between image volume and texts, etc. Existing solutions typically use
general 2D or 3D image processing techniques to extract features from a CT
volume, where they firstly compress the volume and then divide the compressed
CT slices into patches for visual encoding. These approaches do not explicitly
account for the transformations among CT slices, nor do they effectively
integrate multi-level image features, particularly those containing specific
organ lesions, to instruct CT report generation (CTRG). In considering the
strong correlation among consecutive slices in CT scans, in this paper, we
propose a large language model (LLM) based CTRG method with recurrent visual
feature extraction and stereo attentions for hierarchical feature modeling.
Specifically, we use a vision Transformer to recurrently process each slice in
a CT volume, and employ a set of attentions over the encoded slices from
different perspectives to selectively obtain important visual information and
align them with textual features, so as to better instruct an LLM for CTRG.
Experiment results and further analysis on the benchmark M3D-Cap dataset show
that our method outperforms strong baseline models and achieves
state-of-the-art results, demonstrating its validity and effectiveness.
[COMMENTS]
7 pages, 3 figures
[LINK]
http://arxiv.org/abs/2506.19665v1
[DATE]
2025-06-24 22:29:06+08:00
[CATEGORIES]
cs.CL
Tailored Conversations beyond LLMs: A RL-Based Dialogue Manager
[AUTHORS]
Lucie Galland, Catherine Pelachaud, Florian Pecune
[ABSTRACT]
In this work, we propose a novel framework that integrates large language
models (LLMs) with an RL-based dialogue manager for open-ended dialogue with a
specific goal. By leveraging hierarchical reinforcement learning to model the
structured phases of dialogue and employ meta-learning to enhance adaptability
across diverse user profiles, our approach enhances adaptability and
efficiency, enabling the system to learn from limited data, transition fluidly
between dialogue phases, and personalize responses to heterogeneous patient
needs. We apply our framework to Motivational Interviews, aiming to foster
behavior change, and demonstrate that the proposed dialogue manager outperforms
a state-of-the-art LLM baseline in terms of reward, showing a potential benefit
of conditioning LLMs to create open-ended dialogue systems with specific goals.
[LINK]
http://arxiv.org/abs/2506.19652v1
[DATE]
2025-06-24 22:15:26+08:00
[CATEGORIES]
cs.CL
Language Model Re-rankers are Fooled by Lexical Similarities
[AUTHORS]
Lovisa Hagström, Ercong Nie, Ruben Halifa, Helmut Schmid, Richard Johansson, Alexander Junge
[ABSTRACT]
Language model (LM) re-rankers are used to refine retrieval results for
retrieval-augmented generation (RAG). They are more expensive than lexical
matching methods like BM25 but assumed to better process semantic information
and the relations between the query and the retrieved answers. To understand
whether LM re-rankers always live up to this assumption, we evaluate 6
different LM re-rankers on the NQ, LitQA2 and DRUID datasets. Our results show
that LM re-rankers struggle to outperform a simple BM25 baseline on DRUID.
Leveraging a novel separation metric based on BM25 scores, we explain and
identify re-ranker errors stemming from lexical dissimilarities. We also
investigate different methods to improve LM re-ranker performance and find
these methods mainly useful for NQ. Taken together, our work identifies and
explains weaknesses of LM re-rankers and points to the need for more
adversarial and realistic datasets for their evaluation.
[COMMENTS]
Accepted to FEVER 2025
[LINK]
http://arxiv.org/abs/2502.17036v2
[DATE]
2025-06-24 22:03:01+08:00
[CATEGORIES]
cs.CL
Correcting Hallucinations in News Summaries: Exploration of Self-Correcting LLM Methods with External Knowledge
[AUTHORS]
Juraj Vladika, Ihsan Soydemir, Florian Matthes
[ABSTRACT]
While large language models (LLMs) have shown remarkable capabilities to
generate coherent text, they suffer from the issue of hallucinations –
factually inaccurate statements. Among numerous approaches to tackle
hallucinations, especially promising are the self-correcting methods. They
leverage the multi-turn nature of LLMs to iteratively generate verification
questions inquiring additional evidence, answer them with internal or external
knowledge, and use that to refine the original response with the new
corrections. These methods have been explored for encyclopedic generation, but
less so for domains like news summarization. In this work, we investigate two
state-of-the-art self-correcting systems by applying them to correct
hallucinated summaries using evidence from three search engines. We analyze the
results and provide insights into systems’ performance, revealing interesting
practical findings on the benefits of search engine snippets and few-shot
prompts, as well as high alignment of G-Eval and human evaluation.
[COMMENTS]
Accepted to FEVER @ ACL 2025
[LINK]
http://arxiv.org/abs/2506.19607v1
[DATE]
2025-06-24 21:20:31+08:00
[CATEGORIES]
cs.CL
PATCH! {P}sychometrics-{A}ssis{T}ed Ben{CH}marking of Large Language Models against Human Populations: A Case Study of Proficiency in 8th Grade Mathematics
[AUTHORS]
Qixiang Fang, Daniel L. Oberski, Dong Nguyen
[ABSTRACT]
Many existing benchmarks of large (multimodal) language models (LLMs) focus
on measuring LLMs’ academic proficiency, often with also an interest in
comparing model performance with human test takers’. While such benchmarks have
proven key to the development of LLMs, they suffer from several limitations,
including questionable measurement quality (e.g., Do they measure what they are
supposed to in a reliable way?), lack of quality assessment on the item level
(e.g., Are some items more important or difficult than others?) and unclear
human population reference (e.g., To whom can the model be compared?). In
response to these challenges, we propose leveraging knowledge from
psychometrics – a field dedicated to the measurement of latent variables like
academic proficiency – into LLM benchmarking. We make four primary
contributions. First, we reflect on current LLM benchmark developments and
contrast them with psychometrics-based test development. Second, we introduce
PATCH: a novel framework for {P}sychometrics-{A}ssis{T}ed ben{CH}marking of
LLMs. PATCH addresses the aforementioned limitations. In particular, PATCH
enables valid comparison between LLMs and human populations. Third, we
demonstrate PATCH by measuring several LLMs’ proficiency in 8th grade
mathematics against 56 human populations. We show that adopting a
psychometrics-based approach yields evaluation outcomes that diverge from those
based on current benchmarking practices. Fourth, we release 4 high-quality
datasets to support measuring and comparing LLM proficiency in grade school
mathematics and science with human populations.
[COMMENTS]
Accepted to GEM2 Workshop: Generation, Evaluation & Metrics - ACL
2025
[LINK]
http://arxiv.org/abs/2404.01799v3
[DATE]
2025-06-24 21:11:54+08:00
[CATEGORIES]
cs.CL
Large Language Models as Span Annotators
[AUTHORS]
Zdeněk Kasner, Vilém Zouhar, Patrícia Schmidtová, Ivan Kartáč, Kristýna Onderková, Ondřej Plátek, Dimitra Gkatzia, Saad Mahamood, Ondřej Dušek, Simone Balloccu
[ABSTRACT]
Span annotation is the task of localizing and classifying text spans
according to custom guidelines. Annotated spans can be used to analyze and
evaluate high-quality texts for which single-score metrics fail to provide
actionable feedback. Until recently, span annotation was limited to human
annotators or fine-tuned models. In this study, we show that large language
models (LLMs) can serve as flexible and cost-effective span annotation
backbones. To demonstrate their utility, we compare LLMs to skilled human
annotators on three diverse span annotation tasks: evaluating data-to-text
generation, identifying translation errors, and detecting propaganda
techniques. We demonstrate that LLMs achieve inter-annotator agreement (IAA)
comparable to human annotators at a fraction of a cost per output annotation.
We also manually analyze model outputs, finding that LLMs make errors at a
similar rate to human annotators. We release the dataset of more than 40k model
and human annotations for further research.
[LINK]
http://arxiv.org/abs/2504.08697v2
[DATE]
2025-06-24 21:11:18+08:00
[CATEGORIES]
cs.CL
ECCoT: A Framework for Enhancing Effective Cognition via Chain of Thought in Large Language Model
[AUTHORS]
Zhenke Duan, Jiqun Pan, Jiani Tu, Xiaoyi Wang, Yanqing Wang
[ABSTRACT]
In the era of large-scale artificial intelligence, Large Language Models
(LLMs) have made significant strides in natural language processing. However,
they often lack transparency and generate unreliable outputs, raising concerns
about their interpretability. To address this, the Chain of Thought (CoT)
prompting method structures reasoning into step-by-step deductions. Yet, not
all reasoning chains are valid, and errors can lead to unreliable conclusions.
We propose ECCoT, an End-to-End Cognitive Chain of Thought Validation
Framework, to evaluate and refine reasoning chains in LLMs. ECCoT integrates
the Markov Random Field-Embedded Topic Model (MRF-ETM) for topic-aware CoT
generation and Causal Sentence-BERT (CSBert) for causal reasoning alignment. By
filtering ineffective chains using structured ordering statistics, ECCoT
improves interpretability, reduces biases, and enhances the trustworthiness of
LLM-based decision-making. Key contributions include the introduction of ECCoT,
MRF-ETM for topic-driven CoT generation, and CSBert for causal reasoning
enhancement. Code is released at: https://github.com/erwinmsmith/ECCoT.git.
[LINK]
http://arxiv.org/abs/2506.19599v1
[DATE]
2025-06-24 21:09:53+08:00
[CATEGORIES]
cs.CL
ConciseHint: Boosting Efficient Reasoning via Continuous Concise Hints during Generation
[AUTHORS]
Siao Tang, Xinyin Ma, Gongfan Fang, Xinchao Wang
[ABSTRACT]
Recent advancements in large reasoning models (LRMs) like DeepSeek-R1 and
OpenAI o1 series have achieved notable performance enhancements on complex
reasoning tasks by scaling up the generation length by Chain-of-Thought (CoT).
However, an emerging issue is their inclination to produce excessively verbose
reasoning processes, leading to the inefficiency problem. Existing literature
on improving efficiency mainly adheres to the before-reasoning paradigms such
as prompting and reasoning or fine-tuning and reasoning, but ignores the
promising direction of directly encouraging the model to speak concisely by
intervening during the generation of reasoning. In order to fill the blank, we
propose a framework dubbed ConciseHint, which continuously encourages the
reasoning model to speak concisely by injecting the textual hint (manually
designed or trained on the concise data) during the token generation of the
reasoning process. Besides, ConciseHint is adaptive to the complexity of the
query by adaptively adjusting the hint intensity, which ensures it will not
undermine model performance. Experiments on the state-of-the-art LRMs,
including DeepSeek-R1 and Qwen-3 series, demonstrate that our method can
effectively produce concise reasoning processes while maintaining performance
well. For instance, we achieve a reduction ratio of 65\% for the reasoning
length on GSM8K benchmark with Qwen-3 4B with nearly no accuracy loss.
[COMMENTS]
Codes are available at https://github.com/tsa18/ConciseHint
[LINK]
http://arxiv.org/abs/2506.18810v2
[DATE]
2025-06-24 21:08:33+08:00
[CATEGORIES]
cs.CL
KAG-Thinker: Interactive Thinking and Deep Reasoning in LLMs via Knowledge-Augmented Generation
[AUTHORS]
Dalong Zhang, Jun Xu, Jun Zhou, Lei Liang, Lin Yuan, Ling Zhong, Mengshu Sun, Peilong Zhao, QiWei Wang, Xiaorui Wang, Xinkai Du, YangYang Hou, Yu Ao, ZhaoYang Wang, Zhengke Gui, ZhiYing Yi, Zhongpu Bo
[ABSTRACT]
In this paper, we introduce KAG-Thinker, which upgrade KAG to a multi-turn
interactive thinking and deep reasoning framework powered by a dedicated
parameter-light large language model (LLM). Our approach constructs a
structured thinking process for solving complex problems, enhancing the the
logical coherence and contextual consistency of the reasoning process in
question-answering (Q&A) tasks on domain-specific knowledge bases (KBs) within
LLMs. Following the \textbf{Logical Form} guided retrieval and reasoning
technology route of KAG, this framework first decomposes complex questions into
independently solvable sub-problems (which are also referred to as logical
forms) through \textbf{breadth decomposition}. Each such logical form is
represented in two equivalent forms-natural language and logical function-and
subsequently classified as either a Knowledge Retrieval or Reasoning Analysis
task. Dependencies and parameter passing between these tasks are explicitly
modeled via logical function interfaces. In the solving process, the Retrieval
function performs retrieval tasks. It retrieves one-hop structured and
unstructured information of specified knowledge unit. While the Math and Deduce
functions are used to perform reasoning analysis tasks. Secondly, it is worth
noting that, in the Knowledge Retrieval sub-problem tasks, LLMs and external
knowledge sources are regarded as equivalent KBs. We use the \textbf{knowledge
boundary} module to determine the optimal source using self-regulatory
mechanisms such as confidence calibration and reflective reasoning, and use the
\textbf{depth solving} module to enhance the comprehensiveness of knowledge
acquisition…
[LINK]
http://arxiv.org/abs/2506.17728v2
[DATE]
2025-06-24 20:50:57+08:00
[CATEGORIES]
cs.CL
Has Machine Translation Evaluation Achieved Human Parity? The Human Reference and the Limits of Progress
[AUTHORS]
Lorenzo Proietti, Stefano Perrella, Roberto Navigli
[ABSTRACT]
In Machine Translation (MT) evaluation, metric performance is assessed based
on agreement with human judgments. In recent years, automatic metrics have
demonstrated increasingly high levels of agreement with humans. To gain a
clearer understanding of metric performance and establish an upper bound, we
incorporate human baselines in the MT meta-evaluation, that is, the assessment
of MT metrics’ capabilities. Our results show that human annotators are not
consistently superior to automatic metrics, with state-of-the-art metrics often
ranking on par with or higher than human baselines. Despite these findings
suggesting human parity, we discuss several reasons for caution. Finally, we
explore the broader implications of our results for the research field, asking:
Can we still reliably measure improvements in MT evaluation? With this work, we
aim to shed light on the limits of our ability to measure progress in the
field, fostering discussion on an issue that we believe is crucial to the
entire MT evaluation community.
[COMMENTS]
Accepted at ACL 2025 Main Conference. 24 pages
[LINK]
http://arxiv.org/abs/2506.19571v1
[DATE]
2025-06-24 20:35:00+08:00
[CATEGORIES]
cs.CL
GeistBERT: Breathing Life into German NLP
[AUTHORS]
Raphael Scheible-Schmitt, Johann Frei
[ABSTRACT]
Advances in transformer-based language models have highlighted the benefits
of language-specific pre-training on high-quality corpora. In this context,
German NLP stands to gain from updated architectures and modern datasets
tailored to the linguistic characteristics of the German language. GeistBERT
seeks to improve German language processing by incrementally training on a
diverse corpus and optimizing model performance across various NLP tasks. It
was pre-trained using fairseq with standard hyperparameters, initialized from
GottBERT weights, and trained on a large-scale German corpus using Whole Word
Masking (WWM). Based on the pre-trained model, we derived extended-input
variants using Nystr"omformer and Longformer architectures with support for
sequences up to 8k tokens. While these long-context models were not evaluated
on dedicated long-context benchmarks, they are included in our release. We
assessed all models on NER (CoNLL 2003, GermEval 2014) and text classification
(GermEval 2018 fine/coarse, 10kGNAD) using $F_1$ score and accuracy. The
GeistBERT models achieved strong performance, leading all tasks among the base
models and setting a new state-of-the-art (SOTA). Notably, the base models
outperformed larger models in several tasks. To support the German NLP research
community, we are releasing GeistBERT under the MIT license.
[LINK]
http://arxiv.org/abs/2506.11903v3
[DATE]
2025-06-24 20:31:06+08:00
[CATEGORIES]
cs.CL
ChatSR: Multimodal Large Language Models for Scientific Formula Discovery
[AUTHORS]
Yanjie Li, Lina Yu, Weijun Li, Min Wu, Jingyi Liu, Wenqiang Li, Shu Wei, Yusong Deng
[ABSTRACT]
Formulas are the language of communication between humans and nature. The
discovery of formulas to describe natural laws from observational data is the
purpose of scientific research. It is also an important research topic in
artificial intelligence, which is called a symbolic regression problem. Most of
the existing symbolic regression methods generate expressions directly from
observed data. Although in some methods, we can inject some prior knowledge
into the model by adding constraints or introducing some special character
hints. However, these methods can only introduce a limited amount of prior
knowledge specified in advance. Not to mention understanding natural language
instructions. In this article, based on the powerful knowledge reserve and
language understanding ability of multi-modal large language models, we present
ChatSR, which acts like a knowledgeable human scientist, and we can tell it any
prior knowledge through natural language to guide it in formula generation. By
testing on 13 datasets, ChatSR not only shows state-of-the-art performance on
traditional symbolic regression tasks. More notably, ChatSR can well understand
the prior knowledge contained in natural language prompts and improve the
quality of generated expressions. In addition, it is exciting that ChatSR has a
good zero-shot capability to understand prior knowledge that is not present in
the training data.
[COMMENTS]
23 pages,
[LINK]
http://arxiv.org/abs/2406.05410v2
[DATE]
2025-06-24 20:22:55+08:00
[CATEGORIES]
cs.CL
DaMO: A Data-Efficient Multimodal Orchestrator for Temporal Reasoning with Video LLMs
[AUTHORS]
Bo-Cheng Chiu, Jen-Jee Chen, Yu-Chee Tseng, Feng-Chi Chen
[ABSTRACT]
Large Language Models (LLMs) have recently been extended to the video domain,
enabling sophisticated video-language understanding. However, existing Video
LLMs often exhibit limitations in fine-grained temporal reasoning, restricting
their ability to precisely attribute responses to specific video moments,
especially under constrained supervision. We introduce DaMO, a data-efficient
Video LLM explicitly designed for accurate temporal reasoning and multimodal
understanding. At its core, the proposed Temporal-aware Fuseformer employs a
hierarchical dual-stream architecture that progressively captures temporal
dynamics within each modality and effectively fuses complementary visual and
audio information. To further enhance computational efficiency, DaMO integrates
a global residual that reduces spatial redundancy while preserving essential
semantic details. We train DaMO via a structured four-stage progressive
training paradigm, incrementally equipping the model with multimodal alignment,
semantic grounding, and temporal reasoning capabilities. This work also
contributes multiple datasets augmented from existing ones with GPT-generated
temporally grounded QA pairs for tasks requiring temporal supervision.
Comprehensive experiments on temporal grounding and video QA benchmarks
demonstrate that DaMO consistently surpasses prior methods, particularly in
tasks demanding precise temporal alignment and reasoning. Our work establishes
a promising direction for data-efficient video-language modeling.
[COMMENTS]
I would like to request the withdrawal of this submission because the
current version contains significant errors and incomplete results. I intend
to revise the manuscript thoroughly before resubmitting. I apologize for the
oversight and appreciate your understanding
[LINK]
http://arxiv.org/abs/2506.11558v2
[DATE]
2025-06-24 19:59:30+08:00
[CATEGORIES]
cs.CL
RCStat: A Statistical Framework for using Relative Contextualization in Transformers
[AUTHORS]
Debabrata Mahapatra, Shubham Agarwal, Apoorv Saxena, Subrata Mitra
[ABSTRACT]
Prior work on input-token importance in auto-regressive transformers has
relied on Softmax-normalized attention weights, which obscure the richer
structure of pre-Softmax query-key logits. We introduce RCStat, a statistical
framework that harnesses raw attention logits via Relative Contextualization
(RC), a random variable measuring contextual alignment between token segments,
and derive an efficient upper bound for RC. We demonstrate two applications:
(i) Key-Value compression, where RC-based thresholds drive adaptive key-value
eviction for substantial cache reduction with minimal quality loss; and (ii)
Attribution, where RC yields higher-fidelity token-, sentence-, and chunk-level
explanations than post-Softmax methods. Across question answering,
summarization, and attribution benchmarks, RCStat achieves significant
empirical gains, delivering state-of-the-art compression and attribution
performance without any model retraining.
[LINK]
http://arxiv.org/abs/2506.19549v1
[DATE]
2025-06-24 19:55:43+08:00
[CATEGORIES]
cs.CL
cs.LG
Health Sentinel: An AI Pipeline For Real-time Disease Outbreak Detection
[AUTHORS]
Devesh Pant, Rishi Raj Grandhe, Vipin Samaria, Mukul Paul, Sudhir Kumar, Saransh Khanna, Jatin Agrawal, Jushaan Singh Kalra, Akhil VSSG, Satish V Khalikar, Vipin Garg, Himanshu Chauhan, Pranay Verma, Neha Khandelwal, Soma S Dhavala, Minesh Mathew
[ABSTRACT]
Early detection of disease outbreaks is crucial to ensure timely intervention
by the health authorities. Due to the challenges associated with traditional
indicator-based surveillance, monitoring informal sources such as online media
has become increasingly popular. However, owing to the number of online
articles getting published everyday, manual screening of the articles is
impractical. To address this, we propose Health Sentinel. It is a multi-stage
information extraction pipeline that uses a combination of ML and non-ML
methods to extract events-structured information concerning disease outbreaks
or other unusual health events-from online articles. The extracted events are
made available to the Media Scanning and Verification Cell (MSVC) at the
National Centre for Disease Control (NCDC), Delhi for analysis, interpretation
and further dissemination to local agencies for timely intervention. From April
2022 till date, Health Sentinel has processed over 300 million news articles
and identified over 95,000 unique health events across India of which over
3,500 events were shortlisted by the public health experts at NCDC as potential
outbreaks.
[LINK]
http://arxiv.org/abs/2506.19548v1
[DATE]
2025-06-24 19:54:37+08:00
[CATEGORIES]
cs.CL
Automatic Posology Structuration : What role for LLMs?
[AUTHORS]
Natalia Bobkova, Laura Zanella-Calzada, Anyes Tafoughalt, Raphaël Teboul, François Plesse, Félix Gaschi
[ABSTRACT]
Automatically structuring posology instructions is essential for improving
medication safety and enabling clinical decision support. In French
prescriptions, these instructions are often ambiguous, irregular, or
colloquial, limiting the effectiveness of classic ML pipelines. We explore the
use of Large Language Models (LLMs) to convert free-text posologies into
structured formats, comparing prompt-based methods and fine-tuning against a
“pre-LLM” system based on Named Entity Recognition and Linking (NERL). Our
results show that while prompting improves performance, only fine-tuned LLMs
match the accuracy of the baseline. Through error analysis, we observe
complementary strengths: NERL offers structural precision, while LLMs better
handle semantic nuances. Based on this, we propose a hybrid pipeline that
routes low-confidence cases from NERL (<0.8) to the LLM, selecting outputs
based on confidence scores. This strategy achieves 91% structuration accuracy
while minimizing latency and compute. Our results show that this hybrid
approach improves structuration accuracy while limiting computational cost,
offering a scalable solution for real-world clinical use.
[LINK]
http://arxiv.org/abs/2506.19525v1
[DATE]
2025-06-24 19:25:21+08:00
[CATEGORIES]
cs.CL
heiDS at ArchEHR-QA 2025: From Fixed-k to Query-dependent-k for Retrieval Augmented Generation
[AUTHORS]
Ashish Chouhan, Michael Gertz
[ABSTRACT]
This paper presents the approach of our team called heiDS for the ArchEHR-QA
2025 shared task. A pipeline using a retrieval augmented generation (RAG)
framework is designed to generate answers that are attributed to clinical
evidence from the electronic health records (EHRs) of patients in response to
patient-specific questions. We explored various components of a RAG framework,
focusing on ranked list truncation (RLT) retrieval strategies and attribution
approaches. Instead of using a fixed top-k RLT retrieval strategy, we employ a
query-dependent-k retrieval strategy, including the existing surprise and
autocut methods and two new methods proposed in this work, autocut* and elbow.
The experimental results show the benefits of our strategy in producing factual
and relevant answers when compared to a fixed-$k$.
[COMMENTS]
12 pages, 2 figures, 6 tables, Workshop on BioNLP and Shared Tasks at
ACL 2025
[LINK]
http://arxiv.org/abs/2506.19512v1
[DATE]
2025-06-24 19:03:01+08:00
[CATEGORIES]
cs.CL
AnTKV: Anchor Token-Aware Sub-Bit Vector Quantization for KV Cache in Large Language Models
[AUTHORS]
Zeyu Li, Chuanfu Xiao, Yang Wang, Xiang Liu, Zhenheng Tang, Baotong Lu, Mao Yang, Xinyu Chen, Xiaowen Chu
[ABSTRACT]
Quantization has emerged as an effective and lightweight solution to reduce
the memory footprint of the KV cache in Large Language Models (LLMs).
Nevertheless, minimizing the performance degradation caused by ultra-low-bit KV
cache quantization remains a significant challenge. We observe that quantizing
the KV cache of different tokens has varying impacts on the quality of
attention outputs. To systematically investigate this phenomenon, we perform
forward error propagation analysis on attention and propose the Anchor Score
(AnS) that quantifies the sensitivity of each token’s KV cache to
quantization-induced error. Our analysis reveals significant disparities in AnS
across tokens, suggesting that preserving a small subset with full precision
(FP16) of high-AnS tokens can greatly mitigate accuracy loss in aggressive
quantization scenarios. Based on this insight, we introduce AnTKV, a novel
framework that leverages Anchor Token-aware Vector Quantization to compress the
KV cache. Furthermore, to support efficient deployment, we design and develop a
triton kernel that is fully compatible with FlashAttention, enabling fast
online Anchor Token selection. AnTKV enables LLaMA-3-8B to handle context
lengths up to 840K tokens on a single 80GB A100 GPU, while achieving up to 3.5x
higher decoding throughput compared to the FP16 baseline. Our experiment
results demonstrate that AnTKV matches or outperforms prior works such as KIVI,
SKVQ, KVQuant, and CQ under 4-bit settings. More importantly, AnTKV achieves
significantly lower perplexity under ultra-low-bit quantization on Mistral-7B,
with only 6.32 at 1-bit and 8.87 at 0.375-bit, compared to the FP16 baseline of
4.73.
[LINK]
http://arxiv.org/abs/2506.19505v1
[DATE]
2025-06-24 18:45:48+08:00
[CATEGORIES]
cs.CL
NaviAgent: Bilevel Planning on Tool Dependency Graphs for Function Calling
[AUTHORS]
Yan Jiang, Hao Zhou, LiZhong GU, Ai Han, TianLong Li
[ABSTRACT]
LLMs’ reliance on static knowledge and fragile tool invocation severely
hinders the orchestration of complex, heterogeneous toolchains, particularly at
large scales. Existing methods typically use rigid single-path execution,
resulting in poor error recovery and exponentially growing search spaces. We
introduce NaviAgent, a graph-navigated bilevel planning architecture for robust
function calling, comprising a Multi-Path Decider and Graph-Encoded Navigator.
As an LLM-powered agent, the Multi-Path Decider defines a four-dimensional
decision space and continuously perceives environmental states, dynamically
selecting the optimal action to fully cover all tool invocation scenarios. The
Graph-Encoded Navigator constructs a Tool Dependency Heterogeneous Graph
(TDHG), where node embeddings explicitly fuse API schema structure with
historical invocation behavior. It also integrates a novel heuristic search
strategy that guides the Decider toward efficient and highly successful
toolchains, even for unseen tool combinations. Experiments show that NaviAgent
consistently achieves the highest task success rate (TSR) across all foundation
models and task complexities, outperforming the average baselines (ReAct,
ToolLLM, {\alpha}-UMI) by 13.5%, 16.4%, and 19.0% on Qwen2.5-14B, Qwen2.5-32B,
and Deepseek-V3, respectively. Its execution steps are typically within one
step of the most efficient baseline, ensuring a strong balance between quality
and efficiency. Notably, a fine-tuned Qwen2.5-14B model achieves a TSR of
49.5%, surpassing the much larger 32B model (44.9%) under our architecture.
Incorporating the Graph-Encoded Navigator further boosts TSR by an average of
2.4 points, with gains up over 9 points on complex tasks for larger models
(Deepseek-V3 and GPT-4o), highlighting its essential role in toolchain
orchestration.
[LINK]
http://arxiv.org/abs/2506.19500v1
[DATE]
2025-06-24 18:39:07+08:00
[CATEGORIES]
cs.CL
cs.LG
Is Long-to-Short a Free Lunch? Investigating Inconsistency and Reasoning Efficiency in LRMs
[AUTHORS]
Shu Yang, Junchao Wu, Xuansheng Wu, Derek Wong, Ninhao Liu, Di Wang
[ABSTRACT]
Large Reasoning Models (LRMs) have achieved remarkable performance on complex
tasks by engaging in extended reasoning before producing final answers, yet
this strength introduces the risk of overthinking, where excessive token
generation occurs even for simple tasks. While recent work in efficient
reasoning seeks to reduce reasoning length while preserving accuracy, it
remains unclear whether such optimization is truly a free lunch. Drawing on the
intuition that compressing reasoning may reduce the robustness of model
responses and lead models to omit key reasoning steps, we investigate whether
efficient reasoning strategies introduce behavioral inconsistencies. To
systematically assess this, we introduce $ICBENCH$, a benchmark designed to
measure inconsistency in LRMs across three dimensions: inconsistency across
task settings (ITS), inconsistency between training objectives and learned
behavior (TR-LB), and inconsistency between internal reasoning and
self-explanations (IR-SE). Applying $ICBENCH$ to a range of open-source LRMs,
we find that while larger models generally exhibit greater consistency than
smaller ones, they all display widespread “scheming” behaviors, including
self-disagreement, post-hoc rationalization, and the withholding of reasoning
cues. Crucially, our results demonstrate that efficient reasoning strategies
such as No-Thinking and Simple Token-Budget consistently increase all three
defined types of inconsistency. These findings suggest that although efficient
reasoning enhances token-level efficiency, further investigation is imperative
to ascertain whether it concurrently introduces the risk of models evading
effective supervision.
[LINK]
http://arxiv.org/abs/2506.19492v1
[DATE]
2025-06-24 18:25:28+08:00
[CATEGORIES]
cs.CL
Dialogic Pedagogy for Large Language Models: Aligning Conversational AI with Proven Theories of Learning
[AUTHORS]
Russell Beale
[ABSTRACT]
Large Language Models (LLMs) are rapidly transforming education by enabling
rich conversational learning experiences. This article provides a comprehensive
review of how LLM-based conversational agents are being used in higher
education, with extensions to secondary and lifelong learning contexts. We
synthesize existing literature on LLMs in education and theories of
conversational and dialogic pedagogy - including Vygotsky’s sociocultural
learning (scaffolding and the Zone of Proximal Development), the Socratic
method, and Laurillard’s conversational framework - and examine how prompting
strategies and retrieval-augmented generation (RAG) can align LLM behaviors
with these pedagogical theories, and how it can support personalized, adaptive
learning. We map educational theories to LLM capabilities, highlighting where
LLM-driven dialogue supports established learning principles and where it
challenges or falls short of traditional pedagogical assumptions. Notable gaps
in applying prior theories to LLMs are identified, such as the models tendency
to provide direct answers instead of fostering co-construction of knowledge,
and the need to account for the constant availability and broad but non-human
expertise of LLM tutors. In response, we propose practical strategies to better
align LLM interactions with sound pedagogy - for example, designing prompts
that encourage Socratic questioning, scaffolded guidance, and student
reflection, as well as integrating retrieval mechanisms to ensure accuracy and
contextual relevance. Our aim is to bridge the gap between educational theory
and the emerging practice of AI-driven conversational learning, offering
insights and tools for making LLM-based dialogues more educationally productive
and theory-aligned.
[LINK]
http://arxiv.org/abs/2506.19484v1
[DATE]
2025-06-24 18:19:09+08:00
[CATEGORIES]
cs.CL
Commonsense Generation and Evaluation for Dialogue Systems using Large Language Models
[AUTHORS]
Marcos Estecha-Garitagoitia, Chen Zhang, Mario Rodríguez-Cantelar, Luis Fernando D’Haro
[ABSTRACT]
This paper provides preliminary results on exploring the task of performing
turn-level data augmentation for dialogue system based on different types of
commonsense relationships, and the automatic evaluation of the generated
synthetic turns. The proposed methodology takes advantage of the extended
knowledge and zero-shot capabilities of pretrained Large Language Models (LLMs)
to follow instructions, understand contextual information, and their
commonsense reasoning capabilities. The approach draws inspiration from
methodologies like Chain-of-Thought (CoT), applied more explicitly to the task
of prompt-based generation for dialogue-based data augmentation conditioned on
commonsense attributes, and the automatic evaluation of the generated
dialogues.
To assess the effectiveness of the proposed approach, first we extracted 200
randomly selected partial dialogues, from 5 different well-known dialogue
datasets, and generate alternative responses conditioned on different event
commonsense attributes. This novel dataset allows us to measure the proficiency
of LLMs in generating contextually relevant commonsense knowledge, particularly
up to 12 different specific ATOMIC [10] database relations. Secondly, we
propose an evaluation framework to automatically detect the quality of the
generated dataset inspired by the ACCENT [26] metric, which offers a nuanced
approach to assess event commonsense. However, our method does not follow
ACCENT’s complex eventrelation tuple extraction process. Instead, we propose an
instruction-based prompt for each commonsense attribute and use
state-of-the-art LLMs to automatically detect the original attributes used when
creating each augmented turn in the previous step.
Preliminary results suggest that our approach effectively harnesses LLMs
capabilities for commonsense reasoning and evaluation in dialogue systems.
[LINK]
http://arxiv.org/abs/2506.19483v1
[DATE]
2025-06-24 18:18:05+08:00
[CATEGORIES]
cs.CL
LEVOS: Leveraging Vocabulary Overlap with Sanskrit to Generate Technical Lexicons in Indian Languages
[AUTHORS]
Karthika N J, Krishnakant Bhatt, Ganesh Ramakrishnan, Preethi Jyothi
[ABSTRACT]
Translating technical terms into lexically similar, low-resource Indian
languages remains a challenge due to limited parallel data and the complexity
of linguistic structures. We propose a novel use-case of Sanskrit-based
segments for linguistically informed translation of such terms, leveraging
subword-level similarity and morphological alignment across related languages.
Our approach uses character-level segmentation to identify meaningful subword
units, facilitating more accurate and context-aware translation. To enable
this, we utilize a Character-level Transformer model for Sanskrit Word
Segmentation (CharSS), which addresses the complexities of sandhi and
morpho-phonemic changes during segmentation. We observe consistent improvements
in two experimental settings for technical term translation using
Sanskrit-derived segments, averaging 8.46 and 6.79 chrF++ scores, respectively.
Further, we conduct a post hoc human evaluation to verify the quality
assessment of the translated technical terms using automated metrics. This work
has important implications for the education field, especially in creating
accessible, high-quality learning materials in Indian languages. By supporting
the accurate and linguistically rooted translation of technical content, our
approach facilitates inclusivity and aids in bridging the resource gap for
learners in low-resource language communities.
[COMMENTS]
20th Workshop on Innovative Use of NLP for Building Educational
Applications (Co-located with ACL2025)
[LINK]
http://arxiv.org/abs/2407.06331v2
[DATE]
2025-06-24 18:06:32+08:00
[CATEGORIES]
cs.CL
Can Large Language Models Capture Human Annotator Disagreements?
[AUTHORS]
Jingwei Ni, Yu Fan, Vilém Zouhar, Donya Rooein, Alexander Hoyle, Mrinmaya Sachan, Markus Leippold, Dirk Hovy, Elliott Ash
[ABSTRACT]
Human annotation variation (i.e., annotation disagreements) is common in NLP
and often reflects important information such as task subjectivity and sample
ambiguity. While Large Language Models (LLMs) are increasingly used for
automatic annotation to reduce human effort, their evaluation often focuses on
predicting the majority-voted “ground truth” labels. It is still unclear,
however, whether these models also capture informative human annotation
variation. Our work addresses this gap by extensively evaluating LLMs’ ability
to predict annotation disagreements without access to repeated human labels.
Our results show that LLMs struggle with modeling disagreements, which can be
overlooked by majority label-based evaluations. Notably, while RLVR-style
(Reinforcement learning with verifiable rewards) reasoning generally boosts LLM
performance, it degrades performance in disagreement prediction. Our findings
highlight the critical need for evaluating and improving LLM annotators in
disagreement modeling. Code and data at
https://github.com/EdisonNi-hku/Disagreement_Prediction.
[COMMENTS]
Preprint Under Review
[LINK]
http://arxiv.org/abs/2506.19467v1
[DATE]
2025-06-24 17:49:26+08:00
[CATEGORIES]
cs.CL
Mem4Nav: Boosting Vision-and-Language Navigation in Urban Environments with a Hierarchical Spatial-Cognition Long-Short Memory System
[AUTHORS]
Lixuan He, Haoyu Dong, Zhenxing Chen, Yangcheng Yu, Jie Feng, Yong Li
[ABSTRACT]
Vision-and-Language Navigation (VLN) in large-scale urban environments
requires embodied agents to ground linguistic instructions in complex scenes
and recall relevant experiences over extended time horizons. Prior modular
pipelines offer interpretability but lack unified memory, while end-to-end
(M)LLM agents excel at fusing vision and language yet remain constrained by
fixed context windows and implicit spatial reasoning. We introduce
\textbf{Mem4Nav}, a hierarchical spatial-cognition long-short memory system
that can augment any VLN backbone. Mem4Nav fuses a sparse octree for
fine-grained voxel indexing with a semantic topology graph for high-level
landmark connectivity, storing both in trainable memory tokens embedded via a
reversible Transformer. Long-term memory (LTM) compresses and retains
historical observations at both octree and graph nodes, while short-term memory
(STM) caches recent multimodal entries in relative coordinates for real-time
obstacle avoidance and local planning. At each step, STM retrieval sharply
prunes dynamic context, and, when deeper history is needed, LTM tokens are
decoded losslessly to reconstruct past embeddings. Evaluated on Touchdown and
Map2Seq across three backbones (modular, state-of-the-art VLN with prompt-based
LLM, and state-of-the-art VLN with strided-attention MLLM), Mem4Nav yields 7-13
pp gains in Task Completion, sufficient SPD reduction, and >10 pp nDTW
improvement. Ablations confirm the indispensability of both the hierarchical
map and dual memory modules. Our codes are open-sourced via
https://github.com/tsinghua-fib-lab/Mem4Nav.
[LINK]
http://arxiv.org/abs/2506.19433v1
[DATE]
2025-06-24 17:00:43+08:00
[CATEGORIES]
cs.CL
Learning to Disentangle Latent Reasoning Rules with Language VAEs: A Systematic Study
[AUTHORS]
Yingji Zhang, Marco Valentino, Danilo S. Carvalho, André Freitas
[ABSTRACT]
Incorporating explicit reasoning rules within the latent space of language
models (LMs) offers a promising pathway to enhance generalisation,
interpretability, and controllability. While current Transformer-based language
models have shown strong performance on Natural Language Inference (NLI) tasks,
they often rely on memorisation rather than rule-based inference. This work
investigates how reasoning rules can be explicitly embedded and memorised
within the LMs through Language Variational Autoencoders (VAEs). We propose a
complete pipeline for learning reasoning rules within Transformer-based
language VAEs. This pipeline encompasses three rule-based reasoning tasks, a
supporting theoretical framework, and a practical end-to-end architecture. The
experiment illustrates the following findings: Disentangled reasoning: Under
explicit signal supervision, reasoning rules - viewed as functional mappings -
can be disentangled within the encoder’s parametric space. This separation
results in distinct clustering of rules in the output feature space. Prior
knowledge injection: injecting reasoning information into the Query enables the
model to more effectively retrieve the stored value Value from memory based on
Key. This approach offers a simple method for integrating prior knowledge into
decoder-only language models. Performance bottleneck: In mathematical reasoning
tasks using Qwen2.5(0.5B), increasing sample count doesn’t improve performance
beyond a point. Moreover, ffn layers are better than attention layers at
preserving the separation of reasoning rules in the model’s parameters.
[LINK]
http://arxiv.org/abs/2506.19418v1
[DATE]
2025-06-24 16:38:03+08:00
[CATEGORIES]
cs.CL
Automated Detection of Pre-training Text in Black-box LLMs
[AUTHORS]
Ruihan Hu, Yu-Ming Shang, Jiankun Peng, Wei Luo, Yazhe Wang, Xi Zhang
[ABSTRACT]
Detecting whether a given text is a member of the pre-training data of Large
Language Models (LLMs) is crucial for ensuring data privacy and copyright
protection. Most existing methods rely on the LLM’s hidden information (e.g.,
model parameters or token probabilities), making them ineffective in the
black-box setting, where only input and output texts are accessible. Although
some methods have been proposed for the black-box setting, they rely on massive
manual efforts such as designing complicated questions or instructions. To
address these issues, we propose VeilProbe, the first framework for
automatically detecting LLMs’ pre-training texts in a black-box setting without
human intervention. VeilProbe utilizes a sequence-to-sequence mapping model to
infer the latent mapping feature between the input text and the corresponding
output suffix generated by the LLM. Then it performs the key token
perturbations to obtain more distinguishable membership features. Additionally,
considering real-world scenarios where the ground-truth training text samples
are limited, a prototype-based membership classifier is introduced to alleviate
the overfitting issue. Extensive evaluations on three widely used datasets
demonstrate that our framework is effective and superior in the black-box
setting.
[COMMENTS]
13 pages
[LINK]
http://arxiv.org/abs/2506.19399v1
[DATE]
2025-06-24 16:08:15+08:00
[CATEGORIES]
cs.CL
Statistical Multicriteria Evaluation of LLM-Generated Text
[AUTHORS]
Esteban Garces Arias, Hannah Blocher, Julian Rodemann, Matthias Aßenmacher, Christoph Jansen
[ABSTRACT]
Assessing the quality of LLM-generated text remains a fundamental challenge
in natural language processing. Current evaluation approaches often rely on
isolated metrics or simplistic aggregations that fail to capture the nuanced
trade-offs between coherence, diversity, fluency, and other relevant indicators
of text quality. In this work, we adapt a recently proposed framework for
statistical inference based on Generalized Stochastic Dominance (GSD) that
addresses three critical limitations in existing benchmarking methodologies:
the inadequacy of single-metric evaluation, the incompatibility between
cardinal automatic metrics and ordinal human judgments, and the lack of
inferential statistical guarantees. The GSD-front approach enables simultaneous
evaluation across multiple quality dimensions while respecting their different
measurement scales, building upon partial orders of decoding strategies, thus
avoiding arbitrary weighting of the involved metrics. By applying this
framework to evaluate common decoding strategies against human-generated text,
we demonstrate its ability to identify statistically significant performance
differences while accounting for potential deviations from the i.i.d.
assumption of the sampling design.
[LINK]
http://arxiv.org/abs/2506.18082v2
[DATE]
2025-06-24 15:59:45+08:00
[CATEGORIES]
cs.CL
Measuring and Guiding Monosemanticity
[AUTHORS]
Ruben Härle, Felix Friedrich, Manuel Brack, Stephan Wäldchen, Björn Deiseroth, Patrick Schramowski, Kristian Kersting
[ABSTRACT]
There is growing interest in leveraging mechanistic interpretability and
controllability to better understand and influence the internal dynamics of
large language models (LLMs). However, current methods face fundamental
challenges in reliably localizing and manipulating feature representations.
Sparse Autoencoders (SAEs) have recently emerged as a promising direction for
feature extraction at scale, yet they, too, are limited by incomplete feature
isolation and unreliable monosemanticity. To systematically quantify these
limitations, we introduce Feature Monosemanticity Score (FMS), a novel metric
to quantify feature monosemanticity in latent representation. Building on these
insights, we propose Guided Sparse Autoencoders (G-SAE), a method that
conditions latent representations on labeled concepts during training. We
demonstrate that reliable localization and disentanglement of target concepts
within the latent space improve interpretability, detection of behavior, and
control. Specifically, our evaluations on toxicity detection, writing style
identification, and privacy attribute recognition show that G-SAE not only
enhances monosemanticity but also enables more effective and fine-grained
steering with less quality degradation. Our findings provide actionable
guidelines for measuring and advancing mechanistic interpretability and control
of LLMs.
[LINK]
http://arxiv.org/abs/2506.19382v1
[DATE]
2025-06-24 15:18:20+08:00
[CATEGORIES]
cs.CL
ReDit: Reward Dithering for Improved LLM Policy Optimization
[AUTHORS]
Chenxing Wei, Jiarui Yu, Ying Tiffany He, Hande Dong, Yao Shu, Fei Yu
[ABSTRACT]
DeepSeek-R1 has successfully enhanced Large Language Model (LLM) reasoning
capabilities through its rule-based reward system. While it’s a ‘‘perfect’’
reward system that effectively mitigates reward hacking, such reward functions
are often discrete. Our experimental observations suggest that discrete rewards
can lead to gradient anomaly, unstable optimization, and slow convergence. To
address this issue, we propose ReDit (Reward Dithering), a method that dithers
the discrete reward signal by adding simple random noise. With this perturbed
reward, exploratory gradients are continuously provided throughout the learning
process, enabling smoother gradient updates and accelerating convergence. The
injected noise also introduces stochasticity into flat reward regions,
encouraging the model to explore novel policies and escape local optima.
Experiments across diverse tasks demonstrate the effectiveness and efficiency
of ReDit. On average, ReDit achieves performance comparable to vanilla GRPO
with only approximately 10% the training steps, and furthermore, still exhibits
a 4% performance improvement over vanilla GRPO when trained for a similar
duration. Visualizations confirm significant mitigation of gradient issues with
ReDit. Moreover, theoretical analyses are provided to further validate these
advantages.
[COMMENTS]
10 pages, 15 figures
[LINK]
http://arxiv.org/abs/2506.18631v2
[DATE]
2025-06-24 15:07:57+08:00
[CATEGORIES]
cs.LG
cs.CL
SpokenWOZ: A Large-Scale Speech-Text Benchmark for Spoken Task-Oriented Dialogue Agents
[AUTHORS]
Shuzheng Si, Wentao Ma, Haoyu Gao, Yuchuan Wu, Ting-En Lin, Yinpei Dai, Hangyu Li, Rui Yan, Fei Huang, Yongbin Li
[COMMENTS]
NeurIPS 2023
[LINK]
http://arxiv.org/abs/2305.13040v7
[DATE]
2025-06-24 15:06:57+08:00
[CATEGORIES]
cs.CL
Spotting Out-of-Character Behavior: Atomic-Level Evaluation of Persona Fidelity in Open-Ended Generation
[AUTHORS]
Jisu Shin, Juhyun Oh, Eunsu Kim, Hoyun Song, Alice Oh
[ABSTRACT]
Ensuring persona fidelity in large language models (LLMs) is essential for
maintaining coherent and engaging human-AI interactions. However, LLMs often
exhibit Out-of-Character (OOC) behavior, where generated responses deviate from
an assigned persona, leading to inconsistencies that affect model reliability.
Existing evaluation methods typically assign single scores to entire responses,
struggling to capture subtle persona misalignment, particularly in long-form
text generation. To address this limitation, we propose an atomic-level
evaluation framework that quantifies persona fidelity at a finer granularity.
Our three key metrics measure the degree of persona alignment and consistency
within and across generations. Our approach enables a more precise and
realistic assessment of persona fidelity by identifying subtle deviations that
real users would encounter. Through our experiments, we demonstrate that our
framework effectively detects persona inconsistencies that prior methods
overlook. By analyzing persona fidelity across diverse tasks and personality
types, we reveal how task structure and persona desirability influence model
adaptability, highlighting challenges in maintaining consistent persona
expression.
[COMMENTS]
Findings of ACL 2025; github repo:
https://github.com/ddindidu/atomic-persona-evaluation/
[LINK]
http://arxiv.org/abs/2506.19352v1
[DATE]
2025-06-24 14:33:10+08:00
[CATEGORIES]
cs.CL
In-Context Occam’s Razor: How Transformers Prefer Simpler Hypotheses on the Fly
[AUTHORS]
Puneesh Deora, Bhavya Vasudeva, Tina Behnia, Christos Thrampoulidis
[ABSTRACT]
In-context learning (ICL) enables transformers to adapt to new tasks through
contextual examples without parameter updates. While existing research has
typically studied ICL in fixed-complexity environments, practical language
models encounter tasks spanning diverse complexity levels. This paper
investigates how transformers navigate hierarchical task structures where
higher-complexity categories can perfectly represent any pattern generated by
simpler ones. We design well-controlled testbeds based on Markov chains and
linear regression that reveal transformers not only identify the appropriate
complexity level for each task but also accurately infer the corresponding
parameters–even when the in-context examples are compatible with multiple
complexity hypotheses. Notably, when presented with data generated by simpler
processes, transformers consistently favor the least complex sufficient
explanation. We theoretically explain this behavior through a Bayesian
framework, demonstrating that transformers effectively implement an in-context
Bayesian Occam’s razor by balancing model fit against complexity penalties. We
further ablate on the roles of model size, training mixture distribution,
inference context length, and architecture. Finally, we validate this Occam’s
razor-like inductive bias on a pretrained GPT-4 model with Boolean-function
tasks as case study, suggesting it may be inherent to transformers trained on
diverse task distributions.
[COMMENTS]
28 pages, 19 figures
[LINK]
http://arxiv.org/abs/2506.19351v1
[DATE]
2025-06-24 14:33:00+08:00
[CATEGORIES]
cs.LG
cs.CL
Analyzing LLMs’ Knowledge Boundary Cognition Across Languages Through the Lens of Internal Representations
[AUTHORS]
Chenghao Xiao, Hou Pong Chan, Hao Zhang, Mahani Aljunied, Lidong Bing, Noura Al Moubayed, Yu Rong
[ABSTRACT]
While understanding the knowledge boundaries of LLMs is crucial to prevent
hallucination, research on the knowledge boundaries of LLMs has predominantly
focused on English. In this work, we present the first study to analyze how
LLMs recognize knowledge boundaries across different languages by probing their
internal representations when processing known and unknown questions in
multiple languages. Our empirical studies reveal three key findings: 1) LLMs’
perceptions of knowledge boundaries are encoded in the middle to middle-upper
layers across different languages. 2) Language differences in knowledge
boundary perception follow a linear structure, which motivates our proposal of
a training-free alignment method that effectively transfers knowledge boundary
perception ability across languages, thereby helping reduce hallucination risk
in low-resource languages; 3) Fine-tuning on bilingual question pair
translation further enhances LLMs’ recognition of knowledge boundaries across
languages. Given the absence of standard testbeds for cross-lingual knowledge
boundary analysis, we construct a multilingual evaluation suite comprising
three representative types of knowledge boundary data. Our code and datasets
are publicly available at
https://github.com/DAMO-NLP-SG/LLM-Multilingual-Knowledge-Boundaries.
[COMMENTS]
ACL 2025 main; camera ready
[LINK]
http://arxiv.org/abs/2504.13816v3
[DATE]
2025-06-24 14:24:15+08:00
[CATEGORIES]
cs.CL
RAG+: Enhancing Retrieval-Augmented Generation with Application-Aware Reasoning
[AUTHORS]
Yu Wang, Shiwan Zhao, Zhihu Wang, Yubo Zhang, Xicheng Zhang, Zhengfan Wang, Heyuan Huang, Ming Fan, Ting Liu
[ABSTRACT]
The integration of external knowledge through Retrieval-Augmented Generation
(RAG) has become foundational in enhancing large language models (LLMs) for
knowledge-intensive tasks. However, existing RAG paradigms often overlook the
cognitive step of applying knowledge, leaving a gap between retrieved facts and
task-specific reasoning. In this work, we introduce RAG+, a principled and
modular extension that explicitly incorporates application-aware reasoning into
the RAG pipeline. RAG+ constructs a dual corpus consisting of knowledge and
aligned application examples, created either manually or automatically, and
retrieves both jointly during inference. This design enables LLMs not only to
access relevant information but also to apply it within structured,
goal-oriented reasoning processes. Experiments across mathematical, legal, and
medical domains, conducted on multiple models, demonstrate that RAG+
consistently outperforms standard RAG variants, achieving average improvements
of 3-5%, and peak gains up to 7.5% in complex scenarios. By bridging retrieval
with actionable application, RAG+ advances a more cognitively grounded
framework for knowledge integration, representing a step toward more
interpretable and capable LLMs.
[LINK]
http://arxiv.org/abs/2506.11555v2
[DATE]
2025-06-24 13:50:06+08:00
[CATEGORIES]
cs.CL
FLAT-LLM: Fine-grained Low-rank Activation Space Transformation for Large Language Model Compression
[AUTHORS]
Jiayi Tian, Ryan Solgi, Jinming Lu, Yifan Yang, Hai Li, Zheng Zhang
[ABSTRACT]
Large Language Models (LLMs) have enabled remarkable progress in natural
language processing, yet their high computational and memory demands pose
challenges for deployment in resource-constrained environments. Although recent
low-rank decomposition methods offer a promising path for structural
compression, they often suffer from accuracy degradation, expensive calibration
procedures, and result in inefficient model architectures that hinder
real-world inference speedups. In this paper, we propose FLAT-LLM, a fast and
accurate, training-free structural compression method based on fine-grained
low-rank transformations in the activation space. Specifically, we reduce the
hidden dimension by transforming the weights using truncated eigenvectors
computed via head-wise Principal Component Analysis (PCA), and employ an
importance-based metric to adaptively allocate ranks across decoders. FLAT-LLM
achieves efficient and effective weight compression without recovery
fine-tuning, which could complete the calibration within a few minutes.
Evaluated across 4 models and 11 datasets, FLAT-LLM outperforms structural
pruning baselines in generalization and downstream performance, while
delivering inference speedups over decomposition-based methods.
[LINK]
http://arxiv.org/abs/2505.23966v2
[DATE]
2025-06-24 13:40:57+08:00
[CATEGORIES]
cs.CL
JCAPT: A Joint Modeling Approach for CAPT
[AUTHORS]
Tzu-Hsuan Yang, Yue-Yang He, Berlin Chen
[ABSTRACT]
Effective pronunciation feedback is critical in second language (L2)
learning, for which computer-assisted pronunciation training (CAPT) systems
often encompass two key tasks: automatic pronunciation assessment (APA) and
mispronunciation detection and diagnosis (MDD). Recent work has shown that
joint modeling of these two tasks can yield mutual benefits. Our unified
framework leverages Mamba, a selective state space model (SSM), while
integrating phonological features and think token strategies to jointly enhance
interpretability and fine-grained temporal reasoning in APA and MDD. To our
knowledge, this is the first study to combine phonological attribution,
SSM-based modeling, and prompting in CAPT. A series of experiments conducted on
the speechocean762 benchmark demonstrate that our model consistently
outperforms prior methods, particularly on the MDD task.
[COMMENTS]
Submitted to the ISCA SLaTE-2025 Workshop
[LINK]
http://arxiv.org/abs/2506.19315v1
[DATE]
2025-06-24 13:12:32+08:00
[CATEGORIES]
cs.CL
Long-Context Generalization with Sparse Attention
[AUTHORS]
Pavlo Vasylenko, Marcos Treviso, André F. T. Martins
[ABSTRACT]
Transformer-based architectures traditionally employ softmax to compute
attention weights, which produces dense distributions over all tokens in a
sequence. While effective in many settings, this density has been shown to be
detrimental for tasks that demand precise focus on fixed-size patterns: as
sequence length increases, non-informative tokens accumulate attention
probability mass, leading to dispersion and representational collapse. We show
in this paper that sparse attention mechanisms using $\alpha$-entmax can avoid
these issues, due to their ability to assign exact zeros to irrelevant tokens.
Furthermore, we introduce Adaptive-Scalable Entmax (ASEntmax), which endows
$\alpha$-entmax with a learnable temperature parameter, allowing the attention
distribution to interpolate between sparse (pattern-focused) and dense
(softmax-like) regimes. Finally, we show that the ability to locate and
generalize fixed-size patterns can be further improved through a careful design
of position encodings, which impacts both dense and sparse attention methods.
By integrating ASEntmax into standard transformer layers alongside proper
positional encodings, we show that our models greatly outperform softmax,
scalable softmax, and fixed-temperature $\alpha$-entmax baselines on
long-context generalization.
[LINK]
http://arxiv.org/abs/2506.16640v2
[DATE]
2025-06-24 12:45:00+08:00
[CATEGORIES]
cs.CL
Skywork-SWE: Unveiling Data Scaling Laws for Software Engineering in LLMs
[AUTHORS]
Liang Zeng, Yongcong Li, Yuzhen Xiao, Changshi Li, Chris Yuhao Liu, Rui Yan, Tianwen Wei, Jujie He, Xuchen Song, Yang Liu, Yahui Zhou
[ABSTRACT]
Software engineering (SWE) has recently emerged as a crucial testbed for
next-generation LLM agents, demanding inherent capabilities in two critical
dimensions: sustained iterative problem-solving (e.g., >50 interaction rounds)
and long-context dependency resolution (e.g., >32k tokens). However, the data
curation process in SWE remains notoriously time-consuming, as it heavily
relies on manual annotation for code file filtering and the setup of dedicated
runtime environments to execute and validate unit tests. Consequently, most
existing datasets are limited to only a few thousand GitHub-sourced instances.
To this end, we propose an incremental, automated data-curation pipeline that
systematically scales both the volume and diversity of SWE datasets. Our
dataset comprises 10,169 real-world Python task instances from 2,531 distinct
GitHub repositories, each accompanied by a task specified in natural language
and a dedicated runtime-environment image for automated unit-test validation.
We have carefully curated over 8,000 successfully runtime-validated training
trajectories from our proposed SWE dataset. When fine-tuning the Skywork-SWE
model on these trajectories, we uncover a striking data scaling phenomenon: the
trained model’s performance for software engineering capabilities in LLMs
continues to improve as the data size increases, showing no signs of
saturation. Notably, our Skywork-SWE model achieves 38.0% pass@1 accuracy on
the SWE-bench Verified benchmark without using verifiers or multiple rollouts,
establishing a new state-of-the-art (SOTA) among the Qwen2.5-Coder-32B-based
LLMs built on the OpenHands agent framework. Furthermore, with the
incorporation of test-time scaling techniques, the performance further improves
to 47.0% accuracy, surpassing the previous SOTA results for sub-32B parameter
models. We release the Skywork-SWE-32B model checkpoint to accelerate future
research.
[LINK]
http://arxiv.org/abs/2506.19290v1
[DATE]
2025-06-24 11:53:36+08:00
[CATEGORIES]
cs.CL
Evaluating Transparent Reasoning in Large Language Models for Accountable Critical Tasks
[AUTHORS]
Junhao Chen, Bowen Wang, Jiuyang Chang, Yuta Nakashima
[ABSTRACT]
This paper introduces REACT, a benchmark designed to rigorously evaluate the
reasoning capabilities of large language models (LLMs) within accountable,
high-stakes decision-making tasks in medical and legal domains. Unlike
traditional benchmarks primarily focused on prediction accuracy, REACT
emphasizes transparent and interpretable reasoning, requiring models to align
their logic closely with expert-derived procedures. To assess whether LLM
reasoning aligns closely with human experts, we annotated 511 clinical cases
from the medical domain and 86 legal cases from the legal domain, each enriched
with detailed expert-extracted rationales and evidence supporting each step of
the reasoning process. These annotations were guided by carefully constructed
reasoning graphs, which explicitly encode domain-specific inference structures
and decision criteria derived by domain experts. These reasoning graphs serve
not only as standards for expert annotation but also as structured guidelines
enabling models to reason transparently and step-by-step. To address the
scalability challenges of manual annotation, we further developed a
semi-automatic annotation pipeline leveraging expert-defined reasoning graph
templates to efficiently generate new graphs, exploring the potential to extend
our approach into additional critical domains. Experimental results demonstrate
that reasoning graphs substantially enhance the interpretability and accuracy
of LLM reasoning compared to traditional baselines, although significant gaps
remain relative to expert-level reasoning performance.
[COMMENTS]
This paper is the journal extension of our NeurIPS 2024 paper
“DiReCT: Diagnostic Reasoning for Clinical Notes via Large Language Models”
[LINK]
http://arxiv.org/abs/2408.01933v5
[DATE]
2025-06-24 11:31:03+08:00
[CATEGORIES]
cs.CL
Disentangling Reasoning and Knowledge in Medical Large Language Models
[AUTHORS]
Rahul Thapa, Qingyang Wu, Kevin Wu, Harrison Zhang, Angela Zhang, Eric Wu, Haotian Ye, Suhana Bedi, Nevin Aresh, Joseph Boen, Shriya Reddy, Ben Athiwaratkun, Shuaiwen Leon Song, James Zou
[ABSTRACT]
Medical reasoning in large language models (LLMs) aims to emulate clinicians’
diagnostic thinking, but current benchmarks such as MedQA-USMLE, MedMCQA, and
PubMedQA often mix reasoning with factual recall. We address this by separating
11 biomedical QA benchmarks into reasoning- and knowledge-focused subsets using
a PubMedBERT classifier that reaches 81 percent accuracy, comparable to human
performance. Our analysis shows that only 32.8 percent of questions require
complex reasoning. We evaluate biomedical models (HuatuoGPT-o1, MedReason, m1)
and general-domain models (DeepSeek-R1, o4-mini, Qwen3), finding consistent
gaps between knowledge and reasoning performance. For example, HuatuoGPT-o1
scores 56.9 on knowledge but only 44.8 on reasoning. In adversarial tests where
models are misled with incorrect initial reasoning, biomedical models degrade
sharply, while larger or RL-trained general models show more robustness. To
address this, we train BioMed-R1 using fine-tuning and reinforcement learning
on reasoning-heavy examples. It achieves the strongest performance among
similarly sized models. Further gains may come from incorporating clinical case
reports and training with adversarial and backtracking scenarios.
[LINK]
http://arxiv.org/abs/2505.11462v2
[DATE]
2025-06-24 11:27:30+08:00
[CATEGORIES]
cs.CL
EmoStage: A Framework for Accurate Empathetic Response Generation via Perspective-Taking and Phase Recognition
[AUTHORS]
Zhiyang Qi, Keiko Takamizo, Mariko Ukiyo, Michimasa Inaba
[ABSTRACT]
The rising demand for mental health care has fueled interest in AI-driven
counseling systems. While large language models (LLMs) offer significant
potential, current approaches face challenges, including limited understanding
of clients’ psychological states and counseling stages, reliance on
high-quality training data, and privacy concerns associated with commercial
deployment. To address these issues, we propose EmoStage, a framework that
enhances empathetic response generation by leveraging the inference
capabilities of open-source LLMs without additional training data. Our
framework introduces perspective-taking to infer clients’ psychological states
and support needs, enabling the generation of emotionally resonant responses.
In addition, phase recognition is incorporated to ensure alignment with the
counseling process and to prevent contextually inappropriate or inopportune
responses. Experiments conducted in both Japanese and Chinese counseling
settings demonstrate that EmoStage improves the quality of responses generated
by base models and performs competitively with data-driven methods.
[LINK]
http://arxiv.org/abs/2506.19279v1
[DATE]
2025-06-24 11:18:37+08:00
[CATEGORIES]
cs.CL
Process Reward Models That Think
[AUTHORS]
Muhammad Khalifa, Rishabh Agarwal, Lajanugen Logeswaran, Jaekyeom Kim, Hao Peng, Moontae Lee, Honglak Lee, Lu Wang
[ABSTRACT]
Step-by-step verifiers – also known as process reward models (PRMs) – are a
key ingredient for test-time scaling. PRMs require step-level supervision,
making them expensive to train. This work aims to build data-efficient PRMs as
verbalized step-wise reward models that verify every step in the solution by
generating a verification chain-of-thought (CoT). We propose ThinkPRM, a long
CoT verifier fine-tuned on orders of magnitude fewer process labels than those
required by discriminative PRMs. Our approach capitalizes on the inherent
reasoning abilities of long CoT models, and outperforms LLM-as-a-Judge and
discriminative verifiers – using only 1% of the process labels in PRM800K –
across several challenging benchmarks. Specifically, ThinkPRM beats the
baselines on ProcessBench, MATH-500, and AIME ‘24 under best-of-N selection and
reward-guided search. In an out-of-domain evaluation on a subset of
GPQA-Diamond and LiveCodeBench, our PRM surpasses discriminative verifiers
trained on the full PRM800K by 8% and 4.5%, respectively. Lastly, under the
same token budget, ThinkPRM scales up verification compute more effectively
compared to LLM-as-a-Judge, outperforming it by 7.2% on a subset of
ProcessBench. Our work highlights the value of generative, long CoT PRMs that
can scale test-time compute for verification while requiring minimal
supervision for training. Our code, data, and models will be released at
https://github.com/mukhal/thinkprm.
[LINK]
http://arxiv.org/abs/2504.16828v3
[DATE]
2025-06-24 11:05:02+08:00
[CATEGORIES]
cs.LG
cs.CL
Personality Prediction from Life Stories using Language Models
[AUTHORS]
Rasiq Hussain, Jerry Ma, Rithik Khandelwal, Joshua Oltmanns, Mehak Gupta
[ABSTRACT]
Natural Language Processing (NLP) offers new avenues for personality
assessment by leveraging rich, open-ended text, moving beyond traditional
questionnaires. In this study, we address the challenge of modeling long
narrative interview where each exceeds 2000 tokens so as to predict Five-Factor
Model (FFM) personality traits. We propose a two-step approach: first, we
extract contextual embeddings using sliding-window fine-tuning of pretrained
language models; then, we apply Recurrent Neural Networks (RNNs) with attention
mechanisms to integrate long-range dependencies and enhance interpretability.
This hybrid method effectively bridges the strengths of pretrained transformers
and sequence modeling to handle long-context data. Through ablation studies and
comparisons with state-of-the-art long-context models such as LLaMA and
Longformer, we demonstrate improvements in prediction accuracy, efficiency, and
interpretability. Our results highlight the potential of combining
language-based features with long-context modeling to advance personality
assessment from life narratives.
[COMMENTS]
13 pages, 5 figures
[LINK]
http://arxiv.org/abs/2506.19258v1
[DATE]
2025-06-24 10:39:06+08:00
[CATEGORIES]
cs.CL
cs.LG
MSR-Align: Policy-Grounded Multimodal Alignment for Safety-Aware Reasoning in Vision-Language Models
[AUTHORS]
Yinan Xia, Yilei Jiang, Yingshui Tan, Xiaoyong Zhu, Xiangyu Yue, Bo Zheng
[ABSTRACT]
Vision-Language Models (VLMs) have achieved remarkable progress in multimodal
reasoning tasks through enhanced chain-of-thought capabilities. However, this
advancement also introduces novel safety risks, as these models become
increasingly vulnerable to harmful multimodal prompts that can trigger
unethical or unsafe behaviors. Existing safety alignment approaches, primarily
designed for unimodal language models, fall short in addressing the complex and
nuanced threats posed by multimodal inputs. Moreover, current safety datasets
lack the fine-grained, policy-grounded reasoning required to robustly align
reasoning-capable VLMs. In this work, we introduce {MSR-Align}, a high-quality
Multimodal Safety Reasoning dataset tailored to bridge this gap. MSR-Align
supports fine-grained, deliberative reasoning over standardized safety policies
across both vision and text modalities. Our data generation pipeline emphasizes
multimodal diversity, policy-grounded reasoning, and rigorous quality filtering
using strong multimodal judges. Extensive experiments demonstrate that
fine-tuning VLMs on MSR-Align substantially improves robustness against both
textual and vision-language jailbreak attacks, while preserving or enhancing
general reasoning performance. MSR-Align provides a scalable and effective
foundation for advancing the safety alignment of reasoning-capable VLMs. Our
dataset is made publicly available at
https://huggingface.co/datasets/Leigest/MSR-Align.
[LINK]
http://arxiv.org/abs/2506.19257v1
[DATE]
2025-06-24 10:37:59+08:00
[CATEGORIES]
cs.CL
Position: Machine Learning Conferences Should Establish a “Refutations and Critiques” Track
[AUTHORS]
Rylan Schaeffer, Joshua Kazdan, Yegor Denisov-Blanch, Brando Miranda, Matthias Gerstgrasser, Susan Zhang, Andreas Haupt, Isha Gupta, Elyas Obbad, Jesse Dodge, Jessica Zosa Forde, Koustuv Sinha, Francesco Orabona, Sanmi Koyejo, David Donoho
[ABSTRACT]
Science progresses by iteratively advancing and correcting humanity’s
understanding of the world. In machine learning (ML) research, rapid
advancements have led to an explosion of publications, but have also led to
misleading, incorrect, flawed or perhaps even fraudulent studies being accepted
and sometimes highlighted at ML conferences due to the fallibility of peer
review. While such mistakes are understandable, ML conferences do not offer
robust processes to help the field systematically correct when such errors are
made.This position paper argues that ML conferences should establish a
dedicated “Refutations and Critiques” (R & C) Track. This R & C Track would
provide a high-profile, reputable platform to support vital research that
critically challenges prior research, thereby fostering a dynamic
self-correcting research ecosystem. We discuss key considerations including
track design, review principles, potential pitfalls, and provide an
illustrative example submission concerning a recent ICLR 2025 Oral. We conclude
that ML conferences should create official, reputable mechanisms to help ML
research self-correct.
[LINK]
http://arxiv.org/abs/2506.19882v1
[DATE]
2025-06-24 10:19:30+08:00
[CATEGORIES]
cs.LG
cs.CL
Augmenting Multi-Agent Communication with State Delta Trajectory
[AUTHORS]
Yichen Tang, Weihang Su, Yujia Zhou, Yiqun Liu, Min Zhang, Shaoping Ma, Qingyao Ai
[ABSTRACT]
Multi-agent techniques such as role playing or multi-turn debates have been
shown to be effective in improving the performance of large language models
(LLMs) in downstream tasks. Despite their differences in workflows, existing
LLM-based multi-agent systems mostly use natural language for agent
communication. While this is appealing for its simplicity and interpretability,
it also introduces inevitable information loss as one model must down sample
its continuous state vectors to concrete tokens before transferring them to the
other model. Such losses are particularly significant when the information to
transfer is not simple facts, but reasoning logics or abstractive thoughts. To
tackle this problem, we propose a new communication protocol that transfers
both natural language tokens and token-wise state transition trajectory from
one agent to another. Particularly, compared to the actual state value, we find
that the sequence of state changes in LLMs after generating each token can
better reflect the information hidden behind the inference process, so we
propose a State Delta Encoding (SDE) method to represent state transition
trajectories. The experimental results show that multi-agent systems with SDE
achieve SOTA performance compared to other communication protocols,
particularly in tasks that involve complex reasoning. This shows the potential
of communication augmentation for LLM-based multi-agent systems.
[COMMENTS]
22 pages, 5 figures
[LINK]
http://arxiv.org/abs/2506.19209v1
[DATE]
2025-06-24 08:38:25+08:00
[CATEGORIES]
cs.CL
Bayesian Evolutionary Swarm Architecture: A Formal Epistemic System Grounded in Truth-Based Competition
[AUTHORS]
Craig Steven Wright
[ABSTRACT]
We introduce a mathematically rigorous framework for an artificial
intelligence system composed of probabilistic agents evolving through
structured competition and belief revision. The architecture, grounded in
Bayesian inference, measure theory, and population dynamics, defines agent
fitness as a function of alignment with a fixed external oracle representing
ground truth. Agents compete in a discrete-time environment, adjusting
posterior beliefs through observed outcomes, with higher-rated agents
reproducing and lower-rated agents undergoing extinction. Ratings are updated
via pairwise truth-aligned utility comparisons, and belief updates preserve
measurable consistency and stochastic convergence. We introduce hash-based
cryptographic identity commitments to ensure traceability, alongside causal
inference operators using do-calculus. Formal theorems on convergence,
robustness, and evolutionary stability are provided. The system establishes
truth as an evolutionary attractor, demonstrating that verifiable knowledge
arises from adversarial epistemic pressure within a computable, self-regulating
swarm.
[COMMENTS]
83 pages, 14 sections, 92 formal results, no prior conference
publication
[LINK]
http://arxiv.org/abs/2506.19191v1
[DATE]
2025-06-24 07:27:44+08:00
[CATEGORIES]
cs.CL
Prompt, Translate, Fine-Tune, Re-Initialize, or Instruction-Tune? Adapting LLMs for In-Context Learning in Low-Resource Languages
[AUTHORS]
Christopher Toukmaji, Jeffrey Flanigan
[COMMENTS]
Accepted to ACL GEM 2025
[LINK]
http://arxiv.org/abs/2506.19187v1
[DATE]
2025-06-24 07:22:11+08:00
[CATEGORIES]
cs.CL
Transferring Features Across Language Models With Model Stitching
[AUTHORS]
Alan Chen, Jack Merullo, Alessandro Stolfo, Ellie Pavlick
[ABSTRACT]
In this work, we demonstrate that affine mappings between residual streams of
language models is a cheap way to effectively transfer represented features
between models. We apply this technique to transfer the weights of Sparse
Autoencoders (SAEs) between models of different sizes to compare their
representations. We find that small and large models learn similar
representation spaces, which motivates training expensive components like SAEs
on a smaller model and transferring to a larger model at a FLOPs savings. In
particular, using a small-to-large transferred SAE as initialization can lead
to 50% cheaper training runs when training SAEs on larger models. Next, we show
that transferred probes and steering vectors can effectively recover ground
truth performance. Finally, we dive deeper into feature-level transferability,
finding that semantic and structural features transfer noticeably differently
while specific classes of functional features have their roles faithfully
mapped. Overall, our findings illustrate similarities and differences in the
linear representation spaces of small and large models and demonstrate a method
for improving the training efficiency of SAEs.
[LINK]
http://arxiv.org/abs/2506.06609v2
[DATE]
2025-06-24 07:21:57+08:00
[CATEGORIES]
cs.CL
cs.LG
Enhanced Hybrid Transducer and Attention Encoder Decoder with Text Data
[AUTHORS]
Yun Tang, Eesung Kim, Vijendra Raj Apsingekar
[ABSTRACT]
A joint speech and text optimization method is proposed for hybrid transducer
and attention-based encoder decoder (TAED) modeling to leverage large amounts
of text corpus and enhance ASR accuracy. The joint TAED (J-TAED) is trained
with both speech and text input modalities together, while it only takes speech
data as input during inference. The trained model can unify the internal
representations from different modalities, and be further extended to
text-based domain adaptation. It can effectively alleviate data scarcity for
mismatch domain tasks since no speech data is required. Our experiments show
J-TAED successfully integrates speech and linguistic information into one
model, and reduce the WER by 5.8 ~12.8% on the Librispeech dataset. The model
is also evaluated on two out-of-domain datasets: one is finance and another is
named entity focused. The text-based domain adaptation brings 15.3% and 17.8%
WER reduction on those two datasets respectively.
[COMMENTS]
Accepted by Interspeech2025
[LINK]
http://arxiv.org/abs/2506.19159v1
[DATE]
2025-06-24 05:51:39+08:00
[CATEGORIES]
cs.CL
ProxSparse: Regularized Learning of Semi-Structured Sparsity Masks for Pretrained LLMs
[AUTHORS]
Hongyi Liu, Rajarshi Saha, Zhen Jia, Youngsuk Park, Jiaji Huang, Shoham Sabach, Yu-Xiang Wang, George Karypis
[ABSTRACT]
Large Language Models (LLMs) have demonstrated exceptional performance in
natural language processing tasks, yet their massive size makes serving them
inefficient and costly. Semi-structured pruning has emerged as an effective
method for model acceleration, but existing approaches are suboptimal because
they focus on local, layer-wise optimizations using heuristic rules, failing to
leverage global feedback. We present ProxSparse, a learning-based framework for
mask selection enabled by regularized optimization. ProxSparse transforms the
rigid, non-differentiable mask selection process into a smoother optimization
procedure, allowing gradual mask exploration with flexibility. ProxSparse does
not involve additional weight updates once the mask is determined. Our
extensive evaluations on 7 widely used models show that ProxSparse consistently
outperforms previously proposed semi-structured mask selection methods with
significant improvement, demonstrating the effectiveness of our learned
approach towards semi-structured pruning.
[COMMENTS]
ICML25
[LINK]
http://arxiv.org/abs/2502.00258v2
[DATE]
2025-06-24 05:39:56+08:00
[CATEGORIES]
cs.LG
cs.CL
Time-IMM: A Dataset and Benchmark for Irregular Multimodal Multivariate Time Series
[AUTHORS]
Ching Chang, Jeehyun Hwang, Yidan Shi, Haixin Wang, Wen-Chih Peng, Tien-Fu Chen, Wei Wang
[ABSTRACT]
Time series data in real-world applications such as healthcare, climate
modeling, and finance are often irregular, multimodal, and messy, with varying
sampling rates, asynchronous modalities, and pervasive missingness. However,
existing benchmarks typically assume clean, regularly sampled, unimodal data,
creating a significant gap between research and real-world deployment. We
introduce Time-IMM, a dataset specifically designed to capture cause-driven
irregularity in multimodal multivariate time series. Time-IMM represents nine
distinct types of time series irregularity, categorized into trigger-based,
constraint-based, and artifact-based mechanisms. Complementing the dataset, we
introduce IMM-TSF, a benchmark library for forecasting on irregular multimodal
time series, enabling asynchronous integration and realistic evaluation.
IMM-TSF includes specialized fusion modules, including a timestamp-to-text
fusion module and a multimodality fusion module, which support both
recency-aware averaging and attention-based integration strategies. Empirical
results demonstrate that explicitly modeling multimodality on irregular time
series data leads to substantial gains in forecasting performance. Time-IMM and
IMM-TSF provide a foundation for advancing time series analysis under
real-world conditions. The dataset is publicly available at
https://www.kaggle.com/datasets/blacksnail789521/time-imm/data, and the
benchmark library can be accessed at
https://anonymous.4open.science/r/IMMTSF_NeurIPS2025.
[COMMENTS]
This paper is currently under review
[LINK]
http://arxiv.org/abs/2506.10412v2
[DATE]
2025-06-24 05:10:15+08:00
[CATEGORIES]
cs.LG
cs.CL
TRAIL: Trace Reasoning and Agentic Issue Localization
[AUTHORS]
Darshan Deshpande, Varun Gangal, Hersh Mehta, Jitin Krishnan, Anand Kannappan, Rebecca Qian
[ABSTRACT]
The increasing adoption of agentic workflows across diverse domains brings a
critical need to scalably and systematically evaluate the complex traces these
systems generate. Current evaluation methods depend on manual, domain-specific
human analysis of lengthy workflow traces - an approach that does not scale
with the growing complexity and volume of agentic outputs. Error analysis in
these settings is further complicated by the interplay of external tool outputs
and language model reasoning, making it more challenging than traditional
software debugging. In this work, we (1) articulate the need for robust and
dynamic evaluation methods for agentic workflow traces, (2) introduce a formal
taxonomy of error types encountered in agentic systems, and (3) present a set
of 148 large human-annotated traces (TRAIL) constructed using this taxonomy and
grounded in established agentic benchmarks. To ensure ecological validity, we
curate traces from both single and multi-agent systems, focusing on real-world
applications such as software engineering and open-world information retrieval.
Our evaluations reveal that modern long context LLMs perform poorly at trace
debugging, with the best Gemini-2.5-pro model scoring a mere 11% on TRAIL. Our
dataset and code are made publicly available to support and accelerate future
research in scalable evaluation for agentic workflows.
[COMMENTS]
Dataset: https://huggingface.co/datasets/PatronusAI/TRAIL
[LINK]
http://arxiv.org/abs/2505.08638v3
[DATE]
2025-06-24 05:06:11+08:00
[CATEGORIES]
cs.CL
ADVLLM: Iterative Self-Tuning LLMs for Enhanced Jailbreaking Capabilities
[AUTHORS]
Chung-En Sun, Xiaodong Liu, Weiwei Yang, Tsui-Wei Weng, Hao Cheng, Aidan San, Michel Galley, Jianfeng Gao
[ABSTRACT]
Recent research has shown that Large Language Models (LLMs) are vulnerable to
automated jailbreak attacks, where adversarial suffixes crafted by algorithms
appended to harmful queries bypass safety alignment and trigger unintended
responses. Current methods for generating these suffixes are computationally
expensive and have low Attack Success Rates (ASR), especially against
well-aligned models like Llama2 and Llama3. To overcome these limitations, we
introduce ADV-LLM, an iterative self-tuning process that crafts adversarial
LLMs with enhanced jailbreak ability. Our framework significantly reduces the
computational cost of generating adversarial suffixes while achieving nearly
100\% ASR on various open-source LLMs. Moreover, it exhibits strong attack
transferability to closed-source models, achieving 99\% ASR on GPT-3.5 and 49\%
ASR on GPT-4, despite being optimized solely on Llama3. Beyond improving
jailbreak ability, ADV-LLM provides valuable insights for future safety
alignment research through its ability to generate large datasets for studying
LLM safety.
[COMMENTS]
Accepted to NAACL 2025 Main (oral)
[LINK]
http://arxiv.org/abs/2410.18469v4
[DATE]
2025-06-24 04:12:31+08:00
[CATEGORIES]
cs.CL
cs.LG
Small Language Models in the Real World: Insights from Industrial Text Classification
[AUTHORS]
Lujun Li, Lama Sleem, Niccolo’ Gentile, Geoffrey Nichil, Radu State
[ABSTRACT]
With the emergence of ChatGPT, Transformer models have significantly advanced
text classification and related tasks. Decoder-only models such as Llama
exhibit strong performance and flexibility, yet they suffer from inefficiency
on inference due to token-by-token generation, and their effectiveness in text
classification tasks heavily depends on prompt quality. Moreover, their
substantial GPU resource requirements often limit widespread adoption. Thus,
the question of whether smaller language models are capable of effectively
handling text classification tasks emerges as a topic of significant interest.
However, the selection of appropriate models and methodologies remains largely
underexplored. In this paper, we conduct a comprehensive evaluation of prompt
engineering and supervised fine-tuning methods for transformer-based text
classification. Specifically, we focus on practical industrial scenarios,
including email classification, legal document categorization, and the
classification of extremely long academic texts. We examine the strengths and
limitations of smaller models, with particular attention to both their
performance and their efficiency in Video Random-Access Memory (VRAM)
utilization, thereby providing valuable insights for the local deployment and
application of compact models in industrial settings.
[COMMENTS]
This paper has been accepted as a conference paper in the Industry
Track of the 63rd Annual Meeting of the Association for Computational
Linguistics (ACL)
[LINK]
http://arxiv.org/abs/2505.16078v3
[DATE]
2025-06-24 04:09:36+08:00
[CATEGORIES]
cs.CL
Language Models Might Not Understand You: Evaluating Theory of Mind via Story Prompting
[AUTHORS]
Nathaniel Getachew, Abulhair Saparov
[ABSTRACT]
We introduce $\texttt{StorySim}$, a programmable framework for synthetically
generating stories to evaluate the theory of mind (ToM) and world modeling (WM)
capabilities of large language models (LLMs). Unlike prior benchmarks that may
suffer from contamination in pretraining data, $\texttt{StorySim}$ produces
novel, compositional story prompts anchored by a highly controllable
$\texttt{Storyboard}$, enabling precise manipulation of character perspectives
and events. We use this framework to design first- and second-order ToM tasks
alongside WM tasks that control for the ability to track and model mental
states. Our experiments across a suite of state-of-the-art LLMs reveal that
most models perform better on WM tasks than ToM tasks, and that models tend to
perform better reasoning with humans compared to inanimate objects.
Additionally, our framework enabled us to find evidence of heuristic behavior
such as recency bias and an over-reliance on earlier events in the story. All
code for generating data and evaluations is freely available.
[COMMENTS]
14 pages, 11 figures
[LINK]
http://arxiv.org/abs/2506.19089v1
[DATE]
2025-06-24 04:06:53+08:00
[CATEGORIES]
cs.CL
HAWAII: Hierarchical Visual Knowledge Transfer for Efficient Vision-Language Models
[AUTHORS]
Yimu Wang, Mozhgan Nasr Azadani, Sean Sedwards, Krzysztof Czarnecki
[ABSTRACT]
Improving the visual understanding ability of vision-language models (VLMs)
is crucial for enhancing their performance across various tasks. While using
multiple pretrained visual experts has shown great promise, it often incurs
significant computational costs during training and inference. To address this
challenge, we propose HAWAII, a novel framework that distills knowledge from
multiple visual experts into a single vision encoder, enabling it to inherit
the complementary strengths of several experts with minimal computational
overhead. To mitigate conflicts among different teachers and switch between
different teacher-specific knowledge, instead of using a fixed set of adapters
for multiple teachers, we propose to use teacher-specific Low-Rank Adaptation
(LoRA) adapters with a corresponding router. Each adapter is aligned with a
specific teacher, avoiding noisy guidance during distillation. To enable
efficient knowledge distillation, we propose fine-grained and coarse-grained
distillation. At the fine-grained level, token importance scores are employed
to emphasize the most informative tokens from each teacher adaptively. At the
coarse-grained level, we summarize the knowledge from multiple teachers and
transfer it to the student using a set of general-knowledge LoRA adapters with
a router. Extensive experiments on various vision-language tasks demonstrate
the superiority of HAWAII, compared to the popular open-source VLMs.
[COMMENTS]
Work in progress
[LINK]
http://arxiv.org/abs/2506.19072v1
[DATE]
2025-06-24 03:43:25+08:00
[CATEGORIES]
cs.CL
NLPnorth @ TalentCLEF 2025: Comparing Discriminative, Contrastive, and Prompt-Based Methods for Job Title and Skill Matching
[AUTHORS]
Mike Zhang, Rob van der Goot
[ABSTRACT]
Matching job titles is a highly relevant task in the computational job market
domain, as it improves e.g., automatic candidate matching, career path
prediction, and job market analysis. Furthermore, aligning job titles to job
skills can be considered an extension to this task, with similar relevance for
the same downstream tasks. In this report, we outline NLPnorth’s submission to
TalentCLEF 2025, which includes both of these tasks: Multilingual Job Title
Matching, and Job Title-Based Skill Prediction. For both tasks we compare
(fine-tuned) classification-based, (fine-tuned) contrastive-based, and
prompting methods. We observe that for Task A, our prompting approach performs
best with an average of 0.492 mean average precision (MAP) on test data,
averaged over English, Spanish, and German. For Task B, we obtain an MAP of
0.290 on test data with our fine-tuned classification-based approach.
Additionally, we made use of extra data by pulling all the language-specific
titles and corresponding \emph{descriptions} from ESCO for each job and skill.
Overall, we find that the largest multilingual language models perform best for
both tasks. Per the provisional results and only counting the unique teams, the
ranking on Task A is 5$^{\text{th}}$/20 and for Task B 3$^{\text{rd}}$/14.
[COMMENTS]
TalentCLEF 2025
[LINK]
http://arxiv.org/abs/2506.19058v1
[DATE]
2025-06-24 03:18:25+08:00
[CATEGORIES]
cs.CL
Impact of Visual Context on Noisy Multimodal NMT: An Empirical Study for English to Indian Languages
[AUTHORS]
Baban Gain, Dibyanayan Bandyopadhyay, Samrat Mukherjee, Chandranath Adak, Asif Ekbal
[ABSTRACT]
Neural Machine Translation (NMT) has made remarkable progress using
large-scale textual data, but the potential of incorporating multimodal inputs,
especially visual information, remains underexplored in high-resource settings.
While prior research has focused on using multimodal data in low-resource
scenarios, this study examines how image features impact translation when added
to a large-scale, pre-trained unimodal NMT system. Surprisingly, the study
finds that images might be redundant in this context. Additionally, the
research introduces synthetic noise to assess whether images help the model
handle textual noise. Multimodal models slightly outperform text-only models in
noisy settings, even when random images are used. The study’s experiments
translate from English to Hindi, Bengali, and Malayalam, significantly
outperforming state-of-the-art benchmarks. Interestingly, the effect of visual
context varies with the level of source text noise: no visual context works
best for non-noisy translations, cropped image features are optimal for low
noise, and full image features perform better in high-noise scenarios. This
sheds light on the role of visual context, especially in noisy settings, and
opens up a new research direction for Noisy Neural Machine Translation in
multimodal setups. The research emphasizes the importance of combining visual
and textual information to improve translation across various environments. Our
code is publicly available at https://github.com/babangain/indicMMT.
[LINK]
http://arxiv.org/abs/2308.16075v2
[DATE]
2025-06-24 03:07:19+08:00
[CATEGORIES]
cs.CL
Self-reflecting Large Language Models: A Hegelian Dialectical Approach
[AUTHORS]
Sara Abdali, Can Goksen, Michael Solodko, Saeed Amizadeh, Julie E. Maybee, Kazuhito Koishida
[ABSTRACT]
Investigating NLP through a philosophical lens has recently caught
researchers’ eyes, as it bridges computational methods with classical schools
of philosophy. This paper introduces a philosophical framework inspired by the
Hegelian Dialectic to enable LLMs’ self-reflection, utilizing a
self-dialectical approach to emulate internal critiques and synthesize new
scientific ideas (spanning domains such as mathematics, physics, and more).
Additionally, we explore the effect of generation temperature in LLMs by
introducing a dynamic annealing approach, which encourages creativity in the
early stages and gradually focuses on refinement and nuance, as well as a
constant-temperature strategy. Furthermore, we implement a Multi-Agent Majority
Voting (MAMV) strategy to assess the validity and novelty of the generated
ideas, which proves useful in the absence of domain experts. We also evaluate
the effectiveness of our method in generating novel scientific ideas and
improving LLMs’ reasoning capabilities. Our experiments demonstrate promising
results in ideation, along with significant improvements in mathematical and
symbolic reasoning.
[LINK]
http://arxiv.org/abs/2501.14917v6
[DATE]
2025-06-24 02:59:06+08:00
[CATEGORIES]
cs.CL
cs.LG
Plan for Speed – Dilated Scheduling for Masked Diffusion Language Models
[AUTHORS]
Omer Luxembourg, Haim Permuter, Eliya Nachmani
[ABSTRACT]
Masked diffusion language models (MDLM) have shown strong promise for
non-autoregressive text generation, yet existing samplers act as implicit
planners, selecting tokens to unmask via denoiser confidence or entropy scores.
Such heuristics falter under parallel unmasking - they ignore pairwise
interactions between tokens and cannot account for dependencies when unmasking
multiple positions at once, limiting their inference time to traditional
auto-regressive (AR) models. We introduce the Dilated-scheduled Unmasking
Strategy (DUS), an inference-only, planner-model-free method that requires no
additional training. DUS leverages a first-order Markov assumption to partition
sequence positions into dilation-based groups of non-adjacent tokens, enabling
independent, parallel unmasking steps that respect local context that minimizes
the joint entropy of each iteration step. Unlike semi-AR block approaches
(e.g., LLADA and Dream) that still invoke the denoiser per block, DUS reduces
the number of denoiser calls to O(log B) per generation block - yielding
substantial speedup over the O(B) run time of state-of-the-art diffusion
models, where B is the block size in the semi-AR inference process. In
experiments on math (GSM8K) and code completion (Humaneval, MBPP) benchmarks -
domains suited to non-ordinal generation - DUS improves scores over parallel
confidence-based planner, without modifying the underlying denoiser. DUS offers
a lightweight, budget-aware approach to efficient, high-quality text
generation, paving the way to unlock the true capabilities of MDLMs.
[LINK]
http://arxiv.org/abs/2506.19037v1
[DATE]
2025-06-24 02:49:23+08:00
[CATEGORIES]
cs.CL
cs.LG
Broken Tokens? Your Language Model can Secretly Handle Non-Canonical Tokenizations
[AUTHORS]
Brian Siyuan Zheng, Alisa Liu, Orevaoghene Ahia, Jonathan Hayase, Yejin Choi, Noah A. Smith
[ABSTRACT]
Modern tokenizers employ deterministic algorithms to map text into a single
“canonical” token sequence, yet the same string can be encoded as many
non-canonical tokenizations using the tokenizer vocabulary. In this work, we
investigate the robustness of LMs to text encoded with non-canonical
tokenizations entirely unseen during training. Surprisingly, when evaluated
across 20 benchmarks, we find that instruction-tuned models retain up to 93.4%
of their original performance when given a randomly sampled tokenization, and
90.8% with character-level tokenization. We see that overall stronger models
tend to be more robust, and robustness diminishes as the tokenization departs
farther from the canonical form. Motivated by these results, we then identify
settings where non-canonical tokenization schemes can improve performance,
finding that character-level segmentation improves string manipulation and code
understanding tasks by up to +14%, and right-aligned digit grouping enhances
large-number arithmetic by +33%. Finally, we investigate the source of this
robustness, finding that it arises in the instruction-tuning phase. We show
that while both base and post-trained models grasp the semantics of
non-canonical tokenizations (perceiving them as containing misspellings), base
models try to mimic the imagined mistakes and degenerate into nonsensical
output, while post-trained models are committed to fluent responses. Overall,
our findings suggest that models are less tied to their tokenizer than
previously believed, and demonstrate the promise of intervening on tokenization
at inference time to boost performance.
[COMMENTS]
preprint
[LINK]
http://arxiv.org/abs/2506.19004v1
[DATE]
2025-06-24 02:02:26+08:00
[CATEGORIES]
cs.CL
Mirage of Mastery: Memorization Tricks LLMs into Artificially Inflated Self-Knowledge
[AUTHORS]
Sahil Kale, Vijaykant Nadadur
[COMMENTS]
Accepted to the Pre-ACL Workshop 2025, Copenhagen
[LINK]
http://arxiv.org/abs/2506.18998v1
[DATE]
2025-06-24 02:01:16+08:00
[CATEGORIES]
cs.CL
Vision as a Dialect: Unifying Visual Understanding and Generation via Text-Aligned Representations
[AUTHORS]
Jiaming Han, Hao Chen, Yang Zhao, Hanyu Wang, Qi Zhao, Ziyan Yang, Hao He, Xiangyu Yue, Lu Jiang
[ABSTRACT]
This paper presents a multimodal framework that attempts to unify visual
understanding and generation within a shared discrete semantic representation.
At its core is the Text-Aligned Tokenizer (TA-Tok), which converts images into
discrete tokens using a text-aligned codebook projected from a large language
model’s (LLM) vocabulary. By integrating vision and text into a unified space
with an expanded vocabulary, our multimodal LLM, Tar, enables cross-modal input
and output through a shared interface, without the need for modality-specific
designs. Additionally, we propose scale-adaptive encoding and decoding to
balance efficiency and visual detail, along with a generative de-tokenizer to
produce high-fidelity visual outputs. To address diverse decoding needs, we
utilize two complementary de-tokenizers: a fast autoregressive model and a
diffusion-based model. To enhance modality fusion, we investigate advanced
pre-training tasks, demonstrating improvements in both visual understanding and
generation. Experiments across benchmarks show that Tar matches or surpasses
existing multimodal LLM methods, achieving faster convergence and greater
training efficiency. Code, models, and data are available at
https://tar.csuhan.com
[COMMENTS]
Project page: https://tar.csuhan.com
[LINK]
http://arxiv.org/abs/2506.18898v1
[DATE]
2025-06-24 01:59:14+08:00
[CATEGORIES]
cs.CL
ReasonFlux-PRM: Trajectory-Aware PRMs for Long Chain-of-Thought Reasoning in LLMs
[AUTHORS]
Jiaru Zou, Ling Yang, Jingwen Gu, Jiahao Qiu, Ke Shen, Jingrui He, Mengdi Wang
[ABSTRACT]
Process Reward Models (PRMs) have recently emerged as a powerful framework
for supervising intermediate reasoning steps in large language models (LLMs).
Previous PRMs are primarily trained on model final output responses and
struggle to evaluate intermediate thinking trajectories robustly, especially in
the emerging setting of trajectory-response outputs generated by frontier
reasoning models like Deepseek-R1. In this work, we introduce ReasonFlux-PRM, a
novel trajectory-aware PRM explicitly designed to evaluate the
trajectory-response type of reasoning traces. ReasonFlux-PRM incorporates both
step-level and trajectory-level supervision, enabling fine-grained reward
assignment aligned with structured chain-of-thought data. We adapt
ReasonFlux-PRM to support reward supervision under both offline and online
settings, including (i) selecting high-quality model distillation data for
downstream supervised fine-tuning of smaller models, (ii) providing dense
process-level rewards for policy optimization during reinforcement learning,
and (iii) enabling reward-guided Best-of-N test-time scaling. Empirical results
on challenging downstream benchmarks such as AIME, MATH500, and GPQA-Diamond
demonstrate that ReasonFlux-PRM-7B selects higher quality data than strong PRMs
(e.g., Qwen2.5-Math-PRM-72B) and human-curated baselines. Furthermore, our
derived ReasonFlux-PRM-7B yields consistent performance improvements, achieving
average gains of 12.1% in supervised fine-tuning, 4.5% in reinforcement
learning, and 6.3% in test-time scaling. We also release our efficient
ReasonFlux-PRM-1.5B for resource-constrained applications and edge deployment.
Projects: https://github.com/Gen-Verse/ReasonFlux
[COMMENTS]
Codes and Models: https://github.com/Gen-Verse/ReasonFlux
[LINK]
http://arxiv.org/abs/2506.18896v1
[DATE]
2025-06-24 01:59:02+08:00
[CATEGORIES]
cs.CL
OMEGA: Can LLMs Reason Outside the Box in Math? Evaluating Exploratory, Compositional, and Transformative Generalization
[AUTHORS]
Yiyou Sun, Shawn Hu, Georgia Zhou, Ken Zheng, Hannaneh Hajishirzi, Nouha Dziri, Dawn Song
[LINK]
http://arxiv.org/abs/2506.18880v1
[DATE]
2025-06-24 01:51:40+08:00
[CATEGORIES]
cs.CL
CommVQ: Commutative Vector Quantization for KV Cache Compression
[AUTHORS]
Junyan Li, Yang Zhang, Muhammad Yusuf Hassan, Talha Chafekar, Tianle Cai, Zhile Ren, Pengsheng Guo, Foroozan Karimzadeh, Colorado Reed, Chong Wang, Chuang Gan
[ABSTRACT]
Large Language Models (LLMs) are increasingly used in applications requiring
long context lengths, but the key-value (KV) cache often becomes a memory
bottleneck on GPUs as context grows. To address this, we propose Commutative
Vector Quantization (CommVQ) to significantly reduce memory usage for
long-context LLM inference. We first introduce additive quantization with a
lightweight encoder and codebook to compress the KV cache, which can be decoded
via simple matrix multiplication. To further reduce computational costs during
decoding, we design the codebook to be commutative with Rotary Position
Embedding (RoPE) and train it using an Expectation-Maximization (EM) algorithm.
This enables efficient integration of decoding into the self-attention
mechanism. Our approach achieves high accuracy with additive quantization and
low overhead via the RoPE-commutative codebook. Experiments on long-context
benchmarks and GSM8K show that our method reduces FP16 KV cache size by 87.5%
with 2-bit quantization, while outperforming state-of-the-art KV cache
quantization methods. Notably, it enables 1-bit KV cache quantization with
minimal accuracy loss, allowing a LLaMA-3.1 8B model to run with a 128K context
length on a single RTX 4090 GPU. The source code is available at:
https://github.com/UMass-Embodied-AGI/CommVQ.
[COMMENTS]
ICML 2025 poster
[LINK]
http://arxiv.org/abs/2506.18879v1
[DATE]
2025-06-24 01:50:11+08:00
[CATEGORIES]
cs.CL
A Comment On “The Illusion of Thinking”: Reframing the Reasoning Cliff as an Agentic Gap
[AUTHORS]
Sheraz Khan, Subha Madhavan, Kannan Natarajan
[ABSTRACT]
The recent work by Shojaee et al. (2025), titled The Illusion of Thinking:
Understanding the Strengths and Limitations of Reasoning Models via the Lens of
Problem Complexity, presents a compelling empirical finding, a reasoning cliff,
where the performance of Large Reasoning Models (LRMs) collapses beyond a
specific complexity threshold, which the authors posit as an intrinsic scaling
limitation of Chain-of-Thought (CoT) reasoning. This commentary, while
acknowledging the study’s methodological rigor, contends that this conclusion
is confounded by experimental artifacts. We argue that the observed failure is
not evidence of a fundamental cognitive boundary, but rather a predictable
outcome of system-level constraints in the static, text-only evaluation
paradigm, including tool use restrictions, context window recall issues, the
absence of crucial cognitive baselines, inadequate statistical reporting, and
output generation limits. We reframe this performance collapse through the lens
of an agentic gap, asserting that the models are not failing at reasoning, but
at execution within a profoundly restrictive interface. We empirically
substantiate this critique by demonstrating a striking reversal. A model,
initially declaring a puzzle impossible when confined to text-only generation,
now employs agentic tools to not only solve it but also master variations of
complexity far beyond the reasoning cliff it previously failed to surmount.
Additionally, our empirical analysis of tool-enabled models like o4-mini and
GPT-4o reveals a hierarchy of agentic reasoning, from simple procedural
execution to complex meta-cognitive self-correction, which has significant
implications for how we define and measure machine intelligence. The illusion
of thinking attributed to LRMs is less a reasoning deficit and more a
consequence of an otherwise capable mind lacking the tools for action.
[COMMENTS]
10 pages, 2 figures, Comment on “The Illusion of Thinking:
Understanding the Strengths and Limitations of Reasoning Models via the Lens
of Problem Complexity” (arXiv:2506.06941v1)
[LINK]
http://arxiv.org/abs/2506.18957v1
[DATE]
2025-06-24 01:14:21+08:00
[CATEGORIES]
cs.CL
cs.LG
LongWriter-Zero: Mastering Ultra-Long Text Generation via Reinforcement Learning
[AUTHORS]
Yuhao Wu, Yushi Bai, Zhiqiang Hu, Roy Ka-Wei Lee, Juanzi Li
[ABSTRACT]
Ultra-long generation by large language models (LLMs) is a widely demanded
scenario, yet it remains a significant challenge due to their maximum
generation length limit and overall quality degradation as sequence length
increases. Previous approaches, exemplified by LongWriter, typically rely on
‘‘teaching’’, which involves supervised fine-tuning (SFT) on synthetic
long-form outputs. However, this strategy heavily depends on synthetic SFT
data, which is difficult and costly to construct, often lacks coherence and
consistency, and tends to be overly artificial and structurally monotonous. In
this work, we propose an incentivization-based approach that, starting entirely
from scratch and without relying on any annotated or synthetic data, leverages
reinforcement learning (RL) to foster the emergence of ultra-long, high-quality
text generation capabilities in LLMs. We perform RL training starting from a
base model, similar to R1-Zero, guiding it to engage in reasoning that
facilitates planning and refinement during the writing process. To support
this, we employ specialized reward models that steer the LLM towards improved
length control, writing quality, and structural formatting. Experimental
evaluations show that our LongWriter-Zero model, trained from Qwen2.5-32B,
consistently outperforms traditional SFT methods on long-form writing tasks,
achieving state-of-the-art results across all metrics on WritingBench and
Arena-Write, and even surpassing 100B+ models such as DeepSeek R1 and
Qwen3-235B. We open-source our data and model checkpoints under
https://huggingface.co/THU-KEG/LongWriter-Zero-32B
[LINK]
http://arxiv.org/abs/2506.18841v1
[DATE]
2025-06-24 00:59:02+08:00
[CATEGORIES]
cs.CL
cs.LG
EMULATE: A Multi-Agent Framework for Determining the Veracity of Atomic Claims by Emulating Human Actions
[AUTHORS]
Spencer Hong, Meng Luo, Xinyi Wan
[ABSTRACT]
Determining the veracity of atomic claims is an imperative component of many
recently proposed fact-checking systems. Many approaches tackle this problem by
first retrieving evidence by querying a search engine and then performing
classification by providing the evidence set and atomic claim to a large
language model, but this process deviates from what a human would do in order
to perform the task. Recent work attempted to address this issue by proposing
iterative evidence retrieval, allowing for evidence to be collected several
times and only when necessary. Continuing along this line of research, we
propose a novel claim verification system, called EMULATE, which is designed to
better emulate human actions through the use of a multi-agent framework where
each agent performs a small part of the larger task, such as ranking search
results according to predefined criteria or evaluating webpage content.
Extensive experiments on several benchmarks show clear improvements over prior
work, demonstrating the efficacy of our new multi-agent framework.
[COMMENTS]
FEVER 2025 (co-located with ACL 2025)
[LINK]
http://arxiv.org/abs/2505.16576v2
[DATE]
2025-06-24 00:58:51+08:00
[CATEGORIES]
cs.CL
STU-PID: Steering Token Usage via PID Controller for Efficient Large Language Model Reasoning
[AUTHORS]
Aryasomayajula Ram Bharadwaj
[ABSTRACT]
Large Language Models employing extended chain-of-thought (CoT) reasoning
often suffer from the overthinking phenomenon, generating excessive and
redundant reasoning steps that increase computational costs while potentially
degrading performance. While recent work has explored static steering
approaches to mitigate this issue, they lack the adaptability to dynamically
adjust intervention strength based on real-time reasoning quality. We propose
STUPID (Steering Token Usage via PID controller), a novel training-free method
that employs a PID controller to dynamically modulate activation steering
strength during inference. Our approach combines a chunk-level classifier for
detecting redundant reasoning patterns with a PID control mechanism that
adaptively adjusts steering intensity based on the predicted redundancy
probability. Experimental evaluation on GSM8K demonstrates that STUPID achieves
a 6% improvement in accuracy while reducing token usage by 32%, outperforming
static steering baselines. Our method provides a principled framework for
dynamic reasoning calibration that maintains reasoning quality while
significantly improving computational efficiency.
[LINK]
http://arxiv.org/abs/2506.18831v1
[DATE]
2025-06-24 00:47:19+08:00
[CATEGORIES]
cs.CL
MLLP-VRAIN UPV system for the IWSLT 2025 Simultaneous Speech Translation Translation task
[AUTHORS]
Jorge Iranzo-Sánchez, Javier Iranzo-Sánchez, Adrià Giménez, Jorge Civera, Alfons Juan
[ABSTRACT]
This work describes the participation of the MLLP-VRAIN research group in the
shared task of the IWSLT 2025 Simultaneous Speech Translation track. Our
submission addresses the unique challenges of real-time translation of
long-form speech by developing a modular cascade system that adapts strong
pre-trained models to streaming scenarios. We combine Whisper Large-V3-Turbo
for ASR with the multilingual NLLB-3.3B model for MT, implementing lightweight
adaptation techniques rather than training new end-to-end models from scratch.
Our approach employs document-level adaptation with prefix training to enhance
the MT model’s ability to handle incomplete inputs, while incorporating
adaptive emission policies including a wait-$k$ strategy and RALCP for managing
the translation stream. Specialized buffer management techniques and
segmentation strategies ensure coherent translations across long audio
sequences. Experimental results on the ACL60/60 dataset demonstrate that our
system achieves a favorable balance between translation quality and latency,
with a BLEU score of 31.96 and non-computational-aware StreamLAAL latency of
2.94 seconds. Our final model achieves a preliminary score on the official test
set (IWSLT25Instruct) of 29.8 BLEU. Our work demonstrates that carefully
adapted pre-trained components can create effective simultaneous translation
systems for long-form content without requiring extensive in-domain parallel
data or specialized end-to-end training.
[COMMENTS]
IWSLT 2025 System Description
[LINK]
http://arxiv.org/abs/2506.18828v1
[DATE]
2025-06-24 00:44:01+08:00
[CATEGORIES]
cs.CL
RWESummary: A Framework and Test for Choosing Large Language Models to Summarize Real-World Evidence (RWE) Studies
[AUTHORS]
Arjun Mukerji, Michael L. Jackson, Jason Jones, Neil Sanghavi
[ABSTRACT]
Large Language Models (LLMs) have been extensively evaluated for general
summarization tasks as well as medical research assistance, but they have not
been specifically evaluated for the task of summarizing real-world evidence
(RWE) from structured output of RWE studies. We introduce RWESummary, a
proposed addition to the MedHELM framework (Bedi, Cui, Fuentes, Unell et al.,
2025) to enable benchmarking of LLMs for this task. RWESummary includes one
scenario and three evaluations covering major types of errors observed in
summarization of medical research studies and was developed using Atropos
Health proprietary data. Additionally, we use RWESummary to compare the
performance of different LLMs in our internal RWE summarization tool. At the
time of publication, with 13 distinct RWE studies, we found the Gemini 2.5
models performed best overall (both Flash and Pro). We suggest RWESummary as a
novel and useful foundation model benchmark for real-world evidence study
summarization.
[COMMENTS]
24 pages, 2 figures
[LINK]
http://arxiv.org/abs/2506.18819v1
[DATE]
2025-06-24 00:28:03+08:00
[CATEGORIES]
cs.CL
Step-by-Step Unmasking for Parameter-Efficient Fine-tuning of Large Language Models
[AUTHORS]
Aradhye Agarwal, Suhas K Ramesh, Ayan Sengupta, Tanmoy Chakraborty
[ABSTRACT]
Fine-tuning large language models (LLMs) on downstream tasks requires
substantial computational resources. Selective PEFT, a class of
parameter-efficient fine-tuning (PEFT) methodologies, aims to mitigate these
computational challenges by selectively fine-tuning only a small fraction of
the model parameters. Although parameter-efficient, these techniques often fail
to match the performance of fully fine-tuned models, primarily due to inherent
biases introduced during parameter selection. Traditional selective PEFT
techniques use a fixed set of parameters selected using different importance
heuristics, failing to capture parameter importance dynamically and often
leading to suboptimal performance. We introduce $\text{ID}^3$, a novel
selective PEFT method that calculates parameter importance continually, and
dynamically unmasks parameters by balancing exploration and exploitation in
parameter selection. Our empirical study on 16 tasks spanning natural language
understanding, mathematical reasoning and summarization demonstrates the
effectiveness of our method compared to fixed-masking selective PEFT
techniques. We analytically show that $\text{ID}^3$ reduces the number of
gradient updates by a factor of two, enhancing computational efficiency. Since
$\text{ID}^3$ is robust to random initialization of neurons and operates
directly on the optimization process, it is highly flexible and can be
integrated with existing additive and reparametrization-based PEFT techniques
such as adapters and LoRA respectively.
[COMMENTS]
15 pages, 7 tables, 9 figures
[LINK]
http://arxiv.org/abs/2408.14470v3
[DATE]
2025-06-24 00:25:27+08:00
[CATEGORIES]
cs.CL
Existing LLMs Are Not Self-Consistent For Simple Tasks
[AUTHORS]
Zhenru Lin, Jiawen Tao, Yang Yuan, Andrew Chi-Chih Yao
[ABSTRACT]
Large Language Models (LLMs) have grown increasingly powerful, yet ensuring
their decisions remain transparent and trustworthy requires self-consistency –
no contradictions in their internal reasoning. Our study reveals that even on
simple tasks, such as comparing points on a line or a plane, or reasoning in a
family tree, all smaller models are highly inconsistent, and even
state-of-the-art models like DeepSeek-R1 and GPT-o4-mini are not fully
self-consistent. To quantify and mitigate these inconsistencies, we introduce
inconsistency metrics and propose two automated methods – a graph-based and an
energy-based approach. While these fixes provide partial improvements, they
also highlight the complexity and importance of self-consistency in building
more reliable and interpretable AI. The code and data are available at
https://github.com/scorpio-nova/llm-self-consistency.
[COMMENTS]
10 pages, 6 figures
[LINK]
http://arxiv.org/abs/2506.18781v1
[DATE]
2025-06-23 23:50:21+08:00
[CATEGORIES]
cs.CL
Programming by Backprop: LLMs Acquire Reusable Algorithmic Abstractions During Code Training
[AUTHORS]
Jonathan Cook, Silvia Sapora, Arash Ahmadian, Akbir Khan, Tim Rocktaschel, Jakob Foerster, Laura Ruis
[ABSTRACT]
Training large language models (LLMs) on source code significantly enhances
their general-purpose reasoning abilities, but the mechanisms underlying this
generalisation are poorly understood. In this paper, we propose Programming by
Backprop (PBB) as a potential driver of this effect - teaching a model to
evaluate a program for inputs by training on its source code alone, without
ever seeing I/O examples. To explore this idea, we finetune LLMs on two sets of
programs representing simple maths problems and algorithms: one with source
code and I/O examples (w/ IO), the other with source code only (w/o IO). We
find evidence that LLMs have some ability to evaluate w/o IO programs for
inputs in a range of experimental settings, and make several observations.
Firstly, PBB works significantly better when programs are provided as code
rather than semantically equivalent language descriptions. Secondly, LLMs can
produce outputs for w/o IO programs directly, by implicitly evaluating the
program within the forward pass, and more reliably when stepping through the
program in-context via chain-of-thought. We further show that PBB leads to more
robust evaluation of programs across inputs than training on I/O pairs drawn
from a distribution that mirrors naturally occurring data. Our findings suggest
a mechanism for enhanced reasoning through code training: it allows LLMs to
internalise reusable algorithmic abstractions. Significant scope remains for
future work to enable LLMs to more effectively learn from symbolic procedures,
and progress in this direction opens other avenues like model alignment by
training on formal constitutional principles.
[LINK]
http://arxiv.org/abs/2506.18777v1
[DATE]
2025-06-23 23:45:44+08:00
[CATEGORIES]
cs.CL
cs.LG
Neural Total Variation Distance Estimators for Changepoint Detection in News Data
[AUTHORS]
Csaba Zsolnai, Niels Lörch, Julian Arnold
[ABSTRACT]
Detecting when public discourse shifts in response to major events is crucial
for understanding societal dynamics. Real-world data is high-dimensional,
sparse, and noisy, making changepoint detection in this domain a challenging
endeavor. In this paper, we leverage neural networks for changepoint detection
in news data, introducing a method based on the so-called learning-by-confusion
scheme, which was originally developed for detecting phase transitions in
physical systems. We train classifiers to distinguish between articles from
different time periods. The resulting classification accuracy is used to
estimate the total variation distance between underlying content distributions,
where significant distances highlight changepoints. We demonstrate the
effectiveness of this method on both synthetic datasets and real-world data
from The Guardian newspaper, successfully identifying major historical events
including 9/11, the COVID-19 pandemic, and presidential elections. Our approach
requires minimal domain knowledge, can autonomously discover significant shifts
in public discourse, and yields a quantitative measure of change in content,
making it valuable for journalism, policy analysis, and crisis monitoring.
[COMMENTS]
16 pages, 3 figures
[LINK]
http://arxiv.org/abs/2506.18764v1
[DATE]
2025-06-23 23:33:30+08:00
[CATEGORIES]
cs.LG
cs.CL
SEAL: Scaling to Emphasize Attention for Long-Context Retrieval
[AUTHORS]
Changhun Lee, Minsang Seok, Jun-gyu Jin, Younghyun Cho, Eunhyeok Park
[ABSTRACT]
While many advanced LLMs are designed to handle long sequence data, we can
still observe notable quality degradation even within the sequence limit. In
this work, we introduce a novel approach called Scaling to Emphasize Attention
for Long-context retrieval (SEAL), which enhances the retrieval performance of
large language models (LLMs) over long contexts. We observe that specific
attention heads are closely tied to long-context retrieval, showing positive or
negative correlation with retrieval scores, and adjusting the strength of these
heads boosts the quality of LLMs in long context by a large margin. Built on
this insight, we propose a learning-based mechanism that leverages generated
data to emphasize these heads. By applying SEAL, we achieve significant
improvements in long-context retrieval performance across various tasks and
models. Additionally, when combined with existing training-free context
extension techniques, SEAL extends the contextual limits of LLMs while
maintaining highly reliable outputs.
[COMMENTS]
Accepted at ACL 2025 Main
[LINK]
http://arxiv.org/abs/2501.15225v2
[DATE]
2025-06-23 23:24:16+08:00
[CATEGORIES]
cs.CL
cs.LG
Eye of Judgement: Dissecting the Evaluation of Russian-speaking LLMs with POLLUX
[AUTHORS]
Nikita Martynov, Anastasia Mordasheva, Dmitriy Gorbetskiy, Danil Astafurov, Ulyana Isaeva, Elina Basyrova, Sergey Skachkov, Victoria Berestova, Nikolay Ivanov, Valeriia Zanina, Alena Fenogenova
[ABSTRACT]
We introduce POLLUX, a comprehensive open-source benchmark designed to
evaluate the generative capabilities of large language models (LLMs) in
Russian. Our main contribution is a novel evaluation methodology that enhances
the interpretability of LLM assessment. For each task type, we define a set of
detailed criteria and develop a scoring protocol where models evaluate
responses and provide justifications for their ratings. This enables
transparent, criteria-driven evaluation beyond traditional resource-consuming,
side-by-side human comparisons. POLLUX includes a detailed, fine-grained
taxonomy of 35 task types covering diverse generative domains such as code
generation, creative writing, and practical assistant use cases, totaling 2,100
manually crafted and professionally authored prompts. Each task is categorized
by difficulty (easy/medium/hard), with experts constructing the dataset
entirely from scratch. We also release a family of LLM-as-a-Judge (7B and 32B)
evaluators trained for nuanced assessment of generative outputs. This approach
provides scalable, interpretable evaluation and annotation tools for model
development, effectively replacing costly and less precise human judgments.
[COMMENTS]
179 pages
[LINK]
http://arxiv.org/abs/2505.24616v2
[DATE]
2025-06-23 23:01:31+08:00
[CATEGORIES]
cs.CL
Multi-modal Anchor Gated Transformer with Knowledge Distillation for Emotion Recognition in Conversation
[AUTHORS]
Jie Li, Shifei Ding, Lili Guo, Xuan Li
[ABSTRACT]
Emotion Recognition in Conversation (ERC) aims to detect the emotions of
individual utterances within a conversation. Generating efficient and
modality-specific representations for each utterance remains a significant
challenge. Previous studies have proposed various models to integrate features
extracted using different modality-specific encoders. However, they neglect the
varying contributions of modalities to this task and introduce high complexity
by aligning modalities at the frame level. To address these challenges, we
propose the Multi-modal Anchor Gated Transformer with Knowledge Distillation
(MAGTKD) for the ERC task. Specifically, prompt learning is employed to enhance
textual modality representations, while knowledge distillation is utilized to
strengthen representations of weaker modalities. Furthermore, we introduce a
multi-modal anchor gated transformer to effectively integrate utterance-level
representations across modalities. Extensive experiments on the IEMOCAP and
MELD datasets demonstrate the effectiveness of knowledge distillation in
enhancing modality representations and achieve state-of-the-art performance in
emotion recognition. Our code is available at:
https://github.com/JieLi-dd/MAGTKD.
[COMMENTS]
This paper has been accepted by IJCAI2025
[LINK]
http://arxiv.org/abs/2506.18716v1
[DATE]
2025-06-23 22:53:22+08:00
[CATEGORIES]
cs.LG
cs.CL
Handling Numeric Expressions in Automatic Speech Recognition
[AUTHORS]
Christian Huber, Alexander Waibel
[ABSTRACT]
This paper addresses the problem of correctly formatting numeric expressions
in automatic speech recognition (ASR) transcripts. This is challenging since
the expected transcript format depends on the context, e.g., 1945 (year) vs.
19:45 (timestamp). We compare cascaded and end-to-end approaches to recognize
and format numeric expressions such as years, timestamps, currency amounts, and
quantities. For the end-to-end approach, we employed a data generation strategy
using a large language model (LLM) together with a text to speech (TTS) model
to generate adaptation data. The results on our test data set show that while
approaches based on LLMs perform well in recognizing formatted numeric
expressions, adapted end-to-end models offer competitive performance with the
advantage of lower latency and inference cost.
[LINK]
http://arxiv.org/abs/2408.00004v2
[DATE]
2025-06-23 22:45:07+08:00
[CATEGORIES]
cs.CL
Context Biasing for Pronunciations-Orthography Mismatch in Automatic Speech Recognition
[AUTHORS]
Christian Huber, Alexander Waibel
[ABSTRACT]
Neural sequence-to-sequence systems deliver state-of-the-art performance for
automatic speech recognition. When using appropriate modeling units, e.g.,
byte-pair encoded characters, these systems are in principal open vocabulary
systems. In practice, however, they often fail to recognize words not seen
during training, e.g., named entities, acronyms, or domain-specific special
words. To address this problem, many context biasing methods have been
proposed; however, for words with a pronunciation-orthography mismatch, these
methods may still struggle. We propose a method which allows corrections of
substitution errors to improve the recognition accuracy of such challenging
words. Users can add corrections on the fly during inference. We show that with
this method we get a relative improvement in biased word error rate of up to
11\%, while maintaining a competitive overall word error rate.
[LINK]
http://arxiv.org/abs/2506.18703v1
[DATE]
2025-06-23 22:42:03+08:00
[CATEGORIES]
cs.CL
cs.LG
Better Language Model Inversion by Compactly Representing Next-Token Distributions
[AUTHORS]
Murtaza Nazir, Matthew Finlayson, John X. Morris, Xiang Ren, Swabha Swayamdipta
[ABSTRACT]
Language model inversion seeks to recover hidden prompts using only language
model outputs. This capability has implications for security and accountability
in language model deployments, such as leaking private information from an
API-protected language model’s system message. We propose a new method –
prompt inversion from logprob sequences (PILS) – that recovers hidden prompts
by gleaning clues from the model’s next-token probabilities over the course of
multiple generation steps. Our method is enabled by a key insight: The
vector-valued outputs of a language model occupy a low-dimensional subspace.
This enables us to losslessly compress the full next-token probability
distribution over multiple generation steps using a linear map, allowing more
output information to be used for inversion. Our approach yields massive gains
over previous state-of-the-art methods for recovering hidden prompts, achieving
2–3.5 times higher exact recovery rates across test sets, in one case
increasing the recovery rate from 17% to 60%. Our method also exhibits
surprisingly good generalization behavior; for instance, an inverter trained on
16 generations steps gets 5–27 points higher prompt recovery when we increase
the number of steps to 32 at test time. Furthermore, we demonstrate strong
performance of our method on the more challenging task of recovering hidden
system messages. We also analyze the role of verbatim repetition in prompt
recovery and propose a new method for cross-family model transfer for
logit-based inverters. Our findings show that next-token probabilities are a
considerably more vulnerable attack surface for inversion attacks than
previously known.
[LINK]
http://arxiv.org/abs/2506.17090v2
[DATE]
2025-06-23 22:39:37+08:00
[CATEGORIES]
cs.CL
HausaNLP at SemEval-2025 Task 11: Hausa Text Emotion Detection
[AUTHORS]
Sani Abdullahi Sani, Salim Abubakar, Falalu Ibrahim Lawan, Abdulhamid Abubakar, Maryam Bala
[ABSTRACT]
This paper presents our approach to multi-label emotion detection in Hausa, a
low-resource African language, for SemEval Track A. We fine-tuned AfriBERTa, a
transformer-based model pre-trained on African languages, to classify Hausa
text into six emotions: anger, disgust, fear, joy, sadness, and surprise. Our
methodology involved data preprocessing, tokenization, and model fine-tuning
using the Hugging Face Trainer API. The system achieved a validation accuracy
of 74.00%, with an F1-score of 73.50%, demonstrating the effectiveness of
transformer-based models for emotion detection in low-resource languages.
[LINK]
http://arxiv.org/abs/2506.16388v2
[DATE]
2025-06-23 22:32:28+08:00
[CATEGORIES]
cs.CL