How Does Knowledge Selection Help Retrieval Augmented Generation?
[AUTHORS]
Xiangci Li, Jessica Ouyang
[ABSTRACT]
Retrieval-augmented generation (RAG) is a powerful method for enhancing
natural language generation by integrating external knowledge into a model’s
output. While prior work has demonstrated the importance of improving knowledge
retrieval for boosting generation quality, the role of knowledge selection
remains less clear. This paper empirically analyzes how knowledge selection
influences downstream generation performance in RAG systems. By simulating
different retrieval and selection conditions through a controlled mixture of
gold and distractor knowledge, we assess the impact of these factors on
generation outcomes. Our findings indicate that the downstream generator
model’s capability, as well as the complexity of the task and dataset,
significantly influence the impact of knowledge selection on the overall RAG
system performance. In typical scenarios, improving the knowledge recall score
is key to enhancing generation outcomes, with the knowledge selector providing
limited benefit when a strong generator model is used on clear, well-defined
tasks. For weaker generator models or more ambiguous tasks and datasets, the
knowledge F1 score becomes a critical factor, and the knowledge selector plays
a more prominent role in improving overall performance.
[LINK]
http://arxiv.org/abs/2410.13258v3
[DATE]
2025-05-16 01:59:42+08:00
[CATEGORIES]
cs.CL
MathCoder-VL: Bridging Vision and Code for Enhanced Multimodal Mathematical Reasoning
[AUTHORS]
Ke Wang, Junting Pan, Linda Wei, Aojun Zhou, Weikang Shi, Zimu Lu, Han Xiao, Yunqiao Yang, Houxing Ren, Mingjie Zhan, Hongsheng Li
[COMMENTS]
Accepted to ACL 2025 Findings
[LINK]
http://arxiv.org/abs/2505.10557v1
[DATE]
2025-05-16 01:59:21+08:00
[CATEGORIES]
cs.CL
Beyond ‘Aha!’: Toward Systematic Meta-Abilities Alignment in Large Reasoning Models
[AUTHORS]
Zhiyuan Hu, Yibo Wang, Hanze Dong, Yuhui Xu, Amrita Saha, Caiming Xiong, Bryan Hooi, Junnan Li
[ABSTRACT]
Large reasoning models (LRMs) already possess a latent capacity for long
chain-of-thought reasoning. Prior work has shown that outcome-based
reinforcement learning (RL) can incidentally elicit advanced reasoning
behaviors such as self-correction, backtracking, and verification phenomena
often referred to as the model’s “aha moment”. However, the timing and
consistency of these emergent behaviors remain unpredictable and
uncontrollable, limiting the scalability and reliability of LRMs’ reasoning
capabilities. To address these limitations, we move beyond reliance on prompts
and coincidental “aha moments”. Instead, we explicitly align models with three
meta-abilities: deduction, induction, and abduction, using automatically
generated, self-verifiable tasks. Our three stage-pipeline individual
alignment, parameter-space merging, and domain-specific reinforcement learning,
boosting performance by over 10\% relative to instruction-tuned baselines.
Furthermore, domain-specific RL from the aligned checkpoint yields an
additional 2\% average gain in the performance ceiling across math, coding, and
science benchmarks, demonstrating that explicit meta-ability alignment offers a
scalable and dependable foundation for reasoning. Code is available at:
https://github.com/zhiyuanhubj/Meta-Ability-Alignment
[COMMENTS]
In Progress
[LINK]
http://arxiv.org/abs/2505.10554v1
[DATE]
2025-05-16 01:58:33+08:00
[CATEGORIES]
cs.CL
Towards a Deeper Understanding of Reasoning Capabilities in Large Language Models
[AUTHORS]
Annie Wong, Thomas Bäck, Aske Plaat, Niki van Stein, Anna V. Kononova
[ABSTRACT]
While large language models demonstrate impressive performance on static
benchmarks, the true potential of large language models as self-learning and
reasoning agents in dynamic environments remains unclear. This study
systematically evaluates the efficacy of self-reflection, heuristic mutation,
and planning as prompting techniques to test the adaptive capabilities of
agents. We conduct experiments with various open-source language models in
dynamic environments and find that larger models generally outperform smaller
ones, but that strategic prompting can close this performance gap. Second, a
too-long prompt can negatively impact smaller models on basic reactive tasks,
while larger models show more robust behaviour. Third, advanced prompting
techniques primarily benefit smaller models on complex games, but offer less
improvement for already high-performing large language models. Yet, we find
that advanced reasoning methods yield highly variable outcomes: while capable
of significantly improving performance when reasoning and decision-making
align, they also introduce instability and can lead to big performance drops.
Compared to human performance, our findings reveal little evidence of true
emergent reasoning. Instead, large language model performance exhibits
persistent limitations in crucial areas such as planning, reasoning, and
spatial coordination, suggesting that current-generation large language models
still suffer fundamental shortcomings that may not be fully overcome through
self-reflective prompting alone. Reasoning is a multi-faceted task, and while
reasoning methods like Chain of thought improves multi-step reasoning on math
word problems, our findings using dynamic benchmarks highlight important
shortcomings in general reasoning capabilities, indicating a need to move
beyond static benchmarks to capture the complexity of reasoning.
[LINK]
http://arxiv.org/abs/2505.10543v1
[DATE]
2025-05-16 01:53:47+08:00
[CATEGORIES]
cs.CL
ARR: Question Answering with Large Language Models via Analyzing, Retrieving, and Reasoning
[AUTHORS]
Yuwei Yin, Giuseppe Carenini
[ABSTRACT]
Large language models (LLMs) have demonstrated impressive capabilities on
complex evaluation benchmarks, many of which are formulated as
question-answering (QA) tasks. Enhancing the performance of LLMs in QA contexts
is becoming increasingly vital for advancing their development and
applicability. This paper introduces ARR, an intuitive, effective, and general
QA solving method that explicitly incorporates three key steps: analyzing the
intent of the question, retrieving relevant information, and reasoning step by
step. Notably, this paper is the first to introduce intent analysis in QA,
which plays a vital role in ARR. Comprehensive evaluations across 10 diverse QA
tasks demonstrate that ARR consistently outperforms the baseline methods.
Ablation and case studies further validate the positive contributions of each
ARR component. Furthermore, experiments involving variations in prompt design
indicate that ARR maintains its effectiveness regardless of the specific prompt
formulation. Additionally, extensive evaluations across various model sizes,
LLM series, and generation settings solidify the effectiveness, robustness, and
generalizability of ARR.
[COMMENTS]
21 pages. Code: https://github.com/YuweiYin/ARR
[LINK]
http://arxiv.org/abs/2502.04689v3
[DATE]
2025-05-16 01:52:51+08:00
[CATEGORIES]
cs.CL
cs.LG
WorldPM: Scaling Human Preference Modeling
[AUTHORS]
Binghai Wang, Runji Lin, Keming Lu, Le Yu, Zhenru Zhang, Fei Huang, Chujie Zheng, Kai Dang, Yang Fan, Xingzhang Ren, An Yang, Binyuan Hui, Dayiheng Liu, Tao Gui, Qi Zhang, Xuanjing Huang, Yu-Gang Jiang, Bowen Yu, Jingren Zhou, Junyang Lin
[ABSTRACT]
Motivated by scaling laws in language modeling that demonstrate how test loss
scales as a power law with model and dataset sizes, we find that similar laws
exist in preference modeling. We propose World Preference Modeling$ (WorldPM)
to emphasize this scaling potential, where World Preference embodies a unified
representation of human preferences. In this paper, we collect preference data
from public forums covering diverse user communities, and conduct extensive
training using 15M-scale data across models ranging from 1.5B to 72B
parameters. We observe distinct patterns across different evaluation metrics:
(1) Adversarial metrics (ability to identify deceptive features) consistently
scale up with increased training data and base model size; (2) Objective
metrics (objective knowledge with well-defined answers) show emergent behavior
in larger language models, highlighting WorldPM’s scalability potential; (3)
Subjective metrics (subjective preferences from a limited number of humans or
AI) do not demonstrate scaling trends. Further experiments validate the
effectiveness of WorldPM as a foundation for preference fine-tuning. Through
evaluations on 7 benchmarks with 20 subtasks, we find that WorldPM broadly
improves the generalization performance across human preference datasets of
varying sizes (7K, 100K and 800K samples), with performance gains exceeding 5%
on many key subtasks. Integrating WorldPM into our internal RLHF pipeline, we
observe significant improvements on both in-house and public evaluation sets,
with notable gains of 4% to 8% in our in-house evaluations.
[LINK]
http://arxiv.org/abs/2505.10527v1
[DATE]
2025-05-16 01:38:37+08:00
[CATEGORIES]
cs.CL
Multi-Token Prediction Needs Registers
[AUTHORS]
Anastasios Gerontopoulos, Spyros Gidaris, Nikos Komodakis
[ABSTRACT]
Multi-token prediction has emerged as a promising objective for improving
language model pretraining, but its benefits have not consistently generalized
to other settings such as fine-tuning. In this paper, we propose MuToR, a
simple and effective approach to multi-token prediction that interleaves
learnable register tokens into the input sequence, each tasked with predicting
future targets. Compared to existing methods, MuToR offers several key
advantages: it introduces only a negligible number of additional parameters,
requires no architectural changes–ensuring compatibility with off-the-shelf
pretrained language models–and remains aligned with the next-token pretraining
objective, making it especially well-suited for supervised fine-tuning.
Moreover, it naturally supports scalable prediction horizons. We demonstrate
the effectiveness and versatility of MuToR across a range of use cases,
including supervised fine-tuning, parameter-efficient fine-tuning (PEFT), and
pretraining, on challenging generative tasks in both language and vision
domains. Our code will be available at: https://github.com/nasosger/MuToR.
[LINK]
http://arxiv.org/abs/2505.10518v1
[DATE]
2025-05-16 01:25:03+08:00
[CATEGORIES]
cs.CL
cs.LG
PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling
[AUTHORS]
Zefan Cai, Yichi Zhang, Bofei Gao, Yuliang Liu, Yucheng Li, Tianyu Liu, Keming Lu, Wayne Xiong, Yue Dong, Junjie Hu, Wen Xiao
[ABSTRACT]
In this study, we investigate whether attention-based information flow inside
large language models (LLMs) is aggregated through noticeable patterns for long
context processing. Our observations reveal that LLMs aggregate information
through Pyramidal Information Funneling where attention is scattering widely in
lower layers, progressively consolidating within specific contexts, and
ultimately focusing on critical tokens (a.k.a massive activation or attention
sink) in higher layers. Motivated by these insights, we developed PyramidKV, a
novel and effective KV cache compression method. This approach dynamically
adjusts the KV cache size across different layers, allocating more cache in
lower layers and less in higher ones, diverging from traditional methods that
maintain a uniform KV cache size. Our experimental evaluations, utilizing the
LongBench benchmark, show that PyramidKV matches the performance of models with
a full KV cache while retaining only 12% of the KV cache, thus significantly
reducing memory usage. In scenarios emphasizing memory efficiency, where only
0.7% of the KV cache is maintained, PyramidKV surpasses other KV cache
compression techniques, achieving up to a 20.5 absolute accuracy improvement on
TREC dataset. In the Needle-in-a-Haystack experiment, PyramidKV outperforms
competing methods in maintaining long-context comprehension in LLMs; notably,
retaining just 128 KV cache entries enables the LLAMA-3-70B model to achieve
100.0 Acc. performance.
[LINK]
http://arxiv.org/abs/2406.02069v4
[DATE]
2025-05-16 01:18:12+08:00
[CATEGORIES]
cs.CL
The Devil Is in the Word Alignment Details: On Translation-Based Cross-Lingual Transfer for Token Classification Tasks
[AUTHORS]
Benedikt Ebing, Goran Glavaš
[ABSTRACT]
Translation-based strategies for cross-lingual transfer XLT such as
translate-train – training on noisy target language data translated from the
source language – and translate-test – evaluating on noisy source language
data translated from the target language – are competitive XLT baselines. In
XLT for token classification tasks, however, these strategies include label
projection, the challenging step of mapping the labels from each token in the
original sentence to its counterpart(s) in the translation. Although word
aligners (WAs) are commonly used for label projection, the low-level design
decisions for applying them to translation-based XLT have not been
systematically investigated. Moreover, recent marker-based methods, which
project labeled spans by inserting tags around them before (or after)
translation, claim to outperform WAs in label projection for XLT. In this work,
we revisit WAs for label projection, systematically investigating the effects
of low-level design decisions on token-level XLT: (i) the algorithm for
projecting labels between (multi-)token spans, (ii) filtering strategies to
reduce the number of noisily mapped labels, and (iii) the pre-tokenization of
the translated sentences. We find that all of these substantially impact
translation-based XLT performance and show that, with optimized choices, XLT
with WA offers performance at least comparable to that of marker-based methods.
We then introduce a new projection strategy that ensembles translate-train and
translate-test predictions and demonstrate that it substantially outperforms
the marker-based projection. Crucially, we show that our proposed ensembling
also reduces sensitivity to low-level WA design choices, resulting in more
robust XLT for token classification tasks.
[LINK]
http://arxiv.org/abs/2505.10507v1
[DATE]
2025-05-16 01:10:50+08:00
[CATEGORIES]
cs.CL
Benchmarking Generative AI for Scoring Medical Student Interviews in Objective Structured Clinical Examinations (OSCEs)
[AUTHORS]
Jadon Geathers, Yann Hicke, Colleen Chan, Niroop Rajashekar, Justin Sewell, Susannah Cornes, Rene F. Kizilcec, Dennis Shung
[ABSTRACT]
Objective Structured Clinical Examinations (OSCEs) are widely used to assess
medical students’ communication skills, but scoring interview-based assessments
is time-consuming and potentially subject to human bias. This study explored
the potential of large language models (LLMs) to automate OSCE evaluations
using the Master Interview Rating Scale (MIRS). We compared the performance of
four state-of-the-art LLMs (GPT-4o, Claude 3.5, Llama 3.1, and Gemini 1.5 Pro)
in evaluating OSCE transcripts across all 28 items of the MIRS under the
conditions of zero-shot, chain-of-thought (CoT), few-shot, and multi-step
prompting. The models were benchmarked against a dataset of 10 OSCE cases with
174 expert consensus scores available. Model performance was measured using
three accuracy metrics (exact, off-by-one, thresholded). Averaging across all
MIRS items and OSCE cases, LLMs performed with low exact accuracy (0.27 to
0.44), and moderate to high off-by-one accuracy (0.67 to 0.87) and thresholded
accuracy (0.75 to 0.88). A zero temperature parameter ensured high intra-rater
reliability ({\alpha} = 0.98 for GPT-4o). CoT, few-shot, and multi-step
techniques proved valuable when tailored to specific assessment items. The
performance was consistent across MIRS items, independent of encounter phases
and communication domains. We demonstrated the feasibility of AI-assisted OSCE
evaluation and provided benchmarking of multiple LLMs across multiple prompt
techniques. Our work provides a baseline performance assessment for LLMs that
lays a foundation for future research into automated assessment of clinical
communication skills.
[COMMENTS]
12 pages + 3 pages of references, 4 figures
[LINK]
http://arxiv.org/abs/2501.13957v2
[DATE]
2025-05-16 01:09:21+08:00
[CATEGORIES]
cs.CL
Disentangling Memory and Reasoning Ability in Large Language Models
[AUTHORS]
Mingyu Jin, Weidi Luo, Sitao Cheng, Xinyi Wang, Wenyue Hua, Ruixiang Tang, William Yang Wang, Yongfeng Zhang
[COMMENTS]
Accepted by ACL 2025
[LINK]
http://arxiv.org/abs/2411.13504v3
[DATE]
2025-05-16 01:05:43+08:00
[CATEGORIES]
cs.CL
RouteNator: A Router-Based Multi-Modal Architecture for Generating Synthetic Training Data for Function Calling LLMs
[AUTHORS]
Vibha Belavadi, Tushar Vatsa, Dewang Sultania, Suhas Suresha, Ishita Verma, Cheng Chen, Tracy Holloway King, Michael Friedrich
[ABSTRACT]
This paper addresses fine-tuning Large Language Models (LLMs) for function
calling tasks when real user interaction data is unavailable. In digital
content creation tools, where users express their needs through natural
language queries that must be mapped to API calls, the lack of real-world
task-specific data and privacy constraints for training on it necessitate
synthetic data generation. Existing approaches to synthetic data generation
fall short in diversity and complexity, failing to replicate real-world data
distributions and leading to suboptimal performance after LLM fine-tuning. We
present a novel router-based architecture that leverages domain resources like
content metadata and structured knowledge graphs, along with text-to-text and
vision-to-text language models to generate high-quality synthetic training
data. Our architecture’s flexible routing mechanism enables synthetic data
generation that matches observed real-world distributions, addressing a
fundamental limitation of traditional approaches. Evaluation on a comprehensive
set of real user queries demonstrates significant improvements in both function
classification accuracy and API parameter selection. Models fine-tuned with our
synthetic data consistently outperform traditional approaches, establishing new
benchmarks for function calling tasks.
[COMMENTS]
Proceedings of the 4th International Workshop on Knowledge-Augmented
Methods for Natural Language Processing
[LINK]
http://arxiv.org/abs/2505.10495v1
[DATE]
2025-05-16 00:53:45+08:00
[CATEGORIES]
cs.LG
cs.CL
Can You Really Trust Code Copilots? Evaluating Large Language Models from a Code Security Perspective
[AUTHORS]
Yutao Mou, Xiao Deng, Yuxiao Luo, Shikun Zhang, Wei Ye
[ABSTRACT]
Code security and usability are both essential for various coding assistant
applications driven by large language models (LLMs). Current code security
benchmarks focus solely on single evaluation task and paradigm, such as code
completion and generation, lacking comprehensive assessment across dimensions
like secure code generation, vulnerability repair and discrimination. In this
paper, we first propose CoV-Eval, a multi-task benchmark covering various tasks
such as code completion, vulnerability repair, vulnerability detection and
classification, for comprehensive evaluation of LLM code security. Besides, we
developed VC-Judge, an improved judgment model that aligns closely with human
experts and can review LLM-generated programs for vulnerabilities in a more
efficient and reliable way. We conduct a comprehensive evaluation of 20
proprietary and open-source LLMs. Overall, while most LLMs identify vulnerable
codes well, they still tend to generate insecure codes and struggle with
recognizing specific vulnerability types and performing repairs. Extensive
experiments and qualitative analyses reveal key challenges and optimization
directions, offering insights for future research in LLM code security.
[COMMENTS]
Accepted by ACL2025 Main Conference
[LINK]
http://arxiv.org/abs/2505.10494v1
[DATE]
2025-05-16 00:53:41+08:00
[CATEGORIES]
cs.CL
CL-RAG: Bridging the Gap in Retrieval-Augmented Generation with Curriculum Learning
[AUTHORS]
Shaohan Wang, Licheng Zhang, Zheren Fu, Zhendong Mao
[ABSTRACT]
Retrieval-Augmented Generation (RAG) is an effective method to enhance the
capabilities of large language models (LLMs). Existing methods focus on
optimizing the retriever or generator in the RAG system by directly utilizing
the top-k retrieved documents. However, the documents effectiveness are various
significantly across user queries, i.e. some documents provide valuable
knowledge while others totally lack critical information. It hinders the
retriever and generator’s adaptation during training. Inspired by human
cognitive learning, curriculum learning trains models using samples progressing
from easy to difficult, thus enhancing their generalization ability, and we
integrate this effective paradigm to the training of the RAG system. In this
paper, we propose a multi-stage Curriculum Learning based RAG system training
framework, named CL-RAG. We first construct training data with multiple
difficulty levels for the retriever and generator separately through sample
evolution. Then, we train the model in stages based on the curriculum learning
approach, thereby optimizing the overall performance and generalization of the
RAG system more effectively. Our CL-RAG framework demonstrates consistent
effectiveness across four open-domain QA datasets, achieving performance gains
of 2% to 4% over multiple advanced methods.
[LINK]
http://arxiv.org/abs/2505.10493v1
[DATE]
2025-05-16 00:53:04+08:00
[CATEGORIES]
cs.CL
SceneGenAgent: Precise Industrial Scene Generation with Coding Agent
[AUTHORS]
Xiao Xia, Dan Zhang, Zibo Liao, Zhenyu Hou, Tianrui Sun, Jing Li, Ling Fu, Yuxiao Dong
[ABSTRACT]
The modeling of industrial scenes is essential for simulations in industrial
manufacturing. While large language models (LLMs) have shown significant
progress in generating general 3D scenes from textual descriptions, generating
industrial scenes with LLMs poses a unique challenge due to their demand for
precise measurements and positioning, requiring complex planning over spatial
arrangement. To address this challenge, we introduce SceneGenAgent, an
LLM-based agent for generating industrial scenes through C# code. SceneGenAgent
ensures precise layout planning through a structured and calculable format,
layout verification, and iterative refinement to meet the quantitative
requirements of industrial scenarios. Experiment results demonstrate that LLMs
powered by SceneGenAgent exceed their original performance, reaching up to
81.0% success rate in real-world industrial scene generation tasks and
effectively meeting most scene generation requirements. To further enhance
accessibility, we construct SceneInstruct, a dataset designed for fine-tuning
open-source LLMs to integrate into SceneGenAgent. Experiments show that
fine-tuning open-source LLMs on SceneInstruct yields significant performance
improvements, with Llama3.1-70B approaching the capabilities of GPT-4o. Our
code and data are available at https://github.com/THUDM/SceneGenAgent .
[LINK]
http://arxiv.org/abs/2410.21909v2
[DATE]
2025-05-16 00:40:39+08:00
[CATEGORIES]
cs.CL
cs.LG
Data-Driven Calibration of Prediction Sets in Large Vision-Language Models Based on Inductive Conformal Prediction
[AUTHORS]
Yuanchang Ye, Weiyan Wen
[ABSTRACT]
This study addresses the critical challenge of hallucination mitigation in
Large Vision-Language Models (LVLMs) for Visual Question Answering (VQA) tasks
through a Split Conformal Prediction (SCP) framework. While LVLMs excel in
multi-modal reasoning, their outputs often exhibit hallucinated content with
high confidence, posing risks in safety-critical applications. We propose a
model-agnostic uncertainty quantification method that integrates dynamic
threshold calibration and cross-modal consistency verification. By partitioning
data into calibration and test sets, the framework computes nonconformity
scores to construct prediction sets with statistical guarantees under
user-defined risk levels ($\alpha$). Key innovations include: (1) rigorous
control of \textbf{marginal coverage} to ensure empirical error rates remain
strictly below $\alpha$; (2) dynamic adjustment of prediction set sizes
inversely with $\alpha$, filtering low-confidence outputs; (3) elimination of
prior distribution assumptions and retraining requirements. Evaluations on
benchmarks (ScienceQA, MMMU) with eight LVLMs demonstrate that SCP enforces
theoretical guarantees across all $\alpha$ values. The framework achieves
stable performance across varying calibration-to-test split ratios,
underscoring its robustness for real-world deployment in healthcare, autonomous
systems, and other safety-sensitive domains. This work bridges the gap between
theoretical reliability and practical applicability in multi-modal AI systems,
offering a scalable solution for hallucination detection and uncertainty-aware
decision-making.
[COMMENTS]
Accepted by ICIPCA 2025
[LINK]
http://arxiv.org/abs/2504.17671v3
[DATE]
2025-05-16 00:24:49+08:00
[CATEGORIES]
cs.CL
cs.LG
Parallel Scaling Law for Language Models
[AUTHORS]
Mouxiang Chen, Binyuan Hui, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Jianling Sun, Junyang Lin, Zhongxin Liu
[ABSTRACT]
It is commonly believed that scaling language models should commit a
significant space or time cost, by increasing the parameters (parameter
scaling) or output tokens (inference-time scaling). We introduce the third and
more inference-efficient scaling paradigm: increasing the model’s parallel
computation during both training and inference time. We apply $P$ diverse and
learnable transformations to the input, execute forward passes of the model in
parallel, and dynamically aggregate the $P$ outputs. This method, namely
parallel scaling (ParScale), scales parallel computation by reusing existing
parameters and can be applied to any model structure, optimization procedure,
data, or task. We theoretically propose a new scaling law and validate it
through large-scale pre-training, which shows that a model with $P$ parallel
streams is similar to scaling the parameters by $O(\log P)$ while showing
superior inference efficiency. For example, ParScale can use up to 22$\times$
less memory increase and 6$\times$ less latency increase compared to parameter
scaling that achieves the same performance improvement. It can also recycle an
off-the-shelf pre-trained model into a parallelly scaled one by post-training
on a small amount of tokens, further reducing the training budget. The new
scaling law we discovered potentially facilitates the deployment of more
powerful models in low-resource scenarios, and provides an alternative
perspective for the role of computation in machine learning.
[LINK]
http://arxiv.org/abs/2505.10475v1
[DATE]
2025-05-16 00:24:45+08:00
[CATEGORIES]
cs.LG
cs.CL
Superposition Yields Robust Neural Scaling
[AUTHORS]
Yizhou liu, Ziming Liu, Jeff Gore
[ABSTRACT]
The success of today’s large language models (LLMs) depends on the
observation that larger models perform better. However, the origin of this
neural scaling law – the finding that loss decreases as a power law with model
size – remains unclear. Starting from two empirical principles – that LLMs
represent more things than the model dimensions (widths) they have (i.e.,
representations are superposed), and that words or concepts in language occur
with varying frequencies – we constructed a toy model to study the loss
scaling with model size. We found that when superposition is weak, meaning only
the most frequent features are represented without interference, the scaling of
loss with model size depends on the underlying feature frequency; if feature
frequencies follow a power law, so does the loss. In contrast, under strong
superposition, where all features are represented but overlap with each other,
the loss becomes inversely proportional to the model dimension across a wide
range of feature frequency distributions. This robust scaling behavior is
explained geometrically: when many more vectors are packed into a lower
dimensional space, the interference (squared overlaps) between vectors scales
inversely with that dimension. We then analyzed four families of open-sourced
LLMs and found that they exhibit strong superposition and quantitatively match
the predictions of our toy model. The Chinchilla scaling law turned out to also
agree with our results. We conclude that representation superposition is an
important mechanism underlying the observed neural scaling laws. We anticipate
that these insights will inspire new training strategies and model
architectures to achieve better performance with less computation and fewer
parameters.
[COMMENTS]
30 pages, 23 figures
[LINK]
http://arxiv.org/abs/2505.10465v1
[DATE]
2025-05-16 00:18:13+08:00
[CATEGORIES]
cs.LG
cs.CL
Reinforcing the Diffusion Chain of Lateral Thought with Diffusion Language Models
[AUTHORS]
Zemin Huang, Zhiyang Chen, Zijun Wang, Tiancheng Li, Guo-Jun Qi
[ABSTRACT]
We introduce the \emph{Diffusion Chain of Lateral Thought (DCoLT)}, a
reasoning framework for diffusion language models. DCoLT treats each
intermediate step in the reverse diffusion process as a latent “thinking”
action and optimizes the entire reasoning trajectory to maximize the reward on
the correctness of the final answer with outcome-based Reinforcement Learning
(RL). Unlike traditional Chain-of-Thought (CoT) methods that follow a causal,
linear thinking process, DCoLT allows bidirectional, non-linear reasoning with
no strict rule on grammatical correctness amid its intermediate steps of
thought. We implement DCoLT on two representative Diffusion Language Models
(DLMs). First, we choose SEDD as a representative continuous-time discrete
diffusion model, where its concrete score derives a probabilistic policy to
maximize the RL reward over the entire sequence of intermediate diffusion
steps. We further consider the discrete-time masked diffusion language model –
LLaDA, and find that the order to predict and unmask tokens plays an essential
role to optimize its RL action resulting from the ranking-based Unmasking
Policy Module (UPM) defined by the Plackett-Luce model. Experiments on both
math and code generation tasks show that using only public data and 16 H800
GPUs, DCoLT-reinforced DLMs outperform other DLMs trained by SFT or RL or even
both. Notably, DCoLT-reinforced LLaDA boosts its reasoning accuracy by +9.8%,
+5.7%, +11.4%, +19.5% on GSM8K, MATH, MBPP, and HumanEval.
[LINK]
http://arxiv.org/abs/2505.10446v1
[DATE]
2025-05-16 00:06:32+08:00
[CATEGORIES]
cs.CL
Neural Thermodynamic Laws for Large Language Model Training
[AUTHORS]
Ziming Liu, Yizhou Liu, Jeff Gore, Max Tegmark
[ABSTRACT]
Beyond neural scaling laws, little is known about the laws underlying large
language models (LLMs). We introduce Neural Thermodynamic Laws (NTL) – a new
framework that offers fresh insights into LLM training dynamics. On the
theoretical side, we demonstrate that key thermodynamic quantities (e.g.,
temperature, entropy, heat capacity, thermal conduction) and classical
thermodynamic principles (e.g., the three laws of thermodynamics and the
equipartition theorem) naturally emerge under river-valley loss landscape
assumptions. On the practical side, this scientific perspective yields
intuitive guidelines for designing learning rate schedules.
[COMMENTS]
18 pages, 10 figures
[LINK]
http://arxiv.org/abs/2505.10559v1
[DATE]
2025-05-16 01:59:22+08:00
[CATEGORIES]
cs.LG
An AI-driven framework for the prediction of personalised health response to air pollution
[AUTHORS]
Nazanin Zounemat Kermani, Sadjad Naderi, Claire H. Dilliway, Claire E. Heaney, Shrreya Behll, Boyang Chen, Hisham Abubakar-Waziri, Alexandra E. Porter, Marc Chadeau-Hyam, Fangxin Fang, Ian M. Adcock, Kian Fan Chung, Christopher C. Pain
[ABSTRACT]
Air pollution poses a significant threat to public health, causing or
exacerbating many respiratory and cardiovascular diseases. In addition, climate
change is bringing about more extreme weather events such as wildfires and
heatwaves, which can increase levels of pollution and worsen the effects of
pollution exposure. Recent advances in personal sensing have transformed the
collection of behavioural and physiological data, leading to the potential for
new improvements in healthcare. We wish to capitalise on this data, alongside
new capabilities in AI for making time series predictions, in order to monitor
and predict health outcomes for an individual. Thus, we present a novel
workflow for predicting personalised health responses to pollution by
integrating physiological data from wearable fitness devices with real-time
environmental exposures. The data is collected from various sources in a secure
and ethical manner, and is used to train an AI model to predict individual
health responses to pollution exposure within a cloud-based, modular framework.
We demonstrate that the AI model – an Adversarial Autoencoder neural network
in this case – accurately reconstructs time-dependent health signals and
captures nonlinear responses to pollution. Transfer learning is applied using
data from a personal smartwatch, which increases the generalisation abilities
of the AI model and illustrates the adaptability of the approach to real-world,
user-generated data.
[COMMENTS]
Kermani and Naderi share first authorship. 20 pages, 6 figures and 1
table
[LINK]
http://arxiv.org/abs/2505.10556v1
[DATE]
2025-05-16 01:59:07+08:00
[CATEGORIES]
cs.LG
Pharmacophore-Conditioned Diffusion Model for Ligand-Based De Novo Drug Design
[AUTHORS]
Amira Alakhdar, Barnabas Poczos, Newell Washburn
[ABSTRACT]
Developing bioactive molecules remains a central, time- and cost-heavy
challenge in drug discovery, particularly for novel targets lacking structural
or functional data. Pharmacophore modeling presents an alternative for
capturing the key features required for molecular bioactivity against a
biological target. In this work, we present PharmaDiff, a
pharmacophore-conditioned diffusion model for 3D molecular generation.
PharmaDiff employs a transformer-based architecture to integrate an atom-based
representation of the 3D pharmacophore into the generative process, enabling
the precise generation of 3D molecular graphs that align with predefined
pharmacophore hypotheses. Through comprehensive testing, PharmaDiff
demonstrates superior performance in matching 3D pharmacophore constraints
compared to ligand-based drug design methods. Additionally, it achieves higher
docking scores across a range of proteins in structure-based drug design,
without the need for target protein structures. By integrating pharmacophore
modeling with 3D generative techniques, PharmaDiff offers a powerful and
flexible framework for rational drug design.
[LINK]
http://arxiv.org/abs/2505.10545v1
[DATE]
2025-05-16 01:54:29+08:00
[CATEGORIES]
cs.LG
Lightspeed Geometric Dataset Distance via Sliced Optimal Transport
[AUTHORS]
Khai Nguyen, Hai Nguyen, Tuan Pham, Nhat Ho
[ABSTRACT]
We introduce sliced optimal transport dataset distance (s-OTDD), a
model-agnostic, embedding-agnostic approach for dataset comparison that
requires no training, is robust to variations in the number of classes, and can
handle disjoint label sets. The core innovation is Moment Transform Projection
(MTP), which maps a label, represented as a distribution over features, to a
real number. Using MTP, we derive a data point projection that transforms
datasets into one-dimensional distributions. The s-OTDD is defined as the
expected Wasserstein distance between the projected distributions, with respect
to random projection parameters. Leveraging the closed form solution of
one-dimensional optimal transport, s-OTDD achieves (near-)linear computational
complexity in the number of data points and feature dimensions and is
independent of the number of classes. With its geometrically meaningful
projection, s-OTDD strongly correlates with the optimal transport dataset
distance while being more efficient than existing dataset discrepancy measures.
Moreover, it correlates well with the performance gap in transfer learning and
classification accuracy in data augmentation.
[COMMENTS]
Accepted to ICML 2025, 16 pages, 13 figures
[LINK]
http://arxiv.org/abs/2501.18901v2
[DATE]
2025-05-16 01:48:47+08:00
[CATEGORIES]
cs.LG
Knowledge capture, adaptation and composition (KCAC): A framework for cross-task curriculum learning in robotic manipulation
[AUTHORS]
Xinrui Wang, Yan Jin
[ABSTRACT]
Reinforcement learning (RL) has demonstrated remarkable potential in robotic
manipulation but faces challenges in sample inefficiency and lack of
interpretability, limiting its applicability in real world scenarios. Enabling
the agent to gain a deeper understanding and adapt more efficiently to diverse
working scenarios is crucial, and strategic knowledge utilization is a key
factor in this process. This paper proposes a Knowledge Capture, Adaptation,
and Composition (KCAC) framework to systematically integrate knowledge transfer
into RL through cross-task curriculum learning. KCAC is evaluated using a two
block stacking task in the CausalWorld benchmark, a complex robotic
manipulation environment. To our knowledge, existing RL approaches fail to
solve this task effectively, reflecting deficiencies in knowledge capture. In
this work, we redesign the benchmark reward function by removing rigid
constraints and strict ordering, allowing the agent to maximize total rewards
concurrently and enabling flexible task completion. Furthermore, we define two
self-designed sub-tasks and implement a structured cross-task curriculum to
facilitate efficient learning. As a result, our KCAC approach achieves a 40
percent reduction in training time while improving task success rates by 10
percent compared to traditional RL methods. Through extensive evaluation, we
identify key curriculum design parameters subtask selection, transition timing,
and learning rate that optimize learning efficiency and provide conceptual
guidance for curriculum based RL frameworks. This work offers valuable insights
into curriculum design in RL and robotic learning.
[LINK]
http://arxiv.org/abs/2505.10522v1
[DATE]
2025-05-16 01:30:29+08:00
[CATEGORIES]
cs.LG
A Deep Learning-Driven Inhalation Injury Grading Assistant Using Bronchoscopy Images
[AUTHORS]
Yifan Li, Alan W Pang, Jo Woon Chong
[ABSTRACT]
Inhalation injuries present a challenge in clinical diagnosis and grading due
to Conventional grading methods such as the Abbreviated Injury Score (AIS)
being subjective and lacking robust correlation with clinical parameters like
mechanical ventilation duration and patient mortality. This study introduces a
novel deep learning-based diagnosis assistant tool for grading inhalation
injuries using bronchoscopy images to overcome subjective variability and
enhance consistency in severity assessment. Our approach leverages data
augmentation techniques, including graphic transformations, Contrastive
Unpaired Translation (CUT), and CycleGAN, to address the scarcity of medical
imaging data. We evaluate the classification performance of two deep learning
models, GoogLeNet and Vision Transformer (ViT), across a dataset significantly
expanded through these augmentation methods. The results demonstrate GoogLeNet
combined with CUT as the most effective configuration for grading inhalation
injuries through bronchoscopy images and achieves a classification accuracy of
97.8%. The histograms and frequency analysis evaluations reveal variations
caused by the augmentation CUT with distribution changes in the histogram and
texture details of the frequency spectrum. PCA visualizations underscore the
CUT substantially enhances class separability in the feature space. Moreover,
Grad-CAM analyses provide insight into the decision-making process; mean
intensity for CUT heatmaps is 119.6, which significantly exceeds 98.8 of the
original datasets. Our proposed tool leverages mechanical ventilation periods
as a novel grading standard, providing comprehensive diagnostic support.
[LINK]
http://arxiv.org/abs/2505.08517v2
[DATE]
2025-05-16 01:28:04+08:00
[CATEGORIES]
cs.LG
PnPXAI: A Universal XAI Framework Providing Automatic Explanations Across Diverse Modalities and Models
[AUTHORS]
Seongun Kim, Sol A Kim, Geonhyeong Kim, Enver Menadjiev, Chanwoo Lee, Seongwook Chung, Nari Kim, Jaesik Choi
[ABSTRACT]
Recently, post hoc explanation methods have emerged to enhance model
transparency by attributing model outputs to input features. However, these
methods face challenges due to their specificity to certain neural network
architectures and data modalities. Existing explainable artificial intelligence
(XAI) frameworks have attempted to address these challenges but suffer from
several limitations. These include limited flexibility to diverse model
architectures and data modalities due to hard-coded implementations, a
restricted number of supported XAI methods because of the requirements for
layer-specific operations of attribution methods, and sub-optimal
recommendations of explanations due to the lack of evaluation and optimization
phases. Consequently, these limitations impede the adoption of XAI technology
in real-world applications, making it difficult for practitioners to select the
optimal explanation method for their domain. To address these limitations, we
introduce \textbf{PnPXAI}, a universal XAI framework that supports diverse data
modalities and neural network models in a Plug-and-Play (PnP) manner. PnPXAI
automatically detects model architectures, recommends applicable explanation
methods, and optimizes hyperparameters for optimal explanations. We validate
the framework’s effectiveness through user surveys and showcase its versatility
across various domains, including medicine and finance.
[LINK]
http://arxiv.org/abs/2505.10515v1
[DATE]
2025-05-16 01:21:54+08:00
[CATEGORIES]
cs.LG
Learning Nonlinear Dynamics in Physical Modelling Synthesis using Neural Ordinary Differential Equations
[AUTHORS]
Victor Zheleznov, Stefan Bilbao, Alec Wright, Simon King
[ABSTRACT]
Modal synthesis methods are a long-standing approach for modelling
distributed musical systems. In some cases extensions are possible in order to
handle geometric nonlinearities. One such case is the high-amplitude vibration
of a string, where geometric nonlinear effects lead to perceptually important
effects including pitch glides and a dependence of brightness on striking
amplitude. A modal decomposition leads to a coupled nonlinear system of
ordinary differential equations. Recent work in applied machine learning
approaches (in particular neural ordinary differential equations) has been used
to model lumped dynamic systems such as electronic circuits automatically from
data. In this work, we examine how modal decomposition can be combined with
neural ordinary differential equations for modelling distributed musical
systems. The proposed model leverages the analytical solution for linear
vibration of system’s modes and employs a neural network to account for
nonlinear dynamic behaviour. Physical parameters of a system remain easily
accessible after the training without the need for a parameter encoder in the
network architecture. As an initial proof of concept, we generate synthetic
data for a nonlinear transverse string and show that the model can be trained
to reproduce the nonlinear dynamics of the system. Sound examples are
presented.
[COMMENTS]
Accepted for publication in Proceedings of the 28th International
Conference on Digital Audio Effects (DAFx25), Ancona, Italy, September 2025
[LINK]
http://arxiv.org/abs/2505.10511v1
[DATE]
2025-05-16 01:17:21+08:00
[CATEGORIES]
cs.LG
An unsupervised method for MRI recovery: Deep image prior with structured sparsity
[AUTHORS]
Muhammad Ahmad Sultan, Chong Chen, Yingmin Liu, Katarzyna Gil, Karolina Zareba, Rizwan Ahmad
[ABSTRACT]
Objective: To propose and validate an unsupervised MRI reconstruction method
that does not require fully sampled k-space data. Materials and Methods: The
proposed method, deep image prior with structured sparsity (DISCUS), extends
the deep image prior (DIP) by introducing group sparsity to frame-specific code
vectors, enabling the discovery of a low-dimensional manifold for capturing
temporal variations. \discus was validated using four studies: (I) simulation
of a dynamic Shepp-Logan phantom to demonstrate its manifold discovery
capabilities, (II) comparison with compressed sensing and DIP-based methods
using simulated single-shot late gadolinium enhancement (LGE) image series from
six distinct digital cardiac phantoms in terms of normalized mean square error
(NMSE) and structural similarity index measure (SSIM), (III) evaluation on
retrospectively undersampled single-shot LGE data from eight patients, and (IV)
evaluation on prospectively undersampled single-shot LGE data from eight
patients, assessed via blind scoring from two expert readers. Results: DISCUS
outperformed competing methods, demonstrating superior reconstruction quality
in terms of NMSE and SSIM (Studies I–III) and expert reader scoring (Study
IV). Discussion: An unsupervised image reconstruction method is presented and
validated on simulated and measured data. These developments can benefit
applications where acquiring fully sampled data is challenging.
[COMMENTS]
Magn Reson Mater Phy (2025)
[LINK]
http://arxiv.org/abs/2501.01482v2
[DATE]
2025-05-16 01:15:14+08:00
[CATEGORIES]
cs.LG
Batched Nonparametric Bandits via k-Nearest Neighbor UCB
[AUTHORS]
Sakshi Arya
[ABSTRACT]
We study sequential decision-making in batched nonparametric contextual
bandits, where actions are selected over a finite horizon divided into a small
number of batches. Motivated by constraints in domains such as medicine and
marketing – where online feedback is limited – we propose a nonparametric
algorithm that combines adaptive k-nearest neighbor (k-NN) regression with the
upper confidence bound (UCB) principle. Our method, BaNk-UCB, is fully
nonparametric, adapts to the context dimension, and is simple to implement.
Unlike prior work relying on parametric or binning-based estimators, BaNk-UCB
uses local geometry to estimate rewards and adaptively balances exploration and
exploitation. We provide near-optimal regret guarantees under standard
Lipschitz smoothness and margin assumptions, using a theoretically motivated
batch schedule that balances regret across batches and achieves minimax-optimal
rates. Empirical evaluations on synthetic and real-world datasets demonstrate
that BaNk-UCB consistently outperforms binning-based baselines.
[COMMENTS]
25 pages, 6 figures
[LINK]
http://arxiv.org/abs/2505.10498v1
[DATE]
2025-05-16 01:00:51+08:00
[CATEGORIES]
cs.LG
Personalized Federated Learning under Model Dissimilarity Constraints
[AUTHORS]
Samuel Erickson, Mikael Johansson
[ABSTRACT]
One of the defining challenges in federated learning is that of statistical
heterogeneity among clients. We address this problem with KARULA, a regularized
strategy for personalized federated learning, which constrains the pairwise
model dissimilarities between clients based on the difference in their
distributions, as measured by a surrogate for the 1-Wasserstein distance
adapted for the federated setting. This allows the strategy to adapt to highly
complex interrelations between clients, that e.g., clustered approaches fail to
capture. We propose an inexact projected stochastic gradient algorithm to solve
the constrained problem that the strategy defines, and show theoretically that
it converges with smooth, possibly non-convex losses to a neighborhood of a
stationary point with rate O(1/K). We demonstrate the effectiveness of KARULA
on synthetic and real federated data sets.
[LINK]
http://arxiv.org/abs/2505.07575v2
[DATE]
2025-05-16 00:50:52+08:00
[CATEGORIES]
cs.LG
Fine-tuning Diffusion Policies with Backpropagation Through Diffusion Timesteps
[AUTHORS]
Ningyuan Yang, Jiaxuan Gao, Feng Gao, Yi Wu, Chao Yu
[ABSTRACT]
Diffusion policies, widely adopted in decision-making scenarios such as
robotics, gaming and autonomous driving, are capable of learning diverse skills
from demonstration data due to their high representation power. However, the
sub-optimal and limited coverage of demonstration data could lead to diffusion
policies that generate sub-optimal trajectories and even catastrophic failures.
While reinforcement learning (RL)-based fine-tuning has emerged as a promising
solution to address these limitations, existing approaches struggle to
effectively adapt Proximal Policy Optimization (PPO) to diffusion models. This
challenge stems from the computational intractability of action likelihood
estimation during the denoising process, which leads to complicated
optimization objectives. In our experiments starting from randomly initialized
policies, we find that online tuning of Diffusion Policies demonstrates much
lower sample efficiency compared to directly applying PPO on MLP policies
(MLP+PPO). To address these challenges, we introduce NCDPO, a novel framework
that reformulates Diffusion Policy as a noise-conditioned deterministic policy.
By treating each denoising step as a differentiable transformation conditioned
on pre-sampled noise, NCDPO enables tractable likelihood evaluation and
gradient backpropagation through all diffusion timesteps. Our experiments
demonstrate that NCDPO achieves sample efficiency comparable to MLP+PPO when
training from scratch, outperforming existing methods in both sample efficiency
and final performance across diverse benchmarks, including continuous robot
control and multi-agent game scenarios. Furthermore, our experimental results
show that our method is robust to the number denoising timesteps in the
Diffusion Policy.
[COMMENTS]
9 pages for main text, 23 pages in total, submitted to Neurips, 13
figures
[LINK]
http://arxiv.org/abs/2505.10482v1
[DATE]
2025-05-16 00:33:44+08:00
[CATEGORIES]
cs.LG
Unified Modeling Language Code Generation from Diagram Images Using Multimodal Large Language Models
[AUTHORS]
Averi Bates, Ryan Vavricka, Shane Carleton, Ruosi Shao, Chongle Pan
[ABSTRACT]
The Unified Modeling Language is a standardized visual language widely used
for modeling and documenting the design of software systems. Although many
tools generate UML diagrams from UML code, generating executable UML code from
image-based UML diagrams remains challenging. This paper proposes a new
approach to generate UML code using a large multimodal language model
automatically. Synthetic UML activity and sequence diagram datasets were
created to train and test the model. We compared standard fine-tuning with LoRA
techniques to optimize base models. The experiments measured code generation
accuracy across different model sizes and training strategies. These results
demonstrated that domain-adapted MM-LLMs perform for UML code generation
automation, whereby, at the best model, it achieved BLEU and SSIM scores of
0.779 and 0.942 on sequence diagrams. This will enable the modernization of
legacy systems and decrease the manual effort in software development
workflows.
[COMMENTS]
Published in the Journal of Machine Learning with Applications,
Author Contributions: Averi Bates: Methodology, Development, Analysis, Data
Curation, Drafting, Review. Ryan Vavricka: Data Curation, Development,
Review. Shane Carleton: Supervision, Funding. Ruosi Shao: Review. Chongle
Pan: Supervision, Review
[LINK]
http://arxiv.org/abs/2503.12293v2
[DATE]
2025-05-16 00:29:38+08:00
[CATEGORIES]
cs.LG
Large Language Models for Cancer Communication: Evaluating Linguistic Quality, Safety, and Accessibility in Generative AI
[AUTHORS]
Agnik Saha, Victoria Churchill, Anny D. Rodriguez, Ugur Kursuncu, Muhammed Y. Idris
[ABSTRACT]
Effective communication about breast and cervical cancers remains a
persistent health challenge, with significant gaps in public understanding of
cancer prevention, screening, and treatment, potentially leading to delayed
diagnoses and inadequate treatments. This study evaluates the capabilities and
limitations of Large Language Models (LLMs) in generating accurate, safe, and
accessible cancer-related information to support patient understanding. We
evaluated five general-purpose and three medical LLMs using a mixed-methods
evaluation framework across linguistic quality, safety and trustworthiness, and
communication accessibility and affectiveness. Our approach utilized
quantitative metrics, qualitative expert ratings, and statistical analysis
using Welch’s ANOVA, Games-Howell, and Hedges’ g. Our results show that
general-purpose LLMs produced outputs of higher linguistic quality and
affectiveness, while medical LLMs demonstrate greater communication
accessibility. However, medical LLMs tend to exhibit higher levels of potential
harm, toxicity, and bias, reducing their performance in safety and
trustworthiness. Our findings indicate a duality between domain-specific
knowledge and safety in health communications. The results highlight the need
for intentional model design with targeted improvements, particularly in
mitigating harm and bias, and improving safety and affectiveness. This study
provides a comprehensive evaluation of LLMs for cancer communication, offering
critical insights for improving AI-generated health content and informing
future development of accurate, safe, and accessible digital health tools.
[LINK]
http://arxiv.org/abs/2505.10472v1
[DATE]
2025-05-16 00:23:21+08:00
[CATEGORIES]
cs.CL
cs.LG
FlowVAT: Normalizing Flow Variational Inference with Affine-Invariant Tempering
[AUTHORS]
Juehang Qin, Shixiao Liang, Christopher Tunnell
[ABSTRACT]
Multi-modal and high-dimensional posteriors present significant challenges
for variational inference, causing mode-seeking behavior and collapse despite
the theoretical expressiveness of normalizing flows. Traditional annealing
methods require temperature schedules and hyperparameter tuning, falling short
of the goal of truly black-box variational inference. We introduce FlowVAT, a
conditional tempering approach for normalizing flow variational inference that
addresses these limitations. Our method tempers both the base and target
distributions simultaneously, maintaining affine-invariance under tempering. By
conditioning the normalizing flow on temperature, we leverage overparameterized
neural networks’ generalization capabilities to train a single flow
representing the posterior across a range of temperatures. This preserves modes
identified at higher temperatures when sampling from the variational posterior
at $T = 1$, mitigating standard variational methods’ mode-seeking behavior. In
experiments with 2, 10, and 20 dimensional multi-modal distributions, FlowVAT
outperforms traditional and adaptive annealing methods, finding more modes and
achieving better ELBO values, particularly in higher dimensions where existing
approaches fail. Our method requires minimal hyperparameter tuning and does not
require an annealing schedule, advancing toward fully-automatic black-box
variational inference for complicated posteriors.
[COMMENTS]
10 pages, 5 figures, and 2 tables in main text, two appendices
[LINK]
http://arxiv.org/abs/2505.10466v1
[DATE]
2025-05-16 00:20:36+08:00
[CATEGORIES]
cs.LG
SEAL: Searching Expandable Architectures for Incremental Learning
[AUTHORS]
Matteo Gambella, Vicente Javier Castro Solar, Manuel Roveri
[ABSTRACT]
Incremental learning is a machine learning paradigm where a model learns from
a sequential stream of tasks. This setting poses a key challenge: balancing
plasticity (learning new tasks) and stability (preserving past knowledge).
Neural Architecture Search (NAS), a branch of AutoML, automates the design of
the architecture of Deep Neural Networks and has shown success in static
settings. However, existing NAS-based approaches to incremental learning often
rely on expanding the model at every task, making them impractical in
resource-constrained environments. In this work, we introduce SEAL, a NAS-based
framework tailored for data-incremental learning, a scenario where disjoint
data samples arrive sequentially and are not stored for future access. SEAL
adapts the model structure dynamically by expanding it only when necessary,
based on a capacity estimation metric. Stability is preserved through
cross-distillation training after each expansion step. The NAS component
jointly searches for both the architecture and the optimal expansion policy.
Experiments across multiple benchmarks demonstrate that SEAL effectively
reduces forgetting and enhances accuracy while maintaining a lower model size
compared to prior methods. These results highlight the promise of combining NAS
and selective expansion for efficient, adaptive learning in incremental
scenarios.
[COMMENTS]
8 pages, 5 figures
[LINK]
http://arxiv.org/abs/2505.10457v1
[DATE]
2025-05-16 00:14:18+08:00
[CATEGORIES]
cs.LG
Multi-Objective Optimization-Based Anonymization of Structured Data for Machine Learning Application
[AUTHORS]
Yusi Wei, Hande Y. Benson, Joseph K. Agor, Muge Capan
[LINK]
http://arxiv.org/abs/2501.01002v2
[DATE]
2025-05-16 00:07:20+08:00
[CATEGORIES]
cs.LG
Efficient MCMC Sampling with Expensive-to-Compute and Irregular Likelihoods
[AUTHORS]
Conor Rosato, Harvinder Lehal, Simon Maskell, Lee Devlin, Malcolm Strens
[ABSTRACT]
Bayesian inference with Markov Chain Monte Carlo (MCMC) is challenging when
the likelihood function is irregular and expensive to compute. We explore
several sampling algorithms that make use of subset evaluations to reduce
computational overhead. We adapt the subset samplers for this setting where
gradient information is not available or is unreliable. To achieve this, we
introduce data-driven proxies in place of Taylor expansions and define a novel
computation-cost aware adaptive controller. We undertake an extensive
evaluation for a challenging disease modelling task and a configurable task
with similar irregularity in the likelihood surface. We find our improved
version of Hierarchical Importance with Nested Training Samples (HINTS), with
adaptive proposals and a data-driven proxy, obtains the best sampling error in
a fixed computational budget. We conclude that subset evaluations can provide
cheap and naturally-tempered exploration, while a data-driven proxy can
pre-screen proposals successfully in explored regions of the state space. These
two elements combine through hierarchical delayed acceptance to achieve
efficient, exact sampling.
[COMMENTS]
45 pages
[LINK]
http://arxiv.org/abs/2505.10448v1
[DATE]
2025-05-16 00:06:44+08:00
[CATEGORIES]
cs.LG
Inferring entropy production in many-body systems using nonequilibrium MaxEnt
[AUTHORS]
Miguel Aguilera, Sosuke Ito, Artemy Kolchinsky
[ABSTRACT]
We propose a method for inferring entropy production (EP) in high-dimensional
stochastic systems, including many-body systems and non-Markovian systems with
long memory. Standard techniques for estimating EP become intractable in such
systems due to computational and statistical limitations. We infer
trajectory-level EP and lower bounds on average EP by exploiting a
nonequilibrium analogue of the Maximum Entropy principle, along with convex
duality. Our approach uses only samples of trajectory observables (such as
spatiotemporal correlation functions). It does not require reconstruction of
high-dimensional probability distributions or rate matrices, nor any special
assumptions such as discrete states or multipartite dynamics. It may be used to
compute a hierarchical decomposition of EP, reflecting contributions from
different kinds of interactions, and it has an intuitive physical
interpretation as a thermodynamic uncertainty relation. We demonstrate its
numerical performance on a disordered nonequilibrium spin model with 1000 spins
and a large neural spike-train dataset.
[LINK]
http://arxiv.org/abs/2505.10444v1
[DATE]
2025-05-16 00:05:50+08:00
[CATEGORIES]
cs.LG
PIF: Anomaly detection via preference embedding
[AUTHORS]
Filippo Leveni, Luca Magri, Giacomo Boracchi, Cesare Alippi
[ABSTRACT]
We address the problem of detecting anomalies with respect to structured
patterns. To this end, we conceive a novel anomaly detection method called PIF,
that combines the advantages of adaptive isolation methods with the flexibility
of preference embedding. Specifically, we propose to embed the data in a high
dimensional space where an efficient tree-based method, PI-Forest, is employed
to compute an anomaly score. Experiments on synthetic and real datasets
demonstrate that PIF favorably compares with state-of-the-art anomaly detection
techniques, and confirm that PI-Forest is better at measuring arbitrary
distances and isolate points in the preference space.
[COMMENTS]
Accepted at International Conference on Pattern Recognition (ICPR
2020)
[LINK]
http://arxiv.org/abs/2505.10441v1
[DATE]
2025-05-16 00:00:31+08:00
[CATEGORIES]
cs.LG
Tokenization Matters! Degrading Large Language Models through Challenging Their Tokenization
[AUTHORS]
Dixuan Wang, Yanda Li, Junyuan Jiang, Zepeng Ding, Ziqin Luo, Guochao Jiang, Jiaqing Liang, Deqing Yang
[ABSTRACT]
Large Language Models (LLMs) have shown remarkable capabilities in language
understanding and generation. Nonetheless, it was also witnessed that LLMs tend
to produce inaccurate responses to specific queries. This deficiency can be
traced to the tokenization step LLMs must undergo, which is an inevitable
limitation inherent to all LLMs. In fact, incorrect tokenization is the
critical point that hinders LLMs in understanding the input precisely, thus
leading to unsatisfactory output. This defect is more obvious in Chinese
scenarios. To demonstrate this flaw of LLMs, we construct an adversarial
dataset, named as $\textbf{ADT (Adversarial Dataset for Tokenizer)}$, which
draws upon the vocabularies of various open-source LLMs to challenge LLMs’
tokenization. ADT consists of two subsets: the manually constructed ADT-Human
and the automatically generated ADT-Auto. Our empirical results reveal that our
ADT is highly effective on challenging the tokenization of leading LLMs,
including GPT-4o, Llama-3, Deepseek-R1 and so on, thus degrading these LLMs’
capabilities. Moreover, our method of automatic data generation has been proven
efficient and robust, which can be applied to any open-source LLMs. In this
paper, we substantially investigate LLMs’ vulnerability in terms of challenging
their token segmentation, which will shed light on the subsequent research of
improving LLMs’ capabilities through optimizing their tokenization process and
algorithms.
[LINK]
http://arxiv.org/abs/2405.17067v2
[DATE]
2025-05-15 23:57:32+08:00
[CATEGORIES]
cs.CL
Hierarchical Document Refinement for Long-context Retrieval-augmented Generation
[AUTHORS]
Jiajie Jin, Xiaoxi Li, Guanting Dong, Yuyao Zhang, Yutao Zhu, Yongkang Wu, Zhonghua Li, Qi Ye, Zhicheng Dou
[ABSTRACT]
Real-world RAG applications often encounter long-context input scenarios,
where redundant information and noise results in higher inference costs and
reduced performance. To address these challenges, we propose LongRefiner, an
efficient plug-and-play refiner that leverages the inherent structural
characteristics of long documents. LongRefiner employs dual-level query
analysis, hierarchical document structuring, and adaptive refinement through
multi-task learning on a single foundation model. Experiments on seven QA
datasets demonstrate that LongRefiner achieves competitive performance in
various scenarios while using 10x fewer computational costs and latency
compared to the best baseline. Further analysis validates that LongRefiner is
scalable, efficient, and effective, providing practical insights for real-world
long-text RAG applications. Our code is available at
https://github.com/ignorejjj/LongRefiner.
[LINK]
http://arxiv.org/abs/2505.10413v1
[DATE]
2025-05-15 23:34:15+08:00
[CATEGORIES]
cs.CL
Are LLM-generated plain language summaries truly understandable? A large-scale crowdsourced evaluation
[AUTHORS]
Yue Guo, Jae Ho Sohn, Gondy Leroy, Trevor Cohen
[ABSTRACT]
Plain language summaries (PLSs) are essential for facilitating effective
communication between clinicians and patients by making complex medical
information easier for laypeople to understand and act upon. Large language
models (LLMs) have recently shown promise in automating PLS generation, but
their effectiveness in supporting health information comprehension remains
unclear. Prior evaluations have generally relied on automated scores that do
not measure understandability directly, or subjective Likert-scale ratings from
convenience samples with limited generalizability. To address these gaps, we
conducted a large-scale crowdsourced evaluation of LLM-generated PLSs using
Amazon Mechanical Turk with 150 participants. We assessed PLS quality through
subjective Likert-scale ratings focusing on simplicity, informativeness,
coherence, and faithfulness; and objective multiple-choice comprehension and
recall measures of reader understanding. Additionally, we examined the
alignment between 10 automated evaluation metrics and human judgments. Our
findings indicate that while LLMs can generate PLSs that appear
indistinguishable from human-written ones in subjective evaluations,
human-written PLSs lead to significantly better comprehension. Furthermore,
automated evaluation metrics fail to reflect human judgment, calling into
question their suitability for evaluating PLSs. This is the first study to
systematically evaluate LLM-generated PLSs based on both reader preferences and
comprehension outcomes. Our findings highlight the need for evaluation
frameworks that move beyond surface-level quality and for generation methods
that explicitly optimize for layperson comprehension.
[LINK]
http://arxiv.org/abs/2505.10409v1
[DATE]
2025-05-15 23:31:17+08:00
[CATEGORIES]
cs.CL
Rethinking Repetition Problems of LLMs in Code Generation
[AUTHORS]
Yihong Dong, Yuchen Liu, Xue Jiang, Zhi Jin, Ge Li
[ABSTRACT]
With the advent of neural language models, the performance of code generation
has been significantly boosted. However, the problem of repetitions during the
generation process continues to linger. Previous work has primarily focused on
content repetition, which is merely a fraction of the broader repetition
problem in code generation. A more prevalent and challenging problem is
structural repetition. In structural repetition, the repeated code appears in
various patterns but possesses a fixed structure, which can be inherently
reflected in grammar. In this paper, we formally define structural repetition
and propose an efficient decoding approach called RPG, which stands for
Repetition Penalization based on Grammar, to alleviate the repetition problems
in code generation for LLMs. Specifically, RPG first leverages grammar rules to
identify repetition problems during code generation, and then strategically
decays the likelihood of critical tokens that contribute to repetitions,
thereby mitigating them in code generation. To facilitate this study, we
construct a new dataset CodeRepetEval to comprehensively evaluate approaches
for mitigating the repetition problems in code generation. Extensive
experimental results demonstrate that RPG substantially outperforms the
best-performing baselines on CodeRepetEval dataset as well as HumanEval and
MBPP benchmarks, effectively reducing repetitions and enhancing the quality of
generated code.
[COMMENTS]
Accepted to ACL 2025 (main)
[LINK]
http://arxiv.org/abs/2505.10402v1
[DATE]
2025-05-15 23:26:32+08:00
[CATEGORIES]
cs.CL
cs.LG
Multi-domain Multilingual Sentiment Analysis in Industry: Predicting Aspect-based Opinion Quadruples
[AUTHORS]
Benjamin White, Anastasia Shimorina
[ABSTRACT]
This paper explores the design of an aspect-based sentiment analysis system
using large language models (LLMs) for real-world use. We focus on quadruple
opinion extraction – identifying aspect categories, sentiment polarity,
targets, and opinion expressions from text data across different domains and
languages. Using internal datasets, we investigate whether a single fine-tuned
model can effectively handle multiple domain-specific taxonomies
simultaneously. We demonstrate that a combined multi-domain model achieves
performance comparable to specialized single-domain models while reducing
operational complexity. We also share lessons learned for handling
non-extractive predictions and evaluating various failure modes when developing
LLM-based systems for structured prediction tasks.
[LINK]
http://arxiv.org/abs/2505.10389v1
[DATE]
2025-05-15 23:11:48+08:00
[CATEGORIES]
cs.CL
LDIR: Low-Dimensional Dense and Interpretable Text Embeddings with Relative Representations
[AUTHORS]
Yile Wang, Zhanyu Shen, Hui Huang
[ABSTRACT]
Semantic text representation is a fundamental task in the field of natural
language processing. Existing text embedding (e.g., SimCSE and LLM2Vec) have
demonstrated excellent performance, but the values of each dimension are
difficult to trace and interpret. Bag-of-words, as classic sparse interpretable
embeddings, suffers from poor performance. Recently, Benara et al. (2024)
propose interpretable text embeddings using large language models, which forms
“0/1” embeddings based on responses to a series of questions. These
interpretable text embeddings are typically high-dimensional (larger than
10,000). In this work, we propose Low-dimensional (lower than 500) Dense and
Interpretable text embeddings with Relative representations (LDIR). The
numerical values of its dimensions indicate semantic relatedness to different
anchor texts through farthest point sampling, offering both semantic
representation as well as a certain level of traceability and interpretability.
We validate LDIR on multiple semantic textual similarity, retrieval, and
clustering tasks. Extensive experimental results show that LDIR performs close
to the black-box baseline models and outperforms the interpretable embeddings
baselines with much fewer dimensions. Code is available at
https://github.com/szu-tera/LDIR.
[COMMENTS]
ACL 2025 Findings
[LINK]
http://arxiv.org/abs/2505.10354v1
[DATE]
2025-05-15 22:45:45+08:00
[CATEGORIES]
cs.CL
Not All Adapters Matter: Selective Adapter Freezing for Memory-Efficient Fine-Tuning of Language Models
[AUTHORS]
Hyegang Son, Yonglak Son, Changhoon Kim, Young Geun Kim
[ABSTRACT]
Transformer-based large-scale pre-trained models achieve great success.
Fine-tuning is the standard practice for leveraging these models in downstream
tasks. Among the fine-tuning methods, adapter-tuning provides a
parameter-efficient fine-tuning by introducing lightweight trainable modules
while keeping most pre-trained parameters frozen. However, existing
adapter-tuning methods still impose substantial resource usage. Through our
investigation, we show that each adapter unequally contributes to both task
performance and resource usage. Motivated by this insight, we propose Selective
Adapter FrEezing (SAFE), which gradually freezes less important adapters early
to reduce unnecessary resource usage while maintaining performance. In our
experiments, SAFE reduces memory usage, computation amount, and training time
by 42.85\%, 34.59\%, and 11.82\%, respectively, while achieving comparable or
better task performance compared to the baseline. We also demonstrate that SAFE
induces regularization effect, thereby smoothing the loss landscape, which
enables the model to generalize better by avoiding sharp minima.
[COMMENTS]
URL: https://aclanthology.org/2025.naacl-long.480/ Volume:
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of
the Association for Computational Linguistics: Human Language Technologies
(Volume 1: Long Papers) Year: 2025 Address: Albuquerque, New Mexico
[LINK]
http://arxiv.org/abs/2412.03587v2
[DATE]
2025-05-15 22:39:45+08:00
[CATEGORIES]
cs.CL
cs.LG
FitCF: A Framework for Automatic Feature Importance-guided Counterfactual Example Generation
[AUTHORS]
Qianli Wang, Nils Feldhus, Simon Ostermann, Luis Felipe Villa-Arenas, Sebastian Möller, Vera Schmitt
[ABSTRACT]
Counterfactual examples are widely used in natural language processing (NLP)
as valuable data to improve models, and in explainable artificial intelligence
(XAI) to understand model behavior. The automated generation of counterfactual
examples remains a challenging task even for large language models (LLMs),
despite their impressive performance on many tasks. In this paper, we first
introduce ZeroCF, a faithful approach for leveraging important words derived
from feature attribution methods to generate counterfactual examples in a
zero-shot setting. Second, we present a new framework, FitCF, which further
verifies aforementioned counterfactuals by label flip verification and then
inserts them as demonstrations for few-shot prompting, outperforming two
state-of-the-art baselines. Through ablation studies, we identify the
importance of each of FitCF’s core components in improving the quality of
counterfactuals, as assessed through flip rate, perplexity, and similarity
measures. Furthermore, we show the effectiveness of LIME and Integrated
Gradients as backbone attribution methods for FitCF and find that the number of
demonstrations has the largest effect on performance. Finally, we reveal a
strong correlation between the faithfulness of feature attribution scores and
the quality of generated counterfactuals.
[COMMENTS]
ACL 2025 Findings; camera-ready version
[LINK]
http://arxiv.org/abs/2501.00777v2
[DATE]
2025-05-15 22:18:58+08:00
[CATEGORIES]
cs.CL
cs.LG
Time Awareness in Large Language Models: Benchmarking Fact Recall Across Time
[AUTHORS]
David Herel, Vojtech Bartek, Jiri Jirak, Tomas Mikolov
[ABSTRACT]
Who is the US President? The answer changes depending on when the question is
asked. While large language models (LLMs) are evaluated on various reasoning
tasks, they often miss a crucial dimension: time. In real-world scenarios, the
correctness of answers is frequently tied to temporal context. To address this
gap, we present a novel framework and dataset spanning over 8,000 events from
2018 to 2024, annotated with day-level granularity and sourced globally across
domains such as politics, science, and business. Our TimeShift evaluation
method systematically probes LLMs for temporal reasoning, revealing that base
models often outperform instruction-tuned and synthetic-trained counterparts on
time-sensitive recall. Additionally, we find that even large-scale models
exhibit brittleness in handling paraphrased facts, highlighting unresolved
challenges in temporal consistency. By identifying these limitations, our work
provides a significant step toward advancing time-aware language models capable
of adapting to the dynamic nature of real-world knowledge.
[LINK]
http://arxiv.org/abs/2409.13338v3
[DATE]
2025-05-15 22:13:36+08:00
[CATEGORIES]
cs.CL
Simple and Provable Scaling Laws for the Test-Time Compute of Large Language Models
[AUTHORS]
Yanxi Chen, Xuchen Pan, Yaliang Li, Bolin Ding, Jingren Zhou
[ABSTRACT]
We propose two simple, principled and practical algorithms that enjoy
provable scaling laws for the test-time compute of large language models
(LLMs). The first one is a two-stage knockout-style algorithm: given an input
problem, it first generates multiple candidate solutions, and then aggregate
them via a knockout tournament for the final output. Assuming that the LLM can
generate a correct solution with non-zero probability and do better than a
random guess in comparing a pair of correct and incorrect solutions, we prove
theoretically that the failure probability of this algorithm decays to zero
exponentially or by a power law (depending on the specific way of scaling) as
its test-time compute grows. The second one is a two-stage league-style
algorithm, where each candidate is evaluated by its average win rate against
multiple opponents, rather than eliminated upon loss to a single opponent.
Under analogous but more robust assumptions, we prove that its failure
probability also decays to zero exponentially with more test-time compute. Both
algorithms require a black-box LLM and nothing else (e.g., no verifier or
reward model) for a minimalistic implementation, which makes them appealing for
practical applications and easy to adapt for different tasks. Through extensive
experiments with diverse models and datasets, we validate the proposed theories
and demonstrate the outstanding scaling properties of both algorithms.
[LINK]
http://arxiv.org/abs/2411.19477v3
[DATE]
2025-05-15 22:06:27+08:00
[CATEGORIES]
cs.CL
cs.LG
TopoLM: brain-like spatio-functional organization in a topographic language model
[AUTHORS]
Neil Rathi, Johannes Mehrer, Badr AlKhamissi, Taha Binhuraib, Nicholas M. Blauch, Martin Schrimpf
[ABSTRACT]
Neurons in the brain are spatially organized such that neighbors on tissue
often exhibit similar response profiles. In the human language system,
experimental studies have observed clusters for syntactic and semantic
categories, but the mechanisms underlying this functional organization remain
unclear. Here, building on work from the vision literature, we develop TopoLM,
a transformer language model with an explicit two-dimensional spatial
representation of model units. By combining a next-token prediction objective
with a spatial smoothness loss, representations in this model assemble into
clusters that correspond to semantically interpretable groupings of text and
closely match the functional organization in the brain’s language system.
TopoLM successfully predicts the emergence of the spatio-functional
organization of a cortical language system as well as the organization of
functional clusters selective for fine-grained linguistic features empirically
observed in human cortex. Our results suggest that the functional organization
of the human language system is driven by a unified spatial objective, and
provide a functionally and spatially aligned model of language processing in
the brain.
[LINK]
http://arxiv.org/abs/2410.11516v3
[DATE]
2025-05-15 21:50:00+08:00
[CATEGORIES]
cs.CL
StoryReasoning Dataset: Using Chain-of-Thought for Scene Understanding and Grounded Story Generation
[AUTHORS]
Daniel A. P. Oliveira, David Martins de Matos
[ABSTRACT]
Visual storytelling systems struggle to maintain character identity across
frames and link actions to appropriate subjects, frequently leading to
referential hallucinations. These issues can be addressed through grounding of
characters, objects, and other entities on the visual elements. We propose
StoryReasoning, a dataset containing 4,178 stories derived from 52,016 movie
images, with both structured scene analyses and grounded stories. Each story
maintains character and object consistency across frames while explicitly
modeling multi-frame relationships through structured tabular representations.
Our approach features cross-frame object re-identification using visual
similarity and face recognition, chain-of-thought reasoning for explicit
narrative modeling, and a grounding scheme that links textual elements to
visual entities across multiple frames. We establish baseline performance by
fine-tuning Qwen2.5-VL 7B, creating Qwen Storyteller, which performs end-to-end
object detection, re-identification, and landmark detection while maintaining
consistent object references throughout the story. Evaluation demonstrates a
reduction from 4.06 to 3.56 (-12.3%) hallucinations on average per story when
compared to a non-fine-tuned model.
[COMMENTS]
31 pages, 14 figures
[LINK]
http://arxiv.org/abs/2505.10292v1
[DATE]
2025-05-15 21:42:14+08:00
[CATEGORIES]
cs.CL
KBAlign: Efficient Self Adaptation on Specific Knowledge Bases
[AUTHORS]
Zheni Zeng, Yuxuan Chen, Shi Yu, Ruobing Wang, Yukun Yan, Zhenghao Liu, Shuo Wang, Xu Han, Zhiyuan Liu, Maosong Sun
[ABSTRACT]
Although retrieval-augmented generation (RAG) remains essential for
knowledge-based question answering (KBQA), current paradigms face critical
challenges under specific domains. Existing methods struggle with targeted
adaptation on small-scale KBs: vanilla unsupervised training exhibits poor
effectiveness, while fine-tuning incurs prohibitive costs of external signals.
We present KBAlign, a self-supervised framework that enhances RAG systems
through efficient model adaptation. Our key insight is to leverage the model’s
intrinsic capabilities for knowledge alignment through two innovative
mechanisms: multi-grained self-annotation that captures global knowledge for
data construction, and iterative tuning that accelerates convergence through
self verification. This framework enables cost-effective model adaptation to
specific textual KBs, without human supervision or external model assistance.
Experiments demonstrate that KBAlign can achieve 90\% of the performance gain
obtained through GPT-4-supervised adaptation, while relying entirely on
self-annotation of much smaller models. KBAlign significantly improves
downstream QA accuracy across multiple domains with tiny costs, particularly
benefiting scenarios requiring deep knowledge integration from specialized
corpora. We release our experimental data, models, and process analyses to the
community for further exploration (https://github.com/thunlp/KBAlign).
[LINK]
http://arxiv.org/abs/2411.14790v4
[DATE]
2025-05-15 21:02:21+08:00
[CATEGORIES]
cs.CL
TensorLLM: Tensorising Multi-Head Attention for Enhanced Reasoning and Compression in LLMs
[AUTHORS]
Yuxuan Gu, Wuyang Zhou, Giorgos Iacovides, Danilo Mandic
[ABSTRACT]
The reasoning abilities of Large Language Models (LLMs) can be improved by
structurally denoising their weights, yet existing techniques primarily focus
on denoising the feed-forward network (FFN) of the transformer block, and can
not efficiently utilise the Multi-head Attention (MHA) block, which is the core
of transformer architectures. To address this issue, we propose a novel
intuitive framework that, at its very core, performs MHA compression through a
multi-head tensorisation process and the Tucker decomposition. This enables
both higher-dimensional structured denoising and compression of the MHA
weights, by enforcing a shared higher-dimensional subspace across the weights
of the multiple attention heads. We demonstrate that this approach consistently
enhances the reasoning capabilities of LLMs across multiple benchmark datasets,
and for both encoder-only and decoder-only architectures, while achieving
compression rates of up to $\sim 250$ times in the MHA weights, all without
requiring any additional data, training, or fine-tuning. Furthermore, we show
that the proposed method can be seamlessly combined with existing
FFN-only-based denoising techniques to achieve further improvements in LLM
reasoning performance.
[COMMENTS]
Accpeted for IEEE International Joint Conference on Neural Networks
(IJCNN 2025). The code is available at https://github.com/guyuxuan9/TensorLLM
[LINK]
http://arxiv.org/abs/2501.15674v2
[DATE]
2025-05-15 20:42:44+08:00
[CATEGORIES]
cs.CL
cs.LG
ComplexFormer: Disruptively Advancing Transformer Inference Ability via Head-Specific Complex Vector Attention
[AUTHORS]
Jintian Shao, Hongyi Huang, Jiayi Wu, Beiwen Zhang, ZhiYu Wu, You Shan, MingKai Zheng
[ABSTRACT]
Transformer models rely on self-attention to capture token dependencies but
face challenges in effectively integrating positional information while
allowing multi-head attention (MHA) flexibility. Prior methods often model
semantic and positional differences disparately or apply uniform positional
adjustments across heads, potentially limiting representational capacity. This
paper introduces ComplexFormer, featuring Complex Multi-Head Attention-CMHA.
CMHA empowers each head to independently model semantic and positional
differences unified within the complex plane, representing interactions as
rotations and scaling. ComplexFormer incorporates two key improvements: (1) a
per-head Euler transformation, converting real-valued query/key projections
into polar-form complex vectors for head-specific complex subspace operation;
and (2) a per-head adaptive differential rotation mechanism,
exp[i(Adapt(ASmn,i) + Delta(Pmn),i)], allowing each head to learn distinct
strategies for integrating semantic angle differences (ASmn,i) with relative
positional encodings (Delta(Pmn),i). Extensive experiments on language
modeling, text generation, code generation, and mathematical reasoning show
ComplexFormer achieves superior performance, significantly lower generation
perplexity , and improved long-context coherence compared to strong baselines
like RoPE-Transformers. ComplexFormer demonstrates strong parameter efficiency,
offering a more expressive, adaptable attention mechanism.
[LINK]
http://arxiv.org/abs/2505.10222v1
[DATE]
2025-05-15 20:30:33+08:00
[CATEGORIES]
cs.LG
cs.CL
Latent Action Pretraining from Videos
[AUTHORS]
Seonghyeon Ye, Joel Jang, Byeongguk Jeon, Sejune Joo, Jianwei Yang, Baolin Peng, Ajay Mandlekar, Reuben Tan, Yu-Wei Chao, Bill Yuchen Lin, Lars Liden, Kimin Lee, Jianfeng Gao, Luke Zettlemoyer, Dieter Fox, Minjoon Seo
[ABSTRACT]
We introduce Latent Action Pretraining for general Action models (LAPA), an
unsupervised method for pretraining Vision-Language-Action (VLA) models without
ground-truth robot action labels. Existing Vision-Language-Action models
require action labels typically collected by human teleoperators during
pretraining, which significantly limits possible data sources and scale. In
this work, we propose a method to learn from internet-scale videos that do not
have robot action labels. We first train an action quantization model
leveraging VQ-VAE-based objective to learn discrete latent actions between
image frames, then pretrain a latent VLA model to predict these latent actions
from observations and task descriptions, and finally finetune the VLA on
small-scale robot manipulation data to map from latent to robot actions.
Experimental results demonstrate that our method significantly outperforms
existing techniques that train robot manipulation policies from large-scale
videos. Furthermore, it outperforms the state-of-the-art VLA model trained with
robotic action labels on real-world manipulation tasks that require language
conditioning, generalization to unseen objects, and semantic generalization to
unseen instructions. Training only on human manipulation videos also shows
positive transfer, opening up the potential for leveraging web-scale data for
robotics foundation model.
[COMMENTS]
ICLR 2025 Website: https://latentactionpretraining.github.io
[LINK]
http://arxiv.org/abs/2410.11758v2
[DATE]
2025-05-15 20:13:37+08:00
[CATEGORIES]
cs.CL
cs.LG
VQ-Logits: Compressing the Output Bottleneck of Large Language Models via Vector Quantized Logits
[AUTHORS]
Jintian Shao, Hongyi Huang, Jiayi Wu, YiMing Cheng, ZhiYu Wu, You Shan, MingKai Zheng
[ABSTRACT]
Large Language Models (LLMs) have achieved remarkable success but face
significant computational and memory challenges, particularly due to their
extensive output vocabularies. The final linear projection layer, mapping
hidden states to vocabulary-sized logits, often constitutes a substantial
portion of the model’s parameters and computational cost during inference.
Existing methods like adaptive softmax or hierarchical softmax introduce
structural complexities. In this paper, we propose VQ-Logits, a novel approach
that leverages Vector Quantization (VQ) to drastically reduce the parameter
count and computational load of the LLM output layer. VQ-Logits replaces the
large V * dmodel output embedding matrix with a small, shared codebook of K
embedding vectors (K « V ). Each token in the vocabulary is mapped to one of
these K codebook vectors. The LLM predicts logits over this compact codebook,
which are then efficiently “scattered” to the full vocabulary space using the
learned or preassigned mapping. We demonstrate through extensive experiments on
standard language modeling benchmarks (e.g., WikiText-103, C4) that VQ-Logits
can achieve up to 99% parameter reduction in the output layer and 6x speedup in
logit computation, with only a marginal 4% increase in perplexity compared to
full softmax baselines. We further provide detailed ablation studies on
codebook size, initialization, and learning strategies, showcasing the
robustness and effectiveness of our approach.
[LINK]
http://arxiv.org/abs/2505.10202v1
[DATE]
2025-05-15 19:58:04+08:00
[CATEGORIES]
cs.CL
The CoT Encyclopedia: Analyzing, Predicting, and Controlling how a Reasoning Model will Think
[AUTHORS]
Seongyun Lee, Seungone Kim, Minju Seo, Yongrae Jo, Dongyoung Go, Hyeonbin Hwang, Jinho Park, Xiang Yue, Sean Welleck, Graham Neubig, Moontae Lee, Minjoon Seo
[COMMENTS]
Work in progress
[LINK]
http://arxiv.org/abs/2505.10185v1
[DATE]
2025-05-15 19:31:02+08:00
[CATEGORIES]
cs.CL
Mining Hidden Thoughts from Texts: Evaluating Continual Pretraining with Synthetic Data for LLM Reasoning
[AUTHORS]
Yoichi Ishibashi, Taro Yano, Masafumi Oyamada
[ABSTRACT]
Large Language Models (LLMs) have demonstrated significant improvements in
reasoning capabilities through supervised fine-tuning and reinforcement
learning. However, when training reasoning models, these approaches are
primarily applicable to specific domains such as mathematics and programming,
which imposes fundamental constraints on the breadth and scalability of
training data. In contrast, continual pretraining (CPT) offers the advantage of
not requiring task-specific signals. Nevertheless, how to effectively
synthesize training data for reasoning and how such data affect a wide range of
domains remain largely unexplored. This study provides a detailed evaluation of
Reasoning CPT, a form of CPT that uses synthetic data to reconstruct the hidden
thought processes underlying texts, based on the premise that texts are the
result of the author’s thinking process. Specifically, we apply Reasoning CPT
to Gemma2-9B using synthetic data with hidden thoughts derived from STEM and
Law corpora, and compare it to standard CPT on the MMLU benchmark. Our analysis
reveals that Reasoning CPT consistently improves performance across all
evaluated domains. Notably, reasoning skills acquired in one domain transfer
effectively to others; the performance gap with conventional methods widens as
problem difficulty increases, with gains of up to 8 points on the most
challenging problems. Furthermore, models trained with hidden thoughts learn to
adjust the depth of their reasoning according to problem difficulty.
[LINK]
http://arxiv.org/abs/2505.10182v1
[DATE]
2025-05-15 19:29:01+08:00
[CATEGORIES]
cs.CL
cs.LG
Phase Diagram of Vision Large Language Models Inference: A Perspective from Interaction across Image and Instruction
[AUTHORS]
Houjing Wei, Yuting Shi, Naoya Inoue
[ABSTRACT]
Vision Large Language Models (VLLMs) usually take input as a concatenation of
image token embeddings and text token embeddings and conduct causal modeling.
However, their internal behaviors remain underexplored, raising the question of
interaction among two types of tokens. To investigate such multimodal
interaction during model inference, in this paper, we measure the
contextualization among the hidden state vectors of tokens from different
modalities. Our experiments uncover a four-phase inference dynamics of VLLMs
against the depth of Transformer-based LMs, including (I) Alignment: In very
early layers, contextualization emerges between modalities, suggesting a
feature space alignment. (II) Intra-modal Encoding: In early layers,
intra-modal contextualization is enhanced while inter-modal interaction is
suppressed, suggesting a local encoding within modalities. (III) Inter-modal
Encoding: In later layers, contextualization across modalities is enhanced,
suggesting a deeper fusion across modalities. (IV) Output Preparation: In very
late layers, contextualization is reduced globally, and hidden states are
aligned towards the unembedding space.
[COMMENTS]
6 pages, 5 figures
[LINK]
http://arxiv.org/abs/2411.00646v2
[DATE]
2025-05-15 19:25:54+08:00
[CATEGORIES]
cs.CL
GE-Chat: A Graph Enhanced RAG Framework for Evidential Response Generation of LLMs
[AUTHORS]
Longchao Da, Parth Mitesh Shah, Kuan-Ru Liou, Jiaxing Zhang, Hua Wei
[ABSTRACT]
Large Language Models are now key assistants in human decision-making
processes. However, a common note always seems to follow: “LLMs can make
mistakes. Be careful with important info.” This points to the reality that not
all outputs from LLMs are dependable, and users must evaluate them manually.
The challenge deepens as hallucinated responses, often presented with seemingly
plausible explanations, create complications and raise trust issues among
users. To tackle such issue, this paper proposes GE-Chat, a knowledge Graph
enhanced retrieval-augmented generation framework to provide Evidence-based
response generation. Specifically, when the user uploads a material document, a
knowledge graph will be created, which helps construct a retrieval-augmented
agent, enhancing the agent’s responses with additional knowledge beyond its
training corpus. Then we leverage Chain-of-Thought (CoT) logic generation,
n-hop sub-graph searching, and entailment-based sentence generation to realize
accurate evidence retrieval. We demonstrate that our method improves the
existing models’ performance in terms of identifying the exact evidence in a
free-form context, providing a reliable way to examine the resources of LLM’s
conclusion and help with the judgment of the trustworthiness.
[COMMENTS]
5 pages, 4 figures, accepted to IJCAI2025 demo track
[LINK]
http://arxiv.org/abs/2505.10143v1
[DATE]
2025-05-15 18:17:35+08:00
[CATEGORIES]
cs.CL
Why 1 + 1 < 1 in Visual Token Pruning: Beyond Naive Integration via Multi-Objective Balanced Covering
[AUTHORS]
Yangfu Li, Hongjian Zhan, Tianyi Chen, Qi Liu, Yue Lu
[ABSTRACT]
Existing visual token pruning methods target prompt alignment and visual
preservation with static strategies, overlooking the varying relative
importance of these objectives across tasks, which leads to inconsistent
performance. To address this, we derive the first closed-form error bound for
visual token pruning based on the Hausdorff distance, uniformly characterizing
the contributions of both objectives. Moreover, leveraging $\epsilon$-covering
theory, we reveal an intrinsic trade-off between these objectives and quantify
their optimal attainment levels under a fixed budget. To practically handle
this trade-off, we propose Multi-Objective Balanced Covering (MoB), which
reformulates visual token pruning as a bi-objective covering problem. In this
framework, the attainment trade-off reduces to budget allocation via greedy
radius trading. MoB offers a provable performance bound and linear scalability
with respect to the number of input visual tokens, enabling adaptation to
challenging pruning scenarios. Extensive experiments show that MoB preserves
96.4% of performance for LLaVA-1.5-7B using only 11.1% of the original visual
tokens and accelerates LLaVA-Next-7B by 1.3-1.5$\times$ with negligible
performance loss. Additionally, evaluations on Qwen2-VL and Video-LLaVA confirm
that MoB integrates seamlessly into advanced MLLMs and diverse vision-language
tasks.
[COMMENTS]
31 pages,9 figures,conference
[LINK]
http://arxiv.org/abs/2505.10118v1
[DATE]
2025-05-15 17:43:28+08:00
[CATEGORIES]
cs.CL
Learning Virtual Machine Scheduling in Cloud Computing through Language Agents
[AUTHORS]
JieHao Wu, Ziwei Wang, Junjie Sheng, Wenhao Li, Xiangfei Wang, Jun Luo
[ABSTRACT]
In cloud services, virtual machine (VM) scheduling is a typical Online
Dynamic Multidimensional Bin Packing (ODMBP) problem, characterized by
large-scale complexity and fluctuating demands. Traditional optimization
methods struggle to adapt to real-time changes, domain-expert-designed
heuristic approaches suffer from rigid strategies, and existing learning-based
methods often lack generalizability and interpretability. To address these
limitations, this paper proposes a hierarchical language agent framework named
MiCo, which provides a large language model (LLM)-driven heuristic design
paradigm for solving ODMBP. Specifically, ODMBP is formulated as a Semi-Markov
Decision Process with Options (SMDP-Option), enabling dynamic scheduling
through a two-stage architecture, i.e., Option Miner and Option Composer.
Option Miner utilizes LLMs to discover diverse and useful non-context-aware
strategies by interacting with constructed environments. Option Composer
employs LLMs to discover a composing strategy that integrates the
non-context-aware strategies with the contextual ones. Extensive experiments on
real-world enterprise datasets demonstrate that MiCo achieves a 96.9\%
competitive ratio in large-scale scenarios involving more than 10,000 virtual
machines. It maintains high performance even under nonstationary request flows
and diverse configurations, thus validating its effectiveness in complex and
large-scale cloud environments.
[LINK]
http://arxiv.org/abs/2505.10117v1
[DATE]
2025-05-15 17:42:11+08:00
[CATEGORIES]
cs.LG
cs.CL
From Text to Network: Constructing a Knowledge Graph of Taiwan-Based China Studies Using Generative AI
[AUTHORS]
Hsuan-Lei Shao
[ABSTRACT]
Taiwanese China Studies (CS) has developed into a rich, interdisciplinary
research field shaped by the unique geopolitical position and long standing
academic engagement with Mainland China. This study responds to the growing
need to systematically revisit and reorganize decades of Taiwan based CS
scholarship by proposing an AI assisted approach that transforms unstructured
academic texts into structured, interactive knowledge representations. We apply
generative AI (GAI) techniques and large language models (LLMs) to extract and
standardize entity relation triples from 1,367 peer reviewed CS articles
published between 1996 and 2019. These triples are then visualized through a
lightweight D3.js based system, forming the foundation of a domain specific
knowledge graph and vector database for the field. This infrastructure allows
users to explore conceptual nodes and semantic relationships across the corpus,
revealing previously uncharted intellectual trajectories, thematic clusters,
and research gaps. By decomposing textual content into graph structured
knowledge units, our system enables a paradigm shift from linear text
consumption to network based knowledge navigation. In doing so, it enhances
scholarly access to CS literature while offering a scalable, data driven
alternative to traditional ontology construction. This work not only
demonstrates how generative AI can augment area studies and digital humanities
but also highlights its potential to support a reimagined scholarly
infrastructure for regional knowledge systems.
[COMMENTS]
4 pages, 4 figures
[LINK]
http://arxiv.org/abs/2505.10093v1
[DATE]
2025-05-15 16:51:53+08:00
[CATEGORIES]
cs.CL
XRAG: Cross-lingual Retrieval-Augmented Generation
[AUTHORS]
Wei Liu, Sony Trenous, Leonardo F. R. Ribeiro, Bill Byrne, Felix Hieber
[ABSTRACT]
We propose XRAG, a novel benchmark designed to evaluate the generation
abilities of LLMs in cross-lingual Retrieval-Augmented Generation (RAG)
settings where the user language does not match the retrieval results. XRAG is
constructed from recent news articles to ensure that its questions require
external knowledge to be answered. It covers the real-world scenarios of
monolingual and multilingual retrieval, and provides relevancy annotations for
each retrieved document. Our novel dataset construction pipeline results in
questions that require complex reasoning, as evidenced by the significant gap
between human and LLM performance. Consequently, XRAG serves as a valuable
benchmark for studying LLM reasoning abilities, even before considering the
additional cross-lingual complexity. Experimental results on five LLMs uncover
two previously unreported challenges in cross-lingual RAG: 1) in the
monolingual retrieval setting, all evaluated models struggle with response
language correctness; 2) in the multilingual retrieval setting, the main
challenge lies in reasoning over retrieved information across languages rather
than generation of non-English text.
[LINK]
http://arxiv.org/abs/2505.10089v1
[DATE]
2025-05-15 16:47:55+08:00
[CATEGORIES]
cs.CL
PersLLM: A Personified Training Approach for Large Language Models
[AUTHORS]
Zheni Zeng, Jiayi Chen, Huimin Chen, Yukun Yan, Yuxuan Chen, Zhenghao Liu, Zhiyuan Liu, Maosong Sun
[ABSTRACT]
Large language models (LLMs) exhibit human-like intelligence, enabling them
to simulate human behavior and support various applications that require both
humanized communication and extensive knowledge reserves. Efforts are made to
personify LLMs with special training data or hand-crafted prompts, while
correspondingly faced with challenges such as insufficient data usage or rigid
behavior patterns. Consequently, personified LLMs fail to capture personified
knowledge or express persistent opinion. To fully unlock the potential of LLM
personification, we propose PersLLM, a framework for better data construction
and model tuning. For insufficient data usage, we incorporate strategies such
as Chain-of-Thought prompting and anti-induction, improving the quality of data
construction and capturing the personality experiences, knowledge, and thoughts
more comprehensively. For rigid behavior patterns, we design the tuning process
and introduce automated DPO to enhance the specificity and dynamism of the
models’ personalities, which leads to a more natural opinion communication.
Both automated metrics and expert human evaluations demonstrate the
effectiveness of our approach. Case studies in human-machine interactions and
multi-agent systems further suggest potential application scenarios and future
directions for LLM personification.
[COMMENTS]
8 pages for main text, 5 figures
[LINK]
http://arxiv.org/abs/2407.12393v5
[DATE]
2025-05-15 16:22:08+08:00
[CATEGORIES]
cs.CL
Understanding In-context Learning of Addition via Activation Subspaces
[AUTHORS]
Xinyan Hu, Kayo Yin, Michael I. Jordan, Jacob Steinhardt, Lijie Chen
[ABSTRACT]
To perform in-context learning, language models must extract signals from
individual few-shot examples, aggregate these into a learned prediction rule,
and then apply this rule to new examples. How is this implemented in the
forward pass of modern transformer models? To study this, we consider a
structured family of few-shot learning tasks for which the true prediction rule
is to add an integer $k$ to the input. We find that Llama-3-8B attains high
accuracy on this task for a range of $k$, and localize its few-shot ability to
just three attention heads via a novel optimization approach. We further show
the extracted signals lie in a six-dimensional subspace, where four of the
dimensions track the unit digit and the other two dimensions track overall
magnitude. We finally examine how these heads extract information from
individual few-shot examples, identifying a self-correction mechanism in which
mistakes from earlier examples are suppressed by later examples. Our results
demonstrate how tracking low-dimensional subspaces across a forward pass can
provide insight into fine-grained computational structures.
[COMMENTS]
20 pages
[LINK]
http://arxiv.org/abs/2505.05145v2
[DATE]
2025-05-15 15:19:33+08:00
[CATEGORIES]
cs.LG
cs.CL
DIF: A Framework for Benchmarking and Verifying Implicit Bias in LLMs
[AUTHORS]
Lake Yin, Fan Huang
[ABSTRACT]
As Large Language Models (LLMs) have risen in prominence over the past few
years, there has been concern over the potential biases in LLMs inherited from
the training data. Previous studies have examined how LLMs exhibit implicit
bias, such as when response generation changes when different social contexts
are introduced. We argue that this implicit bias is not only an ethical, but
also a technical issue, as it reveals an inability of LLMs to accommodate
extraneous information. However, unlike other measures of LLM intelligence,
there are no standard methods to benchmark this specific subset of LLM bias. To
bridge this gap, we developed a method for calculating an easily interpretable
benchmark, DIF (Demographic Implicit Fairness), by evaluating preexisting LLM
logic and math problem datasets with sociodemographic personas. We demonstrate
that this method can statistically validate the presence of implicit bias in
LLM behavior and find an inverse trend between question answering accuracy and
implicit bias, supporting our argument.
[COMMENTS]
7 pages, 1 figure
[LINK]
http://arxiv.org/abs/2505.10013v1
[DATE]
2025-05-15 14:53:37+08:00
[CATEGORIES]
cs.CL
Compensate Quantization Errors+: Quantized Models Are Inquisitive Learners
[AUTHORS]
Yifei Gao, Jie Ou, Lei Wang, Jun Cheng, Mengchu Zhou
[ABSTRACT]
The quantization of large language models (LLMs) has been a prominent
research area aimed at enabling their lightweight deployment in practice.
Existing research about LLM’s quantization has mainly explored the interplay
between weights and activations, or employing auxiliary components while
neglecting the necessity of adjusting weights during quantization.
Consequently, original weight distributions frequently fail to yield desired
results after round-to-nearest (RTN) quantization. Even though incorporating
techniques such as mixed precision and low-rank error approximation in LLM’s
quantization can yield improved results, they inevitably introduce additional
computational overhead. On the other hand, traditional techniques for weight
quantization, such as Generative Post-Training Quantization, rely on manually
tweaking weight distributions to minimize local errors, but they fall short of
achieving globally optimal outcomes. Although the recently proposed Learnable
Singular-value Increment improves global weight quantization by modifying
weight distributions, it disrupts the original distribution considerably. This
introduces pronounced bias toward the training data and can degrade downstream
task performance. In this paper, we introduce Singular-value Diagonal
Expansion, a more nuanced approach to refining weight distributions to achieve
better quantization alignment. Furthermore, we introduce Cross-layer Learning
that improves overall quantization outcomes by distributing errors more evenly
across layers. Our plug-and-play weight-quantization methods demonstrate
substantial performance improvements over state-of-the-art approaches,
including OmniQuant, DuQuant, and PrefixQuant.
[COMMENTS]
Effecient Quantization Methods for LLMs
[LINK]
http://arxiv.org/abs/2407.15508v3
[DATE]
2025-05-15 13:34:45+08:00
[CATEGORIES]
cs.CL
Beyond Next Token Prediction: Patch-Level Training for Large Language Models
[AUTHORS]
Chenze Shao, Fandong Meng, Jie Zhou
[ABSTRACT]
The prohibitive training costs of Large Language Models (LLMs) have emerged
as a significant bottleneck in the development of next-generation LLMs. In this
paper, we show that it is possible to significantly reduce the training costs
of LLMs without sacrificing their performance. Specifically, we introduce
patch-level training for LLMs, in which multiple tokens are aggregated into a
unit of higher information density, referred to as a `patch’, to serve as the
fundamental text unit for training LLMs. During patch-level training, we feed
the language model shorter sequences of patches and train it to predict the
next patch, thereby processing the majority of the training data at a
significantly reduced cost. Following this, the model continues token-level
training on the remaining training data to align with the inference mode.
Experiments on a diverse range of models (370M-2.7B parameters) demonstrate
that patch-level training can reduce the overall training costs to 0.5$\times$,
without compromising the model performance compared to token-level training.
Source code: https://github.com/shaochenze/PatchTrain.
[COMMENTS]
ICLR 2025 Spotlight
[LINK]
http://arxiv.org/abs/2407.12665v3
[DATE]
2025-05-15 13:15:13+08:00
[CATEGORIES]
cs.CL
cs.LG
MultiMed: Multilingual Medical Speech Recognition via Attention Encoder Decoder
[AUTHORS]
Khai Le-Duc, Phuc Phan, Tan-Hanh Pham, Bach Phan Tat, Minh-Huong Ngo, Chris Ngo, Thanh Nguyen-Tang, Truong-Son Hy
[ABSTRACT]
Multilingual automatic speech recognition (ASR) in the medical domain serves
as a foundational task for various downstream applications such as speech
translation, spoken language understanding, and voice-activated assistants.
This technology improves patient care by enabling efficient communication
across language barriers, alleviating specialized workforce shortages, and
facilitating improved diagnosis and treatment, particularly during pandemics.
In this work, we introduce MultiMed, the first multilingual medical ASR
dataset, along with the first collection of small-to-large end-to-end medical
ASR models, spanning five languages: Vietnamese, English, German, French, and
Mandarin Chinese. To our best knowledge, MultiMed stands as the world’s largest
medical ASR dataset across all major benchmarks: total duration, number of
recording conditions, number of accents, and number of speaking roles.
Furthermore, we present the first multilinguality study for medical ASR, which
includes reproducible empirical baselines, a monolinguality-multilinguality
analysis, Attention Encoder Decoder (AED) vs Hybrid comparative study and a
linguistic analysis. We present practical ASR end-to-end training schemes
optimized for a fixed number of trainable parameters that are common in
industry settings. All code, data, and models are available online:
https://github.com/leduckhai/MultiMed/tree/master/MultiMed.
[COMMENTS]
ACL 2025, 38 pages
[LINK]
http://arxiv.org/abs/2409.14074v3
[DATE]
2025-05-15 12:35:00+08:00
[CATEGORIES]
cs.CL
RM-R1: Reward Modeling as Reasoning
[AUTHORS]
Xiusi Chen, Gaotang Li, Ziqi Wang, Bowen Jin, Cheng Qian, Yu Wang, Hongru Wang, Yu Zhang, Denghui Zhang, Tong Zhang, Hanghang Tong, Heng Ji
[ABSTRACT]
Reward modeling is essential for aligning large language models (LLMs) with
human preferences through reinforcement learning (RL). To provide accurate
reward signals, a reward model (RM) should stimulate deep thinking and conduct
interpretable reasoning before assigning a score or a judgment. Inspired by
recent advances of long chain-of-thought (CoT) on reasoning-intensive tasks, we
hypothesize and validate that integrating reasoning capabilities into reward
modeling significantly enhances RM’s interpretability and performance. To this
end, we introduce a new class of generative reward models – Reasoning Reward
Models (ReasRMs) – which formulate reward modeling as a reasoning task. We
propose a reasoning-oriented training pipeline and train a family of ReasRMs,
RM-R1. RM-R1 features a chain-of-rubrics (CoR) mechanism – self-generating
sample-level chat rubrics or math/code solutions, and evaluating candidate
responses against them. The training of M-R1 consists of two key stages: (1)
distillation of high-quality reasoning chains and (2) reinforcement learning
with verifiable rewards. Empirically, our models achieve state-of-the-art
performance across three reward model benchmarks on average, outperforming much
larger open-weight models (e.g., INF-ORM-Llama3.1-70B) and proprietary ones
(e.g., GPT-4o) by up to 4.9%. Beyond final performance, we perform thorough
empirical analysis to understand the key ingredients of successful ReasRM
training. To facilitate future research, we release six ReasRM models along
with code and data at https://github.com/RM-R1-UIUC/RM-R1.
[COMMENTS]
24 pages, 8 figures
[LINK]
http://arxiv.org/abs/2505.02387v2
[DATE]
2025-05-15 12:14:49+08:00
[CATEGORIES]
cs.CL
cs.LG
Personalizing Large Language Models using Retrieval Augmented Generation and Knowledge Graph
[AUTHORS]
Deeksha Prahlad, Chanhee Lee, Dongha Kim, Hokeun Kim
[ABSTRACT]
The advent of large language models (LLMs) has allowed numerous applications,
including the generation of queried responses, to be leveraged in chatbots and
other conversational assistants. Being trained on a plethora of data, LLMs
often undergo high levels of over-fitting, resulting in the generation of extra
and incorrect data, thus causing hallucinations in output generation. One of
the root causes of such problems is the lack of timely, factual, and
personalized information fed to the LLM. In this paper, we propose an approach
to address these problems by introducing retrieval augmented generation (RAG)
using knowledge graphs (KGs) to assist the LLM in personalized response
generation tailored to the users. KGs have the advantage of storing
continuously updated factual information in a structured way. While our KGs can
be used for a variety of frequently updated personal data, such as calendar,
contact, and location data, we focus on calendar data in this paper. Our
experimental results show that our approach works significantly better in
understanding personal information and generating accurate responses compared
to the baseline LLMs using personal data as text inputs, with a moderate
reduction in response time.
[COMMENTS]
To appear in the Companion Proceedings of the ACM Web Conference 2025
(WWW Companion ‘25)
[LINK]
http://arxiv.org/abs/2505.09945v1
[DATE]
2025-05-15 12:01:58+08:00
[CATEGORIES]
cs.CL
cs.LG
Natural Language Reinforcement Learning
[AUTHORS]
Xidong Feng, Bo Liu, Ziyu Wan, Haotian Fu, Girish A. Koushik, Zhiyuan Hu, Mengyue Yang, Ying Wen, Jun Wang
[ABSTRACT]
Reinforcement Learning (RL) mathematically formulates decision-making with
Markov Decision Process (MDP). With MDPs, researchers have achieved remarkable
breakthroughs across various domains, including games, robotics, and language
models. This paper seeks a new possibility, Natural Language Reinforcement
Learning (NLRL), by extending traditional MDP to natural language-based
representation space. Specifically, NLRL innovatively redefines RL principles,
including task objectives, policy, value function, Bellman equation, and policy
iteration, into their language counterparts. With recent advancements in large
language models (LLMs), NLRL can be practically implemented to achieve RL-like
policy and value improvement by either pure prompting or gradient-based
training. Experiments over Maze, Breakthrough, and Tic-Tac-Toe games
demonstrate the effectiveness, efficiency, and interpretability of the NLRL
framework among diverse use cases.
[COMMENTS]
Accepted at ICLR 2025 Workshop SSI-FM
[LINK]
http://arxiv.org/abs/2411.14251v2
[DATE]
2025-05-15 11:35:25+08:00
[CATEGORIES]
cs.LG
cs.CL
From Trade-off to Synergy: A Versatile Symbiotic Watermarking Framework for Large Language Models
[AUTHORS]
Yidan Wang, Yubing Ren, Yanan Cao, Binxing Fang
[ABSTRACT]
The rise of Large Language Models (LLMs) has heightened concerns about the
misuse of AI-generated text, making watermarking a promising solution.
Mainstream watermarking schemes for LLMs fall into two categories: logits-based
and sampling-based. However, current schemes entail trade-offs among
robustness, text quality, and security. To mitigate this, we integrate
logits-based and sampling-based schemes, harnessing their respective strengths
to achieve synergy. In this paper, we propose a versatile symbiotic
watermarking framework with three strategies: serial, parallel, and hybrid. The
hybrid framework adaptively embeds watermarks using token entropy and semantic
entropy, optimizing the balance between detectability, robustness, text
quality, and security. Furthermore, we validate our approach through
comprehensive experiments on various datasets and models. Experimental results
indicate that our method outperforms existing baselines and achieves
state-of-the-art (SOTA) performance. We believe this framework provides novel
insights into diverse watermarking paradigms. Our code is available at
\href{https://github.com/redwyd/SymMark}{https://github.com/redwyd/SymMark}.
[LINK]
http://arxiv.org/abs/2505.09924v1
[DATE]
2025-05-15 11:12:36+08:00
[CATEGORIES]
cs.CL
Temporal Scaling Law for Large Language Models
[AUTHORS]
Yizhe Xiong, Xiansheng Chen, Xin Ye, Hui Chen, Zijia Lin, Haoran Lian, Zhenpeng Su, Wei Huang, Jianwei Niu, Jungong Han, Guiguang Ding
[ABSTRACT]
Recently, Large Language Models (LLMs) have been widely adopted in a wide
range of tasks, leading to increasing attention towards the research on how
scaling LLMs affects their performance. Existing works, termed Scaling Laws,
have discovered that the final test loss of LLMs scales as power-laws with
model size, computational budget, and dataset size. However, the temporal
change of the test loss of an LLM throughout its pre-training process remains
unexplored, though it is valuable in many aspects, such as selecting better
hyperparameters \textit{directly} on the target LLM. In this paper, we propose
the novel concept of Temporal Scaling Law, studying how the test loss of an LLM
evolves as the training steps scale up. In contrast to modeling the test loss
as a whole in a coarse-grained manner, we break it down and dive into the
fine-grained test loss of each token position, and further develop a dynamic
hyperbolic-law. Afterwards, we derive the much more precise temporal scaling
law by studying the temporal patterns of the parameters in the dynamic
hyperbolic-law. Results on both in-distribution (ID) and out-of-distribution
(OOD) validation datasets demonstrate that our temporal scaling law accurately
predicts the test loss of LLMs across training steps. Our temporal scaling law
has broad practical applications. First, it enables direct and efficient
hyperparameter selection on the target LLM, such as data mixture proportions.
Secondly, viewing the LLM pre-training dynamics from the token position
granularity provides some insights to enhance the understanding of LLM
pre-training.
[COMMENTS]
Preprint, Currently under review
[LINK]
http://arxiv.org/abs/2404.17785v3
[DATE]
2025-05-15 10:48:26+08:00
[CATEGORIES]
cs.CL
Comparing Exploration-Exploitation Strategies of LLMs and Humans: Insights from Standard Multi-armed Bandit Tasks
[AUTHORS]
Ziyuan Zhang, Darcy Wang, Ningyuan Chen, Rodrigo Mansur, Vahid Sarhangian
[ABSTRACT]
Large language models (LLMs) are increasingly used to simulate or automate
human behavior in complex sequential decision-making tasks. A natural question
is then whether LLMs exhibit similar decision-making behavior to humans, and
can achieve comparable (or superior) performance. In this work, we focus on the
exploration-exploitation (E&E) tradeoff, a fundamental aspect of dynamic
decision-making under uncertainty. We employ canonical multi-armed bandit (MAB)
tasks introduced in the cognitive science and psychiatry literature to conduct
a comparative study of the E&E strategies of LLMs, humans, and MAB algorithms.
We use interpretable choice models to capture the E&E strategies of the agents
and investigate how explicit reasoning, through both prompting strategies and
reasoning-enhanced models, shapes LLM decision-making. We find that reasoning
shifts LLMs toward more human-like behavior, characterized by a mix of random
and directed exploration. In simple stationary tasks, reasoning-enabled LLMs
exhibit similar levels of random and directed exploration compared to humans.
However, in more complex, non-stationary environments, LLMs struggle to match
human adaptability, particularly in effective directed exploration, despite
achieving similar regret in certain scenarios. Our findings highlight both the
promise and limits of LLMs as simulators of human behavior and tools for
automated decision-making and point to potential areas of improvements.
[LINK]
http://arxiv.org/abs/2505.09901v1
[DATE]
2025-05-15 10:09:18+08:00
[CATEGORIES]
cs.LG
cs.CL
Construction and Application of Materials Knowledge Graph in Multidisciplinary Materials Science via Large Language Model
[AUTHORS]
Yanpeng Ye, Jie Ren, Shaozhou Wang, Yuwei Wan, Imran Razzak, Bram Hoex, Haofen Wang, Tong Xie, Wenjie Zhang
[ABSTRACT]
Knowledge in materials science is widely dispersed across extensive
scientific literature, posing significant challenges to the efficient discovery
and integration of new materials. Traditional methods, often reliant on costly
and time-consuming experimental approaches, further complicate rapid
innovation. Addressing these challenges, the integration of artificial
intelligence with materials science has opened avenues for accelerating the
discovery process, though it also demands precise annotation, data extraction,
and traceability of information. To tackle these issues, this article
introduces the Materials Knowledge Graph (MKG), which utilizes advanced natural
language processing techniques integrated with large language models to extract
and systematically organize a decade’s worth of high-quality research into
structured triples, contains 162,605 nodes and 731,772 edges. MKG categorizes
information into comprehensive labels such as Name, Formula, and Application,
structured around a meticulously designed ontology, thus enhancing data
usability and integration. By implementing network-based algorithms, MKG not
only facilitates efficient link prediction but also significantly reduces
reliance on traditional experimental methods. This structured approach not only
streamlines materials research but also lays the groundwork for more
sophisticated science knowledge graphs.
[COMMENTS]
Accepted by 38th Conference on Neural Information Processing Systems
(NeurIPS 2024)
[LINK]
http://arxiv.org/abs/2404.03080v5
[DATE]
2025-05-15 10:03:46+08:00
[CATEGORIES]
cs.CL
RoBERTa-BiLSTM: A Context-Aware Hybrid Model for Sentiment Analysis
[AUTHORS]
Md. Mostafizer Rahman, Ariful Islam Shiplu, Yutaka Watanobe, Md. Ashad Alam
[ABSTRACT]
Effectively analyzing the comments to uncover latent intentions holds immense
value in making strategic decisions across various domains. However, several
challenges hinder the process of sentiment analysis including the lexical
diversity exhibited in comments, the presence of long dependencies within the
text, encountering unknown symbols and words, and dealing with imbalanced
datasets. Moreover, existing sentiment analysis tasks mostly leveraged
sequential models to encode the long dependent texts and it requires longer
execution time as it processes the text sequentially. In contrast, the
Transformer requires less execution time due to its parallel processing nature.
In this work, we introduce a novel hybrid deep learning model, RoBERTa-BiLSTM,
which combines the Robustly Optimized BERT Pretraining Approach (RoBERTa) with
Bidirectional Long Short-Term Memory (BiLSTM) networks. RoBERTa is utilized to
generate meaningful word embedding vectors, while BiLSTM effectively captures
the contextual semantics of long-dependent texts. The RoBERTa-BiLSTM hybrid
model leverages the strengths of both sequential and Transformer models to
enhance performance in sentiment analysis. We conducted experiments using
datasets from IMDb, Twitter US Airline, and Sentiment140 to evaluate the
proposed model against existing state-of-the-art methods. Our experimental
findings demonstrate that the RoBERTa-BiLSTM model surpasses baseline models
(e.g., BERT, RoBERTa-base, RoBERTa-GRU, and RoBERTa-LSTM), achieving accuracies
of 80.74%, 92.36%, and 82.25% on the Twitter US Airline, IMDb, and Sentiment140
datasets, respectively. Additionally, the model achieves F1-scores of 80.73%,
92.35%, and 82.25% on the same datasets, respectively.
[LINK]
http://arxiv.org/abs/2406.00367v2
[DATE]
2025-05-15 09:38:21+08:00
[CATEGORIES]
cs.CL
uDistil-Whisper: Label-Free Data Filtering for Knowledge Distillation in Low-Data Regimes
[AUTHORS]
Abdul Waheed, Karima Kadaoui, Bhiksha Raj, Muhammad Abdul-Mageed
[ABSTRACT]
Recent work on distilling Whisper’s knowledge into small models using
pseudo-labels shows promising performance while reducing the size by up to 50%.
This results in small, efficient, and dedicated models. However, a critical
step of distillation using pseudo-labels involves filtering high-quality
predictions and using only those during training. This step requires ground
truth labels to compare with and filter low-quality examples, making the
process dependent on human labels. Additionally, the distillation process
requires a large amount of data thereby limiting its applicability in
low-resource settings. To address this, we propose a distillation framework
that does not require any labeled data. Through experimentation, we show that
our best-distilled models outperform the teacher model by 5-7 WER points and
are on par with or outperform similar supervised data filtering setups. When
scaling the data, our models significantly outperform all zero-shot and
supervised models. Our models are also 25-50% more compute- and
memory-efficient while maintaining performance equal to or better than that of
the teacher model. For more details about our models, dataset, and other
resources, please visit our GitHub page:
https://github.com/UBC-NLP/uDistilWhisper.
[COMMENTS]
Accepted to NAACL‘25 main conference
[LINK]
http://arxiv.org/abs/2407.01257v5
[DATE]
2025-05-15 09:04:11+08:00
[CATEGORIES]
cs.CL
Predictability Shapes Adaptation: An Evolutionary Perspective on Modes of Learning in Transformers
[AUTHORS]
Alexander Y. Ku, Thomas L. Griffiths, Stephanie C. Y. Chan
[ABSTRACT]
Transformer models learn in two distinct modes: in-weights learning (IWL),
encoding knowledge into model weights, and in-context learning (ICL), adapting
flexibly to context without weight modification. To better understand the
interplay between these learning modes, we draw inspiration from evolutionary
biology’s analogous adaptive strategies: genetic encoding (akin to IWL,
adapting over generations and fixed within an individual’s lifetime) and
phenotypic plasticity (akin to ICL, enabling flexible behavioral responses to
environmental cues). In evolutionary biology, environmental predictability
dictates the balance between these strategies: stability favors genetic
encoding, while reliable predictive cues promote phenotypic plasticity. We
experimentally operationalize these dimensions of predictability and
systematically investigate their influence on the ICL/IWL balance in
Transformers. Using regression and classification tasks, we show that high
environmental stability decisively favors IWL, as predicted, with a sharp
transition at maximal stability. Conversely, high cue reliability enhances ICL
efficacy, particularly when stability is low. Furthermore, learning dynamics
reveal task-contingent temporal evolution: while a canonical ICL-to-IWL shift
occurs in some settings (e.g., classification with many classes), we
demonstrate that scenarios with easier IWL (e.g., fewer classes) or slower ICL
acquisition (e.g., regression) can exhibit an initial IWL phase later yielding
to ICL dominance. These findings support a relative-cost hypothesis for
explaining these learning mode transitions, establishing predictability as a
critical factor governing adaptive strategies in Transformers, and offering
novel insights for understanding ICL and guiding training methodologies.
[LINK]
http://arxiv.org/abs/2505.09855v1
[DATE]
2025-05-15 07:31:17+08:00
[CATEGORIES]
cs.LG
cs.CL
Do Large Language Models Know Conflict? Investigating Parametric vs. Non-Parametric Knowledge of LLMs for Conflict Forecasting
[AUTHORS]
Apollinaire Poli Nemkova, Sarath Chandra Lingareddy, Sagnik Ray Choudhury, Mark V. Albert
[ABSTRACT]
Large Language Models (LLMs) have shown impressive performance across natural
language tasks, but their ability to forecast violent conflict remains
underexplored. We investigate whether LLMs possess meaningful parametric
knowledge-encoded in their pretrained weights-to predict conflict escalation
and fatalities without external data. This is critical for early warning
systems, humanitarian planning, and policy-making. We compare this parametric
knowledge with non-parametric capabilities, where LLMs access structured and
unstructured context from conflict datasets (e.g., ACLED, GDELT) and recent
news reports via Retrieval-Augmented Generation (RAG). Incorporating external
information could enhance model performance by providing up-to-date context
otherwise missing from pretrained weights. Our two-part evaluation framework
spans 2020-2024 across conflict-prone regions in the Horn of Africa and the
Middle East. In the parametric setting, LLMs predict conflict trends and
fatalities relying only on pretrained knowledge. In the non-parametric setting,
models receive summaries of recent conflict events, indicators, and
geopolitical developments. We compare predicted conflict trend labels (e.g.,
Escalate, Stable Conflict, De-escalate, Peace) and fatalities against
historical data. Our findings highlight the strengths and limitations of LLMs
for conflict forecasting and the benefits of augmenting them with structured
external knowledge.
[LINK]
http://arxiv.org/abs/2505.09852v1
[DATE]
2025-05-15 07:24:22+08:00
[CATEGORIES]
cs.CL
Hypernym Mercury: Token Optimization Through Semantic Field Constriction And Reconstruction From Hypernyms. A New Text Compression Method
[AUTHORS]
Chris Forrester, Octavia Sulea
[ABSTRACT]
Compute optimization using token reduction of LLM prompts is an emerging task
in the fields of NLP and next generation, agentic AI. In this white paper, we
introduce a novel (patent pending) text representation scheme and a
first-of-its-kind word-level semantic compression of paragraphs that can lead
to over 90% token reduction, while retaining high semantic similarity to the
source text. We explain how this novel compression technique can be lossless
and how the detail granularity is controllable. We discuss benchmark results
over open source data (i.e. Bram Stoker’s Dracula available through Project
Gutenberg) and show how our results hold at the paragraph level, across
multiple genres and models.
[LINK]
http://arxiv.org/abs/2505.08058v2
[DATE]
2025-05-15 04:57:31+08:00
[CATEGORIES]
cs.CL
A Survey on Large Language Models in Multimodal Recommender Systems
[AUTHORS]
Alejo Lopez-Avila, Jinhua Du
[ABSTRACT]
Multimodal recommender systems (MRS) integrate heterogeneous user and item
data, such as text, images, and structured information, to enhance
recommendation performance. The emergence of large language models (LLMs)
introduces new opportunities for MRS by enabling semantic reasoning, in-context
learning, and dynamic input handling. Compared to earlier pre-trained language
models (PLMs), LLMs offer greater flexibility and generalisation capabilities
but also introduce challenges related to scalability and model accessibility.
This survey presents a comprehensive review of recent work at the intersection
of LLMs and MRS, focusing on prompting strategies, fine-tuning methods, and
data adaptation techniques. We propose a novel taxonomy to characterise
integration patterns, identify transferable techniques from related
recommendation domains, provide an overview of evaluation metrics and datasets,
and point to possible future directions. We aim to clarify the emerging role of
LLMs in multimodal recommendation and support future research in this rapidly
evolving field.
[COMMENTS]
30 pages, 6 figures
[LINK]
http://arxiv.org/abs/2505.09777v1
[DATE]
2025-05-15 04:15:52+08:00
[CATEGORIES]
cs.CL
Language Agents Mirror Human Causal Reasoning Biases. How Can We Help Them Think Like Scientists?
[AUTHORS]
Anthony GX-Chen, Dongyan Lin, Mandana Samiei, Doina Precup, Blake A. Richards, Rob Fergus, Kenneth Marino
[ABSTRACT]
Language model (LM) agents are increasingly used as autonomous
decision-makers who need to actively gather information to guide their
decisions. A crucial cognitive skill for such agents is the efficient
exploration and understanding of the causal structure of the world – key to
robust, scientifically grounded reasoning. Yet, it remains unclear whether LMs
possess this capability or exhibit systematic biases leading to erroneous
conclusions. In this work, we examine LMs’ ability to explore and infer causal
relationships, using the well-established “Blicket Test” paradigm from
developmental psychology. We find that LMs reliably infer the common, intuitive
disjunctive causal relationships but systematically struggle with the unusual,
yet equally (or sometimes even more) evidenced conjunctive ones. This
“disjunctive bias” persists across model families, sizes, and prompting
strategies, and performance further declines as task complexity increases.
Interestingly, an analogous bias appears in human adults, suggesting that LMs
may have inherited deep-seated reasoning heuristics from their training data.
To this end, we quantify similarities between LMs and humans, finding that LMs
exhibit adult-like inference profiles (but not children-like). Finally, we
propose a test-time sampling method which explicitly samples and eliminates
hypotheses about causal relationships from the LM. This scalable approach
significantly reduces the disjunctive bias and moves LMs closer to the goal of
scientific, causally rigorous reasoning.
[LINK]
http://arxiv.org/abs/2505.09614v1
[DATE]
2025-05-15 01:59:35+08:00
[CATEGORIES]
cs.CL
Embodied-Reasoner: Synergizing Visual Search, Reasoning, and Action for Embodied Interactive Tasks
[AUTHORS]
Wenqi Zhang, Mengna Wang, Gangao Liu, Xu Huixin, Yiwei Jiang, Yongliang Shen, Guiyang Hou, Zhe Zheng, Hang Zhang, Xin Li, Weiming Lu, Peng Li, Yueting Zhuang
[ABSTRACT]
Recent advances in deep thinking models have demonstrated remarkable
reasoning capabilities on mathematical and coding tasks. However, their
effectiveness in embodied domains which require continuous interaction with
environments through image action interleaved trajectories remains largely
-unexplored. We present Embodied Reasoner, a model that extends o1 style
reasoning to interactive embodied search tasks. Unlike mathematical reasoning
that relies primarily on logical deduction, embodied scenarios demand spatial
understanding, temporal reasoning, and ongoing self-reflection based on
interaction history. To address these challenges, we synthesize 9.3k coherent
Observation-Thought-Action trajectories containing 64k interactive images and
90k diverse thinking processes (analysis, spatial reasoning, reflection,
planning, and verification). We develop a three-stage training pipeline that
progressively enhances the model’s capabilities through imitation learning,
self-exploration via rejection sampling, and self-correction through reflection
tuning. The evaluation shows that our model significantly outperforms those
advanced visual reasoning models, e.g., it exceeds OpenAI o1, o3-mini, and
Claude-3.7 by +9\%, 24\%, and +13\%. Analysis reveals our model exhibits fewer
repeated searches and logical inconsistencies, with particular advantages in
complex long-horizon tasks. Real-world environments also show our superiority
while exhibiting fewer repeated searches and logical inconsistency cases.
[COMMENTS]
Code: https://github.com/zwq2018/embodied_reasoner Dataset:
https://huggingface.co/datasets/zwq2018/embodied_reasoner
[LINK]
http://arxiv.org/abs/2503.21696v2
[DATE]
2025-05-15 01:48:02+08:00
[CATEGORIES]
cs.CL
Activation Steering in Neural Theorem Provers
[AUTHORS]
Shashank Kirtania
[ABSTRACT]
Large Language Models (LLMs) have shown promise in proving formal theorems
using proof assistants like Lean. However, current state of the art language
models struggles to predict next step in proofs leading practitioners to use
different sampling techniques to improve LLMs capabilities. We observe that the
LLM is capable of predicting the correct tactic; however, it faces challenges
in ranking it appropriately within the set of candidate tactics, affecting the
overall selection process. To overcome this hurdle, we use activation steering
to guide LLMs responses to improve the generations at the time of inference.
Our results suggest that activation steering offers a promising lightweight
alternative to specialized fine-tuning for enhancing theorem proving
capabilities in LLMs, particularly valuable in resource-constrained
environments.
[COMMENTS]
incorrect explanation for a concept, need to revise and update!
[LINK]
http://arxiv.org/abs/2502.15507v3
[DATE]
2025-05-15 01:25:36+08:00
[CATEGORIES]
cs.LG
cs.CL
Llama-Nemotron: Efficient Reasoning Models
[AUTHORS]
Akhiad Bercovich, Itay Levy, Izik Golan, Mohammad Dabbah, Ran El-Yaniv, Omri Puny, Ido Galil, Zach Moshe, Tomer Ronen, Najeeb Nabwani, Ido Shahaf, Oren Tropp, Ehud Karpas, Ran Zilberstein, Jiaqi Zeng, Soumye Singhal, Alexander Bukharin, Yian Zhang, Tugrul Konuk, Gerald Shen, Ameya Sunil Mahabaleshwarkar, Bilal Kartal, Yoshi Suhara, Olivier Delalleau, Zijia Chen, Zhilin Wang, David Mosallanezhad, Adi Renduchintala, Haifeng Qian, Dima Rekesh, Fei Jia, Somshubra Majumdar, Vahid Noroozi, Wasi Uddin Ahmad, Sean Narenthiran, Aleksander Ficek, Mehrzad Samadi, Jocelyn Huang, Siddhartha Jain, Igor Gitman, Ivan Moshkov, Wei Du, Shubham Toshniwal, George Armstrong, Branislav Kisacanin, Matvei Novikov, Daria Gitman, Evelina Bakhturina, Jane Polak Scowcroft, John Kamalu, Dan Su, Kezhi Kong, Markus Kliegl, Rabeeh Karimi, Ying Lin, Sanjeev Satheesh, Jupinder Parmar, Pritam Gundecha, Brandon Norick, Joseph Jennings, Shrimai Prabhumoye, Syeda Nahida Akter, Mostofa Patwary, Abhinav Khattar, Deepak Narayanan, Roger Waleffe, Jimmy Zhang, Bor-Yiing Su, Guyue Huang, Terry Kong, Parth Chadha, Sahil Jain, Christine Harvey, Elad Segal, Jining Huang, Sergey Kashirsky, Robert McQueen, Izzy Putterman, George Lam, Arun Venkatesan, Sherry Wu, Vinh Nguyen, Manoj Kilaru, Andrew Wang, Anna Warno, Abhilash Somasamudramath, Sandip Bhaskar, Maka Dong, Nave Assaf, Shahar Mor, Omer Ullman Argov, Scot Junkin, Oleksandr Romanenko, Pedro Larroy, Monika Katariya, Marco Rovinelli, Viji Balas, Nicholas Edelman, Anahita Bhiwandiwalla, Muthu Subramaniam, Smita Ithape, Karthik Ramamoorthy, Yuting Wu, Suguna Varshini Velury, Omri Almog, Joyjit Daw, Denys Fridman, Erick Galinkin, Michael Evans, Shaona Ghosh, Katherine Luna, Leon Derczynski, Nikki Pope, Eileen Long, Seth Schneider, Guillermo Siman, Tomasz Grzegorzek, Pablo Ribalta, Monika Katariya, Chris Alexiuk, Joey Conway, Trisha Saar, Ann Guan, Krzysztof Pawelec, Shyamala Prayaga, Oleksii Kuchaiev, Boris Ginsburg, Oluwatobi Olabiyi, Kari Briski, Jonathan Cohen, Bryan Catanzaro, Jonah Alben, Yonatan Geifman, Eric Chung
[ABSTRACT]
We introduce the Llama-Nemotron series of models, an open family of
heterogeneous reasoning models that deliver exceptional reasoning capabilities,
inference efficiency, and an open license for enterprise use. The family comes
in three sizes – Nano (8B), Super (49B), and Ultra (253B) – and performs
competitively with state-of-the-art reasoning models such as DeepSeek-R1 while
offering superior inference throughput and memory efficiency. In this report,
we discuss the training procedure for these models, which entails using neural
architecture search from Llama 3 models for accelerated inference, knowledge
distillation, and continued pretraining, followed by a reasoning-focused
post-training stage consisting of two main parts: supervised fine-tuning and
large scale reinforcement learning. Llama-Nemotron models are the first
open-source models to support a dynamic reasoning toggle, allowing users to
switch between standard chat and reasoning modes during inference. To further
support open research and facilitate model development, we provide the
following resources: 1. We release the Llama-Nemotron reasoning models –
LN-Nano, LN-Super, and LN-Ultra – under the commercially permissive NVIDIA
Open Model License Agreement. 2. We release the complete post-training dataset:
Llama-Nemotron-Post-Training-Dataset. 3. We also release our training
codebases: NeMo, NeMo-Aligner, and Megatron-LM.
[LINK]
http://arxiv.org/abs/2505.00949v3
[DATE]
2025-05-15 00:47:23+08:00
[CATEGORIES]
cs.CL
cs.LG
TSLFormer: A Lightweight Transformer Model for Turkish Sign Language Recognition Using Skeletal Landmarks
[AUTHORS]
Kutay Ertürk, Furkan Altınışık, İrem Sarıaltın, Ömer Nezih Gerek
[ABSTRACT]
This study presents TSLFormer, a light and robust word-level Turkish Sign
Language (TSL) recognition model that treats sign gestures as ordered,
string-like language. Instead of using raw RGB or depth videos, our method only
works with 3D joint positions - articulation points - extracted using Google’s
Mediapipe library, which focuses on the hand and torso skeletal locations. This
creates efficient input dimensionality reduction while preserving important
semantic gesture information.
Our approach revisits sign language recognition as sequence-to-sequence
translation, inspired by the linguistic nature of sign languages and the
success of transformers in natural language processing. Since TSLFormer uses
the self-attention mechanism, it effectively captures temporal co-occurrence
within gesture sequences and highlights meaningful motion patterns as words
unfold.
Evaluated on the AUTSL dataset with over 36,000 samples and 227 different
words, TSLFormer achieves competitive performance with minimal computational
cost. These results show that joint-based input is sufficient for enabling
real-time, mobile, and assistive communication systems for hearing-impaired
individuals.
[LINK]
http://arxiv.org/abs/2505.07890v2
[DATE]
2025-05-15 00:43:25+08:00
[CATEGORIES]
cs.CL
Tales of the 2025 Los Angeles Fire: Hotwash for Public Health Concerns in Reddit via LLM-Enhanced Topic Modeling
[AUTHORS]
Sulong Zhou, Qunying Huang, Shaoheng Zhou, Yun Hang, Xinyue Ye, Aodong Mei, Kathryn Phung, Yuning Ye, Uma Govindswamy, Zehan Li
[ABSTRACT]
Wildfires have become increasingly frequent, irregular, and severe in recent
years. Understanding how affected populations perceive and respond during
wildfire crises is critical for timely and empathetic disaster response. Social
media platforms offer a crowd-sourced channel to capture evolving public
discourse, providing hyperlocal information and insight into public sentiment.
This study analyzes Reddit discourse during the 2025 Los Angeles wildfires,
spanning from the onset of the disaster to full containment. We collect 385
posts and 114,879 comments related to the Palisades and Eaton fires. We adopt
topic modeling methods to identify the latent topics, enhanced by large
language models (LLMs) and human-in-the-loop (HITL) refinement. Furthermore, we
develop a hierarchical framework to categorize latent topics, consisting of two
main categories, Situational Awareness (SA) and Crisis Narratives (CN). The
volume of SA category closely aligns with real-world fire progressions, peaking
within the first 2-5 days as the fires reach the maximum extent. The most
frequent co-occurring category set of public health and safety, loss and
damage, and emergency resources expands on a wide range of health-related
latent topics, including environmental health, occupational health, and one
health. Grief signals and mental health risks consistently accounted for 60
percentage and 40 percentage of CN instances, respectively, with the highest
total volume occurring at night. This study contributes the first annotated
social media dataset on the 2025 LA fires, and introduces a scalable
multi-layer framework that leverages topic modeling for crisis discourse
analysis. By identifying persistent public health concerns, our results can
inform more empathetic and adaptive strategies for disaster response, public
health communication, and future research in comparable climate-related
disaster events.
[LINK]
http://arxiv.org/abs/2505.09665v1
[DATE]
2025-05-15 00:31:08+08:00
[CATEGORIES]
cs.CL
PT-MoE: An Efficient Finetuning Framework for Integrating Mixture-of-Experts into Prompt Tuning
[AUTHORS]
Zongqian Li, Yixuan Su, Nigel Collier
[ABSTRACT]
Parameter-efficient fine-tuning (PEFT) methods have shown promise in adapting
large language models, yet existing approaches exhibit counter-intuitive
phenomena: integrating router into prompt tuning (PT) increases training
efficiency yet does not improve performance universally; parameter reduction
through matrix decomposition can improve performance in specific domains.
Motivated by these observations and the modular nature of PT, we propose
PT-MoE, a novel framework that integrates matrix decomposition with
mixture-of-experts (MoE) routing for efficient PT. Results across 17 datasets
demonstrate that PT-MoE achieves state-of-the-art performance in both question
answering (QA) and mathematical problem solving tasks, improving F1 score by
1.49 points over PT and 2.13 points over LoRA in QA tasks, while enhancing
mathematical accuracy by 10.75 points over PT and 0.44 points over LoRA, all
while using 25% fewer parameters than LoRA. Our analysis reveals that while PT
methods generally excel in QA tasks and LoRA-based methods in math datasets,
the integration of matrix decomposition and MoE in PT-MoE yields complementary
benefits: decomposition enables efficient parameter sharing across experts
while MoE provides dynamic adaptation, collectively enabling PT-MoE to
demonstrate cross-task consistency and generalization abilities. These
findings, along with ablation studies on routing mechanisms and architectural
components, provide insights for future PEFT methods.
[LINK]
http://arxiv.org/abs/2505.09519v1
[DATE]
2025-05-15 00:16:36+08:00
[CATEGORIES]
cs.CL
Identification and Optimal Nonlinear Control of Turbojet Engine Using Koopman Eigenfunction Model
[AUTHORS]
David Grasev
[ABSTRACT]
Gas turbine engines represent complex highly nonlinear dynamical systems.
Deriving their physics-based models can be challenging as it requires
performance characteristics, that are not always available, and one often has
to make many simplifying assumptions. In this paper, the limitations of
conventional experimental methods used to derive component-level and locally
linear parameter-varying models are discussed and addressed by employing
identification techniques based on data collected from standard engine
operation under closed-loop control. The rotor dynamics were estimated using
the sparse identification of nonlinear dynamics. Subsequently, the autonomous
part of the dynamics was mapped into an optimally constructed Koopman
eigenfunction space. The process included eigenvalue optimization using
metaheuristic algorithms and temporal projection, followed by gradient-based
eigenfunction identification. The resulting Koopman model was validated against
an in-house reference component-level model. A globally optimal nonlinear
feedback controller and a Kalman estimator were then designed in the
eigenfunction space and compared to the classical and gain-scheduled
proportional-integral controllers, as well as a proposed internal model control
approach. The eigenmode structure allowed targeting individual modes during the
optimization process, resulting in a better performance tuning. The results
showed that the Koopman-based controller outperformed the other benchmark
controllers in both reference tracking and disturbance rejection, under
sea-level and varying flight conditions, due to its global nature.
[COMMENTS]
51 pages, 28 figures
[LINK]
http://arxiv.org/abs/2505.10438v1
[DATE]
2025-05-15 23:55:13+08:00
[CATEGORIES]
cs.LG
Score-based diffusion nowcasting of GOES imagery
[AUTHORS]
Randy J. Chase, Katherine Haynes, Lander Ver Hoef, Imme Ebert-Uphoff
[ABSTRACT]
Clouds and precipitation are important for understanding weather and climate.
Simulating clouds and precipitation with traditional numerical weather
prediction is challenging because of the sub-grid parameterizations required.
Machine learning has been explored for forecasting clouds and precipitation,
but early machine learning methods often created blurry forecasts. In this
paper we explore a newer method, named score-based diffusion, to nowcast (zero
to three hour forecast) clouds and precipitation. We discuss the background and
intuition of score-based diffusion models - thus providing a starting point for
the community - while exploring the methodology’s use for nowcasting
geostationary infrared imagery. We experiment with three main types of
diffusion models: a standard score-based diffusion model (Diff); a residual
correction diffusion model (CorrDiff); and a latent diffusion model (LDM). Our
results show that the diffusion models are able to not only advect existing
clouds, but also generate and decay clouds, including convective initiation.
These results are surprising because the forecasts are initiated with only the
past 20 mins of infrared satellite imagery. A case study qualitatively shows
the preservation of high resolution features longer into the forecast than a
conventional mean-squared error trained U-Net. The best of the three diffusion
models tested was the CorrDiff approach, outperforming all other diffusion
models, the traditional U-Net, and a persistence forecast by one to two kelvin
on root mean squared error. The diffusion models also enable out-of-the-box
ensemble generation, which shows skillful calibration, with the spread of the
ensemble correlating well to the error.
[LINK]
http://arxiv.org/abs/2505.10432v1
[DATE]
2025-05-15 23:51:41+08:00
[CATEGORIES]
cs.LG
Learning to Think: Information-Theoretic Reinforcement Fine-Tuning for LLMs
[AUTHORS]
Jingyao Wang, Wenwen Qiang, Zeen Song, Changwen Zheng, Hui Xiong
[ABSTRACT]
Large language models (LLMs) excel at complex tasks thanks to advances in
reasoning abilities. However, existing methods overlook the trade-off between
reasoning effectiveness and computational efficiency, often encouraging
unnecessarily long reasoning chains and wasting tokens. To address this, we
propose Learning to Think (L2T), an information-theoretic reinforcement
fine-tuning framework for LLMs to make the models achieve optimal reasoning
with fewer tokens. Specifically, L2T treats each query-response interaction as
a hierarchical session of multiple episodes and proposes a universal dense
process reward, i.e., quantifies the episode-wise information gain in
parameters, requiring no extra annotations or task-specific evaluators. We
propose a method to quickly estimate this reward based on PAC-Bayes bounds and
the Fisher information matrix. Theoretical analyses show that it significantly
reduces computational complexity with high estimation accuracy. By immediately
rewarding each episode’s contribution and penalizing excessive updates, L2T
optimizes the model via reinforcement learning to maximize the use of each
episode and achieve effective updates. Empirical results on various reasoning
benchmarks and base models demonstrate the advantage of L2T across different
tasks, boosting both reasoning effectiveness and efficiency.
[LINK]
http://arxiv.org/abs/2505.10425v1
[DATE]
2025-05-15 23:40:25+08:00
[CATEGORIES]
cs.LG
The Power of Random Features and the Limits of Distribution-Free Gradient Descent
[AUTHORS]
Ari Karchmer, Eran Malach
[ABSTRACT]
We study the relationship between gradient-based optimization of parametric
models (e.g., neural networks) and optimization of linear combinations of
random features. Our main result shows that if a parametric model can be
learned using mini-batch stochastic gradient descent (bSGD) without making
assumptions about the data distribution, then with high probability, the target
function can also be approximated using a polynomial-sized combination of
random features. The size of this combination depends on the number of gradient
steps and numerical precision used in the bSGD process. This finding reveals
fundamental limitations of distribution-free learning in neural networks
trained by gradient descent, highlighting why making assumptions about data
distributions is often crucial in practice. Along the way, we also introduce a
new theoretical framework called average probabilistic dimension complexity
(adc), which extends the probabilistic dimension complexity developed by Kamath
et al. (2020). We prove that adc has a polynomial relationship with statistical
query dimension, and use this relationship to demonstrate an infinite
separation between adc and standard dimension complexity.
[LINK]
http://arxiv.org/abs/2505.10423v1
[DATE]
2025-05-15 23:39:28+08:00
[CATEGORIES]
cs.LG
Decomposed Inductive Procedure Learning: Learning Academic Tasks with Human-Like Data Efficiency
[AUTHORS]
Daniel Weitekamp, Christopher MacLellan, Erik Harpstead, Kenneth Koedinger
[ABSTRACT]
Human learning relies on specialization – distinct cognitive mechanisms
working together to enable rapid learning. In contrast, most modern neural
networks rely on a single mechanism: gradient descent over an objective
function. This raises the question: might human learners’ relatively rapid
learning from just tens of examples instead of tens of thousands in data-driven
deep learning arise from our ability to use multiple specialized mechanisms of
learning in combination? We investigate this question through an ablation
analysis of inductive human learning simulations in online tutoring
environments. Comparing reinforcement learning to a more data-efficient
3-mechanism symbolic rule induction approach, we find that decomposing learning
into multiple distinct mechanisms significantly improves data efficiency,
bringing it in line with human learning. Furthermore, we show that this
decomposition has a greater impact on efficiency than the distinction between
symbolic and subsymbolic learning alone. Efforts to align data-driven machine
learning with human learning often overlook the stark difference in learning
efficiency. Our findings suggest that integrating multiple specialized learning
mechanisms may be key to bridging this gap.
[COMMENTS]
To appear in CogSci 2025
[LINK]
http://arxiv.org/abs/2505.10422v1
[DATE]
2025-05-15 23:39:09+08:00
[CATEGORIES]
cs.LG
Learning Graph Representation of Agent Diffusers
[AUTHORS]
Youcef Djenouri, Nassim Belmecheri, Tomasz Michalak, Jan Dubiński, Ahmed Nabil Belbachir, Anis Yazidi
[ABSTRACT]
Diffusion-based generative models have significantly advanced text-to-image
synthesis, demonstrating impressive text comprehension and zero-shot
generalization. These models refine images from random noise based on textual
prompts, with initial reliance on text input shifting towards enhanced visual
fidelity over time. This transition suggests that static model parameters might
not optimally address the distinct phases of generation. We introduce LGR-AD
(Learning Graph Representation of Agent Diffusers), a novel multi-agent system
designed to improve adaptability in dynamic computer vision tasks. LGR-AD
models the generation process as a distributed system of interacting agents,
each representing an expert sub-model. These agents dynamically adapt to
varying conditions and collaborate through a graph neural network that encodes
their relationships and performance metrics. Our approach employs a
coordination mechanism based on top-$k$ maximum spanning trees, optimizing the
generation process. Each agent’s decision-making is guided by a meta-model that
minimizes a novel loss function, balancing accuracy and diversity. Theoretical
analysis and extensive empirical evaluations show that LGR-AD outperforms
traditional diffusion models across various benchmarks, highlighting its
potential for scalable and flexible solutions in complex image generation
tasks. Code is available at: https://github.com/YousIA/LGR_AD
[COMMENTS]
Accepted at AAMAS2025 International Conference on Autonomous Agents
and Multiagent Systems
[LINK]
http://arxiv.org/abs/2505.06761v2
[DATE]
2025-05-15 23:32:55+08:00
[CATEGORIES]
cs.LG
Two-Stage Generative Model for Intracranial Aneurysm Meshes with Morphological Marker Conditioning
[AUTHORS]
Wenhao Ding, Choon Hwai Yap, Kangjun Ji, Simão Castro
[ABSTRACT]
A generative model for the mesh geometry of intracranial aneurysms (IA) is
crucial for training networks to predict blood flow forces in real time, which
is a key factor affecting disease progression. This need is necessitated by the
absence of a large IA image datasets. Existing shape generation methods
struggle to capture realistic IA features and ignore the relationship between
IA pouches and parent vessels, limiting physiological realism and their
generation cannot be controlled to have specific morphological measurements. We
propose AneuG, a two-stage Variational Autoencoder (VAE)-based IA mesh
generator. In the first stage, AneuG generates low-dimensional Graph Harmonic
Deformation (GHD) tokens to encode and reconstruct aneurysm pouch shapes,
constrained to morphing energy statistics truths. GHD enables more accurate
shape encoding than alternatives. In the second stage, AneuG generates parent
vessels conditioned on GHD tokens, by generating vascular centreline and
propagating the cross-section. AneuG’s IA shape generation can further be
conditioned to have specific clinically relevant morphological measurements.
This is useful for studies to understand shape variations represented by
clinical measurements, and for flow simulation studies to understand effects of
specific clinical shape parameters on fluid dynamics. Source code and
implementation details are available at
https://github.com/anonymousaneug/AneuG.
[COMMENTS]
10 pages, 2 figures
[LINK]
http://arxiv.org/abs/2505.10407v1
[DATE]
2025-05-15 23:30:41+08:00
[CATEGORIES]
cs.LG
Visual Fidelity Index for Generative Semantic Communications with Critical Information Embedding
[AUTHORS]
Jianhao Huang, Qunsong Zeng, Kaibin Huang
[ABSTRACT]
Generative semantic communication (Gen-SemCom) with large artificial
intelligence (AI) model promises a transformative paradigm for 6G networks,
which reduces communication costs by transmitting low-dimensional prompts
rather than raw data. However, purely prompt-driven generation loses
fine-grained visual details. Additionally, there is a lack of systematic
metrics to evaluate the performance of Gen-SemCom systems. To address these
issues, we develop a hybrid Gen-SemCom system with a critical information
embedding (CIE) framework, where both text prompts and semantically critical
features are extracted for transmissions. First, a novel approach of semantic
filtering is proposed to select and transmit the semantically critical features
of images relevant to semantic label. By integrating the text prompt and
critical features, the receiver reconstructs high-fidelity images using a
diffusion-based generative model. Next, we propose the generative visual
information fidelity (GVIF) metric to evaluate the visual quality of the
generated image. By characterizing the statistical models of image features,
the GVIF metric quantifies the mutual information between the distorted
features and their original counterparts. By maximizing the GVIF metric, we
design a channel-adaptive Gen-SemCom system that adaptively control the volume
of features and compression rate according to the channel state. Experimental
results validate the GVIF metric’s sensitivity to visual fidelity, correlating
with both the PSNR and critical information volume. In addition, the optimized
system achieves superior performance over benchmarking schemes in terms of
higher PSNR and lower FID scores.
[LINK]
http://arxiv.org/abs/2505.10405v1
[DATE]
2025-05-15 23:28:32+08:00
[CATEGORIES]
cs.LG
Unitless Unrestricted Markov-Consistent SCM Generation: Better Benchmark Datasets for Causal Discovery
[AUTHORS]
Rebecca J. Herman, Jonas Wahl, Urmi Ninad, Jakob Runge
[ABSTRACT]
Causal discovery aims to extract qualitative causal knowledge in the form of
causal graphs from data. Because causal ground truth is rarely known in the
real world, simulated data plays a vital role in evaluating the performance of
the various causal discovery algorithms proposed in the literature. But recent
work highlighted certain artifacts of commonly used data generation techniques
for a standard class of structural causal models (SCM) that may be nonphysical,
including var- and R2-sortability, where the variables’ variance and
coefficients of determination (R2) after regressing on all other variables,
respectively, increase along the causal order. Some causal methods exploit such
artifacts, leading to unrealistic expectations for their performance on
real-world data. Some modifications have been proposed to remove these
artifacts; notably, the internally-standardized structural causal model (iSCM)
avoids varsortability and largely alleviates R2-sortability on sparse causal
graphs, but exhibits a reversed R2-sortability pattern for denser graphs not
featured in their work. We analyze which sortability patterns we expect to see
in real data, and propose a method for drawing coefficients that we argue more
effectively samples the space of SCMs. Finally, we propose a novel extension of
our SCM generation method to the time series setting.
[COMMENTS]
4th Conference on Causal Learning and Reasoning
[LINK]
http://arxiv.org/abs/2503.17037v2
[DATE]
2025-05-15 23:22:41+08:00
[CATEGORIES]
cs.LG
Double Successive Over-Relaxation Q-Learning with an Extension to Deep Reinforcement Learning
[AUTHORS]
Shreyas S R
[ABSTRACT]
Q-learning is a widely used algorithm in reinforcement learning (RL), but its
convergence can be slow, especially when the discount factor is close to one.
Successive Over-Relaxation (SOR) Q-learning, which introduces a relaxation
factor to speed up convergence, addresses this issue but has two major
limitations: In the tabular setting, the relaxation parameter depends on
transition probability, making it not entirely model-free, and it suffers from
overestimation bias. To overcome these limitations, we propose a sample-based,
model-free double SOR Q-learning algorithm. Theoretically and empirically, this
algorithm is shown to be less biased than SOR Q-learning. Further, in the
tabular setting, the convergence analysis under boundedness assumptions on
iterates is discussed. The proposed algorithm is extended to large-scale
problems using deep RL. Finally, the tabular version of the proposed algorithm
is compared using roulette and grid world environments, while the deep RL
version is tested on a maximization bias example and OpenAI Gym environments.
[LINK]
http://arxiv.org/abs/2409.06356v2
[DATE]
2025-05-15 23:16:33+08:00
[CATEGORIES]
cs.LG
Schreier-Coset Graph Propagation
[AUTHORS]
Aryan Mishra, Lizhen Lin
[ABSTRACT]
Graph Neural Networks (GNNs) offer a principled framework for learning over
graph-structured data, yet their expressive capacity is often hindered by
over-squashing, wherein information from distant nodes is compressed into
fixed-size vectors. Existing solutions, including graph rewiring and
bottleneck-resistant architectures such as Cayley and expander graphs, avoid
this problem but introduce scalability bottlenecks. In particular, the Cayley
graphs constructed over $SL(2,\mathbb{Z}_n)$ exhibit strong theoretical
properties, yet suffer from cubic node growth $O(n^3)$, leading to high memory
usage. To address this, this work introduces Schrier-Coset Graph Propagation
(SCGP), a group-theoretic augmentation method that enriches node features
through Schreier-coset embeddings without altering the input graph topology.
SCGP embeds bottleneck-free connectivity patterns into a compact feature space,
improving long-range message passing while maintaining computational
efficiency. Empirical evaluations across standard node and graph classification
benchmarks demonstrate that SCGP achieves performance comparable to, or
exceeding, expander graph and rewired GNN baselines. Furthermore, SCGP exhibits
particular advantages in processing hierarchical and modular graph structures,
offering reduced inference latency, improved scalability, and a low memory
footprint, making it suitable for real-time and resource-constrained
applications.
[COMMENTS]
9 pages, 1 figure , preprint
[LINK]
http://arxiv.org/abs/2505.10392v1
[DATE]
2025-05-15 23:14:02+08:00
[CATEGORIES]
cs.LG
ABKD: Pursuing a Proper Allocation of the Probability Mass in Knowledge Distillation via $α$-$β$-Divergence
[AUTHORS]
Guanghui Wang, Zhiyong Yang, Zitai Wang, Shi Wang, Qianqian Xu, Qingming Huang
[ABSTRACT]
Knowledge Distillation (KD) transfers knowledge from a large teacher model to
a smaller student model by minimizing the divergence between their output
distributions, typically using forward Kullback-Leibler divergence (FKLD) or
reverse KLD (RKLD). It has become an effective training paradigm due to the
broader supervision information provided by the teacher distribution compared
to one-hot labels. We identify that the core challenge in KD lies in balancing
two mode-concentration effects: the \textbf{\textit{Hardness-Concentration}}
effect, which refers to focusing on modes with large errors, and the
\textbf{\textit{Confidence-Concentration}} effect, which refers to focusing on
modes with high student confidence. Through an analysis of how probabilities
are reassigned during gradient updates, we observe that these two effects are
entangled in FKLD and RKLD, but in extreme forms. Specifically, both are too
weak in FKLD, causing the student to fail to concentrate on the target class.
In contrast, both are too strong in RKLD, causing the student to overly
emphasize the target class while ignoring the broader distributional
information from the teacher. To address this imbalance, we propose ABKD, a
generic framework with $\alpha$-$\beta$-divergence. Our theoretical results
show that ABKD offers a smooth interpolation between FKLD and RKLD, achieving
an effective trade-off between these effects. Extensive experiments on 17
language/vision datasets with 12 teacher-student settings confirm its efficacy.
The code is available at https://github.com/ghwang-s/abkd.
[COMMENTS]
ICML 2025 Spotlight
[LINK]
http://arxiv.org/abs/2505.04560v2
[DATE]
2025-05-15 23:13:43+08:00
[CATEGORIES]
cs.LG
CryoSAMU: Enhancing 3D Cryo-EM Density Maps of Protein Structures at Intermediate Resolution with Structure-Aware Multimodal U-Nets
[AUTHORS]
Chenwei Zhang, Khanh Dao Duc
[ABSTRACT]
Enhancing cryogenic electron microscopy (cryo-EM) 3D density maps at
intermediate resolution (4-8 {\AA}) is crucial in protein structure
determination. Recent advances in deep learning have led to the development of
automated approaches for enhancing experimental cryo-EM density maps. Yet,
these methods are not optimized for intermediate-resolution maps and rely on
map density features alone. To address this, we propose CryoSAMU, a novel
method designed to enhance 3D cryo-EM density maps of protein structures using
structure-aware multimodal U-Nets and trained on curated
intermediate-resolution density maps. We comprehensively evaluate CryoSAMU
across various metrics and demonstrate its competitive performance compared to
state-of-the-art methods. Notably, CryoSAMU achieves significantly faster
processing speed, showing promise for future practical applications. Our code
is available at https://github.com/chenwei-zhang/CryoSAMU.
[COMMENTS]
19 pages, 6 main figures, 2 supplementary figures, 3 main tables, 4
supplementary tables
[LINK]
http://arxiv.org/abs/2503.20291v2
[DATE]
2025-05-15 23:06:46+08:00
[CATEGORIES]
cs.LG
From Uncertain to Safe: Conformal Fine-Tuning of Diffusion Models for Safe PDE Control
[AUTHORS]
Peiyan Hu, Xiaowei Qian, Wenhao Deng, Rui Wang, Haodong Feng, Ruiqi Feng, Tao Zhang, Long Wei, Yue Wang, Zhi-Ming Ma, Tailin Wu
[ABSTRACT]
The application of deep learning for partial differential equation
(PDE)-constrained control is gaining increasing attention. However, existing
methods rarely consider safety requirements crucial in real-world applications.
To address this limitation, we propose Safe Diffusion Models for PDE Control
(SafeDiffCon), which introduce the uncertainty quantile as model uncertainty
quantification to achieve optimal control under safety constraints through both
post-training and inference phases. Firstly, our approach post-trains a
pre-trained diffusion model to generate control sequences that better satisfy
safety constraints while achieving improved control objectives via a reweighted
diffusion loss, which incorporates the uncertainty quantile estimated using
conformal prediction. Secondly, during inference, the diffusion model
dynamically adjusts both its generation process and parameters through
iterative guidance and fine-tuning, conditioned on control targets while
simultaneously integrating the estimated uncertainty quantile. We evaluate
SafeDiffCon on three control tasks: 1D Burgers’ equation, 2D incompressible
fluid, and controlled nuclear fusion problem. Results demonstrate that
SafeDiffCon is the only method that satisfies all safety constraints, whereas
other classical and deep learning baselines fail. Furthermore, while adhering
to safety constraints, SafeDiffCon achieves the best control performance.
[LINK]
http://arxiv.org/abs/2502.02205v2
[DATE]
2025-05-15 23:00:10+08:00
[CATEGORIES]
cs.LG
Are Sparse Autoencoders Useful for Java Function Bug Detection?
[AUTHORS]
Rui Melo, Claudia Mamede, Andre Catarino, Rui Abreu, Henrique Lopes Cardoso
[ABSTRACT]
Software vulnerabilities such as buffer overflows and SQL injections are a
major source of security breaches. Traditional methods for vulnerability
detection remain essential but are limited by high false positive rates,
scalability issues, and reliance on manual effort. These constraints have
driven interest in AI-based approaches to automated vulnerability detection and
secure code generation. While Large Language Models (LLMs) have opened new
avenues for classification tasks, their complexity and opacity pose challenges
for interpretability and deployment. Sparse Autoencoder offer a promising
solution to this problem. We explore whether SAEs can serve as a lightweight,
interpretable alternative for bug detection in Java functions. We evaluate the
effectiveness of SAEs when applied to representations from GPT-2 Small and
Gemma 2B, examining their capacity to highlight buggy behaviour without
fine-tuning the underlying LLMs. We found that SAE-derived features enable bug
detection with an F1 score of up to 89%, consistently outperforming fine-tuned
transformer encoder baselines. Our work provides the first empirical evidence
that SAEs can be used to detect software bugs directly from the internal
representations of pretrained LLMs, without any fine-tuning or task-specific
supervision.
[COMMENTS]
10 pages, 10 figures
[LINK]
http://arxiv.org/abs/2505.10375v1
[DATE]
2025-05-15 22:59:17+08:00
[CATEGORIES]
cs.LG
ILIF: Temporal Inhibitory Leaky Integrate-and-Fire Neuron for Overactivation in Spiking Neural Networks
[AUTHORS]
Kai Sun, Peibo Duan, Levin Kuhlmann, Beilun Wang, Bin Zhang
[ABSTRACT]
The Spiking Neural Network (SNN) has drawn increasing attention for its
energy-efficient, event-driven processing and biological plausibility. To train
SNNs via backpropagation, surrogate gradients are used to approximate the
non-differentiable spike function, but they only maintain nonzero derivatives
within a narrow range of membrane potentials near the firing threshold,
referred to as the surrogate gradient support width gamma. We identify a major
challenge, termed the dilemma of gamma: a relatively large gamma leads to
overactivation, characterized by excessive neuron firing, which in turn
increases energy consumption, whereas a small gamma causes vanishing gradients
and weakens temporal dependencies. To address this, we propose a temporal
Inhibitory Leaky Integrate-and-Fire (ILIF) neuron model, inspired by biological
inhibitory mechanisms. This model incorporates interconnected inhibitory units
for membrane potential and current, effectively mitigating overactivation while
preserving gradient propagation. Theoretical analysis demonstrates ILIF
effectiveness in overcoming the gamma dilemma, and extensive experiments on
multiple datasets show that ILIF improves energy efficiency by reducing firing
rates, stabilizes training, and enhances accuracy. The code is available at
github.com/kaisun1/ILIF.
[LINK]
http://arxiv.org/abs/2505.10371v1
[DATE]
2025-05-15 22:56:06+08:00
[CATEGORIES]
cs.LG
A Hybrid Strategy for Aggregated Probabilistic Forecasting and Energy Trading in HEFTCom2024
[AUTHORS]
Chuanqing Pu, Feilong Fan, Nengling Tai, Songyuan Liu, Jinming Yu
[ABSTRACT]
Obtaining accurate probabilistic energy forecasts and making effective
decisions amid diverse uncertainties are routine challenges in future energy
systems. This paper presents the solution of team GEB, which ranked 3rd in
trading, 4th in forecasting, and 1st among student teams in the IEEE Hybrid
Energy Forecasting and Trading Competition 2024 (HEFTCom2024). The solution
provides accurate probabilistic forecasts for a wind-solar hybrid system, and
achieves substantial trading revenue in the day-ahead electricity market. Key
components include: (1) a stacking-based approach combining sister forecasts
from various Numerical Weather Predictions (NWPs) to provide wind power
forecasts, (2) an online solar post-processing model to address the
distribution shift in the online test set caused by increased solar capacity,
(3) a probabilistic aggregation method for accurate quantile forecasts of
hybrid generation, and (4) a stochastic trading strategy to maximize expected
trading revenue considering uncertainties in electricity prices. This paper
also explores the potential of end-to-end learning to further enhance the
trading revenue by adjusting the distribution of forecast errors. Detailed case
studies are provided to validate the effectiveness of these proposed methods.
Code for all mentioned methods is available for reproduction and further
research in both industry and academia.
[COMMENTS]
Solution description of IEEE Hybrid Energy Forecasting and Trading
Competition (HEFTCom)
[LINK]
http://arxiv.org/abs/2505.10367v1
[DATE]
2025-05-15 22:55:11+08:00
[CATEGORIES]
cs.LG
FactsR: A Safer Method for Producing High Quality Healthcare Documentation
[AUTHORS]
Victor Petrén Bach Hansen, Lasse Krogsbøll, Jonas Lyngsø, Mathias Baltzersen, Andreas Motzfeldt, Kevin Pelgrims, Lars Maaløe
[ABSTRACT]
There are now a multitude of AI-scribing solutions for healthcare promising
the utilization of large language models for ambient documentation. However,
these AI scribes still rely on one-shot, or few-shot prompts for generating
notes after the consultation has ended, employing little to no reasoning. This
risks long notes with an increase in hallucinations, misrepresentation of the
intent of the clinician, and reliance on the proofreading of the clinician to
catch errors. A dangerous combination for patient safety if vigilance is
compromised by workload and fatigue. In this paper, we introduce a method for
extracting salient clinical information in real-time alongside the healthcare
consultation, denoted Facts, and use that information recursively to generate
the final note. The FactsR method results in more accurate and concise notes by
placing the clinician-in-the-loop of note generation, while opening up new use
cases within real-time decision support.
[LINK]
http://arxiv.org/abs/2505.10360v1
[DATE]
2025-05-15 22:51:22+08:00
[CATEGORIES]
cs.LG
SpikeVideoFormer: An Efficient Spike-Driven Video Transformer with Hamming Attention and $\mathcal{O}(T)$ Complexity
[AUTHORS]
Shihao Zou, Qingfeng Li, Wei Ji, Jingjing Li, Yongkui Yang, Guoqi Li, Chao Dong
[ABSTRACT]
Spiking Neural Networks (SNNs) have shown competitive performance to
Artificial Neural Networks (ANNs) in various vision tasks, while offering
superior energy efficiency. However, existing SNN-based Transformers primarily
focus on single-image tasks, emphasizing spatial features while not effectively
leveraging SNNs’ efficiency in video-based vision tasks. In this paper, we
introduce SpikeVideoFormer, an efficient spike-driven video Transformer,
featuring linear temporal complexity $\mathcal{O}(T)$. Specifically, we design
a spike-driven Hamming attention (SDHA) which provides a theoretically guided
adaptation from traditional real-valued attention to spike-driven attention.
Building on SDHA, we further analyze various spike-driven space-time attention
designs and identify an optimal scheme that delivers appealing performance for
video tasks, while maintaining only linear temporal complexity. The
generalization ability and efficiency of our model are demonstrated across
diverse downstream video tasks, including classification, human pose tracking,
and semantic segmentation. Empirical results show our method achieves
state-of-the-art (SOTA) performance compared to existing SNN approaches, with
over 15\% improvement on the latter two tasks. Additionally, it matches the
performance of recent ANN-based methods while offering significant efficiency
gains, achieving $\times 16$, $\times 10$ and $\times 5$ improvements on the
three tasks. https://github.com/JimmyZou/SpikeVideoFormer
[COMMENTS]
Accepted by ICML 2025
[LINK]
http://arxiv.org/abs/2505.10352v1
[DATE]
2025-05-15 22:43:35+08:00
[CATEGORIES]
cs.LG
Uniform Loss vs. Specialized Optimization: A Comparative Analysis in Multi-Task Learning
[AUTHORS]
Gabriel S. Gama, Valdir Grassi Jr
[ABSTRACT]
Specialized Multi-Task Optimizers (SMTOs) balance task learning in Multi-Task
Learning by addressing issues like conflicting gradients and differing gradient
norms, which hinder equal-weighted task training. However, recent critiques
suggest that equally weighted tasks can achieve competitive results compared to
SMTOs, arguing that previous SMTO results were influenced by poor
hyperparameter optimization and lack of regularization. In this work, we
evaluate these claims through an extensive empirical evaluation of SMTOs,
including some of the latest methods, on more complex multi-task problems to
clarify this behavior. Our findings indicate that SMTOs perform well compared
to uniform loss and that fixed weights can achieve competitive performance
compared to SMTOs. Furthermore, we demonstrate why uniform loss perform
similarly to SMTOs in some instances. The code will be made publicly available.
[LINK]
http://arxiv.org/abs/2505.10347v1
[DATE]
2025-05-15 22:34:36+08:00
[CATEGORIES]
cs.LG
An Introduction to Discrete Variational Autoencoders
[AUTHORS]
Alan Jeffares, Liyuan Liu
[ABSTRACT]
Variational Autoencoders (VAEs) are well-established as a principled approach
to probabilistic unsupervised learning with neural networks. Typically, an
encoder network defines the parameters of a Gaussian distributed latent space
from which we can sample and pass realizations to a decoder network. This model
is trained to reconstruct its inputs and is optimized through the evidence
lower bound. In recent years, discrete latent spaces have grown in popularity,
suggesting that they may be a natural choice for many data modalities (e.g.
text). In this tutorial, we provide a rigorous, yet practical, introduction to
discrete variational autoencoders – specifically, VAEs in which the latent
space is made up of latent variables that follow a categorical distribution. We
assume only a basic mathematical background with which we carefully derive each
step from first principles. From there, we develop a concrete training recipe
and provide an example implementation, hosted at
https://github.com/alanjeffares/discreteVAE.
[COMMENTS]
Tutorial paper
[LINK]
http://arxiv.org/abs/2505.10344v1
[DATE]
2025-05-15 22:33:31+08:00
[CATEGORIES]
cs.LG
Towards Graph Foundation Models: Training on Knowledge Graphs Enables Transferability to General Graphs
[AUTHORS]
Kai Wang, Siqiang Luo, Caihua Shan, Yifei Shen
[ABSTRACT]
Inspired by the success of large language models, there is a trend toward
developing graph foundation models to conduct diverse downstream tasks in
various domains. However, current models often require extra fine-tuning to
apply their learned structural and semantic representations to new graphs,
which limits their versatility. Recent breakthroughs in zero-shot inductive
reasoning on knowledge graphs (KGs), offer us a new perspective on extending KG
reasoning to general graph applications. In this paper, we introduce SCR, a
unified graph reasoning framework designed to train on knowledge graphs and
effectively generalize across a wide range of graph tasks and domains. We begin
by designing the task-specific KG structures to establish a unified topology
for different task formats. Then we propose semantic-conditioned message
passing, a novel mechanism addressing the inherent semantic isolation in
traditional KG reasoning, by jointly modeling structural and semantic
invariance patterns in graph representations. To demonstrate the effectiveness,
we evaluate the inductive reasoning capability of SCR using 38 diverse graph
datasets, covering node-level, link-level, and graph-level tasks across
multiple domains. Our results show substantial performance gains over existing
foundation models and supervised baselines, highlighting the efficacy and
adaptability of our approach.
[COMMENTS]
25 Pages, 5 figures
[LINK]
http://arxiv.org/abs/2410.12609v2
[DATE]
2025-05-15 22:27:59+08:00
[CATEGORIES]
cs.LG
Optimizing Power Grid Topologies with Reinforcement Learning: A Survey of Methods and Challenges
[AUTHORS]
Erica van der Sar, Alessandro Zocca, Sandjai Bhulai
[ABSTRACT]
Power grid operation is becoming increasingly complex due to the rising
integration of renewable energy sources and the need for more adaptive control
strategies. Reinforcement Learning (RL) has emerged as a promising approach to
power network control (PNC), offering the potential to enhance decision-making
in dynamic and uncertain environments. The Learning To Run a Power Network
(L2RPN) competitions have played a key role in accelerating research by
providing standardized benchmarks and problem formulations, leading to rapid
advancements in RL-based methods. This survey provides a comprehensive and
structured overview of RL applications for power grid topology optimization,
categorizing existing techniques, highlighting key design choices, and
identifying gaps in current research. Additionally, we present a comparative
numerical study evaluating the impact of commonly applied RL-based methods,
offering insights into their practical effectiveness. By consolidating existing
research and outlining open challenges, this survey aims to provide a
foundation for future advancements in RL-driven power grid optimization.
[COMMENTS]
60 pages, 26 figures, preprint
[LINK]
http://arxiv.org/abs/2504.08210v2
[DATE]
2025-05-15 22:22:35+08:00
[CATEGORIES]
cs.LG
Machine Learning with Physics Knowledge for Prediction: A Survey
[AUTHORS]
Joe Watson, Chen Song, Oliver Weeger, Theo Gruner, An T. Le, Kay Pompetzki, Ahmed Hendawy, Oleg Arenz, Will Trojak, Miles Cranmer, Carlo D’Eramo, Fabian Bülow, Tanmay Goyal, Jan Peters, Martin W. Hoffman
[ABSTRACT]
This survey examines the broad suite of methods and models for combining
machine learning with physics knowledge for prediction and forecast, with a
focus on partial differential equations. These methods have attracted
significant interest due to their potential impact on advancing scientific
research and industrial practices by improving predictive models with small- or
large-scale datasets and expressive predictive models with useful inductive
biases. The survey has two parts. The first considers incorporating physics
knowledge on an architectural level through objective functions, structured
predictive models, and data augmentation. The second considers data as physics
knowledge, which motivates looking at multi-task, meta, and contextual learning
as an alternative approach to incorporating physics knowledge in a data-driven
fashion. Finally, we also provide an industrial perspective on the application
of these methods and a survey of the open-source ecosystem for physics-informed
machine learning.
[COMMENTS]
61 pages, 8 figures, 2 tables. Accepted at the Transactions of
Machine Learning Research (TMLR)
[LINK]
http://arxiv.org/abs/2408.09840v2
[DATE]
2025-05-15 22:20:49+08:00
[CATEGORIES]
cs.LG
Emergence of Structure in Ensembles of Random Neural Networks
[AUTHORS]
Luca Muscarnera, Luigi Loreti, Giovanni Todeschini, Alessio Fumagalli, Francesco Regazzoni
[ABSTRACT]
Randomness is ubiquitous in many applications across data science and machine
learning. Remarkably, systems composed of random components often display
emergent global behaviors that appear deterministic, manifesting a transition
from microscopic disorder to macroscopic organization. In this work, we
introduce a theoretical model for studying the emergence of collective
behaviors in ensembles of random classifiers. We argue that, if the ensemble is
weighted through the Gibbs measure defined by adopting the classification loss
as an energy, then there exists a finite temperature parameter for the
distribution such that the classification is optimal, with respect to the loss
(or the energy). Interestingly, for the case in which samples are generated by
a Gaussian distribution and labels are constructed by employing a teacher
perceptron, we analytically prove and numerically confirm that such optimal
temperature does not depend neither on the teacher classifier (which is, by
construction of the learning problem, unknown), nor on the number of random
classifiers, highlighting the universal nature of the observed behavior.
Experiments on the MNIST dataset underline the relevance of this phenomenon in
high-quality, noiseless, datasets. Finally, a physical analogy allows us to
shed light on the self-organizing nature of the studied phenomenon.
[LINK]
http://arxiv.org/abs/2505.10331v1
[DATE]
2025-05-15 22:20:02+08:00
[CATEGORIES]
cs.LG
Efficient Adaptation of Reinforcement Learning Agents to Sudden Environmental Change
[AUTHORS]
Jonathan Clifford Balloch
[ABSTRACT]
Real-world autonomous decision-making systems, from robots to recommendation
engines, must operate in environments that change over time. While deep
reinforcement learning (RL) has shown an impressive ability to learn optimal
policies in stationary environments, most methods are data intensive and assume
a world that does not change between training and test time. As a result,
conventional RL methods struggle to adapt when conditions change. This poses a
fundamental challenge: how can RL agents efficiently adapt their behavior when
encountering novel environmental changes during deployment without
catastrophically forgetting useful prior knowledge? This dissertation
demonstrates that efficient online adaptation requires two key capabilities:
(1) prioritized exploration and sampling strategies that help identify and
learn from relevant experiences, and (2) selective preservation of prior
knowledge through structured representations that can be updated without
disruption to reusable components.
[COMMENTS]
PhD Dissertation, 131 pages
[LINK]
http://arxiv.org/abs/2505.10330v1
[DATE]
2025-05-15 22:19:01+08:00
[CATEGORIES]
cs.LG
Towards Foundation Model for Chemical Reactor Modeling: Meta-Learning with Physics-Informed Adaptation
[AUTHORS]
Zihao Wang, Zhe Wu
[ABSTRACT]
Developing accurate models for chemical reactors is often challenging due to
the complexity of reaction kinetics and process dynamics. Traditional
approaches require retraining models for each new system, limiting
generalizability and efficiency. In this work, we take a step toward foundation
models for chemical reactor modeling by introducing a neural network framework
that generalizes across diverse reactor types and rapidly adapts to new
chemical processes. Our approach leverages meta-learning to pretrain the model
on a broad set of reactor dynamics, enabling efficient adaptation to unseen
reactions with minimal data. To further enhance generalizability, we
incorporate physics-informed fine-tuning, ensuring physically consistent
adaptation to new reactor conditions. Our framework is evaluated across three
integer-order fundamental reactor types - continuous stirred tank reactors,
batch reactors, and plug flow reactors - demonstrating superior few-shot
adaptation compared to conventional data-driven, physics-informed, and transfer
learning approaches. By combining meta-learning with physics-informed
adaptation, this work lays the foundation for a generalizable modeling
framework, advancing the development of foundation models for chemical
engineering applications. Source code is available at
https://github.com/killingbear999/chemical-reactor-foundation-model.
[COMMENTS]
Chemical Engineering Research and Design
[LINK]
http://arxiv.org/abs/2405.11752v3
[DATE]
2025-05-15 22:08:49+08:00
[CATEGORIES]
cs.LG
A Representation Learning Approach to Feature Drift Detection in Wireless Networks
[AUTHORS]
Athanasios Tziouvaras, Blaz Bertalanic, George Floros, Kostas Kolomvatsos, Panagiotis Sarigiannidis, Carolina Fortuna
[ABSTRACT]
AI is foreseen to be a centerpiece in next generation wireless networks
enabling enabling ubiquitous communication as well as new services. However, in
real deployment, feature distribution changes may degrade the performance of AI
models and lead to undesired behaviors. To counter for undetected model
degradation, we propose ALERT; a method that can detect feature distribution
changes and trigger model re-training that works well on two wireless network
use cases: wireless fingerprinting and link anomaly detection. ALERT includes
three components: representation learning, statistical testing and utility
assessment. We rely on MLP for designing the representation learning component,
on Kolmogorov-Smirnov and Population Stability Index tests for designing the
statistical testing and a new function for utility assessment. We show the
superiority of the proposed method against ten standard drift detection methods
available in the literature on two wireless network use cases.
[LINK]
http://arxiv.org/abs/2505.10325v1
[DATE]
2025-05-15 22:08:00+08:00
[CATEGORIES]
cs.LG
Intelligently Augmented Contrastive Tensor Factorization: Empowering Multi-dimensional Time Series Classification in Low-Data Environments
[AUTHORS]
Anushiya Arunan, Yan Qin, Xiaoli Li, Yuen Chau
[ABSTRACT]
Classification of multi-dimensional time series from real-world systems
require fine-grained learning of complex features such as cross-dimensional
dependencies and intra-class variations-all under the practical challenge of
low training data availability. However, standard deep learning (DL) struggles
to learn generalizable features in low-data environments due to model
overfitting. We propose a versatile yet data-efficient framework, Intelligently
Augmented Contrastive Tensor Factorization (ITA-CTF), to learn effective
representations from multi-dimensional time series. The CTF module learns core
explanatory components of the time series (e.g., sensor factors, temporal
factors), and importantly, their joint dependencies. Notably, unlike standard
tensor factorization (TF), the CTF module incorporates a new contrastive loss
optimization to induce similarity learning and class-awareness into the learnt
representations for better classification performance. To strengthen this
contrastive learning, the preceding ITA module generates targeted but
informative augmentations that highlight realistic intra-class patterns in the
original data, while preserving class-wise properties. This is achieved by
dynamically sampling a “soft” class prototype to guide the warping of each
query data sample, which results in an augmentation that is intelligently
pattern-mixed between the “soft” class prototype and the query sample. These
augmentations enable the CTF module to recognize complex intra-class variations
despite the limited original training data, and seek out invariant class-wise
properties for accurate classification performance. The proposed method is
comprehensively evaluated on five different classification tasks. Compared to
standard TF and several DL benchmarks, notable performance improvements up to
18.7% were achieved.
[COMMENTS]
Accepted in Expert Systems with Applications
(DOI:https://doi.org/10.1016/j.eswa.2025.127889)
[LINK]
http://arxiv.org/abs/2505.03825v2
[DATE]
2025-05-15 22:07:22+08:00
[CATEGORIES]
cs.LG
Asynchronous Decentralized SGD under Non-Convexity: A Block-Coordinate Descent Framework
[AUTHORS]
Yijie Zhou, Shi Pu
[ABSTRACT]
Decentralized optimization has become vital for leveraging distributed data
without central control, enhancing scalability and privacy. However, practical
deployments face fundamental challenges due to heterogeneous computation speeds
and unpredictable communication delays. This paper introduces a refined model
of Asynchronous Decentralized Stochastic Gradient Descent (ADSGD) under
practical assumptions of bounded computation and communication times. To
understand the convergence of ADSGD, we first analyze Asynchronous Stochastic
Block Coordinate Descent (ASBCD) as a tool, and then show that ADSGD converges
under computation-delay-independent step sizes. The convergence result is
established without assuming bounded data heterogeneity. Empirical experiments
reveal that ADSGD outperforms existing methods in wall-clock convergence time
across various scenarios. With its simplicity, efficiency in memory and
communication, and resilience to communication and computation delays, ADSGD is
well-suited for real-world decentralized learning tasks.
[LINK]
http://arxiv.org/abs/2505.10322v1
[DATE]
2025-05-15 22:06:38+08:00
[CATEGORIES]
cs.LG
Optimizing Electric Bus Charging Scheduling with Uncertainties Using Hierarchical Deep Reinforcement Learning
[AUTHORS]
Jiaju Qi, Lei Lei, Thorsteinn Jonsson, Dusit Niyato
[ABSTRACT]
The growing adoption of Electric Buses (EBs) represents a significant step
toward sustainable development. By utilizing Internet of Things (IoT) systems,
charging stations can autonomously determine charging schedules based on
real-time data. However, optimizing EB charging schedules remains a critical
challenge due to uncertainties in travel time, energy consumption, and
fluctuating electricity prices. Moreover, to address real-world complexities,
charging policies must make decisions efficiently across multiple time scales
and remain scalable for large EB fleets. In this paper, we propose a
Hierarchical Deep Reinforcement Learning (HDRL) approach that reformulates the
original Markov Decision Process (MDP) into two augmented MDPs. To solve these
MDPs and enable multi-timescale decision-making, we introduce a novel HDRL
algorithm, namely Double Actor-Critic Multi-Agent Proximal Policy Optimization
Enhancement (DAC-MAPPO-E). Scalability challenges of the Double Actor-Critic
(DAC) algorithm for large-scale EB fleets are addressed through enhancements at
both decision levels. At the high level, we redesign the decentralized actor
network and integrate an attention mechanism to extract relevant global state
information for each EB, decreasing the size of neural networks. At the low
level, the Multi-Agent Proximal Policy Optimization (MAPPO) algorithm is
incorporated into the DAC framework, enabling decentralized and coordinated
charging power decisions, reducing computational complexity and enhancing
convergence speed. Extensive experiments with real-world data demonstrate the
superior performance and scalability of DAC-MAPPO-E in optimizing EB fleet
charging schedules.
[LINK]
http://arxiv.org/abs/2505.10296v1
[DATE]
2025-05-15 21:44:27+08:00
[CATEGORIES]
cs.LG
Estimating the number of household TV profiles based in customer behaviour using Gaussian mixture model averaging
[AUTHORS]
Gabriel R. Palma, Sally McClean, Brahim Allan, Zeeshan Tariq, Rafael A. Moral
[ABSTRACT]
TV customers today face many choices from many live channels and on-demand
services. Providing a personalised experience that saves customers time when
discovering content is essential for TV providers. However, a reliable
understanding of their behaviour and preferences is key. When creating
personalised recommendations for TV, the biggest challenge is understanding
viewing behaviour within households when multiple people are watching. The
objective is to detect and combine individual profiles to make
better-personalised recommendations for group viewing. Our challenge is that we
have little explicit information about who is watching the devices at any time
(individuals or groups). Also, we do not have a way to combine more than one
individual profile to make better recommendations for group viewing. We propose
a novel framework using a Gaussian mixture model averaging to obtain point
estimates for the number of household TV profiles and a Bayesian random walk
model to introduce uncertainty. We applied our approach using data from real
customers whose TV-watching data totalled approximately half a million
observations. Our results indicate that combining our framework with the
selected features provides a means to estimate the number of household TV
profiles and their characteristics, including shifts over time and
quantification of uncertainty.
[COMMENTS]
21 pages
[LINK]
http://arxiv.org/abs/2505.10279v1
[DATE]
2025-05-15 21:27:32+08:00
[CATEGORIES]
cs.LG
System Log Parsing with Large Language Models: A Review
[AUTHORS]
Viktor Beck, Max Landauer, Markus Wurzenberger, Florian Skopik, Andreas Rauber
[ABSTRACT]
Log data provides crucial insights for tasks like monitoring, root cause
analysis, and anomaly detection. Due to the vast volume of logs, automated log
parsing is essential to transform semi-structured log messages into structured
representations. Recent advances in large language models (LLMs) have
introduced the new research field of LLM-based log parsing. Despite promising
results, there is no structured overview of the approaches in this relatively
new research field with the earliest advances published in late 2023. This work
systematically reviews 29 LLM-based log parsing methods. We benchmark seven of
them on public datasets and critically assess their comparability and the
reproducibility of their reported results. Our findings summarize the advances
of this new research field, with insights on how to report results, which data
sets, metrics and which terminology to use, and which inconsistencies to avoid,
with code and results made publicly available for transparency.
[COMMENTS]
36 pages, 11 figures
[LINK]
http://arxiv.org/abs/2504.04877v2
[DATE]
2025-05-15 21:27:26+08:00
[CATEGORIES]
cs.LG
Spike-timing-dependent Hebbian learning as noisy gradient descent
[AUTHORS]
Niklas Dexheimer, Sascha Gaudlitz, Johannes Schmidt-Hieber
[ABSTRACT]
Hebbian learning is a key principle underlying learning in biological neural
networks. It postulates that synaptic changes occur locally, depending on the
activities of pre- and postsynaptic neurons. While Hebbian learning based on
neuronal firing rates is well explored, much less is known about learning rules
that account for precise spike-timing. We relate a Hebbian
spike-timing-dependent plasticity rule to noisy gradient descent with respect
to a natural loss function on the probability simplex. This connection allows
us to prove that the learning rule eventually identifies the presynaptic neuron
with the highest activity. We also discover an intrinsic connection to noisy
mirror descent.
[LINK]
http://arxiv.org/abs/2505.10272v1
[DATE]
2025-05-15 21:23:16+08:00
[CATEGORIES]
cs.LG
RainPro-8: An Efficient Deep Learning Model to Estimate Rainfall Probabilities Over 8 Hours
[AUTHORS]
Rafael Pablos Sarabia, Joachim Nyborg, Morten Birk, Jeppe Liborius Sjørup, Anders Lillevang Vesterholt, Ira Assent
[ABSTRACT]
We present a deep learning model for high-resolution probabilistic
precipitation forecasting over an 8-hour horizon in Europe, overcoming the
limitations of radar-only deep learning models with short forecast lead times.
Our model efficiently integrates multiple data sources - including radar,
satellite, and physics-based numerical weather prediction (NWP) - while
capturing long-range interactions, resulting in accurate forecasts with robust
uncertainty quantification through consistent probabilistic maps. Featuring a
compact architecture, it enables more efficient training and faster inference
than existing models. Extensive experiments demonstrate that our model
surpasses current operational NWP systems, extrapolation-based methods, and
deep-learning nowcasting models, setting a new standard for high-resolution
precipitation forecasting in Europe, ensuring a balance between accuracy,
interpretability, and computational efficiency.
[LINK]
http://arxiv.org/abs/2505.10271v1
[DATE]
2025-05-15 21:22:20+08:00
[CATEGORIES]
cs.LG
TimeBridge: Non-Stationarity Matters for Long-term Time Series Forecasting
[AUTHORS]
Peiyuan Liu, Beiliang Wu, Yifan Hu, Naiqi Li, Tao Dai, Jigang Bao, Shu-tao Xia
[ABSTRACT]
Non-stationarity poses significant challenges for multivariate time series
forecasting due to the inherent short-term fluctuations and long-term trends
that can lead to spurious regressions or obscure essential long-term
relationships. Most existing methods either eliminate or retain
non-stationarity without adequately addressing its distinct impacts on
short-term and long-term modeling. Eliminating non-stationarity is essential
for avoiding spurious regressions and capturing local dependencies in
short-term modeling, while preserving it is crucial for revealing long-term
cointegration across variates. In this paper, we propose TimeBridge, a novel
framework designed to bridge the gap between non-stationarity and dependency
modeling in long-term time series forecasting. By segmenting input series into
smaller patches, TimeBridge applies Integrated Attention to mitigate short-term
non-stationarity and capture stable dependencies within each variate, while
Cointegrated Attention preserves non-stationarity to model long-term
cointegration across variates. Extensive experiments show that TimeBridge
consistently achieves state-of-the-art performance in both short-term and
long-term forecasting. Additionally, TimeBridge demonstrates exceptional
performance in financial forecasting on the CSI 500 and S&P 500 indices,
further validating its robustness and effectiveness. Code is available at
https://github.com/Hank0626/TimeBridge.
[LINK]
http://arxiv.org/abs/2410.04442v4
[DATE]
2025-05-15 21:21:39+08:00
[CATEGORIES]
cs.LG
Electric Bus Charging Schedules Relying on Real Data-Driven Targets Based on Hierarchical Deep Reinforcement Learning
[AUTHORS]
Jiaju Qi, Lei Lei, Thorsteinn Jonsson, Lajos Hanzo
[ABSTRACT]
The charging scheduling problem of Electric Buses (EBs) is investigated based
on Deep Reinforcement Learning (DRL). A Markov Decision Process (MDP) is
conceived, where the time horizon includes multiple charging and operating
periods in a day, while each period is further divided into multiple time
steps. To overcome the challenge of long-range multi-phase planning with sparse
reward, we conceive Hierarchical DRL (HDRL) for decoupling the original MDP
into a high-level Semi-MDP (SMDP) and multiple low-level MDPs. The Hierarchical
Double Deep Q-Network (HDDQN)-Hindsight Experience Replay (HER) algorithm is
proposed for simultaneously solving the decision problems arising at different
temporal resolutions. As a result, the high-level agent learns an effective
policy for prescribing the charging targets for every charging period, while
the low-level agent learns an optimal policy for setting the charging power of
every time step within a single charging period, with the aim of minimizing the
charging costs while meeting the charging target. It is proved that the flat
policy constructed by superimposing the optimal high-level policy and the
optimal low-level policy performs as well as the optimal policy of the original
MDP. Since jointly learning both levels of policies is challenging due to the
non-stationarity of the high-level agent and the sampling inefficiency of the
low-level agent, we divide the joint learning process into two phases and
exploit our new HER algorithm to manipulate the experience replay buffers for
both levels of agents. Numerical experiments are performed with the aid of
real-world data to evaluate the performance of the proposed algorithm.
[LINK]
http://arxiv.org/abs/2505.10262v1
[DATE]
2025-05-15 21:13:41+08:00
[CATEGORIES]
cs.LG
SpecOffload: Unlocking Latent GPU Capacity for LLM Inference on Resource-Constrained Devices
[AUTHORS]
Xiangwen Zhuge, Xu Shen, Zeyu Wang, Fan Dang, Xuan Ding, Danyang Li, Yahui Han, Tianxiang Hao, Zheng Yang
[ABSTRACT]
Efficient LLM inference on resource-constrained devices presents significant
challenges in compute and memory utilization. Due to limited GPU memory,
existing systems offload model weights to CPU memory, incurring substantial I/O
overhead between the CPU and GPU. This leads to two major inefficiencies: (1)
GPU cores are underutilized, often remaining idle while waiting for data to be
loaded; and (2) GPU memory has low impact on performance, as reducing its
capacity has minimal effect on overall throughput.In this paper, we propose
SpecOffload, a high-throughput inference engine that embeds speculative
decoding into offloading. Our key idea is to unlock latent GPU resources for
storing and executing a draft model used for speculative decoding, thus
accelerating inference at near-zero additional cost. To support this, we
carefully orchestrate the interleaved execution of target and draft models in
speculative decoding within the offloading pipeline, and propose a planner to
manage tensor placement and select optimal parameters. Compared to the best
baseline, SpecOffload improves GPU core utilization by 4.49x and boosts
inference throughput by 2.54x. Our code is available at
https://github.com/MobiSense/SpecOffload .
[LINK]
http://arxiv.org/abs/2505.10259v1
[DATE]
2025-05-15 21:10:31+08:00
[CATEGORIES]
cs.LG
Improving Fine-Grained Control via Aggregation of Multiple Diffusion Models
[AUTHORS]
Conghan Yue, Zhengwei Peng, Shiyan Du, Zhi Ji, Chuangjian Cai, Le Wan, Dongyu Zhang
[ABSTRACT]
While many diffusion models perform well when controlling for particular
aspect among style, character, and interaction, they struggle with fine-grained
control due to dataset limitations and intricate model architecture design.
This paper first introduces a novel training-free algorithm in fine-grained
generation, Aggregation of Multiple Diffusion Models (AMDM), which integrates
features from multiple diffusion models into a specified model to activate
specific features and enable fine-grained control. Experimental results
demonstrate that AMDM significantly improves fine-grained control without
training, validating its effectiveness. Additionally, it reveals that diffusion
models initially focus on features such as position, attributes, and style,
with later stages improving generation quality and consistency. AMDM offers a
new perspective for tackling the challenges of fine-grained conditional control
generation in diffusion models: We can fully utilize existing or develop new
conditional diffusion models that control specific aspects, and then aggregate
them using AMDM algorithm. This eliminates the need for constructing complex
datasets, designing intricate model architectures, and incurring high training
costs. Code is available at: https://github.com/Hammour-steak/AMDM.
[LINK]
http://arxiv.org/abs/2410.01262v3
[DATE]
2025-05-15 20:59:09+08:00
[CATEGORIES]
cs.LG
R2VF: A Two-Step Regularization Algorithm to Cluster Categories in GLMs
[AUTHORS]
Yuval Ben Dror
[ABSTRACT]
Over recent decades, extensive research has aimed to overcome the restrictive
underlying assumptions required for a Generalized Linear Model to generate
accurate and meaningful predictions. These efforts include regularizing
coefficients, selecting features, and clustering ordinal categories, among
other approaches. Despite these advances, efficiently clustering nominal
categories in GLMs without incurring high computational costs remains a
challenge. This paper introduces Ranking to Variable Fusion (R2VF), a two-step
method designed to efficiently fuse nominal and ordinal categories in GLMs. By
first transforming nominal features into an ordinal framework via regularized
regression and then applying variable fusion, R2VF strikes a balance between
model complexity and interpretability. We demonstrate the effectiveness of R2VF
through comparisons with other methods, highlighting its performance in
addressing overfitting and identifying an appropriate set of covariates.
[LINK]
http://arxiv.org/abs/2503.01521v2
[DATE]
2025-05-15 20:45:16+08:00
[CATEGORIES]
cs.LG
Representation Convergence: Mutual Distillation is Secretly a Form of Regularization
[AUTHORS]
Zhengpeng Xie, Jiahang Cao, Qiang Zhang, Jianxiong Zhang, Changwei Wang, Renjing Xu
[ABSTRACT]
In this paper, we argue that mutual distillation between reinforcement
learning policies serves as an implicit regularization, preventing them from
overfitting to irrelevant features. We highlight two key contributions: (a)
Theoretically, for the first time, we prove that enhancing the policy
robustness to irrelevant features leads to improved generalization performance.
(b) Empirically, we demonstrate that mutual distillation between policies
contributes to such robustness, enabling the spontaneous emergence of invariant
representations over pixel inputs. Overall, our findings challenge the
conventional view of distillation as merely a means of knowledge transfer,
offering a novel perspective on the generalization in deep reinforcement
learning.
[LINK]
http://arxiv.org/abs/2501.02481v4
[DATE]
2025-05-15 20:40:27+08:00
[CATEGORIES]
cs.LG
Informed Forecasting: Leveraging Auxiliary Knowledge to Boost LLM Performance on Time Series Forecasting
[AUTHORS]
Mohammadmahdi Ghasemloo, Alireza Moradi
[ABSTRACT]
With the widespread adoption of Large Language Models (LLMs), there is a
growing need to establish best practices for leveraging their capabilities
beyond traditional natural language tasks. In this paper, a novel cross-domain
knowledge transfer framework is proposed to enhance the performance of LLMs in
time series forecasting – a task of increasing relevance in fields such as
energy systems, finance, and healthcare. The approach systematically infuses
LLMs with structured temporal information to improve their forecasting
accuracy. This study evaluates the proposed method on a real-world time series
dataset and compares it to a naive baseline where the LLM receives no auxiliary
information. Results show that knowledge-informed forecasting significantly
outperforms the uninformed baseline in terms of predictive accuracy and
generalization. These findings highlight the potential of knowledge transfer
strategies to bridge the gap between LLMs and domain-specific forecasting
tasks.
[LINK]
http://arxiv.org/abs/2505.10213v1
[DATE]
2025-05-15 20:17:52+08:00
[CATEGORIES]
cs.LG
Collaborative Speculative Inference for Efficient LLM Inference Serving
[AUTHORS]
Luyao Gao, Jianchun Liu, Hongli Xu, Xichong Zhang, Yunming Liao, Liusheng Huang
[ABSTRACT]
Speculative inference is a promising paradigm employing small speculative
models (SSMs) as drafters to generate draft tokens, which are subsequently
verified in parallel by the target large language model (LLM). This approach
enhances the efficiency of inference serving by reducing LLM inference latency
and costs while preserving generation quality. However, existing speculative
methods face critical challenges, including inefficient resource utilization
and limited draft acceptance, which constrain their scalability and overall
effectiveness. To overcome these obstacles, we present CoSine, a novel
speculative inference system that decouples sequential speculative decoding
from parallel verification, enabling efficient collaboration among multiple
nodes. Specifically, CoSine routes inference requests to specialized drafters
based on their expertise and incorporates a confidence-based token fusion
mechanism to synthesize outputs from cooperating drafters, ensuring
high-quality draft generation. Additionally, CoSine dynamically orchestrates
the execution of speculative decoding and verification in a pipelined manner,
employing batch scheduling to selectively group requests and adaptive
speculation control to minimize idle periods. By optimizing parallel workflows
through heterogeneous node collaboration, CoSine balances draft generation and
verification throughput in real-time, thereby maximizing resource utilization.
Experimental results demonstrate that CoSine achieves superior performance
compared to state-of-the-art speculative approaches. Notably, with equivalent
resource costs, CoSine achieves up to a 23.2% decrease in latency and a 32.5%
increase in throughput compared to baseline methods.
[LINK]
http://arxiv.org/abs/2503.10325v2
[DATE]
2025-05-15 20:02:56+08:00
[CATEGORIES]
cs.LG
A multi-head deep fusion model for recognition of cattle foraging events using sound and movement signals
[AUTHORS]
Mariano Ferrero, José Omar Chelotti, Luciano Sebastián Martinez-Rau, Leandro Vignolo, Martín Pires, Julio Ricardo Galli, Leonardo Luis Giovanini, Hugo Leonardo Rufiner
[ABSTRACT]
Monitoring feeding behaviour is a relevant task for efficient herd management
and the effective use of available resources in grazing cattle. The ability to
automatically recognise animals’ feeding activities through the identification
of specific jaw movements allows for the improvement of diet formulation, as
well as early detection of metabolic problems and symptoms of animal
discomfort, among other benefits. The use of sensors to obtain signals for such
monitoring has become popular in the last two decades. The most frequently
employed sensors include accelerometers, microphones, and cameras, each with
its own set of advantages and drawbacks. An unexplored aspect is the
simultaneous use of multiple sensors with the aim of combining signals in order
to enhance the precision of the estimations. In this direction, this work
introduces a deep neural network based on the fusion of acoustic and inertial
signals, composed of convolutional, recurrent, and dense layers. The main
advantage of this model is the combination of signals through the automatic
extraction of features independently from each of them. The model has emerged
from an exploration and comparison of different neural network architectures
proposed in this work, which carry out information fusion at different levels.
Feature-level fusion has outperformed data and decision-level fusion by at
least a 0.14 based on the F1-score metric. Moreover, a comparison with
state-of-the-art machine learning methods is presented, including traditional
and deep learning approaches. The proposed model yielded an F1-score value of
0.802, representing a 14% increase compared to previous methods. Finally,
results from an ablation study and post-training quantization evaluation are
also reported.
[COMMENTS]
Preprint submitted to Engineering Applications of Artificial
Intelligence
[LINK]
http://arxiv.org/abs/2505.10198v1
[DATE]
2025-05-15 19:55:16+08:00
[CATEGORIES]
cs.LG
LanTu: Dynamics-Enhanced Deep Learning for Eddy-Resolving Ocean Forecasting
[AUTHORS]
Qingyu Zheng, Qi Shao, Guijun Han, Wei Li, Hong Li, Xuan Wang
[ABSTRACT]
Mesoscale eddies dominate the spatiotemporal multiscale variability of the
ocean, and their impact on the energy cascade of the global ocean cannot be
ignored. Eddy-resolving ocean forecasting is providing more reliable protection
for fisheries and navigational safety, but also presents significant scientific
challenges and high computational costs for traditional numerical models.
Artificial intelligence (AI)-based weather and ocean forecasting systems are
becoming powerful tools that balance forecast performance with computational
efficiency. However, the complex multiscale features in the ocean dynamical
system make AI models still face many challenges in mesoscale eddy forecasting
(especially regional modelling). Here, we develop LanTu, a regional
eddy-resolving ocean forecasting system based on dynamics-enhanced deep
learning. We incorporate cross-scale interactions into LanTu and construct
multiscale physical constraint for optimising LanTu guided by knowledge of eddy
dynamics in order to improve the forecasting skill of LanTu for mesoscale
evolution. The results show that LanTu outperforms the existing advanced
operational numerical ocean forecasting system (NOFS) and AI-based ocean
forecasting system (AI-OFS) in temperature, salinity, sea level anomaly and
current prediction, with a lead time of more than 10 days. Our study highlights
that dynamics-enhanced deep learning (LanTu) can be a powerful paradigm for
eddy-resolving ocean forecasting.
[COMMENTS]
22 pages, 6 figures
[LINK]
http://arxiv.org/abs/2505.10191v1
[DATE]
2025-05-15 19:47:54+08:00
[CATEGORIES]
cs.LG
A systematic review of challenges and proposed solutions in modeling multimodal data
[AUTHORS]
Maryam Farhadizadeh, Maria Weymann, Michael Blaß, Johann Kraus, Christopher Gundler, Sebastian Walter, Noah Hempen, Harald Binder, Nadine Binder
[ABSTRACT]
Multimodal data modeling has emerged as a powerful approach in clinical
research, enabling the integration of diverse data types such as imaging,
genomics, wearable sensors, and electronic health records. Despite its
potential to improve diagnostic accuracy and support personalized care,
modeling such heterogeneous data presents significant technical challenges.
This systematic review synthesizes findings from 69 studies to identify common
obstacles, including missing modalities, limited sample sizes, dimensionality
imbalance, interpretability issues, and finding the optimal fusion techniques.
We highlight recent methodological advances, such as transfer learning,
generative models, attention mechanisms, and neural architecture search that
offer promising solutions. By mapping current trends and innovations, this
review provides a comprehensive overview of the field and offers practical
insights to guide future research and development in multimodal modeling for
medical applications.
[LINK]
http://arxiv.org/abs/2505.06945v2
[DATE]
2025-05-15 19:38:48+08:00
[CATEGORIES]
cs.LG
Learning Progress Driven Multi-Agent Curriculum
[AUTHORS]
Wenshuai Zhao, Zhiyuan Li, Joni Pajarinen
[ABSTRACT]
The number of agents can be an effective curriculum variable for controlling
the difficulty of multi-agent reinforcement learning (MARL) tasks. Existing
work typically uses manually defined curricula such as linear schemes. We
identify two potential flaws while applying existing reward-based automatic
curriculum learning methods in MARL: (1) The expected episode return used to
measure task difficulty has high variance; (2) Credit assignment difficulty can
be exacerbated in tasks where increasing the number of agents yields higher
returns which is common in many MARL tasks. To address these issues, we propose
to control the curriculum by using a TD-error based learning progress measure
and by letting the curriculum proceed from an initial context distribution to
the final task specific one. Since our approach maintains a distribution over
the number of agents and measures learning progress rather than absolute
performance, which often increases with the number of agents, we alleviate
problem (2). Moreover, the learning progress measure naturally alleviates
problem (1) by aggregating returns. In three challenging sparse-reward MARL
benchmarks, our approach outperforms state-of-the-art baselines.
[COMMENTS]
ICML 2025
[LINK]
http://arxiv.org/abs/2205.10016v3
[DATE]
2025-05-15 19:37:19+08:00
[CATEGORIES]
cs.LG
Does Scaling Law Apply in Time Series Forecasting?
[AUTHORS]
Zeyan Li, Libing Chen, Yin Tang
[ABSTRACT]
Rapid expansion of model size has emerged as a key challenge in time series
forecasting. From early Transformer with tens of megabytes to recent
architectures like TimesNet with thousands of megabytes, performance gains have
often come at the cost of exponentially increasing parameter counts. But is
this scaling truly necessary? To question the applicability of the scaling law
in time series forecasting, we propose Alinear, an ultra-lightweight
forecasting model that achieves competitive performance using only k-level
parameters. We introduce a horizon-aware adaptive decomposition mechanism that
dynamically rebalances component emphasis across different forecast lengths,
alongside a progressive frequency attenuation strategy that achieves stable
prediction in various forecasting horizons without incurring the computational
overhead of attention mechanisms. Extensive experiments on seven benchmark
datasets demonstrate that Alinear consistently outperforms large-scale models
while using less than 1% of their parameters, maintaining strong accuracy
across both short and ultra-long forecasting horizons. Moreover, to more fairly
evaluate model efficiency, we propose a new parameter-aware evaluation metric
that highlights the superiority of ALinear under constrained model budgets. Our
analysis reveals that the relative importance of trend and seasonal components
varies depending on data characteristics rather than following a fixed pattern,
validating the necessity of our adaptive design. This work challenges the
prevailing belief that larger models are inherently better and suggests a
paradigm shift toward more efficient time series modeling.
[LINK]
http://arxiv.org/abs/2505.10172v1
[DATE]
2025-05-15 19:04:39+08:00
[CATEGORIES]
cs.LG
Modeling Saliency Dataset Bias
[AUTHORS]
Matthias Kümmerer, Harneet Khanuja, Matthias Bethge
[ABSTRACT]
Recent advances in image-based saliency prediction are approaching gold
standard performance levels on existing benchmarks. Despite this success, we
show that predicting fixations across multiple saliency datasets remains
challenging due to dataset bias. We find a significant performance drop (around
40%) when models trained on one dataset are applied to another. Surprisingly,
increasing dataset diversity does not resolve this inter-dataset gap, with
close to 60% attributed to dataset-specific biases. To address this remaining
generalization gap, we propose a novel architecture extending a mostly
dataset-agnostic encoder-decoder structure with fewer than 20 dataset-specific
parameters that govern interpretable mechanisms such as multi-scale structure,
center bias, and fixation spread. Adapting only these parameters to new data
accounts for more than 75% of the generalization gap, with a large fraction of
the improvement achieved with as few as 50 samples. Our model sets a new
state-of-the-art on all three datasets of the MIT/Tuebingen Saliency Benchmark
(MIT300, CAT2000, and COCO-Freeview), even when purely generalizing from
unrelated datasets, but with a substantial boost when adapting to the
respective training datasets. The model also provides valuable insights into
spatial saliency properties, revealing complex multi-scale effects that combine
both absolute and relative sizes.
[LINK]
http://arxiv.org/abs/2505.10169v1
[DATE]
2025-05-15 18:55:47+08:00
[CATEGORIES]
cs.LG
QuXAI: Explainers for Hybrid Quantum Machine Learning Models
[AUTHORS]
Saikat Barua, Mostafizur Rahman, Shehenaz Khaled, Md Jafor Sadek, Rafiul Islam, Shahnewaz Siddique
[ABSTRACT]
The emergence of hybrid quantum-classical machine learning (HQML) models
opens new horizons of computational intelligence but their fundamental
complexity frequently leads to black box behavior that undermines transparency
and reliability in their application. Although XAI for quantum systems still in
its infancy, a major research gap is evident in robust global and local
explainability approaches that are designed for HQML architectures that employ
quantized feature encoding followed by classical learning. The gap is the focus
of this work, which introduces QuXAI, an framework based upon Q-MEDLEY, an
explainer for explaining feature importance in these hybrid systems. Our model
entails the creation of HQML models incorporating quantum feature maps, the use
of Q-MEDLEY, which combines feature based inferences, preserving the quantum
transformation stage and visualizing the resulting attributions. Our result
shows that Q-MEDLEY delineates influential classical aspects in HQML models, as
well as separates their noise, and competes well against established XAI
techniques in classical validation settings. Ablation studies more
significantly expose the virtues of the composite structure used in Q-MEDLEY.
The implications of this work are critically important, as it provides a route
to improve the interpretability and reliability of HQML models, thus promoting
greater confidence and being able to engage in safer and more responsible use
of quantum-enhanced AI technology.
[COMMENTS]
16 pages, 6 figures, 7 equations
[LINK]
http://arxiv.org/abs/2505.10167v1
[DATE]
2025-05-15 18:51:34+08:00
[CATEGORIES]
cs.LG
Multi-Objective Hyperparameter Selection via Hypothesis Testing on Reliability Graphs
[AUTHORS]
Amirmohammad Farzaneh, Osvaldo Simeone
[ABSTRACT]
The selection of hyperparameters, such as prompt templates in large language
models (LLMs), must often strike a balance between reliability and cost. In
many cases, structural relationships between the expected reliability levels of
the hyperparameters can be inferred from prior information and held-out data –
e.g., longer prompt templates may be more detailed and thus more reliable.
However, existing hyperparameter selection methods either do not provide formal
reliability guarantees or are unable to incorporate structured knowledge in the
hyperparameter space. This paper introduces reliability graph-based Pareto
testing (RG-PT), a novel multi-objective hyperparameter selection framework
that maintains formal reliability guarantees in terms of false discovery rate
(FDR), while accounting for known relationships among hyperparameters via a
directed acyclic graph. Edges in the graph reflect expected reliability and
cost trade-offs among hyperparameters, which are inferred via the Bradley-Terry
(BT) ranking model from prior information and held-out data. Experimental
evaluations demonstrate that RG-PT significantly outperforms existing methods
such as learn-then-test (LTT) and Pareto testing (PT) through a more efficient
exploration of the hyperparameter space.
[LINK]
http://arxiv.org/abs/2501.13018v2
[DATE]
2025-05-15 18:49:09+08:00
[CATEGORIES]
cs.LG
One-Stage Top-$k$ Learning-to-Defer: Score-Based Surrogates with Theoretical Guarantees
[AUTHORS]
Yannis Montreuil, Axel Carlier, Lai Xing Ng, Wei Tsang Ooi
[ABSTRACT]
We introduce the first one-stage Top-$k$ Learning-to-Defer framework, which
unifies prediction and deferral by learning a shared score-based model that
selects the $k$ most cost-effective entities-labels or experts-per input. While
existing one-stage L2D methods are limited to deferring to a single expert, our
approach jointly optimizes prediction and deferral across multiple entities
through a single end-to-end objective. We define a cost-sensitive loss and
derive a novel convex surrogate that is independent of the cardinality
parameter $k$, enabling generalization across Top-$k$ regimes without
retraining. Our formulation recovers the Top-1 deferral policy of prior
score-based methods as a special case, and we prove that our surrogate is both
Bayes-consistent and $\mathcal{H}$-consistent under mild assumptions. We
further introduce an adaptive variant, Top-$k(x)$, which dynamically selects
the number of consulted entities per input to balance predictive accuracy and
consultation cost. Experiments on CIFAR-10 and SVHN confirm that our one-stage
Top-$k$ method strictly outperforms Top-1 deferral, while Top-$k(x)$ achieves
superior accuracy-cost trade-offs by tailoring allocations to input complexity.
[LINK]
http://arxiv.org/abs/2505.10160v1
[DATE]
2025-05-15 18:41:16+08:00
[CATEGORIES]
cs.LG
UniVLA: Learning to Act Anywhere with Task-centric Latent Actions
[AUTHORS]
Qingwen Bu, Yanting Yang, Jisong Cai, Shenyuan Gao, Guanghui Ren, Maoqing Yao, Ping Luo, Hongyang Li
[ABSTRACT]
A generalist robot should perform effectively across various environments.
However, most existing approaches heavily rely on scaling action-annotated data
to enhance their capabilities. Consequently, they are often limited to single
physical specification and struggle to learn transferable knowledge across
different embodiments and environments. To confront these limitations, we
propose UniVLA, a new framework for learning cross-embodiment
vision-language-action (VLA) policies. Our key innovation is to derive
task-centric action representations from videos with a latent action model.
This enables us to exploit extensive data across a wide spectrum of embodiments
and perspectives. To mitigate the effect of task-irrelevant dynamics, we
incorporate language instructions and establish a latent action model within
the DINO feature space. Learned from internet-scale videos, the generalist
policy can be deployed to various robots through efficient latent action
decoding. We obtain state-of-the-art results across multiple manipulation and
navigation benchmarks, as well as real-robot deployments. UniVLA achieves
superior performance over OpenVLA with less than 1/20 of pretraining compute
and 1/10 of downstream data. Continuous performance improvements are observed
as heterogeneous data, even including human videos, are incorporated into the
training pipeline. The results underscore UniVLA’s potential to facilitate
scalable and efficient robot policy learning.
[COMMENTS]
Accepted to RSS 2025. Code is available at
https://github.com/OpenDriveLab/UniVLA
[LINK]
http://arxiv.org/abs/2505.06111v2
[DATE]
2025-05-15 18:31:45+08:00
[CATEGORIES]
cs.LG
Why Ask One When You Can Ask $k$? Two-Stage Learning-to-Defer to the Top-$k$ Experts
[AUTHORS]
Yannis Montreuil, Axel Carlier, Lai Xing Ng, Wei Tsang Ooi
[ABSTRACT]
Although existing Learning-to-Defer (L2D) frameworks support multiple
experts, they allocate each query to a single expert, limiting their ability to
leverage collective expertise in complex decision-making scenarios. To address
this, we introduce the first framework for Top-$k$ Learning-to-Defer, enabling
systems to defer each query to the $k$ most cost-effective experts. Our
formulation strictly generalizes classical two-stage L2D by supporting
multi-expert deferral-a capability absent in prior work. We further propose
Top-$k(x)$ Learning-to-Defer, an adaptive extension that learns the optimal
number of experts per query based on input complexity, expert quality, and
consultation cost. We introduce a novel surrogate loss that is
Bayes-consistent, $(\mathcal{R}, \mathcal{G})$-consistent, and independent of
the cardinality parameter $k$, enabling efficient reuse across different values
of $k$. We show that classical model cascades arise as a special case of our
method, situating our framework as a strict generalization of both selective
deferral and cascaded inference. Experiments on classification and regression
demonstrate that Top-$k$ and Top-$k(x)$ yield improved accuracy–cost
trade-offs, establishing a new direction for multi-expert deferral in
Learning-to-Defer.
[LINK]
http://arxiv.org/abs/2504.12988v3
[DATE]
2025-05-15 18:25:18+08:00
[CATEGORIES]
cs.LG
Near Optimal Best Arm Identification for Clustered Bandits
[AUTHORS]
Yash, Nikhil Karamchandani, Avishek Ghosh
[ABSTRACT]
This work investigates the problem of best arm identification for multi-agent
multi-armed bandits. We consider $N$ agents grouped into $M$ clusters, where
each cluster solves a stochastic bandit problem. The mapping between agents and
bandits is a priori unknown. Each bandit is associated with $K$ arms, and the
goal is to identify the best arm for each agent under a $\delta$-probably
correct ($\delta$-PC) framework, while minimizing sample complexity and
communication overhead.
We propose two novel algorithms: Clustering then Best Arm Identification
(Cl-BAI) and Best Arm Identification then Clustering (BAI-Cl). Cl-BAI uses a
two-phase approach that first clusters agents based on the bandit problems they
are learning, followed by identifying the best arm for each cluster. BAI-Cl
reverses the sequence by identifying the best arms first and then clustering
agents accordingly. Both algorithms leverage the successive elimination
framework to ensure computational efficiency and high accuracy.
We establish $\delta$-PC guarantees for both methods, derive bounds on their
sample complexity, and provide a lower bound for this problem class. Moreover,
when $M$ is small (a constant), we show that the sample complexity of a variant
of BAI-Cl is minimax optimal in an order-wise sense. Experiments on synthetic
and real-world datasets (MovieLens, Yelp) demonstrate the superior performance
of the proposed algorithms in terms of sample and communication efficiency,
particularly in settings where $M \ll N$.
[COMMENTS]
To be published in ICML 2025
[LINK]
http://arxiv.org/abs/2505.10147v1
[DATE]
2025-05-15 18:20:26+08:00
[CATEGORIES]
cs.LG
Path Gradients after Flow Matching
[AUTHORS]
Lorenz Vaitl, Leon Klein
[ABSTRACT]
Boltzmann Generators have emerged as a promising machine learning tool for
generating samples from equilibrium distributions of molecular systems using
Normalizing Flows and importance weighting. Recently, Flow Matching has helped
speed up Continuous Normalizing Flows (CNFs), scale them to more complex
molecular systems, and minimize the length of the flow integration
trajectories. We investigate the benefits of using path gradients to fine-tune
CNFs initially trained by Flow Matching, in the setting where a target energy
is known. Our experiments show that this hybrid approach yields up to a
threefold increase in sampling efficiency for molecular systems, all while
using the same model, a similar computational budget and without the need for
additional sampling. Furthermore, by measuring the length of the flow
trajectories during fine-tuning, we show that path gradients largely preserve
the learned structure of the flow.
[LINK]
http://arxiv.org/abs/2505.10139v1
[DATE]
2025-05-15 18:13:45+08:00
[CATEGORIES]
cs.LG
Large Wireless Localization Model (LWLM): A Foundation Model for Positioning in 6G Networks
[AUTHORS]
Guangjin Pan, Kaixuan Huang, Hui Chen, Shunqing Zhang, Christian Häger, Henk Wymeersch
[ABSTRACT]
Accurate and robust localization is a critical enabler for emerging 5G and 6G
applications, including autonomous driving, extended reality (XR), and smart
manufacturing. While data-driven approaches have shown promise, most existing
models require large amounts of labeled data and struggle to generalize across
deployment scenarios and wireless configurations. To address these limitations,
we propose a foundation-model-based solution tailored for wireless
localization. We first analyze how different self-supervised learning (SSL)
tasks acquire general-purpose and task-specific semantic features based on
information bottleneck (IB) theory. Building on this foundation, we design a
pretraining methodology for the proposed Large Wireless Localization Model
(LWLM). Specifically, we propose an SSL framework that jointly optimizes three
complementary objectives: (i) spatial-frequency masked channel modeling
(SF-MCM), (ii) domain-transformation invariance (DTI), and (iii)
position-invariant contrastive learning (PICL). These objectives jointly
capture the underlying semantics of wireless channel from multiple
perspectives. We further design lightweight decoders for key downstream tasks,
including time-of-arrival (ToA) estimation, angle-of-arrival (AoA) estimation,
single base station (BS) localization, and multiple BS localization.
Comprehensive experimental results confirm that LWLM consistently surpasses
both model-based and supervised learning baselines across all localization
tasks. In particular, LWLM achieves 26.0%–87.5% improvement over transformer
models without pretraining, and exhibits strong generalization under
label-limited fine-tuning and unseen BS configurations, confirming its
potential as a foundation model for wireless localization.
[COMMENTS]
13 pages,16 figures.This work has been submitted to the IEEE for
possible publication
[LINK]
http://arxiv.org/abs/2505.10134v1
[DATE]
2025-05-15 18:04:44+08:00
[CATEGORIES]
cs.LG
Robust Federated Learning on Edge Devices with Domain Heterogeneity
[AUTHORS]
Huy Q. Le, Latif U. Khan, Choong Seon Hong
[ABSTRACT]
Federated Learning (FL) allows collaborative training while ensuring data
privacy across distributed edge devices, making it a popular solution for
privacy-sensitive applications. However, FL faces significant challenges due to
statistical heterogeneity, particularly domain heterogeneity, which impedes the
global mode’s convergence. In this study, we introduce a new framework to
address this challenge by improving the generalization ability of the FL global
model under domain heterogeneity, using prototype augmentation. Specifically,
we introduce FedAPC (Federated Augmented Prototype Contrastive Learning), a
prototype-based FL framework designed to enhance feature diversity and model
robustness. FedAPC leverages prototypes derived from the mean features of
augmented data to capture richer representations. By aligning local features
with global prototypes, we enable the model to learn meaningful semantic
features while reducing overfitting to any specific domain. Experimental
results on the Office-10 and Digits datasets illustrate that our framework
outperforms SOTA baselines, demonstrating superior performance.
[COMMENTS]
IWCMC 2025
[LINK]
http://arxiv.org/abs/2505.10128v1
[DATE]
2025-05-15 17:53:14+08:00
[CATEGORIES]
cs.LG
All You Need Is Synthetic Task Augmentation
[AUTHORS]
Guillaume Godin
[ABSTRACT]
Injecting rule-based models like Random Forests into differentiable neural
network frameworks remains an open challenge in machine learning. Recent
advancements have demonstrated that pretrained models can generate efficient
molecular embeddings. However, these approaches often require extensive
pretraining and additional techniques, such as incorporating posterior
probabilities, to boost performance. In our study, we propose a novel strategy
that jointly trains a single Graph Transformer neural network on both sparse
multitask molecular property experimental targets and synthetic targets derived
from XGBoost models trained on Osmordred molecular descriptors. These synthetic
tasks serve as independent auxiliary tasks. Our results show consistent and
significant performance improvement across all 19 molecular property prediction
tasks. For 16 out of 19 targets, the multitask Graph Transformer outperforms
the XGBoost single-task learner. This demonstrates that synthetic task
augmentation is an effective method for enhancing neural model performance in
multitask molecular property prediction without the need for feature injection
or pretraining.
[COMMENTS]
14 pages, 3 Figures, 6 tables
[LINK]
http://arxiv.org/abs/2505.10120v1
[DATE]
2025-05-15 17:46:27+08:00
[CATEGORIES]
cs.LG
Cape: Context-Aware Prompt Perturbation Mechanism with Differential Privacy
[AUTHORS]
Haoqi Wu, Wei Dai, Li Wang, Qiang Yan
[ABSTRACT]
Large Language Models (LLMs) have gained significant popularity due to their
remarkable capabilities in text understanding and generation. However, despite
their widespread deployment in inference services such as ChatGPT, concerns
about the potential leakage of sensitive user data have arisen. Existing
solutions primarily rely on privacy-enhancing technologies to mitigate such
risks, facing the trade-off among efficiency, privacy, and utility. To narrow
this gap, we propose Cape, a context-aware prompt perturbation mechanism based
on differential privacy, to enable efficient inference with an improved
privacy-utility trade-off. Concretely, we introduce a hybrid utility function
that better captures the token similarity. Additionally, we propose a
bucketized sampling mechanism to handle large sampling space, which might lead
to long-tail phenomenons. Extensive experiments across multiple datasets, along
with ablation studies, demonstrate that Cape achieves a better privacy-utility
trade-off compared to prior state-of-the-art works.
[COMMENTS]
to be published in ICML 2025
[LINK]
http://arxiv.org/abs/2505.05922v2
[DATE]
2025-05-15 17:31:11+08:00
[CATEGORIES]
cs.LG
Mirror Descent Under Generalized Smoothness
[AUTHORS]
Dingzhi Yu, Wei Jiang, Yuanyu Wan, Lijun Zhang
[ABSTRACT]
Smoothness is crucial for attaining fast rates in first-order optimization.
However, many optimization problems in modern machine learning involve
non-smooth objectives. Recent studies relax the smoothness assumption by
allowing the Lipschitz constant of the gradient to grow with respect to the
gradient norm, which accommodates a broad range of objectives in practice.
Despite this progress, existing generalizations of smoothness are restricted to
Euclidean geometry with $\ell_2$-norm and only have theoretical guarantees for
optimization in the Euclidean space. In this paper, we address this limitation
by introducing a new $\ell*$-smoothness concept that measures the norm of
Hessians in terms of a general norm and its dual, and establish convergence for
mirror-descent-type algorithms, matching the rates under the classic
smoothness. Notably, we propose a generalized self-bounding property that
facilitates bounding the gradients via controlling suboptimality gaps, serving
as a principal component for convergence analysis. Beyond deterministic
optimization, we establish an anytime convergence for stochastic mirror descent
based on a new bounded noise condition that encompasses the widely adopted
bounded or affine noise assumptions.
[LINK]
http://arxiv.org/abs/2502.00753v2
[DATE]
2025-05-15 17:09:06+08:00
[CATEGORIES]
cs.LG
A Scalable Gradient-Based Optimization Framework for Sparse Minimum-Variance Portfolio Selection
[AUTHORS]
Sarat Moka, Matias Quiroz, Vali Asimit, Samuel Muller
[ABSTRACT]
Portfolio optimization involves selecting asset weights to minimize a
risk-reward objective, such as the portfolio variance in the classical
minimum-variance framework. Sparse portfolio selection extends this by imposing
a cardinality constraint: only $k$ assets from a universe of $p$ may be
included. The standard approach models this problem as a mixed-integer
quadratic program and relies on commercial solvers to find the optimal
solution. However, the computational costs of such methods increase
exponentially with $k$ and $p$, making them too slow for problems of even
moderate size. We propose a fast and scalable gradient-based approach that
transforms the combinatorial sparse selection problem into a constrained
continuous optimization task via Boolean relaxation, while preserving
equivalence with the original problem on the set of binary points. Our
algorithm employs a tunable parameter that transmutes the auxiliary objective
from a convex to a concave function. This allows a stable convex starting
point, followed by a controlled path toward a sparse binary solution as the
tuning parameter increases and the objective moves toward concavity. In
practice, our method matches commercial solvers in asset selection for most
instances and, in rare instances, the solution differs by a few assets whilst
showing a negligible error in portfolio variance.
[LINK]
http://arxiv.org/abs/2505.10099v1
[DATE]
2025-05-15 17:01:07+08:00
[CATEGORIES]
cs.LG
Role of scrambling and noise in temporal information processing with quantum systems
[AUTHORS]
Weijie Xiong, Zoë Holmes, Armando Angrisani, Yudai Suzuki, Thiparat Chotibut, Supanut Thanasilp
[ABSTRACT]
Scrambling quantum systems have been demonstrated as effective substrates for
temporal information processing. While their role in providing rich feature
maps has been widely studied, a theoretical understanding of their performance
in temporal tasks is still lacking. Here we consider a general quantum
reservoir processing framework that captures a broad range of physical
computing models with quantum systems. We examine the scalability and memory
retention of the model with scrambling reservoirs modelled by high-order
unitary designs in both noiseless and noisy settings. In the former regime, we
show that measurement readouts become exponentially concentrated with
increasing reservoir size, yet strikingly do not worsen with the reservoir
iterations. Thus, while repeatedly reusing a small scrambling reservoir with
quantum data might be viable, scaling up the problem size deteriorates
generalization unless one can afford an exponential shot overhead. In contrast,
the memory of early inputs and initial states decays exponentially in both
reservoir size and reservoir iterations. In the noisy regime, we also prove
exponential memory decays with iterations for local noisy channels. Proving
these results required us to introduce new proof techniques for bounding
concentration in temporal quantum learning models.
[COMMENTS]
14+35 pages, 6+5 figures, 1 table
[LINK]
http://arxiv.org/abs/2505.10080v1
[DATE]
2025-05-15 16:35:10+08:00
[CATEGORIES]
cs.LG
Scaling Laws for Black box Adversarial Attacks
[AUTHORS]
Chuan Liu, Huanran Chen, Yichi Zhang, Yinpeng Dong, Jun Zhu
[ABSTRACT]
Adversarial examples usually exhibit good cross-model transferability,
enabling attacks on black-box models with limited information about their
architectures and parameters, which are highly threatening in commercial
black-box scenarios. Model ensembling is an effective strategy to improve the
transferability of adversarial examples by attacking multiple surrogate models.
However, since prior studies usually adopt few models in the ensemble, there
remains an open question of whether scaling the number of models can further
improve black-box attacks. Inspired by the scaling law of large foundation
models, we investigate the scaling laws of black-box adversarial attacks in
this work. Through theoretical analysis and empirical evaluations, we conclude
with clear scaling laws that using more surrogate models enhances adversarial
transferability. Comprehensive experiments verify the claims on standard image
classifiers, diverse defended models and multimodal large language models using
various adversarial attack methods. Specifically, by scaling law, we achieve
90%+ transfer attack success rate on even proprietary models like GPT-4o.
Further visualization indicates that there is also a scaling law on the
interpretability and semantics of adversarial perturbations.
[LINK]
http://arxiv.org/abs/2411.16782v2
[DATE]
2025-05-15 16:18:43+08:00
[CATEGORIES]
cs.LG
Malliavin Calculus for Score-based Diffusion Models
[AUTHORS]
Ehsan Mirafzali, Utkarsh Gupta, Patrick Wyrod, Frank Proske, Daniele Venturi, Razvan Marinescu
[ABSTRACT]
We introduce a new framework based on Malliavin calculus to derive exact
analytical expressions for the score function $\nabla \log p_t(x)$, i.e., the
gradient of the log-density associated with the solution to stochastic
differential equations (SDEs). Our approach combines classical
integration-by-parts techniques with modern stochastic analysis tools, such as
Bismut’s formula and Malliavin calculus, and it works for both linear and
nonlinear SDEs. In doing so, we establish a rigorous connection between the
Malliavin derivative, its adjoint, the Malliavin divergence (Skorokhod
integral), and diffusion generative models, thereby providing a systematic
method for computing $\nabla \log p_t(x)$. In the linear case, we present a
detailed analysis showing that our formula coincides with the analytical score
function derived from the solution of the Fokker–Planck equation. For
nonlinear SDEs with state-independent diffusion coefficients, we derive a
closed-form expression for $\nabla \log p_t(x)$. We evaluate the proposed
framework across multiple generative tasks and find that its performance is
comparable to state-of-the-art methods. These results can be generalised to
broader classes of SDEs, paving the way for new score-based diffusion
generative models.
[LINK]
http://arxiv.org/abs/2503.16917v2
[DATE]
2025-05-15 16:12:52+08:00
[CATEGORIES]
cs.LG
Single View Garment Reconstruction Using Diffusion Mapping Via Pattern Coordinates
[AUTHORS]
Ren Li, Cong Cao, Corentin Dumery, Yingxuan You, Hao Li, Pascal Fua
[ABSTRACT]
Reconstructing 3D clothed humans from images is fundamental to applications
like virtual try-on, avatar creation, and mixed reality. While recent advances
have enhanced human body recovery, accurate reconstruction of garment geometry
– especially for loose-fitting clothing – remains an open challenge. We
present a novel method for high-fidelity 3D garment reconstruction from single
images that bridges 2D and 3D representations. Our approach combines Implicit
Sewing Patterns (ISP) with a generative diffusion model to learn rich garment
shape priors in a 2D UV space. A key innovation is our mapping model that
establishes correspondences between 2D image pixels, UV pattern coordinates,
and 3D geometry, enabling joint optimization of both 3D garment meshes and the
corresponding 2D patterns by aligning learned priors with image observations.
Despite training exclusively on synthetically simulated cloth data, our method
generalizes effectively to real-world images, outperforming existing approaches
on both tight- and loose-fitting garments. The reconstructed garments maintain
physical plausibility while capturing fine geometric details, enabling
downstream applications including garment retargeting and texture manipulation.
[COMMENTS]
SIGGRAPH 2025
[LINK]
http://arxiv.org/abs/2504.08353v2
[DATE]
2025-05-15 15:51:50+08:00
[CATEGORIES]
cs.LG
Instance-Prototype Affinity Learning for Non-Exemplar Continual Graph Learning
[AUTHORS]
Lei Song, Jiaxing Li, Shihan Guan, Youyong Kong
[ABSTRACT]
Graph Neural Networks (GNN) endure catastrophic forgetting, undermining their
capacity to preserve previously acquired knowledge amid the assimilation of
novel information. Rehearsal-based techniques revisit historical examples,
adopted as a principal strategy to alleviate this phenomenon. However, memory
explosion and privacy infringements impose significant constraints on their
utility. Non-Exemplar methods circumvent the prior issues through Prototype
Replay (PR), yet feature drift presents new challenges. In this paper, our
empirical findings reveal that Prototype Contrastive Learning (PCL) exhibits
less pronounced drift than conventional PR. Drawing upon PCL, we propose
Instance-Prototype Affinity Learning (IPAL), a novel paradigm for Non-Exemplar
Continual Graph Learning (NECGL). Exploiting graph structural information, we
formulate Topology-Integrated Gaussian Prototypes (TIGP), guiding feature
distributions towards high-impact nodes to augment the model’s capacity for
assimilating new knowledge. Instance-Prototype Affinity Distillation (IPAD)
safeguards task memory by regularizing discontinuities in class relationships.
Moreover, we embed a Decision Boundary Perception (DBP) mechanism within PCL,
fostering greater inter-class discriminability. Evaluations on four node
classification benchmark datasets demonstrate that our method outperforms
existing state-of-the-art methods, achieving a better trade-off between
plasticity and stability.
[LINK]
http://arxiv.org/abs/2505.10040v1
[DATE]
2025-05-15 15:35:27+08:00
[CATEGORIES]
cs.LG
Rethinking Circuit Completeness in Language Models: AND, OR, and ADDER Gates
[AUTHORS]
Hang Chen, Jiaying Zhu, Xinyu Yang, Wenya Wang
[ABSTRACT]
Circuit discovery has gradually become one of the prominent methods for
mechanistic interpretability, and research on circuit completeness has also
garnered increasing attention. Methods of circuit discovery that do not
guarantee completeness not only result in circuits that are not fixed across
different runs but also cause key mechanisms to be omitted. The nature of
incompleteness arises from the presence of OR gates within the circuit, which
are often only partially detected in standard circuit discovery methods. To
this end, we systematically introduce three types of logic gates: AND, OR, and
ADDER gates, and decompose the circuit into combinations of these logical
gates. Through the concept of these gates, we derive the minimum requirements
necessary to achieve faithfulness and completeness. Furthermore, we propose a
framework that combines noising-based and denoising-based interventions, which
can be easily integrated into existing circuit discovery methods without
significantly increasing computational complexity. This framework is capable of
fully identifying the logic gates and distinguishing them within the circuit.
In addition to the extensive experimental validation of the framework’s ability
to restore the faithfulness, completeness, and sparsity of circuits, using this
framework, we uncover fundamental properties of the three logic gates, such as
their proportions and contributions to the output, and explore how they behave
among the functionalities of language models.
[COMMENTS]
10 pages
[LINK]
http://arxiv.org/abs/2505.10039v1
[DATE]
2025-05-15 15:35:14+08:00
[CATEGORIES]
cs.LG
Optimal normalization in quantum-classical hybrid models for anti-cancer drug response prediction
[AUTHORS]
Takafumi Ito, Lysenko Artem, Tatsuhiko Tsunoda
[ABSTRACT]
Quantum-classical Hybrid Machine Learning (QHML) models are recognized for
their robust performance and high generalization ability even for relatively
small datasets. These qualities offer unique advantages for anti-cancer drug
response prediction, where the number of available samples is typically small.
However, such hybrid models appear to be very sensitive to the data encoding
used at the interface of a neural network and a quantum circuit, with
suboptimal choices leading to stability issues. To address this problem, we
propose a novel strategy that uses a normalization function based on a
moderated gradient version of the $\tanh$. This method transforms the outputs
of the neural networks without concentrating them at the extreme value ranges.
Our idea was evaluated on a dataset of gene expression and drug response
measurements for various cancer cell lines, where we compared the prediction
performance of a classical deep learning model and several QHML models. These
results confirmed that QHML performed better than the classical models when
data was optimally normalized. This study opens up new possibilities for
biomedical data analysis using quantum computers.
[COMMENTS]
10 pages, 3 figures
[LINK]
http://arxiv.org/abs/2505.10037v1
[DATE]
2025-05-15 15:33:41+08:00
[CATEGORIES]
cs.LG
DeepSeqCoco: A Robust Mobile Friendly Deep Learning Model for Detection of Diseases in Cocos nucifera
[AUTHORS]
Miit Daga, Dhriti Parikh, Swarna Priya Ramu
[ABSTRACT]
Coconut tree diseases are a serious risk to agricultural yield, particularly
in developing countries where conventional farming practices restrict early
diagnosis and intervention. Current disease identification methods are manual,
labor-intensive, and non-scalable. In response to these limitations, we come up
with DeepSeqCoco, a deep learning based model for accurate and automatic
disease identification from coconut tree images. The model was tested under
various optimizer settings, such as SGD, Adam, and hybrid configurations, to
identify the optimal balance between accuracy, minimization of loss, and
computational cost. Results from experiments indicate that DeepSeqCoco can
achieve as much as 99.5% accuracy (achieving up to 5% higher accuracy than
existing models) with the hybrid SGD-Adam showing the lowest validation loss of
2.81%. It also shows a drop of up to 18% in training time and up to 85% in
prediction time for input images. The results point out the promise of the
model to improve precision agriculture through an AI-based, scalable, and
efficient disease monitoring system.
[COMMENTS]
This paper is accepted for publication in IEEE Access journal and is
currently pending revisions before publication
[LINK]
http://arxiv.org/abs/2505.10030v1
[DATE]
2025-05-15 15:25:43+08:00
[CATEGORIES]
cs.LG
Sample Complexity of Distributionally Robust Average-Reward Reinforcement Learning
[AUTHORS]
Zijun Chen, Shengbo Wang, Nian Si
[ABSTRACT]
Motivated by practical applications where stable long-term performance is
critical-such as robotics, operations research, and healthcare-we study the
problem of distributionally robust (DR) average-reward reinforcement learning.
We propose two algorithms that achieve near-optimal sample complexity. The
first reduces the problem to a DR discounted Markov decision process (MDP),
while the second, Anchored DR Average-Reward MDP, introduces an anchoring state
to stabilize the controlled transition kernels within the uncertainty set.
Assuming the nominal MDP is uniformly ergodic, we prove that both algorithms
attain a sample complexity of $\widetilde{O}\left(|\mathbf{S}||\mathbf{A}|
t_{\mathrm{mix}}^2\varepsilon^{-2}\right)$ for estimating the optimal policy as
well as the robust average reward under KL and $f_k$-divergence-based
uncertainty sets, provided the uncertainty radius is sufficiently small. Here,
$\varepsilon$ is the target accuracy, $|\mathbf{S}|$ and $|\mathbf{A}|$ denote
the sizes of the state and action spaces, and $t_{\mathrm{mix}}$ is the mixing
time of the nominal MDP. This represents the first finite-sample convergence
guarantee for DR average-reward reinforcement learning. We further validate the
convergence rates of our algorithms through numerical experiments.
[LINK]
http://arxiv.org/abs/2505.10007v1
[DATE]
2025-05-15 14:42:25+08:00
[CATEGORIES]
cs.LG
TSINR: Capturing Temporal Continuity via Implicit Neural Representations for Time Series Anomaly Detection
[AUTHORS]
Mengxuan Li, Ke Liu, Hongyang Chen, Jiajun Bu, Hongwei Wang, Haishuai Wang
[ABSTRACT]
Time series anomaly detection aims to identify unusual patterns in data or
deviations from systems’ expected behavior. The reconstruction-based methods
are the mainstream in this task, which learn point-wise representation via
unsupervised learning. However, the unlabeled anomaly points in training data
may cause these reconstruction-based methods to learn and reconstruct anomalous
data, resulting in the challenge of capturing normal patterns. In this paper,
we propose a time series anomaly detection method based on implicit neural
representation (INR) reconstruction, named TSINR, to address this challenge.
Due to the property of spectral bias, TSINR enables prioritizing low-frequency
signals and exhibiting poorer performance on high-frequency abnormal data.
Specifically, we adopt INR to parameterize time series data as a continuous
function and employ a transformer-based architecture to predict the INR of
given data. As a result, the proposed TSINR method achieves the advantage of
capturing the temporal continuity and thus is more sensitive to discontinuous
anomaly data. In addition, we further design a novel form of INR continuous
function to learn inter- and intra-channel information, and leverage a
pre-trained large language model to amplify the intense fluctuations in
anomalies. Extensive experiments demonstrate that TSINR achieves superior
overall performance on both univariate and multivariate time series anomaly
detection benchmarks compared to other state-of-the-art reconstruction-based
methods. Our codes are available.
[COMMENTS]
Accepted by SIGKDD 2025
[LINK]
http://arxiv.org/abs/2411.11641v3
[DATE]
2025-05-15 14:30:38+08:00
[CATEGORIES]
cs.LG
Towards More Efficient, Robust, Instance-adaptive, and Generalizable Sequential Decision making
[AUTHORS]
Zhiyong Wang
[ABSTRACT]
The primary goal of my Ph.D. study is to develop provably efficient and
practical algorithms for data-driven sequential decision-making under
uncertainty. My work focuses on reinforcement learning (RL), multi-armed
bandits, and their applications, including recommendation systems, computer
networks, video analytics, and large language models (LLMs). Sequential
decision-making methods, such as bandits and RL, have demonstrated remarkable
success - ranging from outperforming human players in complex games like Atari
and Go to advancing robotics, recommendation systems, and fine-tuning LLMs.
Despite these successes, many established algorithms rely on idealized models
that can fail under model misspecifications or adversarial perturbations,
particularly in settings where accurate prior knowledge of the underlying model
class is unavailable or where malicious users operate within dynamic systems.
These challenges are pervasive in real-world applications, where robust and
adaptive solutions are critical. Furthermore, while worst-case guarantees
provide theoretical reliability, they often fail to capture instance-dependent
performance, which can lead to more efficient and practical solutions. Another
key challenge lies in generalizing to new, unseen environments, a crucial
requirement for deploying these methods in dynamic and unpredictable settings.
To address these limitations, my research aims to develop more efficient,
robust, instance-adaptive, and generalizable sequential decision-making
algorithms for both reinforcement learning and bandits. Towards this end, I
focus on developing more efficient, robust, instance-adaptive, and
generalizable for both general reinforcement learning (RL) and bandits.
[COMMENTS]
Ph.D. Thesis
[LINK]
http://arxiv.org/abs/2504.09192v4
[DATE]
2025-05-15 14:21:11+08:00
[CATEGORIES]
cs.LG
Sybil-based Virtual Data Poisoning Attacks in Federated Learning
[AUTHORS]
Changxun Zhu, Qilong Wu, Lingjuan Lyu, Shibei Xue
[ABSTRACT]
Federated learning is vulnerable to poisoning attacks by malicious
adversaries. Existing methods often involve high costs to achieve effective
attacks. To address this challenge, we propose a sybil-based virtual data
poisoning attack, where a malicious client generates sybil nodes to amplify the
poisoning model’s impact. To reduce neural network computational complexity, we
develop a virtual data generation method based on gradient matching. We also
design three schemes for target model acquisition, applicable to online local,
online global, and offline scenarios. In simulation, our method outperforms
other attack algorithms since our method can obtain a global target model under
non-independent uniformly distributed data.
[COMMENTS]
7 pages, 6 figures, accepted by IEEE Codit 2025
[LINK]
http://arxiv.org/abs/2505.09983v1
[DATE]
2025-05-15 13:46:59+08:00
[CATEGORIES]
cs.LG
Graph Neural Network-based Spectral Filtering Mechanism for Imbalance Classification in Network Digital Twin
[AUTHORS]
Abubakar Isah, Ibrahim Aliyu, Sulaiman Muhammad Rashid, Jaehyung Park, Minsoo Hahn, Jinsul Kim
[ABSTRACT]
Graph neural networks are gaining attention in fifth-generation (5G) core
network digital twins, which are data-driven complex systems with numerous
components. Analyzing these data can be challenging due to rare failure types,
leading to imbalanced classification in multiclass settings. Digital twins of
5G networks increasingly employ graph classification as the main method for
identifying failure types. However, the skewed distribution of failure
occurrences is a significant class-imbalance problem that prevents practical
graph data mining. Previous studies have not sufficiently addressed this
complex problem. This paper, proposes class-Fourier GNN (CF-GNN) that
introduces a class-oriented spectral filtering mechanism to ensure precise
classification by estimating a unique spectral filter for each class. This work
employs eigenvalue and eigenvector spectral filtering to capture and adapt to
variations in minority classes, ensuring accurate class-specific feature
discrimination, and adept at graph representation learning for complex local
structures among neighbors in an end-to-end setting. The extensive experiments
demonstrate that the proposed CF-GNN could help create new techniques for
enhancing classifiers and investigate the characteristics of the multiclass
imbalanced data in a network digital twin system.
[COMMENTS]
arXiv admin note: substantial text overlap with arXiv:2406.06595
[LINK]
http://arxiv.org/abs/2502.11505v2
[DATE]
2025-05-15 13:27:59+08:00
[CATEGORIES]
cs.LG
A Comprehensive Machine Learning Framework for Heart Disease Prediction: Performance Evaluation and Future Perspectives
[AUTHORS]
Ali Azimi Lamir, Shiva Razzagzadeh, Zeynab Rezaei
[ABSTRACT]
This study presents a machine learning-based framework for heart disease
prediction using the heart-disease dataset, comprising 303 samples with 14
features. The methodology involves data preprocessing, model training, and
evaluation using three classifiers: Logistic Regression, K-Nearest Neighbors
(KNN), and Random Forest. Hyperparameter tuning with GridSearchCV and
RandomizedSearchCV was employed to enhance model performance. The Random Forest
classifier outperformed other models, achieving an accuracy of 91% and an
F1-score of 0.89. Evaluation metrics, including precision, recall, and
confusion matrix, revealed balanced performance across classes. The proposed
model demonstrates strong potential for aiding clinical decision-making by
effectively predicting heart disease. Limitations such as dataset size and
generalizability underscore the need for future studies using larger and more
diverse datasets. This work highlights the utility of machine learning in
healthcare, offering insights for further advancements in predictive
diagnostics.
[LINK]
http://arxiv.org/abs/2505.09969v1
[DATE]
2025-05-15 13:13:38+08:00
[CATEGORIES]
cs.LG
Commute Graph Neural Networks
[AUTHORS]
Wei Zhuo, Han Yu, Guang Tan, Xiaoxiao Li
[ABSTRACT]
Graph Neural Networks (GNNs) have shown remarkable success in learning from
graph-structured data. However, their application to directed graphs (digraphs)
presents unique challenges, primarily due to the inherent asymmetry in node
relationships. Traditional GNNs are adept at capturing unidirectional relations
but fall short in encoding the mutual path dependencies between nodes, such as
asymmetrical shortest paths typically found in digraphs. Recognizing this gap,
we introduce Commute Graph Neural Networks (CGNN), an approach that seamlessly
integrates node-wise commute time into the message passing scheme. The
cornerstone of CGNN is an efficient method for computing commute time using a
newly formulated digraph Laplacian. Commute time is then integrated into the
neighborhood aggregation process, with neighbor contributions weighted
according to their respective commute time to the central node in each layer.
It enables CGNN to directly capture the mutual, asymmetric relationships in
digraphs. Extensive experiments on 8 benchmarking datasets confirm the
superiority of CGNN against 13 state-of-the-art methods.
[COMMENTS]
Published in International Conference on Machine Learning (ICML),
2025
[LINK]
http://arxiv.org/abs/2407.01635v7
[DATE]
2025-05-15 13:02:59+08:00
[CATEGORIES]
cs.LG
Saliency-Motion Guided Trunk-Collateral Network for Unsupervised Video Object Segmentation
[AUTHORS]
Xiangyu Zheng, Wanyun Li, Songcheng He, Jianping Fan, Xiaoqiang Li, We Zhang
[ABSTRACT]
Recent mainstream unsupervised video object segmentation (UVOS)
motion-appearance approaches use either the bi-encoder structure to separately
encode motion and appearance features, or the uni-encoder structure for joint
encoding. However, these methods fail to properly balance the motion-appearance
relationship. Consequently, even with complex fusion modules for
motion-appearance integration, the extracted suboptimal features degrade the
models’ overall performance. Moreover, the quality of optical flow varies
across scenarios, making it insufficient to rely solely on optical flow to
achieve high-quality segmentation results. To address these challenges, we
propose the Saliency-Motion guided Trunk-Collateral Network (SMTC-Net), which
better balances the motion-appearance relationship and incorporates model’s
intrinsic saliency information to enhance segmentation performance.
Specifically, considering that optical flow maps are derived from RGB images,
they share both commonalities and differences. Accordingly, we propose a novel
Trunk-Collateral structure for motion-appearance UVOS. The shared trunk
backbone captures the motion-appearance commonality, while the collateral
branch learns the uniqueness of motion features. Furthermore, an Intrinsic
Saliency guided Refinement Module (ISRM) is devised to efficiently leverage the
model’s intrinsic saliency information to refine high-level features, and
provide pixel-level guidance for motion-appearance fusion, thereby enhancing
performance without additional input. Experimental results show that SMTC-Net
achieved state-of-the-art performance on three UVOS datasets ( 89.2% J&F on
DAVIS-16, 76% J on YouTube-Objects, 86.4% J on FBMS ) and four standard video
salient object detection (VSOD) benchmarks with the notable increase,
demonstrating its effectiveness and superiority over previous methods.
[LINK]
http://arxiv.org/abs/2504.05904v2
[DATE]
2025-05-15 13:01:49+08:00
[CATEGORIES]
cs.LG
On the Power of Learning-Augmented Search Trees
[AUTHORS]
Jingbang Chen, Xinyuan Cao, Alicia Stepin, Li Chen
[ABSTRACT]
We study learning-augmented binary search trees (BSTs) via Treaps with
carefully designed priorities. The result is a simple search tree in which the
depth of each item $x$ is determined by its predicted weight $w_x$.
Specifically, each item $x$ is assigned a composite priority of
$-\lfloor\log\log(1/w_x)\rfloor + U(0, 1)$ where $U(0, 1)$ is the uniform
random variable. By choosing $w_x$ as the relative frequency of $x$, the
resulting search trees achieve static optimality. This approach generalizes the
recent learning-augmented BSTs [Lin-Luo-Woodruff ICML ‘22], which only work for
Zipfian distributions, by extending them to arbitrary input distributions.
Furthermore, we demonstrate that our method can be generalized to a B-Tree data
structure using the B-Treap approach [Golovin ICALP ‘09]. Our search trees are
also capable of leveraging localities in the access sequence through online
self-reorganization, thereby achieving the working-set property. Additionally,
they are robust to prediction errors and support dynamic operations, such as
insertions, deletions, and prediction updates. We complement our analysis with
an empirical study, demonstrating that our method outperforms prior work and
classic data structures.
[COMMENTS]
Accepted by ICML25
[LINK]
http://arxiv.org/abs/2211.09251v3
[DATE]
2025-05-15 12:46:39+08:00
[CATEGORIES]
cs.LG
Approximated Behavioral Metric-based State Projection for Federated Reinforcement Learning
[AUTHORS]
Zengxia Guo, Bohui An, Zhongqi Lu
[ABSTRACT]
Federated reinforcement learning (FRL) methods usually share the encrypted
local state or policy information and help each client to learn from others
while preserving everyone’s privacy. In this work, we propose that sharing the
approximated behavior metric-based state projection function is a promising way
to enhance the performance of FRL and concurrently provides an effective
protection of sensitive information. We introduce FedRAG, a FRL framework to
learn a computationally practical projection function of states for each client
and aggregating the parameters of projection functions at a central server. The
FedRAG approach shares no sensitive task-specific information, yet provides
information gain for each client. We conduct extensive experiments on the
DeepMind Control Suite to demonstrate insightful results.
[LINK]
http://arxiv.org/abs/2505.09959v1
[DATE]
2025-05-15 12:41:21+08:00
[CATEGORIES]
cs.LG
TransPL: VQ-Code Transition Matrices for Pseudo-Labeling of Time Series Unsupervised Domain Adaptation
[AUTHORS]
Jaeho Kim, Seulki Lee
[ABSTRACT]
Unsupervised domain adaptation (UDA) for time series data remains a critical
challenge in deep learning, with traditional pseudo-labeling strategies failing
to capture temporal patterns and channel-wise shifts between domains, producing
sub-optimal pseudo-labels. As such, we introduce TransPL, a novel approach that
addresses these limitations by modeling the joint distribution $P(\mathbf{X},
y)$ of the source domain through code transition matrices, where the codes are
derived from vector quantization (VQ) of time series patches. Our method
constructs class- and channel-wise code transition matrices from the source
domain and employs Bayes’ rule for target domain adaptation, generating
pseudo-labels based on channel-wise weighted class-conditional likelihoods.
TransPL offers three key advantages: explicit modeling of temporal transitions
and channel-wise shifts between different domains, versatility towards
different UDA scenarios (e.g., weakly-supervised UDA), and explainable
pseudo-label generation. We validate TransPL’s effectiveness through extensive
analysis on four time series UDA benchmarks and confirm that it consistently
outperforms state-of-the-art pseudo-labeling methods by a strong margin (6.1%
accuracy improvement, 4.9% F1 improvement), while providing interpretable
insights into the domain adaptation process through its learned code transition
matrices.
[COMMENTS]
ICML 2025 Accept
[LINK]
http://arxiv.org/abs/2505.09955v1
[DATE]
2025-05-15 12:27:48+08:00
[CATEGORIES]
cs.LG
A Trust-Guided Approach to MR Image Reconstruction with Side Information
[AUTHORS]
Arda Atalık, Sumit Chopra, Daniel K. Sodickson
[ABSTRACT]
Reducing MRI scan times can improve patient care and lower healthcare costs.
Many acceleration methods are designed to reconstruct diagnostic-quality images
from sparse k-space data, via an ill-posed or ill-conditioned linear inverse
problem (LIP). To address the resulting ambiguities, it is crucial to
incorporate prior knowledge into the optimization problem, e.g., in the form of
regularization. Another form of prior knowledge less commonly used in medical
imaging is the readily available auxiliary data (a.k.a. side information)
obtained from sources other than the current acquisition. In this paper, we
present the Trust- Guided Variational Network (TGVN), an end-to-end deep
learning framework that effectively and reliably integrates side information
into LIPs. We demonstrate its effectiveness in multi-coil, multi-contrast MRI
reconstruction, where incomplete or low-SNR measurements from one contrast are
used as side information to reconstruct high-quality images of another contrast
from heavily under-sampled data. TGVN is robust across different contrasts,
anatomies, and field strengths. Compared to baselines utilizing side
information, TGVN achieves superior image quality while preserving subtle
pathological features even at challenging acceleration levels, drastically
speeding up acquisition while minimizing hallucinations. Source code and
dataset splits are available on github.com/sodicksonlab/TGVN.
[COMMENTS]
27 pages, 9 figures
[LINK]
http://arxiv.org/abs/2501.03021v2
[DATE]
2025-05-15 12:15:14+08:00
[CATEGORIES]
cs.LG
Efficient Transformed Gaussian Process State-Space Models for Non-Stationary High-Dimensional Dynamical Systems
[AUTHORS]
Zhidi Lin, Ying Li, Feng Yin, Juan Maroñas, Alexandre H. Thiéry
[ABSTRACT]
Gaussian process state-space models (GPSSMs) offer a principled framework for
learning and inference in nonlinear dynamical systems with uncertainty
quantification. However, existing GPSSMs are limited by the use of multiple
independent stationary Gaussian processes (GPs), leading to prohibitive
computational and parametric complexity in high-dimensional settings and
restricted modeling capacity for non-stationary dynamics. To address these
challenges, we propose an efficient transformed Gaussian process state-space
model (ETGPSSM) for scalable and flexible modeling of high-dimensional,
non-stationary dynamical systems. Specifically, our ETGPSSM integrates a single
shared GP with input-dependent normalizing flows, yielding an expressive
implicit process prior that captures complex, non-stationary transition
dynamics while significantly reducing model complexity. For the inference of
the implicit process, we develop a variational inference algorithm that jointly
approximates the posterior over the underlying GP and the neural network
parameters defining the normalizing flows. To avoid explicit variational
parameterization of the latent states, we further incorporate the ensemble
Kalman filter (EnKF) into the variational framework, enabling accurate and
efficient state estimation. Extensive empirical evaluations on synthetic and
real-world datasets demonstrate the superior performance of our ETGPSSM in
system dynamics learning, high-dimensional state estimation, and time-series
forecasting, outperforming existing GPSSMs and neural network-based SSMs in
terms of computational efficiency and accuracy.
[COMMENTS]
15 pages, 6 figures
[LINK]
http://arxiv.org/abs/2503.18309v3
[DATE]
2025-05-15 11:55:55+08:00
[CATEGORIES]
cs.LG
VRU-CIPI: Crossing Intention Prediction at Intersections for Improving Vulnerable Road Users Safety
[AUTHORS]
Ahmed S. Abdelrahman, Mohamed Abdel-Aty, Quoc Dai Tran
[ABSTRACT]
Understanding and predicting human behavior in-thewild, particularly at urban
intersections, remains crucial for enhancing interaction safety between road
users. Among the most critical behaviors are crossing intentions of Vulnerable
Road Users (VRUs), where misinterpretation may result in dangerous conflicts
with oncoming vehicles. In this work, we propose the VRU-CIPI framework with a
sequential attention-based model designed to predict VRU crossing intentions at
intersections. VRU-CIPI employs Gated Recurrent Unit (GRU) to capture temporal
dynamics in VRU movements, combined with a multi-head Transformer
self-attention mechanism to encode contextual and spatial dependencies critical
for predicting crossing direction. Evaluated on UCF-VRU dataset, our proposed
achieves state-of-the-art performance with an accuracy of 96.45% and achieving
real-time inference speed reaching 33 frames per second. Furthermore, by
integrating with Infrastructure-to-Vehicles (I2V) communication, our approach
can proactively enhance intersection safety through timely activation of
crossing signals and providing early warnings to connected vehicles, ensuring
smoother and safer interactions for all road users.
[LINK]
http://arxiv.org/abs/2505.09935v1
[DATE]
2025-05-15 11:40:29+08:00
[CATEGORIES]
cs.LG
DELTA: Dual Consistency Delving with Topological Uncertainty for Active Graph Domain Adaptation
[AUTHORS]
Pengyun Wang, Yadi Cao, Chris Russell, Yanxin Shen, Junyu Luo, Ming Zhang, Siyu Heng, Xiao Luo
[ABSTRACT]
Graph domain adaptation has recently enabled knowledge transfer across
different graphs. However, without the semantic information on target graphs,
the performance on target graphs is still far from satisfactory. To address the
issue, we study the problem of active graph domain adaptation, which selects a
small quantitative of informative nodes on the target graph for extra
annotation. This problem is highly challenging due to the complicated
topological relationships and the distribution discrepancy across graphs. In
this paper, we propose a novel approach named Dual Consistency Delving with
Topological Uncertainty (DELTA) for active graph domain adaptation. Our DELTA
consists of an edge-oriented graph subnetwork and a path-oriented graph
subnetwork, which can explore topological semantics from complementary
perspectives. In particular, our edge-oriented graph subnetwork utilizes the
message passing mechanism to learn neighborhood information, while our
path-oriented graph subnetwork explores high-order relationships from
sub-structures. To jointly learn from two subnetworks, we roughly select
informative candidate nodes with the consideration of consistency across two
subnetworks. Then, we aggregate local semantics from its K-hop subgraph based
on node degrees for topological uncertainty estimation. To overcome potential
distribution shifts, we compare target nodes and their corresponding source
nodes for discrepancy scores as an additional component for fine selection.
Extensive experiments on benchmark datasets demonstrate that DELTA outperforms
various state-of-the-art approaches. The code implementation of DELTA is
available at https://github.com/goose315/DELTA.
[LINK]
http://arxiv.org/abs/2409.08946v2
[DATE]
2025-05-15 11:38:43+08:00
[CATEGORIES]
cs.LG
Demystifying AI Agents: The Final Generation of Intelligence
[AUTHORS]
Kevin J McNamara, Rhea Pritham Marpu
[ABSTRACT]
The trajectory of artificial intelligence (AI) has been one of relentless
acceleration, evolving from rudimentary rule-based systems to sophisticated,
autonomous agents capable of complex reasoning and interaction. This whitepaper
chronicles this remarkable journey, charting the key technological
milestones–advancements in prompting, training methodologies, hardware
capabilities, and architectural innovations–that have converged to create the
AI agents of today. We argue that these agents, exemplified by systems like
OpenAI’s ChatGPT with plugins and xAI’s Grok, represent a culminating phase in
AI development, potentially constituting the “final generation” of intelligence
as we currently conceive it. We explore the capabilities and underlying
technologies of these agents, grounded in practical examples, while also
examining the profound societal implications and the unprecedented pace of
progress that suggests intelligence is now doubling approximately every six
months. The paper concludes by underscoring the critical need for wisdom and
foresight in navigating the opportunities and challenges presented by this
powerful new era of intelligence.
[LINK]
http://arxiv.org/abs/2505.09932v1
[DATE]
2025-05-15 11:35:12+08:00
[CATEGORIES]
cs.LG
Rehearsal-Free Continual Federated Learning with Synergistic Synaptic Intelligence
[AUTHORS]
Yichen Li, Yuying Wang, Haozhao Wang, Yining Qi, Tianzhe Xiao, Ruixuan Li
[ABSTRACT]
Continual Federated Learning (CFL) allows distributed devices to
collaboratively learn novel concepts from continuously shifting training data
while avoiding knowledge forgetting of previously seen tasks. To tackle this
challenge, most current CFL approaches rely on extensive rehearsal of previous
data. Despite effectiveness, rehearsal comes at a cost to memory, and it may
also violate data privacy. Considering these, we seek to apply regularization
techniques to CFL by considering their cost-efficient properties that do not
require sample caching or rehearsal. Specifically, we first apply traditional
regularization techniques to CFL and observe that existing regularization
techniques, especially synaptic intelligence, can achieve promising results
under homogeneous data distribution but fail when the data is heterogeneous.
Based on this observation, we propose a simple yet effective regularization
algorithm for CFL named FedSSI, which tailors the synaptic intelligence for the
CFL with heterogeneous data settings. FedSSI can not only reduce computational
overhead without rehearsal but also address the data heterogeneity issue.
Extensive experiments show that FedSSI achieves superior performance compared
to state-of-the-art methods.
[COMMENTS]
arXiv admin note: text overlap with arXiv:2403.05890
[LINK]
http://arxiv.org/abs/2412.13779v3
[DATE]
2025-05-15 11:29:15+08:00
[CATEGORIES]
cs.LG
TGDT: A Temporal Graph-based Digital Twin for Urban Traffic Corridors
[AUTHORS]
Nooshin Yousefzadeh, Rahul Sengupta, Jeremy Dilmore, Sanjay Ranka
[ABSTRACT]
Urban congestion at signalized intersections leads to significant delays,
economic losses, and increased emissions. Existing deep learning models often
lack spatial generalizability, rely on complex architectures, and struggle with
real-time deployment. To address these limitations, we propose the Temporal
Graph-based Digital Twin (TGDT), a scalable framework that integrates Temporal
Convolutional Networks and Attentional Graph Neural Networks for dynamic,
direction-aware traffic modeling and assessment at urban corridors. TGDT
estimates key Measures of Effectiveness (MOEs) for traffic flow optimization at
both the intersection level (e.g., queue length, waiting time) and the corridor
level (e.g., traffic volume, travel time). Its modular architecture and
sequential optimization scheme enable easy extension to any number of
intersections and MOEs. The model outperforms state-of-the-art baselines by
accurately producing high-dimensional, concurrent multi-output estimates. It
also demonstrates high robustness and accuracy across diverse traffic
conditions, including extreme scenarios, while relying on only a minimal set of
traffic features. Fully parallelized, TGDT can simulate over a thousand
scenarios within a matter of seconds, offering a cost-effective, interpretable,
and real-time solution for urban traffic management and optimization.
[COMMENTS]
8 pages, 4 figures, 1 table
[LINK]
http://arxiv.org/abs/2504.18008v2
[DATE]
2025-05-15 11:23:36+08:00
[CATEGORIES]
cs.LG
Reinforced Interactive Continual Learning via Real-time Noisy Human Feedback
[AUTHORS]
Yutao Yang, Jie Zhou, Junsong Li, Qianjun Pan, Bihao Zhan, Qin Chen, Xipeng Qiu, Liang He
[ABSTRACT]
This paper introduces an interactive continual learning paradigm where AI
models dynamically learn new skills from real-time human feedback while
retaining prior knowledge. This paradigm distinctively addresses two major
limitations of traditional continual learning: (1) dynamic model updates using
streaming, real-time human-annotated data, rather than static datasets with
fixed labels, and (2) the assumption of clean labels, by explicitly handling
the noisy feedback common in real-world interactions. To tackle these problems,
we propose RiCL, a Reinforced interactive Continual Learning framework
leveraging Large Language Models (LLMs) to learn new skills effectively from
dynamic feedback. RiCL incorporates three key components: a temporal
consistency-aware purifier to automatically discern clean from noisy samples in
data streams; an interaction-aware direct preference optimization strategy to
align model behavior with human intent by reconciling AI-generated and
human-provided feedback; and a noise-resistant contrastive learning module that
captures robust representations by exploiting inherent data relationships, thus
avoiding reliance on potentially unreliable labels. Extensive experiments on
two benchmark datasets (FewRel and TACRED), contaminated with realistic noise
patterns, demonstrate that our RiCL approach substantially outperforms existing
combinations of state-of-the-art online continual learning and noisy-label
learning methods.
[LINK]
http://arxiv.org/abs/2505.09925v1
[DATE]
2025-05-15 11:22:03+08:00
[CATEGORIES]
cs.LG
Hierarchical Learning and Computing over Space-Ground Integrated Networks
[AUTHORS]
Jingyang Zhu, Yuanming Shi, Yong Zhou, Chunxiao Jiang, Linling Kuang
[ABSTRACT]
Space-ground integrated networks hold great promise for providing global
connectivity, particularly in remote areas where large amounts of valuable data
are generated by Internet of Things (IoT) devices, but lacking terrestrial
communication infrastructure. The massive data is conventionally transferred to
the cloud server for centralized artificial intelligence (AI) models training,
raising huge communication overhead and privacy concerns. To address this, we
propose a hierarchical learning and computing framework, which leverages the
lowlatency characteristic of low-earth-orbit (LEO) satellites and the global
coverage of geostationary-earth-orbit (GEO) satellites, to provide global
aggregation services for locally trained models on ground IoT devices. Due to
the time-varying nature of satellite network topology and the energy
constraints of LEO satellites, efficiently aggregating the received local
models from ground devices on LEO satellites is highly challenging. By
leveraging the predictability of inter-satellite connectivity, modeling the
space network as a directed graph, we formulate a network energy minimization
problem for model aggregation, which turns out to be a Directed Steiner Tree
(DST) problem. We propose a topologyaware energy-efficient routing (TAEER)
algorithm to solve the DST problem by finding a minimum spanning arborescence
on a substitute directed graph. Extensive simulations under realworld
space-ground integrated network settings demonstrate that the proposed TAEER
algorithm significantly reduces energy consumption and outperforms benchmarks.
[COMMENTS]
Accepted by IEEE Transactions on Mobile Computing
[LINK]
http://arxiv.org/abs/2408.14116v2
[DATE]
2025-05-15 11:20:48+08:00
[CATEGORIES]
cs.LG
MTDT: A Multi-Task Deep Learning Digital Twin
[AUTHORS]
Nooshin Yousefzadeh, Rahul Sengupta, Yashaswi Karnati, Anand Rangarajan, Sanjay Ranka
[ABSTRACT]
Traffic congestion has significant impacts on both the economy and the
environment. Measures of Effectiveness (MOEs) have long been the standard for
evaluating traffic intersections’ level of service and operational efficiency.
However, the scarcity of traditional high-resolution loop detector data (ATSPM)
presents challenges in accurately measuring MOEs or capturing the intricate
spatiotemporal characteristics inherent in urban intersection traffic. To
address this challenge, we present a comprehensive intersection traffic flow
simulation that utilizes a multi-task learning paradigm. This approach combines
graph convolutions for primary estimating lane-wise exit and inflow with time
series convolutions for secondary assessing multi-directional queue lengths and
travel time distribution through any arbitrary urban traffic intersection.
Compared to existing deep learning methodologies, the proposed Multi-Task Deep
Learning Digital Twin (MTDT) distinguishes itself through its adaptability to
local temporal and spatial features, such as signal timing plans, intersection
topology, driving behaviors, and turning movement counts. We also show the
benefit of multi-task learning in the effectiveness of individual traffic
simulation tasks. Furthermore, our approach facilitates sequential computation
and provides complete parallelization through GPU implementation. This not only
streamlines the computational process but also enhances scalability and
performance.
[COMMENTS]
8 pages, 2 figures, 4 tables
[LINK]
http://arxiv.org/abs/2405.00922v2
[DATE]
2025-05-15 11:16:20+08:00
[CATEGORIES]
cs.LG
Diffusion-assisted Model Predictive Control Optimization for Power System Real-Time Operation
[AUTHORS]
Linna Xu, Yongli Zhu
[ABSTRACT]
This paper presents a modified model predictive control (MPC) framework for
real-time power system operation. The framework incorporates a diffusion model
tailored for time series generation to enhance the accuracy of the load
forecasting module used in the system operation. In the absence of explicit
state transition law, a model-identification procedure is leveraged to derive
the system dynamics, thereby eliminating a barrier when applying MPC to a
renewables-dominated power system. Case study results on an industry park
system and the IEEE 30-bus system demonstrate that using the diffusion model to
augment the training dataset significantly improves load-forecasting accuracy,
and the inferred system dynamics are applicable to the real-time grid operation
with solar and wind.
[COMMENTS]
This paper has been accepted by the 2025 IEEE PES General Meeting
(PESGM), which will be held in Austin, TX, July 27-31, 2025
[LINK]
http://arxiv.org/abs/2505.08535v2
[DATE]
2025-05-15 11:16:05+08:00
[CATEGORIES]
cs.LG
Towards Fair In-Context Learning with Tabular Foundation Models
[AUTHORS]
Patrik Kenfack, Samira Ebrahimi Kahou, Ulrich Aïvodji
[ABSTRACT]
Tabular foundational models have exhibited strong in-context learning (ICL)
capabilities on structured data, allowing them to make accurate predictions on
test sets without parameter updates, using training examples as context. This
emerging approach positions itself as a competitive alternative to traditional
gradient-boosted tree methods. However, while biases in conventional machine
learning models are well documented, it remains unclear how these biases
manifest in tabular ICL. The paper investigates the fairness implications of
tabular ICL and explores three preprocessing strategies–correlation removal,
group-balanced demonstration selection, and uncertainty-based demonstration
selection–to address bias. Comprehensive experiments indicate that
uncertainty-based demonstration selection consistently enhances group fairness
of in-context predictions. The source code for reproducing the results of this
work can be found at https://github.com/patrikken/Fair-TabICL.
[COMMENTS]
24 pages, 10 figures, 4 tables
[LINK]
http://arxiv.org/abs/2505.09503v2
[DATE]
2025-05-15 11:13:43+08:00
[CATEGORIES]
cs.LG
Improving the Euclidean Diffusion Generation of Manifold Data by Mitigating Score Function Singularity
[AUTHORS]
Zichen Liu, Wei Zhang, Tiejun Li
[ABSTRACT]
Euclidean diffusion models have achieved remarkable success in generative
modeling across diverse domains, and they have been extended to manifold case
in recent advances. Instead of explicitly utilizing the structure of special
manifolds as studied in previous works, we investigate direct sampling of the
Euclidean diffusion models for general manifold-constrained data in this paper.
We reveal the multiscale singularity of the score function in the embedded
space of manifold, which hinders the accuracy of diffusion-generated samples.
We then present an elaborate theoretical analysis of the singularity structure
of the score function by separating it along the tangential and normal
directions of the manifold. To mitigate the singularity and improve the
sampling accuracy, we propose two novel methods: (1) Niso-DM, which introduces
non-isotropic noise along the normal direction to reduce scale discrepancies,
and (2) Tango-DM, which trains only the tangential component of the score
function using a tangential-only loss function. Numerical experiments
demonstrate that our methods achieve superior performance on distributions over
various manifolds with complex geometries.
[COMMENTS]
22 pages
[LINK]
http://arxiv.org/abs/2505.09922v1
[DATE]
2025-05-15 11:12:27+08:00
[CATEGORIES]
cs.LG
Self-cross Feature based Spiking Neural Networks for Efficient Few-shot Learning
[AUTHORS]
Qi Xu, Junyang Zhu, Dongdong Zhou, Hao Chen, Yang Liu, Jiangrong Shen, Qiang Zhang
[ABSTRACT]
Deep neural networks (DNNs) excel in computer vision tasks, especially,
few-shot learning (FSL), which is increasingly important for generalizing from
limited examples. However, DNNs are computationally expensive with scalability
issues in real world. Spiking Neural Networks (SNNs), with their event-driven
nature and low energy consumption, are particularly efficient in processing
sparse and dynamic data, though they still encounter difficulties in capturing
complex spatiotemporal features and performing accurate cross-class
comparisons. To further enhance the performance and efficiency of SNNs in
few-shot learning, we propose a few-shot learning framework based on SNNs,
which combines a self-feature extractor module and a cross-feature contrastive
module to refine feature representation and reduce power consumption. We apply
the combination of temporal efficient training loss and InfoNCE loss to
optimize the temporal dynamics of spike trains and enhance the discriminative
power. Experimental results show that the proposed FSL-SNN significantly
improves the classification performance on the neuromorphic dataset N-Omniglot,
and also achieves competitive performance to ANNs on static datasets such as
CUB and miniImageNet with low power consumption.
[LINK]
http://arxiv.org/abs/2505.07921v2
[DATE]
2025-05-15 10:56:21+08:00
[CATEGORIES]
cs.LG
Shallow AutoEncoding Recommender with Cold Start Handling via Side Features
[AUTHORS]
Edward DongBo Cui, Lu Zhang, William Ping-hsun Lee
[ABSTRACT]
User and item cold starts present significant challenges in industrial
applications of recommendation systems. Supplementing user-item interaction
data with metadata is a common solution-but often at the cost of introducing
additional biases. In this work, we introduce an augmented EASE model that
seamlessly integrates both user and item side information to address these cold
start issues. Our straightforward, autoencoder-based method produces a
closed-form solution that leverages rich content signals for cold items while
refining user representations in data-sparse environments. Importantly, our
method strikes a balance by effectively recommending cold start items and
handling cold start users without incurring extra bias, and it maintains strong
performance in warm settings. Experimental results demonstrate improved
recommendation accuracy and robustness compared to previous collaborative
filtering approaches. Moreover, our model serves as a strong baseline for
future comparative studies.
[COMMENTS]
Preparing submission to CIKM 2025; 2 Figures; 4 Tables; 13 pages;
Python code implementation example
[LINK]
http://arxiv.org/abs/2504.02288v4
[DATE]
2025-05-15 10:47:32+08:00
[CATEGORIES]
cs.LG
Demystifying AI Platform Design for Distributed Inference of Next-Generation LLM models
[AUTHORS]
Abhimanyu Bambhaniya, Ritik Raj, Geonhwa Jeong, Souvik Kundu, Sudarshan Srinivasan, Suvinay Subramanian, Midhilesh Elavazhagan, Madhu Kumar, Tushar Krishna
[ABSTRACT]
Large language models (LLMs) have shown remarkable performance across a wide
range of applications, often outperforming human experts. However, deploying
these gigantic models efficiently for diverse inference use cases requires
carefully designed hardware platforms with ample computing, memory, and network
resources. With constant innovation in LLM serving optimizations and model
architecture evolving at breakneck speed, the hardware requirements to meet
Service Level Objectives (SLOs) remain an open research question.
To answer the question, we present an analytical tool, GenZ, to efficiently
navigate the relationship between diverse LLM model architectures(Dense, GQA,
MoE, Mamba), LLM serving optimizations(Chunking, Speculative decoding,
quanitization), and AI platform design parameters. Our tool estimates LLM
inference performance metrics for the given scenario. We have validated against
real hardware platforms running various different LLM models, achieving a max
geomean error of 5.82.We use GenZ to identify compute, memory capacity, memory
bandwidth, network latency, and network bandwidth requirements across diverse
LLM inference use cases. We also study diverse architectural choices in use
today (inspired by LLM serving platforms from several vendors) to help inform
computer architects designing next-generation AI hardware accelerators and
platforms. The trends and insights derived from GenZ can guide AI engineers
deploying LLMs as well as computer architects designing next-generation
hardware accelerators and platforms. Ultimately, this work sheds light on the
platform design considerations for unlocking the full potential of large
language models across a spectrum of applications. The source code is available
at https://github.com/abhibambhaniya/GenZ-LLM-Analyzer . Users can also be
tried it on at https://genz-llm-analyzer.streamlit.app/ without any setup on
your web browser.
[COMMENTS]
19 Pages, https://github.com/abhibambhaniya/GenZ-LLM-Analyzer,
https://genz-llm-analyzer.streamlit.app/
[LINK]
http://arxiv.org/abs/2406.01698v3
[DATE]
2025-05-15 10:46:53+08:00
[CATEGORIES]
cs.LG
Autoencoder-Based Hybrid Replay for Class-Incremental Learning
[AUTHORS]
Milad Khademi Nori, Il-Min Kim, Guanghui Wang
[ABSTRACT]
In class-incremental learning (CIL), effective incremental learning
strategies are essential to mitigate task confusion and catastrophic
forgetting, especially as the number of tasks $t$ increases. Current exemplar
replay strategies impose $\mathcal{O}(t)$ memory/compute complexities. We
propose an autoencoder-based hybrid replay (AHR) strategy that leverages our
new hybrid autoencoder (HAE) to function as a compressor to alleviate the
requirement for large memory, achieving $\mathcal{O}(0.1 t)$ at the worst case
with the computing complexity of $\mathcal{O}(t)$ while accomplishing
state-of-the-art performance. The decoder later recovers the exemplar data
stored in the latent space, rather than in raw format. Additionally, HAE is
designed for both discriminative and generative modeling, enabling
classification and replay capabilities, respectively. HAE adopts the charged
particle system energy minimization equations and repulsive force algorithm for
the incremental embedding and distribution of new class centroids in its latent
space. Our results demonstrate that AHR consistently outperforms recent
baselines across multiple benchmarks while operating with the same
memory/compute budgets. The source code is included in the supplementary
material and will be open-sourced upon publication.
[COMMENTS]
Accepted ICML 2025
[LINK]
http://arxiv.org/abs/2505.05926v2
[DATE]
2025-05-15 10:46:39+08:00
[CATEGORIES]
cs.LG
Avocado Price Prediction Using a Hybrid Deep Learning Model: TCN-MLP-Attention Architecture
[AUTHORS]
Linwei Zhang, LuFeng, Ruijia Liang
[ABSTRACT]
With the growing demand for healthy foods, agricultural product price
forecasting has become increasingly important. Hass avocados, as a high-value
crop, exhibit complex price fluctuations influenced by factors such as
seasonality, region, and weather. Traditional prediction models often struggle
with highly nonlinear and dynamic data. To address this, we propose a hybrid
deep learning model, TCN-MLP-Attention Architecture, combining Temporal
Convolutional Networks (TCN) for sequential feature extraction, Multi-Layer
Perceptrons (MLP) for nonlinear interactions, and an Attention mechanism for
dynamic feature weighting. The dataset used covers over 50,000 records of Hass
avocado sales across the U.S. from 2015 to 2018, including variables such as
sales volume, average price, time, region, weather, and variety type, collected
from point-of-sale systems and the Hass Avocado Board. After systematic
preprocessing, including missing value imputation and feature normalization,
the proposed model was trained and evaluated. Experimental results demonstrate
that the TCN-MLP-Attention model achieves excellent predictive performance,
with an RMSE of 1.23 and an MSE of 1.51, outperforming traditional methods.
This research provides a scalable and effective approach for time series
forecasting in agricultural markets and offers valuable insights for
intelligent supply chain management and price strategy optimization.
[LINK]
http://arxiv.org/abs/2505.09907v1
[DATE]
2025-05-15 10:26:22+08:00
[CATEGORIES]
cs.LG
Establishing Linear Surrogate Regret Bounds for Convex Smooth Losses via Convolutional Fenchel-Young Losses
[AUTHORS]
Yuzhou Cao, Han Bao, Lei Feng, Bo An
[ABSTRACT]
Surrogate regret bounds, also known as excess risk bounds, bridge the gap
between the convergence rates of surrogate and target losses, with linear
bounds favorable for their lossless regret transfer. While convex smooth
surrogate losses are appealing in particular due to the efficient estimation
and optimization, the existence of a trade-off between the smoothness and
linear regret bound has been believed in the community. That being said, the
better optimization and estimation properties of convex smooth surrogate losses
may inevitably deteriorate after undergoing the regret transfer onto a target
loss. We overcome this dilemma for arbitrary discrete target losses by
constructing a convex smooth surrogate loss, which entails a linear surrogate
regret bound composed with a tailored prediction link. The construction is
based on Fenchel-Young losses generated by the convolutional negentropy, which
are equivalent to the infimal convolution of a generalized negentropy and the
target Bayes risk. Consequently, the infimal convolution enables us to derive a
smooth loss while maintaining the surrogate regret bound linear. We
additionally benefit from the infimal convolution to have a consistent
estimator of the underlying class probability. Our results are overall a novel
demonstration of how convex analysis penetrates into optimization and
statistical efficiency in risk minimization.
[LINK]
http://arxiv.org/abs/2505.09432v2
[DATE]
2025-05-15 10:26:10+08:00
[CATEGORIES]
cs.LG
Efficient Parallelization of Message Passing Neural Networks
[AUTHORS]
Junfan Xia, Bin Jiang
[ABSTRACT]
Machine learning potentials have achieved great success in accelerating
atomistic simulations. Many of them rely on local descriptors that readily
allow parallelization. More recent message passing neural network (MPNN) models
have demonstrated their superior accuracy and become increasingly popular.
However, parallelizing MPNN models for large-scale simulations across compute
nodes remains a challenge, as the previously argued poor scalability with the
number of MP layers and the necessity of data communication. Here, we propose
an efficient parallel algorithm for MPNN models, in which additional data
communication is minimized among local atoms only in each MP layer without
redundant computation, thus scaling linearly with the layer number. Integrated
with our recursively embedded atom neural network model, this algorithm
demonstrates excellent strong scaling and weak scaling behaviors in several
benchmark systems. This approach enables massive molecular dynamics simulations
on MPNN models for hundreds of millions of atoms as fast as on strictly local
models, vastly extending the applicability of the MPNN potential to an
unprecedented scale. This general parallelization framework can empower various
MPNN models to efficiently simulate very large and complex systems.
[COMMENTS]
34 pages, 8 figures
[LINK]
http://arxiv.org/abs/2505.06711v2
[DATE]
2025-05-15 10:22:45+08:00
[CATEGORIES]
cs.LG
Fast and Robust: Task Sampling with Posterior and Diversity Synergies for Adaptive Decision-Makers in Randomized Environments
[AUTHORS]
Yun Qu, Qi Cheems Wang, Yixiu Mao, Yiqin Lv, Xiangyang Ji
[ABSTRACT]
Task robust adaptation is a long-standing pursuit in sequential
decision-making. Some risk-averse strategies, e.g., the conditional
value-at-risk principle, are incorporated in domain randomization or meta
reinforcement learning to prioritize difficult tasks in optimization, which
demand costly intensive evaluations. The efficiency issue prompts the
development of robust active task sampling to train adaptive policies, where
risk-predictive models are used to surrogate policy evaluation. This work
characterizes the optimization pipeline of robust active task sampling as a
Markov decision process, posits theoretical and practical insights, and
constitutes robustness concepts in risk-averse scenarios. Importantly, we
propose an easy-to-implement method, referred to as Posterior and Diversity
Synergized Task Sampling (PDTS), to accommodate fast and robust sequential
decision-making. Extensive experiments show that PDTS unlocks the potential of
robust active task sampling, significantly improves the zero-shot and few-shot
adaptation robustness in challenging tasks, and even accelerates the learning
process under certain scenarios. Our project website is at
https://thu-rllab.github.io/PDTS_project_page.
[COMMENTS]
ICML 2025
[LINK]
http://arxiv.org/abs/2504.19139v3
[DATE]
2025-05-15 09:51:26+08:00
[CATEGORIES]
cs.LG
Generating time-consistent dynamics with discriminator-guided image diffusion models
[AUTHORS]
Philipp Hess, Maximilian Gelbrecht, Christof Schötz, Michael Aich, Yu Huang, Shangshang Yang, Niklas Boers
[ABSTRACT]
Realistic temporal dynamics are crucial for many video generation, processing
and modelling applications, e.g. in computational fluid dynamics, weather
prediction, or long-term climate simulations. Video diffusion models (VDMs) are
the current state-of-the-art method for generating highly realistic dynamics.
However, training VDMs from scratch can be challenging and requires large
computational resources, limiting their wider application. Here, we propose a
time-consistency discriminator that enables pretrained image diffusion models
to generate realistic spatiotemporal dynamics. The discriminator guides the
sampling inference process and does not require extensions or finetuning of the
image diffusion model. We compare our approach against a VDM trained from
scratch on an idealized turbulence simulation and a real-world global
precipitation dataset. Our approach performs equally well in terms of temporal
consistency, shows improved uncertainty calibration and lower biases compared
to the VDM, and achieves stable centennial-scale climate simulations at daily
time steps.
[LINK]
http://arxiv.org/abs/2505.09089v2
[DATE]
2025-05-15 08:55:20+08:00
[CATEGORIES]
cs.LG
X-Sim: Cross-Embodiment Learning via Real-to-Sim-to-Real
[AUTHORS]
Prithwish Dan, Kushal Kedia, Angela Chao, Edward Weiyi Duan, Maximus Adrian Pace, Wei-Chiu Ma, Sanjiban Choudhury
[ABSTRACT]
Human videos offer a scalable way to train robot manipulation policies, but
lack the action labels needed by standard imitation learning algorithms.
Existing cross-embodiment approaches try to map human motion to robot actions,
but often fail when the embodiments differ significantly. We propose X-Sim, a
real-to-sim-to-real framework that uses object motion as a dense and
transferable signal for learning robot policies. X-Sim starts by reconstructing
a photorealistic simulation from an RGBD human video and tracking object
trajectories to define object-centric rewards. These rewards are used to train
a reinforcement learning (RL) policy in simulation. The learned policy is then
distilled into an image-conditioned diffusion policy using synthetic rollouts
rendered with varied viewpoints and lighting. To transfer to the real world,
X-Sim introduces an online domain adaptation technique that aligns real and
simulated observations during deployment. Importantly, X-Sim does not require
any robot teleoperation data. We evaluate it across 5 manipulation tasks in 2
environments and show that it: (1) improves task progress by 30% on average
over hand-tracking and sim-to-real baselines, (2) matches behavior cloning with
10x less data collection time, and (3) generalizes to new camera viewpoints and
test-time changes. Code and videos are available at
https://portal-cornell.github.io/X-Sim/.
[LINK]
http://arxiv.org/abs/2505.07096v2
[DATE]
2025-05-15 08:43:19+08:00
[CATEGORIES]
cs.LG
Benchmarking Self-Supervised Learning Methods for Accelerated MRI Reconstruction
[AUTHORS]
Andrew Wang, Steven McDonagh, Mike Davies
[ABSTRACT]
Reconstructing MRI from highly undersampled measurements is crucial for
accelerating medical imaging, but is challenging due to the ill-posedness of
the inverse problem. While supervised deep learning (DL) approaches have shown
remarkable success, they traditionally rely on fully-sampled ground truth (GT)
images, which are expensive or impossible to obtain in real scenarios. This
problem has created a recent surge in interest in self-supervised learning
methods that do not require GT. Although recent methods are now fast
approaching “oracle” supervised performance, the lack of systematic comparison
and standard experimental setups are hindering targeted methodological research
and precluding widespread trustworthy industry adoption. We present SSIBench, a
modular and flexible comparison framework to unify and thoroughly benchmark
Self-Supervised Imaging methods (SSI) without GT. We evaluate 18 methods across
4 realistic MRI scenarios on real data, showing a wide performance landscape
whose method ranking differs across scenarios and metrics, exposing the need
for further SSI research. Our insights also show how complementary methods
could be compounded for future improvements, exemplified by a novel loss we
propose, Multi-Operator Equivariant Imaging. To accelerate reproducible
research and lower the barrier to entry, we provide the extensible benchmark
and open-source reimplementations of all methods at
https://andrewwango.github.io/ssibench, allowing researchers to rapidly and
fairly contribute and evaluate new methods on the standardised setup for
potential leaderboard ranking, or benchmark existing methods on custom
datasets, forward operators, or models, unlocking the application of SSI to
other valuable GT free domains such as 4D MRI and other nascent scientific
imaging modalities.
[COMMENTS]
Preprint. Live benchmark site available at
https://andrewwango.github.io/ssibench
[LINK]
http://arxiv.org/abs/2502.14009v4
[DATE]
2025-05-15 08:10:29+08:00
[CATEGORIES]
cs.LG
BINGO: A Novel Pruning Mechanism to Reduce the Size of Neural Networks
[AUTHORS]
Aditya Panangat
[ABSTRACT]
Over the past decade, the use of machine learning has increased
exponentially. Models are far more complex than ever before, growing to
gargantuan sizes and housing millions of weights. Unfortunately, the fact that
large models have become the state of the art means that it often costs
millions of dollars to train and operate them. These expenses not only hurt
companies but also bar non-wealthy individuals from contributing to new
developments and force consumers to pay greater prices for AI. Current methods
used to prune models, such as iterative magnitude pruning, have shown great
accuracy but require an iterative training sequence that is incredibly
computationally and environmentally taxing. To solve this problem, BINGO is
introduced. BINGO, during the training pass, studies specific subsets of a
neural network one at a time to gauge how significant of a role each weight
plays in contributing to a network’s accuracy. By the time training is done,
BINGO generates a significance score for each weight, allowing for
insignificant weights to be pruned in one shot. BINGO provides an
accuracy-preserving pruning technique that is less computationally intensive
than current methods, allowing for a world where AI growth does not have to
mean model growth, as well.
[COMMENTS]
6 pages, 0 figures, 2 tables
[LINK]
http://arxiv.org/abs/2505.09864v1
[DATE]
2025-05-15 08:00:19+08:00
[CATEGORIES]
cs.LG
LiDDA: Data Driven Attribution at LinkedIn
[AUTHORS]
John Bencina, Erkut Aykutlug, Yue Chen, Zerui Zhang, Stephanie Sorenson, Shao Tang, Changshuai Wei
[ABSTRACT]
Data Driven Attribution, which assigns conversion credits to marketing
interactions based on causal patterns learned from data, is the foundation of
modern marketing intelligence and vital to any marketing businesses and
advertising platform. In this paper, we introduce a unified transformer-based
attribution approach that can handle member-level data, aggregate-level data,
and integration of external macro factors. We detail the large scale
implementation of the approach at LinkedIn, showcasing significant impact. We
also share learning and insights that are broadly applicable to the marketing
and ad tech fields.
[LINK]
http://arxiv.org/abs/2505.09861v1
[DATE]
2025-05-15 07:54:57+08:00
[CATEGORIES]
cs.LG
ZENN: A Thermodynamics-Inspired Computational Framework for Heterogeneous Data-Driven Modeling
[AUTHORS]
Shun Wang, Shun-Li Shang, Zi-Kui Liu, Wenrui Hao
[ABSTRACT]
Traditional entropy-based methods - such as cross-entropy loss in
classification problems - have long been essential tools for quantifying
uncertainty and disorder in data and developing artificial intelligence
algorithms. However, the rapid growth of data across various domains has
introduced new challenges, particularly the integration of heterogeneous
datasets with intrinsic disparities. In this paper, we extend zentropy theory
into the data science domain by introducing intrinsic entropy, enabling more
effective learning from heterogeneous data sources. We propose a
zentropy-enhanced neural network (ZENN) that simultaneously learns both energy
and intrinsic entropy components, capturing the underlying structure of
multi-source data. To support this, we redesign the neural network architecture
to better reflect the intrinsic properties and variability inherent in diverse
datasets. We demonstrate the effectiveness of ZENN on classification tasks and
energy landscape reconstructions, showing its superior generalization
capabilities and robustness-particularly in predicting high-order derivatives.
As a practical application, we employ ZENN to reconstruct the Helmholtz energy
landscape of Fe3Pt using data generated from DFT and capture key material
behaviors, including negative thermal expansion and the critical point in the
temperature-pressure space. Overall, our study introduces a novel approach for
data-driven machine learning grounded in zentropy theory, highlighting ZENN as
a versatile and robust deep learning framework for scientific problems
involving complex, heterogeneous datasets.
[COMMENTS]
9 pages, 4 figures
[LINK]
http://arxiv.org/abs/2505.09851v1
[DATE]
2025-05-15 07:23:28+08:00
[CATEGORIES]
cs.LG
Radiogenomic Bipartite Graph Representation Learning for Alzheimer’s Disease Detection
[AUTHORS]
Aditya Raj, Golrokh Mirzaei
[ABSTRACT]
Imaging and genomic data offer distinct and rich features, and their
integration can unveil new insights into the complex landscape of diseases. In
this study, we present a novel approach utilizing radiogenomic data including
structural MRI images and gene expression data, for Alzheimer’s disease
detection. Our framework introduces a novel heterogeneous bipartite graph
representation learning featuring two distinct node types: genes and images.
The network can effectively classify Alzheimer’s disease (AD) into three
distinct stages:AD, Mild Cognitive Impairment (MCI), and Cognitive Normal (CN)
classes, utilizing a small dataset. Additionally, it identified which genes
play a significant role in each of these classification groups. We evaluate the
performance of our approach using metrics including classification accuracy,
recall, precision, and F1 score. The proposed technique holds potential for
extending to radiogenomic-based classification to other diseases.
[COMMENTS]
11 pages
[LINK]
http://arxiv.org/abs/2505.09848v1
[DATE]
2025-05-15 07:13:35+08:00
[CATEGORIES]
cs.LG
Causal Predictive Optimization and Generation for Business AI
[AUTHORS]
Liyang Zhao, Olurotimi Seton, Himadeep Reddy Reddivari, Suvendu Jena, Shadow Zhao, Rachit Kumar, Changshuai Wei
[ABSTRACT]
The sales process involves sales functions converting leads or opportunities
to customers and selling more products to existing customers. The optimization
of the sales process thus is key to success of any B2B business. In this work,
we introduce a principled approach to sales optimization and business AI,
namely the Causal Predictive Optimization and Generation, which includes three
layers: 1) prediction layer with causal ML 2) optimization layer with
constraint optimization and contextual bandit 3) serving layer with Generative
AI and feedback-loop for system enhancement. We detail the implementation and
deployment of the system in LinkedIn, showcasing significant wins over legacy
systems and sharing learning and insight broadly applicable to this field.
[LINK]
http://arxiv.org/abs/2505.09847v1
[DATE]
2025-05-15 07:12:20+08:00
[CATEGORIES]
cs.LG
Noise Sensitivity and Learning Lower Bounds for Hierarchical Functions
[AUTHORS]
Rupert Li, Elchanan Mossel
[ABSTRACT]
Recent works explore deep learning’s success by examining functions or data
with hierarchical structure. To study the learning complexity of functions with
hierarchical structure, we study the noise stability of functions with tree
hierarchical structure on independent inputs. We show that if each function in
the hierarchy is $\varepsilon$-far from linear, the noise stability is
exponentially small in the depth of the hierarchy.
Our results have immediate applications for learning. In the Boolean setting
using the results of Dachman-Soled, Feldman, Tan, Wan and Wimmer (2014) our
results provide Statistical Query super-polynomial lower bounds for learning
classes that are based on hierarchical functions. Similarly, using the results
of Diakonikolas, Kane, Pittas and Zarifis (2021) our results provide
super-polynomial lower bounds for SQ learning under the Gaussian measure. Using
the results of Abbe, Bengio, Cornacchiam, Kleinberg, Lotfi, Raghu and Zhang
(2022) our results imply sample complexity lower bounds for learning
hierarchical functions with gradient descent on fully connected neural
networks.
[COMMENTS]
18 pages
[LINK]
http://arxiv.org/abs/2502.05073v2
[DATE]
2025-05-15 06:45:07+08:00
[CATEGORIES]
cs.LG
Integrating Protein Sequence and Expression Level to Analysis Molecular Characterization of Breast Cancer Subtypes
[AUTHORS]
Hossein Sholehrasa
[ABSTRACT]
Breast cancer’s complexity and variability pose significant challenges in
understanding its progression and guiding effective treatment. This study aims
to integrate protein sequence data with expression levels to improve the
molecular characterization of breast cancer subtypes and predict clinical
outcomes. Using ProtGPT2, a language model designed for protein sequences, we
generated embeddings that capture the functional and structural properties of
proteins sequence. These embeddings were integrated with protein expression
level to form enriched biological representations, which were analyzed using
machine learning methods like ensemble K-means for clustering and XGBoost for
classification. Our approach enabled successful clustering of patients into
biologically distinct groups and accurately predicted clinical outcomes such as
survival and biomarkers status, achieving high performance metrics, notably an
F1 score of 0.88 for survival and 0.87 for biomarkers status prediction.
Feature importance analysis identified KMT2C, CLASP2, and MYO1B as key proteins
involved in hormone signaling, cytoskeletal remodeling, and therapy resistance
in hormone receptor-positive and triple-negative breast cancer, with potential
influence on breast cancer subtype behavior and progression. Furthermore,
protein-protein interaction networks and correlation analyses revealed
functional interdependencies among proteins that may influence breast cancer
subtype behavior and progression. These findings suggest that integrating
protein sequence and expression data provides valuable insights into tumor
biology and has significant potential to enhance personalized treatment
strategies in breast cancer care.
[LINK]
http://arxiv.org/abs/2410.01755v2
[DATE]
2025-05-15 05:57:11+08:00
[CATEGORIES]
cs.LG
Learning Kronecker-Structured Graphs from Smooth Signals
[AUTHORS]
Changhao Shi, Gal Mishne
[ABSTRACT]
Graph learning, or network inference, is a prominent problem in graph signal
processing (GSP). GSP generalizes the Fourier transform to non-Euclidean
domains, and graph learning is pivotal to applying GSP when these domains are
unknown. With the recent prevalence of multi-way data, there has been growing
interest in product graphs that naturally factorize dependencies across
different ways. However, the types of graph products that can be learned are
still limited for modeling diverse dependency structures. In this paper, we
study the problem of learning a Kronecker-structured product graph from smooth
signals. Unlike the more commonly used Cartesian product, the Kronecker product
models dependencies in a more intricate, non-separable way, but posits harder
constraints on the graph learning problem. To tackle this non-convex problem,
we propose an alternating scheme to optimize each factor graph and provide
theoretical guarantees for its asymptotic convergence. The proposed algorithm
is also modified to learn factor graphs of the strong product. We conduct
experiments on synthetic and real-world graphs and demonstrate our approach’s
efficacy and superior performance compared to existing methods.
[LINK]
http://arxiv.org/abs/2505.09822v1
[DATE]
2025-05-15 05:53:37+08:00
[CATEGORIES]
cs.LG
Adversarial Attack on Large Language Models using Exponentiated Gradient Descent
[AUTHORS]
Sajib Biswas, Mao Nishino, Samuel Jacob Chacko, Xiuwen Liu
[ABSTRACT]
As Large Language Models (LLMs) are widely used, understanding them
systematically is key to improving their safety and realizing their full
potential. Although many models are aligned using techniques such as
reinforcement learning from human feedback (RLHF), they are still vulnerable to
jailbreaking attacks. Some of the existing adversarial attack methods search
for discrete tokens that may jailbreak a target model while others try to
optimize the continuous space represented by the tokens of the model’s
vocabulary. While techniques based on the discrete space may prove to be
inefficient, optimization of continuous token embeddings requires projections
to produce discrete tokens, which might render them ineffective. To fully
utilize the constraints and the structures of the space, we develop an
intrinsic optimization technique using exponentiated gradient descent with the
Bregman projection method to ensure that the optimized one-hot encoding always
stays within the probability simplex. We prove the convergence of the technique
and implement an efficient algorithm that is effective in jailbreaking several
widely used LLMs. We demonstrate the efficacy of the proposed technique using
five open-source LLMs on four openly available datasets. The results show that
the technique achieves a higher success rate with great efficiency compared to
three other state-of-the-art jailbreaking techniques. The source code for our
implementation is available at:
https://github.com/sbamit/Exponentiated-Gradient-Descent-LLM-Attack
[COMMENTS]
Accepted to International Joint Conference on Neural Networks (IJCNN)
2025
[LINK]
http://arxiv.org/abs/2505.09820v1
[DATE]
2025-05-15 05:50:46+08:00
[CATEGORIES]
cs.LG
cs.CL
Visual Feedback of Pattern Separability Improves Myoelectric Decoding Performance of Upper Limb Prostheses
[AUTHORS]
Ruichen Yang, György M. Lévay, Christopher L. Hunt, Dániel Czeiner, Megan C. Hodgson, Damini Agarwal, Rahul R. Kaliki, Nitish V. Thakor
[ABSTRACT]
State-of-the-art upper limb myoelectric prostheses often use pattern
recognition (PR) control systems that translate electromyography (EMG) signals
into desired movements. As prosthesis movement complexity increases, users
often struggle to produce sufficiently distinct EMG patterns for reliable
classification. Existing training typically involves heuristic, trial-and-error
user adjustments to static decoder boundaries. Goal: We introduce the Reviewer,
a 3D visual interface projecting EMG signals directly into the decoder’s
classification space, providing intuitive, real-time insight into PR algorithm
behavior. This structured feedback reduces cognitive load and fosters mutual,
data-driven adaptation between user-generated EMG patterns and decoder
boundaries. Methods: A 10-session study with 12 able-bodied participants
compared PR performance after motor-based training and updating using the
Reviewer versus conventional virtual arm visualization. Performance was
assessed using a Fitts law task that involved the aperture of the cursor and
the control of orientation. Results: Participants trained with the Reviewer
achieved higher completion rates, reduced overshoot, and improved path
efficiency and throughput compared to the standard visualization group.
Significance: The Reviewer introduces decoder-informed motor training,
facilitating immediate and consistent PR-based myoelectric control
improvements. By iteratively refining control through real-time feedback, this
approach reduces reliance on trial-and-error recalibration, enabling a more
adaptive, self-correcting training framework. Conclusion: The 3D visual
feedback significantly improves PR control in novice operators through
structured training, enabling feedback-driven adaptation and reducing reliance
on extensive heuristic adjustments.
[LINK]
http://arxiv.org/abs/2505.09819v1
[DATE]
2025-05-15 05:47:28+08:00
[CATEGORIES]
cs.LG
Heterogeneous graph neural networks for species distribution modeling
[AUTHORS]
Lauren Harrell, Christine Kaeser-Chen, Burcu Karagol Ayan, Keith Anderson, Michelangelo Conserva, Elise Kleeman, Maxim Neumann, Matt Overlan, Melissa Chapman, Drew Purves
[ABSTRACT]
Species distribution models (SDMs) are necessary for measuring and predicting
occurrences and habitat suitability of species and their relationship with
environmental factors. We introduce a novel presence-only SDM with graph neural
networks (GNN). In our model, species and locations are treated as two distinct
node sets, and the learning task is predicting detection records as the edges
that connect locations to species. Using GNN for SDM allows us to model
fine-grained interactions between species and the environment. We evaluate the
potential of this methodology on the six-region dataset compiled by National
Center for Ecological Analysis and Synthesis (NCEAS) for benchmarking SDMs. For
each of the regions, the heterogeneous GNN model is comparable to or
outperforms previously-benchmarked single-species SDMs as well as a
feed-forward neural network baseline model.
[COMMENTS]
13 pages, 3 figures,
[LINK]
http://arxiv.org/abs/2503.11900v3
[DATE]
2025-05-15 05:32:38+08:00
[CATEGORIES]
cs.LG
Comparative Analysis of Stroke Prediction Models Using Machine Learning
[AUTHORS]
Anastasija Tashkova, Stefan Eftimov, Bojan Ristov, Slobodan Kalajdziski
[ABSTRACT]
Stroke remains one of the most critical global health challenges, ranking as
the second leading cause of death and the third leading cause of disability
worldwide. This study explores the effectiveness of machine learning algorithms
in predicting stroke risk using demographic, clinical, and lifestyle data from
the Stroke Prediction Dataset. By addressing key methodological challenges such
as class imbalance and missing data, we evaluated the performance of multiple
models, including Logistic Regression, Random Forest, and XGBoost. Our results
demonstrate that while these models achieve high accuracy, sensitivity remains
a limiting factor for real-world clinical applications. In addition, we
identify the most influential predictive features and propose strategies to
improve machine learning-based stroke prediction. These findings contribute to
the development of more reliable and interpretable models for the early
assessment of stroke risk.
[LINK]
http://arxiv.org/abs/2505.09812v1
[DATE]
2025-05-15 05:27:19+08:00
[CATEGORIES]
cs.LG
Contextual Phenotyping of Pediatric Sepsis Cohort Using Large Language Models
[AUTHORS]
Aditya Nagori, Ayush Gautam, Matthew O. Wiens, Vuong Nguyen, Nathan Kenya Mugisha, Jerome Kabakyenga, Niranjan Kissoon, John Mark Ansermino, Rishikesan Kamaleswaran
[ABSTRACT]
Clustering patient subgroups is essential for personalized care and efficient
resource use. Traditional clustering methods struggle with high-dimensional,
heterogeneous healthcare data and lack contextual understanding. This study
evaluates Large Language Model (LLM) based clustering against classical methods
using a pediatric sepsis dataset from a low-income country (LIC), containing
2,686 records with 28 numerical and 119 categorical variables. Patient records
were serialized into text with and without a clustering objective. Embeddings
were generated using quantized LLAMA 3.1 8B, DeepSeek-R1-Distill-Llama-8B with
low-rank adaptation(LoRA), and Stella-En-400M-V5 models. K-means clustering was
applied to these embeddings. Classical comparisons included K-Medoids
clustering on UMAP and FAMD-reduced mixed data. Silhouette scores and
statistical tests evaluated cluster quality and distinctiveness.
Stella-En-400M-V5 achieved the highest Silhouette Score (0.86). LLAMA 3.1 8B
with the clustering objective performed better with higher number of clusters,
identifying subgroups with distinct nutritional, clinical, and socioeconomic
profiles. LLM-based methods outperformed classical techniques by capturing
richer context and prioritizing key features. These results highlight potential
of LLMs for contextual phenotyping and informed decision-making in
resource-limited settings.
[COMMENTS]
11 pages, 2 Figures, 1 Table
[LINK]
http://arxiv.org/abs/2505.09805v1
[DATE]
2025-05-15 05:05:40+08:00
[CATEGORIES]
cs.LG
LatticeVision: Image to Image Networks for Modeling Non-Stationary Spatial Data
[AUTHORS]
Antony Sikorski, Michael Ivanitskiy, Nathan Lenssen, Douglas Nychka, Daniel McKenzie
[ABSTRACT]
In many scientific and industrial applications, we are given a handful of
instances (a ‘small ensemble’) of a spatially distributed quantity (a ‘field’)
but would like to acquire many more. For example, a large ensemble of global
temperature sensitivity fields from a climate model can help farmers, insurers,
and governments plan appropriately. When acquiring more data is prohibitively
expensive – as is the case with climate models – statistical emulation offers
an efficient alternative for simulating synthetic yet realistic fields.
However, parameter inference using maximum likelihood estimation (MLE) is
computationally prohibitive, especially for large, non-stationary fields. Thus,
many recent works train neural networks to estimate parameters given spatial
fields as input, sidestepping MLE completely. In this work we focus on a
popular class of parametric, spatially autoregressive (SAR) models. We make a
simple yet impactful observation; because the SAR parameters can be arranged on
a regular grid, both inputs (spatial fields) and outputs (model parameters) can
be viewed as images. Using this insight, we demonstrate that image-to-image
(I2I) networks enable faster and more accurate parameter estimation for a class
of non-stationary SAR models with unprecedented complexity.
[LINK]
http://arxiv.org/abs/2505.09803v1
[DATE]
2025-05-15 04:59:10+08:00
[CATEGORIES]
cs.LG
Ontology-Based Structuring and Analysis of North Macedonian Public Procurement Contracts
[AUTHORS]
Bojan Ristov, Stefan Eftimov, Milena Trajanoska, Dimitar Trajanov
[ABSTRACT]
Public procurement plays a critical role in government operations, ensuring
the efficient allocation of resources and fostering economic growth. However,
traditional procurement data is often stored in rigid, tabular formats,
limiting its analytical potential and hindering transparency. This research
presents a methodological framework for transforming structured procurement
data into a semantic knowledge graph, leveraging ontological modeling and
automated data transformation techniques. By integrating RDF and SPARQL-based
querying, the system enhances the accessibility and interpretability of
procurement records, enabling complex semantic queries and advanced analytics.
Furthermore, by incorporating machine learning-driven predictive modeling, the
system extends beyond conventional data analysis, offering insights into
procurement trends and risk assessment. This work contributes to the broader
field of public procurement intelligence by improving data transparency,
supporting evidence-based decision-making, and enabling in-depth analysis of
procurement activities in North Macedonia.
[LINK]
http://arxiv.org/abs/2505.09798v1
[DATE]
2025-05-15 04:51:26+08:00
[CATEGORIES]
cs.LG
Interim Report on Human-Guided Adaptive Hyperparameter Optimization with Multi-Fidelity Sprints
[AUTHORS]
Michael Kamfonas
[ABSTRACT]
This case study applies a phased hyperparameter optimization process to
compare multitask natural language model variants that utilize multiphase
learning rate scheduling and optimizer parameter grouping. We employ short,
Bayesian optimization sessions that leverage multi-fidelity, hyperparameter
space pruning, progressive halving, and a degree of human guidance. We utilize
the Optuna TPE sampler and Hyperband pruner, as well as the Scikit-Learn
Gaussian process minimization. Initially, we use efficient low-fidelity sprints
to prune the hyperparameter space. Subsequent sprints progressively increase
their model fidelity and employ hyperband pruning for efficiency. A second
aspect of our approach is using a meta-learner to tune threshold values to
resolve classification probabilities during inference. We demonstrate our
method on a collection of variants of the 2021 Joint Entity and Relation
Extraction model proposed by Eberts and Ulges.
[LINK]
http://arxiv.org/abs/2505.09792v1
[DATE]
2025-05-15 04:38:44+08:00
[CATEGORIES]
cs.CL
cs.LG
Aligning Transformers with Continuous Feedback via Energy Rank Alignment
[AUTHORS]
Shriram Chennakesavalu, Frank Hu, Sebastian Ibarraran, Grant M. Rotskoff
[ABSTRACT]
Searching through chemical space is an exceptionally challenging problem
because the number of possible molecules grows combinatorially with the number
of atoms. Large, autoregressive models trained on databases of chemical
compounds have yielded powerful generators, but we still lack robust strategies
for generating molecules with desired properties. This molecular search problem
closely resembles the “alignment” problem for large language models, though for
many chemical tasks we have a specific and easily evaluable reward function.
Here, we introduce an algorithm called energy rank alignment (ERA) that
leverages an explicit reward function to produce a gradient-based objective
that we use to optimize autoregressive policies. We show theoretically that
this algorithm is closely related to proximal policy optimization (PPO) and
direct preference optimization (DPO), but has a minimizer that converges to an
ideal Gibbs-Boltzmann distribution with the reward playing the role of an
energy function. Furthermore, this algorithm is highly scalable, does not
require reinforcement learning, and performs well relative to DPO when the
number of preference observations per pairing is small. We deploy this approach
to align molecular transformers and protein language models to generate
molecules and protein sequences, respectively, with externally specified
properties and find that it does so robustly, searching through diverse parts
of chemical space.
[LINK]
http://arxiv.org/abs/2405.12961v2
[DATE]
2025-05-15 04:23:34+08:00
[CATEGORIES]
cs.LG
Pure Component Property Estimation Framework Using Explainable Machine Learning Methods
[AUTHORS]
Jianfeng Jiao, Xi Gao, Jie Li
[ABSTRACT]
Accurate prediction of pure component physiochemical properties is crucial
for process integration, multiscale modeling, and optimization. In this work,
an enhanced framework for pure component property prediction by using
explainable machine learning methods is proposed. In this framework, the
molecular representation method based on the connectivity matrix effectively
considers atomic bonding relationships to automatically generate features. The
supervised machine learning model random forest is applied for feature ranking
and pooling. The adjusted R2 is introduced to penalize the inclusion of
additional features, providing an assessment of the true contribution of
features. The prediction results for normal boiling point (Tb), liquid molar
volume, critical temperature (Tc) and critical pressure (Pc) obtained using
Artificial Neural Network and Gaussian Process Regression models confirm the
accuracy of the molecular representation method. Comparison with GC based
models shows that the root-mean-square error on the test set can be reduced by
up to 83.8%. To enhance the interpretability of the model, a feature analysis
method based on Shapley values is employed to determine the contribution of
each feature to the property predictions. The results indicate that using the
feature pooling method reduces the number of features from 13316 to 100 without
compromising model accuracy. The feature analysis results for Tb, Tc, and Pc
confirms that different molecular properties are influenced by different
structural features, aligning with mechanistic interpretations. In conclusion,
the proposed framework is demonstrated to be feasible and provides a solid
foundation for mixture component reconstruction and process integration
modelling.
[LINK]
http://arxiv.org/abs/2505.09783v1
[DATE]
2025-05-15 04:21:23+08:00
[CATEGORIES]
cs.LG
Centrally Coordinated Multi-Agent Reinforcement Learning for Power Grid Topology Control
[AUTHORS]
Barbera de Mol, Davide Barbieri, Jan Viebahn, Davide Grossi
[ABSTRACT]
Power grid operation is becoming more complex due to the increase in
generation of renewable energy. The recent series of Learning To Run a Power
Network (L2RPN) competitions have encouraged the use of artificial agents to
assist human dispatchers in operating power grids. However, the combinatorial
nature of the action space poses a challenge to both conventional optimizers
and learned controllers. Action space factorization, which breaks down
decision-making into smaller sub-tasks, is one approach to tackle the curse of
dimensionality. In this study, we propose a centrally coordinated multi-agent
(CCMA) architecture for action space factorization. In this approach, regional
agents propose actions and subsequently a coordinating agent selects the final
action. We investigate several implementations of the CCMA architecture, and
benchmark in different experimental settings against various L2RPN baseline
approaches. The CCMA architecture exhibits higher sample efficiency and
superior final performance than the baseline approaches. The results suggest
high potential of the CCMA approach for further application in
higher-dimensional L2RPN as well as real-world power grid settings.
[COMMENTS]
Accepted version to The 16th ACM International Conference on Future
and Sustainable Energy Systems. The final published version is available at
10.1145/3679240.3734602
[LINK]
http://arxiv.org/abs/2502.08681v2
[DATE]
2025-05-15 04:06:33+08:00
[CATEGORIES]
cs.LG
Mechanisms of Projective Composition of Diffusion Models
[AUTHORS]
Arwen Bradley, Preetum Nakkiran, David Berthelot, James Thornton, Joshua M. Susskind
[ABSTRACT]
We study the theoretical foundations of composition in diffusion models, with
a particular focus on out-of-distribution extrapolation and
length-generalization. Prior work has shown that composing distributions via
linear score combination can achieve promising results, including
length-generalization in some cases (Du et al., 2023; Liu et al., 2022).
However, our theoretical understanding of how and why such compositions work
remains incomplete. In fact, it is not even entirely clear what it means for
composition to “work”. This paper starts to address these fundamental gaps. We
begin by precisely defining one possible desired result of composition, which
we call projective composition. Then, we investigate: (1) when linear score
combinations provably achieve projective composition, (2) whether
reverse-diffusion sampling can generate the desired composition, and (3) the
conditions under which composition fails. We connect our theoretical analysis
to prior empirical observations where composition has either worked or failed,
for reasons that were unclear at the time. Finally, we propose a simple
heuristic to help predict the success or failure of new compositions.
[COMMENTS]
10 pages, 8 figures. The first two authors contributed equally
[LINK]
http://arxiv.org/abs/2502.04549v2
[DATE]
2025-05-15 04:06:09+08:00
[CATEGORIES]
cs.LG
Self-Consuming Generative Models with Adversarially Curated Data
[AUTHORS]
Xiukun Wei, Xueru Zhang
[ABSTRACT]
Recent advances in generative models have made it increasingly difficult to
distinguish real data from model-generated synthetic data. Using synthetic data
for successive training of future model generations creates “self-consuming
loops”, which may lead to model collapse or training instability. Furthermore,
synthetic data is often subject to human feedback and curated by users based on
their preferences. Ferbach et al. (2024) recently showed that when data is
curated according to user preferences, the self-consuming retraining loop
drives the model to converge toward a distribution that optimizes those
preferences. However, in practice, data curation is often noisy or
adversarially manipulated. For example, competing platforms may recruit
malicious users to adversarially curate data and disrupt rival models. In this
paper, we study how generative models evolve under self-consuming retraining
loops with noisy and adversarially curated data. We theoretically analyze the
impact of such noisy data curation on generative models and identify conditions
for the robustness of the retraining process. Building on this analysis, we
design attack algorithms for competitive adversarial scenarios, where a
platform with a limited budget employs malicious users to misalign a rival’s
model from actual user preferences. Experiments on both synthetic and
real-world datasets demonstrate the effectiveness of the proposed algorithms.
[LINK]
http://arxiv.org/abs/2505.09768v1
[DATE]
2025-05-15 03:54:55+08:00
[CATEGORIES]
cs.LG
Exploring Best Practices for ECG Pre-Processing in Machine Learning
[AUTHORS]
Amir Salimi, Sunil Vasu Kalmady, Abram Hindle, Osmar Zaiane, Padma Kaul
[ABSTRACT]
In this work we search for best practices in pre-processing of
Electrocardiogram (ECG) signals in order to train better classifiers for the
diagnosis of heart conditions. State of the art machine learning algorithms
have achieved remarkable results in classification of some heart conditions
using ECG data, yet there appears to be no consensus on pre-processing best
practices. Is this lack of consensus due to different conditions and
architectures requiring different processing steps for optimal performance? Is
it possible that state of the art deep-learning models have rendered
pre-processing unnecessary? In this work we apply down-sampling, normalization,
and filtering functions to 3 different multi-label ECG datasets and measure
their effects on 3 different high-performing time-series classifiers. We find
that sampling rates as low as 50Hz can yield comparable results to the commonly
used 500Hz. This is significant as smaller sampling rates will result in
smaller datasets and models, which require less time and resources to train.
Additionally, despite their common usage, we found min-max normalization to be
slightly detrimental overall, and band-passing to make no measurable
difference. We found the blind approach to pre-processing of ECGs for
multi-label classification to be ineffective, with the exception of sample rate
reduction which reliably reduces computational resources, but does not increase
accuracy.
[LINK]
http://arxiv.org/abs/2311.04229v2
[DATE]
2025-05-15 03:49:48+08:00
[CATEGORIES]
cs.LG
Community-based Multi-Agent Reinforcement Learning with Transfer and Active Exploration
[AUTHORS]
Zhaoyang Shi
[ABSTRACT]
We propose a new framework for multi-agent reinforcement learning (MARL),
where the agents cooperate in a time-evolving network with latent community
structures and mixed memberships. Unlike traditional neighbor-based or fixed
interaction graphs, our community-based framework captures flexible and
abstract coordination patterns by allowing each agent to belong to multiple
overlapping communities. Each community maintains shared policy and value
functions, which are aggregated by individual agents according to personalized
membership weights. We also design actor-critic algorithms that exploit this
structure: agents inherit community-level estimates for policy updates and
value learning, enabling structured information sharing without requiring
access to other agents’ policies. Importantly, our approach supports both
transfer learning by adapting to new agents or tasks via membership estimation,
and active learning by prioritizing uncertain communities during exploration.
Theoretically, we establish convergence guarantees under linear function
approximation for both actor and critic updates. To our knowledge, this is the
first MARL framework that integrates community structure, transferability, and
active learning with provable guarantees.
[LINK]
http://arxiv.org/abs/2505.09756v1
[DATE]
2025-05-15 03:42:43+08:00
[CATEGORIES]
cs.LG
On-Robot Reinforcement Learning with Goal-Contrastive Rewards
[AUTHORS]
Ondrej Biza, Thomas Weng, Lingfeng Sun, Karl Schmeckpeper, Tarik Kelestemur, Yecheng Jason Ma, Robert Platt, Jan-Willem van de Meent, Lawson L. S. Wong
[ABSTRACT]
Reinforcement Learning (RL) has the potential to enable robots to learn from
their own actions in the real world. Unfortunately, RL can be prohibitively
expensive, in terms of on-robot runtime, due to inefficient exploration when
learning from a sparse reward signal. Designing dense reward functions is
labour-intensive and requires domain expertise. In our work, we propose GCR
(Goal-Contrastive Rewards), a dense reward function learning method that can be
trained on passive video demonstrations. By using videos without actions, our
method is easier to scale, as we can use arbitrary videos. GCR combines two
loss functions, an implicit value loss function that models how the reward
increases when traversing a successful trajectory, and a goal-contrastive loss
that discriminates between successful and failed trajectories. We perform
experiments in simulated manipulation environments across RoboMimic and
MimicGen tasks, as well as in the real world using a Franka arm and a Spot
quadruped. We find that GCR leads to a more-sample efficient RL, enabling
model-free RL to solve about twice as many tasks as our baseline reward
learning methods. We also demonstrate positive cross-embodiment transfer from
videos of people and of other robots performing a task. Website:
https://gcr-robot.github.io/.
[LINK]
http://arxiv.org/abs/2410.19989v2
[DATE]
2025-05-15 03:20:06+08:00
[CATEGORIES]
cs.LG
Learning Multi-Attribute Differential Graphs with Non-Convex Penalties
[AUTHORS]
Jitendra K Tugnait
[ABSTRACT]
We consider the problem of estimating differences in two multi-attribute
Gaussian graphical models (GGMs) which are known to have similar structure,
using a penalized D-trace loss function with non-convex penalties. The GGM
structure is encoded in its precision (inverse covariance) matrix. Existing
methods for multi-attribute differential graph estimation are based on a group
lasso penalized loss function. In this paper, we consider a penalized D-trace
loss function with non-convex (log-sum and smoothly clipped absolute deviation
(SCAD)) penalties. Two proximal gradient descent methods are presented to
optimize the objective function. Theoretical analysis establishing sufficient
conditions for consistency in support recovery, convexity and estimation in
high-dimensional settings is provided. We illustrate our approaches with
numerical examples based on synthetic and real data.
[COMMENTS]
14 pages, 1 figures, 2 tables, published in IEEE Access, pp.
67065-67078, 2025
[LINK]
http://arxiv.org/abs/2505.09748v1
[DATE]
2025-05-15 03:19:09+08:00
[CATEGORIES]
cs.LG
Inductive Moment Matching
[AUTHORS]
Linqi Zhou, Stefano Ermon, Jiaming Song
[ABSTRACT]
Diffusion models and Flow Matching generate high-quality samples but are slow
at inference, and distilling them into few-step models often leads to
instability and extensive tuning. To resolve these trade-offs, we propose
Inductive Moment Matching (IMM), a new class of generative models for one- or
few-step sampling with a single-stage training procedure. Unlike distillation,
IMM does not require pre-training initialization and optimization of two
networks; and unlike Consistency Models, IMM guarantees distribution-level
convergence and remains stable under various hyperparameters and standard model
architectures. IMM surpasses diffusion models on ImageNet-256x256 with 1.99 FID
using only 8 inference steps and achieves state-of-the-art 2-step FID of 1.98
on CIFAR-10 for a model trained from scratch.
[LINK]
http://arxiv.org/abs/2503.07565v7
[DATE]
2025-05-15 03:11:44+08:00
[CATEGORIES]
cs.LG
A Generative Neural Annealer for Black-Box Combinatorial Optimization
[AUTHORS]
Yuan-Hang Zhang, Massimiliano Di Ventra
[ABSTRACT]
We propose a generative, end-to-end solver for black-box combinatorial
optimization that emphasizes both sample efficiency and solution quality on NP
problems. Drawing inspiration from annealing-based algorithms, we treat the
black-box objective as an energy function and train a neural network to model
the associated Boltzmann distribution. By conditioning on temperature, the
network captures a continuum of distributions–from near-uniform at high
temperatures to sharply peaked around global optima at low
temperatures–thereby learning the structure of the energy landscape and
facilitating global optimization. When queries are expensive, the
temperature-dependent distributions naturally enable data augmentation and
improve sample efficiency. When queries are cheap but the problem remains hard,
the model learns implicit variable interactions, effectively “opening” the
black box. We validate our approach on challenging combinatorial tasks under
both limited and unlimited query budgets, showing competitive performance
against state-of-the-art black-box optimizers.
[COMMENTS]
15 pages, 3 figures
[LINK]
http://arxiv.org/abs/2505.09742v1
[DATE]
2025-05-15 03:05:19+08:00
[CATEGORIES]
cs.LG
Risk-Aware Safe Reinforcement Learning for Control of Stochastic Linear Systems
[AUTHORS]
Babak Esmaeili, Nariman Niknejad, Hamidreza Modares
[ABSTRACT]
This paper presents a risk-aware safe reinforcement learning (RL) control
design for stochastic discrete-time linear systems. Rather than using a safety
certifier to myopically intervene with the RL controller, a risk-informed safe
controller is also learned besides the RL controller, and the RL and safe
controllers are combined together. Several advantages come along with this
approach: 1) High-confidence safety can be certified without relying on a
high-fidelity system model and using limited data available, 2) Myopic
interventions and convergence to an undesired equilibrium can be avoided by
deciding on the contribution of two stabilizing controllers, and 3) highly
efficient and computationally tractable solutions can be provided by optimizing
over a scalar decision variable and linear programming polyhedral sets. To
learn safe controllers with a large invariant set, piecewise affine controllers
are learned instead of linear controllers. To this end, the closed-loop system
is first represented using collected data, a decision variable, and noise. The
effect of the decision variable on the variance of the safe violation of the
closed-loop system is formalized. The decision variable is then designed such
that the probability of safety violation for the learned closed-loop system is
minimized. It is shown that this control-oriented approach reduces the data
requirements and can also reduce the variance of safety violations. Finally, to
integrate the safe and RL controllers, a new data-driven interpolation
technique is introduced. This method aims to maintain the RL agent’s optimal
implementation while ensuring its safety within environments characterized by
noise. The study concludes with a simulation example that serves to validate
the theoretical results.
[COMMENTS]
Submitted to Asian Journal of Control
[LINK]
http://arxiv.org/abs/2505.09734v1
[DATE]
2025-05-15 02:49:32+08:00
[CATEGORIES]
cs.LG
Robust Federated Learning with Confidence-Weighted Filtering and GAN-Based Completion under Noisy and Incomplete Data
[AUTHORS]
Alpaslan Gokcen, Ali Boyaci
[ABSTRACT]
Federated learning (FL) presents an effective solution for collaborative
model training while maintaining data privacy across decentralized client
datasets. However, data quality issues such as noisy labels, missing classes,
and imbalanced distributions significantly challenge its effectiveness. This
study proposes a federated learning methodology that systematically addresses
data quality issues, including noise, class imbalance, and missing labels. The
proposed approach systematically enhances data integrity through adaptive noise
cleaning, collaborative conditional GAN-based synthetic data generation, and
robust federated model training. Experimental evaluations conducted on
benchmark datasets (MNIST and Fashion-MNIST) demonstrate significant
improvements in federated model performance, particularly macro-F1 Score, under
varying noise and class imbalance conditions. Additionally, the proposed
framework carefully balances computational feasibility and substantial
performance gains, ensuring practicality for resource constrained edge devices
while rigorously maintaining data privacy. Our results indicate that this
method effectively mitigates common data quality challenges, providing a
robust, scalable, and privacy compliant solution suitable for diverse
real-world federated learning scenarios.
[LINK]
http://arxiv.org/abs/2505.09733v1
[DATE]
2025-05-15 02:49:18+08:00
[CATEGORIES]
cs.LG
Out-of-distribution generalisation is hard: evidence from ARC-like tasks
[AUTHORS]
George Dimitriadis. Spyridon Samothrakis
[ABSTRACT]
Out-of-distribution (OOD) generalisation is considered a hallmark of human
and animal intelligence. To achieve OOD through composition, a system must
discover the environment-invariant properties of experienced input-output
mappings and transfer them to novel inputs. This can be realised if an
intelligent system can identify appropriate, task-invariant, and composable
input features, as well as the composition methods, thus allowing it to act
based not on the interpolation between learnt data points but on the
task-invariant composition of those features. We propose that in order to
confirm that an algorithm does indeed learn compositional structures from data,
it is not enough to just test on an OOD setup, but one also needs to confirm
that the features identified are indeed compositional. We showcase this by
exploring two tasks with clearly defined OOD metrics that are not OOD solvable
by three commonly used neural networks: a Multi-Layer Perceptron (MLP), a
Convolutional Neural Network (CNN), and a Transformer. In addition, we develop
two novel network architectures imbued with biases that allow them to be
successful in OOD scenarios. We show that even with correct biases and almost
perfect OOD performance, an algorithm can still fail to learn the correct
features for compositional generalisation.
[COMMENTS]
Submission to NeurIPS 2025
[LINK]
http://arxiv.org/abs/2505.09716v1
[DATE]
2025-05-15 02:21:21+08:00
[CATEGORIES]
cs.LG
Temporal-Difference Variational Continual Learning
[AUTHORS]
Luckeciano C. Melo, Alessandro Abate, Yarin Gal
[ABSTRACT]
Machine Learning models in real-world applications must continuously learn
new tasks to adapt to shifts in the data-generating distribution. Yet, for
Continual Learning (CL), models often struggle to balance learning new tasks
(plasticity) with retaining previous knowledge (memory stability).
Consequently, they are susceptible to Catastrophic Forgetting, which degrades
performance and undermines the reliability of deployed systems. In the Bayesian
CL literature, variational methods tackle this challenge by employing a
learning objective that recursively updates the posterior distribution while
constraining it to stay close to its previous estimate. Nonetheless, we argue
that these methods may be ineffective due to compounding approximation errors
over successive recursions. To mitigate this, we propose new learning
objectives that integrate the regularization effects of multiple previous
posterior estimations, preventing individual errors from dominating future
posterior updates and compounding over time. We reveal insightful connections
between these objectives and Temporal-Difference methods, a popular learning
mechanism in Reinforcement Learning and Neuroscience. Experiments on
challenging CL benchmarks show that our approach effectively mitigates
Catastrophic Forgetting, outperforming strong Variational CL methods.
[LINK]
http://arxiv.org/abs/2410.07812v2
[DATE]
2025-05-15 02:18:45+08:00
[CATEGORIES]
cs.LG
Forests for Differences: Robust Causal Inference Beyond Parametric DiD
[AUTHORS]
Hugo Gobato Souto, Francisco Louzada Neto
[ABSTRACT]
This paper introduces the Difference-in-Differences Bayesian Causal Forest
(DiD-BCF), a novel non-parametric model addressing key challenges in DiD
estimation, such as staggered adoption and heterogeneous treatment effects.
DiD-BCF provides a unified framework for estimating Average (ATE),
Group-Average (GATE), and Conditional Average Treatment Effects (CATE). A core
innovation, its Parallel Trends Assumption (PTA)-based reparameterization,
enhances estimation accuracy and stability in complex panel data settings.
Extensive simulations demonstrate DiD-BCF’s superior performance over
established benchmarks, particularly under non-linearity, selection biases, and
effect heterogeneity. Applied to U.S. minimum wage policy, the model uncovers
significant conditional treatment effect heterogeneity related to county
population, insights obscured by traditional methods. DiD-BCF offers a robust
and versatile tool for more nuanced causal inference in modern DiD
applications.
[LINK]
http://arxiv.org/abs/2505.09706v1
[DATE]
2025-05-15 02:06:51+08:00
[CATEGORIES]
cs.LG
Energy-Efficient Federated Learning for AIoT using Clustering Methods
[AUTHORS]
Roberto Pereira, Fernanda Famá, Charalampos Kalalas, Paolo Dini
[ABSTRACT]
While substantial research has been devoted to optimizing model performance,
convergence rates, and communication efficiency, the energy implications of
federated learning (FL) within Artificial Intelligence of Things (AIoT)
scenarios are often overlooked in the existing literature. This study examines
the energy consumed during the FL process, focusing on three main
energy-intensive processes: pre-processing, communication, and local learning,
all contributing to the overall energy footprint. We rely on the observation
that device/client selection is crucial for speeding up the convergence of
model training in a distributed AIoT setting and propose two
clustering-informed methods. These clustering solutions are designed to group
AIoT devices with similar label distributions, resulting in clusters composed
of nearly heterogeneous devices. Hence, our methods alleviate the heterogeneity
often encountered in real-world distributed learning applications. Throughout
extensive numerical experimentation, we demonstrate that our clustering
strategies typically achieve high convergence rates while maintaining low
energy consumption when compared to other recent approaches available in the
literature.
[LINK]
http://arxiv.org/abs/2505.09704v1
[DATE]
2025-05-15 02:04:58+08:00
[CATEGORIES]
cs.LG
DataMIL: Selecting Data for Robot Imitation Learning with Datamodels
[AUTHORS]
Shivin Dass, Alaa Khaddaj, Logan Engstrom, Aleksander Madry, Andrew Ilyas, Roberto Martín-Martín
[ABSTRACT]
Recently, the robotics community has amassed ever larger and more diverse
datasets to train generalist robot policies. However, while these policies
achieve strong mean performance across a variety of tasks, they often
underperform on individual, specialized tasks and require further tuning on
newly acquired task-specific data. Combining task-specific data with carefully
curated subsets of large prior datasets via co-training can produce better
specialized policies, but selecting data naively may actually harm downstream
performance. To address this, we introduce DataMIL, a policy-driven data
selection framework built on the datamodels paradigm that reasons about data
selection in an end-to-end manner, using the policy itself to identify which
data points will most improve performance. Unlike standard practices that
filter data using human notions of quality (e.g., based on semantic or visual
similarity), DataMIL directly optimizes data selection for task success,
allowing us to select data that enhance the policy while dropping data that
degrade it. To avoid performing expensive rollouts in the environment during
selection, we use a novel surrogate loss function on task-specific data,
allowing us to use DataMIL in the real world without degrading performance. We
validate our approach on a suite of more than 60 simulation and real-world
manipulation tasks - most notably showing successful data selection from the
Open X-Embodiment datasets-demonstrating consistent gains in success rates and
superior performance over multiple baselines. Our results underscore the
importance of end-to-end, performance-aware data selection for unlocking the
potential of large prior datasets in robotics. More information at
https://robin-lab.cs.utexas.edu/datamodels4imitation/
[LINK]
http://arxiv.org/abs/2505.09603v1
[DATE]
2025-05-15 01:55:10+08:00
[CATEGORIES]
cs.LG
Adversarial Suffix Filtering: a Defense Pipeline for LLMs
[AUTHORS]
David Khachaturov, Robert Mullins
[ABSTRACT]
Large Language Models (LLMs) are increasingly embedded in autonomous systems
and public-facing environments, yet they remain susceptible to jailbreak
vulnerabilities that may undermine their security and trustworthiness.
Adversarial suffixes are considered to be the current state-of-the-art
jailbreak, consistently outperforming simpler methods and frequently succeeding
even in black-box settings. Existing defenses rely on access to the internal
architecture of models limiting diverse deployment, increase memory and
computation footprints dramatically, or can be bypassed with simple prompt
engineering methods. We introduce $\textbf{Adversarial Suffix Filtering}$
(ASF), a lightweight novel model-agnostic defensive pipeline designed to
protect LLMs against adversarial suffix attacks. ASF functions as an input
preprocessor and sanitizer that detects and filters adversarially crafted
suffixes in prompts, effectively neutralizing malicious injections. We
demonstrate that ASF provides comprehensive defense capabilities across both
black-box and white-box attack settings, reducing the attack efficacy of
state-of-the-art adversarial suffix generation methods to below 4%, while only
minimally affecting the target model’s capabilities in non-adversarial
scenarios.
[LINK]
http://arxiv.org/abs/2505.09602v1
[DATE]
2025-05-15 01:52:10+08:00
[CATEGORIES]
cs.LG
Decoding Futures Price Dynamics: A Regularized Sparse Autoencoder for Interpretable Multi-Horizon Forecasting and Factor Discovery
[AUTHORS]
Abhijit Gupta
[ABSTRACT]
Commodity price volatility creates economic challenges, necessitating
accurate multi-horizon forecasting. Predicting prices for commodities like
copper and crude oil is complicated by diverse interacting factors
(macroeconomic, supply/demand, geopolitical, etc.). Current models often lack
transparency, limiting strategic use. This paper presents a Regularized Sparse
Autoencoder (RSAE), a deep learning framework for simultaneous multi-horizon
commodity price prediction and discovery of interpretable latent market
drivers. The RSAE forecasts prices at multiple horizons (e.g., 1-day, 1-week,
1-month) using multivariate time series. Crucially, L1 regularization
($|\mathbf{z}|_1$) on its latent vector $\mathbf{z}$ enforces sparsity,
promoting parsimonious explanations of market dynamics through learned factors
representing underlying drivers (e.g., demand, supply shocks). Drawing from
energy-based models and sparse coding, the RSAE optimizes predictive accuracy
while learning sparse representations. Evaluated on historical Copper and Crude
Oil data with numerous indicators, our findings indicate the RSAE offers
competitive multi-horizon forecasting accuracy and data-driven insights into
price dynamics via its interpretable latent space, a key advantage over
traditional black-box approaches.
[LINK]
http://arxiv.org/abs/2505.06795v3
[DATE]
2025-05-15 01:49:51+08:00
[CATEGORIES]
cs.LG
Online Isolation Forest
[AUTHORS]
Filippo Leveni, Guilherme Weigert Cassales, Bernhard Pfahringer, Albert Bifet, Giacomo Boracchi
[COMMENTS]
Accepted at International Conference on Machine Learning (ICML 2024)
[LINK]
http://arxiv.org/abs/2505.09593v1
[DATE]
2025-05-15 01:42:50+08:00
[CATEGORIES]
cs.LG
Rhomboid Tiling for Geometric Graph Deep Learning
[AUTHORS]
Yipeng Zhang, Longlong Li, Kelin Xia
[ABSTRACT]
Graph Neural Networks (GNNs) have proven effective for learning from
graph-structured data through their neighborhood-based message passing
framework. Many hierarchical graph clustering pooling methods modify this
framework by introducing clustering-based strategies, enabling the construction
of more expressive and powerful models. However, all of these message passing
framework heavily rely on the connectivity structure of graphs, limiting their
ability to capture the rich geometric features inherent in geometric graphs. To
address this, we propose Rhomboid Tiling (RT) clustering, a novel clustering
method based on the rhomboid tiling structure, which performs clustering by
leveraging the complex geometric information of the data and effectively
extracts its higher-order geometric structures. Moreover, we design RTPool, a
hierarchical graph clustering pooling model based on RT clustering for graph
classification tasks. The proposed model demonstrates superior performance,
outperforming 21 state-of-the-art competitors on all the 7 benchmark datasets.
[LINK]
http://arxiv.org/abs/2505.09586v1
[DATE]
2025-05-15 01:37:15+08:00
[CATEGORIES]
cs.LG
TuneNSearch: a hybrid transfer learning and local search approach for solving vehicle routing problems
[AUTHORS]
Arthur Corrêa, Cristóvão Silva, Liming Xu, Alexandra Brintrup, Samuel Moniz
[ABSTRACT]
This paper introduces TuneNSearch, a hybrid transfer learning and local
search approach for addressing different variants of vehicle routing problems
(VRP). Recently, multi-task learning has gained much attention for solving VRP
variants. However, this adaptability often compromises the performance of the
models. To address this challenge, we first pre-train a reinforcement learning
model on the multi-depot VRP, followed by a short fine-tuning phase to adapt it
to different variants. By leveraging the complexity of the multi-depot VRP, the
pre-trained model learns richer node representations and gains more
transferable knowledge compared to models trained on simpler routing problems,
such as the traveling salesman problem. TuneNSearch employs, in the first
stage, a Transformer-based architecture, augmented with a residual edge-graph
attention network to capture the impact of edge distances and residual
connections between layers. This architecture allows for a more precise capture
of graph-structured data, improving the encoding of VRP’s features. After
inference, our model is also coupled with a second stage composed of a local
search algorithm, which yields substantial performance gains with minimal
computational overhead added. Results show that TuneNSearch outperforms many
existing state-of-the-art models trained for each VRP variant, requiring only
one-fifth of the training epochs. Our approach demonstrates strong
generalization, achieving high performance across different tasks,
distributions and problem sizes, thus addressing a long-standing gap in the
literature.
[LINK]
http://arxiv.org/abs/2503.12662v2
[DATE]
2025-05-15 01:20:26+08:00
[CATEGORIES]
cs.LG
SAD Neural Networks: Divergent Gradient Flows and Asymptotic Optimality via o-minimal Structures
[AUTHORS]
Julian Kranz, Davide Gallon, Steffen Dereich, Arnulf Jentzen
[ABSTRACT]
We study gradient flows for loss landscapes of fully connected feed forward
neural networks with commonly used continuously differentiable activation
functions such as the logistic, hyperbolic tangent, softplus or GELU function.
We prove that the gradient flow either converges to a critical point or
diverges to infinity while the loss converges to an asymptotic critical value.
Moreover, we prove the existence of a threshold $\varepsilon>0$ such that the
loss value of any gradient flow initialized at most $\varepsilon$ above the
optimal level converges to it. For polynomial target functions and sufficiently
big architecture and data set, we prove that the optimal loss value is zero and
can only be realized asymptotically. From this setting, we deduce our main
result that any gradient flow with sufficiently good initialization diverges to
infinity. Our proof heavily relies on the geometry of o-minimal structures. We
confirm these theoretical findings with numerical experiments and extend our
investigation to real-world scenarios, where we observe an analogous behavior.
[COMMENTS]
27 pages, 4 figures
[LINK]
http://arxiv.org/abs/2505.09572v1
[DATE]
2025-05-15 01:15:11+08:00
[CATEGORIES]
cs.LG
Don’t be lazy: CompleteP enables compute-efficient deep transformers
[AUTHORS]
Nolan Dey, Bin Claire Zhang, Lorenzo Noci, Mufan Li, Blake Bordelon, Shane Bergsma, Cengiz Pehlevan, Boris Hanin, Joel Hestness
[ABSTRACT]
We study compute efficiency of LLM training when using different
parameterizations, i.e., rules for adjusting model and optimizer
hyperparameters (HPs) as model size changes. Some parameterizations fail to
transfer optimal base HPs (such as learning rate) across changes in model
depth, requiring practitioners to either re-tune these HPs as they scale up
(expensive), or accept sub-optimal training when re-tuning is prohibitive. Even
when they achieve HP transfer, we develop theory to show parameterizations may
still exist in the lazy learning regime where layers learn only features close
to their linearization, preventing effective use of depth and nonlinearity.
Finally, we identify and adopt the parameterization we call CompleteP that
achieves both depth-wise HP transfer and non-lazy learning in all layers.
CompleteP enables a wider range of model width/depth ratios to remain
compute-efficient, unlocking shapes better suited for different hardware
settings and operational contexts. Moreover, CompleteP enables 12-34% compute
efficiency improvements over the prior state-of-the-art.
[COMMENTS]
10 main pages, 16 appendix pages, 13 figures
[LINK]
http://arxiv.org/abs/2505.01618v2
[DATE]
2025-05-15 01:09:58+08:00
[CATEGORIES]
cs.LG
SpecSphere: Dual-Pass Spectral-Spatial Graph Neural Networks with Certified Robustness
[AUTHORS]
Yoonhyuk Choi, Chong-Kwon Kim
[ABSTRACT]
We introduce SpecSphere, the first dual-pass spectral-spatial GNN that
certifies every prediction against both $\ell_{0}$ edge flips and
$\ell_{\infty}$ feature perturbations, adapts to the full
homophily-heterophily spectrum, and surpasses the expressive power of
1-Weisfeiler-Lehman while retaining linear-time complexity. Our model couples a
Chebyshev-polynomial spectral branch with an attention-gated spatial branch and
fuses their representations through a lightweight MLP trained in a
cooperative-adversarial min-max game. We further establish (i) a uniform
Chebyshev approximation theorem, (ii) minimax-optimal risk across the
homophily-heterophily spectrum, (iii) closed-form robustness certificates, and
(iv) universal approximation strictly beyond 1-WL. SpecSphere achieves
state-of-the-art node-classification accuracy and delivers tighter certified
robustness guarantees on real-world benchmarks. These results demonstrate that
high expressivity, heterophily adaptation, and provable robustness can coexist
within a single, scalable architecture.
[LINK]
http://arxiv.org/abs/2505.08320v2
[DATE]
2025-05-15 01:07:37+08:00
[CATEGORIES]
cs.LG
Graph-structured Small Molecule Drug Discovery Through Deep Learning: Progress, Challenges, and Opportunities
[AUTHORS]
Kun Li, Yida Xiong, Hongzhi Zhang, Xiantao Cai, Jia Wu, Bo Du, Wenbin Hu
[ABSTRACT]
Due to their excellent drug-like and pharmacokinetic properties, small
molecule drugs are widely used to treat various diseases, making them a
critical component of drug discovery. In recent years, with the rapid
development of deep learning (DL) techniques, DL-based small molecule drug
discovery methods have achieved excellent performance in prediction accuracy,
speed, and complex molecular relationship modeling compared to traditional
machine learning approaches. These advancements enhance drug screening
efficiency and optimization and provide more precise and effective solutions
for various drug discovery tasks. Contributing to this field’s development,
this paper aims to systematically summarize and generalize the recent key tasks
and representative techniques in graph-structured small molecule drug discovery
in recent years. Specifically, we provide an overview of the major tasks in
small molecule drug discovery and their interrelationships. Next, we analyze
the six core tasks, summarizing the related methods, commonly used datasets,
and technological development trends. Finally, we discuss key challenges, such
as interpretability and out-of-distribution generalization, and offer our
insights into future research directions for small molecule drug discovery.
[COMMENTS]
10 pages, 1 figures, 8 tables
[LINK]
http://arxiv.org/abs/2502.08975v2
[DATE]
2025-05-15 01:05:32+08:00
[CATEGORIES]
cs.LG
Learning Long-Context Diffusion Policies via Past-Token Prediction
[AUTHORS]
Marcel Torne, Andy Tang, Yuejiang Liu, Chelsea Finn
[ABSTRACT]
Reasoning over long sequences of observations and actions is essential for
many robotic tasks. Yet, learning effective long-context policies from
demonstrations remains challenging. As context length increases, training
becomes increasingly expensive due to rising memory demands, and policy
performance often degrades as a result of spurious correlations. Recent methods
typically sidestep these issues by truncating context length, discarding
historical information that may be critical for subsequent decisions. In this
paper, we propose an alternative approach that explicitly regularizes the
retention of past information. We first revisit the copycat problem in
imitation learning and identify an opposite challenge in recent diffusion
policies: rather than over-relying on prior actions, they often fail to capture
essential dependencies between past and future actions. To address this, we
introduce Past-Token Prediction (PTP), an auxiliary task in which the policy
learns to predict past action tokens alongside future ones. This regularization
significantly improves temporal modeling in the policy head, with minimal
reliance on visual representations. Building on this observation, we further
introduce a multistage training strategy: pre-train the visual encoder with
short contexts, and fine-tune the policy head using cached long-context
embeddings. This strategy preserves the benefits of PTP while greatly reducing
memory and computational overhead. Finally, we extend PTP into a
self-verification mechanism at test time, enabling the policy to score and
select candidates consistent with past actions during inference. Experiments
across four real-world and six simulated tasks demonstrate that our proposed
method improves the performance of long-context diffusion policies by 3x and
accelerates policy training by more than 10x.
[COMMENTS]
Videos are available at https://long-context-dp.github.io
[LINK]
http://arxiv.org/abs/2505.09561v1
[DATE]
2025-05-15 01:00:47+08:00
[CATEGORIES]
cs.LG
WavReward: Spoken Dialogue Models With Generalist Reward Evaluators
[AUTHORS]
Shengpeng Ji, Tianle Liang, Yangzhuo Li, Jialong Zuo, Minghui Fang, Jinzheng He, Yifu Chen, Zhengqing Liu, Ziyue Jiang, Xize Cheng, Siqi Zheng, Jin Xu, Junyang Lin, Zhou Zhao
[ABSTRACT]
End-to-end spoken dialogue models such as GPT-4o-audio have recently garnered
significant attention in the speech domain. However, the evaluation of spoken
dialogue models’ conversational performance has largely been overlooked. This
is primarily due to the intelligent chatbots convey a wealth of non-textual
information which cannot be easily measured using text-based language models
like ChatGPT. To address this gap, we propose WavReward, a reward feedback
model based on audio language models that can evaluate both the IQ and EQ of
spoken dialogue systems with speech input. Specifically, 1) based on audio
language models, WavReward incorporates the deep reasoning process and the
nonlinear reward mechanism for post-training. By utilizing multi-sample
feedback via the reinforcement learning algorithm, we construct a specialized
evaluator tailored to spoken dialogue models. 2) We introduce ChatReward-30K, a
preference dataset used to train WavReward. ChatReward-30K includes both
comprehension and generation aspects of spoken dialogue models. These scenarios
span various tasks, such as text-based chats, nine acoustic attributes of
instruction chats, and implicit chats. WavReward outperforms previous
state-of-the-art evaluation models across multiple spoken dialogue scenarios,
achieving a substantial improvement about Qwen2.5-Omni in objective accuracy
from 55.1$\%$ to 91.5$\%$. In subjective A/B testing, WavReward also leads by a
margin of 83$\%$. Comprehensive ablation studies confirm the necessity of each
component of WavReward. All data and code will be publicly at
https://github.com/jishengpeng/WavReward after the paper is accepted.
[LINK]
http://arxiv.org/abs/2505.09558v1
[DATE]
2025-05-15 00:54:15+08:00
[CATEGORIES]
cs.LG
Scalable Computations for Generalized Mixed Effects Models with Crossed Random Effects Using Krylov Subspace Methods
[AUTHORS]
Pascal Kündig, Fabio Sigrist
[ABSTRACT]
Mixed effects models are widely used for modeling data with hierarchically
grouped structures and high-cardinality categorical predictor variables.
However, for high-dimensional crossed random effects, current standard
computations relying on Cholesky decompositions can become prohibitively slow.
In this work, we present novel Krylov subspace-based methods that address
several existing computational bottlenecks. Among other things, we
theoretically analyze and empirically evaluate various preconditioners for the
conjugate gradient and stochastic Lanczos quadrature methods, derive new
convergence results, and develop computationally efficient methods for
calculating predictive variances. Extensive experiments using simulated and
real-world data sets show that our proposed methods scale much better than
Cholesky-based computations, for instance, achieving a runtime reduction of
approximately two orders of magnitudes for both estimation and prediction.
Moreover, our software implementation is up to 10’000 times faster and more
stable than state-of-the-art implementations such as lme4 and glmmTMB when
using default settings. Our methods are implemented in the free C++ software
library GPBoost with high-level Python and R packages.
[LINK]
http://arxiv.org/abs/2505.09552v1
[DATE]
2025-05-15 00:50:19+08:00
[CATEGORIES]
cs.LG
Distilling Realizable Students from Unrealizable Teachers
[AUTHORS]
Yujin Kim, Nathaniel Chin, Arnav Vasudev, Sanjiban Choudhury
[ABSTRACT]
We study policy distillation under privileged information, where a student
policy with only partial observations must learn from a teacher with full-state
access. A key challenge is information asymmetry: the student cannot directly
access the teacher’s state space, leading to distributional shifts and policy
degradation. Existing approaches either modify the teacher to produce
realizable but sub-optimal demonstrations or rely on the student to explore
missing information independently, both of which are inefficient. Our key
insight is that the student should strategically interact with the teacher
–querying only when necessary and resetting from recovery states –to stay on
a recoverable path within its own observation space. We introduce two methods:
(i) an imitation learning approach that adaptively determines when the student
should query the teacher for corrections, and (ii) a reinforcement learning
approach that selects where to initialize training for efficient exploration.
We validate our methods in both simulated and real-world robotic tasks,
demonstrating significant improvements over standard teacher-student baselines
in training efficiency and final performance. The project website is available
at : https://portal-cornell.github.io/CritiQ_ReTRy/
[LINK]
http://arxiv.org/abs/2505.09546v1
[DATE]
2025-05-15 00:45:51+08:00
[CATEGORIES]
cs.LG
Detecting Multimedia Generated by Large AI Models: A Survey
[AUTHORS]
Li Lin, Neeraj Gupta, Yue Zhang, Hainan Ren, Chun-Hao Liu, Feng Ding, Xin Wang, Xin Li, Luisa Verdoliva, Shu Hu
[ABSTRACT]
The rapid advancement of Large AI Models (LAIMs), particularly diffusion
models and large language models, has marked a new era where AI-generated
multimedia is increasingly integrated into various aspects of daily life.
Although beneficial in numerous fields, this content presents significant
risks, including potential misuse, societal disruptions, and ethical concerns.
Consequently, detecting multimedia generated by LAIMs has become crucial, with
a marked rise in related research. Despite this, there remains a notable gap in
systematic surveys that focus specifically on detecting LAIM-generated
multimedia. Addressing this, we provide the first survey to comprehensively
cover existing research on detecting multimedia (such as text, images, videos,
audio, and multimodal content) created by LAIMs. Specifically, we introduce a
novel taxonomy for detection methods, categorized by media modality, and
aligned with two perspectives: pure detection (aiming to enhance detection
performance) and beyond detection (adding attributes like generalizability,
robustness, and interpretability to detectors). Additionally, we have presented
a brief overview of generation mechanisms, public datasets, online detection
tools, and evaluation metrics to provide a valuable resource for researchers
and practitioners in this field. Most importantly, we offer a focused analysis
from a social media perspective to highlight their broader societal impact.
Furthermore, we identify current challenges in detection and propose directions
for future research that address unexplored, ongoing, and emerging issues in
detecting multimedia generated by LAIMs. Our aim for this survey is to fill an
academic gap and contribute to global AI security efforts, helping to ensure
the integrity of information in the digital realm. The project link is
https://github.com/Purdue-M2/Detect-LAIM-generated-Multimedia-Survey.
[LINK]
http://arxiv.org/abs/2402.00045v5
[DATE]
2025-05-15 00:37:28+08:00
[CATEGORIES]
cs.LG
Contactless Cardiac Pulse Monitoring Using Event Cameras
[AUTHORS]
Mohamed Moustafa, Joseph Lemley, Peter Corcoran
[ABSTRACT]
Time event cameras are a novel technology for recording scene information at
extremely low latency and with low power consumption. Event cameras output a
stream of events that encapsulate pixel-level light intensity changes within
the scene, capturing information with a higher dynamic range and temporal
resolution than traditional cameras. This study investigates the contact-free
reconstruction of an individual’s cardiac pulse signal from time event
recording of their face using a supervised convolutional neural network (CNN)
model. An end-to-end model is trained to extract the cardiac signal from a
two-dimensional representation of the event stream, with model performance
evaluated based on the accuracy of the calculated heart rate. The experimental
results confirm that physiological cardiac information in the facial region is
effectively preserved within the event stream, showcasing the potential of this
novel sensor for remote heart rate monitoring. The model trained on event
frames achieves a root mean square error (RMSE) of 3.32 beats per minute (bpm)
compared to the RMSE of 2.92 bpm achieved by the baseline model trained on
standard camera frames. Furthermore, models trained on event frames generated
at 60 and 120 FPS outperformed the 30 FPS standard camera results, achieving an
RMSE of 2.54 and 2.13 bpm, respectively.
[COMMENTS]
This paper is a preprint of a paper submitted to IEEE Access and is
currently under review
[LINK]
http://arxiv.org/abs/2505.09529v1
[DATE]
2025-05-15 00:24:22+08:00
[CATEGORIES]
cs.LG
Multi-Objective-Guided Discrete Flow Matching for Controllable Biological Sequence Design
[AUTHORS]
Tong Chen, Yinuo Zhang, Sophia Tang, Pranam Chatterjee
[ABSTRACT]
Designing biological sequences that satisfy multiple, often conflicting,
functional and biophysical criteria remains a central challenge in biomolecule
engineering. While discrete flow matching models have recently shown promise
for efficient sampling in high-dimensional sequence spaces, existing approaches
address only single objectives or require continuous embeddings that can
distort discrete distributions. We present Multi-Objective-Guided Discrete Flow
Matching (MOG-DFM), a general framework to steer any pretrained discrete flow
matching generator toward Pareto-efficient trade-offs across multiple scalar
objectives. At each sampling step, MOG-DFM computes a hybrid rank-directional
score for candidate transitions and applies an adaptive hypercone filter to
enforce consistent multi-objective progression. We also trained two
unconditional discrete flow matching models, PepDFM for diverse peptide
generation and EnhancerDFM for functional enhancer DNA generation, as base
generation models for MOG-DFM. We demonstrate MOG-DFM’s effectiveness in
generating peptide binders optimized across five properties (hemolysis,
non-fouling, solubility, half-life, and binding affinity), and in designing DNA
sequences with specific enhancer classes and DNA shapes. In total, MOG-DFM
proves to be a powerful tool for multi-property-guided biomolecule sequence
design.
[LINK]
http://arxiv.org/abs/2505.07086v2
[DATE]
2025-05-15 00:19:40+08:00
[CATEGORIES]
cs.LG
\textsc{rfPG}: Robust Finite-Memory Policy Gradients for Hidden-Model POMDPs
[AUTHORS]
Maris F. L. Galesloot, Roman Andriushchenko, Milan Češka, Sebastian Junges, Nils Jansen
[ABSTRACT]
Partially observable Markov decision processes (POMDPs) model specific
environments in sequential decision-making under uncertainty. Critically,
optimal policies for POMDPs may not be robust against perturbations in the
environment. Hidden-model POMDPs (HM-POMDPs) capture sets of different
environment models, that is, POMDPs with a shared action and observation space.
The intuition is that the true model is hidden among a set of potential models,
and it is unknown which model will be the environment at execution time. A
policy is robust for a given HM-POMDP if it achieves sufficient performance for
each of its POMDPs. We compute such robust policies by combining two orthogonal
techniques: (1) a deductive formal verification technique that supports
tractable robust policy evaluation by computing a worst-case POMDP within the
HM-POMDP and (2) subgradient ascent to optimize the candidate policy for a
worst-case POMDP. The empirical evaluation shows that, compared to various
baselines, our approach (1) produces policies that are more robust and
generalize better to unseen POMDPs and (2) scales to HM-POMDPs that consist of
over a hundred thousand environments.
[COMMENTS]
Accepted for publication at IJCAI 2025
[LINK]
http://arxiv.org/abs/2505.09518v1
[DATE]
2025-05-15 00:15:58+08:00
[CATEGORIES]
cs.LG
IAEmu: Learning Galaxy Intrinsic Alignment Correlations
[AUTHORS]
Sneh Pandya, Yuanyuan Yang, Nicholas Van Alfen, Jonathan Blazek, Robin Walters
[ABSTRACT]
The intrinsic alignments (IA) of galaxies, a key contaminant in weak lensing
analyses, arise from correlations in galaxy shapes driven by tidal interactions
and galaxy formation processes. Accurate IA modeling is essential for robust
cosmological inference, but current approaches rely on perturbative methods
that break down on nonlinear scales or on expensive simulations. We introduce
IAEmu, a neural network-based emulator that predicts the galaxy
position-position ($\xi$), position-orientation ($\omega$), and
orientation-orientation ($\eta$) correlation functions and their uncertainties
using mock catalogs based on the halo occupation distribution (HOD) framework.
Compared to simulations, IAEmu achieves ~3% average error for $\xi$ and ~5% for
$\omega$, while capturing the stochasticity of $\eta$ without overfitting. The
emulator provides both aleatoric and epistemic uncertainties, helping identify
regions where predictions may be less reliable. We also demonstrate
generalization to non-HOD alignment signals by fitting to IllustrisTNG
hydrodynamical simulation data. As a fully differentiable neural network, IAEmu
enables $\sim$10,000$\times$ speed-ups in mapping HOD parameters to correlation
functions on GPUs, compared to CPU-based simulations. This acceleration
facilitates inverse modeling via gradient-based sampling, making IAEmu a
powerful surrogate model for galaxy bias and IA studies with direct
applications to Stage IV weak lensing surveys.
[COMMENTS]
16 pages, 10 figures, 1 table
[LINK]
http://arxiv.org/abs/2504.05235v2
[DATE]
2025-05-15 00:12:07+08:00
[CATEGORIES]
cs.LG
Depth-Based Local Center Clustering: A Framework for Handling Different Clustering Scenarios
[AUTHORS]
Siyi Wang, Alexandre Leblanc, Paul D. McNicholas
[ABSTRACT]
Cluster analysis, or clustering, plays a crucial role across numerous
scientific and engineering domains. Despite the wealth of clustering methods
proposed over the past decades, each method is typically designed for specific
scenarios and presents certain limitations in practical applications. In this
paper, we propose depth-based local center clustering (DLCC). This novel method
makes use of data depth, which is known to produce a center-outward ordering of
sample points in a multivariate space. However, data depth typically fails to
capture the multimodal characteristics of {data}, something of the utmost
importance in the context of clustering. To overcome this, DLCC makes use of a
local version of data depth that is based on subsets of {data}. From this,
local centers can be identified as well as clusters of varying shapes.
Furthermore, we propose a new internal metric based on density-based clustering
to evaluate clustering performance on {non-convex clusters}. Overall, DLCC is a
flexible clustering approach that seems to overcome some limitations of
traditional clustering methods, thereby enhancing data analysis capabilities
across a wide range of application scenarios.
[LINK]
http://arxiv.org/abs/2505.09516v1
[DATE]
2025-05-15 00:08:11+08:00
[CATEGORIES]
cs.LG
CXMArena: Unified Dataset to benchmark performance in realistic CXM Scenarios
[AUTHORS]
Raghav Garg, Kapil Sharma, Karan Gupta
[ABSTRACT]
Large Language Models (LLMs) hold immense potential for revolutionizing
Customer Experience Management (CXM), particularly in contact center
operations. However, evaluating their practical utility in complex operational
environments is hindered by data scarcity (due to privacy concerns) and the
limitations of current benchmarks. Existing benchmarks often lack realism,
failing to incorporate deep knowledge base (KB) integration, real-world noise,
or critical operational tasks beyond conversational fluency. To bridge this
gap, we introduce CXMArena, a novel, large-scale synthetic benchmark dataset
specifically designed for evaluating AI in operational CXM contexts. Given the
diversity in possible contact center features, we have developed a scalable
LLM-powered pipeline that simulates the brand’s CXM entities that form the
foundation of our datasets-such as knowledge articles including product
specifications, issue taxonomies, and contact center conversations. The
entities closely represent real-world distribution because of controlled noise
injection (informed by domain experts) and rigorous automated validation.
Building on this, we release CXMArena, which provides dedicated benchmarks
targeting five important operational tasks: Knowledge Base Refinement, Intent
Prediction, Agent Quality Adherence, Article Search, and Multi-turn RAG with
Integrated Tools. Our baseline experiments underscore the benchmark’s
difficulty: even state of the art embedding and generation models achieve only
68% accuracy on article search, while standard embedding methods yield a low F1
score of 0.3 for knowledge base refinement, highlighting significant challenges
for current models necessitating complex pipelines and solutions over
conventional techniques.
[LINK]
http://arxiv.org/abs/2505.09436v1
[DATE]
2025-05-14 22:44:30+08:00
[CATEGORIES]
cs.LG
cs.CL
Multilingual Machine Translation with Quantum Encoder Decoder Attention-based Convolutional Variational Circuits
[AUTHORS]
Subrit Dikshit, Ritu Tiwari, Priyank Jain
[ABSTRACT]
Cloud-based multilingual translation services like Google Translate and
Microsoft Translator achieve state-of-the-art translation capabilities. These
services inherently use large multilingual language models such as GRU, LSTM,
BERT, GPT, T5, or similar encoder-decoder architectures with attention
mechanisms as the backbone. Also, new age natural language systems, for
instance ChatGPT and DeepSeek, have established huge potential in multiple
tasks in natural language processing. At the same time, they also possess
outstanding multilingual translation capabilities. However, these models use
the classical computing realm as a backend. QEDACVC (Quantum Encoder Decoder
Attention-based Convolutional Variational Circuits) is an alternate solution
that explores the quantum computing realm instead of the classical computing
realm to study and demonstrate multilingual machine translation. QEDACVC
introduces the quantum encoder-decoder architecture that simulates and runs on
quantum computing hardware via quantum convolution, quantum pooling, quantum
variational circuit, and quantum attention as software alterations. QEDACVC
achieves an Accuracy of 82% when trained on the OPUS dataset for English,
French, German, and Hindi corpora for multilingual translations.
[COMMENTS]
12 pages, 12 figures
[LINK]
http://arxiv.org/abs/2505.09407v1
[DATE]
2025-05-14 22:04:44+08:00
[CATEGORIES]
cs.CL
Hakim: Farsi Text Embedding Model
[AUTHORS]
Mehran Sarmadi, Morteza Alikhani, Erfan Zinvandi, Zahra Pourbahman
[ABSTRACT]
Recent advancements in text embedding have significantly improved natural
language understanding across many languages, yet Persian remains notably
underrepresented in large-scale embedding research. In this paper, we present
Hakim, a novel state-of-the-art Persian text embedding model that achieves a
8.5% performance improvement over existing approaches on the FaMTEB benchmark,
outperforming all previously developed Persian language models. As part of this
work, we introduce three new datasets - Corpesia, Pairsia-sup, and
Pairsia-unsup - to support supervised and unsupervised training scenarios.
Additionally, Hakim is designed for applications in chatbots and
retrieval-augmented generation (RAG) systems, particularly addressing retrieval
tasks that require incorporating message history within these systems. We also
propose a new baseline model built on the BERT architecture. Our language model
consistently achieves higher accuracy across various Persian NLP tasks, while
the RetroMAE-based model proves particularly effective for textual information
retrieval applications. Together, these contributions establish a new
foundation for advancing Persian language understanding.
[LINK]
http://arxiv.org/abs/2505.08435v2
[DATE]
2025-05-14 21:47:12+08:00
[CATEGORIES]
cs.CL
cs.LG
Qwen3 Technical Report
[AUTHORS]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, Le Yu, Lianghao Deng, Mei Li, Mingfeng Xue, Mingze Li, Pei Zhang, Peng Wang, Qin Zhu, Rui Men, Ruize Gao, Shixuan Liu, Shuang Luo, Tianhao Li, Tianyi Tang, Wenbiao Yin, Xingzhang Ren, Xinyu Wang, Xinyu Zhang, Xuancheng Ren, Yang Fan, Yang Su, Yichang Zhang, Yinger Zhang, Yu Wan, Yuqiong Liu, Zekun Wang, Zeyu Cui, Zhenru Zhang, Zhipeng Zhou, Zihan Qiu
[ABSTRACT]
In this work, we present Qwen3, the latest version of the Qwen model family.
Qwen3 comprises a series of large language models (LLMs) designed to advance
performance, efficiency, and multilingual capabilities. The Qwen3 series
includes models of both dense and Mixture-of-Expert (MoE) architectures, with
parameter scales ranging from 0.6 to 235 billion. A key innovation in Qwen3 is
the integration of thinking mode (for complex, multi-step reasoning) and
non-thinking mode (for rapid, context-driven responses) into a unified
framework. This eliminates the need to switch between different models–such as
chat-optimized models (e.g., GPT-4o) and dedicated reasoning models (e.g.,
QwQ-32B)–and enables dynamic mode switching based on user queries or chat
templates. Meanwhile, Qwen3 introduces a thinking budget mechanism, allowing
users to allocate computational resources adaptively during inference, thereby
balancing latency and performance based on task complexity. Moreover, by
leveraging the knowledge from the flagship models, we significantly reduce the
computational resources required to build smaller-scale models, while ensuring
their highly competitive performance. Empirical evaluations demonstrate that
Qwen3 achieves state-of-the-art results across diverse benchmarks, including
tasks in code generation, mathematical reasoning, agent tasks, etc.,
competitive against larger MoE models and proprietary models. Compared to its
predecessor Qwen2.5, Qwen3 expands multilingual support from 29 to 119
languages and dialects, enhancing global accessibility through improved
cross-lingual understanding and generation capabilities. To facilitate
reproducibility and community-driven research and development, all Qwen3 models
are publicly accessible under Apache 2.0.
[LINK]
http://arxiv.org/abs/2505.09388v1
[DATE]
2025-05-14 21:41:34+08:00
[CATEGORIES]
cs.CL
Llama See, Llama Do: A Mechanistic Perspective on Contextual Entrainment and Distraction in LLMs
[AUTHORS]
Jingcheng Niu, Xingdi Yuan, Tong Wang, Hamidreza Saghir, Amir H. Abdi
[ABSTRACT]
We observe a novel phenomenon, contextual entrainment, across a wide range of
language models (LMs) and prompt settings, providing a new mechanistic
perspective on how LMs become distracted by “irrelevant” contextual
information in the input prompt. Specifically, LMs assign significantly higher
logits (or probabilities) to any tokens that have previously appeared in the
context prompt, even for random tokens. This suggests that contextual
entrainment is a mechanistic phenomenon, occurring independently of the
relevance or semantic relation of the tokens to the question or the rest of the
sentence. We find statistically significant evidence that the magnitude of
contextual entrainment is influenced by semantic factors. Counterfactual
prompts have a greater effect compared to factual ones, suggesting that while
contextual entrainment is a mechanistic phenomenon, it is modulated by semantic
factors.
We hypothesise that there is a circuit of attention heads – the entrainment
heads – that corresponds to the contextual entrainment phenomenon. Using a
novel entrainment head discovery method based on differentiable masking, we
identify these heads across various settings. When we “turn off” these heads,
i.e., set their outputs to zero, the effect of contextual entrainment is
significantly attenuated, causing the model to generate output that capitulates
to what it would produce if no distracting context were provided. Our discovery
of contextual entrainment, along with our investigation into LM distraction via
the entrainment heads, marks a key step towards the mechanistic analysis and
mitigation of the distraction problem.
[LINK]
http://arxiv.org/abs/2505.09338v1
[DATE]
2025-05-14 20:33:05+08:00
[CATEGORIES]
cs.CL
What Features in Prompts Jailbreak LLMs? Investigating the Mechanisms Behind Attacks
[AUTHORS]
Nathalie Kirch, Constantin Weisser, Severin Field, Helen Yannakoudakis, Stephen Casper
[ABSTRACT]
Jailbreaks have been a central focus of research regarding the safety and
reliability of large language models (LLMs), yet the mechanisms underlying
these attacks remain poorly understood. While previous studies have
predominantly relied on linear methods to detect jailbreak attempts and model
refusals, we take a different approach by examining both linear and non-linear
features in prompts that lead to successful jailbreaks. First, we introduce a
novel dataset comprising 10,800 jailbreak attempts spanning 35 diverse attack
methods. Leveraging this dataset, we train probes to classify successful from
unsuccessful jailbreaks using the latent representations corresponding to
prompt tokens. Notably, we find that even when probes achieve high accuracy in
predicting the success of jailbreaks, their performance often fails to
generalize to unseen attack methods. This reveals that different jailbreaking
strategies exploit different non-linear, non-universal features. Next, we
demonstrate that non-linear probes provide a powerful tool for steering model
behavior. Specifically, we use these probes to guide targeted latent space
perturbations, enabling us to effectively modulate the model’s robustness
against jailbreaks. Overall, our findings challenge the assumption that
jailbreaks can be fully understood through linear or simple universal prompt
features alone, highlighting the importance of a nuanced understanding of the
mechanisms behind LLM vulnerabilities.
[LINK]
http://arxiv.org/abs/2411.03343v2
[DATE]
2025-05-14 20:32:17+08:00
[CATEGORIES]
cs.CL
Scent of Knowledge: Optimizing Search-Enhanced Reasoning with Information Foraging
[AUTHORS]
Hongjin Qian, Zheng Liu
[ABSTRACT]
Augmenting large language models (LLMs) with external retrieval has become a
standard method to address their inherent knowledge cutoff limitations.
However, traditional retrieval-augmented generation methods employ static,
pre-inference retrieval strategies, making them inadequate for complex tasks
involving ambiguous, multi-step, or evolving information needs. Recent advances
in test-time scaling techniques have demonstrated significant potential in
enabling LLMs to dynamically interact with external tools, motivating the shift
toward adaptive inference-time retrieval. Inspired by Information Foraging
Theory (IFT), we propose InForage, a reinforcement learning framework that
formalizes retrieval-augmented reasoning as a dynamic information-seeking
process. Unlike existing approaches, InForage explicitly rewards intermediate
retrieval quality, encouraging LLMs to iteratively gather and integrate
information through adaptive search behaviors. To facilitate training, we
construct a human-guided dataset capturing iterative search and reasoning
trajectories for complex, real-world web tasks. Extensive evaluations across
general question answering, multi-hop reasoning tasks, and a newly developed
real-time web QA dataset demonstrate InForage’s superior performance over
baseline methods. These results highlight InForage’s effectiveness in building
robust, adaptive, and efficient reasoning agents.
[COMMENTS]
16 pages
[LINK]
http://arxiv.org/abs/2505.09316v1
[DATE]
2025-05-14 20:13:38+08:00
[CATEGORIES]
cs.CL
A Scalable Unsupervised Framework for multi-aspect labeling of Multilingual and Multi-Domain Review Data
[AUTHORS]
Jiin Park, Misuk Kim
[ABSTRACT]
Effectively analyzing online review data is essential across industries.
However, many existing studies are limited to specific domains and languages or
depend on supervised learning approaches that require large-scale labeled
datasets. To address these limitations, we propose a multilingual, scalable,
and unsupervised framework for cross-domain aspect detection. This framework is
designed for multi-aspect labeling of multilingual and multi-domain review
data. In this study, we apply automatic labeling to Korean and English review
datasets spanning various domains and assess the quality of the generated
labels through extensive experiments. Aspect category candidates are first
extracted through clustering, and each review is then represented as an
aspect-aware embedding vector using negative sampling. To evaluate the
framework, we conduct multi-aspect labeling and fine-tune several pretrained
language models to measure the effectiveness of the automatically generated
labels. Results show that these models achieve high performance, demonstrating
that the labels are suitable for training. Furthermore, comparisons with
publicly available large language models highlight the framework’s superior
consistency and scalability when processing large-scale data. A human
evaluation also confirms that the quality of the automatic labels is comparable
to those created manually. This study demonstrates the potential of a robust
multi-aspect labeling approach that overcomes limitations of supervised methods
and is adaptable to multilingual, multi-domain environments. Future research
will explore automatic review summarization and the integration of artificial
intelligence agents to further improve the efficiency and depth of review
analysis.
[COMMENTS]
36 pages, 3 figures
[LINK]
http://arxiv.org/abs/2505.09286v1
[DATE]
2025-05-14 19:11:17+08:00
[CATEGORIES]
cs.CL
Evaluating Clinical Competencies of Large Language Models with a General Practice Benchmark
[AUTHORS]
Zheqing Li, Yiying Yang, Jiping Lang, Wenhao Jiang, Yuhang Zhao, Shuang Li, Dingqian Wang, Zhu Lin, Xuanna Li, Yuze Tang, Jiexian Qiu, Xiaolin Lu, Hongji Yu, Shuang Chen, Yuhua Bi, Xiaofei Zeng, Yixian Chen, Junrong Chen, Lin Yao
[ABSTRACT]
Large Language Models (LLMs) have demonstrated considerable potential in
general practice. However, existing benchmarks and evaluation frameworks
primarily depend on exam-style or simplified question-answer formats, lacking a
competency-based structure aligned with the real-world clinical
responsibilities encountered in general practice. Consequently, the extent to
which LLMs can reliably fulfill the duties of general practitioners (GPs)
remains uncertain. In this work, we propose a novel evaluation framework to
assess the capability of LLMs to function as GPs. Based on this framework, we
introduce a general practice benchmark (GPBench), whose data are meticulously
annotated by domain experts in accordance with routine clinical practice
standards. We evaluate ten state-of-the-art LLMs and analyze their
competencies. Our findings indicate that current LLMs are not yet ready for
deployment in such settings without human oversight, and further optimization
specifically tailored to the daily responsibilities of GPs is essential.
[LINK]
http://arxiv.org/abs/2503.17599v2
[DATE]
2025-05-14 18:25:11+08:00
[CATEGORIES]
cs.CL
Focus, Merge, Rank: Improved Question Answering Based on Semi-structured Knowledge Bases
[AUTHORS]
Derian Boer, Stephen Roth, Stefan Kramer
[ABSTRACT]
In many real-world settings, machine learning models and interactive systems
have access to both structured knowledge, e.g., knowledge graphs or tables, and
unstructured content, e.g., natural language documents. However, most rely on
either. Semi-Structured Knowledge Bases (SKBs) bridge this gap by linking
unstructured content to nodes within structured data, thereby enabling new
strategies for knowledge access and use. In this work, we present
FocusedRetriever, a modular SKB-based framework for multi-hop question
answering. It integrates components (VSS-based entity search, LLM-based
generation of Cypher queries and pairwise re-ranking) in a way that enables it
to outperform state-of-the-art methods across all three STaRK benchmark test
sets, covering diverse domains and multiple performance metrics. The average
first-hit rate exceeds that of the second-best method by 25.7%.
FocusedRetriever leverages (1) the capacity of Large Language Models (LLMs) to
extract relational facts and entity attributes from unstructured text, (2) node
set joins to filter answer candidates based on these extracted triplets and
constraints, (3) vector similarity search to retrieve and rank relevant
unstructured content, and (4) the contextual capabilities of LLMs to finally
rank the top-k answers. For generality, we only incorporate base LLMs in
FocusedRetriever in our evaluation. However, our analysis of intermediate
results highlights several opportunities for further upgrades including
finetuning. The source code is publicly available at
https://github.com/kramerlab/FocusedRetriever .
[LINK]
http://arxiv.org/abs/2505.09246v1
[DATE]
2025-05-14 17:35:56+08:00
[CATEGORIES]
cs.CL
PropNet: a White-Box and Human-Like Network for Sentence Representation
[AUTHORS]
Fei Yang
[ABSTRACT]
Transformer-based embedding methods have dominated the field of sentence
representation in recent years. Although they have achieved remarkable
performance on NLP missions, such as semantic textual similarity (STS) tasks,
their black-box nature and large-data-driven training style have raised
concerns, including issues related to bias, trust, and safety. Many efforts
have been made to improve the interpretability of embedding models, but these
problems have not been fundamentally resolved. To achieve inherent
interpretability, we propose a purely white-box and human-like sentence
representation network, PropNet. Inspired by findings from cognitive science,
PropNet constructs a hierarchical network based on the propositions contained
in a sentence. While experiments indicate that PropNet has a significant gap
compared to state-of-the-art (SOTA) embedding models in STS tasks, case studies
reveal substantial room for improvement. Additionally, PropNet enables us to
analyze and understand the human cognitive processes underlying STS benchmarks.
[COMMENTS]
Clarified some ambiguities in the previous version
[LINK]
http://arxiv.org/abs/2502.10725v3
[DATE]
2025-05-14 16:07:08+08:00
[CATEGORIES]
cs.CL
LLM-based NLG Evaluation: Current Status and Challenges
[AUTHORS]
Mingqi Gao, Xinyu Hu, Jie Ruan, Xiao Pu, Xiaojun Wan
[ABSTRACT]
Evaluating natural language generation (NLG) is a vital but challenging
problem in natural language processing. Traditional evaluation metrics mainly
capturing content (e.g. n-gram) overlap between system outputs and references
are far from satisfactory, and large language models (LLMs) such as ChatGPT
have demonstrated great potential in NLG evaluation in recent years. Various
automatic evaluation methods based on LLMs have been proposed, including
metrics derived from LLMs, prompting LLMs, fine-tuning LLMs, and human-LLM
collaborative evaluation. In this survey, we first give a taxonomy of LLM-based
NLG evaluation methods, and discuss their pros and cons, respectively. Lastly,
we discuss several open problems in this area and point out future research
directions.
[LINK]
http://arxiv.org/abs/2402.01383v3
[DATE]
2025-05-14 14:05:53+08:00
[CATEGORIES]
cs.CL
FAS: Fast ANN-SNN Conversion for Spiking Large Language Models
[AUTHORS]
Long Chen, Xiaotian Song, Andy Song, BaDong Chen, Jiancheng Lv, Yanan Sun
[ABSTRACT]
Spiking Large Language Models have been shown as a good alternative to LLMs
in various scenarios. Existing methods for creating Spiking LLMs, i.e., direct
training and ANN-SNN conversion, often suffer from performance degradation and
relatively high computational costs. To address these issues, we propose a
novel Fast ANN-SNN conversion strategy (FAS) that transforms LLMs into spiking
LLMs in two stages. The first stage employs a full-parameter fine-tuning of
pre-trained models, so it does not need any direct training from scratch. The
second stage introduces a coarse-to-fine calibration method to reduce
conversion errors and improve accuracy. Experiments on both language and
vision-language tasks across four different scales of LLMs demonstrate that FAS
can achieve state-of-the-art performance yet with significantly reduced
inference latency and computational costs. Notably, FAS only takes eight
timesteps to achieve an accuracy of 3\% higher than that of the OPT-7B model,
while reducing energy consumption by 96.63\%. The source code is available at
https://github.com/lc783/FAS
[LINK]
http://arxiv.org/abs/2502.04405v2
[DATE]
2025-05-14 13:23:45+08:00
[CATEGORIES]
cs.LG
cs.CL
Reliably Bounding False Positives: A Zero-Shot Machine-Generated Text Detection Framework via Multiscaled Conformal Prediction
[AUTHORS]
Xiaowei Zhu, Yubing Ren, Yanan Cao, Xixun Lin, Fang Fang, Yangxi Li
[ABSTRACT]
The rapid advancement of large language models has raised significant
concerns regarding their potential misuse by malicious actors. As a result,
developing effective detectors to mitigate these risks has become a critical
priority. However, most existing detection methods focus excessively on
detection accuracy, often neglecting the societal risks posed by high false
positive rates (FPRs). This paper addresses this issue by leveraging Conformal
Prediction (CP), which effectively constrains the upper bound of FPRs. While
directly applying CP constrains FPRs, it also leads to a significant reduction
in detection performance. To overcome this trade-off, this paper proposes a
Zero-Shot Machine-Generated Text Detection Framework via Multiscaled Conformal
Prediction (MCP), which both enforces the FPR constraint and improves detection
performance. This paper also introduces RealDet, a high-quality dataset that
spans a wide range of domains, ensuring realistic calibration and enabling
superior detection performance when combined with MCP. Empirical evaluations
demonstrate that MCP effectively constrains FPRs, significantly enhances
detection performance, and increases robustness against adversarial attacks
across multiple detectors and datasets.
[LINK]
http://arxiv.org/abs/2505.05084v2
[DATE]
2025-05-14 12:38:15+08:00
[CATEGORIES]
cs.CL
CEC-Zero: Chinese Error Correction Solution Based on LLM
[AUTHORS]
Sophie Zhang, Zhiming Lin
[ABSTRACT]
Recent advancements in large language models (LLMs) demonstrate exceptional
Chinese text processing capabilities, particularly in Chinese Spelling
Correction (CSC). While LLMs outperform traditional BERT-based models in
accuracy and robustness, challenges persist in reliability and generalization.
This paper proposes CEC-Zero, a novel reinforcement learning (RL) framework
enabling LLMs to self-correct through autonomous error strategy learning
without external supervision. By integrating RL with LLMs’ generative power,
the method eliminates dependency on annotated data or auxiliary models.
Experiments reveal RL-enhanced LLMs achieve industry-viable accuracy and
superior cross-domain generalization, offering a scalable solution for
reliability optimization in Chinese NLP applications. This breakthrough
facilitates LLM deployment in practical Chinese text correction scenarios while
establishing a new paradigm for self-improving language models.
[LINK]
http://arxiv.org/abs/2505.09082v1
[DATE]
2025-05-14 10:35:47+08:00
[CATEGORIES]
cs.CL
P-MMEval: A Parallel Multilingual Multitask Benchmark for Consistent Evaluation of LLMs
[AUTHORS]
Yidan Zhang, Yu Wan, Boyi Deng, Baosong Yang, Haoran Wei, Fei Huang, Bowen Yu, Junyang Lin, Fei Huang, Jingren Zhou
[ABSTRACT]
Recent advancements in large language models (LLMs) showcase varied
multilingual capabilities across tasks like translation, code generation, and
reasoning. Previous assessments often limited their scope to fundamental
natural language processing (NLP) or isolated capability-specific tasks. To
alleviate this drawback, we aim to present a comprehensive multilingual
multitask benchmark. First, we introduce P-MMEval, a large-scale benchmark
covering effective fundamental and capability-specialized datasets.
Furthermore, P-MMEval delivers consistent language coverage across various
datasets and provides parallel samples. Finally, we conduct extensive
experiments on representative multilingual model series to compare performances
across models and tasks, explore the relationship between multilingual
performances and factors such as tasks, model sizes, languages, and prompts,
and examine the effectiveness of knowledge transfer from English to other
languages. The resulting insights are intended to offer valuable guidance for
future research. The dataset is available at
https://huggingface.co/datasets/Qwen/P-MMEval.
[LINK]
http://arxiv.org/abs/2411.09116v2
[DATE]
2025-05-14 10:29:41+08:00
[CATEGORIES]
cs.CL
DRA-GRPO: Exploring Diversity-Aware Reward Adjustment for R1-Zero-Like Training of Large Language Models
[AUTHORS]
Xiwen Chen, Wenhui Zhu, Peijie Qiu, Xuanzhao Dong, Hao Wang, Haiyu Wu, Huayu Li, Aristeidis Sotiras, Yalin Wang, Abolfazl Razi
[ABSTRACT]
Recent advances in reinforcement learning for language model post-training,
such as Group Relative Policy Optimization (GRPO), have shown promise in
low-resource settings. However, GRPO typically relies on solution-level and
scalar reward signals that fail to capture the semantic diversity among sampled
completions. This leads to what we identify as a diversity-quality
inconsistency, where distinct reasoning paths may receive indistinguishable
rewards. To address this limitation, we propose $\textit{Diversity-aware Reward
Adjustment}$ (DRA), a method that explicitly incorporates semantic diversity
into the reward computation. DRA uses Submodular Mutual Information (SMI) to
downweight redundant completions and amplify rewards for diverse ones. This
encourages better exploration during learning, while maintaining stable
exploitation of high-quality samples. Our method integrates seamlessly with
both GRPO and its variant DR.~GRPO, resulting in $\textit{DRA-GRPO}$ and
$\textit{DGA-DR.~GRPO}$. We evaluate our method on five mathematical reasoning
benchmarks and find that it outperforms recent strong baselines. It achieves
state-of-the-art performance with an average accuracy of 58.2%, using only
7,000 fine-tuning samples and a total training cost of approximately $55. The
code is available at https://github.com/xiwenc1/DRA-GRPO.
[LINK]
http://arxiv.org/abs/2505.09655v1
[DATE]
2025-05-14 10:02:32+08:00
[CATEGORIES]
cs.CL
Beyond Single-Turn: A Survey on Multi-Turn Interactions with Large Language Models
[AUTHORS]
Yubo Li, Xiaobin Shen, Xinyu Yao, Xueying Ding, Yidi Miao, Ramayya Krishnan, Rema Padman
[ABSTRACT]
Recent advancements in large language models (LLMs) have revolutionized their
ability to handle single-turn tasks, yet real-world applications demand
sophisticated multi-turn interactions. This survey provides a comprehensive
review of recent advancements in evaluating and enhancing multi-turn
interactions in LLMs. Focusing on task-specific scenarios, from instruction
following in diverse domains such as math and coding to complex conversational
engagements in roleplay, healthcare, education, and even adversarial jailbreak
settings, we systematically examine the challenges of maintaining context,
coherence, fairness, and responsiveness over prolonged dialogues. The paper
organizes current benchmarks and datasets into coherent categories that reflect
the evolving landscape of multi-turn dialogue evaluation. In addition, we
review a range of enhancement methodologies under multi-turn settings,
including model-centric strategies (contextual learning, supervised
fine-tuning, reinforcement learning, and new architectures), external
integration approaches (memory-augmented, retrieval-based methods, and
knowledge graph), and agent-based techniques for collaborative interactions.
Finally, we discuss open challenges and propose future directions for research
to further advance the robustness and effectiveness of multi-turn interactions
in LLMs. Related resources and papers are available at
https://github.com/yubol-cmu/Awesome-Multi-Turn-LLMs.
[LINK]
http://arxiv.org/abs/2504.04717v4
[DATE]
2025-05-14 09:48:30+08:00
[CATEGORIES]
cs.CL
Fusing Bidirectional Chains of Thought and Reward Mechanisms A Method for Enhancing Question-Answering Capabilities of Large Language Models for Chinese Intangible Cultural Heritage
[AUTHORS]
Ruilin Liu, Zhixiao Zhao, Jieqiong Li, Chang Liu, Dongbo Wang
[ABSTRACT]
The rapid development of large language models (LLMs) has provided
significant support and opportunities for the advancement of domain-specific
LLMs. However, fine-tuning these large models using Intangible Cultural
Heritage (ICH) data inevitably faces challenges such as bias, incorrect
knowledge inheritance, and catastrophic forgetting. To address these issues, we
propose a novel training method that integrates a bidirectional chains of
thought and a reward mechanism. This method is built upon ICH-Qwen, a large
language model specifically designed for the field of intangible cultural
heritage. The proposed method enables the model to not only perform forward
reasoning but also enhances the accuracy of the generated answers by utilizing
reverse questioning and reverse reasoning to activate the model’s latent
knowledge. Additionally, a reward mechanism is introduced during training to
optimize the decision-making process. This mechanism improves the quality of
the model’s outputs through structural and content evaluations with different
weighting schemes. We conduct comparative experiments on ICH-Qwen, with results
demonstrating that our method outperforms 0-shot, step-by-step reasoning,
knowledge distillation, and question augmentation methods in terms of accuracy,
Bleu-4, and Rouge-L scores on the question-answering task. Furthermore, the
paper highlights the effectiveness of combining the bidirectional chains of
thought and reward mechanism through ablation experiments. In addition, a
series of generalizability experiments are conducted, with results showing that
the proposed method yields improvements on various domain-specific datasets and
advanced models in areas such as Finance, Wikidata, and StrategyQA. This
demonstrates that the method is adaptable to multiple domains and provides a
valuable approach for model training in future applications across diverse
fields.
[COMMENTS]
22 pages, 5 figures
[LINK]
http://arxiv.org/abs/2505.08167v2
[DATE]
2025-05-14 09:35:33+08:00
[CATEGORIES]
cs.CL
A Comprehensive Analysis of Large Language Model Outputs: Similarity, Diversity, and Bias
[AUTHORS]
Brandon Smith, Mohamed Reda Bouadjenek, Tahsin Alamgir Kheya, Phillip Dawson, Sunil Aryal
[ABSTRACT]
Large Language Models (LLMs) represent a major step toward artificial general
intelligence, significantly advancing our ability to interact with technology.
While LLMs perform well on Natural Language Processing tasks – such as
translation, generation, code writing, and summarization – questions remain
about their output similarity, variability, and ethical implications. For
instance, how similar are texts generated by the same model? How does this
compare across different models? And which models best uphold ethical
standards? To investigate, we used 5{,}000 prompts spanning diverse tasks like
generation, explanation, and rewriting. This resulted in approximately 3
million texts from 12 LLMs, including proprietary and open-source systems from
OpenAI, Google, Microsoft, Meta, and Mistral. Key findings include: (1) outputs
from the same LLM are more similar to each other than to human-written texts;
(2) models like WizardLM-2-8x22b generate highly similar outputs, while GPT-4
produces more varied responses; (3) LLM writing styles differ significantly,
with Llama 3 and Mistral showing higher similarity, and GPT-4 standing out for
distinctiveness; (4) differences in vocabulary and tone underscore the
linguistic uniqueness of LLM-generated content; (5) some LLMs demonstrate
greater gender balance and reduced bias. These results offer new insights into
the behavior and diversity of LLM outputs, helping guide future development and
ethical evaluation.
[LINK]
http://arxiv.org/abs/2505.09056v1
[DATE]
2025-05-14 09:21:46+08:00
[CATEGORIES]
cs.CL
Atomic Consistency Preference Optimization for Long-Form Question Answering
[AUTHORS]
Jingfeng Chen, Raghuveer Thirukovalluru, Junlin Wang, Kaiwei Luo, Bhuwan Dhingra
[ABSTRACT]
Large Language Models (LLMs) frequently produce factoid hallucinations -
plausible yet incorrect answers. A common mitigation strategy is model
alignment, which improves factual accuracy by training on curated factual and
non-factual pairs. However, this approach often relies on a stronger model
(e.g., GPT-4) or an external knowledge base to assess factual correctness,
which may not always be accessible. To address this, we propose Atomic
Consistency Preference Optimization (ACPO), a self-supervised preference-tuning
method that enhances factual accuracy without external supervision. ACPO
leverages atomic consistency signals, i.e., the agreement of individual facts
across multiple stochastic responses, to identify high- and low-quality data
pairs for model alignment. By eliminating the need for costly GPT calls, ACPO
provides a scalable and efficient approach to improving factoid
question-answering. Despite being self-supervised, empirical results
demonstrate that ACPO outperforms FactAlign, a strong supervised alignment
baseline, by 1.95 points on the LongFact and BioGen datasets, highlighting its
effectiveness in enhancing factual reliability without relying on external
models or knowledge bases.
[COMMENTS]
16 pages, 2 figures
[LINK]
http://arxiv.org/abs/2505.09039v1
[DATE]
2025-05-14 08:39:47+08:00
[CATEGORIES]
cs.CL
Improving the Reliability of LLMs: Combining CoT, RAG, Self-Consistency, and Self-Verification
[AUTHORS]
Adarsh Kumar, Hwiyoon Kim, Jawahar Sai Nathani, Neil Roy
[ABSTRACT]
Hallucination, where large language models (LLMs) generate confident but
incorrect or irrelevant information, remains a key limitation in their
application to complex, open-ended tasks. Chain-of-thought (CoT) prompting has
emerged as a promising method for improving multistep reasoning by guiding
models through intermediate steps. However, CoT alone does not fully address
the hallucination problem. In this work, we investigate how combining CoT with
retrieval-augmented generation (RAG), as well as applying self-consistency and
self-verification strategies, can reduce hallucinations and improve factual
accuracy. By incorporating external knowledge sources during reasoning and
enabling models to verify or revise their own outputs, we aim to generate more
accurate and coherent responses. We present a comparative evaluation of
baseline LLMs against CoT, CoT+RAG, self-consistency, and self-verification
techniques. Our results highlight the effectiveness of each method and identify
the most robust approach for minimizing hallucinations while preserving fluency
and reasoning depth.
[LINK]
http://arxiv.org/abs/2505.09031v1
[DATE]
2025-05-14 07:57:02+08:00
[CATEGORIES]
cs.CL
Automated Meta Prompt Engineering for Alignment with the Theory of Mind
[AUTHORS]
Aaron Baughman, Rahul Agarwal, Eduardo Morales, Gozde Akay
[ABSTRACT]
We introduce a method of meta-prompting that jointly produces fluent text for
complex tasks while optimizing the similarity of neural states between a
human’s mental expectation and a Large Language Model’s (LLM) neural
processing. A technique of agentic reinforcement learning is applied, in which
an LLM as a Judge (LLMaaJ) teaches another LLM, through in-context learning,
how to produce content by interpreting the intended and unintended generated
text traits. To measure human mental beliefs around content production, users
modify long form AI-generated text articles before publication at the US Open
2024 tennis Grand Slam. Now, an LLMaaJ can solve the Theory of Mind (ToM)
alignment problem by anticipating and including human edits within the creation
of text from an LLM. Throughout experimentation and by interpreting the results
of a live production system, the expectations of human content reviewers had
100% of alignment with AI 53.8% of the time with an average iteration count of
4.38. The geometric interpretation of content traits such as factualness,
novelty, repetitiveness, and relevancy over a Hilbert vector space combines
spatial volume (all trait importance) with vertices alignment (individual trait
relevance) enabled the LLMaaJ to optimize on Human ToM. This resulted in an
increase in content quality by extending the coverage of tennis action. Our
work that was deployed at the US Open 2024 has been used across other live
events within sports and entertainment.
[COMMENTS]
9 pages, 6 figures, 3 tables
[LINK]
http://arxiv.org/abs/2505.09024v1
[DATE]
2025-05-14 07:42:36+08:00
[CATEGORIES]
cs.CL
cs.LG
For GPT-4 as with Humans: Information Structure Predicts Acceptability of Long-Distance Dependencies
[AUTHORS]
Nicole Cuneo, Eleanor Graves, Supantho Rakshit, Adele E. Goldberg
[ABSTRACT]
It remains debated how well any LM understands natural language or generates
reliable metalinguistic judgments. Moreover, relatively little work has
demonstrated that LMs can represent and respect subtle relationships between
form and function proposed by linguists. We here focus on a particular such
relationship established in recent work: English speakers’ judgments about the
information structure of canonical sentences predicts independently collected
acceptability ratings on corresponding ‘long distance dependency’ [LDD]
constructions, across a wide array of base constructions and multiple types of
LDDs. To determine whether any LM captures this relationship, we probe GPT-4 on
the same tasks used with humans and new extensions.Results reveal reliable
metalinguistic skill on the information structure and acceptability tasks,
replicating a striking interaction between the two, despite the zero-shot,
explicit nature of the tasks, and little to no chance of contamination [Studies
1a, 1b]. Study 2 manipulates the information structure of base sentences and
confirms a causal relationship: increasing the prominence of a constituent in a
context sentence increases the subsequent acceptability ratings on an LDD
construction. The findings suggest a tight relationship between natural and
GPT-4 generated English, and between information structure and syntax, which
begs for further exploration.
[LINK]
http://arxiv.org/abs/2505.09005v1
[DATE]
2025-05-14 06:41:13+08:00
[CATEGORIES]
cs.CL
An Analytical Emotion Framework of Rumour Threads on Social Media
[AUTHORS]
Rui Xing, Boyang Sun, Kun Zhang, Preslav Nakov, Timothy Baldwin, Jey Han Lau
[ABSTRACT]
Rumours in online social media pose significant risks to modern society,
motivating the need for better understanding of how they develop. We focus
specifically on the interface between emotion and rumours in threaded
discourses, building on the surprisingly sparse literature on the topic which
has largely focused on single aspect of emotions within the original rumour
posts themselves, and largely overlooked the comparative differences between
rumours and non-rumours. In this work, we take one step further to provide a
comprehensive analytical emotion framework with multi-aspect emotion detection,
contrasting rumour and non-rumour threads and provide both correlation and
causal analysis of emotions. We applied our framework on existing widely-used
rumour datasets to further understand the emotion dynamics in online social
media threads. Our framework reveals that rumours trigger more negative
emotions (e.g., anger, fear, pessimism), while non-rumours evoke more positive
ones. Emotions are contagious, rumours spread negativity, non-rumours spread
positivity. Causal analysis shows surprise bridges rumours and other emotions;
pessimism comes from sadness and fear, while optimism arises from joy and love.
[COMMENTS]
Accepted to ICWSM 2025 MisD Workshop
[LINK]
http://arxiv.org/abs/2502.16560v2
[DATE]
2025-05-14 06:37:48+08:00
[CATEGORIES]
cs.CL
Prioritizing Image-Related Tokens Enhances Vision-Language Pre-Training
[AUTHORS]
Yangyi Chen, Hao Peng, Tong Zhang, Heng Ji
[ABSTRACT]
In standard large vision-language models (LVLMs) pre-training, the model
typically maximizes the joint probability of the caption conditioned on the
image via next-token prediction (NTP); however, since only a small subset of
caption tokens directly relates to the visual content, this naive NTP
unintentionally fits the model to noise and increases the risk of
hallucination. We present PRIOR, a simple vision-language pre-training approach
that addresses this issue by prioritizing image-related tokens through
differential weighting in the NTP loss, drawing from the importance sampling
framework. PRIOR introduces a reference model-a text-only large language model
(LLM) trained on the captions without image inputs, to weight each token based
on its probability for LVLMs training. Intuitively, tokens that are directly
related to the visual inputs are harder to predict without the image and thus
receive lower probabilities from the text-only reference LLM. During training,
we implement a token-specific re-weighting term based on the importance scores
to adjust each token’s loss. We implement PRIOR in two distinct settings: LVLMs
with visual encoders and LVLMs without visual encoders. We observe 19% and 8%
average relative improvement, respectively, on several vision-language
benchmarks compared to NTP. In addition, PRIOR exhibits superior scaling
properties, as demonstrated by significantly higher scaling coefficients,
indicating greater potential for performance gains compared to NTP given
increasing compute and data.
[COMMENTS]
The code will be available at https://github.com/Yangyi-Chen/PRIOR
[LINK]
http://arxiv.org/abs/2505.08971v1
[DATE]
2025-05-14 05:27:52+08:00
[CATEGORIES]
cs.CL
cs.LG
ForeCite: Adapting Pre-Trained Language Models to Predict Future Citation Rates of Academic Papers
[AUTHORS]
Gavin Hull, Alex Bihlo
[ABSTRACT]
Predicting the future citation rates of academic papers is an important step
toward the automation of research evaluation and the acceleration of scientific
progress. We present $\textbf{ForeCite}$, a simple but powerful framework to
append pre-trained causal language models with a linear head for average
monthly citation rate prediction. Adapting transformers for regression tasks,
ForeCite achieves a test correlation of $\rho = 0.826$ on a curated dataset of
900K+ biomedical papers published between 2000 and 2024, a 27-point improvement
over the previous state-of-the-art. Comprehensive scaling-law analysis reveals
consistent gains across model sizes and data volumes, while temporal holdout
experiments confirm practical robustness. Gradient-based saliency heatmaps
suggest a potentially undue reliance on titles and abstract texts. These
results establish a new state-of-the-art in forecasting the long-term influence
of academic research and lay the groundwork for the automated, high-fidelity
evaluation of scientific contributions.
[COMMENTS]
16 pages, 13 figures
[LINK]
http://arxiv.org/abs/2505.08941v1
[DATE]
2025-05-14 04:10:00+08:00
[CATEGORIES]
cs.LG
cs.CL
Simulating and Analysing Human Survey Responses with Large Language Models: A Case Study in Energy Stated Preference
[AUTHORS]
Han Wang, Jacek Pawlak, Aruna Sivakumar
[ABSTRACT]
Survey research plays a crucial role in studies by capturing consumer
preferences and informing policy decisions. Stated preference (SP) surveys help
researchers understand how individuals make trade-offs in hypothetical,
potentially futuristic, scenarios. However, traditional methods are costly,
time-consuming, and affected by respondent fatigue and ethical constraints.
Large language models (LLMs) have shown remarkable capabilities in generating
human-like responses, prompting interest in their use in survey research. This
study investigates LLMs for simulating consumer choices in energy-related SP
surveys and explores their integration into data collection and analysis
workflows. Test scenarios were designed to assess the simulation performance of
several LLMs (LLaMA 3.1, Mistral, GPT-3.5, DeepSeek-R1) at individual and
aggregated levels, considering prompt design, in-context learning (ICL),
chain-of-thought (CoT) reasoning, model types, integration with traditional
choice models, and potential biases. While LLMs achieve accuracy above random
guessing, performance remains insufficient for practical simulation use.
Cloud-based LLMs do not consistently outperform smaller local models.
DeepSeek-R1 achieves the highest average accuracy (77%) and outperforms
non-reasoning LLMs in accuracy, factor identification, and choice distribution
alignment. Previous SP choices are the most effective input; longer prompts
with more factors reduce accuracy. Mixed logit models can support LLM prompt
refinement. Reasoning LLMs show potential in data analysis by indicating factor
significance, offering a qualitative complement to statistical models. Despite
limitations, pre-trained LLMs offer scalability and require minimal historical
data. Future work should refine prompts, further explore CoT reasoning, and
investigate fine-tuning techniques.
[LINK]
http://arxiv.org/abs/2503.10652v2
[DATE]
2025-05-14 03:38:19+08:00
[CATEGORIES]
cs.CL
Principled Data Selection for Alignment: The Hidden Risks of Difficult Examples
[AUTHORS]
Chengqian Gao, Haonan Li, Liu Liu, Zeke Xie, Peilin Zhao, Zhiqiang Xu
[COMMENTS]
Accepted at ICML 2025
[LINK]
http://arxiv.org/abs/2502.09650v2
[DATE]
2025-05-14 02:54:09+08:00
[CATEGORIES]
cs.CL
cs.LG
Grounding Synthetic Data Evaluations of Language Models in Unsupervised Document Corpora
[AUTHORS]
Michael Majurski, Cynthia Matuszek
[ABSTRACT]
Language Models (LMs) continue to advance, improving response quality and
coherence. Given Internet-scale training datasets, LMs have likely encountered
much of what users might ask them to generate in some form during their
training. A plethora of evaluation benchmarks have been constructed to assess
model quality, response appropriateness, and reasoning capabilities. However,
the human effort required for benchmark construction is limited and being
rapidly outpaced by the size and scope of the models under evaluation.
Additionally, having humans build a benchmark for every possible domain of
interest is impractical. Therefore, we propose a methodology for automating the
construction of fact-based synthetic data model evaluations grounded in
document populations. This work leverages those very same LMs to evaluate
domain-specific knowledge automatically, using only grounding documents (e.g.,
a textbook) as input. This synthetic data benchmarking approach corresponds
well with human curated questions with a Spearman ranking correlation of 0.96
and a benchmark evaluation Pearson accuracy correlation of 0.79. This novel
tool supports generating both multiple choice and open-ended synthetic data
questions to gain diagnostic insight of LM capability. We apply this
methodology to evaluate model performance on a recent relevant arXiv preprint,
discovering a surprisingly strong performance from Gemma3 models.
[LINK]
http://arxiv.org/abs/2505.08905v1
[DATE]
2025-05-14 02:50:03+08:00
[CATEGORIES]
cs.CL
Performance Gains of LLMs With Humans in a World of LLMs Versus Humans
[AUTHORS]
Lucas McCullum, Pelagie Ami Agassi, Leo Anthony Celi, Daniel K. Ebner, Chrystinne Oliveira Fernandes, Rachel S. Hicklen, Mkliwa Koumbia, Lisa Soleymani Lehmann, David Restrepo
[ABSTRACT]
Currently, a considerable research effort is devoted to comparing LLMs to a
group of human experts, where the term “expert” is often ill-defined or
variable, at best, in a state of constantly updating LLM releases. Without
proper safeguards in place, LLMs will threaten to cause harm to the established
structure of safe delivery of patient care which has been carefully developed
throughout history to keep the safety of the patient at the forefront. A key
driver of LLM innovation is founded on community research efforts which, if
continuing to operate under “humans versus LLMs” principles, will expedite this
trend. Therefore, research efforts moving forward must focus on effectively
characterizing the safe use of LLMs in clinical settings that persist across
the rapid development of novel LLM models. In this communication, we
demonstrate that rather than comparing LLMs to humans, there is a need to
develop strategies enabling efficient work of humans with LLMs in an almost
symbiotic manner.
[LINK]
http://arxiv.org/abs/2505.08902v1
[DATE]
2025-05-14 02:44:22+08:00
[CATEGORIES]
cs.CL
Clicking some of the silly options: Exploring Player Motivation in Static and Dynamic Educational Interactive Narratives
[AUTHORS]
Daeun Hwang, Samuel Shields, Alex Calderwood, Shi Johnson-Bey, Michael Mateas, Noah Wardrip-Fruin, Edward F. Melcer
[ABSTRACT]
Motivation is an important factor underlying successful learning. Previous
research has demonstrated the positive effects that static interactive
narrative games can have on motivation. Concurrently, advances in AI have made
dynamic and adaptive approaches to interactive narrative increasingly
accessible. However, limited work has explored the impact that dynamic
narratives can have on learner motivation. In this paper, we compare two
versions of Academical, a choice-based educational interactive narrative game
about research ethics. One version employs a traditional hand-authored
branching plot (i.e., static narrative) while the other dynamically sequences
plots during play (i.e., dynamic narrative). Results highlight the importance
of responsive content and a variety of choices for player engagement, while
also illustrating the challenge of balancing pedagogical goals with the dynamic
aspects of narrative. We also discuss design implications that arise from these
findings. Ultimately, this work provides initial steps to illuminate the
emerging potential of AI-driven dynamic narrative in educational games.
[COMMENTS]
8 pages, 3 figures, 1 table, 1 appendix. Workshop paper, CHI 2025
Augmented Educators and AI
[LINK]
http://arxiv.org/abs/2505.08891v1
[DATE]
2025-05-14 02:27:25+08:00
[CATEGORIES]
cs.CL
InductionBench: LLMs Fail in the Simplest Complexity Class
[AUTHORS]
Wenyue Hua, Tyler Wong, Sun Fei, Liangming Pan, Adam Jardine, William Yang Wang
[ABSTRACT]
Large language models (LLMs) have shown remarkable improvements in reasoning
and many existing benchmarks have been addressed by models such as o1 and o3
either fully or partially. However, a majority of these benchmarks emphasize
deductive reasoning, including mathematical and coding tasks in which rules
such as mathematical axioms or programming syntax are clearly defined, based on
which LLMs can plan and apply these rules to arrive at a solution. In contrast,
inductive reasoning, where one infers the underlying rules from observed data,
remains less explored. Such inductive processes lie at the heart of scientific
discovery, as they enable researchers to extract general principles from
empirical observations. To assess whether LLMs possess this capacity, we
introduce InductionBench, a new benchmark designed to evaluate the inductive
reasoning ability of LLMs. Our experimental findings reveal that even the most
advanced models available struggle to master the simplest complexity classes
within the subregular hierarchy of functions, highlighting a notable deficiency
in current LLMs’ inductive reasoning capabilities. Coda and data are available
https://github.com/Wenyueh/inductive_reasoning_benchmark.
[COMMENTS]
25 pages, 10 figures, more details including examples and prompts are
added
[LINK]
http://arxiv.org/abs/2502.15823v4
[DATE]
2025-05-14 02:06:09+08:00
[CATEGORIES]
cs.LG
cs.CL
CodePDE: An Inference Framework for LLM-driven PDE Solver Generation
[AUTHORS]
Shanda Li, Tanya Marwah, Junhong Shen, Weiwei Sun, Andrej Risteski, Yiming Yang, Ameet Talwalkar
[ABSTRACT]
Partial differential equations (PDEs) are fundamental to modeling physical
systems, yet solving them remains a complex challenge. Traditional numerical
solvers rely on expert knowledge to implement and are computationally
expensive, while neural-network-based solvers require large training datasets
and often lack interpretability. In this work, we frame PDE solving as a code
generation task and introduce CodePDE, the first inference framework for
generating PDE solvers using large language models (LLMs). Leveraging advanced
inference-time algorithms and scaling strategies, CodePDE unlocks critical
capacities of LLM for PDE solving: reasoning, debugging, selfrefinement, and
test-time scaling – all without task-specific tuning. CodePDE achieves
superhuman performance across a range of representative PDE problems. We also
present a systematic empirical analysis of LLM generated solvers, analyzing
their accuracy, efficiency, and numerical scheme choices. Our findings
highlight the promise and the current limitations of LLMs in PDE solving,
offering a new perspective on solver design and opportunities for future model
development. Our code is available at https://github.com/LithiumDA/CodePDE.
[LINK]
http://arxiv.org/abs/2505.08783v1
[DATE]
2025-05-14 01:58:08+08:00
[CATEGORIES]
cs.LG
cs.CL
Graph RAG for Legal Norms: A Hierarchical and Temporal Approach
[AUTHORS]
Hudson de Martim
[ABSTRACT]
This article proposes an adaptation of Graph Retrieval Augmented Generation
(Graph RAG) specifically designed for the analysis and comprehension of legal
norms, which are characterized by their predefined hierarchical structure,
extensive network of internal and external references and multiple temporal
versions. By combining structured knowledge graphs with contextually enriched
text segments, Graph RAG offers a promising solution to address the inherent
complexity and vast volume of legal data. The integration of hierarchical
structure and temporal evolution into knowledge graphs - along with the concept
of comprehensive Text Units - facilitates the construction of richer,
interconnected representations of legal knowledge. Through a detailed analysis
of Graph RAG and its application to legal norm datasets, this article aims to
advance the field of Artificial Intelligence applied to Law, creating
opportunities for more effective systems in legal research, legislative
analysis, and decision support.
[LINK]
http://arxiv.org/abs/2505.00039v2
[DATE]
2025-05-14 01:19:55+08:00
[CATEGORIES]
cs.CL
Self-reflecting Large Language Models: A Hegelian Dialectical Approach
[AUTHORS]
Sara Abdali, Can Goksen, Saeed Amizadeh, Julie E. Maybee, Kazuhito Koishida
[ABSTRACT]
Investigating NLP through a philosophical lens has recently caught
researcher’s eyes as it connects computational methods with classical schools
of philosophy. This paper introduces a philosophical approach inspired by the
\textit{Hegelian Dialectic} for LLMs’ \textit{self-reflection}, utilizing a
self-dialectical approach to emulate internal critiques and then synthesize new
ideas by resolving the opposing points of view. Moreover, this paper
investigates the effect of LLMs’ temperature for generation by establishing a
dynamic annealing approach, which promotes the creativity in the early stages
and gradually refines it by focusing on the nuances, as well as a
fixed-temperature strategy for generation. We assess the effectiveness of our
proposed method in generating novel ideas and in improving the reasoning
abilities of LLMs during problem-solving. Moreover, we implement a Multi-Agent
Majority Voting (MAMV) strategy to assess the validity and novelty of the
generated ideas, which proves useful in the absence of domain experts. Our
experiments demonstrate promising results in generating ideas and enhancing
problem-solving performance.
[LINK]
http://arxiv.org/abs/2501.14917v5
[DATE]
2025-05-14 01:06:22+08:00
[CATEGORIES]
cs.CL
cs.LG
Aya Vision: Advancing the Frontier of Multilingual Multimodality
[AUTHORS]
Saurabh Dash, Yiyang Nan, John Dang, Arash Ahmadian, Shivalika Singh, Madeline Smith, Bharat Venkitesh, Vlad Shmyhlo, Viraat Aryabumi, Walter Beller-Morales, Jeremy Pekmez, Jason Ozuzu, Pierre Richemond, Acyr Locatelli, Nick Frosst, Phil Blunsom, Aidan Gomez, Ivan Zhang, Marzieh Fadaee, Manoj Govindassamy, Sudip Roy, Matthias Gallé, Beyza Ermis, Ahmet Üstün, Sara Hooker
[ABSTRACT]
Building multimodal language models is fundamentally challenging: it requires
aligning vision and language modalities, curating high-quality instruction
data, and avoiding the degradation of existing text-only capabilities once
vision is introduced. These difficulties are further magnified in the
multilingual setting, where the need for multimodal data in different languages
exacerbates existing data scarcity, machine translation often distorts meaning,
and catastrophic forgetting is more pronounced. To address the aforementioned
challenges, we introduce novel techniques spanning both data and modeling.
First, we develop a synthetic annotation framework that curates high-quality,
diverse multilingual multimodal instruction data, enabling Aya Vision models to
produce natural, human-preferred responses to multimodal inputs across many
languages. Complementing this, we propose a cross-modal model merging technique
that mitigates catastrophic forgetting, effectively preserving text-only
capabilities while simultaneously enhancing multimodal generative performance.
Aya-Vision-8B achieves best-in-class performance compared to strong multimodal
models such as Qwen-2.5-VL-7B, Pixtral-12B, and even much larger
Llama-3.2-90B-Vision. We further scale this approach with Aya-Vision-32B, which
outperforms models more than twice its size, such as Molmo-72B and
LLaMA-3.2-90B-Vision. Our work advances multilingual progress on the
multi-modal frontier, and provides insights into techniques that effectively
bend the need for compute while delivering extremely high performance.
[LINK]
http://arxiv.org/abs/2505.08751v1
[DATE]
2025-05-14 01:03:48+08:00
[CATEGORIES]
cs.CL
cs.LG
AC-Reason: Towards Theory-Guided Actual Causality Reasoning with Large Language Models
[AUTHORS]
Yanxi Zhang, Xin Cong, Zhong Zhang, Xiao Liu, Dongyan Zhao, Yesai Wu
[ABSTRACT]
Actual causality (AC), a fundamental aspect of causal reasoning (CR), is
responsible for attribution and responsibility assignment in real-world
scenarios. However, existing LLM-based methods lack grounding in formal AC
theory, resulting in limited interpretability. Therefore, we propose AC-Reason,
a semi-formal reasoning framework that identifies causally relevant events
within an AC scenario, infers the values of their formal causal factors (e.g.,
sufficiency, necessity, and normality), and answers AC queries via a
theory-guided algorithm with explanations. While AC-Reason does not explicitly
construct a causal graph, it operates over variables in the underlying causal
structure to support principled reasoning. To enable comprehensive evaluation,
we introduce AC-Bench, a new benchmark built upon and substantially extending
Big-Bench Hard Causal Judgment (BBH-CJ). AC-Bench comprises ~1K carefully
annotated samples, each with detailed reasoning steps and focuses solely on
actual causation. The case study shows that synthesized samples in AC-Bench
present greater challenges for LLMs. Extensive experiments on BBH-CJ and
AC-Bench show that AC-Reason consistently improves LLM performance over
baselines. On BBH-CJ, all tested LLMs surpass the average human rater accuracy
of 69.60%, with GPT-4 + AC-Reason achieving 75.04%. On AC-Bench, GPT-4 +
AC-Reason again achieves the highest accuracy of 71.82%. AC-Bench further
enables fine-grained analysis of reasoning faithfulness, revealing that only
Qwen-2.5-72B-Instruct, Claude-3.5-Sonnet, and GPT-4o exhibit faithful
reasoning, whereas GPT-4 tends to exploit shortcuts. Finally, our ablation
study proves that integrating AC theory into LLMs is highly effective, with the
proposed algorithm contributing the most significant performance gains.
[LINK]
http://arxiv.org/abs/2505.08750v1
[DATE]
2025-05-14 01:02:33+08:00
[CATEGORIES]
cs.CL
Probability Consistency in Large Language Models: Theoretical Foundations Meet Empirical Discrepancies
[AUTHORS]
Xiaoliang Luo, Xinyi Xu, Michael Ramscar, Bradley C. Love
[ABSTRACT]
Can autoregressive large language models (LLMs) learn consistent probability
distributions when trained on sequences in different token orders? We prove
formally that for any well-defined probability distribution, sequence
perplexity is invariant under any factorization, including forward, backward,
or arbitrary permutations. This result establishes a rigorous theoretical
foundation for studying how LLMs learn from data and defines principled
protocols for empirical evaluation. Applying these protocols, we show that
prior studies examining ordering effects suffer from critical methodological
flaws. We retrain GPT-2 models across forward, backward, and arbitrary permuted
orders on scientific text. We find systematic deviations from theoretical
invariance across all orderings with arbitrary permutations strongly deviating
from both forward and backward models, which largely (but not completely)
agreed with one another. Deviations were traceable to differences in
self-attention, reflecting positional and locality biases in processing. Our
theoretical and empirical results provide novel avenues for understanding
positional biases in LLMs and suggest methods for detecting when LLMs’
probability distributions are inconsistent and therefore untrustworthy.
[LINK]
http://arxiv.org/abs/2505.08739v1
[DATE]
2025-05-14 00:52:19+08:00
[CATEGORIES]
cs.CL
NurValues: Real-World Nursing Values Evaluation for Large Language Models in Clinical Context
[AUTHORS]
Ben Yao, Qiuchi Li, Yazhou Zhang, Siyu Yang, Bohan Zhang, Prayag Tiwari, Jing Qin
[ABSTRACT]
This work introduces the first benchmark for nursing value alignment,
consisting of five core value dimensions distilled from international nursing
codes: Altruism, Human Dignity, Integrity, Justice, and Professionalism. The
benchmark comprises 1,100 real-world nursing behavior instances collected
through a five-month longitudinal field study across three hospitals of varying
tiers. These instances are annotated by five clinical nurses and then augmented
with LLM-generated counterfactuals with reversed ethic polarity. Each original
case is paired with a value-aligned and a value-violating version, resulting in
2,200 labeled instances that constitute the Easy-Level dataset. To increase
adversarial complexity, each instance is further transformed into a
dialogue-based format that embeds contextual cues and subtle misleading
signals, yielding a Hard-Level dataset. We evaluate 23 state-of-the-art (SoTA)
LLMs on their alignment with nursing values. Our findings reveal three key
insights: (1) DeepSeek-V3 achieves the highest performance on the Easy-Level
dataset (94.55), where Claude 3.5 Sonnet outperforms other models on the
Hard-Level dataset (89.43), significantly surpassing the medical LLMs; (2)
Justice is consistently the most difficult nursing value dimension to evaluate;
and (3) in-context learning significantly improves alignment. This work aims to
provide a foundation for value-sensitive LLMs development in clinical settings.
The dataset and the code are available at
https://huggingface.co/datasets/Ben012345/NurValues.
[COMMENTS]
25 pages, 10 figures, 16 tables
[LINK]
http://arxiv.org/abs/2505.08734v1
[DATE]
2025-05-14 00:46:25+08:00
[CATEGORIES]
cs.CL
AI Hiring with LLMs: A Context-Aware and Explainable Multi-Agent Framework for Resume Screening
[AUTHORS]
Frank P. -W. Lo, Jianing Qiu, Zeyu Wang, Haibao Yu, Yeming Chen, Gao Zhang, Benny Lo
[ABSTRACT]
Resume screening is a critical yet time-intensive process in talent
acquisition, requiring recruiters to analyze vast volume of job applications
while remaining objective, accurate, and fair. With the advancements in Large
Language Models (LLMs), their reasoning capabilities and extensive knowledge
bases demonstrate new opportunities to streamline and automate recruitment
workflows. In this work, we propose a multi-agent framework for resume
screening using LLMs to systematically process and evaluate resumes. The
framework consists of four core agents, including a resume extractor, an
evaluator, a summarizer, and a score formatter. To enhance the contextual
relevance of candidate assessments, we integrate Retrieval-Augmented Generation
(RAG) within the resume evaluator, allowing incorporation of external knowledge
sources, such as industry-specific expertise, professional certifications,
university rankings, and company-specific hiring criteria. This dynamic
adaptation enables personalized recruitment, bridging the gap between AI
automation and talent acquisition. We assess the effectiveness of our approach
by comparing AI-generated scores with ratings provided by HR professionals on a
dataset of anonymized online resumes. The findings highlight the potential of
multi-agent RAG-LLM systems in automating resume screening, enabling more
efficient and scalable hiring workflows.
[COMMENTS]
Accepted by CVPR 2025 Workshop
[LINK]
http://arxiv.org/abs/2504.02870v2
[DATE]
2025-05-14 00:41:54+08:00
[CATEGORIES]
cs.CL
Why do LLMs attend to the first token?
[AUTHORS]
Federico Barbero, Álvaro Arroyo, Xiangming Gu, Christos Perivolaropoulos, Michael Bronstein, Petar Veličković, Razvan Pascanu
[ABSTRACT]
Large Language Models (LLMs) tend to attend heavily to the first token in the
sequence – creating a so-called attention sink. Many works have studied this
phenomenon in detail, proposing various ways to either leverage or alleviate
it. Attention sinks have been connected to quantisation difficulties, security
issues, and streaming attention. Yet, while many works have provided conditions
in which they occur or not, a critical question remains shallowly answered: Why
do LLMs learn such patterns and how are they being used? In this work, we argue
theoretically and empirically that this mechanism provides a method for LLMs to
avoid over-mixing, connecting this to existing lines of work that study
mathematically how information propagates in Transformers. We conduct
experiments to validate our theoretical intuitions and show how choices such as
context length, depth, and data packing influence the sink behaviour. We hope
that this study provides a new practical perspective on why attention sinks are
useful in LLMs, leading to a better understanding of the attention patterns
that form during training.
[LINK]
http://arxiv.org/abs/2504.02732v3
[DATE]
2025-05-14 00:38:34+08:00
[CATEGORIES]
cs.CL
Memorization-Compression Cycles Improve Generalization
[AUTHORS]
Fangyuan Yu
[ABSTRACT]
We prove theoretically that generalization improves not only through data
scaling but also by compressing internal representations. To operationalize
this insight, we introduce the Information Bottleneck Language Modeling (IBLM)
objective, which reframes language modeling as a constrained optimization
problem: minimizing representation entropy subject to optimal prediction
performance. Empirically, we observe an emergent memorization-compression cycle
during LLM pretraining, evidenced by oscillation positive/negative gradient
alignment between cross-entropy and Matrix-Based Entropy (MBE), a measure of
representation entropy. This pattern closely mirrors the predictive-compressive
trade-off prescribed by IBLM and also parallels the biological alternation
between awake learning and sleep consolidation. Motivated by this observation,
we propose Gated Phase Transition (GAPT), a training algorithm that adaptively
switches between memorization and compression phases. When applied to GPT-2
pretraining on FineWeb dataset, GAPT reduces MBE by 50% and improves
cross-entropy by 4.8%. GAPT improves OOD generalizatino by 35% in a pretraining
task on arithmetic multiplication. In a setting designed to simulate
catastrophic forgetting, GAPT reduces interference by compressing and
separating representations, achieving a 97% improvement in separation -
paralleling the functional role of sleep consolidation.
[COMMENTS]
12 pages, 6 figures
[LINK]
http://arxiv.org/abs/2505.08727v1
[DATE]
2025-05-14 00:37:54+08:00
[CATEGORIES]
cs.LG
cs.CL
Deep-SITAR: A SITAR-Based Deep Learning Framework for Growth Curve Modeling via Autoencoders
[AUTHORS]
María Alejandra Hernández, Oscar Rodriguez, Dae-Jin Lee
[ABSTRACT]
Several approaches have been developed to capture the complexity and
nonlinearity of human growth. One widely used is the Super Imposition by
Translation and Rotation (SITAR) model, which has become popular in studies of
adolescent growth. SITAR is a shape-invariant mixed-effects model that
represents the shared growth pattern of a population using a natural cubic
spline mean curve while incorporating three subject-specific random effects –
timing, size, and growth intensity – to account for variations among
individuals. In this work, we introduce a supervised deep learning framework
based on an autoencoder architecture that integrates a deep neural network
(neural network) with a B-spline model to estimate the SITAR model. In this
approach, the encoder estimates the random effects for each individual, while
the decoder performs a fitting based on B-splines similar to the classic SITAR
model. We refer to this method as the Deep-SITAR model. This innovative
approach enables the prediction of the random effects of new individuals
entering a population without requiring a full model re-estimation. As a
result, Deep-SITAR offers a powerful approach to predicting growth
trajectories, combining the flexibility and efficiency of deep learning with
the interpretability of traditional mixed-effects models.
[COMMENTS]
Pre-print
[LINK]
http://arxiv.org/abs/2505.09506v1
[DATE]
2025-05-14 23:55:16+08:00
[CATEGORIES]
cs.LG
Layered Unlearning for Adversarial Relearning
[AUTHORS]
Timothy Qian, Vinith Suriyakumar, Ashia Wilson, Dylan Hadfield-Menell
[ABSTRACT]
Our goal is to understand how post-training methods, such as fine-tuning,
alignment, and unlearning, modify language model behavior and representations.
We are particularly interested in the brittle nature of these modifications
that makes them easy to bypass through prompt engineering or relearning. Recent
results suggest that post-training induces shallow context-dependent
“circuits” that suppress specific response patterns. This could be one
explanation for the brittleness of post-training. To test this hypothesis, we
design an unlearning algorithm, Layered Unlearning (LU), that creates distinct
inhibitory mechanisms for a growing subset of the data. By unlearning the first
$i$ folds while retaining the remaining $k - i$ at the $i$th of $k$ stages, LU
limits the ability of relearning on a subset of data to recover the full
dataset. We evaluate LU through a combination of synthetic and large language
model (LLM) experiments. We find that LU improves robustness to adversarial
relearning for several different unlearning methods. Our results contribute to
the state-of-the-art of machine unlearning and provide insight into the effect
of post-training updates.
[COMMENTS]
37 pages, 8 figures
[LINK]
http://arxiv.org/abs/2505.09500v1
[DATE]
2025-05-14 23:50:45+08:00
[CATEGORIES]
cs.LG
Reinforcement Learning for Individual Optimal Policy from Heterogeneous Data
[AUTHORS]
Rui Miao, Babak Shahbaba, Annie Qu
[ABSTRACT]
Offline reinforcement learning (RL) aims to find optimal policies in dynamic
environments in order to maximize the expected total rewards by leveraging
pre-collected data. Learning from heterogeneous data is one of the fundamental
challenges in offline RL. Traditional methods focus on learning an optimal
policy for all individuals with pre-collected data from a single episode or
homogeneous batch episodes, and thus, may result in a suboptimal policy for a
heterogeneous population. In this paper, we propose an individualized offline
policy optimization framework for heterogeneous time-stationary Markov decision
processes (MDPs). The proposed heterogeneous model with individual latent
variables enables us to efficiently estimate the individual Q-functions, and
our Penalized Pessimistic Personalized Policy Learning (P4L) algorithm
guarantees a fast rate on the average regret under a weak partial coverage
assumption on behavior policies. In addition, our simulation studies and a real
data application demonstrate the superior numerical performance of the proposed
method compared with existing methods.
[LINK]
http://arxiv.org/abs/2505.09496v1
[DATE]
2025-05-14 23:44:10+08:00
[CATEGORIES]
cs.LG
Preserving Plasticity in Continual Learning with Adaptive Linearity Injection
[AUTHORS]
Seyed Roozbeh Razavi Rohani, Khashayar Khajavi, Wesley Chung, Mo Chen, Sharan Vaswani
[ABSTRACT]
Loss of plasticity in deep neural networks is the gradual reduction in a
model’s capacity to incrementally learn and has been identified as a key
obstacle to learning in non-stationary problem settings. Recent work has shown
that deep linear networks tend to be resilient towards loss of plasticity.
Motivated by this observation, we propose Adaptive Linearization (AdaLin), a
general approach that dynamically adapts each neuron’s activation function to
mitigate plasticity loss. Unlike prior methods that rely on regularization or
periodic resets, AdaLin equips every neuron with a learnable parameter and a
gating mechanism that injects linearity into the activation function based on
its gradient flow. This adaptive modulation ensures sufficient gradient signal
and sustains continual learning without introducing additional hyperparameters
or requiring explicit task boundaries. When used with conventional activation
functions like ReLU, Tanh, and GeLU, we demonstrate that AdaLin can
significantly improve performance on standard benchmarks, including Random
Label and Permuted MNIST, Random Label and Shuffled CIFAR-10, and Class-Split
CIFAR-100. Furthermore, its efficacy is shown in more complex scenarios, such
as class-incremental learning on CIFAR-100 with a ResNet-18 backbone, and in
mitigating plasticity loss in off-policy reinforcement learning agents. We
perform a systematic set of ablations that show that neuron-level adaptation is
crucial for good performance and analyze a number of metrics in the network
that might be correlated to loss of plasticity.
[COMMENTS]
Accepted in 4th Conference on Lifelong Learning Agents (CoLLAs), 2025
[LINK]
http://arxiv.org/abs/2505.09486v1
[DATE]
2025-05-14 23:36:51+08:00
[CATEGORIES]
cs.LG
Sensitivity-Constrained Fourier Neural Operators for Forward and Inverse Problems in Parametric Differential Equations
[AUTHORS]
Abdolmehdi Behroozi, Chaopeng Shen and, Daniel Kifer
[ABSTRACT]
Parametric differential equations of the form du/dt = f(u, x, t, p) are
fundamental in science and engineering. While deep learning frameworks such as
the Fourier Neural Operator (FNO) can efficiently approximate solutions, they
struggle with inverse problems, sensitivity estimation (du/dp), and concept
drift. We address these limitations by introducing a sensitivity-based
regularization strategy, called Sensitivity-Constrained Fourier Neural
Operators (SC-FNO). SC-FNO achieves high accuracy in predicting solution paths
and consistently outperforms standard FNO and FNO with physics-informed
regularization. It improves performance in parameter inversion tasks, scales to
high-dimensional parameter spaces (tested with up to 82 parameters), and
reduces both data and training requirements. These gains are achieved with a
modest increase in training time (30% to 130% per epoch) and generalize across
various types of differential equations and neural operators. Code and selected
experiments are available at: https://github.com/AMBehroozi/SC_Neural_Operators
[LINK]
http://arxiv.org/abs/2505.08740v2
[DATE]
2025-05-14 23:24:15+08:00
[CATEGORIES]
cs.LG
Fairness-aware Bayes optimal functional classification
[AUTHORS]
Xiaoyu Hu, Gengyu Xue, Zhenhua Lin, Yi Yu
[ABSTRACT]
Algorithmic fairness has become a central topic in machine learning, and
mitigating disparities across different subpopulations has emerged as a rapidly
growing research area. In this paper, we systematically study the
classification of functional data under fairness constraints, ensuring the
disparity level of the classifier is controlled below a pre-specified
threshold. We propose a unified framework for fairness-aware functional
classification, tackling an infinite-dimensional functional space, addressing
key challenges from the absence of density ratios and intractability of
posterior probabilities, and discussing unique phenomena in functional
classification. We further design a post-processing algorithm, Fair Functional
Linear Discriminant Analysis classifier (Fair-FLDA), which targets at
homoscedastic Gaussian processes and achieves fairness via group-wise
thresholding. Under weak structural assumptions on eigenspace, theoretical
guarantees on fairness and excess risk controls are established. As a
byproduct, our results cover the excess risk control of the standard FLDA as a
special case, which, to the best of our knowledge, is first time seen. Our
theoretical findings are complemented by extensive numerical experiments on
synthetic and real datasets, highlighting the practicality of our designed
algorithm.
[LINK]
http://arxiv.org/abs/2505.09471v1
[DATE]
2025-05-14 23:22:09+08:00
[CATEGORIES]
cs.LG
Accelerating Multiscale Modeling with Hybrid Solvers: Coupling FEM and Neural Operators with Domain Decomposition
[AUTHORS]
Wei Wang, Maryam Hakimzadeh, Haihui Ruan, Somdatta Goswami
[ABSTRACT]
Numerical solvers for PDEs face challenges in balancing computational cost
and accuracy, particularly for multiscale and dynamical systems. Neural
operators (NOs) can significantly speed up simulations; however, they face
challenges such as error accumulation for dynamical systems and limited
generalization in multiphysics problems. This work introduces a novel hybrid
framework that integrates PI-NO with finite element method (FE) through domain
decomposition and leverages numerical analysis for time marching. The core
innovation lies in efficient coupling FE and NO subdomains via a Schwarz
alternating method: regions with complex, nonlinear, or high-gradient behavior
are resolved using a pretrained NO, while the remainder is handled by
conventional FE. To address the challenges of dynamic systems, we embed a
time-stepping scheme directly into the NO architecture, substantially reducing
long-term error propagation. Also, an adaptive subdomain evolution strategy
enables the ML resolved region to expand dynamically, capturing emerging fine
scale features without remeshing. The framework efficacy has been validated
across a range of problems, spanning static, quasi-static, and dynamic regimes
(e.g., linear elasticity, hyperelasticity, and elastodynamics), demonstrating
accelerated convergence (up to 20% improvement in convergence compared to
conventional FE coupling) while preserving solution fidelity with error margins
consistently below 1%. Our study shows that our hybrid solver: (1) maintains
solution continuity across subdomain interfaces, (2) reduces computational
costs by eliminating fine mesh requirements, (3) mitigates error accumulation
in time dependent simulations, and (4) enables automatic adaptation to evolving
physical phenomena. This work bridges the gap between numerical methods and
AI-driven surrogates, offering a scalable pathway for high-fidelity multiscale
simulations.
[LINK]
http://arxiv.org/abs/2504.11383v3
[DATE]
2025-05-14 23:14:24+08:00
[CATEGORIES]
cs.LG
Variational Rank Reduction Autoencoder
[AUTHORS]
Jad Mounayer, Alicia Tierz, Jerome Tomezyk, Chady Ghnatios, Francisco Chinesta
[ABSTRACT]
Deterministic Rank Reduction Autoencoders (RRAEs) enforce by construction a
regularization on the latent space by applying a truncated SVD. While this
regularization makes Autoencoders more powerful, using them for generative
purposes is counter-intuitive due to their deterministic nature. On the other
hand, Variational Autoencoders (VAEs) are well known for their generative
abilities by learning a probabilistic latent space. In this paper, we present
Variational Rank Reduction Autoencoders (VRRAEs), a model that leverages the
advantages of both RRAEs and VAEs. Our claims and results show that when
carefully sampling the latent space of RRAEs and further regularizing with the
Kullback-Leibler (KL) divergence (similarly to VAEs), VRRAEs outperform RRAEs
and VAEs. Additionally, we show that the regularization induced by the SVD not
only makes VRRAEs better generators than VAEs, but also reduces the possibility
of posterior collapse. Our results include a synthetic dataset of a small size
that showcases the robustness of VRRAEs against collapse, and three real-world
datasets; the MNIST, CelebA, and CIFAR-10, over which VRRAEs are shown to
outperform both VAEs and RRAEs on many random generation and interpolation
tasks based on the FID score.
[LINK]
http://arxiv.org/abs/2505.09458v1
[DATE]
2025-05-14 23:08:28+08:00
[CATEGORIES]
cs.LG
Quantum state-agnostic work extraction (almost) without dissipation
[AUTHORS]
Josep Lumbreras, Ruo Cheng Huang, Yanglin Hu, Mile Gu, Marco Tomamichel
[ABSTRACT]
We investigate work extraction protocols designed to transfer the maximum
possible energy to a battery using sequential access to $N$ copies of an
unknown pure qubit state. The core challenge is designing interactions to
optimally balance two competing goals: charging of the battery optimally using
the qubit in hand, and acquiring more information by qubit to improve energy
harvesting in subsequent rounds. Here, we leverage exploration-exploitation
trade-off in reinforcement learning to develop adaptive strategies achieving
energy dissipation that scales only poly-logarithmically in $N$. This
represents an exponential improvement over current protocols based on full
state tomography.
[COMMENTS]
5 pages+14 pages, 2 figures
[LINK]
http://arxiv.org/abs/2505.09456v1
[DATE]
2025-05-14 23:07:58+08:00
[CATEGORIES]
cs.LG
MGPATH: Vision-Language Model with Multi-Granular Prompt Learning for Few-Shot WSI Classification
[AUTHORS]
Anh-Tien Nguyen, Duy Minh Ho Nguyen, Nghiem Tuong Diep, Trung Quoc Nguyen, Nhat Ho, Jacqueline Michelle Metsch, Miriam Cindy Maurer, Daniel Sonntag, Hanibal Bohnenberger, Anne-Christin Hauschild
[ABSTRACT]
Whole slide pathology image classification presents challenges due to
gigapixel image sizes and limited annotation labels, hindering model
generalization. This paper introduces a prompt learning method to adapt large
vision-language models for few-shot pathology classification. We first extend
the Prov-GigaPath vision foundation model, pre-trained on 1.3 billion pathology
image tiles, into a vision-language model by adding adaptors and aligning it
with medical text encoders via contrastive learning on 923K image-text pairs.
The model is then used to extract visual features and text embeddings from
few-shot annotations and fine-tunes with learnable prompt embeddings. Unlike
prior methods that combine prompts with frozen features using prefix embeddings
or self-attention, we propose multi-granular attention that compares
interactions between learnable prompts with individual image patches and groups
of them. This approach improves the model’s ability to capture both
fine-grained details and broader context, enhancing its recognition of complex
patterns across sub-regions. To further improve accuracy, we leverage
(unbalanced) optimal transport-based visual-text distance to secure model
robustness by mitigating perturbations that might occur during the data
augmentation process. Empirical experiments on lung, kidney, and breast
pathology modalities validate the effectiveness of our approach; thereby, we
surpass several of the latest competitors and consistently improve performance
across diverse architectures, including CLIP, PLIP, and Prov-GigaPath
integrated PLIP. We release our implementations and pre-trained models at this
MGPATH.
[LINK]
http://arxiv.org/abs/2502.07409v3
[DATE]
2025-05-14 22:57:00+08:00
[CATEGORIES]
cs.LG
Analog Foundation Models
[AUTHORS]
Julian Büchel, Iason Chalas, Giovanni Acampa, An Chen, Omobayode Fagbohungbe, Sidney Tsai, Kaoutar El Maghraoui, Manuel Le Gallo, Abbas Rahimi, Abu Sebastian
[ABSTRACT]
Analog in-memory computing (AIMC) is a promising compute paradigm to improve
speed and power efficiency of neural network inference beyond the limits of
conventional von Neumann-based architectures. However, AIMC introduces
fundamental challenges such as noisy computations and strict constraints on
input and output quantization. Because of these constraints and imprecisions,
off-the-shelf LLMs are not able to achieve 4-bit-level performance when
deployed on AIMC-based hardware. While researchers previously investigated
recovering this accuracy gap on small, mostly vision-based models, a generic
method applicable to LLMs pre-trained on trillions of tokens does not yet
exist. In this work, we introduce a general and scalable method to robustly
adapt LLMs for execution on noisy, low-precision analog hardware. Our approach
enables state-of-the-art models $\unicode{x2013}$ including
Phi-3-mini-4k-instruct and Llama-3.2-1B-Instruct $\unicode{x2013}$ to retain
performance comparable to 4-bit weight, 8-bit activation baselines, despite the
presence of analog noise and quantization constraints. Additionally, we show
that as a byproduct of our training methodology, analog foundation models can
be quantized for inference on low-precision digital hardware. Finally, we show
that our models also benefit from test-time compute scaling, showing better
scaling behavior than models trained with 4-bit weight and 8-bit static input
quantization. Our work bridges the gap between high-capacity LLMs and efficient
analog hardware, offering a path toward energy-efficient foundation models.
Code is available at https://github.com/IBM/analog-foundation-models .
[COMMENTS]
43 pages, 8 figures, under review
[LINK]
http://arxiv.org/abs/2505.09663v1
[DATE]
2025-05-14 22:52:22+08:00
[CATEGORIES]
cs.LG
Time Can Invalidate Algorithmic Recourse
[AUTHORS]
Giovanni De Toni, Stefano Teso, Bruno Lepri, Andrea Passerini
[ABSTRACT]
Algorithmic Recourse (AR) aims to provide users with actionable steps to
overturn unfavourable decisions made by machine learning predictors. However,
these actions often take time to implement (e.g., getting a degree can take
years), and their effects may vary as the world evolves. Thus, it is natural to
ask for recourse that remains valid in a dynamic environment. In this paper, we
study the robustness of algorithmic recourse over time by casting the problem
through the lens of causality. We demonstrate theoretically and empirically
that (even robust) causal AR methods can fail over time, except in the –
unlikely – case that the world is stationary. Even more critically, unless the
world is fully deterministic, counterfactual AR cannot be solved optimally. To
account for this, we propose a simple yet effective algorithm for temporal AR
that explicitly accounts for time under the assumption of having access to an
estimator approximating the stochastic process. Our simulations on synthetic
and realistic datasets show how considering time produces more resilient
solutions to potential trends in the data distribution.
[COMMENTS]
This is a preprint of a paper accepted at FAccT 2025. The content is
identical to the published version, apart from minor cosmetic changes
[LINK]
http://arxiv.org/abs/2410.08007v3
[DATE]
2025-05-14 22:50:15+08:00
[CATEGORIES]
cs.LG
Considerations in the use of ML interaction potentials for free energy calculations
[AUTHORS]
Orlando A. Mendible, Jonathan K. Whitmer, Yamil J. Colón
[ABSTRACT]
Machine learning force fields (MLFFs) promise to accurately describe the
potential energy surface of molecules at the ab initio level of theory with
improved computational efficiency. Within MLFFs, equivariant graph neural
networks (EQNNs) have shown great promise in accuracy and performance and are
the focus of this work. The capability of EQNNs to recover free energy surfaces
(FES) remains to be thoroughly investigated. In this work, we investigate the
impact of collective variables (CVs) distribution within the training data on
the accuracy of EQNNs predicting the FES of butane and alanine dipeptide (ADP).
A generalizable workflow is presented in which training configurations are
generated with classical molecular dynamics simulations, and energies and
forces are obtained with ab initio calculations. We evaluate how bond and angle
constraints in the training data influence the accuracy of EQNN force fields in
reproducing the FES of the molecules at both classical and ab initio levels of
theory. Results indicate that the model’s accuracy is unaffected by the
distribution of sampled CVs during training, given that the training data
includes configurations from characteristic regions of the system’s FES.
However, when the training data is obtained from classical simulations, the
EQNN struggles to extrapolate the free energy for configurations with high free
energy. In contrast, models trained with the same configurations on ab initio
data show improved extrapolation accuracy. The findings underscore the
difficulties in creating a comprehensive training dataset for EQNNs to predict
FESs and highlight the importance of prior knowledge of the system’s FES.
[LINK]
http://arxiv.org/abs/2403.13952v3
[DATE]
2025-05-14 22:50:01+08:00
[CATEGORIES]
cs.LG
Train a Multi-Task Diffusion Policy on RLBench-18 in One Day with One GPU
[AUTHORS]
Yutong Hu, Pinhao Song, Kehan Wen, Renaud Detry
[ABSTRACT]
We present a method for training multi-task vision-language robotic diffusion
policies that reduces training time and memory usage by an order of magnitude.
This improvement arises from a previously underexplored distinction between
action diffusion and the image diffusion techniques that inspired it: image
generation targets are high-dimensional, while robot actions lie in a much
lower-dimensional space. Meanwhile, the vision-language conditions for action
generation remain high-dimensional. Our approach, Mini-Diffuser, exploits this
asymmetry by introducing Level-2 minibatching, which pairs multiple noised
action samples with each vision-language condition, instead of the conventional
one-to-one sampling strategy. To support this batching scheme, we introduce
architectural adaptations to the diffusion transformer that prevent information
leakage across samples while maintaining full conditioning access. In RLBench
simulations, Mini-Diffuser achieves 95\% of the performance of state-of-the-art
multi-task diffusion policies, while using only 5\% of the training time and
7\% of the memory. Real-world experiments further validate that Mini-Diffuser
preserves the key strengths of diffusion-based policies, including the ability
to model multimodal action distributions and produce behavior conditioned on
diverse perceptual inputs. Code available at
github.com/utomm/mini-diffuse-actor.
[LINK]
http://arxiv.org/abs/2505.09430v1
[DATE]
2025-05-14 22:34:40+08:00
[CATEGORIES]
cs.LG
Pushing the Limits of the Reactive Affine Shaker Algorithm to Higher Dimensions
[AUTHORS]
Roberto Battiti, Mauro Brunato
[ABSTRACT]
Bayesian Optimization (BO) for the minimization of expensive functions of
continuous variables uses all the knowledge acquired from previous samples
(${\boldsymbol x}_i$ and $f({\boldsymbol x}_i)$ values) to build a surrogate
model based on Gaussian processes. The surrogate is then exploited to define
the next point to sample, through a careful balance of exploration and
exploitation. Initially intended for low-dimensional spaces, BO has recently
been modified and used also for very large-dimensional spaces (up to about one
thousand dimensions).
In this paper we consider a much simpler algorithm, called “Reactive Affine
Shaker” (RAS). The next sample is always generated with a uniform probability
distribution inside a parallelepiped (the “box”). At each iteration, the form
of the box is adapted during the search through an affine transformation, based
only on the point $\boldsymbol x$ position and on the success or failure in
improving the function. The function values are therefore not used directly to
modify the search area and to generate the next sample. The entire
dimensionality is kept (no active subspaces).
Despite its extreme simplicity and its use of only stochastic local search,
surprisingly the produced results are comparable to and not too far from the
state-of-the-art results of high-dimensional versions of BO, although with some
more function evaluations.
An ablation study and an analysis of probability distribution of directions
(improving steps and prevailing box orientation) in very large-dimensional
spaces are conducted to understand more about the behavior of RAS and to assess
the relative importance of the algorithmic building blocks for the final
results.
[COMMENTS]
Accepted at: the 19th Learning and Intelligent Optimization
Conference (LION19), June 15-19 2025, Prague, Czech Republic
(https://lion19.org/)
[LINK]
http://arxiv.org/abs/2502.12877v2
[DATE]
2025-05-14 22:31:05+08:00
[CATEGORIES]
cs.LG
CAT Merging: A Training-Free Approach for Resolving Conflicts in Model Merging
[AUTHORS]
Wenju Sun, Qingyong Li, Yangli-ao Geng, Boyang Li
[ABSTRACT]
Multi-task model merging offers a promising paradigm for integrating multiple
expert models into a unified model without additional training. Existing
state-of-the-art techniques, such as Task Arithmetic and its variants, merge
models by accumulating task vectors – the parameter differences between
pretrained and finetuned models. However, task vector accumulation is often
hindered by knowledge conflicts, leading to performance degradation. To address
this challenge, we propose Conflict-Aware Task Merging (CAT Merging), a novel
training-free framework that selectively trims conflict-prone components from
the task vectors. CAT Merging introduces several parameter-specific strategies,
including projection for linear weights and masking for scaling and shifting
parameters in normalization layers. Extensive experiments on vision, language,
and vision-language tasks demonstrate that CAT Merging effectively suppresses
knowledge conflicts, achieving average accuracy improvements of up to 2.5%
(ViT-B/32) and 2.0% (ViT-L/14) over state-of-the-art methods.
[LINK]
http://arxiv.org/abs/2505.06977v2
[DATE]
2025-05-14 22:11:52+08:00
[CATEGORIES]
cs.LG
Data-driven multiscale modeling for correcting dynamical systems
[AUTHORS]
Karl Otness, Laure Zanna, Joan Bruna
[ABSTRACT]
We propose a multiscale approach for predicting quantities in dynamical
systems which is explicitly structured to extract information in both
fine-to-coarse and coarse-to-fine directions. We envision this method being
generally applicable to problems with significant self-similarity or in which
the prediction task is challenging and where stability of a learned model’s
impact on the target dynamical system is important. We evaluate our approach on
a climate subgrid parameterization task in which our multiscale networks
correct chaotic underlying models to reflect the contributions of unresolved,
fine-scale dynamics.
[COMMENTS]
Extended with additional experiments
[LINK]
http://arxiv.org/abs/2303.17496v2
[DATE]
2025-05-14 22:04:18+08:00
[CATEGORIES]
cs.LG
Quantum-Enhanced Parameter-Efficient Learning for Typhoon Trajectory Forecasting
[AUTHORS]
Chen-Yu Liu, Kuan-Cheng Chen, Yi-Chien Chen, Samuel Yen-Chi Chen, Wei-Hao Huang, Wei-Jia Huang, Yen-Jui Chang
[ABSTRACT]
Typhoon trajectory forecasting is essential for disaster preparedness but
remains computationally demanding due to the complexity of atmospheric dynamics
and the resource requirements of deep learning models. Quantum-Train (QT), a
hybrid quantum-classical framework that leverages quantum neural networks
(QNNs) to generate trainable parameters exclusively during training,
eliminating the need for quantum hardware at inference time. Building on QT’s
success across multiple domains, including image classification, reinforcement
learning, flood prediction, and large language model (LLM) fine-tuning, we
introduce Quantum Parameter Adaptation (QPA) for efficient typhoon forecasting
model learning. Integrated with an Attention-based Multi-ConvGRU model, QPA
enables parameter-efficient training while maintaining predictive accuracy.
This work represents the first application of quantum machine learning (QML) to
large-scale typhoon trajectory prediction, offering a scalable and
energy-efficient approach to climate modeling. Our results demonstrate that QPA
significantly reduces the number of trainable parameters while preserving
performance, making high-performance forecasting more accessible and
sustainable through hybrid quantum-classical learning.
[LINK]
http://arxiv.org/abs/2505.09395v1
[DATE]
2025-05-14 21:50:44+08:00
[CATEGORIES]
cs.LG
Examining Deployment and Refinement of the VIOLA-AI Intracranial Hemorrhage Model Using an Interactive NeoMedSys Platform
[AUTHORS]
Qinghui Liu, Jon Nesvold, Hanna Raaum, Elakkyen Murugesu, Martin Røvang, Bradley J Maclntosh, Atle Bjørnerud, Karoline Skogen
[ABSTRACT]
Background: There are many challenges and opportunities in the clinical
deployment of AI tools in radiology. The current study describes a radiology
software platform called NeoMedSys that can enable efficient deployment and
refinements of AI models. We evaluated the feasibility and effectiveness of
running NeoMedSys for three months in real-world clinical settings and focused
on improvement performance of an in-house developed AI model (VIOLA-AI)
designed for intracranial hemorrhage (ICH) detection.
Methods: NeoMedSys integrates tools for deploying, testing, and optimizing AI
models with a web-based medical image viewer, annotation system, and
hospital-wide radiology information systems. A pragmatic investigation was
deployed using clinical cases of patients presenting to the largest Emergency
Department in Norway (site-1) with suspected traumatic brain injury (TBI) or
patients with suspected stroke (site-2). We assessed ICH classification
performance as VIOLA-AI encountered new data and underwent pre-planned model
retraining. Performance metrics included sensitivity, specificity, accuracy,
and the area under the receiver operating characteristic curve (AUC).
Results: NeoMedSys facilitated iterative improvements in the AI model,
significantly enhancing its diagnostic accuracy. Automated bleed detection and
segmentation were reviewed in near real-time to facilitate re-training
VIOLA-AI. The iterative refinement process yielded a marked improvement in
classification sensitivity, rising to 90.3% (from 79.2%), and specificity that
reached 89.3% (from 80.7%). The bleed detection ROC analysis for the entire
sample demonstrated a high area-under-the-curve (AUC) of 0.949 (from 0.873).
Model refinement stages were associated with notable gains, highlighting the
value of real-time radiologist feedback.
[COMMENTS]
19 pages, 11 figures, on submission to BMC Methods
[LINK]
http://arxiv.org/abs/2505.09380v1
[DATE]
2025-05-14 21:33:38+08:00
[CATEGORIES]
cs.LG
Gradient Attention Map Based Verification of Deep Convolutional Neural Networks with Application to X-ray Image Datasets
[AUTHORS]
Omid Halimi Milani, Amanda Nikho, Lauren Mills, Marouane Tliba, Ahmet Enis Cetin, Mohammed H. Elnagar
[ABSTRACT]
Deep learning models have great potential in medical imaging, including
orthodontics and skeletal maturity assessment. However, applying a model to
data different from its training set can lead to unreliable predictions that
may impact patient care. To address this, we propose a comprehensive
verification framework that evaluates model suitability through multiple
complementary strategies. First, we introduce a Gradient Attention Map
(GAM)-based approach that analyzes attention patterns using Grad-CAM and
compares them via similarity metrics such as IoU, Dice Similarity, SSIM, Cosine
Similarity, Pearson Correlation, KL Divergence, and Wasserstein Distance.
Second, we extend verification to early convolutional feature maps, capturing
structural mis-alignments missed by attention alone. Finally, we incorporate an
additional garbage class into the classification model to explicitly reject
out-of-distribution inputs. Experimental results demonstrate that these
combined methods effectively identify unsuitable models and inputs, promoting
safer and more reliable deployment of deep learning in medical imaging.
[COMMENTS]
13 pages, 7 figures, accepted at IEEE VLSI Test Symposium (VTS) 2025
[LINK]
http://arxiv.org/abs/2504.21227v2
[DATE]
2025-05-14 21:30:48+08:00
[CATEGORIES]
cs.LG
Think Smart, Act SMARL! Analyzing Probabilistic Logic Shields for Multi-Agent Reinforcement Learning
[AUTHORS]
Satchit Chatterji, Erman Acar
[ABSTRACT]
Safe reinforcement learning (RL) is crucial for real-world applications, and
multi-agent interactions introduce additional safety challenges. While
Probabilistic Logic Shields (PLS) has been a powerful proposal to enforce
safety in single-agent RL, their generalizability to multi-agent settings
remains unexplored. In this paper, we address this gap by conducting extensive
analyses of PLS within decentralized, multi-agent environments, and in doing
so, propose Shielded Multi-Agent Reinforcement Learning (SMARL) as a general
framework for steering MARL towards norm-compliant outcomes. Our key
contributions are: (1) a novel Probabilistic Logic Temporal Difference (PLTD)
update for shielded, independent Q-learning, which incorporates probabilistic
constraints directly into the value update process; (2) a probabilistic logic
policy gradient method for shielded PPO with formal safety guarantees for MARL;
and (3) comprehensive evaluation across symmetric and asymmetrically shielded
$n$-player game-theoretic benchmarks, demonstrating fewer constraint violations
and significantly better cooperation under normative constraints. These results
position SMARL as an effective mechanism for equilibrium selection, paving the
way toward safer, socially aligned multi-agent systems.
[COMMENTS]
21 pages, 16 figures, Earlier title: “Analyzing Probabilistic Logic
Driven Safety in Multi-Agent Reinforcement Learning” (changed for specificity
and clarity)
[LINK]
http://arxiv.org/abs/2411.04867v2
[DATE]
2025-05-14 21:30:31+08:00
[CATEGORIES]
cs.LG
TensorRL-QAS: Reinforcement learning with tensor networks for scalable quantum architecture search
[AUTHORS]
Akash Kundu, Stefano Mangini
[ABSTRACT]
Variational quantum algorithms hold the promise to address meaningful quantum
problems already on noisy intermediate-scale quantum hardware, but they face
the challenge of designing quantum circuits that both solve the target problem
and comply with device limitations. Quantum architecture search (QAS) automates
this design process, with reinforcement learning (RL) emerging as a promising
approach. Yet, RL-based QAS methods encounter significant scalability issues,
as computational and training costs grow rapidly with the number of qubits,
circuit depth, and noise, severely impacting performance. To address these
challenges, we introduce $\textit{TensorRL-QAS}$, a scalable framework that
combines tensor network (TN) methods with RL for designing quantum circuits. By
warm-starting the architecture search with a matrix product state approximation
of the target solution, TensorRL-QAS effectively narrows the search space to
physically meaningful circuits, accelerating convergence to the desired
solution. Tested on several quantum chemistry problems of up to 12-qubit,
TensorRL-QAS achieves up to a 10-fold reduction in CNOT count and circuit depth
compared to baseline methods, while maintaining or surpassing chemical
accuracy. It reduces function evaluations by up to 100-fold, accelerates
training episodes by up to $98\%$, and achieves up to $50\%$ success
probability for 10-qubit systems-far exceeding the $<1\%$ rates of baseline
approaches. Robustness and versatility are demonstrated both in the noiseless
and noisy scenarios, where we report a simulation of up to 8-qubit. These
advancements establish TensorRL-QAS as a promising candidate for a scalable and
efficient quantum circuit discovery protocol on near-term quantum hardware.
[COMMENTS]
The code will be available soon! Comments are welcomed!
[LINK]
http://arxiv.org/abs/2505.09371v1
[DATE]
2025-05-14 21:23:34+08:00
[CATEGORIES]
cs.LG
RobustSpring: Benchmarking Robustness to Image Corruptions for Optical Flow, Scene Flow and Stereo
[AUTHORS]
Jenny Schmalfuss, Victor Oei, Lukas Mehl, Madlen Bartsch, Shashank Agnihotri, Margret Keuper, Andrés Bruhn
[ABSTRACT]
Standard benchmarks for optical flow, scene flow, and stereo vision
algorithms generally focus on model accuracy rather than robustness to image
corruptions like noise or rain. Hence, the resilience of models to such
real-world perturbations is largely unquantified. To address this, we present
RobustSpring, a comprehensive dataset and benchmark for evaluating robustness
to image corruptions for optical flow, scene flow, and stereo models.
RobustSpring applies 20 different image corruptions, including noise, blur,
color changes, quality degradations, and weather distortions, in a time-,
stereo-, and depth-consistent manner to the high-resolution Spring dataset,
creating a suite of 20,000 corrupted images that reflect challenging
conditions. RobustSpring enables comparisons of model robustness via a new
corruption robustness metric. Integration with the Spring benchmark enables
public two-axis evaluations of both accuracy and robustness. We benchmark a
curated selection of initial models, observing that accurate models are not
necessarily robust and that robustness varies widely by corruption type.
RobustSpring is a new computer vision benchmark that treats robustness as a
first-class citizen to foster models that combine accuracy with resilience. It
will be available at https://spring-benchmark.org.
[LINK]
http://arxiv.org/abs/2505.09368v1
[DATE]
2025-05-14 21:21:34+08:00
[CATEGORIES]
cs.LG
ARCANE – Early Detection of Interplanetary Coronal Mass Ejections
[AUTHORS]
H. T. Rüdisser, G. Nguyen, J. Le Louëdec, C. Möstl
[ABSTRACT]
Interplanetary coronal mass ejections (ICMEs) are major drivers of space
weather disturbances, posing risks to both technological infrastructure and
human activities. Automatic detection of ICMEs in solar wind in situ data is
essential for early warning systems. While several methods have been proposed
to identify these structures in time series data, robust real-time detection
remains a significant challenge. In this work, we present ARCANE - the first
framework explicitly designed for early ICME detection in streaming solar wind
data under realistic operational constraints, enabling event identification
without requiring observation of the full structure. Our approach evaluates the
strengths and limitations of detection models by comparing a machine
learning-based method to a threshold-based baseline. The ResUNet++ model,
previously validated on science data, significantly outperforms the baseline,
particularly in detecting high-impact events, while retaining solid performance
on lower-impact cases. Notably, we find that using real-time solar wind (RTSW)
data instead of high-resolution science data leads to only minimal performance
degradation. Despite the challenges of operational settings, our detection
pipeline achieves an F1 score of 0.53, with an average detection delay of 21.5%
of the event’s duration while only seeing a minimal amount of data. As more
data becomes available, the performance increases significantly. These results
mark a substantial step forward in automated space weather monitoring and lay
the groundwork for enhanced real-time forecasting capabilities.
[COMMENTS]
25 pages, 9 figures, 1 table, submitted to AGU Space Weather on 14th
May 2025
[LINK]
http://arxiv.org/abs/2505.09365v1
[DATE]
2025-05-14 21:17:45+08:00
[CATEGORIES]
cs.LG
Diffusion Recommender Models and the Illusion of Progress: A Concerning Study of Reproducibility and a Conceptual Mismatch
[AUTHORS]
Michael Benigni, Maurizio Ferrari Dacrema, Dietmar Jannach
[ABSTRACT]
Countless new machine learning models are published every year and are
reported to significantly advance the state-of-the-art in \emph{top-n}
recommendation. However, earlier reproducibility studies indicate that progress
in this area may be quite limited. Specifically, various widespread
methodological issues, e.g., comparisons with untuned baseline models, have led
to an \emph{illusion of progress}. In this work, our goal is to examine whether
these problems persist in today’s research. To this end, we aim to reproduce
the latest advancements reported from applying modern Denoising Diffusion
Probabilistic Models to recommender systems, focusing on four models published
at the top-ranked SIGIR conference in 2023 and 2024. Our findings are
concerning, revealing persistent methodological problems. Alarmingly, through
experiments, we find that the latest recommendation techniques based on
diffusion models, despite their computational complexity and substantial carbon
footprint, are consistently outperformed by simpler existing models.
Furthermore, we identify key mismatches between the characteristics of
diffusion models and those of the traditional \emph{top-n} recommendation task,
raising doubts about their suitability for recommendation. We also note that,
in the papers we analyze, the generative capabilities of these models are
constrained to a minimum. Overall, our results and continued methodological
issues call for greater scientific rigor and a disruptive change in the
research and publication culture in this area.
[LINK]
http://arxiv.org/abs/2505.09364v1
[DATE]
2025-05-14 21:13:53+08:00
[CATEGORIES]
cs.LG
Full-waveform earthquake source inversion using simulation-based inference
[AUTHORS]
A. A. Saoulis, D. Piras, A. Spurio Mancini, B. Joachimi, A. M. G. Ferreira
[ABSTRACT]
This paper presents a novel framework for full-waveform seismic source
inversion using simulation-based inference (SBI). Traditional probabilistic
approaches often rely on simplifying assumptions about data errors, which we
show can lead to inaccurate uncertainty quantification. SBI addresses this
limitation by building an empirical probabilistic model of the data errors
using machine learning models, known as neural density estimators, which can
then be integrated into the Bayesian inference framework. We apply the SBI
framework to point-source moment tensor inversions as well as joint moment
tensor and time-location inversions. We construct a range of synthetic examples
to explore the quality of the SBI solutions, as well as to compare the SBI
results with standard Gaussian likelihood-based Bayesian inversions. We then
demonstrate that under real seismic noise, common Gaussian likelihood
assumptions for treating full-waveform data yield overconfident posterior
distributions that underestimate the moment tensor component uncertainties by
up to a factor of 3. We contrast this with SBI, which produces well-calibrated
posteriors that generally agree with the true seismic source parameters, and
offers an order-of-magnitude reduction in the number of simulations required to
perform inference compared to standard Monte Carlo techniques. Finally, we
apply our methodology to a pair of moderate magnitude earthquakes in the North
Atlantic. We utilise seismic waveforms recorded by the recent UPFLOW ocean
bottom seismometer array as well as by regional land stations in the Azores,
comparing full moment tensor and source-time location posteriors between SBI
and a Gaussian likelihood approach. We find that our adaptation of SBI can be
directly applied to real earthquake sources to efficiently produce high quality
posterior distributions that significantly improve upon Gaussian likelihood
approaches.
[COMMENTS]
22 + 11 pages, 11 + 11 figures. Now published in GJI
[LINK]
http://arxiv.org/abs/2410.23238v2
[DATE]
2025-05-14 21:12:57+08:00
[CATEGORIES]
cs.LG
Efficient Mixed Precision Quantization in Graph Neural Networks
[AUTHORS]
Samir Moustafa, Nils M. Kriege, Wilfried N. Gansterer
[ABSTRACT]
Graph Neural Networks (GNNs) have become essential for handling large-scale
graph applications. However, the computational demands of GNNs necessitate the
development of efficient methods to accelerate inference. Mixed precision
quantization emerges as a promising solution to enhance the efficiency of GNN
architectures without compromising prediction performance. Compared to
conventional deep learning architectures, GNN layers contain a wider set of
components that can be quantized, including message passing functions,
aggregation functions, update functions, the inputs, learnable parameters, and
outputs of these functions. In this paper, we introduce a theorem for efficient
quantized message passing to aggregate integer messages. It guarantees
numerical equality of the aggregated messages using integer values with respect
to those obtained with full (FP32) precision. Based on this theorem, we
introduce the Mixed Precision Quantization for GNN (MixQ-GNN) framework, which
flexibly selects effective integer bit-widths for all components within GNN
layers. Our approach systematically navigates the wide set of possible
bit-width combinations, addressing the challenge of optimizing efficiency while
aiming at maintaining comparable prediction performance. MixQ-GNN integrates
with existing GNN quantization methods, utilizing their graph structure
advantages to achieve higher prediction performance. On average, MixQ-GNN
achieved reductions in bit operations of 5.5x for node classification and 5.1x
for graph classification compared to architectures represented in FP32
precision.
[LINK]
http://arxiv.org/abs/2505.09361v1
[DATE]
2025-05-14 21:11:39+08:00
[CATEGORIES]
cs.LG
Marigold: Affordable Adaptation of Diffusion-Based Image Generators for Image Analysis
[AUTHORS]
Bingxin Ke, Kevin Qu, Tianfu Wang, Nando Metzger, Shengyu Huang, Bo Li, Anton Obukhov, Konrad Schindler
[ABSTRACT]
The success of deep learning in computer vision over the past decade has
hinged on large labeled datasets and strong pretrained models. In data-scarce
settings, the quality of these pretrained models becomes crucial for effective
transfer learning. Image classification and self-supervised learning have
traditionally been the primary methods for pretraining CNNs and
transformer-based architectures. Recently, the rise of text-to-image generative
models, particularly those using denoising diffusion in a latent space, has
introduced a new class of foundational models trained on massive, captioned
image datasets. These models’ ability to generate realistic images of unseen
content suggests they possess a deep understanding of the visual world. In this
work, we present Marigold, a family of conditional generative models and a
fine-tuning protocol that extracts the knowledge from pretrained latent
diffusion models like Stable Diffusion and adapts them for dense image analysis
tasks, including monocular depth estimation, surface normals prediction, and
intrinsic decomposition. Marigold requires minimal modification of the
pre-trained latent diffusion model’s architecture, trains with small synthetic
datasets on a single GPU over a few days, and demonstrates state-of-the-art
zero-shot generalization. Project page:
https://marigoldcomputervision.github.io
[COMMENTS]
Journal extension of our CVPR 2024 paper, featuring new tasks,
improved efficiency, high-resolution capabilities, and enhanced accessibility
[LINK]
http://arxiv.org/abs/2505.09358v1
[DATE]
2025-05-14 21:07:03+08:00
[CATEGORIES]
cs.LG
Exploiting the Potential Supervision Information of Clean Samples in Partial Label Learning
[AUTHORS]
Guangtai Wang, Chi-Man Vong, Jintao Huang
[ABSTRACT]
Diminishing the impact of false-positive labels is critical for conducting
disambiguation in partial label learning. However, the existing disambiguation
strategies mainly focus on exploiting the characteristics of individual partial
label instances while neglecting the strong supervision information of clean
samples randomly lying in the datasets. In this work, we show that clean
samples can be collected to offer guidance and enhance the confidence of the
most possible candidates. Motivated by the manner of the differentiable count
loss strat- egy and the K-Nearest-Neighbor algorithm, we proposed a new
calibration strategy called CleanSE. Specifically, we attribute the most
reliable candidates with higher significance under the assumption that for each
clean sample, if its label is one of the candidates of its nearest neighbor in
the representation space, it is more likely to be the ground truth of its
neighbor. Moreover, clean samples offer help in characterizing the sample
distributions by restricting the label counts of each label to a specific
interval. Extensive experiments on 3 synthetic benchmarks and 5 real-world PLL
datasets showed this calibration strategy can be applied to most of the
state-of-the-art PLL methods as well as enhance their performance.
[LINK]
http://arxiv.org/abs/2505.09354v1
[DATE]
2025-05-14 21:04:55+08:00
[CATEGORIES]
cs.LG
Easz: An Agile Transformer-based Image Compression Framework for Resource-constrained IoTs
[AUTHORS]
Yu Mao, Jingzong Li, Jun Wang, Hong Xu, Tei-Wei Kuo, Nan Guan, Chun Jason Xue
[ABSTRACT]
Neural image compression, necessary in various machine-to-machine
communication scenarios, suffers from its heavy encode-decode structures and
inflexibility in switching between different compression levels. Consequently,
it raises significant challenges in applying the neural image compression to
edge devices that are developed for powerful servers with high computational
and storage capacities. We take a step to solve the challenges by proposing a
new transformer-based edge-compute-free image coding framework called Easz.
Easz shifts the computational overhead to the server, and hence avoids the
heavy encoding and model switching overhead on the edge. Easz utilizes a
patch-erase algorithm to selectively remove image contents using a conditional
uniform-based sampler. The erased pixels are reconstructed on the receiver side
through a transformer-based framework. To further reduce the computational
overhead on the receiver, we then introduce a lightweight transformer-based
reconstruction structure to reduce the reconstruction load on the receiver
side. Extensive evaluations conducted on a real-world testbed demonstrate
multiple advantages of Easz over existing compression approaches, in terms of
adaptability to different compression levels, computational efficiency, and
image reconstruction quality.
[LINK]
http://arxiv.org/abs/2505.01742v2
[DATE]
2025-05-14 21:02:05+08:00
[CATEGORIES]
cs.LG
A General Graph Spectral Wavelet Convolution via Chebyshev Order Decomposition
[AUTHORS]
Nian Liu, Xiaoxin He, Thomas Laurent, Francesco Di Giovanni, Michael M. Bronstein, Xavier Bresson
[ABSTRACT]
Spectral graph convolution, an important tool of data filtering on graphs,
relies on two essential decisions: selecting spectral bases for signal
transformation and parameterizing the kernel for frequency analysis. While
recent techniques mainly focus on standard Fourier transform and vector-valued
spectral functions, they fall short in flexibility to model signal
distributions over large spatial ranges, and capacity of spectral function. In
this paper, we present a novel wavelet-based graph convolution network, namely
WaveGC, which integrates multi-resolution spectral bases and a matrix-valued
filter kernel. Theoretically, we establish that WaveGC can effectively capture
and decouple short-range and long-range information, providing superior
filtering flexibility, surpassing existing graph wavelet neural networks. To
instantiate WaveGC, we introduce a novel technique for learning general graph
wavelets by separately combining odd and even terms of Chebyshev polynomials.
This approach strictly satisfies wavelet admissibility criteria. Our numerical
experiments showcase the consistent improvements in both short-range and
long-range tasks. This underscores the effectiveness of the proposed model in
handling different scenarios. Our code is available at
https://github.com/liun-online/WaveGC.
[COMMENTS]
This paper is accepted by ICML 2025
[LINK]
http://arxiv.org/abs/2405.13806v2
[DATE]
2025-05-14 21:00:05+08:00
[CATEGORIES]
cs.LG
On Measuring Intrinsic Causal Attributions in Deep Neural Networks
[AUTHORS]
Saptarshi Saha, Dhruv Vansraj Rathore, Soumadeep Saha, Utpal Garain, David Doermann
[ABSTRACT]
Quantifying the causal influence of input features within neural networks has
become a topic of increasing interest. Existing approaches typically assess
direct, indirect, and total causal effects. This work treats NNs as structural
causal models (SCMs) and extends our focus to include intrinsic causal
contributions (ICC). We propose an identifiable generative post-hoc framework
for quantifying ICC. We also draw a relationship between ICC and Sobol’
indices. Our experiments on synthetic and real-world datasets demonstrate that
ICC generates more intuitive and reliable explanations compared to existing
global explanation techniques.
[LINK]
http://arxiv.org/abs/2505.09660v1
[DATE]
2025-05-14 20:59:04+08:00
[CATEGORIES]
cs.LG
Handling Missing Data in Downstream Tasks With Distribution-Preserving Guarantees
[AUTHORS]
Rahul Bordoloi, Clémence Réda, Saptarshi Bej, Olaf Wolkenhauer
[ABSTRACT]
Missing feature values are a significant hurdle for downstream
machine-learning tasks such as classification. However, imputation methods for
classification might be time-consuming for high-dimensional data, and offer few
theoretical guarantees on the preservation of the data distribution and
imputation quality, especially for not-missing-at-random mechanisms. First, we
propose an imputation approach named F3I based on the iterative improvement of
a K-nearest neighbor imputation, where neighbor-specific weights are learned
through the optimization of a novel concave, differentiable objective function
related to the preservation of the data distribution on non-missing values. F3I
can then be chained to and jointly trained with any classifier architecture.
Second, we provide a theoretical analysis of imputation quality and data
distribution preservation by F3I for several types of missing mechanisms.
Finally, we demonstrate the superior performance of F3I on several imputation
and classification tasks, with applications to drug repurposing and
handwritten-digit recognition data.
[LINK]
http://arxiv.org/abs/2501.13786v2
[DATE]
2025-05-14 20:51:30+08:00
[CATEGORIES]
cs.LG
GreenFactory: Ensembling Zero-Cost Proxies to Estimate Performance of Neural Networks
[AUTHORS]
Gabriel Cortês, Nuno Lourenço, Paolo Romano, Penousal Machado
[ABSTRACT]
Determining the performance of a Deep Neural Network during Neural
Architecture Search processes is essential for identifying optimal
architectures and hyperparameters. Traditionally, this process requires
training and evaluation of each network, which is time-consuming and
resource-intensive. Zero-cost proxies estimate performance without training,
serving as an alternative to traditional training. However, recent proxies
often lack generalization across diverse scenarios and provide only relative
rankings rather than predicted accuracies. To address these limitations, we
propose GreenFactory, an ensemble of zero-cost proxies that leverages a random
forest regressor to combine multiple predictors’ strengths and directly predict
model test accuracy. We evaluate GreenFactory on NATS-Bench, achieving robust
results across multiple datasets. Specifically, GreenFactory achieves high
Kendall correlations on NATS-Bench-SSS, indicating substantial agreement
between its predicted scores and actual performance: 0.907 for CIFAR-10, 0.945
for CIFAR-100, and 0.920 for ImageNet-16-120. Similarly, on NATS-Bench-TSS, we
achieve correlations of 0.921 for CIFAR-10, 0.929 for CIFAR-100, and 0.908 for
ImageNet-16-120, showcasing its reliability in both search spaces.
[LINK]
http://arxiv.org/abs/2505.09344v1
[DATE]
2025-05-14 20:40:34+08:00
[CATEGORIES]
cs.LG
Evaluating the Robustness of Adversarial Defenses in Malware Detection Systems
[AUTHORS]
Mostafa Jafari, Alireza Shameli-Sendi
[ABSTRACT]
Machine learning is a key tool for Android malware detection, effectively
identifying malicious patterns in apps. However, ML-based detectors are
vulnerable to evasion attacks, where small, crafted changes bypass detection.
Despite progress in adversarial defenses, the lack of comprehensive evaluation
frameworks in binary-constrained domains limits understanding of their
robustness. We introduce two key contributions. First, Prioritized Binary
Rounding, a technique to convert continuous perturbations into binary feature
spaces while preserving high attack success and low perturbation size. Second,
the sigma-binary attack, a novel adversarial method for binary domains,
designed to achieve attack goals with minimal feature changes. Experiments on
the Malscan dataset show that sigma-binary outperforms existing attacks and
exposes key vulnerabilities in state-of-the-art defenses. Defenses equipped
with adversary detectors, such as KDE, DLA, DNN+, and ICNN, exhibit significant
brittleness, with attack success rates exceeding 90% using fewer than 10
feature modifications and reaching 100% with just 20. Adversarially trained
defenses, including AT-rFGSM-k, AT-MaxMA, improves robustness under small
budgets but remains vulnerable to unrestricted perturbations, with attack
success rates of 99.45% and 96.62%, respectively. Although PAD-SMA demonstrates
strong robustness against state-of-the-art gradient-based adversarial attacks
by maintaining an attack success rate below 16.55%, the sigma-binary attack
significantly outperforms these methods, achieving a 94.56% success rate under
unrestricted perturbations. These findings highlight the critical need for
precise method like sigma-binary to expose hidden vulnerabilities in existing
defenses and support the development of more resilient malware detection
systems.
[COMMENTS]
Submitted to IEEE Transactions on Information Forensics and Security
(T-IFS), 13 pages, 4 figures
[LINK]
http://arxiv.org/abs/2505.09342v1
[DATE]
2025-05-14 20:38:43+08:00
[CATEGORIES]
cs.LG
TREET: TRansfer Entropy Estimation via Transformers
[AUTHORS]
Omer Luxembourg, Dor Tsur, Haim Permuter
[ABSTRACT]
Transfer entropy (TE) is an information theoretic measure that reveals the
directional flow of information between processes, providing valuable insights
for a wide range of real-world applications. This work proposes Transfer
Entropy Estimation via Transformers (TREET), a novel attention-based approach
for estimating TE for stationary processes. The proposed approach employs
Donsker-Varadhan representation to TE and leverages the attention mechanism for
the task of neural estimation. We propose a detailed theoretical and empirical
study of the TREET, comparing it to existing methods on a dedicated estimation
benchmark. To increase its applicability, we design an estimated TE
optimization scheme that is motivated by the functional representation lemma,
and use it to estimate the capacity of communication channels with memory,
which is a canonical optimization problem in information theory. We further
demonstrate how an optimized TREET can be used to estimate underlying
densities, providing experimental results. Finally, we apply TREET to feature
analysis of patients with Apnea, demonstrating its applicability to real-world
physiological data. Our work, applied with state-of-the-art deep learning
methods, opens a new door for communication problems which are yet to be
solved.
[COMMENTS]
This work has been submitted to the IEEE for possible publication
[LINK]
http://arxiv.org/abs/2402.06919v3
[DATE]
2025-05-14 20:35:16+08:00
[CATEGORIES]
cs.LG
MUST: Multi-Scale Structural-Temporal Link Prediction Model for UAV Ad Hoc Networks
[AUTHORS]
Cunlai Pu, Fangrui Wu, Rajput Ramiz Sharafat, Guangzhao Dai, Xiangbo Shu
[ABSTRACT]
Link prediction in unmanned aerial vehicle (UAV) ad hoc networks (UANETs)
aims to predict the potential formation of future links between UAVs. In
adversarial environments where the route information of UAVs is unavailable,
predicting future links must rely solely on the observed historical topological
information of UANETs. However, the highly dynamic and sparse nature of UANET
topologies presents substantial challenges in effectively capturing meaningful
structural and temporal patterns for accurate link prediction. Most existing
link prediction methods focus on temporal dynamics at a single structural scale
while neglecting the effects of sparsity, resulting in insufficient information
capture and limited applicability to UANETs. In this paper, we propose a
multi-scale structural-temporal link prediction model (MUST) for UANETs.
Specifically, we first employ graph attention networks (GATs) to capture
structural features at multiple levels, including the individual UAV level, the
UAV community level, and the overall network level. Then, we use long
short-term memory (LSTM) networks to learn the temporal dynamics of these
multi-scale structural features. Additionally, we address the impact of
sparsity by introducing a sophisticated loss function during model
optimization. We validate the performance of MUST using several UANET datasets
generated through simulations. Extensive experimental results demonstrate that
MUST achieves state-of-the-art link prediction performance in highly dynamic
and sparse UANETs.
[LINK]
http://arxiv.org/abs/2505.09331v1
[DATE]
2025-05-14 20:26:46+08:00
[CATEGORIES]
cs.LG
Efficient Prior Calibration From Indirect Data
[AUTHORS]
O. Deniz Akyildiz, Mark Girolami, Andrew M. Stuart, Arnaud Vadeboncoeur
[ABSTRACT]
Bayesian inversion is central to the quantification of uncertainty within
problems arising from numerous applications in science and engineering. To
formulate the approach, four ingredients are required: a forward model mapping
the unknown parameter to an element of a solution space, often the solution
space for a differential equation; an observation operator mapping an element
of the solution space to the data space; a noise model describing how noise
pollutes the observations; and a prior model describing knowledge about the
unknown parameter before the data is acquired. This paper is concerned with
learning the prior model from data; in particular, learning the prior from
multiple realizations of indirect data obtained through the noisy observation
process. The prior is represented, using a generative model, as the pushforward
of a Gaussian in a latent space; the pushforward map is learned by minimizing
an appropriate loss function. A metric that is well-defined under empirical
approximation is used to define the loss function for the pushforward map to
make an implementable methodology. Furthermore, an efficient residual-based
neural operator approximation of the forward model is proposed and it is shown
that this may be learned concurrently with the pushforward map, using a bilevel
optimization formulation of the problem; this use of neural operator
approximation has the potential to make prior learning from indirect data more
computationally efficient, especially when the observation process is
expensive, non-smooth or not known. The ideas are illustrated with the Darcy
flow inverse problem of finding permeability from piezometric head
measurements.
[LINK]
http://arxiv.org/abs/2405.17955v2
[DATE]
2025-05-14 20:25:27+08:00
[CATEGORIES]
cs.LG
Accelerating Machine Learning Systems via Category Theory: Applications to Spherical Attention for Gene Regulatory Networks
[AUTHORS]
Vincent Abbott, Kotaro Kamiya, Gerard Glowacki, Yu Atsumi, Gioele Zardini, Yoshihiro Maruyama
[ABSTRACT]
How do we enable artificial intelligence models to improve themselves? This
is central to exponentially improving generalized artificial intelligence
models, which can improve their own architecture to handle new problem domains
in an efficient manner that leverages the latest hardware. However, current
automated compilation methods are poor, and efficient algorithms require years
of human development. In this paper, we use neural circuit diagrams, based in
category theory, to prove a general theorem related to deep learning
algorithms, guide the development of a novel attention algorithm catered to the
domain of gene regulatory networks, and produce a corresponding efficient
kernel. The algorithm we propose, spherical attention, shows that neural
circuit diagrams enable a principled and systematic method for reasoning about
deep learning architectures and providing high-performance code. By replacing
SoftMax with an $L^2$ norm as suggested by diagrams, it overcomes the special
function unit bottleneck of standard attention while retaining the streaming
property essential to high-performance. Our diagrammatically derived
\textit{FlashSign} kernel achieves comparable performance to the
state-of-the-art, fine-tuned FlashAttention algorithm on an A100, and
$3.6\times$ the performance of PyTorch. Overall, this investigation shows
neural circuit diagrams’ suitability as a high-level framework for the
automated development of efficient, novel artificial intelligence
architectures.
[LINK]
http://arxiv.org/abs/2505.09326v1
[DATE]
2025-05-14 20:24:22+08:00
[CATEGORIES]
cs.LG
Neural Video Compression using 2D Gaussian Splatting
[AUTHORS]
Lakshya Gupta, Imran N. Junejo
[ABSTRACT]
The computer vision and image processing research community has been involved
in standardizing video data communications for the past many decades, leading
to standards such as AVC, HEVC, VVC, AV1, AV2, etc. However, recent
groundbreaking works have focused on employing deep learning-based techniques
to replace the traditional video codec pipeline to a greater affect. Neural
video codecs (NVC) create an end-to-end ML-based solution that does not rely on
any handcrafted features (motion or edge-based) and have the ability to learn
content-aware compression strategies, offering better adaptability and higher
compression efficiency than traditional methods. This holds a great potential
not only for hardware design, but also for various video streaming platforms
and applications, especially video conferencing applications such as MS-Teams
or Zoom that have found extensive usage in classrooms and workplaces. However,
their high computational demands currently limit their use in real-time
applications like video conferencing. To address this, we propose a
region-of-interest (ROI) based neural video compression model that leverages 2D
Gaussian Splatting. Unlike traditional codecs, 2D Gaussian Splatting is capable
of real-time decoding and can be optimized using fewer data points, requiring
only thousands of Gaussians for decent quality outputs as opposed to millions
in 3D scenes. In this work, we designed a video pipeline that speeds up the
encoding time of the previous Gaussian splatting-based image codec by 88% by
using a content-aware initialization strategy paired with a novel Gaussian
inter-frame redundancy-reduction mechanism, enabling Gaussian splatting to be
used for a video-codec solution, the first of its kind solution in this neural
video codec space.
[COMMENTS]
9 pages, 8 figures
[LINK]
http://arxiv.org/abs/2505.09324v1
[DATE]
2025-05-14 20:23:53+08:00
[CATEGORIES]
cs.LG
TransDiffuser: End-to-end Trajectory Generation with Decorrelated Multi-modal Representation for Autonomous Driving
[AUTHORS]
Xuefeng Jiang, Yuan Ma, Pengxiang Li, Leimeng Xu, Xin Wen, Kun Zhan, Zhongpu Xia, Peng Jia, XianPeng Lang, Sheng Sun
[ABSTRACT]
In recent years, diffusion model has shown its potential across diverse
domains from vision generation to language modeling. Transferring its
capabilities to modern autonomous driving systems has also emerged as a
promising direction.In this work, we propose TransDiffuser, an encoder-decoder
based generative trajectory planning model for end-to-end autonomous driving.
The encoded scene information serves as the multi-modal conditional input of
the denoising decoder. To tackle the mode collapse dilemma in generating
high-quality diverse trajectories, we introduce a simple yet effective
multi-modal representation decorrelation optimization mechanism during the
training process.TransDiffuser achieves PDMS of 94.85 on the NAVSIM benchmark,
surpassing previous state-of-the-art methods without any anchor-based prior
trajectories.
[COMMENTS]
Under review
[LINK]
http://arxiv.org/abs/2505.09315v1
[DATE]
2025-05-14 20:10:41+08:00
[CATEGORIES]
cs.LG
Energy Matching: Unifying Flow Matching and Energy-Based Models for Generative Modeling
[AUTHORS]
Michal Balcerak, Tamaz Amiranashvili, Antonio Terpin, Suprosanna Shit, Sebastian Kaltenbach, Petros Koumoutsakos, Bjoern Menze
[ABSTRACT]
The most widely used generative models map noise and data distributions by
matching flows or scores. However, they struggle to incorporate partial
observations and additional priors–something energy-based models (EBMs) handle
elegantly by simply adding corresponding scalar energy terms. We address this
issue by proposing Energy Matching, a framework that endows flow-based
approaches with the flexibility of EBMs. Far from the data manifold, samples
move along curl-free, optimal transport paths from noise to data. As they
approach the data manifold, an entropic energy term guides the system into a
Boltzmann equilibrium distribution, explicitly capturing the underlying
likelihood structure of the data. We parameterize this dynamic with a single
time-independent scalar field, which serves as both a powerful generator and a
flexible prior for effective regularization of inverse problems. Our method
substantially outperforms existing EBMs on CIFAR-10 and ImageNet generation in
terms of fidelity, while retaining simulation-free training of transport-based
approaches away from the data manifold. Furthermore, we leverage the method’s
flexibility to introduce an interaction energy that supports diverse mode
exploration, which we demonstrate in a controlled protein-generation setting.
Our approach focuses on learning a scalar potential energy–without
time-conditioning, auxiliary generators, or additional networks–which marks a
significant departure from recent EBM methods. We believe that this simplified
framework significantly advances EBMs capabilities and paves the way for their
wider adoption in generative modeling across diverse domains.
[LINK]
http://arxiv.org/abs/2504.10612v2
[DATE]
2025-05-14 20:10:11+08:00
[CATEGORIES]
cs.LG
Detecting Sybil Addresses in Blockchain Airdrops: A Subgraph-based Feature Propagation and Fusion Approach
[AUTHORS]
Qiangqiang Liu, Qian Huang, Frank Fan, Haishan Wu, Xueyan Tang
[ABSTRACT]
Sybil attacks pose a significant security threat to blockchain ecosystems,
particularly in token airdrop events. This paper proposes a novel sybil address
identification method based on subgraph feature extraction lightGBM. The method
first constructs a two-layer deep transaction subgraph for each address, then
extracts key event operation features according to the lifecycle of sybil
addresses, including the time of first transaction, first gas acquisition,
participation in airdrop activities, and last transaction. These temporal
features effectively capture the consistency of sybil address behavior
operations. Additionally, the method extracts amount and network structure
features, comprehensively describing address behavior patterns and network
topology through feature propagation and fusion. Experiments conducted on a
dataset containing 193,701 addresses (including 23,240 sybil addresses) show
that this method outperforms existing approaches in terms of precision, recall,
F1 score, and AUC, with all metrics exceeding 0.9. The methods and results of
this study can be further applied to broader blockchain security areas such as
transaction manipulation identification and token liquidity risk assessment,
contributing to the construction of a more secure and fair blockchain
ecosystem.
[COMMENTS]
IEEE International Conference on Blockchain and Cryptocurrency(Proc.
IEEE ICBC 2025)
[LINK]
http://arxiv.org/abs/2505.09313v1
[DATE]
2025-05-14 20:04:26+08:00
[CATEGORIES]
cs.LG
Properties of Discrete Sliced Wasserstein Losses
[AUTHORS]
Eloi Tanguy, Rémi Flamary, Julie Delon
[ABSTRACT]
The Sliced Wasserstein (SW) distance has become a popular alternative to the
Wasserstein distance for comparing probability measures. Widespread
applications include image processing, domain adaptation and generative
modelling, where it is common to optimise some parameters in order to minimise
SW, which serves as a loss function between discrete probability measures
(since measures admitting densities are numerically unattainable). All these
optimisation problems bear the same sub-problem, which is minimising the Sliced
Wasserstein energy. In this paper we study the properties of $\mathcal{E}: Y
\longmapsto \mathrm{SW}_2^2(\gamma_Y, \gamma_Z)$, i.e. the SW distance between
two uniform discrete measures with the same amount of points as a function of
the support $Y \in \mathbb{R}^{n \times d}$ of one of the measures. We
investigate the regularity and optimisation properties of this energy, as well
as its Monte-Carlo approximation $\mathcal{E}_p$ (estimating the expectation in
SW using only $p$ samples) and show convergence results on the critical points
of $\mathcal{E}_p$ to those of $\mathcal{E}$, as well as an almost-sure uniform
convergence and a uniform Central Limit result on the process
$\mathcal{E}_p(Y)$. Finally, we show that in a certain sense, Stochastic
Gradient Descent methods minimising $\mathcal{E}$ and $\mathcal{E}_p$ converge
towards (Clarke) critical points of these energies.
[LINK]
http://arxiv.org/abs/2307.10352v7
[DATE]
2025-05-14 20:02:25+08:00
[CATEGORIES]
cs.LG
Bayesian computation with generative diffusion models by Multilevel Monte Carlo
[AUTHORS]
Abdul-Lateef Haji-Ali, Marcelo Pereyra, Luke Shaw, Konstantinos Zygalakis
[ABSTRACT]
Generative diffusion models have recently emerged as a powerful strategy to
perform stochastic sampling in Bayesian inverse problems, delivering remarkably
accurate solutions for a wide range of challenging applications. However,
diffusion models often require a large number of neural function evaluations
per sample in order to deliver accurate posterior samples. As a result, using
diffusion models as stochastic samplers for Monte Carlo integration in Bayesian
computation can be highly computationally expensive, particularly in
applications that require a substantial number of Monte Carlo samples for
conducting uncertainty quantification analyses. This cost is especially high in
large-scale inverse problems such as computational imaging, which rely on large
neural networks that are expensive to evaluate. With quantitative imaging
applications in mind, this paper presents a Multilevel Monte Carlo strategy
that significantly reduces the cost of Bayesian computation with diffusion
models. This is achieved by exploiting cost-accuracy trade-offs inherent to
diffusion models to carefully couple models of different levels of accuracy in
a manner that significantly reduces the overall cost of the calculation,
without reducing the final accuracy. The proposed approach achieves a
$4\times$-to-$8\times$ reduction in computational cost w.r.t. standard
techniques across three benchmark imaging problems.
[COMMENTS]
13 images
[LINK]
http://arxiv.org/abs/2409.15511v4
[DATE]
2025-05-14 19:56:59+08:00
[CATEGORIES]
cs.LG
Neural Multivariate Regression: Qualitative Insights from the Unconstrained Feature Model
[AUTHORS]
George Andriopoulos, Soyuj Jung Basnet, Juan Guevara, Li Guo, Keith Ross
[ABSTRACT]
The Unconstrained Feature Model (UFM) is a mathematical framework that
enables closed-form approximations for minimal training loss and related
performance measures in deep neural networks (DNNs). This paper leverages the
UFM to provide qualitative insights into neural multivariate regression, a
critical task in imitation learning, robotics, and reinforcement learning.
Specifically, we address two key questions: (1) How do multi-task models
compare to multiple single-task models in terms of training performance? (2)
Can whitening and normalizing regression targets improve training performance?
The UFM theory predicts that multi-task models achieve strictly smaller
training MSE than multiple single-task models when the same or stronger
regularization is applied to the latter, and our empirical results confirm
these findings. Regarding whitening and normalizing regression targets, the UFM
theory predicts that they reduce training MSE when the average variance across
the target dimensions is less than one, and our empirical results once again
confirm these findings. These findings highlight the UFM as a powerful
framework for deriving actionable insights into DNN design and data
pre-processing strategies.
[COMMENTS]
31 pages, 8 figures
[LINK]
http://arxiv.org/abs/2505.09308v1
[DATE]
2025-05-14 19:52:45+08:00
[CATEGORIES]
cs.LG
Predicting butterfly species presence from satellite imagery using soft contrastive regularisation
[AUTHORS]
Thijs L van der Plas, Stephen Law, Michael JO Pocock
[ABSTRACT]
The growing demand for scalable biodiversity monitoring methods has fuelled
interest in remote sensing data, due to its widespread availability and
extensive coverage. Traditionally, the application of remote sensing to
biodiversity research has focused on mapping and monitoring habitats, but with
increasing availability of large-scale citizen-science wildlife observation
data, recent methods have started to explore predicting multi-species presence
directly from satellite images. This paper presents a new data set for
predicting butterfly species presence from satellite data in the United
Kingdom. We experimentally optimise a Resnet-based model to predict
multi-species presence from 4-band satellite images, and find that this model
especially outperforms the mean rate baseline for locations with high species
biodiversity. To improve performance, we develop a soft, supervised contrastive
regularisation loss that is tailored to probabilistic labels (such as
species-presence data), and demonstrate that this improves prediction accuracy.
In summary, our new data set and contrastive regularisation method contribute
to the open challenge of accurately predicting species biodiversity from remote
sensing data, which is key for efficient biodiversity monitoring.
[COMMENTS]
To be published in the 2025 CVPR FGVC12 workshop
[LINK]
http://arxiv.org/abs/2505.09306v1
[DATE]
2025-05-14 19:42:09+08:00
[CATEGORIES]
cs.LG
Adaptive Noise Resilient Keyword Spotting Using One-Shot Learning
[AUTHORS]
Luciano Sebastian Martinez-Rau, Quynh Nguyen Phuong Vu, Yuxuan Zhang, Bengt Oelmann, Sebastian Bader
[ABSTRACT]
Keyword spotting (KWS) is a key component of smart devices, enabling
efficient and intuitive audio interaction. However, standard KWS systems
deployed on embedded devices often suffer performance degradation under
real-world operating conditions. Resilient KWS systems address this issue by
enabling dynamic adaptation, with applications such as adding or replacing
keywords, adjusting to specific users, and improving noise robustness. However,
deploying resilient, standalone KWS systems with low latency on
resource-constrained devices remains challenging due to limited memory and
computational resources. This study proposes a low computational approach for
continuous noise adaptation of pretrained neural networks used for KWS
classification, requiring only 1-shot learning and one epoch. The proposed
method was assessed using two pretrained models and three real-world noise
sources at signal-to-noise ratios (SNRs) ranging from 24 to -3 dB. The adapted
models consistently outperformed the pretrained models across all scenarios,
especially at SNR $\leq$ 18 dB, achieving accuracy improvements of 4.9% to
46.0%. These results highlight the efficacy of the proposed methodology while
being lightweight enough for deployment on resource-constrained devices.
[COMMENTS]
Preprint submitted to the IEEE 11th World Forum on Internet of Things
[LINK]
http://arxiv.org/abs/2505.09304v1
[DATE]
2025-05-14 19:39:47+08:00
[CATEGORIES]
cs.LG
Toward Fair Federated Learning under Demographic Disparities and Data Imbalance
[AUTHORS]
Qiming Wu, Siqi Li, Doudou Zhou, Nan Liu
[ABSTRACT]
Ensuring fairness is critical when applying artificial intelligence to
high-stakes domains such as healthcare, where predictive models trained on
imbalanced and demographically skewed data risk exacerbating existing
disparities. Federated learning (FL) enables privacy-preserving collaboration
across institutions, but remains vulnerable to both algorithmic bias and
subgroup imbalance - particularly when multiple sensitive attributes intersect.
We propose FedIDA (Fed erated Learning for Imbalance and D isparity A
wareness), a framework-agnostic method that combines fairness-aware
regularization with group-conditional oversampling. FedIDA supports multiple
sensitive attributes and heterogeneous data distributions without altering the
convergence behavior of the underlying FL algorithm. We provide theoretical
analysis establishing fairness improvement bounds using Lipschitz continuity
and concentration inequalities, and show that FedIDA reduces the variance of
fairness metrics across test sets. Empirical results on both benchmark and
real-world clinical datasets confirm that FedIDA consistently improves fairness
while maintaining competitive predictive performance, demonstrating its
effectiveness for equitable and privacy-preserving modeling in healthcare. The
source code is available on GitHub.
[LINK]
http://arxiv.org/abs/2505.09295v1
[DATE]
2025-05-14 19:22:54+08:00
[CATEGORIES]
cs.LG
Ranking-Based At-Risk Student Prediction Using Federated Learning and Differential Features
[AUTHORS]
Shunsuke Yoneda, Valdemar Švábenský, Gen Li, Daisuke Deguchi, Atsushi Shimada
[ABSTRACT]
Digital textbooks are widely used in various educational contexts, such as
university courses and online lectures. Such textbooks yield learning log data
that have been used in numerous educational data mining (EDM) studies for
student behavior analysis and performance prediction. However, these studies
have faced challenges in integrating confidential data, such as academic
records and learning logs, across schools due to privacy concerns.
Consequently, analyses are often conducted with data limited to a single
school, which makes developing high-performing and generalizable models
difficult. This study proposes a method that combines federated learning and
differential features to address these issues. Federated learning enables model
training without centralizing data, thereby preserving student privacy.
Differential features, which utilize relative values instead of absolute
values, enhance model performance and generalizability. To evaluate the
proposed method, a model for predicting at-risk students was trained using data
from 1,136 students across 12 courses conducted over 4 years, and validated on
hold-out test data from 5 other courses. Experimental results demonstrated that
the proposed method addresses privacy concerns while achieving performance
comparable to that of models trained via centralized learning in terms of Top-n
precision, nDCG, and PR-AUC. Furthermore, using differential features improved
prediction performance across all evaluation datasets compared to
non-differential approaches. The trained models were also applicable for early
prediction, achieving high performance in detecting at-risk students in earlier
stages of the semester within the validation datasets.
[COMMENTS]
To appear in the Proceedings of the 18th Educational Data Mining
Conference (EDM 2025)
[LINK]
http://arxiv.org/abs/2505.09287v1
[DATE]
2025-05-14 19:12:30+08:00
[CATEGORIES]
cs.LG
Generating Full-field Evolution of Physical Dynamics from Irregular Sparse Observations
[AUTHORS]
Panqi Chen, Yifan Sun, Lei Cheng, Yang Yang, Weichang Li, Yang Liu, Weiqing Liu, Jiang Bian, Shikai Fang
[ABSTRACT]
Modeling and reconstructing multidimensional physical dynamics from sparse
and off-grid observations presents a fundamental challenge in scientific
research. Recently, diffusion-based generative modeling shows promising
potential for physical simulation. However, current approaches typically
operate on on-grid data with preset spatiotemporal resolution, but struggle
with the sparsely observed and continuous nature of real-world physical
dynamics. To fill the gaps, we present SDIFT, Sequential DIffusion in
Functional Tucker space, a novel framework that generates full-field evolution
of physical dynamics from irregular sparse observations. SDIFT leverages the
functional Tucker model as the latent space representer with proven universal
approximation property, and represents observations as latent functions and
Tucker core sequences. We then construct a sequential diffusion model with
temporally augmented UNet in the functional Tucker space, denoising noise drawn
from a Gaussian process to generate the sequence of core tensors.
At the posterior sampling stage, we propose a Message-Passing Posterior
Sampling mechanism, enabling conditional generation of the entire sequence
guided by observations at limited time steps. We validate SDIFT on three
physical systems spanning astronomical (supernova explosions, light-year
scale), environmental (ocean sound speed fields, kilometer scale), and
molecular (organic liquid, millimeter scale) domains, demonstrating significant
improvements in both reconstruction accuracy and computational efficiency
compared to state-of-the-art approaches.
[LINK]
http://arxiv.org/abs/2505.09284v1
[DATE]
2025-05-14 19:09:15+08:00
[CATEGORIES]
cs.LG
OLinear: A Linear Model for Time Series Forecasting in Orthogonally Transformed Domain
[AUTHORS]
Wenzhen Yue, Yong Liu, Haoxuan Li, Hao Wang, Xianghua Ying, Ruohao Guo, Bowei Xing, Ji Shi
[ABSTRACT]
This paper presents $\mathbf{OLinear}$, a $\mathbf{linear}$-based
multivariate time series forecasting model that operates in an
$\mathbf{o}$rthogonally transformed domain. Recent forecasting models typically
adopt the temporal forecast (TF) paradigm, which directly encode and decode
time series in the time domain. However, the entangled step-wise dependencies
in series data can hinder the performance of TF. To address this, some
forecasters conduct encoding and decoding in the transformed domain using
fixed, dataset-independent bases (e.g., sine and cosine signals in the Fourier
transform). In contrast, we utilize $\mathbf{OrthoTrans}$, a data-adaptive
transformation based on an orthogonal matrix that diagonalizes the series’
temporal Pearson correlation matrix. This approach enables more effective
encoding and decoding in the decorrelated feature domain and can serve as a
plug-in module to enhance existing forecasters. To enhance the representation
learning for multivariate time series, we introduce a customized linear layer,
$\mathbf{NormLin}$, which employs a normalized weight matrix to capture
multivariate dependencies. Empirically, the NormLin module shows a surprising
performance advantage over multi-head self-attention, while requiring nearly
half the FLOPs. Extensive experiments on 24 benchmarks and 140 forecasting
tasks demonstrate that OLinear consistently achieves state-of-the-art
performance with high efficiency. Notably, as a plug-in replacement for
self-attention, the NormLin module consistently enhances Transformer-based
forecasters. The code and datasets are available at
https://anonymous.4open.science/r/OLinear
[LINK]
http://arxiv.org/abs/2505.08550v2
[DATE]
2025-05-14 19:00:57+08:00
[CATEGORIES]
cs.LG
AdaWorld: Learning Adaptable World Models with Latent Actions
[AUTHORS]
Shenyuan Gao, Siyuan Zhou, Yilun Du, Jun Zhang, Chuang Gan
[ABSTRACT]
World models aim to learn action-controlled future prediction and have proven
essential for the development of intelligent agents. However, most existing
world models rely heavily on substantial action-labeled data and costly
training, making it challenging to adapt to novel environments with
heterogeneous actions through limited interactions. This limitation can hinder
their applicability across broader domains. To overcome this limitation, we
propose AdaWorld, an innovative world model learning approach that enables
efficient adaptation. The key idea is to incorporate action information during
the pretraining of world models. This is achieved by extracting latent actions
from videos in a self-supervised manner, capturing the most critical
transitions between frames. We then develop an autoregressive world model that
conditions on these latent actions. This learning paradigm enables highly
adaptable world models, facilitating efficient transfer and learning of new
actions even with limited interactions and finetuning. Our comprehensive
experiments across multiple environments demonstrate that AdaWorld achieves
superior performance in both simulation quality and visual planning.
[COMMENTS]
ICML 2025. Project page: https://adaptable-world-model.github.io/,
code: https://github.com/Little-Podi/AdaWorld, model:
https://huggingface.co/Little-Podi/AdaWorld
[LINK]
http://arxiv.org/abs/2503.18938v3
[DATE]
2025-05-14 18:26:17+08:00
[CATEGORIES]
cs.LG
EDBench: Large-Scale Electron Density Data for Molecular Modeling
[AUTHORS]
Hongxin Xiang, Ke Li, Mingquan Liu, Zhixiang Cheng, Bin Yao, Wenjie Du, Jun Xia, Li Zeng, Xin Jin, Xiangxiang Zeng
[ABSTRACT]
Existing molecular machine learning force fields (MLFFs) generally focus on
the learning of atoms, molecules, and simple quantum chemical properties (such
as energy and force), but ignore the importance of electron density (ED)
$\rho(r)$ in accurately understanding molecular force fields (MFFs). ED
describes the probability of finding electrons at specific locations around
atoms or molecules, which uniquely determines all ground state properties (such
as energy, molecular structure, etc.) of interactive multi-particle systems
according to the Hohenberg-Kohn theorem. However, the calculation of ED relies
on the time-consuming first-principles density functional theory (DFT) which
leads to the lack of large-scale ED data and limits its application in MLFFs.
In this paper, we introduce EDBench, a large-scale, high-quality dataset of ED
designed to advance learning-based research at the electronic scale. Built upon
the PCQM4Mv2, EDBench provides accurate ED data, covering 3.3 million
molecules. To comprehensively evaluate the ability of models to understand and
utilize electronic information, we design a suite of ED-centric benchmark tasks
spanning prediction, retrieval, and generation. Our evaluation on several
state-of-the-art methods demonstrates that learning from EDBench is not only
feasible but also achieves high accuracy. Moreover, we show that learning-based
method can efficiently calculate ED with comparable precision while
significantly reducing the computational cost relative to traditional DFT
calculations. All data and benchmarks from EDBench will be freely available,
laying a robust foundation for ED-driven drug discovery and materials science.
[LINK]
http://arxiv.org/abs/2505.09262v1
[DATE]
2025-05-14 18:23:22+08:00
[CATEGORIES]
cs.LG
Stable and Convexified Information Bottleneck Optimization via Symbolic Continuation and Entropy-Regularized Trajectories
[AUTHORS]
Faruk Alpay
[ABSTRACT]
The Information Bottleneck (IB) method frequently suffers from unstable
optimization, characterized by abrupt representation shifts near critical
points of the IB trade-off parameter, beta. In this paper, I introduce a novel
approach to achieve stable and convex IB optimization through symbolic
continuation and entropy-regularized trajectories. I analytically prove
convexity and uniqueness of the IB solution path when an entropy regularization
term is included, and demonstrate how this stabilizes representation learning
across a wide range of \b{eta} values. Additionally, I provide extensive
sensitivity analyses around critical points (beta) with statistically robust
uncertainty quantification (95% confidence intervals). The open-source
implementation, experimental results, and reproducibility framework included in
this work offer a clear path for practical deployment and future extension of
my proposed method.
[COMMENTS]
23 pages, 11 figures, includes analytical proofs, sensitivity
analysis (95% CI), and JAX-based open-source implementation available at:
https://github.com/farukalpay/information-bottleneck-beta-optimization
[LINK]
http://arxiv.org/abs/2505.09239v1
[DATE]
2025-05-14 17:27:09+08:00
[CATEGORIES]
cs.LG
Simulating Dynamic Tumor Contrast Enhancement in Breast MRI using Conditional Generative Adversarial Networks
[AUTHORS]
Richard Osuala, Smriti Joshi, Apostolia Tsirikoglou, Lidia Garrucho, Walter H. L. Pinaya, Daniel M. Lang, Julia A. Schnabel, Oliver Diaz, Karim Lekadir
[ABSTRACT]
This paper presents a method for virtual contrast enhancement in breast MRI,
offering a promising non-invasive alternative to traditional contrast
agent-based DCE-MRI acquisition. Using a conditional generative adversarial
network, we predict DCE-MRI images, including jointly-generated sequences of
multiple corresponding DCE-MRI timepoints, from non-contrast-enhanced MRIs,
enabling tumor localization and characterization without the associated health
risks. Furthermore, we qualitatively and quantitatively evaluate the synthetic
DCE-MRI images, proposing a multi-metric Scaled Aggregate Measure (SAMe),
assessing their utility in a tumor segmentation downstream task, and conclude
with an analysis of the temporal patterns in multi-sequence DCE-MRI generation.
Our approach demonstrates promising results in generating realistic and useful
DCE-MRI sequences, highlighting the potential of virtual contrast enhancement
for improving breast cancer diagnosis and treatment, particularly for patients
where contrast agent administration is contraindicated.
[LINK]
http://arxiv.org/abs/2409.18872v2
[DATE]
2025-05-14 17:15:25+08:00
[CATEGORIES]
cs.LG
Optimal Transport-Based Domain Adaptation for Rotated Linear Regression
[AUTHORS]
Brian Britos, Mathias Bourel
[ABSTRACT]
Optimal Transport (OT) has proven effective for domain adaptation (DA) by
aligning distributions across domains with differing statistical properties.
Building on the approach of Courty et al. (2016), who mapped source data to the
target domain for improved model transfer, we focus on a supervised DA problem
involving linear regression models under rotational shifts. This ongoing work
considers cases where source and target domains are related by a
rotation-common in applications like sensor calibration or image orientation.
We show that in $\mathbb{R}^2$ , when using a p-norm cost with $p $\ge$ 2$, the
optimal transport map recovers the underlying rotation. Based on this, we
propose an algorithm that combines K-means clustering, OT, and singular value
decomposition (SVD) to estimate the rotation angle and adapt the regression
model. This method is particularly effective when the target domain is sparsely
sampled, leveraging abundant source data for improved generalization. Our
contributions offer both theoretical and practical insights into OT-based model
adaptation under geometric transformations.
[LINK]
http://arxiv.org/abs/2505.09229v1
[DATE]
2025-05-14 17:06:40+08:00
[CATEGORIES]
cs.LG
Learning Traffic Anomalies from Generative Models on Real-Time Observations
[AUTHORS]
Fotis I. Giasemis, Alexandros Sopasakis
[ABSTRACT]
Accurate detection of traffic anomalies is crucial for effective urban
traffic management and congestion mitigation. We use the Spatiotemporal
Generative Adversarial Network (STGAN) framework combining Graph Neural
Networks and Long Short-Term Memory networks to capture complex spatial and
temporal dependencies in traffic data. We apply STGAN to real-time,
minute-by-minute observations from 42 traffic cameras across Gothenburg,
Sweden, collected over several months in 2020. The images are processed to
compute a flow metric representing vehicle density, which serves as input for
the model. Training is conducted on data from April to November 2020, and
validation is performed on a separate dataset from November 14 to 23, 2020. Our
results demonstrate that the model effectively detects traffic anomalies with
high precision and low false positive rates. The detected anomalies include
camera signal interruptions, visual artifacts, and extreme weather conditions
affecting traffic flow.
[LINK]
http://arxiv.org/abs/2502.01391v2
[DATE]
2025-05-14 17:00:33+08:00
[CATEGORIES]
cs.LG
Birch SGD: A Tree Graph Framework for Local and Asynchronous SGD Methods
[AUTHORS]
Alexander Tyurin, Danil Sivtsov
[ABSTRACT]
We propose a new unifying framework, Birch SGD, for analyzing and designing
distributed SGD methods. The central idea is to represent each method as a
weighted directed tree, referred to as a computation tree. Leveraging this
representation, we introduce a general theoretical result that reduces
convergence analysis to studying the geometry of these trees. This perspective
yields a purely graph-based interpretation of optimization dynamics, offering a
new and intuitive foundation for method development. Using Birch SGD, we design
eight new methods and analyze them alongside previously known ones, with at
least six of the new methods shown to have optimal computational time
complexity. Our research leads to two key insights: (i) all methods share the
same “iteration rate” of $O\left(\frac{(R + 1) L \Delta}{\varepsilon} +
\frac{\sigma^2 L \Delta}{\varepsilon^2}\right)$, where $R$ the maximum “tree
distance” along the main branch of a tree; and (ii) different methods exhibit
different trade-offs-for example, some update iterates more frequently,
improving practical performance, while others are more communication-efficient
or focus on other aspects. Birch SGD serves as a unifying framework for
navigating these trade-offs. We believe these results provide a unified
foundation for understanding, analyzing, and designing efficient asynchronous
and parallel optimization methods.
[LINK]
http://arxiv.org/abs/2505.09218v1
[DATE]
2025-05-14 16:37:45+08:00
[CATEGORIES]
cs.LG
The Larger the Merrier? Efficient Large AI Model Inference in Wireless Edge Networks
[AUTHORS]
Zhonghao Lyu, Ming Xiao, Jie Xu, Mikael Skoglund, Marco Di Renzo
[ABSTRACT]
The growing demand for large artificial intelligence model (LAIM) services is
driving a paradigm shift from traditional cloud-based inference to edge-based
inference for low-latency, privacy-preserving applications. In particular,
edge-device co-inference, which partitions LAIMs between edge devices and
servers, has emerged as a promising strategy for resource-efficient LAIM
execution in wireless networks. In this paper, we investigate a pruning-aware
LAIM co-inference scheme, where a pre-trained LAIM is pruned and partitioned
into on-device and on-server sub-models for deployment. For analysis, we first
prove that the LAIM output distortion is upper bounded by its parameter
distortion. Then, we derive a lower bound on parameter distortion via
rate-distortion theory, analytically capturing the relationship between pruning
ratio and co-inference performance. Next, based on the analytical results, we
formulate an LAIM co-inference distortion bound minimization problem by jointly
optimizing the pruning ratio, transmit power, and computation frequency under
system latency, energy, and available resource constraints. Moreover, we
propose an efficient algorithm to tackle the considered highly non-convex
problem. Finally, extensive simulations demonstrate the effectiveness of the
proposed design. In particular, model parameter distortion is shown to provide
a reliable bound on output distortion. Also, the proposed joint pruning ratio
and resource management design achieves superior performance in balancing
trade-offs among inference performance, system latency, and energy consumption
compared with benchmark schemes, such as fully on-device and on-server
inference. Moreover, the split point is shown to play a critical role in system
performance optimization under heterogeneous and resource-limited edge
environments.
[LINK]
http://arxiv.org/abs/2505.09214v1
[DATE]
2025-05-14 16:18:55+08:00
[CATEGORIES]
cs.LG
A physics-informed transformer neural operator for learning generalized solutions of initial boundary value problems
[AUTHORS]
Sumanth Kumar Boya, Deepak Subramani
[ABSTRACT]
Initial boundary value problems arise commonly in applications with
engineering and natural systems governed by nonlinear partial differential
equations (PDEs). Operator learning is an emerging field for solving these
equations by using a neural network to learn a map between infinite dimensional
input and output function spaces. These neural operators are trained using a
combination of data (observations or simulations) and PDE-residuals
(physics-loss). A major drawback of existing neural approaches is the
requirement to retrain with new initial/boundary conditions, and the necessity
for a large amount of simulation data for training. We develop a
physics-informed transformer neural operator (named PINTO) that efficiently
generalizes to unseen initial and boundary conditions, trained in a
simulation-free setting using only physics loss. The main innovation lies in
our new iterative kernel integral operator units, implemented using
cross-attention, to transform the PDE solution’s domain points into an
initial/boundary condition-aware representation vector, enabling efficient
learning of the solution function for new scenarios. The PINTO architecture is
applied to simulate the solutions of important equations used in engineering
applications: advection, Burgers, and steady and unsteady Navier-Stokes
equations (three flow scenarios). For these five test cases, we show that the
relative errors during testing under challenging conditions of unseen
initial/boundary conditions are only one-fifth to one-third of other leading
physics informed operator learning methods. Moreover, our PINTO model is able
to accurately solve the advection and Burgers equations at time steps that are
not included in the training collocation points. The code is available at
https://github.com/quest-lab-iisc/PINTO
[COMMENTS]
30 pages, 14 figures, 9 tables
[LINK]
http://arxiv.org/abs/2412.09009v4
[DATE]
2025-05-14 16:00:18+08:00
[CATEGORIES]
cs.LG
InvDesFlow-AL: Active Learning-based Workflow for Inverse Design of Functional Materials
[AUTHORS]
Xiao-Qi Han, Peng-Jie Guo, Ze-Feng Gao, Hao Sun, Zhong-Yi Lu
[ABSTRACT]
Developing inverse design methods for functional materials with specific
properties is critical to advancing fields like renewable energy, catalysis,
energy storage, and carbon capture. Generative models based on diffusion
principles can directly produce new materials that meet performance
constraints, thereby significantly accelerating the material design process.
However, existing methods for generating and predicting crystal structures
often remain limited by low success rates. In this work, we propose a novel
inverse material design generative framework called InvDesFlow-AL, which is
based on active learning strategies. This framework can iteratively optimize
the material generation process to gradually guide it towards desired
performance characteristics. In terms of crystal structure prediction, the
InvDesFlow-AL model achieves an RMSE of 0.0423 {\AA}, representing an 32.96%
improvement in performance compared to exsisting generative models.
Additionally, InvDesFlow-AL has been successfully validated in the design of
low-formation-energy and low-Ehull materials. It can systematically generate
materials with progressively lower formation energies while continuously
expanding the exploration across diverse chemical spaces. These results fully
demonstrate the effectiveness of the proposed active learning-driven generative
model in accelerating material discovery and inverse design. To further prove
the effectiveness of this method, we took the search for BCS superconductors
under ambient pressure as an example explored by InvDesFlow-AL. As a result, we
successfully identified Li(_2)AuH(_6) as a conventional BCS superconductor
with an ultra-high transition temperature of 140 K. This discovery provides
strong empirical support for the application of inverse design in materials
science.
[COMMENTS]
29 pages, 11 figures
[LINK]
http://arxiv.org/abs/2505.09203v1
[DATE]
2025-05-14 15:29:06+08:00
[CATEGORIES]
cs.LG
Least Squares and Marginal Log-Likelihood Model Predictive Control using Normalizing Flows
[AUTHORS]
Eike Cramer
[ABSTRACT]
Real-world (bio)chemical processes often exhibit stochastic dynamics with
non-trivial correlations and state-dependent fluctuations. Model predictive
control (MPC) often must consider these fluctuations to achieve reliable
performance. However, most process models simply add stationary noise terms to
a deterministic prediction. This work proposes using conditional normalizing
flows as discrete-time models to learn stochastic dynamics. Normalizing flows
learn the probability density function (PDF) of the states explicitly, given
prior states and control inputs. In addition to standard least squares (LSQ)
objectives, this work derives a marginal log-likelihood (MLL) objective based
on the explicit PDF and Markov chain simulations. In a reactor study, the
normalizing flow MPC reduces the setpoint error in open and closed-loop cases
to half that of a nominal controller. Furthermore, the chance constraints lead
to fewer constraint violations than the nominal controller. The MLL objective
yields slightly more stable results than the LSQ, particularly for small
scenario sets.
[COMMENTS]
16 pages, 7 Figures, 10 Tables
[LINK]
http://arxiv.org/abs/2409.17632v2
[DATE]
2025-05-14 15:02:59+08:00
[CATEGORIES]
cs.LG
Transfer Learning of CATE with Kernel Ridge Regression
[AUTHORS]
Seok-Jin Kim, Hongjie Liu, Molei Liu, Kaizheng Wang
[ABSTRACT]
The proliferation of data has sparked significant interest in leveraging
findings from one study to estimate treatment effects in a different target
population without direct outcome observations. However, the transfer learning
process is frequently hindered by substantial covariate shift and limited
overlap between (i) the source and target populations, as well as (ii) the
treatment and control groups within the source. We propose a novel method for
overlap-adaptive transfer learning of conditional average treatment effect
(CATE) using kernel ridge regression (KRR). Our approach involves partitioning
the labeled source data into two subsets. The first one is used to train
candidate CATE models based on regression adjustment and pseudo-outcomes. An
optimal model is then selected using the second subset and unlabeled target
data, employing another pseudo-outcome-based strategy. We provide a theoretical
justification for our method through sharp non-asymptotic MSE bounds,
highlighting its adaptivity to both weak overlaps and the complexity of CATE
function. Extensive numerical studies confirm that our method achieves superior
finite-sample efficiency and adaptability. We conclude by demonstrating the
effectiveness of our approach using a 401(k) eligibility dataset.
[LINK]
http://arxiv.org/abs/2502.11331v3
[DATE]
2025-05-14 14:54:57+08:00
[CATEGORIES]
cs.LG
LAS: Loss-less ANN-SNN Conversion for Fully Spike-Driven Large Language Models
[AUTHORS]
Long Chen, Xiaotian Song, Yanan Sun
[ABSTRACT]
Spiking Large Language Models (LLMs) have emerged as an energy-efficient
alternative to conventional LLMs through their event-driven computation. To
effectively obtain spiking LLMs, researchers develop different ANN-to-SNN
conversion methods by leveraging pre-trained ANN parameters while inheriting
the energy efficiency of SNN. However, existing conversion methods struggle
with extreme activation outliers and incompatible nonlinear operations of
ANN-based LLMs. To address this, we propose a loss-less ANN-SNN conversion for
fully spike-driven LLMs, termed LAS. Specifically, LAS introduces two novel
neurons to convert the activation outlier and nonlinear operation of ANN-based
LLMs. Moreover, LAS tailors the spike-equivalent Transformer components for
spiking LLMs, which can ensure full spiking conversion without any loss of
performance. Experimental results on six language models and two
vision-language models demonstrate that LAS achieves loss-less conversion.
Notably, on OPT-66B, LAS even improves the accuracy of 2\% on the WSC task. In
addition, the parameter and ablation studies further verify the effectiveness
of LAS. The source code is available at https://github.com/lc783/LAS
[LINK]
http://arxiv.org/abs/2505.09659v1
[DATE]
2025-05-14 14:18:08+08:00
[CATEGORIES]
cs.LG
cs.CL
Optimizing Urban Critical Green Space Development Using Machine Learning
[AUTHORS]
Mohammad Ganjirad, Mahmoud Reza Delavar, Hossein Bagheri, Mohammad Mehdi Azizi
[ABSTRACT]
This paper presents a novel framework for prioritizing urban green space
development in Tehran using diverse socio-economic, environmental, and
sensitivity indices. The indices were derived from various sources including
Google Earth Engine, air pollution measurements, municipal reports and the
Weather Research & Forecasting (WRF) model. The WRF model was used to estimate
the air temperature at a 1 km resolution due to insufficient meteorological
stations, yielding RMSE and MAE values of 0.96{\deg}C and 0.92{\deg}C,
respectively. After data preparation, several machine learning models were used
for binary vegetation cover classification including XGBoost, LightGBM, Random
Forest (RF) and Extra Trees. RF achieved the highest performance, exceeding 94%
in Overall Accuracy, Recall, and F1-score. Then, the probability of areas
lacking vegetation cover was assessed using socio-economic, environmental and
sensitivity indices. This resulted in the RF generating an urban green space
development prioritization map. Feature Importance Analysis revealed that the
most significant indices were nightly land surface temperature (LST) and
sensitive population. Finally, the framework performance was validated through
microclimate simulation to assess the critical areas after and before the green
space development by green roofs. The simulation demonstrated reducing air
temperature by up to 0.67{\deg}C after utilizing the green roof technology in
critical areas. As a result, this framework provides a valuable tool for urban
planners to develop green spaces.
[LINK]
http://arxiv.org/abs/2505.09175v1
[DATE]
2025-05-14 14:13:23+08:00
[CATEGORIES]
cs.LG
Quotient Complex Transformer (QCformer) for Perovskite Data Analysis
[AUTHORS]
Xinyu You, Xiang Liu, Chuan-Shen Hu, Kelin Xia, Tze Chien Sum
[ABSTRACT]
The discovery of novel functional materials is crucial in addressing the
challenges of sustainable energy generation and climate change. Hybrid
organic-inorganic perovskites (HOIPs) have gained attention for their
exceptional optoelectronic properties in photovoltaics. Recently, geometric
deep learning, particularly graph neural networks (GNNs), has shown strong
potential in predicting material properties and guiding material design.
However, traditional GNNs often struggle to capture the periodic structures and
higher-order interactions prevalent in such systems. To address these
limitations, we propose a novel representation based on quotient complexes
(QCs) and introduce the Quotient Complex Transformer (QCformer) for material
property prediction. A material structure is modeled as a quotient complex,
which encodes both pairwise and many-body interactions via simplices of varying
dimensions and captures material periodicity through a quotient operation. Our
model leverages higher-order features defined on simplices and processes them
using a simplex-based Transformer module. We pretrain QCformer on benchmark
datasets such as the Materials Project and JARVIS, and fine-tune it on HOIP
datasets. The results show that QCformer outperforms state-of-the-art models in
HOIP property prediction, demonstrating its effectiveness. The quotient complex
representation and QCformer model together contribute a powerful new tool for
predictive modeling of perovskite materials.
[LINK]
http://arxiv.org/abs/2505.09174v1
[DATE]
2025-05-14 14:13:14+08:00
[CATEGORIES]
cs.LG
Fast Text-to-Audio Generation with Adversarial Post-Training
[AUTHORS]
Zachary Novack, Zach Evans, Zack Zukowski, Josiah Taylor, CJ Carr, Julian Parker, Adnan Al-Sinan, Gian Marco Iodice, Julian McAuley, Taylor Berg-Kirkpatrick, Jordi Pons
[ABSTRACT]
Text-to-audio systems, while increasingly performant, are slow at inference
time, thus making their latency unpractical for many creative applications. We
present Adversarial Relativistic-Contrastive (ARC) post-training, the first
adversarial acceleration algorithm for diffusion/flow models not based on
distillation. While past adversarial post-training methods have struggled to
compare against their expensive distillation counterparts, ARC post-training is
a simple procedure that (1) extends a recent relativistic adversarial
formulation to diffusion/flow post-training and (2) combines it with a novel
contrastive discriminator objective to encourage better prompt adherence. We
pair ARC post-training with a number optimizations to Stable Audio Open and
build a model capable of generating $\approx$12s of 44.1kHz stereo audio in
$\approx$75ms on an H100, and $\approx$7s on a mobile edge-device, the fastest
text-to-audio model to our knowledge.
[LINK]
http://arxiv.org/abs/2505.08175v2
[DATE]
2025-05-14 14:07:26+08:00
[CATEGORIES]
cs.LG
Theoretical Insights into Fine-Tuning Attention Mechanism: Generalization and Optimization
[AUTHORS]
Xinhao Yao, Hongjin Qian, Xiaolin Hu, Gengze Xu, Wei Liu, Jian Luan, Bin Wang, Yong Liu
[ABSTRACT]
Large Language Models (LLMs), built on Transformer architectures, exhibit
remarkable generalization across a wide range of tasks. However, fine-tuning
these models for specific tasks remains resource-intensive due to their
extensive parameterization. In this paper, we explore two remarkable phenomena
related to the attention mechanism during the fine-tuning of LLMs (where
$\mathbf{W}_q$, $\mathbf{W}_k$, and $\mathbf{W}_v$ denote the weights of the
query, key, and value layers, respectively). The first phenomenon, termed
“Unequal Importance of Attention Matrices”, highlights the impact of
fine-tuning different weight matrices. It shows that optimizing the
$\mathbf{W}_v$ matrix yields significantly better performance than optimizing
the $\mathbf{W}_k$ matrix. Fine-tuning only the $\mathbf{W}_q$ and
$\mathbf{W}_v$ matrices is computationally efficient while delivering results
comparable to, or even better than fine-tuning all three matrices
($\mathbf{W}_q$, $\mathbf{W}_k$, and $\mathbf{W}_v$). The second
phenomenon,”Attention Matrices with Customized Learning Rate Lead to Better
Convergence”, emphasizes the importance of assigning distinct learning rates to
these matrices. Specifically, a higher learning rate for the $\mathbf{W}_v$
matrix compared to $\mathbf{W}_q$ and $\mathbf{W}_k$ accelerates convergence
and improves performance. Building on these insights, we propose a new strategy
that improves fine-tuning efficiency in terms of both storage and time.
Experimental results on benchmark datasets validate the effectiveness of this
approach, supporting our theoretical findings. Our analysis lays the
theoretical groundwork for configuring and improving algorithms in LLMs
fine-tuning.
[COMMENTS]
IJCAI 2025
[LINK]
http://arxiv.org/abs/2410.02247v3
[DATE]
2025-05-14 14:06:20+08:00
[CATEGORIES]
cs.LG
Online Learning of Neural Networks
[AUTHORS]
Amit Daniely, Idan Mehalel, Elchanan Mossel
[ABSTRACT]
We study online learning of feedforward neural networks with the sign
activation function that implement functions from the unit ball in
$\mathbb{R}^d$ to a finite label set $\{1, \ldots, Y\}$.
First, we characterize a margin condition that is sufficient and in some
cases necessary for online learnability of a neural network: Every neuron in
the first hidden layer classifies all instances with some margin $\gamma$
bounded away from zero. Quantitatively, we prove that for any net, the optimal
mistake bound is at most approximately $\mathtt{TS}(d,\gamma)$, which is the
$(d,\gamma)$-totally-separable-packing number, a more restricted variation of
the standard $(d,\gamma)$-packing number. We complement this result by
constructing a net on which any learner makes $\mathtt{TS}(d,\gamma)$ many
mistakes. We also give a quantitative lower bound of approximately
$\mathtt{TS}(d,\gamma) \geq \max\{1/(\gamma \sqrt{d})^d, d\}$ when $\gamma \geq
1/2$, implying that for some nets and input sequences every learner will err
for $\exp(d)$ many times, and that a dimension-free mistake bound is almost
always impossible.
To remedy this inevitable dependence on $d$, it is natural to seek additional
natural restrictions to be placed on the network, so that the dependence on $d$
is removed. We study two such restrictions. The first is the multi-index model,
in which the function computed by the net depends only on $k \ll d$ orthonormal
directions. We prove a mistake bound of approximately $(1.5/\gamma)^{k + 2}$ in
this model. The second is the extended margin assumption. In this setting, we
assume that all neurons (in all layers) in the network classify every ingoing
input from previous layer with margin $\gamma$ bounded away from zero. In this
model, we prove a mistake bound of approximately $(\log Y)/ \gamma^{O(L)}$,
where L is the depth of the network.
[LINK]
http://arxiv.org/abs/2505.09167v1
[DATE]
2025-05-14 14:03:07+08:00
[CATEGORIES]
cs.LG
Bridging Theory and Experiment in Materials Discovery: Machine-Learning-Assisted Prediction of Synthesizable Structures
[AUTHORS]
Yu Xin, Peng Liu, Zhuohang Xie, Wenhui Mi, Pengyue Gao, Hong Jian Zhao, Jian Lv, Yanchao Wang, Yanming Ma
[ABSTRACT]
Even though thermodynamic energy-based crystal structure prediction (CSP) has
revolutionized materials discovery, the energy-driven CSP approaches often
struggle to identify experimentally realizable metastable materials synthesized
through kinetically controlled pathways, creating a critical gap between
theoretical predictions and experimental synthesis. Here, we propose a
synthesizability-driven CSP framework that integrates symmetry-guided structure
derivation with a Wyckoff encode-based machine-learning model, allowing for the
efficient localization of subspaces likely to yield highly synthesizable
structures. Within the identified promising subspaces, a structure-based
synthesizability evaluation model, fine-tuned using recently synthesized
structures to enhance predictive accuracy, is employed in conjunction with ab
initio calculations to systematically identify synthesizable candidates. The
framework successfully reproduces 13 experimentally known XSe (X = Sc, Ti, Mn,
Fe, Ni, Cu, Zn) structures, demonstrating its effectiveness in predicting
synthesizable structures. Notably, 92,310 structures are filtered from the
554,054 candidates predicted by GNoME, exhibiting great potential for promising
synthesizability. Additionally, eight thermodynamically favorable Hf-X-O (X =
Ti, V, and Mn) structures have been identified, among which three HfV$_2$O$_7$
candidates exhibit high synthesizability, presenting viable candidates for
experimental realization and potentially associated with experimentally
observed temperature-induced phase transitions. This work establishes a
data-driven paradigm for machine-learning-assisted inorganic materials
synthesis, highlighting its potential to bridge the gap between computational
predictions and experimental realization while unlocking new opportunities for
the targeted discovery of novel functional materials.
[LINK]
http://arxiv.org/abs/2505.09161v1
[DATE]
2025-05-14 13:48:55+08:00
[CATEGORIES]
cs.LG
A Multi-Task Foundation Model for Wireless Channel Representation Using Contrastive and Masked Autoencoder Learning
[AUTHORS]
Berkay Guler, Giovanni Geraci, Hamid Jafarkhani
[ABSTRACT]
Current applications of self-supervised learning to wireless channel
representation often borrow paradigms developed for text and image processing,
without fully addressing the unique characteristics and constraints of wireless
communications. Aiming to fill this gap, we first propose WiMAE (Wireless
Masked Autoencoder), a transformer-based encoder-decoder foundation model
pretrained on a realistic open-source multi-antenna wireless channel dataset.
Building upon this foundation, we develop ContraWiMAE, which enhances WiMAE by
incorporating a contrastive learning objective alongside the reconstruction
task in a unified multi-task framework. By warm-starting from pretrained WiMAE
weights and generating positive pairs via noise injection, the contrastive
component enables the model to capture both structural and discriminative
features, enhancing representation quality beyond what reconstruction alone can
achieve. Through extensive evaluation on unseen scenarios, we demonstrate the
effectiveness of both approaches across multiple downstream tasks, with
ContraWiMAE showing further improvements in linear separability and
adaptability in diverse wireless environments. Comparative evaluations against
a state-of-the-art wireless channel foundation model confirm the superior
performance and data efficiency of our models, highlighting their potential as
powerful baselines for future research in self-supervised wireless channel
representation learning.
[LINK]
http://arxiv.org/abs/2505.09160v1
[DATE]
2025-05-14 13:45:22+08:00
[CATEGORIES]
cs.LG
Morphological-Symmetry-Equivariant Heterogeneous Graph Neural Network for Robotic Dynamics Learning
[AUTHORS]
Fengze Xie, Sizhe Wei, Yue Song, Yisong Yue, Lu Gan
[ABSTRACT]
We present a morphological-symmetry-equivariant heterogeneous graph neural
network, namely MS-HGNN, for robotic dynamics learning, that integrates robotic
kinematic structures and morphological symmetries into a single graph network.
These structural priors are embedded into the learning architecture as
constraints, ensuring high generalizability, sample and model efficiency. The
proposed MS-HGNN is a versatile and general architecture that is applicable to
various multi-body dynamic systems and a wide range of dynamics learning
problems. We formally prove the morphological-symmetry-equivariant property of
our MS-HGNN and validate its effectiveness across multiple quadruped robot
learning problems using both real-world and simulated data. Our code is made
publicly available at https://github.com/lunarlab-gatech/MorphSym-HGNN/.
[LINK]
http://arxiv.org/abs/2412.01297v2
[DATE]
2025-05-14 12:48:21+08:00
[CATEGORIES]
cs.LG
Scaling Gaussian Process Regression with Full Derivative Observations
[AUTHORS]
Daniel Huang
[ABSTRACT]
We present a scalable Gaussian Process (GP) method that can fit and predict
full derivative observations called DSoftKI. It extends SoftKI, a method that
approximates a kernel via softmax interpolation from learned interpolation
point locations, to the setting with derivatives. DSoftKI enhances SoftKI’s
interpolation scheme to incorporate the directional orientation of
interpolation points relative to the data. This enables the construction of a
scalable approximate kernel, including its first and second-order derivatives,
through interpolation. We evaluate DSoftKI on a synthetic function benchmark
and high-dimensional molecular force field prediction (100-1000 dimensions),
demonstrating that DSoftKI is accurate and can scale to larger datasets with
full derivative observations than previously possible.
[COMMENTS]
12 pages
[LINK]
http://arxiv.org/abs/2505.09134v1
[DATE]
2025-05-14 12:35:26+08:00
[CATEGORIES]
cs.LG
Fair Clustering via Alignment
[AUTHORS]
Kunwoong Kim, Jihu Lee, Sangchul Park, Yongdai Kim
[ABSTRACT]
Algorithmic fairness in clustering aims to balance the proportions of
instances assigned to each cluster with respect to a given sensitive attribute.
While recently developed fair clustering algorithms optimize clustering
objectives under specific fairness constraints, their inherent complexity or
approximation often results in suboptimal clustering utility or numerical
instability in practice. To resolve these limitations, we propose a new fair
clustering algorithm based on a novel decomposition of the fair K-means
clustering objective function. The proposed algorithm, called Fair Clustering
via Alignment (FCA), operates by alternately (i) finding a joint probability
distribution to align the data from different protected groups, and (ii)
optimizing cluster centers in the aligned space. A key advantage of FCA is that
it theoretically guarantees approximately optimal clustering utility for any
given fairness level without complex constraints, thereby enabling high-utility
fair clustering in practice. Experiments show that FCA outperforms existing
methods by (i) attaining a superior trade-off between fairness level and
clustering utility, and (ii) achieving near-perfect fairness without numerical
instability.
[COMMENTS]
Accepted at ICML 2025. This is the version submitted for review and
will be replaced by the camera-ready version soon
[LINK]
http://arxiv.org/abs/2505.09131v1
[DATE]
2025-05-14 12:29:09+08:00
[CATEGORIES]
cs.LG
BridgePure: Limited Protection Leakage Can Break Black-Box Data Protection
[AUTHORS]
Yihan Wang, Yiwei Lu, Xiao-Shan Gao, Gautam Kamath, Yaoliang Yu
[ABSTRACT]
Availability attacks, or unlearnable examples, are defensive techniques that
allow data owners to modify their datasets in ways that prevent unauthorized
machine learning models from learning effectively while maintaining the data’s
intended functionality. It has led to the release of popular black-box tools
(e.g., APIs) for users to upload personal data and receive protected
counterparts. In this work, we show that such black-box protections can be
substantially compromised if a small set of unprotected in-distribution data is
available. Specifically, we propose a novel threat model of protection leakage,
where an adversary can (1) easily acquire (unprotected, protected) pairs by
querying the black-box protections with a small unprotected dataset; and (2)
train a diffusion bridge model to build a mapping between unprotected and
protected data. This mapping, termed BridgePure, can effectively remove the
protection from any previously unseen data within the same distribution.
BridgePure demonstrates superior purification performance on classification and
style mimicry tasks, exposing critical vulnerabilities in black-box data
protection. We suggest that practitioners implement multi-level countermeasures
to mitigate such risks.
[COMMENTS]
29 pages,18 figures
[LINK]
http://arxiv.org/abs/2412.21061v2
[DATE]
2025-05-14 12:17:54+08:00
[CATEGORIES]
cs.LG
Beyond the Known: Decision Making with Counterfactual Reasoning Decision Transformer
[AUTHORS]
Minh Hoang Nguyen, Linh Le Pham Van, Thommen George Karimpanal, Sunil Gupta, Hung Le
[ABSTRACT]
Decision Transformers (DT) play a crucial role in modern reinforcement
learning, leveraging offline datasets to achieve impressive results across
various domains. However, DT requires high-quality, comprehensive data to
perform optimally. In real-world applications, the lack of training data and
the scarcity of optimal behaviours make training on offline datasets
challenging, as suboptimal data can hinder performance. To address this, we
propose the Counterfactual Reasoning Decision Transformer (CRDT), a novel
framework inspired by counterfactual reasoning. CRDT enhances DT ability to
reason beyond known data by generating and utilizing counterfactual
experiences, enabling improved decision-making in unseen scenarios. Experiments
across Atari and D4RL benchmarks, including scenarios with limited data and
altered dynamics, demonstrate that CRDT outperforms conventional DT approaches.
Additionally, reasoning counterfactually allows the DT agent to obtain
stitching abilities, combining suboptimal trajectories, without architectural
modifications. These results highlight the potential of counterfactual
reasoning to enhance reinforcement learning agents’ performance and
generalization capabilities.
[LINK]
http://arxiv.org/abs/2505.09114v1
[DATE]
2025-05-14 11:45:16+08:00
[CATEGORIES]
cs.LG
Sequential Treatment Effect Estimation with Unmeasured Confounders
[AUTHORS]
Yingrong Wang, Anpeng Wu, Baohong Li, Ziyang Xiao, Ruoxuan Xiong, Qing Han, Kun Kuang
[ABSTRACT]
This paper studies the cumulative causal effects of sequential treatments in
the presence of unmeasured confounders. It is a critical issue in sequential
decision-making scenarios where treatment decisions and outcomes dynamically
evolve over time. Advanced causal methods apply transformer as a backbone to
model such time sequences, which shows superiority in capturing long time
dependence and periodic patterns via attention mechanism. However, even they
control the observed confounding, these estimators still suffer from unmeasured
confounders, which influence both treatment assignments and outcomes. How to
adjust the latent confounding bias in sequential treatment effect estimation
remains an open challenge. Therefore, we propose a novel Decomposing Sequential
Instrumental Variable framework for CounterFactual Regression (DSIV-CFR),
relying on a common negative control assumption. Specifically, an instrumental
variable (IV) is a special negative control exposure, while the previous
outcome serves as a negative control outcome. This allows us to recover the IVs
latent in observation variables and estimate sequential treatment effects via a
generalized moment condition. We conducted experiments on 4 datasets and
achieved significant performance in one- and multi-step prediction, supported
by which we can identify optimal treatments for dynamic systems.
[LINK]
http://arxiv.org/abs/2505.09113v1
[DATE]
2025-05-14 11:42:43+08:00
[CATEGORIES]
cs.LG
Diffusion Factor Models: Generating High-Dimensional Returns with Factor Structure
[AUTHORS]
Minshuo Chen, Renyuan Xu, Yumin Xu, Ruixun Zhang
[ABSTRACT]
Financial scenario simulation is essential for risk management and portfolio
optimization, yet it remains challenging especially in high-dimensional and
small data settings common in finance. We propose a diffusion factor model that
integrates latent factor structure into generative diffusion processes,
bridging econometrics with modern generative AI to address the challenges of
the curse of dimensionality and data scarcity in financial simulation. By
exploiting the low-dimensional factor structure inherent in asset returns, we
decompose the score function–a key component in diffusion models–using
time-varying orthogonal projections, and this decomposition is incorporated
into the design of neural network architectures. We derive rigorous statistical
guarantees, establishing nonasymptotic error bounds for both score estimation
at O(d^{5/2} n^{-2/(k+5)}) and generated distribution at O(d^{5/4}
n^{-1/2(k+5)}), primarily driven by the intrinsic factor dimension k rather
than the number of assets d, surpassing the dimension-dependent limits in the
classical nonparametric statistics literature and making the framework viable
for markets with thousands of assets. Numerical studies confirm superior
performance in latent subspace recovery under small data regimes. Empirical
analysis demonstrates the economic significance of our framework in
constructing mean-variance optimal portfolios and factor portfolios. This work
presents the first theoretical integration of factor structure with diffusion
models, offering a principled approach for high-dimensional financial
simulation with limited data. Our code is available at
https://github.com/xymmmm00/diffusion_factor_model.
[LINK]
http://arxiv.org/abs/2504.06566v2
[DATE]
2025-05-14 11:29:54+08:00
[CATEGORIES]
cs.LG
Statistical Mean Estimation with Coded Relayed Observations
[AUTHORS]
Yan Hao Ling, Zhouhao Yang, Jonathan Scarlett
[ABSTRACT]
We consider a problem of statistical mean estimation in which the samples are
not observed directly, but are instead observed by a relay (“teacher”) that
transmits information through a memoryless channel to the decoder
(“student”), who then produces the final estimate. We consider the minimax
estimation error in the large deviations regime, and establish achievable error
exponents that are tight in broad regimes of the estimation accuracy and
channel quality. In contrast, two natural baseline methods are shown to yield
strictly suboptimal error exponents. We initially focus on Bernoulli sources
and binary symmetric channels, and then generalize to sub-Gaussian and
heavy-tailed settings along with arbitrary discrete memoryless channels.
[LINK]
http://arxiv.org/abs/2505.09098v1
[DATE]
2025-05-14 11:07:05+08:00
[CATEGORIES]
cs.LG
DPN-GAN: Inducing Periodic Activations in Generative Adversarial Networks for High-Fidelity Audio Synthesis
[AUTHORS]
Zeeshan Ahmad, Shudi Bao, Meng Chen
[ABSTRACT]
In recent years, generative adversarial networks (GANs) have made significant
progress in generating audio sequences. However, these models typically rely on
bandwidth-limited mel-spectrograms, which constrain the resolution of generated
audio sequences, and lead to mode collapse during conditional generation. To
address this issue, we propose Deformable Periodic Network based GAN (DPN-GAN),
a novel GAN architecture that incorporates a kernel-based periodic ReLU
activation function to induce periodic bias in audio generation. This
innovative approach enhances the model’s ability to capture and reproduce
intricate audio patterns. In particular, our proposed model features a DPN
module for multi-resolution generation utilizing deformable convolution
operations, allowing for adaptive receptive fields that improve the quality and
fidelity of the synthetic audio. Additionally, we enhance the discriminator
network using deformable convolution to better distinguish between real and
generated samples, further refining the audio quality. We trained two versions
of the model: DPN-GAN small (38.67M parameters) and DPN-GAN large (124M
parameters). For evaluation, we use five different datasets, covering both
speech synthesis and music generation tasks, to demonstrate the efficiency of
the DPN-GAN. The experimental results demonstrate that DPN-GAN delivers
superior performance on both out-of-distribution and noisy data, showcasing its
robustness and adaptability. Trained across various datasets, DPN-GAN
outperforms state-of-the-art GAN architectures on standard evaluation metrics,
and exhibits increased robustness in synthesized audio.
[LINK]
http://arxiv.org/abs/2505.09091v1
[DATE]
2025-05-14 10:52:16+08:00
[CATEGORIES]
cs.LG
A Comparative Review of RNA Language Models
[AUTHORS]
He Wang, Yikun Zhang, Jie Chen, Jian Zhan, Yaoqi Zhou
[ABSTRACT]
Given usefulness of protein language models (LMs) in structure and functional
inference, RNA LMs have received increased attentions in the last few years.
However, these RNA models are often not compared against the same standard.
Here, we divided RNA LMs into three classes (pretrained on multiple RNA types
(especially noncoding RNAs), specific-purpose RNAs, and LMs that unify RNA with
DNA or proteins or both) and compared 13 RNA LMs along with 3 DNA and 1 protein
LMs as controls in zero-shot prediction of RNA secondary structure and
functional classification. Results shows that the models doing well on
secondary structure prediction often perform worse in function classification
or vice versa, suggesting that more balanced unsupervised training is needed.
[LINK]
http://arxiv.org/abs/2505.09087v1
[DATE]
2025-05-14 10:40:13+08:00
[CATEGORIES]
cs.LG
Human-like Cognitive Generalization for Large Models via Brain-in-the-loop Supervision
[AUTHORS]
Jiaxuan Chen, Yu Qi, Yueming Wang, Gang Pan
[ABSTRACT]
Recent advancements in deep neural networks (DNNs), particularly large-scale
language models, have demonstrated remarkable capabilities in image and natural
language understanding. Although scaling up model parameters with increasing
volume of training data has progressively improved DNN capabilities, achieving
complex cognitive abilities - such as understanding abstract concepts,
reasoning, and adapting to novel scenarios, which are intrinsic to human
cognition - remains a major challenge. In this study, we show that
brain-in-the-loop supervised learning, utilizing a small set of brain signals,
can effectively transfer human conceptual structures to DNNs, significantly
enhancing their comprehension of abstract and even unseen concepts.
Experimental results further indicate that the enhanced cognitive capabilities
lead to substantial performance gains in challenging tasks, including
few-shot/zero-shot learning and out-of-distribution recognition, while also
yielding highly interpretable concept representations. These findings highlight
that human-in-the-loop supervision can effectively augment the complex
cognitive abilities of large models, offering a promising pathway toward
developing more human-like cognitive abilities in artificial systems.
[LINK]
http://arxiv.org/abs/2505.09085v1
[DATE]
2025-05-14 10:39:10+08:00
[CATEGORIES]
cs.LG
Combinatorial Logistic Bandits
[AUTHORS]
Xutong Liu, Xiangxiang Dai, Xuchuang Wang, Mohammad Hajiesmaili, John C. S. Lui
[ABSTRACT]
We introduce a novel framework called combinatorial logistic bandits (CLogB),
where in each round, a subset of base arms (called the super arm) is selected,
with the outcome of each base arm being binary and its expectation following a
logistic parametric model. The feedback is governed by a general arm triggering
process. Our study covers CLogB with reward functions satisfying two smoothness
conditions, capturing application scenarios such as online content delivery,
online learning to rank, and dynamic channel allocation. We first propose a
simple yet efficient algorithm, CLogUCB, utilizing a variance-agnostic
exploration bonus. Under the 1-norm triggering probability modulated (TPM)
smoothness condition, CLogUCB achieves a regret bound of
$\tilde{O}(d\sqrt{\kappa KT})$, where $\tilde{O}$ ignores logarithmic factors,
$d$ is the dimension of the feature vector, $\kappa$ represents the
nonlinearity of the logistic model, and $K$ is the maximum number of base arms
a super arm can trigger. This result improves on prior work by a factor of
$\tilde{O}(\sqrt{\kappa})$. We then enhance CLogUCB with a variance-adaptive
version, VA-CLogUCB, which attains a regret bound of $\tilde{O}(d\sqrt{KT})$
under the same 1-norm TPM condition, improving another
$\tilde{O}(\sqrt{\kappa})$ factor. VA-CLogUCB shows even greater promise under
the stronger triggering probability and variance modulated (TPVM) condition,
achieving a leading $\tilde{O}(d\sqrt{T})$ regret, thus removing the additional
dependency on the action-size $K$. Furthermore, we enhance the computational
efficiency of VA-CLogUCB by eliminating the nonconvex optimization process when
the context feature map is time-invariant while maintaining the tight
$\tilde{O}(d\sqrt{T})$ regret. Finally, experiments on synthetic and real-world
datasets demonstrate the superior performance of our algorithms compared to
benchmark algorithms.
[COMMENTS]
Accepted in ACM SIGMETRICS 2025
[LINK]
http://arxiv.org/abs/2410.17075v3
[DATE]
2025-05-14 10:28:36+08:00
[CATEGORIES]
cs.LG
AdaFortiTran: An Adaptive Transformer Model for Robust OFDM Channel Estimation
[AUTHORS]
Berkay Guler, Hamid Jafarkhani
[ABSTRACT]
Deep learning models for channel estimation in Orthogonal Frequency Division
Multiplexing (OFDM) systems often suffer from performance degradation under
fast-fading channels and low-SNR scenarios. To address these limitations, we
introduce the Adaptive Fortified Transformer (AdaFortiTran), a novel model
specifically designed to enhance channel estimation in challenging
environments. Our approach employs convolutional layers that exploit locality
bias to capture strong correlations between neighboring channel elements,
combined with a transformer encoder that applies the global Attention mechanism
to channel patches. This approach effectively models both long-range
dependencies and spectro-temporal interactions within single OFDM frames. We
further augment the model’s adaptability by integrating nonlinear
representations of available channel statistics SNR, delay spread, and Doppler
shift as priors. A residual connection is employed to merge global features
from the transformer with local features from early convolutional processing,
followed by final convolutional layers to refine the hierarchical channel
representation. Despite its compact architecture, AdaFortiTran achieves up to 6
dB reduction in mean squared error (MSE) compared to state-of-the-art models.
Tested across a wide range of Doppler shifts (200-1000 Hz), SNRs (0 to 25 dB),
and delay spreads (50-300 ns), it demonstrates superior robustness in
high-mobility environments.
[LINK]
http://arxiv.org/abs/2505.09076v1
[DATE]
2025-05-14 10:22:37+08:00
[CATEGORIES]
cs.LG
Single-shot prediction of parametric partial differential equations
[AUTHORS]
Khalid Rafiq, Wenjing Liao, Aditya G. Nair
[ABSTRACT]
We introduce Flexi-VAE, a data-driven framework for efficient single-shot
forecasting of nonlinear parametric partial differential equations (PDEs),
eliminating the need for iterative time-stepping while maintaining high
accuracy and stability. Flexi-VAE incorporates a neural propagator that
advances latent representations forward in time, aligning latent evolution with
physical state reconstruction in a variational autoencoder setting. We evaluate
two propagation strategies, the Direct Concatenation Propagator (DCP) and the
Positional Encoding Propagator (PEP), and demonstrate, through
representation-theoretic analysis, that DCP offers superior long-term
generalization by fostering disentangled and physically meaningful latent
spaces. Geometric diagnostics, including Jacobian spectral analysis, reveal
that propagated latent states reside in regions of lower decoder sensitivity
and more stable local geometry than those derived via direct encoding,
enhancing robustness for long-horizon predictions. We validate Flexi-VAE on
canonical PDE benchmarks, the 1D viscous Burgers equation and the 2D
advection-diffusion equation, achieving accurate forecasts across wide
parametric ranges. The model delivers over 50x CPU and 90x GPU speedups
compared to autoencoder-LSTM baselines for large temporal shifts. These results
position Flexi-VAE as a scalable and interpretable surrogate modeling tool for
accelerating high-fidelity simulations in computational fluid dynamics (CFD)
and other parametric PDE-driven applications, with extensibility to
higher-dimensional and more complex systems.
[COMMENTS]
35 pages, 17 figures
[LINK]
http://arxiv.org/abs/2505.09063v1
[DATE]
2025-05-14 09:48:26+08:00
[CATEGORIES]
cs.LG
Variational Prefix Tuning for Diverse and Accurate Code Summarization Using Pre-trained Language Models
[AUTHORS]
Junda Zhao, Yuliang Song, Eldan Cohen
[ABSTRACT]
Recent advancements in source code summarization have leveraged
transformer-based pre-trained models, including Large Language Models of Code
(LLMCs), to automate and improve the generation of code summaries. However,
existing methods often focus on generating a single high-quality summary for a
given source code, neglecting scenarios where the generated summary might be
inadequate and alternative options are needed. In this paper, we introduce
Variational Prefix Tuning (VPT), a novel approach that enhances pre-trained
models’ ability to generate diverse yet accurate sets of summaries, allowing
the user to choose the most suitable one for the given source code. Our method
integrates a Conditional Variational Autoencoder (CVAE) framework as a modular
component into pre-trained models, enabling us to model the distribution of
observed target summaries and sample continuous embeddings to be used as
prefixes to steer the generation of diverse outputs during decoding.
Importantly, we construct our method in a parameter-efficient manner,
eliminating the need for expensive model retraining, especially when using
LLMCs. Furthermore, we employ a bi-criteria reranking method to select a subset
of generated summaries, optimizing both the diversity and the accuracy of the
options presented to users. We present extensive experimental evaluations using
widely used datasets and current state-of-the-art pre-trained code
summarization models to demonstrate the effectiveness of our approach and its
adaptability across models.
[COMMENTS]
Accepted by the Journal of Systems and Software
[LINK]
http://arxiv.org/abs/2505.09062v1
[DATE]
2025-05-14 09:46:56+08:00
[CATEGORIES]
cs.LG
Making Small Language Models Efficient Reasoners: Intervention, Supervision, Reinforcement
[AUTHORS]
Xuechen Zhang, Zijian Huang, Chenshun Ni, Ziyang Xiong, Jiasi Chen, Samet Oymak
[ABSTRACT]
Recent research enhances language model reasoning by scaling test-time
compute via longer chain-of-thought traces. This often improves accuracy but
also introduces redundancy and high computational cost, especially for small
language models distilled with supervised fine-tuning (SFT). In this work, we
propose new algorithms to improve token-efficient reasoning with small-scale
models by effectively trading off accuracy and computation. We first show that
the post-SFT model fails to determine the optimal stopping point of the
reasoning process, resulting in verbose and repetitive outputs. Verbosity also
significantly varies across wrong vs correct responses. To address these
issues, we propose two solutions: (1) Temperature scaling (TS) to control the
stopping point for the thinking phase and thereby trace length, and (2) TLDR: a
length-regularized reinforcement learning method based on GRPO that facilitates
multi-level trace length control (e.g. short, medium, long reasoning).
Experiments on four reasoning benchmarks, MATH500, AMC, AIME24 and
OlympiadBench, demonstrate that TS is highly effective compared to s1’s budget
forcing approach and TLDR significantly improves token efficiency by about 50%
with minimal to no accuracy loss over the SFT baseline. Moreover, TLDR also
facilitates flexible control over the response length, offering a practical and
effective solution for token-efficient reasoning in small models. Ultimately,
our work reveals the importance of stopping time control, highlights
shortcomings of pure SFT, and provides effective algorithmic recipes.
[LINK]
http://arxiv.org/abs/2505.07961v2
[DATE]
2025-05-14 09:42:08+08:00
[CATEGORIES]
cs.LG
Signed Latent Factors for Spamming Activity Detection
[AUTHORS]
Yuli Liu
[ABSTRACT]
Due to the increasing trend of performing spamming activities (e.g., Web
spam, deceptive reviews, fake followers, etc.) on various online platforms to
gain undeserved benefits, spam detection has emerged as a hot research issue.
Previous attempts to combat spam mainly employ features related to metadata,
user behaviors, or relational ties. These studies have made considerable
progress in understanding and filtering spamming campaigns. However, this
problem remains far from fully solved. Almost all the proposed features focus
on a limited number of observed attributes or explainable phenomena, making it
difficult for existing methods to achieve further improvement. To broaden the
vision about solving the spam problem and address long-standing challenges
(class imbalance and graph incompleteness) in the spam detection area, we
propose a new attempt of utilizing signed latent factors to filter fraudulent
activities. The spam-contaminated relational datasets of multiple online
applications in this scenario are interpreted by the unified signed network.
Two competitive and highly dissimilar algorithms of latent factors mining (LFM)
models are designed based on multi-relational likelihoods estimation (LFM-MRLE)
and signed pairwise ranking (LFM-SPR), respectively. We then explore how to
apply the mined latent factors to spam detection tasks. Experiments on
real-world datasets of different kinds of Web applications (social media and
Web forum) indicate that LFM models outperform state-of-the-art baselines in
detecting spamming activities. By specifically manipulating experimental data,
the effectiveness of our methods in dealing with incomplete and imbalanced
challenges is validated.
[LINK]
http://arxiv.org/abs/2209.13814v2
[DATE]
2025-05-14 09:21:55+08:00
[CATEGORIES]
cs.LG
Convolutional Fourier Analysis Network (CFAN): A Unified Time-Frequency Approach for ECG Classification
[AUTHORS]
Sam Jeong, Hae Yong Kim
[ABSTRACT]
Machine learning has revolutionized biomedical signal analysis, particularly
in electrocardiogram (ECG) classification. While convolutional neural networks
(CNNs) excel at automatic feature extraction, the optimal integration of time-
and frequency-domain information remains unresolved. This study introduces the
Convolutional Fourier Analysis Network (CFAN), a novel architecture that
unifies time-frequency analysis by embedding Fourier principles directly into
CNN layers. We evaluate CFAN against four benchmarks - spectrogram-based 2D CNN
(SPECT); 1D CNN (CNN1D); Fourier-based 1D CNN (FFT1D); and CNN1D with
integrated Fourier Analysis Network (CNN1D-FAN) - across three ECG tasks:
arrhythmia classification (MIT-BIH), identity recognition (ECG-ID), and apnea
detection (Apnea-ECG). CFAN achieved state-of-the-art performance, surpassing
all competing methods with accuracies of 98.95% (MIT-BIH), 96.83% (ECG-ID), and
95.01% (Apnea-ECG). Notably, on ECG-ID and Apnea-ECG, CFAN demonstrated
statistically significant improvements over the second-best method (CNN1D-FAN,
$p \leq 0.02$), further validating its superior performance. Key innovations
include CONV-FAN blocks that combine sine, cosine and GELU activations in
convolutional layers to capture periodic features and joint time-frequency
learning without spectrogram conversion. Our results highlight CFAN’s potential
for broader biomedical and signal classification applications.
[LINK]
http://arxiv.org/abs/2502.00497v3
[DATE]
2025-05-14 08:45:02+08:00
[CATEGORIES]
cs.LG
Monte Carlo Beam Search for Actor-Critic Reinforcement Learning in Continuous Control
[AUTHORS]
Hazim Alzorgan, Abolfazl Razi
[ABSTRACT]
Actor-critic methods, like Twin Delayed Deep Deterministic Policy Gradient
(TD3), depend on basic noise-based exploration, which can result in less than
optimal policy convergence. In this study, we introduce Monte Carlo Beam Search
(MCBS), a new hybrid method that combines beam search and Monte Carlo rollouts
with TD3 to improve exploration and action selection. MCBS produces several
candidate actions around the policy’s output and assesses them through
short-horizon rollouts, enabling the agent to make better-informed choices. We
test MCBS across various continuous-control benchmarks, including
HalfCheetah-v4, Walker2d-v5, and Swimmer-v5, showing enhanced sample efficiency
and performance compared to standard TD3 and other baseline methods like SAC,
PPO, and A2C. Our findings emphasize MCBS’s capability to enhance policy
learning through structured look-ahead search while ensuring computational
efficiency. Additionally, we offer a detailed analysis of crucial
hyperparameters, such as beam width and rollout depth, and explore adaptive
strategies to optimize MCBS for complex control tasks. Our method shows a
higher convergence rate across different environments compared to TD3, SAC,
PPO, and A2C. For instance, we achieved 90% of the maximum achievable reward
within around 200 thousand timesteps compared to 400 thousand timesteps for the
second-best method.
[LINK]
http://arxiv.org/abs/2505.09029v1
[DATE]
2025-05-14 07:56:12+08:00
[CATEGORIES]
cs.LG
Probabilistic Wind Power Forecasting via Non-Stationary Gaussian Processes
[AUTHORS]
Domniki Ladopoulou, Dat Minh Hong, Petros Dellaportas
[ABSTRACT]
Accurate probabilistic forecasting of wind power is essential for maintaining
grid stability and enabling efficient integration of renewable energy sources.
Gaussian Process (GP) models offer a principled framework for quantifying
uncertainty; however, conventional approaches rely on stationary kernels, which
are inadequate for modeling the inherently non-stationary nature of wind speed
and power output. We propose a non-stationary GP framework that incorporates
the generalized spectral mixture (GSM) kernel, enabling the model to capture
time-varying patterns and heteroscedastic behaviors in wind speed and wind
power data. We evaluate the performance of the proposed model on real-world
SCADA data across short\mbox{-,} medium-, and long-term forecasting horizons.
Compared to standard radial basis function and spectral mixture kernels, the
GSM-based model outperforms, particularly in short-term forecasts. These
results highlight the necessity of modeling non-stationarity in wind power
forecasting and demonstrate the practical value of non-stationary GP models in
operational settings.
[COMMENTS]
11 pages, 3 figures, 2 tables
[LINK]
http://arxiv.org/abs/2505.09026v1
[DATE]
2025-05-14 07:46:33+08:00
[CATEGORIES]
cs.LG
Introduction to Machine Learning
[AUTHORS]
Laurent Younes
[ABSTRACT]
This book introduces the mathematical foundations and techniques that lead to
the development and analysis of many of the algorithms that are used in machine
learning. It starts with an introductory chapter that describes notation used
throughout the book and serve at a reminder of basic concepts in calculus,
linear algebra and probability and also introduces some measure theoretic
terminology, which can be used as a reading guide for the sections that use
these tools. The introductory chapters also provide background material on
matrix analysis and optimization. The latter chapter provides theoretical
support to many algorithms that are used in the book, including stochastic
gradient descent, proximal methods, etc. After discussing basic concepts for
statistical prediction, the book includes an introduction to reproducing kernel
theory and Hilbert space techniques, which are used in many places, before
addressing the description of various algorithms for supervised statistical
learning, including linear methods, support vector machines, decision trees,
boosting, or neural networks. The subject then switches to generative methods,
starting with a chapter that presents sampling methods and an introduction to
the theory of Markov chains. The following chapter describe the theory of
graphical models, an introduction to variational methods for models with latent
variables, and to deep-learning based generative models. The next chapters
focus on unsupervised learning methods, for clustering, factor analysis and
manifold learning. The final chapter of the book is theory-oriented and
discusses concentration inequalities and generalization bounds.
[COMMENTS]
textbook
[LINK]
http://arxiv.org/abs/2409.02668v2
[DATE]
2025-05-14 07:40:29+08:00
[CATEGORIES]
cs.LG
Block-Biased Mamba for Long-Range Sequence Processing
[AUTHORS]
Annan Yu, N. Benjamin Erichson
[LINK]
http://arxiv.org/abs/2505.09022v1
[DATE]
2025-05-14 07:34:09+08:00
[CATEGORIES]
cs.LG
DyGSSM: Multi-view Dynamic Graph Embeddings with State Space Model Gradient Update
[AUTHORS]
Bizhan Alipour Pijan, Serdar Bozdag
[ABSTRACT]
Most of the dynamic graph representation learning methods involve dividing a
dynamic graph into discrete snapshots to capture the evolving behavior of nodes
over time. Existing methods primarily capture only local or global structures
of each node within a snapshot using message-passing and random walk-based
methods. Then, they utilize sequence-based models (e.g., transformers) to
encode the temporal evolution of node embeddings, and meta-learning techniques
to update the model parameters. However, these approaches have two limitations.
First, they neglect the extraction of global and local information
simultaneously in each snapshot. Second, they fail to consider the model’s
performance in the current snapshot during parameter updates, resulting in a
lack of temporal dependency management. Recently, HiPPO (High-order Polynomial
Projection Operators) algorithm has gained attention for their ability to
optimize and preserve sequence history in State Space Model (SSM). To address
the aforementioned limitations in dynamic graph representation learning, we
propose a novel method called Multi-view Dynamic Graph Embeddings with State
Space Model Gradient Update (DyGSSM). Our approach combines Graph Convolution
Networks (GCN) for local feature extraction and random walk with Gated
Recurrent Unit (GRU) for global feature extraction in each snapshot. We then
integrate the local and global features using a cross-attention mechanism.
Additionally, we incorporate an SSM based on HiPPO algorithm to account for
long-term dependencies when updating model parameters, ensuring that model
performance in each snapshot informs subsequent updates. Experiments on five
public datasets show that our method outperforms existing baseline and
state-of-the-art (SOTA) methods in 17 out of 20 cases.
[LINK]
http://arxiv.org/abs/2505.09017v1
[DATE]
2025-05-14 07:12:07+08:00
[CATEGORIES]
cs.LG
Signal-based AI-driven software solution for automated quantification of metastatic bone disease and treatment response assessment using Whole-Body Diffusion-Weighted MRI (WB-DWI) biomarkers in Advanced Prostate Cancer
[AUTHORS]
Antonio Candito, Matthew D Blackledge, Richard Holbrey, Nuria Porta, Ana Ribeiro, Fabio Zugni, Luca D’Erme, Francesca Castagnoli, Alina Dragan, Ricardo Donners, Christina Messiou, Nina Tunariu, Dow-Mu Koh
[ABSTRACT]
We developed an AI-driven software solution to quantify metastatic bone
disease from WB-DWI scans. Core technologies include: (i) a weakly-supervised
Residual U-Net model generating a skeleton probability map to isolate bone;
(ii) a statistical framework for WB-DWI intensity normalisation, obtaining a
signal-normalised b=900s/mm^2 (b900) image; and (iii) a shallow convolutional
neural network that processes outputs from (i) and (ii) to generate a mask of
suspected bone lesions, characterised by higher b900 signal intensity due to
restricted water diffusion. This mask is applied to the gADC map to extract TDV
and gADC statistics. We tested the tool using expert-defined metastatic bone
disease delineations on 66 datasets, assessed repeatability of imaging
biomarkers (N=10), and compared software-based response assessment with a
construct reference standard based on clinical, laboratory and imaging
assessments (N=118). Dice score between manual and automated delineations was
0.6 for lesions within pelvis and spine, with an average surface distance of
2mm. Relative differences for log-transformed TDV (log-TDV) and median gADC
were below 9% and 5%, respectively. Repeatability analysis showed coefficients
of variation of 4.57% for log-TDV and 3.54% for median gADC, with intraclass
correlation coefficients above 0.9. The software achieved 80.5% accuracy, 84.3%
sensitivity, and 85.7% specificity in assessing response to treatment compared
to the construct reference standard. Computation time generating a mask
averaged 90 seconds per scan. Our software enables reproducible TDV and gADC
quantification from WB-DWI scans for monitoring metastatic bone disease
response, thus providing potentially useful measurements for clinical
decision-making in APC patients.
[LINK]
http://arxiv.org/abs/2505.09011v1
[DATE]
2025-05-14 06:57:49+08:00
[CATEGORIES]
cs.LG
Lower Bounds on the MMSE of Adversarially Inferring Sensitive Features
[AUTHORS]
Monica Welfert, Nathan Stromberg, Mario Diaz, Lalitha Sankar
[ABSTRACT]
We propose an adversarial evaluation framework for sensitive feature
inference based on minimum mean-squared error (MMSE) estimation with a finite
sample size and linear predictive models. Our approach establishes theoretical
lower bounds on the true MMSE of inferring sensitive features from noisy
observations of other correlated features. These bounds are expressed in terms
of the empirical MMSE under a restricted hypothesis class and a non-negative
error term. The error term captures both the estimation error due to finite
number of samples and the approximation error from using a restricted
hypothesis class. For linear predictive models, we derive closed-form bounds,
which are order optimal in terms of the noise variance, on the approximation
error for several classes of relationships between the sensitive and
non-sensitive features, including linear mappings, binary symmetric channels,
and class-conditional multi-variate Gaussian distributions. We also present a
new lower bound that relies on the MSE computed on a hold-out validation
dataset of the MMSE estimator learned on finite-samples and a restricted
hypothesis class. Through empirical evaluation, we demonstrate that our
framework serves as an effective tool for MMSE-based adversarial evaluation of
sensitive feature inference that balances theoretical guarantees with practical
efficiency.
[COMMENTS]
submitted to IEEE Transactions on Information Theory
[LINK]
http://arxiv.org/abs/2505.09004v1
[DATE]
2025-05-14 06:39:24+08:00
[CATEGORIES]
cs.LG
Continual Reinforcement Learning via Autoencoder-Driven Task and New Environment Recognition
[AUTHORS]
Zeki Doruk Erden, Donia Gasmi, Boi Faltings
[COMMENTS]
Published in the Autonomous Robots and Multirobot Systems (ARMS)
workshop at AAMAS 2025
[LINK]
http://arxiv.org/abs/2505.09003v1
[DATE]
2025-05-14 06:38:54+08:00
[CATEGORIES]
cs.LG
Forecasting intermittent time series with Gaussian Processes and Tweedie likelihood
[AUTHORS]
Stefano Damato, Dario Azzimonti, Giorgio Corani
[ABSTRACT]
We adopt Gaussian Processes (GPs) as latent functions for probabilistic
forecasting of intermittent time series. The model is trained in a Bayesian
framework that accounts for the uncertainty about the latent function and
marginalizes it out when making predictions. We couple the latent GP variable
with two types of forecast distributions: the negative binomial (NegBinGP) and
the Tweedie distribution (TweedieGP). While the negative binomial has already
been used in forecasting intermittent time series, this is the first time in
which a fully parameterized Tweedie density is used for intermittent time
series. We properly evaluate the Tweedie density, which has both a point mass
at zero and heavy tails, avoiding simplifying assumptions made in existing
models. We test our models on thousands of intermittent count time series.
Results show that our models provide consistently better probabilistic
forecasts than the competitors. In particular, TweedieGP obtains the best
estimates of the highest quantiles, thus showing that it is more flexible than
NegBinGP.
[COMMENTS]
Under review
[LINK]
http://arxiv.org/abs/2502.19086v3
[DATE]
2025-05-14 06:38:37+08:00
[CATEGORIES]
cs.LG
Shifting Work Patterns with Generative AI
[AUTHORS]
Eleanor Wiske Dillon, Sonia Jaffe, Nicole Immorlica, Christopher T. Stanton
[ABSTRACT]
We present evidence on how generative AI changes the work patterns of
knowledge workers using data from a 6-month-long, cross-industry, randomized
field experiment. Half of the 7,137 workers in the study received access to a
generative AI tool integrated into the applications they already used for
emails, document creation, and meetings. We find that access to the AI tool
during the first year of its release primarily impacted behaviors that workers
could change independently and not behaviors that require coordination to
change: workers who used the tool in more than half of the sample weeks spent
3.6 fewer hours, or 31% less time on email each week (intent to treat estimate
is 1.3 hours) and completed documents moderately faster, but did not
significantly change time spent in meetings.
[LINK]
http://arxiv.org/abs/2504.11436v2
[DATE]
2025-05-14 06:28:06+08:00
[CATEGORIES]
cs.LG
Learning to Be Cautious
[AUTHORS]
Montaser Mohammedalamen, Dustin Morrill, Alexander Sieusahai, Yash Satsangi, Michael Bowling
[ABSTRACT]
A key challenge in the field of reinforcement learning is to develop agents
that behave cautiously in novel situations. It is generally impossible to
anticipate all situations that an autonomous system may face or what behavior
would best avoid bad outcomes. An agent that can learn to be cautious would
overcome this challenge by discovering for itself when and how to behave
cautiously. In contrast, current approaches typically embed task-specific
safety information or explicit cautious behaviors into the system, which is
error-prone and imposes extra burdens on practitioners. In this paper, we
present both a sequence of tasks where cautious behavior becomes increasingly
non-obvious, as well as an algorithm to demonstrate that it is possible for a
system to learn to be cautious. The essential features of our algorithm are
that it characterizes reward function uncertainty without task-specific safety
information and uses this uncertainty to construct a robust policy.
Specifically, we construct robust policies with a k-of-N counterfactual regret
minimization (CFR) subroutine given learned reward function uncertainty
represented by a neural network ensemble. These policies exhibit caution in
each of our tasks without any task-specific safety tuning.
[COMMENTS]
Under Review
[LINK]
http://arxiv.org/abs/2110.15907v2
[DATE]
2025-05-14 06:20:19+08:00
[CATEGORIES]
cs.LG
Enhancing Aerial Combat Tactics through Hierarchical Multi-Agent Reinforcement Learning
[AUTHORS]
Ardian Selmonaj, Oleg Szehr, Giacomo Del Rio, Alessandro Antonucci, Adrian Schneider, Michael Rüegsegger
[ABSTRACT]
This work presents a Hierarchical Multi-Agent Reinforcement Learning
framework for analyzing simulated air combat scenarios involving heterogeneous
agents. The objective is to identify effective Courses of Action that lead to
mission success within preset simulations, thereby enabling the exploration of
real-world defense scenarios at low cost and in a safe-to-fail setting.
Applying deep Reinforcement Learning in this context poses specific challenges,
such as complex flight dynamics, the exponential size of the state and action
spaces in multi-agent systems, and the capability to integrate real-time
control of individual units with look-ahead planning. To address these
challenges, the decision-making process is split into two levels of
abstraction: low-level policies control individual units, while a high-level
commander policy issues macro commands aligned with the overall mission
targets. This hierarchical structure facilitates the training process by
exploiting policy symmetries of individual agents and by separating control
from command tasks. The low-level policies are trained for individual combat
control in a curriculum of increasing complexity. The high-level commander is
then trained on mission targets given pre-trained control policies. The
empirical validation confirms the advantages of the proposed framework.
[COMMENTS]
Published as journal chapter in Deep Learning Applications, Vol. 1,
by Taylor & Francis
[LINK]
http://arxiv.org/abs/2505.08995v1
[DATE]
2025-05-14 06:13:48+08:00
[CATEGORIES]
cs.LG
ChicGrasp: Imitation-Learning based Customized Dual-Jaw Gripper Control for Delicate, Irregular Bio-products Manipulation
[AUTHORS]
Amirreza Davar, Zhengtong Xu, Siavash Mahmoudi, Pouya Sohrabipour, Chaitanya Pallerla, Yu She, Wan Shou, Philip Crandall, Dongyi Wang
[ABSTRACT]
Automated poultry processing lines still rely on humans to lift slippery,
easily bruised carcasses onto a shackle conveyor. Deformability, anatomical
variance, and strict hygiene rules make conventional suction and scripted
motions unreliable. We present ChicGrasp, an end–to–end hardware–software
co-design for this task. An independently actuated dual-jaw pneumatic gripper
clamps both chicken legs, while a conditional diffusion-policy controller,
trained from only 50 multi–view teleoperation demonstrations (RGB +
proprioception), plans 5 DoF end–effector motion, which includes jaw commands
in one shot. On individually presented raw broiler carcasses, our system
achieves a 40.6\% grasp–and–lift success rate and completes the pick to
shackle cycle in 38 s, whereas state–of–the–art implicit behaviour cloning
(IBC) and LSTM-GMM baselines fail entirely. All CAD, code, and datasets will be
open-source. ChicGrasp shows that imitation learning can bridge the gap between
rigid hardware and variable bio–products, offering a reproducible benchmark
and a public dataset for researchers in agricultural engineering and robot
learning.
[COMMENTS]
Submitted for journal review
[LINK]
http://arxiv.org/abs/2505.08986v1
[DATE]
2025-05-14 05:56:44+08:00
[CATEGORIES]
cs.LG
Model-free Online Learning for the Kalman Filter: Forgetting Factor and Logarithmic Regret
[AUTHORS]
Jiachen Qian, Yang Zheng
[ABSTRACT]
We consider the problem of online prediction for an unknown, non-explosive
linear stochastic system. With a known system model, the optimal predictor is
the celebrated Kalman filter. In the case of unknown systems, existing
approaches based on recursive least squares and its variants may suffer from
degraded performance due to the highly imbalanced nature of the regression
model. This imbalance can easily lead to overfitting and thus degrade
prediction accuracy. We tackle this problem by injecting an inductive bias into
the regression model via {exponential forgetting}. While exponential forgetting
is a common wisdom in online learning, it is typically used for re-weighting
data. In contrast, our approach focuses on balancing the regression model. This
achieves a better trade-off between {regression} and {regularization errors},
and simultaneously reduces the {accumulation error}. With new proof techniques,
we also provide a sharper logarithmic regret bound of $O(\log^3 N)$, where $N$
is the number of observations.
[LINK]
http://arxiv.org/abs/2505.08982v1
[DATE]
2025-05-14 05:49:56+08:00
[CATEGORIES]
cs.LG
Deep-MacroFin: Informed Equilibrium Neural Network for Continuous Time Economic Models
[AUTHORS]
Yuntao Wu, Jiayuan Guo, Goutham Gopalakrishna, Zissis Poulos
[ABSTRACT]
In this paper, we present Deep-MacroFin, a comprehensive framework designed
to solve partial differential equations, with a particular focus on models in
continuous time economics. This framework leverages deep learning
methodologies, including Multi-Layer Perceptrons and the newly developed
Kolmogorov-Arnold Networks. It is optimized using economic information
encapsulated by Hamilton-Jacobi-Bellman (HJB) equations and coupled algebraic
equations. The application of neural networks holds the promise of accurately
resolving high-dimensional problems with fewer computational demands and
limitations compared to other numerical methods. This framework can be readily
adapted for systems of partial differential equations in high dimensions.
Importantly, it offers a more efficient (5$\times$ less CUDA memory and
40$\times$ fewer FLOPs in 100D problems) and user-friendly implementation than
existing libraries. We also incorporate a time-stepping scheme to enhance
training stability for nonlinear HJB equations, enabling the solution of 50D
economic models.
[COMMENTS]
30 pages, 13 figures
[LINK]
http://arxiv.org/abs/2408.10368v4
[DATE]
2025-05-14 05:40:38+08:00
[CATEGORIES]
cs.LG
SaFARi: State-Space Models for Frame-Agnostic Representation
[AUTHORS]
Hossein Babaei, Mel White, Sina Alemohammad, Richard G. Baraniuk
[ABSTRACT]
State-Space Models (SSMs) have re-emerged as a powerful tool for online
function approximation, and as the backbone of machine learning models for
long-range dependent data. However, to date, only a few polynomial bases have
been explored for this purpose, and the state-of-the-art implementations were
built upon the best of a few limited options. In this paper, we present a
generalized method for building an SSM with any frame or basis, rather than
being restricted to polynomials. This framework encompasses the approach known
as HiPPO, but also permits an infinite diversity of other possible “species”
within the SSM architecture. We dub this approach SaFARi: SSMs for
Frame-Agnostic Representation.
[COMMENTS]
13 pages, 5 figures
[LINK]
http://arxiv.org/abs/2505.08977v1
[DATE]
2025-05-14 05:39:40+08:00
[CATEGORIES]
cs.LG
A Comprehensive Social Bias Audit of Contrastive Vision Language Models
[AUTHORS]
Zahraa Al Sahili, Ioannis Patras, Matthew Purver
[ABSTRACT]
In the domain of text-to-image generative models, biases inherent in training
datasets often propagate into generated content, posing significant ethical
challenges, particularly in socially sensitive contexts. We introduce FairCoT,
a novel framework that enhances fairness in text-to-image models through
Chain-of-Thought (CoT) reasoning within multimodal generative large language
models. FairCoT employs iterative CoT refinement to systematically mitigate
biases, and dynamically adjusts textual prompts in real time, ensuring diverse
and equitable representation in generated images. By integrating iterative
reasoning processes, FairCoT addresses the limitations of zero-shot CoT in
sensitive scenarios, balancing creativity with ethical responsibility.
Experimental evaluations across popular text-to-image systems–including DALL-E
and various Stable Diffusion variants–demonstrate that FairCoT significantly
enhances fairness and diversity without sacrificing image quality or semantic
fidelity. By combining robust reasoning, lightweight deployment, and
extensibility to multiple models, FairCoT represents a promising step toward
more socially responsible and transparent AI-driven content generation.
[LINK]
http://arxiv.org/abs/2501.13223v3
[DATE]
2025-05-14 05:39:21+08:00
[CATEGORIES]
cs.LG
Accelerated Stochastic Min-Max Optimization Based on Bias-corrected Momentum
[AUTHORS]
Haoyuan Cai, Sulaiman A. Alghunaim, Ali H. Sayed
[ABSTRACT]
Lower-bound analyses for nonconvex strongly-concave minimax optimization
problems have shown that stochastic first-order algorithms require at least
$\mathcal{O}(\varepsilon^{-4})$ oracle complexity to find an
$\varepsilon$-stationary point. Some works indicate that this complexity can be
improved to $\mathcal{O}(\varepsilon^{-3})$ when the loss gradient is Lipschitz
continuous. The question of achieving enhanced convergence rates under distinct
conditions, remains unresolved. In this work, we address this question for
optimization problems that are nonconvex in the minimization variable and
strongly concave or Polyak-Lojasiewicz (PL) in the maximization variable. We
introduce novel bias-corrected momentum algorithms utilizing efficient
Hessian-vector products. We establish convergence conditions and demonstrate a
lower iteration complexity of $\mathcal{O}(\varepsilon^{-3})$ for the proposed
algorithms. The effectiveness of the method is validated through applications
to robust logistic regression using real-world datasets.
[LINK]
http://arxiv.org/abs/2406.13041v2
[DATE]
2025-05-14 05:28:36+08:00
[CATEGORIES]
cs.LG
GPML: Graph Processing for Machine Learning
[AUTHORS]
Majed Jaber, Julien Michel, Nicolas Boutry, Pierre Parrend
[ABSTRACT]
The dramatic increase of complex, multi-step, and rapidly evolving attacks in
dynamic networks involves advanced cyber-threat detectors. The GPML (Graph
Processing for Machine Learning) library addresses this need by transforming
raw network traffic traces into graph representations, enabling advanced
insights into network behaviors. The library provides tools to detect anomalies
in interaction and community shifts in dynamic networks. GPML supports
community and spectral metrics extraction, enhancing both real-time detection
and historical forensics analysis. This library supports modern cybersecurity
challenges with a robust, graph-based approach.
[LINK]
http://arxiv.org/abs/2505.08964v1
[DATE]
2025-05-14 05:10:46+08:00
[CATEGORIES]
cs.LG
Differentiable Channel Selection in Self-Attention For Person Re-Identification
[AUTHORS]
Yancheng Wang, Nebojsa Jojic, Yingzhen Yang
[ABSTRACT]
In this paper, we propose a novel attention module termed the Differentiable
Channel Selection Attention module, or the DCS-Attention module. In contrast
with conventional self-attention, the DCS-Attention module features selection
of informative channels in the computation of the attention weights. The
selection of the feature channels is performed in a differentiable manner,
enabling seamless integration with DNN training. Our DCS-Attention is
compatible with either fixed neural network backbones or learnable backbones
with Differentiable Neural Architecture Search (DNAS), leading to DCS with
Fixed Backbone (DCS-FB) and DCS-DNAS, respectively. Importantly, our
DCS-Attention is motivated by the principle of Information Bottleneck (IB), and
a novel variational upper bound for the IB loss, which can be optimized by SGD,
is derived and incorporated into the training loss of the networks with the
DCS-Attention modules. In this manner, a neural network with DCS-Attention
modules is capable of selecting the most informative channels for feature
extraction so that it enjoys state-of-the-art performance for the Re-ID task.
Extensive experiments on multiple person Re-ID benchmarks using both DCS-FB and
DCS-DNAS show that DCS-Attention significantly enhances the prediction accuracy
of DNNs for person Re-ID, which demonstrates the effectiveness of DCS-Attention
in learning discriminative features critical to identifying person identities.
The code of our work is available at
https://github.com/Statistical-Deep-Learning/DCS-Attention.
[LINK]
http://arxiv.org/abs/2505.08961v1
[DATE]
2025-05-14 05:01:53+08:00
[CATEGORIES]
cs.LG
DiffCloud: Real-to-Sim from Point Clouds with Differentiable Simulation and Rendering of Deformable Objects
[AUTHORS]
Priya Sundaresan, Rika Antonova, Jeannette Bohg
[ABSTRACT]
Research in manipulation of deformable objects is typically conducted on a
limited range of scenarios, because handling each scenario on hardware takes
significant effort. Realistic simulators with support for various types of
deformations and interactions have the potential to speed up experimentation
with novel tasks and algorithms. However, for highly deformable objects it is
challenging to align the output of a simulator with the behavior of real
objects. Manual tuning is not intuitive, hence automated methods are needed. We
view this alignment problem as a joint perception-inference challenge and
demonstrate how to use recent neural network architectures to successfully
perform simulation parameter inference from real point clouds. We analyze the
performance of various architectures, comparing their data and training
requirements. Furthermore, we propose to leverage differentiable point cloud
sampling and differentiable simulation to significantly reduce the time to
achieve the alignment. We employ an efficient way to propagate gradients from
point clouds to simulated meshes and further through to the physical simulation
parameters, such as mass and stiffness. Experiments with highly deformable
objects show that our method can achieve comparable or better alignment with
real object behavior, while reducing the time needed to achieve this by more
than an order of magnitude. Videos and supplementary material are available at
https://diffcloud.github.io.
[LINK]
http://arxiv.org/abs/2204.03139v2
[DATE]
2025-05-14 04:31:59+08:00
[CATEGORIES]
cs.LG
NeurIPS 2024 Ariel Data Challenge: Characterisation of Exoplanetary Atmospheres Using a Data-Centric Approach
[AUTHORS]
Jeremie Blanchard, Lisa Casino, Jordan Gierschendorf
[ABSTRACT]
The characterization of exoplanetary atmospheres through spectral analysis is
a complex challenge. The NeurIPS 2024 Ariel Data Challenge, in collaboration
with the European Space Agency’s (ESA) Ariel mission, provided an opportunity
to explore machine learning techniques for extracting atmospheric compositions
from simulated spectral data. In this work, we focus on a data-centric business
approach, prioritizing generalization over competition-specific optimization.
We briefly outline multiple experimental axes, including feature extraction,
signal transformation, and heteroskedastic uncertainty modeling. Our
experiments demonstrate that uncertainty estimation plays a crucial role in the
Gaussian Log-Likelihood (GLL) score, impacting performance by several
percentage points. Despite improving the GLL score by 11%, our results
highlight the inherent limitations of tabular modeling and feature engineering
for this task, as well as the constraints of a business-driven approach within
a Kaggle-style competition framework. Our findings emphasize the trade-offs
between model simplicity, interpretability, and generalization in astrophysical
data analysis.
[COMMENTS]
12 pages
[LINK]
http://arxiv.org/abs/2505.08940v1
[DATE]
2025-05-14 04:09:22+08:00
[CATEGORIES]
cs.LG
Optimal navigation of magnetic artificial microswimmers in blood capillaries with deep reinforcement learning
[AUTHORS]
Lucas Amoudruz, Sergey Litvinov, Petros Koumoutsakos
[ABSTRACT]
Biomedical applications such as targeted drug delivery, microsurgery, and
sensing rely on reaching precise areas within the body in a minimally invasive
way. Artificial bacterial flagella (ABFs) have emerged as potential tools for
this task by navigating through the circulatory system with the help of
external magnetic fields. While their swimming characteristics are well
understood in simple settings, their controlled navigation through realistic
capillary networks remains a significant challenge due to the complexity of
blood flow and the high computational cost of detailed simulations. We address
this challenge by conducting numerical simulations of ABFs in retinal
capillaries, propelled by an external magnetic field. The simulations are based
on a validated blood model that predicts the dynamics of individual red blood
cells and their hydrodynamic interactions with ABFs. The magnetic field follows
a control policy that brings the ABF to a prescribed target. The control policy
is learned with an actor-critic, off-policy reinforcement learning algorithm
coupled with a reduced-order model of the system. We show that the same policy
robustly guides the ABF to a prescribed target in both the reduced-order model
and the fine-grained blood simulations. This approach is suitable for designing
robust control policies for personalized medicine at moderate computational
cost.
[LINK]
http://arxiv.org/abs/2404.02171v2
[DATE]
2025-05-14 03:26:11+08:00
[CATEGORIES]
cs.LG
An Analytical Characterization of Sloppiness in Neural Networks: Insights from Linear Models
[AUTHORS]
Jialin Mao, Itay Griniasty, Yan Sun, Mark K. Transtrum, James P. Sethna, Pratik Chaudhari
[ABSTRACT]
Recent experiments have shown that training trajectories of multiple deep
neural networks with different architectures, optimization algorithms,
hyper-parameter settings, and regularization methods evolve on a remarkably
low-dimensional “hyper-ribbon-like” manifold in the space of probability
distributions. Inspired by the similarities in the training trajectories of
deep networks and linear networks, we analytically characterize this phenomenon
for the latter. We show, using tools in dynamical systems theory, that the
geometry of this low-dimensional manifold is controlled by (i) the decay rate
of the eigenvalues of the input correlation matrix of the training data, (ii)
the relative scale of the ground-truth output to the weights at the beginning
of training, and (iii) the number of steps of gradient descent. By analytically
computing and bounding the contributions of these quantities, we characterize
phase boundaries of the region where hyper-ribbons are to be expected. We also
extend our analysis to kernel machines and linear models that are trained with
stochastic gradient descent.
[LINK]
http://arxiv.org/abs/2505.08915v1
[DATE]
2025-05-14 03:20:19+08:00
[CATEGORIES]
cs.LG
Reinforcement Learning-based Heuristics to Guide Domain-Independent Dynamic Programming
[AUTHORS]
Minori Narita, Ryo Kuroiwa, J. Christopher Beck
[ABSTRACT]
Domain-Independent Dynamic Programming (DIDP) is a state-space search
paradigm based on dynamic programming for combinatorial optimization. In its
current implementation, DIDP guides the search using user-defined dual bounds.
Reinforcement learning (RL) is increasingly being applied to combinatorial
optimization problems and shares several key structures with DP, being
represented by the Bellman equation and state-based transition systems. We
propose using reinforcement learning to obtain a heuristic function to guide
the search in DIDP. We develop two RL-based guidance approaches: value-based
guidance using Deep Q-Networks and policy-based guidance using Proximal Policy
Optimization. Our experiments indicate that RL-based guidance significantly
outperforms standard DIDP and problem-specific greedy heuristics with the same
number of node expansions. Further, despite longer node evaluation times, RL
guidance achieves better run-time performance than standard DIDP on three of
four benchmark domains.
[COMMENTS]
24 pages, 4 figures, to be published in CPAIOR 2025
(https://sites.google.com/view/cpaior2025)
[LINK]
http://arxiv.org/abs/2503.16371v2
[DATE]
2025-05-14 03:08:33+08:00
[CATEGORIES]
cs.LG
Differentiable Quantum Architecture Search in Quantum-Enhanced Neural Network Parameter Generation
[AUTHORS]
Samuel Yen-Chi Chen, Chen-Yu Liu, Kuan-Cheng Chen, Wei-Jia Huang, Yen-Jui Chang, Wei-Hao Huang
[ABSTRACT]
The rapid advancements in quantum computing (QC) and machine learning (ML)
have led to the emergence of quantum machine learning (QML), which integrates
the strengths of both fields. Among QML approaches, variational quantum
circuits (VQCs), also known as quantum neural networks (QNNs), have shown
promise both empirically and theoretically. However, their broader adoption is
hindered by reliance on quantum hardware during inference. Hardware
imperfections and limited access to quantum devices pose practical challenges.
To address this, the Quantum-Train (QT) framework leverages the exponential
scaling of quantum amplitudes to generate classical neural network parameters,
enabling inference without quantum hardware and achieving significant parameter
compression. Yet, designing effective quantum circuit architectures for such
quantum-enhanced neural programmers remains non-trivial and often requires
expertise in quantum information science. In this paper, we propose an
automated solution using differentiable optimization. Our method jointly
optimizes both conventional circuit parameters and architectural parameters in
an end-to-end manner via automatic differentiation. We evaluate the proposed
framework on classification, time-series prediction, and reinforcement learning
tasks. Simulation results show that our method matches or outperforms manually
designed QNN architectures. This work offers a scalable and automated pathway
for designing QNNs that can generate classical neural network parameters across
diverse applications.
[LINK]
http://arxiv.org/abs/2505.09653v1
[DATE]
2025-05-14 03:01:08+08:00
[CATEGORIES]
cs.LG
Learning Cocoercive Conservative Denoisers via Helmholtz Decomposition for Poisson Inverse Problems
[AUTHORS]
Deliang Wei, Peng Chen, Haobo Xu, Jiale Yao, Fang Li, Tieyong Zeng
[ABSTRACT]
Plug-and-play (PnP) methods with deep denoisers have shown impressive results
in imaging problems. They typically require strong convexity or smoothness of
the fidelity term and a (residual) non-expansive denoiser for convergence.
These assumptions, however, are violated in Poisson inverse problems, and
non-expansiveness can hinder denoising performance. To address these
challenges, we propose a cocoercive conservative (CoCo) denoiser, which may be
(residual) expansive, leading to improved denoising. By leveraging the
generalized Helmholtz decomposition, we introduce a novel training strategy
that combines Hamiltonian regularization to promote conservativeness and
spectral regularization to ensure cocoerciveness. We prove that CoCo denoiser
is a proximal operator of a weakly convex function, enabling a restoration
model with an implicit weakly convex prior. The global convergence of PnP
methods to a stationary point of this restoration model is established.
Extensive experimental results demonstrate that our approach outperforms
closely related methods in both visual quality and quantitative metrics.
[COMMENTS]
31 pages
[LINK]
http://arxiv.org/abs/2505.08909v1
[DATE]
2025-05-14 03:00:55+08:00
[CATEGORIES]
cs.LG
Bounding Neyman-Pearson Region with $f$-Divergences
[AUTHORS]
Andrew Mullhaupt, Cheng Peng
[ABSTRACT]
The Neyman-Pearson region of a simple binary hypothesis testing is the set of
points whose coordinates represent the false positive rate and false negative
rate of some test. The lower boundary of this region is given by the
Neyman-Pearson lemma, and is up to a coordinate change, equivalent to the
optimal ROC curve. We establish a novel lower bound for the boundary in terms
of any $f$-divergence. Since the bound generated by hockey-stick
$f$-divergences characterizes the Neyman-Pearson boundary, this bound is best
possible. In the case of KL divergence, this bound improves Pinsker’s
inequality. Furthermore, we obtain a closed-form refined upper bound for the
Neyman-Pearson boundary in terms of the Chernoff $\alpha$-coefficient. Finally,
we present methods for constructing pairs of distributions that can
approximately or exactly realize any given Neyman-Pearson boundary.
[LINK]
http://arxiv.org/abs/2505.08899v1
[DATE]
2025-05-14 02:42:10+08:00
[CATEGORIES]
cs.LG
PCS-UQ: Uncertainty Quantification via the Predictability-Computability-Stability Framework
[AUTHORS]
Abhineet Agarwal, Michael Xiao, Rebecca Barter, Omer Ronen, Boyu Fan, Bin Yu
[ABSTRACT]
As machine learning (ML) models are increasingly deployed in high-stakes
domains, trustworthy uncertainty quantification (UQ) is critical for ensuring
the safety and reliability of these models. Traditional UQ methods rely on
specifying a true generative model and are not robust to misspecification. On
the other hand, conformal inference allows for arbitrary ML models but does not
consider model selection, which leads to large interval sizes. We tackle these
drawbacks by proposing a UQ method based on the predictability, computability,
and stability (PCS) framework for veridical data science proposed by Yu and
Kumbier. Specifically, PCS-UQ addresses model selection by using a prediction
check to screen out unsuitable models. PCS-UQ then fits these screened
algorithms across multiple bootstraps to assess inter-sample variability and
algorithmic instability, enabling more reliable uncertainty estimates. Further,
we propose a novel calibration scheme that improves local adaptivity of our
prediction sets. Experiments across $17$ regression and $6$ classification
datasets show that PCS-UQ achieves the desired coverage and reduces width over
conformal approaches by $\approx 20\%$. Further, our local analysis shows
PCS-UQ often achieves target coverage across subgroups while conformal methods
fail to do so. For large deep-learning models, we propose computationally
efficient approximation schemes that avoid the expensive multiple bootstrap
trainings of PCS-UQ. Across three computer vision benchmarks, PCS-UQ reduces
prediction set size over conformal methods by $20\%$. Theoretically, we show a
modified PCS-UQ algorithm is a form of split conformal inference and achieves
the desired coverage with exchangeable data.
[LINK]
http://arxiv.org/abs/2505.08784v1
[DATE]
2025-05-14 01:58:16+08:00
[CATEGORIES]
cs.LG
Addressing the Current Challenges of Quantum Machine Learning through Multi-Chip Ensembles
[AUTHORS]
Junghoon Justin Park, Jiook Cha, Samuel Yen-Chi Chen, Huan-Hsin Tseng, Shinjae Yoo
[ABSTRACT]
Quantum Machine Learning (QML) holds significant promise for solving
computational challenges across diverse domains. However, its practical
deployment is constrained by the limitations of noisy intermediate-scale
quantum (NISQ) devices, including noise, limited scalability, and trainability
issues in variational quantum circuits (VQCs). We introduce the multi-chip
ensemble VQC framework, which partitions high-dimensional computations across
smaller quantum chips to enhance scalability, trainability, and noise
resilience. We show that this approach mitigates barren plateaus, reduces
quantum error bias and variance, and maintains robust generalization through
controlled entanglement. Designed to align with current and emerging quantum
hardware, the framework demonstrates strong potential for enabling scalable QML
on near-term devices, as validated by experiments on standard benchmark
datasets (MNIST, FashionMNIST, CIFAR-10) and real world dataset (PhysioNet
EEG).
[LINK]
http://arxiv.org/abs/2505.08782v1
[DATE]
2025-05-14 01:57:53+08:00
[CATEGORIES]
cs.LG
GPTAQ: Efficient Finetuning-Free Quantization for Asymmetric Calibration
[AUTHORS]
Yuhang Li, Ruokai Yin, Donghyun Lee, Shiting Xiao, Priyadarshini Panda
[ABSTRACT]
We introduce GPTAQ, a novel finetuning-free quantization method for
compressing large-scale transformer architectures. Unlike the previous GPTQ
method, which independently calibrates each layer, we always match the
quantized layer’s output to the exact output in the full-precision model,
resulting in a scheme that we call asymmetric calibration. Such a scheme can
effectively reduce the quantization error accumulated in previous layers. We
analyze this problem using optimal brain compression to derive a close-formed
solution. The new solution explicitly minimizes the quantization error as well
as the accumulated asymmetry error. Furthermore, we utilize various techniques
to parallelize the solution calculation, including channel parallelization,
neuron decomposition, and Cholesky reformulation for matrix fusion. As a
result, GPTAQ is easy to implement, simply using 20 more lines of code than
GPTQ but improving its performance under low-bit quantization. Remarkably, on a
single GPU, we quantize a 405B language transformer as well as EVA-02, the rank
first vision transformer that achieves 90% pretraining Imagenet accuracy. Code
is available at Github.
[COMMENTS]
ICML 2025
[LINK]
http://arxiv.org/abs/2504.02692v3
[DATE]
2025-05-14 01:54:56+08:00
[CATEGORIES]
cs.LG
Generative Molecular Design with Steerable and Granular Synthesizability Control
[AUTHORS]
Jeff Guo, Víctor Sabanza-Gil, Zlatko Jončev, Jeremy S. Luterbacher, Philippe Schwaller
[ABSTRACT]
Synthesizability in small molecule generative design remains a bottleneck.
Existing works that do consider synthesizability can output predicted synthesis
routes for generated molecules. However, there has been minimal attention in
addressing the ease of synthesis and enabling flexibility to incorporate
desired reaction constraints. In this work, we propose a small molecule
generative design framework that enables steerable and granular
synthesizability control. Generated molecules satisfy arbitrary multi-parameter
optimization objectives with predicted synthesis routes containing pre-defined
allowed reactions, while optionally avoiding others. One can also enforce that
all reactions belong to a pre-defined set. We show the capability to
mix-and-match these reaction constraints across the most common medicinal
chemistry transformations. Next, we show how our framework can be used to
valorize industrial byproducts towards de novo optimized molecules. Going
further, we demonstrate how granular control over synthesizability constraints
can loosely mimic virtual screening of ultra-large make-on-demand libraries.
Using only a single GPU, we generate and dock 15k molecules to identify
promising candidates in Freedom 4.0 constituting 142B make-on-demand molecules
(assessing only 0.00001% of the library). Generated molecules satisfying the
reaction constraints have > 90% exact match rate. Lastly, we benchmark our
framework against recent synthesizability-constrained generative models and
demonstrate the highest sample efficiency even when imposing the additional
constraint that all molecules must be synthesizable from a single reaction
type. The main theme is demonstrating that a pre-trained generalist molecular
generative model can be incentivized to generate property-optimized small
molecules under challenging synthesizability constraints through reinforcement
learning.
[LINK]
http://arxiv.org/abs/2505.08774v1
[DATE]
2025-05-14 01:53:54+08:00
[CATEGORIES]
cs.LG
SPAT: Sensitivity-based Multihead-attention Pruning on Time Series Forecasting Models
[AUTHORS]
Suhan Guo, Jiahong Deng, Mengjun Yi, Furao Shen, Jian Zhao
[ABSTRACT]
Attention-based architectures have achieved superior performance in
multivariate time series forecasting but are computationally expensive.
Techniques such as patching and adaptive masking have been developed to reduce
their sizes and latencies. In this work, we propose a structured pruning
method, SPAT ($\textbf{S}$ensitivity $\textbf{P}$runer for
$\textbf{At}$tention), which selectively removes redundant attention mechanisms
and yields highly effective models. Different from previous approaches, SPAT
aims to remove the entire attention module, which reduces the risk of
overfitting and enables speed-up without demanding specialized hardware. We
propose a dynamic sensitivity metric, $\textbf{S}$ensitivity
$\textbf{E}$nhanced $\textbf{N}$ormalized $\textbf{D}$ispersion (SEND) that
measures the importance of each attention module during the pre-training phase.
Experiments on multivariate datasets demonstrate that SPAT-pruned models
achieve reductions of 2.842% in MSE, 1.996% in MAE, and 35.274% in FLOPs.
Furthermore, SPAT-pruned models outperform existing lightweight, Mamba-based
and LLM-based SOTA methods in both standard and zero-shot inference,
highlighting the importance of retaining only the most effective attention
mechanisms. We have made our code publicly available
https://anonymous.4open.science/r/SPAT-6042.
[LINK]
http://arxiv.org/abs/2505.08768v1
[DATE]
2025-05-14 01:39:31+08:00
[CATEGORIES]
cs.LG
PRIMER: Perception-Aware Robust Learning-based Multiagent Trajectory Planner
[AUTHORS]
Kota Kondo, Claudius T. Tewari, Andrea Tagliabue, Jesus Tordesillas, Parker C. Lusk, Mason B. Peterson, Jonathan P. How
[ABSTRACT]
In decentralized multiagent trajectory planners, agents need to communicate
and exchange their positions to generate collision-free trajectories. However,
due to localization errors/uncertainties, trajectory deconfliction can fail
even if trajectories are perfectly shared between agents. To address this
issue, we first present PARM and PARM, perception-aware, decentralized,
asynchronous multiagent trajectory planners that enable a team of agents to
navigate uncertain environments while deconflicting trajectories and avoiding
obstacles using perception information. PARM differs from PARM as it is less
conservative, using more computation to find closer-to-optimal solutions. While
these methods achieve state-of-the-art performance, they suffer from high
computational costs as they need to solve large optimization problems onboard,
making it difficult for agents to replan at high rates. To overcome this
challenge, we present our second key contribution, PRIMER, a learning-based
planner trained with imitation learning (IL) using PARM* as the expert
demonstrator. PRIMER leverages the low computational requirements at deployment
of neural networks and achieves a computation speed up to 5500 times faster
than optimization-based approaches.
[COMMENTS]
7 pages, 3 figures
[LINK]
http://arxiv.org/abs/2406.10060v4
[DATE]
2025-05-14 01:18:07+08:00
[CATEGORIES]
cs.LG
Towards Foundation Models for Experimental Readout Systems Combining Discrete and Continuous Data
[AUTHORS]
James Giroux, Cristiano Fanelli
[ABSTRACT]
We present a (proto) Foundation Model for Nuclear Physics, capable of
operating on low-level detector inputs from Imaging Cherenkov Detectors at the
future Electron Ion Collider. To address limitations in existing next-token
prediction approaches-namely resolution loss from VQ-VAE tokenization and lack
of conditional generation-we propose three key innovations: (i) separate
vocabularies for discrete spatial features and continuous variates, combined
via Causal Multi-Head Cross-Attention (CMHCA), (ii) continuous kinematic
conditioning through prepended context embeddings, and (iii) scalable and
simple, high-resolution continuous variate tokenization without joint
vocabulary inflation. Our model enables fast, high-fidelity generation of pixel
and time sequences for Cherenkov photons, validated through closure tests in
the High Performance DIRC. We also show our model generalizes to reconstruction
tasks such as pion and kaon identification, in which we show its ability to
leverage fine-tuning.
[COMMENTS]
19 pages; 14 figures
[LINK]
http://arxiv.org/abs/2505.08736v1
[DATE]
2025-05-14 00:49:45+08:00
[CATEGORIES]
cs.LG
Deep Representation Learning for Unsupervised Clustering of Myocardial Fiber Trajectories in Cardiac Diffusion Tensor Imaging
[AUTHORS]
Mohini Anand, Xavier Tricoche
[ABSTRACT]
Understanding the complex myocardial architecture is critical for diagnosing
and treating heart disease. However, existing methods often struggle to
accurately capture this intricate structure from Diffusion Tensor Imaging (DTI)
data, particularly due to the lack of ground truth labels and the ambiguous,
intertwined nature of fiber trajectories. We present a novel deep learning
framework for unsupervised clustering of myocardial fibers, providing a
data-driven approach to identifying distinct fiber bundles. We uniquely combine
a Bidirectional Long Short-Term Memory network to capture local sequential
information along fibers, with a Transformer autoencoder to learn global shape
features, with pointwise incorporation of essential anatomical context.
Clustering these representations using a density-based algorithm identifies 33
to 62 robust clusters, successfully capturing the subtle distinctions in fiber
trajectories with varying levels of granularity. Our framework offers a new,
flexible, and quantitative way to analyze myocardial structure, achieving a
level of delineation that, to our knowledge, has not been previously achieved,
with potential applications in improving surgical planning, characterizing
disease-related remodeling, and ultimately, advancing personalized cardiac
care.
[COMMENTS]
10 pages, 5 figures. An extended journal manuscript is in preparation
[LINK]
http://arxiv.org/abs/2504.01953v2
[DATE]
2025-05-14 00:47:56+08:00
[CATEGORIES]
cs.LG
Preference Optimization for Combinatorial Optimization Problems
[AUTHORS]
Mingjun Pan, Guanquan Lin, You-Wei Luo, Bin Zhu, Zhien Dai, Lijun Sun, Chun Yuan
[ABSTRACT]
Reinforcement Learning (RL) has emerged as a powerful tool for neural
combinatorial optimization, enabling models to learn heuristics that solve
complex problems without requiring expert knowledge. Despite significant
progress, existing RL approaches face challenges such as diminishing reward
signals and inefficient exploration in vast combinatorial action spaces,
leading to inefficiency. In this paper, we propose Preference Optimization, a
novel method that transforms quantitative reward signals into qualitative
preference signals via statistical comparison modeling, emphasizing the
superiority among sampled solutions. Methodologically, by reparameterizing the
reward function in terms of policy and utilizing preference models, we
formulate an entropy-regularized RL objective that aligns the policy directly
with preferences while avoiding intractable computations. Furthermore, we
integrate local search techniques into the fine-tuning rather than
post-processing to generate high-quality preference pairs, helping the policy
escape local optima. Empirical results on various benchmarks, such as the
Traveling Salesman Problem (TSP), the Capacitated Vehicle Routing Problem
(CVRP) and the Flexible Flow Shop Problem (FFSP), demonstrate that our method
significantly outperforms existing RL algorithms, achieving superior
convergence efficiency and solution quality.
[COMMENTS]
This paper has been accepted by ICML 2025
[LINK]
http://arxiv.org/abs/2505.08735v1
[DATE]
2025-05-14 00:47:00+08:00
[CATEGORIES]
cs.LG
PWC-MoE: Privacy-Aware Wireless Collaborative Mixture of Experts
[AUTHORS]
Yang Su, Na Yan, Yansha Deng, Robert Schober
[ABSTRACT]
Large language models (LLMs) hosted on cloud servers alleviate the
computational and storage burdens on local devices but raise privacy concerns
due to sensitive data transmission and require substantial communication
bandwidth, which is challenging in constrained environments. In contrast, small
language models (SLMs) running locally enhance privacy but suffer from limited
performance on complex tasks. To balance computational cost, performance, and
privacy protection under bandwidth constraints, we propose a privacy-aware
wireless collaborative mixture of experts (PWC-MoE) framework. Specifically,
PWC-MoE employs a sparse privacy-aware gating network to dynamically route
sensitive tokens to privacy experts located on local clients, while
non-sensitive tokens are routed to non-privacy experts located at the remote
base station. To achieve computational efficiency, the gating network ensures
that each token is dynamically routed to and processed by only one expert. To
enhance scalability and prevent overloading of specific experts, we introduce a
group-wise load-balancing mechanism for the gating network that evenly
distributes sensitive tokens among privacy experts and non-sensitive tokens
among non-privacy experts. To adapt to bandwidth constraints while preserving
model performance, we propose a bandwidth-adaptive and importance-aware token
offloading scheme. This scheme incorporates an importance predictor to evaluate
the importance scores of non-sensitive tokens, prioritizing the most important
tokens for transmission to the base station based on their predicted importance
and the available bandwidth. Experiments demonstrate that the PWC-MoE framework
effectively preserves privacy and maintains high performance even in
bandwidth-constrained environments, offering a practical solution for deploying
LLMs in privacy-sensitive and bandwidth-limited scenarios.
[LINK]
http://arxiv.org/abs/2505.08719v1
[DATE]
2025-05-14 00:27:07+08:00
[CATEGORIES]
cs.LG
Open-Source LLM-Driven Federated Transformer for Predictive IoV Management
[AUTHORS]
Yazan Otoum, Arghavan Asad, Ishtiaq Ahmad
[ABSTRACT]
The proliferation of connected vehicles within the Internet of Vehicles (IoV)
ecosystem presents critical challenges in ensuring scalable, real-time, and
privacy-preserving traffic management. Existing centralized IoV solutions often
suffer from high latency, limited scalability, and reliance on proprietary
Artificial Intelligence (AI) models, creating significant barriers to
widespread deployment, particularly in dynamic and privacy-sensitive
environments. Meanwhile, integrating Large Language Models (LLMs) in vehicular
systems remains underexplored, especially concerning prompt optimization and
effective utilization in federated contexts. To address these challenges, we
propose the Federated Prompt-Optimized Traffic Transformer (FPoTT), a novel
framework that leverages open-source LLMs for predictive IoV management. FPoTT
introduces a dynamic prompt optimization mechanism that iteratively refines
textual prompts to enhance trajectory prediction. The architecture employs a
dual-layer federated learning paradigm, combining lightweight edge models for
real-time inference with cloud-based LLMs to retain global intelligence. A
Transformer-driven synthetic data generator is incorporated to augment training
with diverse, high-fidelity traffic scenarios in the Next Generation Simulation
(NGSIM) format. Extensive evaluations demonstrate that FPoTT, utilizing
EleutherAI Pythia-1B, achieves 99.86% prediction accuracy on real-world data
while maintaining high performance on synthetic datasets. These results
underscore the potential of open-source LLMs in enabling secure, adaptive, and
scalable IoV management, offering a promising alternative to proprietary
solutions in smart mobility ecosystems.
[COMMENTS]
Preprint version; submitted for academic peer review
[LINK]
http://arxiv.org/abs/2505.00651v2
[DATE]
2025-05-14 00:24:54+08:00
[CATEGORIES]
cs.LG
Wilsonian Renormalization of Neural Network Gaussian Processes
[AUTHORS]
Jessica N. Howard, Ro Jefferson, Anindita Maiti, Zohar Ringel
[ABSTRACT]
Separating relevant and irrelevant information is key to any modeling process
or scientific inquiry. Theoretical physics offers a powerful tool for achieving
this in the form of the renormalization group (RG). Here we demonstrate a
practical approach to performing Wilsonian RG in the context of Gaussian
Process (GP) Regression. We systematically integrate out the unlearnable modes
of the GP kernel, thereby obtaining an RG flow of the GP in which the data sets
the IR scale. In simple cases, this results in a universal flow of the ridge
parameter, which becomes input-dependent in the richer scenario in which
non-Gaussianities are included. In addition to being analytically tractable,
this approach goes beyond structural analogies between RG and neural networks
by providing a natural connection between RG flow and learnable vs. unlearnable
modes. Studying such flows may improve our understanding of feature learning in
deep neural networks, and enable us to identify potential universality classes
in these models.
[COMMENTS]
Accepted by Machine Learning: Science and Technology; 45 pages, 6
figures; expanded neural scaling law results with empirical experiments,
clarified intermediate derivation steps, added references, added appendices
[LINK]
http://arxiv.org/abs/2405.06008v3
[DATE]
2025-05-14 00:20:02+08:00
[CATEGORIES]
cs.LG
Improved Algorithms for Differentially Private Language Model Alignment
[AUTHORS]
Keyu Chen, Hao Tang, Qinglin Liu, Yizhao Xu
[ABSTRACT]
Language model alignment is crucial for ensuring that large language models
(LLMs) align with human preferences, yet it often involves sensitive user data,
raising significant privacy concerns. While prior work has integrated
differential privacy (DP) with alignment techniques, their performance remains
limited. In this paper, we propose novel algorithms for privacy-preserving
alignment and rigorously analyze their effectiveness across varying privacy
budgets and models. Our framework can be deployed on two celebrated alignment
techniques, namely direct preference optimization (DPO) and reinforcement
learning from human feedback (RLHF). Through systematic experiments on
large-scale language models, we demonstrate that our approach achieves
state-of-the-art performance. Notably, one of our algorithms, DP-AdamW,
combined with DPO, surpasses existing methods, improving alignment quality by
up to 15% under moderate privacy budgets ({\epsilon}=2-5). We further
investigate the interplay between privacy guarantees, alignment efficacy, and
computational demands, providing practical guidelines for optimizing these
trade-offs.
[LINK]
http://arxiv.org/abs/2505.08849v1
[DATE]
2025-05-14 00:18:59+08:00
[CATEGORIES]
cs.LG
Contrastive Normalizing Flows for Uncertainty-Aware Parameter Estimation
[AUTHORS]
Ibrahim Elsharkawy, Yonatan Kahn
[ABSTRACT]
Estimating physical parameters from data is a crucial application of machine
learning (ML) in the physical sciences. However, systematic uncertainties, such
as detector miscalibration, induce data distribution distortions that can erode
statistical precision. In both high-energy physics (HEP) and broader ML
contexts, achieving uncertainty-aware parameter estimation under these domain
shifts remains an open problem. In this work, we address this challenge of
uncertainty-aware parameter estimation for a broad set of tasks critical for
HEP. We introduce a novel approach based on Contrastive Normalizing Flows
(CNFs), which achieves top performance on the HiggsML Uncertainty Challenge
dataset. Building on the insight that a binary classifier can approximate the
model parameter likelihood ratio, we address the practical limitations of
expressivity and the high cost of simulating high-dimensional parameter grids
by embedding data and parameters in a learned CNF mapping. This mapping yields
a tunable contrastive distribution that enables robust classification under
shifted data distributions. Through a combination of theoretical analysis and
empirical evaluations, we demonstrate that CNFs, when coupled with a classifier
and established frequentist techniques, provide principled parameter estimation
and uncertainty quantification through classification that is robust to data
distribution distortions.
[COMMENTS]
9 + 8 pages, 2 tables, 10 figures; Contribution to the FAIR Universe
Higgs Uncertainty Challenge, winning first place ex aequo
[LINK]
http://arxiv.org/abs/2505.08709v1
[DATE]
2025-05-14 00:14:34+08:00
[CATEGORIES]
cs.LG
Adaptive Schema-aware Event Extraction with Retrieval-Augmented Generation
[AUTHORS]
Sheng Liang, Hang Lv, Zhihao Wen, Yaxiong Wu, Yongyue Zhang, Hao Wang, Yong Liu
[ABSTRACT]
Event extraction (EE) is a fundamental task in natural language processing
(NLP) that involves identifying and extracting event information from
unstructured text. Effective EE in real-world scenarios requires two key steps:
selecting appropriate schemas from hundreds of candidates and executing the
extraction process. Existing research exhibits two critical gaps: (1) the rigid
schema fixation in existing pipeline systems, and (2) the absence of benchmarks
for evaluating joint schema matching and extraction. Although large language
models (LLMs) offer potential solutions, their schema hallucination tendencies
and context window limitations pose challenges for practical deployment. In
response, we propose Adaptive Schema-aware Event Extraction (ASEE), a novel
paradigm combining schema paraphrasing with schema retrieval-augmented
generation. ASEE adeptly retrieves paraphrased schemas and accurately generates
targeted structures. To facilitate rigorous evaluation, we construct the
Multi-Dimensional Schema-aware Event Extraction (MD-SEE) benchmark, which
systematically consolidates 12 datasets across diverse domains, complexity
levels, and language settings. Extensive evaluations on MD-SEE show that our
proposed ASEE demonstrates strong adaptability across various scenarios,
significantly improving the accuracy of event extraction.
[COMMENTS]
15 pages, 3 figures
[LINK]
http://arxiv.org/abs/2505.08690v1
[DATE]
2025-05-13 23:47:54+08:00
[CATEGORIES]
cs.CL
Scaling Context, Not Parameters: Training a Compact 7B Language Model for Efficient Long-Context Processing
[AUTHORS]
Chen Wu, Yin Song
[COMMENTS]
8 pages, 6 figures, ACL 2025 (Industry Track)
[LINK]
http://arxiv.org/abs/2505.08651v1
[DATE]
2025-05-13 23:13:15+08:00
[CATEGORIES]
cs.CL
cs.LG
Integrating Single-Cell Foundation Models with Graph Neural Networks for Drug Response Prediction
[AUTHORS]
Till Rossner, Ziteng Li, Jonas Balke, Nikoo Salehfard, Tom Seifert, Ming Tang
[ABSTRACT]
AI-driven drug response prediction holds great promise for advancing
personalized cancer treatment. However, the inherent heterogenity of cancer and
high cost of data generation make accurate prediction challenging. In this
study, we investigate whether incorporating the pretrained foundation model
scGPT can enhance the performance of existing drug response prediction
frameworks. Our approach builds on the DeepCDR framework, which encodes drug
representations from graph structures and cell representations from multi-omics
profiles. We adapt this framework by leveraging scGPT to generate enriched cell
representations using its pretrained knowledge to compensate for limited amount
of data. We evaluate our modified framework using IC$_{50}$ values on Pearson
correlation coefficient (PCC) and a leave-one-drug out validation strategy,
comparing it against the original DeepCDR framework and a prior
scFoundation-based approach. scGPT not only outperforms previous approaches but
also exhibits greater training stability, highlighting the value of leveraging
scGPT-derived knowledge in this domain.
[COMMENTS]
8 pages, 6 figures
[LINK]
http://arxiv.org/abs/2504.14361v2
[DATE]
2025-05-13 23:04:50+08:00
[CATEGORIES]
cs.LG
cs.CL
TRAIL: Trace Reasoning and Agentic Issue Localization
[AUTHORS]
Darshan Deshpande, Varun Gangal, Hersh Mehta, Jitin Krishnan, Anand Kannappan, Rebecca Qian
[ABSTRACT]
The increasing adoption of agentic workflows across diverse domains brings a
critical need to scalably and systematically evaluate the complex traces these
systems generate. Current evaluation methods depend on manual, domain-specific
human analysis of lengthy workflow traces - an approach that does not scale
with the growing complexity and volume of agentic outputs. Error analysis in
these settings is further complicated by the interplay of external tool outputs
and language model reasoning, making it more challenging than traditional
software debugging. In this work, we (1) articulate the need for robust and
dynamic evaluation methods for agentic workflow traces, (2) introduce a formal
taxonomy of error types encountered in agentic systems, and (3) present a set
of 148 large human-annotated traces (TRAIL) constructed using this taxonomy and
grounded in established agentic benchmarks. To ensure ecological validity, we
curate traces from both single and multi-agent systems, focusing on real-world
applications such as software engineering and open-world information retrieval.
Our evaluations reveal that modern long context LLMs perform poorly at trace
debugging, with the best Gemini-2.5-pro model scoring a mere 11% on TRAIL. Our
dataset and code are made publicly available to support and accelerate future
research in scalable evaluation for agentic workflows.
[COMMENTS]
Dataset link: https://huggingface.co/datasets/PatronusAI/TRAIL
[LINK]
http://arxiv.org/abs/2505.08638v1
[DATE]
2025-05-13 22:55:31+08:00
[CATEGORIES]
cs.CL
Visually Guided Decoding: Gradient-Free Hard Prompt Inversion with Language Models
[AUTHORS]
Donghoon Kim, Minji Bae, Kyuhong Shim, Byonghyo Shim
[ABSTRACT]
Text-to-image generative models like DALL-E and Stable Diffusion have
revolutionized visual content creation across various applications, including
advertising, personalized media, and design prototyping. However, crafting
effective textual prompts to guide these models remains challenging, often
requiring extensive trial and error. Existing prompt inversion approaches, such
as soft and hard prompt techniques, are not so effective due to the limited
interpretability and incoherent prompt generation. To address these issues, we
propose Visually Guided Decoding (VGD), a gradient-free approach that leverages
large language models (LLMs) and CLIP-based guidance to generate coherent and
semantically aligned prompts. In essence, VGD utilizes the robust text
generation capabilities of LLMs to produce human-readable prompts. Further, by
employing CLIP scores to ensure alignment with user-specified visual concepts,
VGD enhances the interpretability, generalization, and flexibility of prompt
generation without the need for additional training. Our experiments
demonstrate that VGD outperforms existing prompt inversion techniques in
generating understandable and contextually relevant prompts, facilitating more
intuitive and controllable interactions with text-to-image models.
[COMMENTS]
ICLR 2025
[LINK]
http://arxiv.org/abs/2505.08622v1
[DATE]
2025-05-13 22:40:22+08:00
[CATEGORIES]
cs.CL
SMI: An Information-Theoretic Metric for Predicting Model Knowledge Solely from Pre-Training Signals
[AUTHORS]
Changhao Jiang, Ming Zhang, Junjie Ye, Xiaoran Fan, Yifei Cao, Jiajun Sun, Zhiheng Xi, Shihan Dou, Yi Dong, Yujiong Shen, Jingqi Tong, Zhen Wang, Tao Liang, Zhihui Fei, Mingyang Wan, Guojun Ma, Qi Zhang, Tao Gui, Xuanjing Huang
[ABSTRACT]
The GPT-4 technical report highlights the possibility of predicting model
performance on downstream tasks using only pre-training signals, though
detailed methodologies are absent. Such predictive capabilities are essential
for resource-efficient pre-training and the construction of task-aligned
datasets. In this paper, we aim to predict performance in closed-book question
answering (QA), a vital downstream task indicative of a model’s internal
knowledge. We address three primary challenges: (1) limited access to and
understanding of pre-training corpora, (2) limitations of current evaluation
methods for pre-trained models, and (3) limitations of frequency-based metrics
in predicting model performance. In response to these challenges, we conduct
large-scale retrieval and semantic analysis across the pre-training corpora of
21 publicly available and 3 custom-trained large language models. Subsequently,
we develop a multi-template QA evaluation framework incorporating paraphrased
question variants. Building on these foundations, we propose Size-dependent
Mutual Information (SMI), an information-theoretic metric that linearly
correlates pre-training data characteristics, model size, and QA accuracy,
without requiring any additional training. The experimental results demonstrate
that SMI outperforms co-occurrence-based baselines, achieving $R^2$ > 0.75 on
models with over one billion parameters. Theoretical analysis further reveals
the marginal benefits of scaling model size and optimizing data, indicating
that the upper limit of specific QA task accuracy is approximately 80%. Our
project is available at https://github.com/yuhui1038/SMI.
[LINK]
http://arxiv.org/abs/2502.04066v2
[DATE]
2025-05-13 22:19:37+08:00
[CATEGORIES]
cs.CL
CursorCore: Assist Programming through Aligning Anything
[AUTHORS]
Hao Jiang, Qi Liu, Rui Li, Shengyu Ye, Shijin Wang
[ABSTRACT]
Large language models have been successfully applied to programming
assistance tasks, such as code completion, code insertion, and instructional
code editing. However, these applications remain insufficiently automated and
struggle to effectively integrate various types of information during the
programming process, including coding history, current code, and user
instructions. In this work, we propose a new conversational framework that
comprehensively integrates these information sources, collect data to train our
models and evaluate their performance. Firstly, to thoroughly evaluate how well
models align with different types of information and the quality of their
outputs, we introduce a new benchmark, APEval (Assist Programming Eval), to
comprehensively assess the performance of models in programming assistance
tasks. Then, for data collection, we develop a data generation pipeline,
Programming-Instruct, which synthesizes training data from diverse sources,
such as GitHub and online judge platforms. This pipeline can automatically
generate various types of messages throughout the programming process. Finally,
using this pipeline, we generate 219K samples, fine-tune multiple models, and
develop the CursorCore series. We show that CursorCore outperforms other models
of comparable size. This framework unifies applications such as inline chat and
automated editing, contributes to the advancement of coding assistants. Code,
models and data are freely available at
https://github.com/TechxGenus/CursorCore.
[LINK]
http://arxiv.org/abs/2410.07002v3
[DATE]
2025-05-13 22:13:13+08:00
[CATEGORIES]
cs.CL
Round and Round We Go! What makes Rotary Positional Encodings useful?
[AUTHORS]
Federico Barbero, Alex Vitvitskyi, Christos Perivolaropoulos, Razvan Pascanu, Petar Veličković
[ABSTRACT]
Positional Encodings (PEs) are a critical component of Transformer-based
Large Language Models (LLMs), providing the attention mechanism with important
sequence-position information. One of the most popular types of encoding used
today in LLMs are Rotary Positional Encodings (RoPE), that rotate the queries
and keys based on their relative distance. A common belief is that RoPE is
useful because it helps to decay token dependency as relative distance
increases. In this work, we argue that this is unlikely to be the core reason.
We study the internals of a trained Gemma 7B model to understand how RoPE is
being used at a mechanical level. We find that Gemma learns to use RoPE to
construct robust “positional” attention patterns by exploiting the highest
frequencies. We also find that, in general, Gemma greatly prefers to use the
lowest frequencies of RoPE, which we suspect are used to carry semantic
information. We mathematically prove interesting behaviours of RoPE and conduct
experiments to verify our findings, proposing a modification of RoPE that fixes
some highlighted issues and improves performance. We believe that this work
represents an interesting step in better understanding PEs in LLMs, which we
believe holds crucial value for scaling LLMs to large sizes and context
lengths.
[LINK]
http://arxiv.org/abs/2410.06205v3
[DATE]
2025-05-13 22:11:59+08:00
[CATEGORIES]
cs.CL
cs.LG
Enhancing Thyroid Cytology Diagnosis with RAG-Optimized LLMs and Pa-thology Foundation Models
[AUTHORS]
Hussien Al-Asi, Jordan P Reynolds, Shweta Agarwal, Bryan J Dangott, Aziza Nassar, Zeynettin Akkus
[ABSTRACT]
Advancements in artificial intelligence (AI) are transforming pathology by
integrat-ing large language models (LLMs) with retrieval-augmented generation
(RAG) and domain-specific foundation models. This study explores the
application of RAG-enhanced LLMs coupled with pathology foundation models for
thyroid cytology diagnosis, addressing challenges in cytological
interpretation, standardization, and diagnostic accuracy. By leveraging a
curated knowledge base, RAG facilitates dy-namic retrieval of relevant case
studies, diagnostic criteria, and expert interpreta-tion, improving the
contextual understanding of LLMs. Meanwhile, pathology foun-dation models,
trained on high-resolution pathology images, refine feature extrac-tion and
classification capabilities. The fusion of these AI-driven approaches en-hances
diagnostic consistency, reduces variability, and supports pathologists in
dis-tinguishing benign from malignant thyroid lesions. Our results demonstrate
that integrating RAG with pathology-specific LLMs significantly improves
diagnostic efficiency and interpretability, paving the way for AI-assisted
thyroid cytopathology, with foundation model UNI achieving AUC 0.73-0.93 for
correct prediction of surgi-cal pathology diagnosis from thyroid cytology
samples.
[LINK]
http://arxiv.org/abs/2505.08590v1
[DATE]
2025-05-13 22:01:35+08:00
[CATEGORIES]
cs.CL
Scaling Laws for Floating Point Quantization Training
[AUTHORS]
Xingwu Sun, Shuaipeng Li, Ruobing Xie, Weidong Han, Kan Wu, Zhen Yang, Yixing Li, An Wang, Shuai Li, Jinbao Xue, Yu Cheng, Yangyu Tao, Zhanhui Kang, Chengzhong Xu, Di Wang, Jie Jiang
[ABSTRACT]
Low-precision training is considered an effective strategy for reducing both
training and downstream inference costs. Previous scaling laws for precision
mainly focus on integer quantization, which pay less attention to the
constituents in floating-point (FP) quantization, and thus cannot well fit the
LLM losses in this scenario. In contrast, while FP quantization training is
more commonly implemented in production, it’s research has been relatively
superficial. In this paper, we thoroughly explore the effects of FP
quantization targets, exponent bits, mantissa bits, and the calculation
granularity of the scaling factor in FP quantization training performance of
LLM models. In addition to an accurate FP quantization unified scaling law, we
also provide valuable suggestions for the community: (1) Exponent bits
contribute slightly more to the model performance than mantissa bits. We
provide the optimal exponent-mantissa bit ratio for different bit numbers,
which is available for future reference by hardware manufacturers; (2) We
discover the formation of the critical data size in low-precision LLM training.
Too much training data exceeding the critical data size will inversely bring in
degradation of LLM performance; (3) The optimal FP quantization precision is
directly proportional to the computational power, but within a wide
computational power range. We estimate that the best cost-performance precision
should lie between 4-8 bits.
[LINK]
http://arxiv.org/abs/2501.02423v2
[DATE]
2025-05-13 21:19:32+08:00
[CATEGORIES]
cs.LG
cs.CL
Are We Paying Attention to Her? Investigating Gender Disambiguation and Attention in Machine Translation
[AUTHORS]
Chiara Manna, Afra Alishahi, Frédéric Blain, Eva Vanmassenhove
[ABSTRACT]
While gender bias in modern Neural Machine Translation (NMT) systems has
received much attention, traditional evaluation metrics do not to fully capture
the extent to which these systems integrate contextual gender cues. We propose
a novel evaluation metric called Minimal Pair Accuracy (MPA), which measures
the reliance of models on gender cues for gender disambiguation. MPA is
designed to go beyond surface-level gender accuracy metrics by focusing on
whether models adapt to gender cues in minimal pairs – sentence pairs that
differ solely in the gendered pronoun, namely the explicit indicator of the
target’s entity gender in the source language (EN). We evaluate a number of NMT
models on the English-Italian (EN–IT) language pair using this metric, we show
that they ignore available gender cues in most cases in favor of (statistical)
stereotypical gender interpretation. We further show that in anti-stereotypical
cases, these models tend to more consistently take masculine gender cues into
account while ignoring the feminine cues. Furthermore, we analyze the attention
head weights in the encoder component and show that while all models encode
gender information to some extent, masculine cues elicit a more diffused
response compared to the more concentrated and specialized responses to
feminine gender cues.
[LINK]
http://arxiv.org/abs/2505.08546v1
[DATE]
2025-05-13 21:17:23+08:00
[CATEGORIES]
cs.CL
Can (A)I Change Your Mind?
[AUTHORS]
Miriam Havin, Timna Wharton Kleinman, Moran Koren, Yaniv Dover, Ariel Goldstein
[ABSTRACT]
The increasing integration of large language models (LLMs) based
conversational agents into everyday life raises critical cognitive and social
questions about their potential to influence human opinions. Although previous
studies have shown that LLM-based agents can generate persuasive content, these
typically involve controlled English-language settings. Addressing this, our
preregistered study explored LLMs’ persuasive capabilities in more ecological,
unconstrained scenarios, examining both static (written paragraphs) and dynamic
(conversations via Telegram) interaction types. Conducted entirely in Hebrew
with 200 participants, the study assessed the persuasive effects of both LLM
and human interlocutors on controversial civil policy topics. Results indicated
that participants adopted LLM and human perspectives similarly, with
significant opinion changes evident across all conditions, regardless of
interlocutor type or interaction mode. Confidence levels increased
significantly in most scenarios. These findings demonstrate LLM-based agents’
robust persuasive capabilities across diverse sources and settings,
highlighting their potential impact on shaping public opinions.
[COMMENTS]
Accetped to CogSci 2025
[LINK]
http://arxiv.org/abs/2503.01844v3
[DATE]
2025-05-13 20:45:16+08:00
[CATEGORIES]
cs.CL
Reassessing Graph Linearization for Sequence-to-sequence AMR Parsing: On the Advantages and Limitations of Triple-Based Encoding
[AUTHORS]
Jeongwoo Kang, Maximin Coavoux, Cédric Lopez, Didier Schwab
[ABSTRACT]
Sequence-to-sequence models are widely used to train Abstract Meaning
Representation (Banarescu et al., 2013, AMR) parsers. To train such models, AMR
graphs have to be linearized into a one-line text format. While Penman encoding
is typically used for this purpose, we argue that it has limitations: (1) for
deep graphs, some closely related nodes are located far apart in the linearized
text (2) Penman’s tree-based encoding necessitates inverse roles to handle node
re-entrancy, doubling the number of relation types to predict. To address these
issues, we propose a triple-based linearization method and compare its
efficiency with Penman linearization. Although triples are well suited to
represent a graph, our results suggest room for improvement in triple encoding
to better compete with Penman’s concise and explicit representation of a nested
graph structure.
[COMMENTS]
published at Insights from Negative Results in NLP (workshop EMNLP
2025)
[LINK]
http://arxiv.org/abs/2505.08504v1
[DATE]
2025-05-13 20:36:02+08:00
[CATEGORIES]
cs.CL
Judging the Judges: Can Large Vision-Language Models Fairly Evaluate Chart Comprehension and Reasoning?
[AUTHORS]
Md Tahmid Rahman Laskar, Mohammed Saidul Islam, Ridwan Mahbub, Ahmed Masry, Mizanur Rahman, Amran Bhuiyan, Mir Tafseer Nayeem, Shafiq Joty, Enamul Hoque, Jimmy Huang
[COMMENTS]
Accepted at ACL 2025 Industry Track
[LINK]
http://arxiv.org/abs/2505.08468v1
[DATE]
2025-05-13 19:50:08+08:00
[CATEGORIES]
cs.CL
Large Language Models Meet Stance Detection: A Survey of Tasks, Methods, Applications, Challenges and Future Directions
[AUTHORS]
Lata Pangtey, Anukriti Bhatnagar, Shubhi Bansal, Shahid Shafi Dar, Nagendra Kumar
[ABSTRACT]
Stance detection is essential for understanding subjective content across
various platforms such as social media, news articles, and online reviews.
Recent advances in Large Language Models (LLMs) have revolutionized stance
detection by introducing novel capabilities in contextual understanding,
cross-domain generalization, and multimodal analysis. Despite these
progressions, existing surveys often lack comprehensive coverage of approaches
that specifically leverage LLMs for stance detection. To bridge this critical
gap, our review article conducts a systematic analysis of stance detection,
comprehensively examining recent advancements of LLMs transforming the field,
including foundational concepts, methodologies, datasets, applications, and
emerging challenges. We present a novel taxonomy for LLM-based stance detection
approaches, structured along three key dimensions: 1) learning methods,
including supervised, unsupervised, few-shot, and zero-shot; 2) data
modalities, such as unimodal, multimodal, and hybrid; and 3) target
relationships, encompassing in-target, cross-target, and multi-target
scenarios. Furthermore, we discuss the evaluation techniques and analyze
benchmark datasets and performance trends, highlighting the strengths and
limitations of different architectures. Key applications in misinformation
detection, political analysis, public health monitoring, and social media
moderation are discussed. Finally, we identify critical challenges such as
implicit stance expression, cultural biases, and computational constraints,
while outlining promising future directions, including explainable stance
reasoning, low-resource adaptation, and real-time deployment frameworks. Our
survey highlights emerging trends, open challenges, and future directions to
guide researchers and practitioners in developing next-generation stance
detection systems powered by large language models.
[LINK]
http://arxiv.org/abs/2505.08464v1
[DATE]
2025-05-13 19:47:49+08:00
[CATEGORIES]
cs.CL
cs.LG
RepCali: High Efficient Fine-tuning Via Representation Calibration in Latent Space for Pre-trained Language Models
[AUTHORS]
Fujun Zhang, XiangDong Su
[ABSTRACT]
Fine-tuning pre-trained language models (PLMs) has become a dominant paradigm
in applying PLMs to downstream tasks. However, with limited fine-tuning, PLMs
still struggle with the discrepancies between the representation obtained from
the PLMs’ encoder and the optimal input to the PLMs’ decoder. This paper
tackles this challenge by learning to calibrate the representation of PLMs in
the latent space. In the proposed representation calibration method (RepCali),
we integrate a specific calibration block to the latent space after the encoder
and use the calibrated output as the decoder input. The merits of the proposed
RepCali include its universality to all PLMs with encoder-decoder
architectures, its plug-and-play nature, and ease of implementation. Extensive
experiments on 25 PLM-based models across 8 tasks (including both English and
Chinese datasets) demonstrate that the proposed RepCali offers desirable
enhancements to PLMs (including LLMs) and significantly improves the
performance of downstream tasks. Comparison experiments across 4 benchmark
tasks indicate that RepCali is superior to the representative fine-tuning
baselines.
[COMMENTS]
13 pages, 4 figures
[LINK]
http://arxiv.org/abs/2505.08463v1
[DATE]
2025-05-13 19:47:00+08:00
[CATEGORIES]
cs.CL
IterKey: Iterative Keyword Generation with LLMs for Enhanced Retrieval Augmented Generation
[AUTHORS]
Kazuki Hayashi, Hidetaka Kamigaito, Shinya Kouda, Taro Watanabe
[ABSTRACT]
Retrieval-Augmented Generation (RAG) has emerged as a way to complement the
in-context knowledge of Large Language Models (LLMs) by integrating external
documents. However, real-world applications demand not only accuracy but also
interpretability. While dense retrieval methods provide high accuracy, they
lack interpretability; conversely, sparse retrieval methods offer transparency
but often fail to capture the full intent of queries due to their reliance on
keyword matching. To address these issues, we introduce IterKey, an LLM-driven
iterative keyword generation framework that enhances RAG via sparse retrieval.
IterKey consists of three LLM-driven stages: generating keywords for retrieval,
generating answers based on retrieved documents, and validating the answers. If
validation fails, the process iteratively repeats with refined keywords. Across
four QA tasks, experimental results show that IterKey achieves 5% to 20%
accuracy improvements over BM25-based RAG and simple baselines. Its performance
is comparable to dense retrieval-based RAG and prior iterative query refinement
methods using dense models. In summary, IterKey is a novel BM25-based approach
leveraging LLMs to iteratively refine RAG, effectively balancing accuracy with
interpretability.
[LINK]
http://arxiv.org/abs/2505.08450v1
[DATE]
2025-05-13 19:25:15+08:00
[CATEGORIES]
cs.CL
Optimizing Retrieval-Augmented Generation: Analysis of Hyperparameter Impact on Performance and Efficiency
[AUTHORS]
Adel Ammar, Anis Koubaa, Omer Nacar, Wadii Boulila
[ABSTRACT]
Large language models achieve high task performance yet often hallucinate or
rely on outdated knowledge. Retrieval-augmented generation (RAG) addresses
these gaps by coupling generation with external search. We analyse how
hyperparameters influence speed and quality in RAG systems, covering Chroma and
Faiss vector stores, chunking policies, cross-encoder re-ranking, and
temperature, and we evaluate six metrics: faithfulness, answer correctness,
answer relevancy, context precision, context recall, and answer similarity.
Chroma processes queries 13% faster, whereas Faiss yields higher retrieval
precision, revealing a clear speed-accuracy trade-off. Naive fixed-length
chunking with small windows and minimal overlap outperforms semantic
segmentation while remaining the quickest option. Re-ranking provides modest
gains in retrieval quality yet increases runtime by roughly a factor of 5, so
its usefulness depends on latency constraints. These results help practitioners
balance computational cost and accuracy when tuning RAG systems for
transparent, up-to-date responses. Finally, we re-evaluate the top
configurations with a corrective RAG workflow and show that their advantages
persist when the model can iteratively request additional evidence. We obtain a
near-perfect context precision (99%), which demonstrates that RAG systems can
achieve extremely high retrieval accuracy with the right combination of
hyperparameters, with significant implications for applications where retrieval
quality directly impacts downstream task performance, such as clinical decision
support in healthcare.
[LINK]
http://arxiv.org/abs/2505.08445v1
[DATE]
2025-05-13 19:13:27+08:00
[CATEGORIES]
cs.LG
cs.CL
IndicSQuAD: A Comprehensive Multilingual Question Answering Dataset for Indic Languages
[AUTHORS]
Sharvi Endait, Ruturaj Ghatage, Aditya Kulkarni, Rajlaxmi Patil, Raviraj Joshi
[ABSTRACT]
The rapid progress in question-answering (QA) systems has predominantly
benefited high-resource languages, leaving Indic languages largely
underrepresented despite their vast native speaker base. In this paper, we
present IndicSQuAD, a comprehensive multi-lingual extractive QA dataset
covering nine major Indic languages, systematically derived from the SQuAD
dataset. Building on previous work with MahaSQuAD for Marathi, our approach
adapts and extends translation techniques to maintain high linguistic fidelity
and accurate answer-span alignment across diverse languages. IndicSQuAD
comprises extensive training, validation, and test sets for each language,
providing a robust foundation for model development. We evaluate baseline
performances using language-specific monolingual BERT models and the
multilingual MuRIL-BERT. The results indicate some challenges inherent in
low-resource settings. Moreover, our experiments suggest potential directions
for future work, including expanding to additional languages, developing
domain-specific datasets, and incorporating multimodal data. The dataset and
models are publicly shared at https://github.com/l3cube-pune/indic-nlp
[LINK]
http://arxiv.org/abs/2505.03688v2
[DATE]
2025-05-13 19:11:55+08:00
[CATEGORIES]
cs.CL
cs.LG
Query-driven Document-level Scientific Evidence Extraction from Biomedical Studies
[AUTHORS]
Massimiliano Pronesti, Joao Bettencourt-Silva, Paul Flanagan, Alessandra Pascale, Oisin Redmond, Anya Belz, Yufang Hou
[ABSTRACT]
Extracting scientific evidence from biomedical studies for clinical research
questions (e.g., Does stem cell transplantation improve quality of life in
patients with medically refractory Crohn’s disease compared to placebo?) is a
crucial step in synthesising biomedical evidence. In this paper, we focus on
the task of document-level scientific evidence extraction for clinical
questions with conflicting evidence. To support this task, we create a dataset
called CochraneForest, leveraging forest plots from Cochrane systematic
reviews. It comprises 202 annotated forest plots, associated clinical research
questions, full texts of studies, and study-specific conclusions. Building on
CochraneForest, we propose URCA (Uniform Retrieval Clustered Augmentation), a
retrieval-augmented generation framework designed to tackle the unique
challenges of evidence extraction. Our experiments show that URCA outperforms
the best existing methods by up to 10.3% in F1 score on this task. However, the
results also underscore the complexity of CochraneForest, establishing it as a
challenging testbed for advancing automated evidence synthesis systems.
[LINK]
http://arxiv.org/abs/2505.06186v2
[DATE]
2025-05-13 18:50:45+08:00
[CATEGORIES]
cs.CL
TUMS: Enhancing Tool-use Abilities of LLMs with Multi-structure Handlers
[AUTHORS]
Aiyao He, Sijia Cui, Shuai Xu, Yanna Wang, Bo Xu
[ABSTRACT]
Recently, large language models(LLMs) have played an increasingly important
role in solving a wide range of NLP tasks, leveraging their capabilities of
natural language understanding and generating. Integration with external tools
further enhances LLMs’ effectiveness, providing more precise, timely, and
specialized responses. However, LLMs still encounter difficulties with
non-executable actions and improper actions, which are primarily attributed to
incorrect parameters. The process of generating parameters by LLMs is confined
to the tool level, employing the coarse-grained strategy without considering
the different difficulties of various tools. To address this issue, we propose
TUMS, a novel framework designed to enhance the tool-use capabilities of LLMs
by transforming tool-level processing into parameter-level processing.
Specifically, our framework consists of four key components: (1) an intent
recognizer that identifies the user’s intent to help LLMs better understand the
task; (2) a task decomposer that breaks down complex tasks into simpler
subtasks, each involving a tool call; (3) a subtask processor equipped with
multi-structure handlers to generate accurate parameters; and (4) an executor.
Our empirical studies have evidenced the effectiveness and efficiency of the
TUMS framework with an average of 19.6\% and 50.6\% improvement separately on
easy and hard benchmarks of ToolQA, meanwhile, we demonstrated the key
contribution of each part with ablation experiments, offering more insights and
stimulating future research on Tool-augmented LLMs.
[COMMENTS]
Accepted to ICONIP 2024
[LINK]
http://arxiv.org/abs/2505.08402v1
[DATE]
2025-05-13 17:57:28+08:00
[CATEGORIES]
cs.CL
Accelerating Chain-of-Thought Reasoning: When Goal-Gradient Importance Meets Dynamic Skipping
[AUTHORS]
Ren Zhuang, Ben Wang, Shuifa Sun
[ABSTRACT]
Large Language Models leverage Chain-of-Thought (CoT) prompting for complex
tasks, but their reasoning traces are often excessively verbose and
inefficient, leading to significant computational costs and latency. Current
CoT compression techniques typically rely on generic importance metrics and
static compression rates, which may inadvertently remove functionally critical
tokens or fail to adapt to varying reasoning complexity. To overcome these
limitations, we propose Adaptive GoGI-Skip, a novel framework learning dynamic
CoT compression via supervised fine-tuning. This approach introduces two
synergistic innovations: (1) Goal-Gradient Importance (GoGI), a novel metric
accurately identifying functionally relevant tokens by measuring the gradient
influence of their intermediate representations on the final answer loss, and
(2) Adaptive Dynamic Skipping (ADS), a mechanism dynamically regulating the
compression rate based on runtime model uncertainty while ensuring local
coherence through an adaptive N-token constraint. To our knowledge, this is the
first work unifying a goal-oriented, gradient-based importance metric with
dynamic, uncertainty-aware skipping for CoT compression. Trained on compressed
MATH data, Adaptive GoGI-Skip demonstrates strong cross-domain generalization
across diverse reasoning benchmarks including AIME, GPQA, and GSM8K. It
achieves substantial efficiency gains - reducing CoT token counts by over 45%
on average and delivering 1.6-2.0 times inference speedups - while maintaining
high reasoning accuracy. Notably, it significantly outperforms existing
baselines by preserving accuracy even at high effective compression rates,
advancing the state of the art in the CoT reasoning efficiency-accuracy
trade-off.
[LINK]
http://arxiv.org/abs/2505.08392v1
[DATE]
2025-05-13 17:39:18+08:00
[CATEGORIES]
cs.CL
On the Geometry of Semantics in Next-token Prediction
[AUTHORS]
Yize Zhao, Christos Thrampoulidis
[ABSTRACT]
Modern language models demonstrate a remarkable ability to capture linguistic
meaning despite being trained solely through next-token prediction (NTP). We
investigate how this conceptually simple training objective leads models to
extract and encode latent semantic and grammatical concepts. Our analysis
reveals that NTP optimization implicitly guides models to encode concepts via
singular value decomposition (SVD) factors of a centered data-sparsity matrix
that captures next-word co-occurrence patterns. While the model never
explicitly constructs this matrix, learned word and context embeddings
effectively factor it to capture linguistic structure. We find that the most
important SVD factors are learned first during training, motivating the use of
spectral clustering of embeddings to identify human-interpretable semantics,
including both classical k-means and a new orthant-based method directly
motivated by our interpretation of concepts. Overall, our work bridges
distributional semantics, neural collapse geometry, and neural network training
dynamics, providing insights into how NTP’s implicit biases shape the emergence
of meaning representations in language models.
[LINK]
http://arxiv.org/abs/2505.08348v1
[DATE]
2025-05-13 16:46:04+08:00
[CATEGORIES]
cs.CL
Evaluating the Effectiveness of Black-Box Prompt Optimization as the Scale of LLMs Continues to Grow
[AUTHORS]
Ziyu Zhou, Yihang Wu, Jingyuan Yang, Zhan Xiao, Rongjun Li
[ABSTRACT]
Black-Box prompt optimization methods have emerged as a promising strategy
for refining input prompts to better align large language models (LLMs),
thereby enhancing their task performance. Although these methods have
demonstrated encouraging results, most studies and experiments have primarily
focused on smaller-scale models (e.g., 7B, 14B) or earlier versions (e.g.,
GPT-3.5) of LLMs. As the scale of LLMs continues to increase, such as with
DeepSeek V3 (671B), it remains an open question whether these black-box
optimization techniques will continue to yield significant performance
improvements for models of such scale. In response to this, we select three
well-known black-box optimization methods and evaluate them on large-scale LLMs
(DeepSeek V3 and Gemini 2.0 Flash) across four NLU and NLG datasets. The
results show that these black-box prompt optimization methods offer only
limited improvements on these large-scale LLMs. Furthermore, we hypothesize
that the scale of the model is the primary factor contributing to the limited
benefits observed. To explore this hypothesis, we conducted experiments on LLMs
of varying sizes (Qwen 2.5 series, ranging from 7B to 72B) and observed an
inverse scaling law, wherein the effectiveness of black-box optimization
methods diminished as the model size increased.
[LINK]
http://arxiv.org/abs/2505.08303v1
[DATE]
2025-05-13 15:26:56+08:00
[CATEGORIES]
cs.CL
Ask, Fail, Repeat: Meeseeks, an Iterative Feedback Benchmark for LLMs’ Multi-turn Instruction-following Ability
[AUTHORS]
Jiaming Wang, Yunke Zhao, Peng Ding, Jun Kuang, Zongyu Wang, Xuezhi Cao, Xunliang Cai
[ABSTRACT]
The ability to follow instructions accurately is fundamental for Large
Language Models (LLMs) to serve as reliable agents in real-world applications.
For complex instructions, LLMs often struggle to fulfill all requirements in a
single attempt. In practice, users typically provide iterative feedback until
the LLM generates a response that meets all requirements. However, existing
instruction-following benchmarks are either single-turn or introduce new
requirements in each turn without allowing self-correction. To address this
gap, we propose \textbf{Meeseeks} (named after Mr. Meeseeks from \textit{Rick
and Morty}\footnote{Rick and Morty is an American adult animated science
fiction sitcom created by Justin Roiland and Dan Harmon for Cartoon Network’s
nighttime programming block Adult Swim.}.) Meeseeks simulates realistic
human-LLM interactions through an iterative feedback framework, which enables
models to self-correct based on specific requirement failures in each turn,
better reflecting real-world user-end usage patterns. Meanwhile, the benchmark
implements a comprehensive evaluation system with 38 capability tags organized
across three dimensions: Intent Recognition, Granular Content Validation, and
Output Structure Validation. Through rigorous evaluation across LLMs, Meeseeks
provides valuable insights into LLMs’ instruction-following capabilities in
multi-turn scenarios.
[LINK]
http://arxiv.org/abs/2504.21625v2
[DATE]
2025-05-13 15:14:03+08:00
[CATEGORIES]
cs.CL
LLMSR@XLLM25: Less is More: Enhancing Structured Multi-Agent Reasoning via Quality-Guided Distillation
[AUTHORS]
Jiahao Yuan, Xingzhe Sun, Xing Yu, Jingwen Wang, Dehui Du, Zhiqing Cui, Zixiang Di
[ABSTRACT]
The LLMSR@XLLM25 formulates a low-resource structural reasoning task that
challenges LLMs to generate interpretable, step-by-step rationales with minimal
labeled data. We present Less is More, the third-place winning approach in the
LLMSR@XLLM25, which focuses on structured reasoning from only 24 labeled
examples. Our approach leverages a multi-agent framework with reverse-prompt
induction, retrieval-augmented reasoning synthesis via GPT-4o, and dual-stage
reward-guided filtering to distill high-quality supervision across three
subtasks: question parsing, CoT parsing, and step-level verification. All
modules are fine-tuned from Meta-Llama-3-8B-Instruct under a unified LoRA+
setup. By combining structure validation with reward filtering across few-shot
and zero-shot prompts, our pipeline consistently improves structure reasoning
quality. These results underscore the value of controllable data distillation
in enhancing structured inference under low-resource constraints. Our code is
available at https://github.com/JhCircle/Less-is-More.
[COMMENTS]
XLLM @ ACL 2025 Shared Task-III: LLM for Structural Reasoning
(LLM-SR)
[LINK]
http://arxiv.org/abs/2504.16408v2
[DATE]
2025-05-13 15:12:49+08:00
[CATEGORIES]
cs.CL
Vision-Language Models Do Not Understand Negation
[AUTHORS]
Kumail Alhamoud, Shaden Alshammari, Yonglong Tian, Guohao Li, Philip Torr, Yoon Kim, Marzyeh Ghassemi
[COMMENTS]
CVPR 2025; project page: https://negbench.github.io
[LINK]
http://arxiv.org/abs/2501.09425v2
[DATE]
2025-05-13 14:30:11+08:00
[CATEGORIES]
cs.CL
MobA: Multifaceted Memory-Enhanced Adaptive Planning for Efficient Mobile Task Automation
[AUTHORS]
Zichen Zhu, Hao Tang, Yansi Li, Dingye Liu, Hongshen Xu, Kunyao Lan, Danyang Zhang, Yixuan Jiang, Hao Zhou, Chenrun Wang, Situo Zhang, Liangtai Sun, Yixiao Wang, Yuheng Sun, Lu Chen, Kai Yu
[ABSTRACT]
Existing Multimodal Large Language Model (MLLM)-based agents face significant
challenges in handling complex GUI (Graphical User Interface) interactions on
devices. These challenges arise from the dynamic and structured nature of GUI
environments, which integrate text, images, and spatial relationships, as well
as the variability in action spaces across different pages and tasks. To
address these limitations, we propose MobA, a novel MLLM-based mobile assistant
system. MobA introduces an adaptive planning module that incorporates a
reflection mechanism for error recovery and dynamically adjusts plans to align
with the real environment contexts and action module’s execution capacity.
Additionally, a multifaceted memory module provides comprehensive memory
support to enhance adaptability and efficiency. We also present MobBench, a
dataset designed for complex mobile interactions. Experimental results on
MobBench and AndroidArena demonstrate MobA’s ability to handle dynamic GUI
environments and perform complex mobile tasks.
[COMMENTS]
NAACL 2025 Demo Track [code] https://github.com/OpenDFM/MobA
[dataset] https://huggingface.co/datasets/OpenDFM/MobA-MobBench
[LINK]
http://arxiv.org/abs/2410.13757v3
[DATE]
2025-05-13 14:25:09+08:00
[CATEGORIES]
cs.CL
Enhancing Cache-Augmented Generation (CAG) with Adaptive Contextual Compression for Scalable Knowledge Integration
[AUTHORS]
Rishabh Agrawal, Himanshu Kumar
[ABSTRACT]
The rapid progress in large language models (LLMs) has paved the way for
novel approaches in knowledge-intensive tasks. Among these, Cache-Augmented
Generation (CAG) has emerged as a promising alternative to Retrieval-Augmented
Generation (RAG). CAG minimizes retrieval latency and simplifies system design
by preloading knowledge into the model’s context. However, challenges persist
in scaling CAG to accommodate large and dynamic knowledge bases effectively.
This paper introduces Adaptive Contextual Compression (ACC), an innovative
technique designed to dynamically compress and manage context inputs, enabling
efficient utilization of the extended memory capabilities of modern LLMs. To
further address the limitations of standalone CAG, we propose a Hybrid CAG-RAG
Framework, which integrates selective retrieval to augment preloaded contexts
in scenarios requiring additional information. Comprehensive evaluations on
diverse datasets highlight the proposed methods’ ability to enhance
scalability, optimize efficiency, and improve multi-hop reasoning performance,
offering practical solutions for real-world knowledge integration challenges.
[LINK]
http://arxiv.org/abs/2505.08261v1
[DATE]
2025-05-13 14:24:48+08:00
[CATEGORIES]
cs.CL
CURIE: Evaluating LLMs On Multitask Scientific Long Context Understanding and Reasoning
[AUTHORS]
Hao Cui, Zahra Shamsi, Gowoon Cheon, Xuejian Ma, Shutong Li, Maria Tikhanovskaya, Peter Norgaard, Nayantara Mudur, Martyna Plomecka, Paul Raccuglia, Yasaman Bahri, Victor V. Albert, Pranesh Srinivasan, Haining Pan, Philippe Faist, Brian Rohr, Ekin Dogus Cubuk, Muratahan Aykol, Amil Merchant, Michael J. Statt, Dan Morris, Drew Purves, Elise Kleeman, Ruth Alcantara, Matthew Abraham, Muqthar Mohammad, Ean Phing VanLee, Chenfei Jiang, Elizabeth Dorfman, Eun-Ah Kim, Michael P Brenner, Viren Jain, Sameera Ponda, Subhashini Venugopalan
[COMMENTS]
Accepted at ICLR 2025 main conference
[LINK]
http://arxiv.org/abs/2503.13517v2
[DATE]
2025-05-13 14:16:23+08:00
[CATEGORIES]
cs.CL
Large Language Model Psychometrics: A Systematic Review of Evaluation, Validation, and Enhancement
[AUTHORS]
Haoran Ye, Jing Jin, Yuhang Xie, Xin Zhang, Guojie Song
[ABSTRACT]
The rapid advancement of large language models (LLMs) has outpaced
traditional evaluation methodologies. It presents novel challenges, such as
measuring human-like psychological constructs, navigating beyond static and
task-specific benchmarks, and establishing human-centered evaluation. These
challenges intersect with Psychometrics, the science of quantifying the
intangible aspects of human psychology, such as personality, values, and
intelligence. This survey introduces and synthesizes an emerging
interdisciplinary field of LLM Psychometrics, which leverages psychometric
instruments, theories, and principles to evaluate, understand, and enhance
LLMs. We systematically explore the role of Psychometrics in shaping
benchmarking principles, broadening evaluation scopes, refining methodologies,
validating results, and advancing LLM capabilities. This paper integrates
diverse perspectives to provide a structured framework for researchers across
disciplines, enabling a more comprehensive understanding of this nascent field.
Ultimately, we aim to provide actionable insights for developing future
evaluation paradigms that align with human-level AI and promote the advancement
of human-centered AI systems for societal benefit. A curated repository of LLM
psychometric resources is available at
https://github.com/valuebyte-ai/Awesome-LLM-Psychometrics.
[COMMENTS]
63 pages, 482 references
[LINK]
http://arxiv.org/abs/2505.08245v1
[DATE]
2025-05-13 13:47:51+08:00
[CATEGORIES]
cs.CL
Red Teaming the Mind of the Machine: A Systematic Evaluation of Prompt Injection and Jailbreak Vulnerabilities in LLMs
[AUTHORS]
Chetan Pathade
[ABSTRACT]
Large Language Models (LLMs) are increasingly integrated into consumer and
enterprise applications. Despite their capabilities, they remain susceptible to
adversarial attacks such as prompt injection and jailbreaks that override
alignment safeguards. This paper provides a systematic investigation of
jailbreak strategies against various state-of-the-art LLMs. We categorize over
1,400 adversarial prompts, analyze their success against GPT-4, Claude 2,
Mistral 7B, and Vicuna, and examine their generalizability and construction
logic. We further propose layered mitigation strategies and recommend a hybrid
red-teaming and sandboxing approach for robust LLM security.
[COMMENTS]
7 Pages, 6 Figures
[LINK]
http://arxiv.org/abs/2505.04806v2
[DATE]
2025-05-13 13:36:34+08:00
[CATEGORIES]
cs.CL
Cite Before You Speak: Enhancing Context-Response Grounding in E-commerce Conversational LLM-Agents
[AUTHORS]
Jingying Zeng, Hui Liu, Zhenwei Dai, Xianfeng Tang, Chen Luo, Samarth Varshney, Zhen Li, Qi He
[ABSTRACT]
With the advancement of conversational large language models (LLMs), several
LLM-based Conversational Shopping Agents (CSA) have been developed to help
customers smooth their online shopping. The primary objective in building an
engaging and trustworthy CSA is to ensure the agent’s responses about product
factoids are accurate and factually grounded. However, two challenges remain.
First, LLMs produce hallucinated or unsupported claims. Such inaccuracies risk
spreading misinformation and diminishing customer trust. Second, without
providing knowledge source attribution in CSA response, customers struggle to
verify LLM-generated information. To address both challenges, we present an
easily productionized solution that enables a ‘‘citation experience’’ to our
customers. We build auto-evaluation metrics to holistically evaluate LLM’s
grounding and attribution capabilities, suggesting that citation generation
paradigm substantially improves grounding performance by 13.83%. To deploy this
capability at scale, we introduce Multi-UX-Inference system, which appends
source citations to LLM outputs while preserving existing user experience
features and supporting scalable inference. Large-scale online A/B tests show
that grounded CSA responses improves customer engagement by 3% - 10%, depending
on UX variations.
[LINK]
http://arxiv.org/abs/2503.04830v3
[DATE]
2025-05-13 13:02:11+08:00
[CATEGORIES]
cs.CL
Evaluating the Symbol Binding Ability of Large Language Models for Multiple-Choice Questions in Vietnamese General Education
[AUTHORS]
Duc-Vu Nguyen, Quoc-Nam Nguyen
[ABSTRACT]
In this paper, we evaluate the ability of large language models (LLMs) to
perform multiple choice symbol binding (MCSB) for multiple choice question
answering (MCQA) tasks in zero-shot, one-shot, and few-shot settings. We focus
on Vietnamese, with fewer challenging MCQA datasets than in English. The two
existing datasets, ViMMRC 1.0 and ViMMRC 2.0, focus on literature. Recent
research in Vietnamese natural language processing (NLP) has focused on the
Vietnamese National High School Graduation Examination (VNHSGE) from 2019 to
2023 to evaluate ChatGPT. However, these studies have mainly focused on how
ChatGPT solves the VNHSGE step by step. We aim to create a novel and
high-quality dataset by providing structured guidelines for typing LaTeX
formulas for mathematics, physics, chemistry, and biology. This dataset can be
used to evaluate the MCSB ability of LLMs and smaller language models (LMs)
because it is typed in a strict LaTeX style. We focus on predicting the
character (A, B, C, or D) that is the most likely answer to a question, given
the context of the question. Our evaluation of six well-known LLMs, namely
BLOOMZ-7.1B-MT, LLaMA-2-7B, LLaMA-2-70B, GPT-3, GPT-3.5, and GPT-4.0, on the
ViMMRC 1.0 and ViMMRC 2.0 benchmarks and our proposed dataset shows promising
results on the MCSB ability of LLMs for Vietnamese. The dataset is available
for research purposes only.
[COMMENTS]
Accepted at SoICT 2023
[LINK]
http://arxiv.org/abs/2310.12059v5
[DATE]
2025-05-13 12:23:12+08:00
[CATEGORIES]
cs.CL
Bridging LLMs and KGs without Fine-Tuning: Intermediate Probing Meets Subgraph-Aware Entity Descriptions
[AUTHORS]
Bo Xue, Yi Xu, Yunchong Song, Yiming Pang, Yuyang Ren, Jiaxin Ding, Luoyi Fu, Xinbing Wang
[ABSTRACT]
Traditional knowledge graph completion (KGC) methods rely solely on
structural information, struggling with the inherent sparsity of knowledge
graphs (KGs). Large Language Models (LLMs) learn extensive knowledge from large
corpora with powerful context modeling, making them promising for mitigating
the limitations of previous methods. Directly fine-tuning LLMs offers great
capability but comes at the cost of huge time and memory consumption, while
utilizing frozen LLMs yields suboptimal results.In this work, we aim to
leverage LLMs for KGC effectively and efficiently. We capture the context-aware
hidden states of knowledge triples by employing prompts to stimulate the
intermediate layers of LLMs. We then train a data-efficient classifier on these
hidden states to harness the inherent capabilities of frozen LLMs in KGC.
Additionally, to reduce ambiguity and enrich knowledge representation, we
generate detailed entity descriptions through subgraph sampling on KGs.
Extensive experiments on standard benchmarks demonstrate the efficiency and
effectiveness of our approach. We outperform traditional KGC methods across
most datasets and, notably, achieve classification performance comparable to
fine-tuned LLMs while enhancing GPU memory efficiency by $188\times$ and
accelerating training and inference by $13.48\times$.
[LINK]
http://arxiv.org/abs/2408.06787v3
[DATE]
2025-05-13 12:09:08+08:00
[CATEGORIES]
cs.CL
Not that Groove: Zero-Shot Symbolic Music Editing
[AUTHORS]
Li Zhang
[ABSTRACT]
Most work in AI music generation focused on audio, which has seen limited use
in the music production industry due to its rigidity. To maximize flexibility
while assuming only textual instructions from producers, we are among the first
to tackle symbolic music editing. We circumvent the known challenge of lack of
labeled data by proving that LLMs with zero-shot prompting can effectively edit
drum grooves. The recipe of success is a creatively designed format that
interfaces LLMs and music, while we facilitate evaluation by providing an
evaluation dataset with annotated unit tests that highly aligns with musicians’
judgment.
[LINK]
http://arxiv.org/abs/2505.08203v1
[DATE]
2025-05-13 11:33:36+08:00
[CATEGORIES]
cs.CL
A Head to Predict and a Head to Question: Pre-trained Uncertainty Quantification Heads for Hallucination Detection in LLM Outputs
[AUTHORS]
Artem Shelmanov, Ekaterina Fadeeva, Akim Tsvigun, Ivan Tsvigun, Zhuohan Xie, Igor Kiselev, Nico Daheim, Caiqi Zhang, Artem Vazhentsev, Mrinmaya Sachan, Preslav Nakov, Timothy Baldwin
[ABSTRACT]
Large Language Models (LLMs) have the tendency to hallucinate, i.e., to
sporadically generate false or fabricated information. This presents a major
challenge, as hallucinations often appear highly convincing and users generally
lack the tools to detect them. Uncertainty quantification (UQ) provides a
framework for assessing the reliability of model outputs, aiding in the
identification of potential hallucinations. In this work, we introduce
pre-trained UQ heads: supervised auxiliary modules for LLMs that substantially
enhance their ability to capture uncertainty compared to unsupervised UQ
methods. Their strong performance stems from the powerful Transformer
architecture in their design and informative features derived from LLM
attention maps. Experimental evaluation shows that these heads are highly
robust and achieve state-of-the-art performance in claim-level hallucination
detection across both in-domain and out-of-domain prompts. Moreover, these
modules demonstrate strong generalization to languages they were not explicitly
trained on. We pre-train a collection of UQ heads for popular LLM series,
including Mistral, Llama, and Gemma 2. We publicly release both the code and
the pre-trained heads.
[LINK]
http://arxiv.org/abs/2505.08200v1
[DATE]
2025-05-13 11:30:26+08:00
[CATEGORIES]
cs.CL
Multi-Party Supervised Fine-tuning of Language Models for Multi-Party Dialogue Generation
[AUTHORS]
Xiaoyu Wang, Ningyuan Xi, Teng Chen, Qingqing Gu, Yue Zhao, Xiaokai Chen, Zhonglin Jiang, Yong Chen, Luo Ji
[ABSTRACT]
Large Language Models (LLM) are usually fine-tuned to participate in dyadic
or two-party dialogues, which can not adapt well to multi-party dialogues
(MPD), which hinders their applications in such scenarios including
multi-personal meetings, discussions and daily communication. Previous
LLM-based researches mainly focus on the multi-agent framework, while their
base LLMs are still pairwisely fine-tuned. In this work, we design a
multi-party fine-tuning framework (MuPaS) for LLMs on the multi-party dialogue
datasets, and prove such a straightforward framework can let the LLM align with
the multi-party conversation style efficiently and effectively. We also design
two training strategies which can convert MuPaS into the MPD simulator.
Substantial experiments show that MuPaS can achieve state-of-the-art
multi-party response, higher accuracy of the-next-speaker prediction, higher
human and automatic evaluated utterance qualities, and can even generate
reasonably with out-of-distribution scene, topic and role descriptions. The
MuPaS framework bridges the LLM training with more complicated multi-party
applications, such as conversation generation, virtual rehearsal or
meta-universe.
[COMMENTS]
Accepted by IJCNN 2025
[LINK]
http://arxiv.org/abs/2412.05342v4
[DATE]
2025-05-13 11:10:40+08:00
[CATEGORIES]
cs.CL
No Preference Left Behind: Group Distributional Preference Optimization
[AUTHORS]
Binwei Yao, Zefan Cai, Yun-Shiuan Chuang, Shanglin Yang, Ming Jiang, Diyi Yang, Junjie Hu
[ABSTRACT]
Preferences within a group of people are not uniform but follow a
distribution. While existing alignment methods like Direct Preference
Optimization (DPO) attempt to steer models to reflect human preferences, they
struggle to capture the distributional pluralistic preferences within a group.
These methods often skew toward dominant preferences, overlooking the diversity
of opinions, especially when conflicting preferences arise. To address this
issue, we propose Group Distributional Preference Optimization (GDPO), a novel
framework that aligns language models with the distribution of preferences
within a group by incorporating the concept of beliefs that shape individual
preferences. GDPO calibrates a language model using statistical estimation of
the group’s belief distribution and aligns the model with belief-conditioned
preferences, offering a more inclusive alignment framework than traditional
methods. In experiments using both synthetic controllable opinion generation
and real-world movie review datasets, we show that DPO fails to align with the
targeted belief distributions, while GDPO consistently reduces this alignment
gap during training. Moreover, our evaluation metrics demonstrate that GDPO
outperforms existing approaches in aligning with group distributional
preferences, marking a significant advance in pluralistic alignment.
[LINK]
http://arxiv.org/abs/2412.20299v2
[DATE]
2025-05-13 11:06:47+08:00
[CATEGORIES]
cs.CL
OnPrem.LLM: A Privacy-Conscious Document Intelligence Toolkit
[AUTHORS]
Arun S. Maiya
[ABSTRACT]
We present OnPrem$.$LLM, a Python-based toolkit for applying large language
models (LLMs) to sensitive, non-public data in offline or restricted
environments. The system is designed for privacy-preserving use cases and
provides prebuilt pipelines for document processing and storage,
retrieval-augmented generation (RAG), information extraction, summarization,
classification, and prompt/output processing with minimal configuration.
OnPrem$.$LLM supports multiple LLM backends – including llama$.$cpp, Ollama,
vLLM, and Hugging Face Transformers – with quantized model support, GPU
acceleration, and seamless backend switching. Although designed for fully local
execution, OnPrem$.$LLM also supports integration with a wide range of cloud
LLM providers when permitted, enabling hybrid deployments that balance
performance with data control. A no-code web interface extends accessibility to
non-technical users.
[COMMENTS]
6 pages
[LINK]
http://arxiv.org/abs/2505.07672v2
[DATE]
2025-05-13 10:43:26+08:00
[CATEGORIES]
cs.CL
cs.LG
Codifying Character Logic in Role-Playing
[AUTHORS]
Letian Peng, Jingbo Shang
[ABSTRACT]
This paper introduces Codified Profiles for role-playing, a novel approach
that represents character logic as structured, executable functions for
behavioral decision-making. Each profile defines a set of functions
parse_by_scene(scene) that outputs a list of logic-grounded assertions
triggered_statements, using both explicit control structures (e.g.,
if-then-else) and condition checks like check_condition(scene, question), where
each question is a semantically meaningful prompt about the scene (e.g., “Is
the character in danger?”) discriminated by the role-playing LLM as true,
false, or unknown. This explicit representation offers three key advantages
over traditional prompt-based profiles, which append character descriptions
directly into text prompts: (1) Persistence, by enforcing complete and
consistent execution of character logic, rather than relying on the model’s
implicit reasoning; (2) Updatability, through systematic inspection and
revision of behavioral logic, which is difficult to track or debug in
prompt-only approaches; (3) Controllable Randomness, by supporting stochastic
behavior directly within the logic, enabling fine-grained variability that
prompting alone struggles to achieve. To validate these advantages, we
introduce a new benchmark constructed from 83 characters and 5,141 scenes
curated from Fandom, using NLI-based scoring to compare character responses
against ground-truth actions. Our experiments demonstrate the significant
benefits of codified profiles in improving persistence, updatability, and
behavioral diversity. Notably, by offloading a significant portion of reasoning
to preprocessing, codified profiles enable even 1B-parameter models to perform
high-quality role-playing, providing a scalable and efficient foundation for
local deployment of role-play agents.
[LINK]
http://arxiv.org/abs/2505.07705v2
[DATE]
2025-05-13 10:16:35+08:00
[CATEGORIES]
cs.CL
Efficient Shapley Value-based Non-Uniform Pruning of Large Language Models
[AUTHORS]
Chuan Sun, Han Yu, Lizhen Cui, Xiaoxiao Li
[ABSTRACT]
Pruning large language models (LLMs) is a promising solution for reducing
model sizes and computational complexity while preserving performance.
Traditional layer-wise pruning methods often adopt a uniform sparsity approach
across all layers, which leads to suboptimal performance due to the varying
significance of individual transformer layers within the model not being
accounted for. To this end, we propose the Shapley Value-based Non-Uniform
Pruning (SV-NUP) method for LLMs. This approach quantifies the contribution of
each transformer layer to the overall model performance, enabling the
assignment of tailored pruning budgets to different layers to retain critical
parameters. To further improve efficiency, we design the Sliding Window-based
Shapley Value approximation method. It substantially reduces computational
overhead compared to exact SV calculation methods. Extensive experiments on
various LLMs including LLaMA-v1, LLaMA-v2 and OPT demonstrate the effectiveness
of the proposed approach. The results reveal that non-uniform pruning
significantly enhances the performance of pruned models. Notably, SV-NUP
achieves a reduction in perplexity (PPL) of 18.01% and 19.55% on LLaMA-7B and
LLaMA-13B, respectively, compared to SparseGPT at 70% sparsity.
[LINK]
http://arxiv.org/abs/2505.01731v2
[DATE]
2025-05-13 10:13:57+08:00
[CATEGORIES]
cs.CL
A Large-Scale Empirical Analysis of Custom GPTs’ Vulnerabilities in the OpenAI Ecosystem
[AUTHORS]
Sunday Oyinlola Ogundoyin, Muhammad Ikram, Hassan Jameel Asghar, Benjamin Zi Hao Zhao, Dali Kaafar
[ABSTRACT]
Millions of users leverage generative pretrained transformer (GPT)-based
language models developed by leading model providers for a wide range of tasks.
To support enhanced user interaction and customization, many platforms-such as
OpenAI-now enable developers to create and publish tailored model instances,
known as custom GPTs, via dedicated repositories or application stores. These
custom GPTs empower users to browse and interact with specialized applications
designed to meet specific needs. However, as custom GPTs see growing adoption,
concerns regarding their security vulnerabilities have intensified. Existing
research on these vulnerabilities remains largely theoretical, often lacking
empirical, large-scale, and statistically rigorous assessments of associated
risks.
In this study, we analyze 14,904 custom GPTs to assess their susceptibility
to seven exploitable threats, such as roleplay-based attacks, system prompt
leakage, phishing content generation, and malicious code synthesis, across
various categories and popularity tiers within the OpenAI marketplace. We
introduce a multi-metric ranking system to examine the relationship between a
custom GPT’s popularity and its associated security risks.
Our findings reveal that over 95% of custom GPTs lack adequate security
protections. The most prevalent vulnerabilities include roleplay-based
vulnerabilities (96.51%), system prompt leakage (92.20%), and phishing
(91.22%). Furthermore, we demonstrate that OpenAI’s foundational models exhibit
inherent security weaknesses, which are often inherited or amplified in custom
GPTs. These results highlight the urgent need for enhanced security measures
and stricter content moderation to ensure the safe deployment of GPT-based
applications.
[LINK]
http://arxiv.org/abs/2505.08148v1
[DATE]
2025-05-13 08:51:07+08:00
[CATEGORIES]
cs.CL
cs.LG
Human-AI Collaboration or Academic Misconduct? Measuring AI Use in Student Writing Through Stylometric Evidence
[AUTHORS]
Eduardo Araujo Oliveira, Madhavi Mohoni, Sonsoles López-Pernas, Mohammed Saqr
[ABSTRACT]
As human-AI collaboration becomes increasingly prevalent in educational
contexts, understanding and measuring the extent and nature of such
interactions pose significant challenges. This research investigates the use of
authorship verification (AV) techniques not as a punitive measure, but as a
means to quantify AI assistance in academic writing, with a focus on promoting
transparency, interpretability, and student development. Building on prior
work, we structured our investigation into three stages: dataset selection and
expansion, AV method development, and systematic evaluation. Using three
datasets - including a public dataset (PAN-14) and two from University of
Melbourne students from various courses - we expanded the data to include
LLM-generated texts, totalling 1,889 documents and 540 authorship problems from
506 students. We developed an adapted Feature Vector Difference AV methodology
to construct robust academic writing profiles for students, designed to capture
meaningful, individual characteristics of their writing. The method’s
effectiveness was evaluated across multiple scenarios, including distinguishing
between student-authored and LLM-generated texts and testing resilience against
LLMs’ attempts to mimic student writing styles. Results demonstrate the
enhanced AV classifier’s ability to identify stylometric discrepancies and
measure human-AI collaboration at word and sentence levels while providing
educators with a transparent tool to support academic integrity investigations.
This work advances AV technology, offering actionable insights into the
dynamics of academic writing in an AI-driven era.
[COMMENTS]
19 pages, 10 figures, 11 tables
[LINK]
http://arxiv.org/abs/2505.08828v1
[DATE]
2025-05-13 08:36:36+08:00
[CATEGORIES]
cs.CL
Are Transformers Able to Reason by Connecting Separated Knowledge in Training Data?
[AUTHORS]
Yutong Yin, Zhaoran Wang
[COMMENTS]
Accepted by ICLR 2025
[LINK]
http://arxiv.org/abs/2501.15857v5
[DATE]
2025-05-13 08:04:47+08:00
[CATEGORIES]
cs.CL
cs.LG
ALOHA: Empowering Multilingual Agent for University Orientation with Hierarchical Retrieval
[AUTHORS]
Mingxu Tao, Bowen Tang, Mingxuan Ma, Yining Zhang, Hourun Li, Feifan Wen, Hao Ma, Jia Yang
[COMMENTS]
To appear in NAACL 2025 Demo Track
[LINK]
http://arxiv.org/abs/2505.08130v1
[DATE]
2025-05-13 08:01:03+08:00
[CATEGORIES]
cs.CL
Are LLMs complicated ethical dilemma analyzers?
[AUTHORS]
Jiashen, Du, Jesse Yao, Allen Liu, Zhekai Zhang
[ABSTRACT]
One open question in the study of Large Language Models (LLMs) is whether
they can emulate human ethical reasoning and act as believable proxies for
human judgment. To investigate this, we introduce a benchmark dataset
comprising 196 real-world ethical dilemmas and expert opinions, each segmented
into five structured components: Introduction, Key Factors, Historical
Theoretical Perspectives, Resolution Strategies, and Key Takeaways. We also
collect non-expert human responses for comparison, limited to the Key Factors
section due to their brevity. We evaluate multiple frontier LLMs (GPT-4o-mini,
Claude-3.5-Sonnet, Deepseek-V3, Gemini-1.5-Flash) using a composite metric
framework based on BLEU, Damerau-Levenshtein distance, TF-IDF cosine
similarity, and Universal Sentence Encoder similarity. Metric weights are
computed through an inversion-based ranking alignment and pairwise AHP
analysis, enabling fine-grained comparison of model outputs to expert
responses. Our results show that LLMs generally outperform non-expert humans in
lexical and structural alignment, with GPT-4o-mini performing most consistently
across all sections. However, all models struggle with historical grounding and
proposing nuanced resolution strategies, which require contextual abstraction.
Human responses, while less structured, occasionally achieve comparable
semantic similarity, suggesting intuitive moral reasoning. These findings
highlight both the strengths and current limitations of LLMs in ethical
decision-making.
[COMMENTS]
CS194-280 Advanced LLM Agents project. Project page:
https://github.com/ALT-JS/ethicaLLM
[LINK]
http://arxiv.org/abs/2505.08106v1
[DATE]
2025-05-13 06:35:07+08:00
[CATEGORIES]
cs.CL
Discriminative Finetuning of Generative Large Language Models without Reward Models and Human Preference Data
[AUTHORS]
Siqi Guo, Ilgee Hong, Vicente Balmaseda, Changlong Yu, Liang Qiu, Xin Liu, Haoming Jiang, Tuo Zhao, Tianbao Yang
[ABSTRACT]
Supervised fine-tuning (SFT) has become a crucial step for aligning
pretrained large language models (LLMs) using supervised datasets of
input-output pairs. However, despite being supervised, SFT is inherently
limited by its generative training objective. To address its limitations, the
existing common strategy is to follow SFT with a separate phase of preference
optimization (PO), which relies on either human-labeled preference data or a
strong reward model to guide the learning process. In this paper, we address
the limitations of SFT by exploring one of the most successful techniques in
conventional supervised learning: discriminative learning. We introduce
Discriminative Fine-Tuning (DFT), an improved variant of SFT, which mitigates
the burden of collecting human-labeled preference data or training strong
reward models. Unlike SFT that employs a generative approach and overlooks
negative data, DFT adopts a discriminative paradigm that increases the
probability of positive answers while suppressing potentially negative ones,
aiming for data prediction instead of token prediction. Our contributions
include: (i) a discriminative probabilistic framework for fine-tuning LLMs by
explicitly modeling the discriminative likelihood of an answer among all
possible outputs given an input; (ii) efficient algorithms to optimize this
discriminative likelihood; and (iii) extensive experiments demonstrating DFT’s
effectiveness, achieving performance better than SFT and comparable to if not
better than SFT$\rightarrow$PO. The code can be found at
https://github.com/Optimization-AI/DFT.
[COMMENTS]
18 pages, 7 figures
[LINK]
http://arxiv.org/abs/2502.18679v2
[DATE]
2025-05-13 06:10:13+08:00
[CATEGORIES]
cs.CL
Adaptive Integrated Layered Attention (AILA)
[AUTHORS]
William Claster, Suhas KM, Dhairya Gundechia
[ABSTRACT]
We propose Adaptive Integrated Layered Attention (AILA), a neural network
architecture that combines dense skip connections with different mechanisms for
adaptive feature reuse across network layers. We evaluate AILA on three
challenging tasks: price forecasting for various commodities and indices (S&P
500, Gold, US dollar Futures, Coffee, Wheat), image recognition using the
CIFAR-10 dataset, and sentiment analysis on the IMDB movie review dataset. In
all cases, AILA matches strong deep learning baselines (LSTMs, Transformers,
and ResNets), achieving it at a fraction of the training and inference time.
Notably, we implement and test two versions of the model - AILA-Architecture 1,
which uses simple linear layers as the connection mechanism between layers, and
AILA-Architecture 2, which implements an attention mechanism to selectively
focus on outputs from previous layers. Both architectures are applied in a
single-task learning setting, with each model trained separately for individual
tasks. Results confirm that AILA’s adaptive inter-layer connections yield
robust gains by flexibly reusing pertinent features at multiple network depths.
The AILA approach thus presents an extension to existing architectures,
improving long-range sequence modeling, image recognition with optimised
computational speed, and SOTA classification performance in practice.
[LINK]
http://arxiv.org/abs/2503.22742v2
[DATE]
2025-05-13 05:58:10+08:00
[CATEGORIES]
cs.LG
cs.CL
Beyond Input Activations: Identifying Influential Latents by Gradient Sparse Autoencoders
[AUTHORS]
Dong Shu, Xuansheng Wu, Haiyan Zhao, Mengnan Du, Ninghao Liu
[ABSTRACT]
Sparse Autoencoders (SAEs) have recently emerged as powerful tools for
interpreting and steering the internal representations of large language models
(LLMs). However, conventional approaches to analyzing SAEs typically rely
solely on input-side activations, without considering the causal influence
between each latent feature and the model’s output. This work is built on two
key hypotheses: (1) activated latents do not contribute equally to the
construction of the model’s output, and (2) only latents with high causal
influence are effective for model steering. To validate these hypotheses, we
propose Gradient Sparse Autoencoder (GradSAE), a simple yet effective method
that identifies the most influential latents by incorporating output-side
gradient information.
[COMMENTS]
10 pages, 3 figures
[LINK]
http://arxiv.org/abs/2505.08080v1
[DATE]
2025-05-13 05:29:12+08:00
[CATEGORIES]
cs.LG
cs.CL
An Extra RMSNorm is All You Need for Fine Tuning to 1.58 Bits
[AUTHORS]
Cody Steinmetz, Gavin Childress, Aaron Herbst, Gavin Jones, Jasdeep Singh, Eli Vang, Keagan Weinstock
[ABSTRACT]
Large language models (LLMs) have transformed natural-language processing,
yet their scale makes real-world deployment costly. Post-training quantization
reduces memory and computation but often degrades accuracy, while
quantization-aware training can recover performance at the cost of extra
training. Pushing quantization to the ternary (2-bit) regime yields even larger
savings but is notoriously unstable. Building on recent work showing that a
bias-free, RMS-normalized Transformer with straight-through estimation can
reach 1.58-bit precision, we demonstrate that simply inserting RMS
normalization before every linear projection and applying a gradual, layer-wise
quantization schedule stably fine-tunes full-precision checkpoints into ternary
LLMs. Our approach matches or surpasses more elaborate knowledge-distillation
pipelines on standard language-modeling benchmarks without adding model
complexity. These results indicate that careful normalization alone can close
much of the accuracy gap between ternary and full-precision LLMs, making
ultra-low-bit inference practical.
[LINK]
http://arxiv.org/abs/2505.08823v1
[DATE]
2025-05-13 05:14:29+08:00
[CATEGORIES]
cs.LG
cs.CL
Multi-Modal Language Models as Text-to-Image Model Evaluators
[AUTHORS]
Jiahui Chen, Candace Ross, Reyhane Askari-Hemmat, Koustuv Sinha, Melissa Hall, Michal Drozdzal, Adriana Romero-Soriano
[ABSTRACT]
The steady improvements of text-to-image (T2I) generative models lead to slow
deprecation of automatic evaluation benchmarks that rely on static datasets,
motivating researchers to seek alternative ways to evaluate the T2I progress.
In this paper, we explore the potential of multi-modal large language models
(MLLMs) as evaluator agents that interact with a T2I model, with the objective
of assessing prompt-generation consistency and image aesthetics. We present
Multimodal Text-to-Image Eval (MT2IE), an evaluation framework that iteratively
generates prompts for evaluation, scores generated images and matches T2I
evaluation of existing benchmarks with a fraction of the prompts used in
existing static benchmarks. Moreover, we show that MT2IE’s prompt-generation
consistency scores have higher correlation with human judgment than scores
previously introduced in the literature. MT2IE generates prompts that are
efficient at probing T2I model performance, producing the same relative T2I
model rankings as existing benchmarks while using only 1/80th the number of
prompts for evaluation.
[LINK]
http://arxiv.org/abs/2505.00759v2
[DATE]
2025-05-13 04:46:35+08:00
[CATEGORIES]
cs.CL
FalseReject: A Resource for Improving Contextual Safety and Mitigating Over-Refusals in LLMs via Structured Reasoning
[AUTHORS]
Zhehao Zhang, Weijie Xu, Fanyou Wu, Chandan K. Reddy
[ABSTRACT]
Safety alignment approaches in large language models (LLMs) often lead to the
over-refusal of benign queries, significantly diminishing their utility in
sensitive scenarios. To address this challenge, we introduce FalseReject, a
comprehensive resource containing 16k seemingly toxic queries accompanied by
structured responses across 44 safety-related categories. We propose a
graph-informed adversarial multi-agent interaction framework to generate
diverse and complex prompts, while structuring responses with explicit
reasoning to aid models in accurately distinguishing safe from unsafe contexts.
FalseReject includes training datasets tailored for both standard
instruction-tuned models and reasoning-oriented models, as well as a
human-annotated benchmark test set. Our extensive benchmarking on 29
state-of-the-art (SOTA) LLMs reveals persistent over-refusal challenges.
Empirical results demonstrate that supervised finetuning with FalseReject
substantially reduces unnecessary refusals without compromising overall safety
or general language capabilities.
[LINK]
http://arxiv.org/abs/2505.08054v1
[DATE]
2025-05-13 04:45:25+08:00
[CATEGORIES]
cs.CL
NAZM: Network Analysis of Zonal Metrics in Persian Poetic Tradition
[AUTHORS]
Kourosh Shahnazari, Seyed Moein Ayyoubzadeh
[ABSTRACT]
This study formalizes a computational model to simulate classical Persian
poets’ dynamics of influence through constructing a multi-dimensional
similarity network. Using a rigorously curated dataset based on Ganjoor’s
corpus, we draw upon semantic, lexical, stylistic, thematic, and metrical
features to demarcate each poet’s corpus. Each is contained within weighted
similarity matrices, which are then appended to generate an aggregate graph
showing poet-to-poet influence. Further network investigation is carried out to
identify key poets, style hubs, and bridging poets by calculating degree,
closeness, betweenness, eigenvector, and Katz centrality measures. Further, for
typological insight, we use the Louvain community detection algorithm to
demarcate clusters of poets sharing both style and theme coherence, which
correspond closely to acknowledged schools of literature like Sabk-e Hindi,
Sabk-e Khorasani, and the Bazgasht-e Adabi phenomenon. Our findings provide a
new data-driven view of Persian literature distinguished between canonical
significance and interextual influence, thus highlighting relatively
lesser-known figures who hold great structural significance. Combining
computational linguistics with literary study, this paper produces an
interpretable and scalable model for poetic tradition, enabling retrospective
reflection as well as forward-looking research within digital humanities.
[LINK]
http://arxiv.org/abs/2505.08052v1
[DATE]
2025-05-13 04:39:53+08:00
[CATEGORIES]
cs.CL
cs.LG
BLAB: Brutally Long Audio Bench
[AUTHORS]
Orevaoghene Ahia, Martijn Bartelds, Kabir Ahuja, Hila Gonen, Valentin Hofmann, Siddhant Arora, Shuyue Stella Li, Vishal Puttagunta, Mofetoluwa Adeyemi, Charishma Buchireddy, Ben Walls, Noah Bennett, Shinji Watanabe, Noah A. Smith, Yulia Tsvetkov, Sachin Kumar
[LINK]
http://arxiv.org/abs/2505.03054v2
[DATE]
2025-05-13 03:49:55+08:00
[CATEGORIES]
cs.CL
From Calculation to Adjudication: Examining LLM judges on Mathematical Reasoning Tasks
[AUTHORS]
Andreas Stephan, Dawei Zhu, Matthias Aßenmacher, Xiaoyu Shen, Benjamin Roth
[ABSTRACT]
To reduce the need for human annotations, large language models (LLMs) have
been proposed as judges of the quality of other candidate models. The
performance of LLM judges is typically evaluated by measuring the correlation
with human judgments on generative tasks such as summarization or machine
translation. In contrast, we study LLM judges on mathematical reasoning tasks.
These tasks require multi-step reasoning, and the correctness of their
solutions is verifiable, enabling a more objective evaluation. We perform a
detailed performance analysis and find that easy samples are easy to judge, and
difficult samples are difficult to judge. Our analysis uncovers a strong
correlation between judgment performance and the candidate model task
performance, indicating that judges tend to favor higher-quality models even if
their answer is incorrect. As a consequence, we test whether we can predict the
behavior of LLM judges using simple features such as part-of-speech tags and
find that we can correctly predict 70%-75% of judgments. We conclude this study
by analyzing practical use cases, showing that LLM judges consistently detect
the on-average better model but largely fail if we use them to improve task
performance.
[LINK]
http://arxiv.org/abs/2409.04168v2
[DATE]
2025-05-13 03:41:57+08:00
[CATEGORIES]
cs.CL
Large Language Models and Arabic Content: A Review
[AUTHORS]
Haneh Rhel, Dmitri Roussinov
[ABSTRACT]
Over the past three years, the rapid advancement of Large Language Models
(LLMs) has had a profound impact on multiple areas of Artificial Intelligence
(AI), particularly in Natural Language Processing (NLP) across diverse
languages, including Arabic. Although Arabic is considered one of the most
widely spoken languages across 27 countries in the Arabic world and used as a
second language in some other non-Arabic countries as well, there is still a
scarcity of Arabic resources, datasets, and tools. Arabic NLP tasks face
various challenges due to the complexities of the Arabic language, including
its rich morphology, intricate structure, and diverse writing standards, among
other factors. Researchers have been actively addressing these challenges,
demonstrating that pre-trained Large Language Models (LLMs) trained on
multilingual corpora achieve significant success in various Arabic NLP tasks.
This study provides an overview of using large language models (LLMs) for the
Arabic language, highlighting early pre-trained Arabic Language models across
various NLP applications and their ability to handle diverse Arabic content
tasks and dialects. It also provides an overview of how techniques like
finetuning and prompt engineering can enhance the performance of these models.
Additionally, the study summarizes common Arabic benchmarks and datasets while
presenting our observations on the persistent upward trend in the adoption of
LLMs.
[COMMENTS]
Original language: English This paper has been submitted to the First
International Conference on Artificial Intelligence and Generative AI
(FICAILY 2025), and it has been accepted for presentation at FICAILY on
9-10/July 2025 and for publication in the Springer Nature. Number of pages:
16 Publication status Accepted/In press - 7 Apr 2025
https://www.gena-ai-libya2025.com/
[LINK]
http://arxiv.org/abs/2505.08004v1
[DATE]
2025-05-13 03:09:12+08:00
[CATEGORIES]
cs.CL
Task-Adaptive Semantic Communications with Controllable Diffusion-based Data Regeneration
[AUTHORS]
Fupei Guo, Achintha Wijesinghe, Songyang Zhang, Zhi Ding
[ABSTRACT]
Semantic communications represent a new paradigm of next-generation
networking that shifts bit-wise data delivery to conveying the semantic
meanings for bandwidth efficiency. To effectively accommodate various potential
downstream tasks at the receiver side, one should adaptively convey the most
critical semantic information. This work presents a novel task-adaptive
semantic communication framework based on diffusion models that is capable of
dynamically adjusting the semantic message delivery according to various
downstream tasks. Specifically, we initialize the transmission of a
deep-compressed general semantic representation from the transmitter to enable
diffusion-based coarse data reconstruction at the receiver. The receiver
identifies the task-specific demands and generates textual prompts as feedback.
Integrated with the attention mechanism, the transmitter updates the semantic
transmission with more details to better align with the objectives of the
intended receivers. Our test results demonstrate the efficacy of the proposed
method in adaptively preserving critical task-relevant information for semantic
communications while preserving high compression efficiency.
[LINK]
http://arxiv.org/abs/2505.07980v1
[DATE]
2025-05-13 02:23:53+08:00
[CATEGORIES]
cs.CL
Assessing and Mitigating Medical Knowledge Drift and Conflicts in Large Language Models
[AUTHORS]
Weiyi Wu, Xinwen Xu, Chongyang Gao, Xingjian Diao, Siting Li, Lucas A. Salas, Jiang Gui
[ABSTRACT]
Large Language Models (LLMs) have great potential in the field of health
care, yet they face great challenges in adapting to rapidly evolving medical
knowledge. This can lead to outdated or contradictory treatment suggestions.
This study investigated how LLMs respond to evolving clinical guidelines,
focusing on concept drift and internal inconsistencies. We developed the
DriftMedQA benchmark to simulate guideline evolution and assessed the temporal
reliability of various LLMs. Our evaluation of seven state-of-the-art models
across 4,290 scenarios demonstrated difficulties in rejecting outdated
recommendations and frequently endorsing conflicting guidance. Additionally, we
explored two mitigation strategies: Retrieval-Augmented Generation and
preference fine-tuning via Direct Preference Optimization. While each method
improved model performance, their combination led to the most consistent and
reliable results. These findings underscore the need to improve LLM robustness
to temporal shifts to ensure more dependable applications in clinical practice.
[LINK]
http://arxiv.org/abs/2505.07968v1
[DATE]
2025-05-13 02:08:02+08:00
[CATEGORIES]
cs.CL
A Comparative Analysis of Static Word Embeddings for Hungarian
[AUTHORS]
Máté Gedeon
[ABSTRACT]
This paper presents a comprehensive analysis of various static word
embeddings for Hungarian, including traditional models such as Word2Vec,
FastText, as well as static embeddings derived from BERT-based models using
different extraction methods. We evaluate these embeddings on both intrinsic
and extrinsic tasks to provide a holistic view of their performance. For
intrinsic evaluation, we employ a word analogy task, which assesses the
embeddings ability to capture semantic and syntactic relationships. Our results
indicate that traditional static embeddings, particularly FastText, excel in
this task, achieving high accuracy and mean reciprocal rank (MRR) scores. Among
the BERT-based models, the X2Static method for extracting static embeddings
demonstrates superior performance compared to decontextualized and aggregate
methods, approaching the effectiveness of traditional static embeddings. For
extrinsic evaluation, we utilize a bidirectional LSTM model to perform Named
Entity Recognition (NER) and Part-of-Speech (POS) tagging tasks. The results
reveal that embeddings derived from dynamic models, especially those extracted
using the X2Static method, outperform purely static embeddings. Notably, ELMo
embeddings achieve the highest accuracy in both NER and POS tagging tasks,
underscoring the benefits of contextualized representations even when used in a
static form. Our findings highlight the continued relevance of static word
embeddings in NLP applications and the potential of advanced extraction methods
to enhance the utility of BERT-based models. This piece of research contributes
to the understanding of embedding performance in the Hungarian language and
provides valuable insights for future developments in the field. The training
scripts, evaluation codes, restricted vocabulary, and extracted embeddings will
be made publicly available to support further research and reproducibility.
[LINK]
http://arxiv.org/abs/2505.07809v1
[DATE]
2025-05-13 01:57:11+08:00
[CATEGORIES]
cs.CL
Learning Dynamics in Continual Pre-Training for Large Language Models
[AUTHORS]
Xingjin Wang, Howe Tissue, Lu Wang, Linjing Li, Daniel Dajun Zeng
[ABSTRACT]
Continual Pre-Training (CPT) has become a popular and effective method to
apply strong foundation models to specific downstream tasks. In this work, we
explore the learning dynamics throughout the CPT process for large language
models. We specifically focus on how general and downstream domain performance
evolves at each training step, with domain performance measured via validation
losses. We have observed that the CPT loss curve fundamentally characterizes
the transition from one curve to another hidden curve, and could be described
by decoupling the effects of distribution shift and learning rate annealing. We
derive a CPT scaling law that combines the two factors, enabling the prediction
of loss at any (continual) training steps and across learning rate schedules
(LRS) in CPT. Our formulation presents a comprehensive understanding of several
critical factors in CPT, including loss potential, peak learning rate, training
steps, replay ratio, etc. Moreover, our approach can be adapted to customize
training hyper-parameters to different CPT goals such as balancing general and
domain-specific performance. Extensive experiments demonstrate that our scaling
law holds across various CPT datasets and training hyper-parameters.
[COMMENTS]
Accepted to ICML2025 (spotlight)
[LINK]
http://arxiv.org/abs/2505.07796v1
[DATE]
2025-05-13 01:47:32+08:00
[CATEGORIES]
cs.CL
cs.LG
Domain Regeneration: How well do LLMs match syntactic properties of text domains?
[AUTHORS]
Da Ju, Hagen Blix, Adina Williams
[ABSTRACT]
Recent improvement in large language model performance have, in all
likelihood, been accompanied by improvement in how well they can approximate
the distribution of their training data. In this work, we explore the following
question: which properties of text domains do LLMs faithfully approximate, and
how well do they do so? Applying observational approaches familiar from corpus
linguistics, we prompt a commonly used, opensource LLM to regenerate text from
two domains of permissively licensed English text which are often contained in
LLM training data – Wikipedia and news text. This regeneration paradigm allows
us to investigate whether LLMs can faithfully match the original human text
domains in a fairly semantically-controlled setting. We investigate varying
levels of syntactic abstraction, from more simple properties like sentence
length, and article readability, to more complex and higher order properties
such as dependency tag distribution, parse depth, and parse complexity. We find
that the majority of the regenerated distributions show a shifted mean, a lower
standard deviation, and a reduction of the long tail, as compared to the human
originals.
[LINK]
http://arxiv.org/abs/2505.07784v1
[DATE]
2025-05-13 01:37:17+08:00
[CATEGORIES]
cs.CL
Must Read: A Systematic Survey of Computational Persuasion
[AUTHORS]
Nimet Beyza Bozdag, Shuhaib Mehri, Xiaocheng Yang, Hyeonjeong Ha, Zirui Cheng, Esin Durmus, Jiaxuan You, Heng Ji, Gokhan Tur, Dilek Hakkani-Tür
[ABSTRACT]
Persuasion is a fundamental aspect of communication, influencing
decision-making across diverse contexts, from everyday conversations to
high-stakes scenarios such as politics, marketing, and law. The rise of
conversational AI systems has significantly expanded the scope of persuasion,
introducing both opportunities and risks. AI-driven persuasion can be leveraged
for beneficial applications, but also poses threats through manipulation and
unethical influence. Moreover, AI systems are not only persuaders, but also
susceptible to persuasion, making them vulnerable to adversarial attacks and
bias reinforcement. Despite rapid advancements in AI-generated persuasive
content, our understanding of what makes persuasion effective remains limited
due to its inherently subjective and context-dependent nature. In this survey,
we provide a comprehensive overview of computational persuasion, structured
around three key perspectives: (1) AI as a Persuader, which explores
AI-generated persuasive content and its applications; (2) AI as a Persuadee,
which examines AI’s susceptibility to influence and manipulation; and (3) AI as
a Persuasion Judge, which analyzes AI’s role in evaluating persuasive
strategies, detecting manipulation, and ensuring ethical persuasion. We
introduce a taxonomy for computational persuasion research and discuss key
challenges, including evaluating persuasiveness, mitigating manipulative
persuasion, and developing responsible AI-driven persuasive systems. Our survey
outlines future research directions to enhance the safety, fairness, and
effectiveness of AI-powered persuasion while addressing the risks posed by
increasingly capable language models.
[LINK]
http://arxiv.org/abs/2505.07775v1
[DATE]
2025-05-13 01:26:31+08:00
[CATEGORIES]
cs.CL
Enhancing Code Generation via Bidirectional Comment-Level Mutual Grounding
[AUTHORS]
Yifeng Di, Tianyi Zhang
[ABSTRACT]
Large Language Models (LLMs) have demonstrated unprecedented capability in
code generation. However, LLM-generated code is still plagued with a wide range
of functional errors, especially for complex programming tasks that LLMs have
not seen before. Recent studies have shown that developers often struggle with
inspecting and fixing incorrect code generated by LLMs, diminishing their
productivity and trust in LLM-based code generation. Inspired by the mutual
grounding theory in communication, we propose an interactive approach that
leverages code comments as a medium for developers and LLMs to establish a
shared understanding. Our approach facilitates iterative grounding by
interleaving code generation, inline comment generation, and contextualized
user feedback through editable comments to align generated code with developer
intent. We evaluated our approach on two popular benchmarks and demonstrated
that our approach significantly improved multiple state-of-the-art LLMs, e.g.,
17.1% pass@1 improvement for code-davinci-002 on HumanEval. Furthermore, we
conducted a user study with 12 participants in comparison to two baselines: (1)
interacting with GitHub Copilot, and (2) interacting with a multi-step code
generation paradigm called Multi-Turn Program Synthesis. Participants completed
the given programming tasks 16.7% faster and with 10.5% improvement in task
success rate when using our approach. Both results show that interactively
refining code comments enables the collaborative establishment of mutual
grounding, leading to more accurate code generation and higher developer
confidence.
[COMMENTS]
Accepted to ICSE 2025
[LINK]
http://arxiv.org/abs/2505.07768v1
[DATE]
2025-05-13 01:20:30+08:00
[CATEGORIES]
cs.CL
Spoken Language Understanding on Unseen Tasks With In-Context Learning
[AUTHORS]
Neeraj Agrawal, Sriram Ganapathy
[ABSTRACT]
Spoken language understanding (SLU) tasks involve diverse skills that probe
the information extraction, classification and/or generation capabilities of
models. In this setting, task-specific training data may not always be
available. While traditional task-specific SLU models are unable to cater to
such requirements, the speech-text large language models (LLMs) offer a
promising alternative with emergent abilities. However, out of-the-box, our
evaluations indicate that the zero/few-shot performance of prominent
open-source speech-text LLMs on SLU tasks are not up to the mark. In this
paper, we introduce a novel approach to robust task-agnostic fine-tuning using
randomized class labels. With this proposed fine-tuning, we illustrate that the
performance of the speech-text LLMs on an unseen task is significantly improved
over standard approaches. Critically, the proposed approach avoids the
requirement of task-specific data annotations for enabling new tasks in
speech-text LLMs.
[LINK]
http://arxiv.org/abs/2505.07731v1
[DATE]
2025-05-13 00:38:43+08:00
[CATEGORIES]
cs.CL
cs.LG
The Leaderboard Illusion
[AUTHORS]
Shivalika Singh, Yiyang Nan, Alex Wang, Daniel D’Souza, Sayash Kapoor, Ahmet Üstün, Sanmi Koyejo, Yuntian Deng, Shayne Longpre, Noah A. Smith, Beyza Ermis, Marzieh Fadaee, Sara Hooker
[ABSTRACT]
Measuring progress is fundamental to the advancement of any scientific field.
As benchmarks play an increasingly central role, they also grow more
susceptible to distortion. Chatbot Arena has emerged as the go-to leaderboard
for ranking the most capable AI systems. Yet, in this work we identify
systematic issues that have resulted in a distorted playing field. We find that
undisclosed private testing practices benefit a handful of providers who are
able to test multiple variants before public release and retract scores if
desired. We establish that the ability of these providers to choose the best
score leads to biased Arena scores due to selective disclosure of performance
results. At an extreme, we identify 27 private LLM variants tested by Meta in
the lead-up to the Llama-4 release. We also establish that proprietary closed
models are sampled at higher rates (number of battles) and have fewer models
removed from the arena than open-weight and open-source alternatives. Both
these policies lead to large data access asymmetries over time. Providers like
Google and OpenAI have received an estimated 19.2% and 20.4% of all data on the
arena, respectively. In contrast, a combined 83 open-weight models have only
received an estimated 29.7% of the total data. We show that access to Chatbot
Arena data yields substantial benefits; even limited additional data can result
in relative performance gains of up to 112% on the arena distribution, based on
our conservative estimates. Together, these dynamics result in overfitting to
Arena-specific dynamics rather than general model quality. The Arena builds on
the substantial efforts of both the organizers and an open community that
maintains this valuable evaluation platform. We offer actionable
recommendations to reform the Chatbot Arena’s evaluation framework and promote
fairer, more transparent benchmarking for the field
[COMMENTS]
68 pages, 18 figures, 9 tables
[LINK]
http://arxiv.org/abs/2504.20879v2
[DATE]
2025-05-13 00:33:58+08:00
[CATEGORIES]
cs.CL
cs.LG
Through the Looking Glass: Common Sense Consistency Evaluation of Weird Images
[AUTHORS]
Elisei Rykov, Kseniia Petrushina, Kseniia Titova, Anton Razzhigaev, Alexander Panchenko, Vasily Konovalov
[ABSTRACT]
Measuring how real images look is a complex task in artificial intelligence
research. For example, an image of a boy with a vacuum cleaner in a desert
violates common sense. We introduce a novel method, which we call Through the
Looking Glass (TLG), to assess image common sense consistency using Large
Vision-Language Models (LVLMs) and Transformer-based encoder. By leveraging
LVLMs to extract atomic facts from these images, we obtain a mix of accurate
facts. We proceed by fine-tuning a compact attention-pooling classifier over
encoded atomic facts. Our TLG has achieved a new state-of-the-art performance
on the WHOOPS! and WEIRD datasets while leveraging a compact fine-tuning
component.
[LINK]
http://arxiv.org/abs/2505.07704v1
[DATE]
2025-05-13 00:12:11+08:00
[CATEGORIES]
cs.CL
From Distributional to Overton Pluralism: Investigating Large Language Model Alignment
[AUTHORS]
Thom Lake, Eunsol Choi, Greg Durrett
[COMMENTS]
NAACL 2025 (Main Conference)
[LINK]
http://arxiv.org/abs/2406.17692v2
[DATE]
2025-05-13 00:11:53+08:00
[CATEGORIES]
cs.CL
cs.LG
Re$^2$: A Consistency-ensured Dataset for Full-stage Peer Review and Multi-turn Rebuttal Discussions
[AUTHORS]
Daoze Zhang, Zhijian Bao, Sihang Du, Zhiyi Zhao, Kuangling Zhang, Dezheng Bao, Yang Yang
[ABSTRACT]
Peer review is a critical component of scientific progress in the fields like
AI, but the rapid increase in submission volume has strained the reviewing
system, which inevitably leads to reviewer shortages and declines review
quality. Besides the growing research popularity, another key factor in this
overload is the repeated resubmission of substandard manuscripts, largely due
to the lack of effective tools for authors to self-evaluate their work before
submission. Large Language Models (LLMs) show great promise in assisting both
authors and reviewers, and their performance is fundamentally limited by the
quality of the peer review data. However, existing peer review datasets face
three major limitations: (1) limited data diversity, (2) inconsistent and
low-quality data due to the use of revised rather than initial submissions, and
(3) insufficient support for tasks involving rebuttal and reviewer-author
interactions. To address these challenges, we introduce the largest
consistency-ensured peer review and rebuttal dataset named Re^2, which
comprises 19,926 initial submissions, 70,668 review comments, and 53,818
rebuttals from 24 conferences and 21 workshops on OpenReview. Moreover, the
rebuttal and discussion stage is framed as a multi-turn conversation paradigm
to support both traditional static review tasks and dynamic interactive LLM
assistants, providing more practical guidance for authors to refine their
manuscripts and helping alleviate the growing review burden. Our data and code
are available in https://anonymous.4open.science/r/ReviewBench_anon/.
[COMMENTS]
2 figures, 5 tables
[LINK]
http://arxiv.org/abs/2505.07920v1
[DATE]
2025-05-13 00:02:52+08:00
[CATEGORIES]
cs.CL
cs.LG
Continuous Temporal Learning of Probability Distributions via Neural ODEs with Applications in Continuous Glucose Monitoring Data
[AUTHORS]
Antonio Álvarez-López, Marcos Matabuena
[ABSTRACT]
Modeling the continuous–time dynamics of probability distributions from
time–dependent data samples is a fundamental problem in many fields, including
digital health. The aim is to analyze how the distribution of a biomarker, such
as glucose, evolves over time and how these changes may reflect the progression
of chronic diseases such as diabetes. In this paper, we propose a novel
probabilistic model based on a mixture of Gaussian distributions to capture how
samples from a continuous-time stochastic process evolve over the time. To
model potential distribution shifts over time, we introduce a time-dependent
function parameterized by a Neural Ordinary Differential Equation (Neural ODE)
and estimate it non–parametrically using the Maximum Mean Discrepancy (MMD).
The proposed model is highly interpretable, detects subtle temporal shifts, and
remains computationally efficient. Through simulation studies, we show that it
performs competitively in terms of estimation accuracy against
state-of-the-art, less interpretable methods such as normalized gradient–flows
and non–parameteric kernel density estimators. Finally, we demonstrate the
utility of our method on digital clinical–trial data, showing how the
interventions alters the time-dependent distribution of glucose levels and
enabling a rigorous comparison of control and treatment groups from novel
mathematical and clinical perspectives.
[LINK]
http://arxiv.org/abs/2505.08698v1
[DATE]
2025-05-13 23:57:06+08:00
[CATEGORIES]
cs.LG
From S4 to Mamba: A Comprehensive Survey on Structured State Space Models
[AUTHORS]
Shriyank Somvanshi, Md Monzurul Islam, Mahmuda Sultana Mimi, Sazzad Bin Bashar Polock, Gaurab Chhetri, Subasish Das
[ABSTRACT]
Recent advancements in sequence modeling have led to the emergence of
Structured State Space Models (SSMs) as an efficient alternative to Recurrent
Neural Networks (RNNs) and Transformers, addressing challenges in long-range
dependency modeling and computational efficiency. While RNNs suffer from
vanishing gradients and sequential inefficiencies, and Transformers face
quadratic complexity, SSMs leverage structured recurrence and state-space
representations to achieve superior long-sequence processing with linear or
near-linear complexity. This survey provides a comprehensive review of SSMs,
tracing their evolution from the foundational S4 model to its successors like
Mamba, Simplified Structured State Space Sequence Model (S5), and Jamba,
highlighting their improvements in computational efficiency, memory
optimization, and inference speed. By comparing SSMs with traditional sequence
models across domains such as natural language processing (NLP), speech
recognition, vision, and time-series forecasting, we demonstrate their
advantages in handling long-range dependencies while reducing computational
overhead. Despite their potential, challenges remain in areas such as training
optimization, hybrid modeling, and interpretability. This survey serves as a
structured guide for researchers and practitioners, detailing the advancements,
trade-offs, and future directions of SSM-based architectures in AI and deep
learning.
[COMMENTS]
30 pages, 8 figures, 3 tables
[LINK]
http://arxiv.org/abs/2503.18970v2
[DATE]
2025-05-13 23:46:33+08:00
[CATEGORIES]
cs.LG
AC-PKAN: Attention-Enhanced and Chebyshev Polynomial-Based Physics-Informed Kolmogorov-Arnold Networks
[AUTHORS]
Hangwei Zhang, Zhimu Huang, Yan Wang
[ABSTRACT]
Kolmogorov-Arnold Networks (KANs) have recently shown promise for solving
partial differential equations (PDEs). Yet their original formulation is
computationally and memory intensive, motivating the introduction of Chebyshev
Type-I-based KANs (Chebyshev1KANs). Although Chebyshev1KANs have outperformed
the vanilla KANs architecture, our rigorous theoretical analysis reveals that
they still suffer from rank collapse, ultimately limiting their expressive
capacity. To overcome these limitations, we enhance Chebyshev1KANs by
integrating wavelet-activated MLPs with learnable parameters and an internal
attention mechanism. We prove that this design preserves a full-rank Jacobian
and is capable of approximating solutions to PDEs of arbitrary order.
Furthermore, to alleviate the loss instability and imbalance introduced by the
Chebyshev polynomial basis, we externally incorporate a Residual Gradient
Attention (RGA) mechanism that dynamically re-weights individual loss terms
according to their gradient norms and residual magnitudes. By jointly
leveraging internal and external attention, we present AC-PKAN, a novel
architecture that constitutes an enhancement to weakly supervised
Physics-Informed Neural Networks (PINNs) and extends the expressive power of
KANs. Experimental results from nine benchmark tasks across three domains show
that AC-PKAN consistently outperforms or matches state-of-the-art models such
as PINNsFormer, establishing it as a highly effective tool for solving complex
real-world engineering problems in zero-data or data-sparse regimes. The code
will be made publicly available upon acceptance.
[LINK]
http://arxiv.org/abs/2505.08687v1
[DATE]
2025-05-13 23:46:10+08:00
[CATEGORIES]
cs.LG
CAD-Coder:Text-Guided CAD Files Code Generation
[AUTHORS]
Changqi He, Shuhan Zhang, Liguo Zhang, Jiajun Miao
[ABSTRACT]
Computer-aided design (CAD) is a way to digitally create 2D drawings and 3D
models of real-world products. Traditional CAD typically relies on hand-drawing
by experts or modifications of existing library files, which doesn’t allow for
rapid personalization. With the emergence of generative artificial
intelligence, convenient and efficient personalized CAD generation has become
possible. However, existing generative methods typically produce outputs that
lack interactive editability and geometric annotations, limiting their
practical applications in manufacturing. To enable interactive generative CAD,
we propose CAD-Coder, a framework that transforms natural language instructions
into CAD script codes, which can be executed in Python environments to generate
human-editable CAD files (.Dxf). To facilitate the generation of editable CAD
sketches with annotation information, we construct a comprehensive dataset
comprising 29,130 Dxf files with their corresponding script codes, where each
sketch preserves both editability and geometric annotations. We evaluate
CAD-Coder on various 2D/3D CAD generation tasks against existing methods,
demonstrating superior interactive capabilities while uniquely providing
editable sketches with geometric annotations.
[LINK]
http://arxiv.org/abs/2505.08686v1
[DATE]
2025-05-13 23:45:46+08:00
[CATEGORIES]
cs.LG
Uncertainty-Aware Surrogate-based Amortized Bayesian Inference for Computationally Expensive Models
[AUTHORS]
Stefania Scheurer, Philipp Reiser, Tim Brünnette, Wolfgang Nowak, Anneli Guthke, Paul-Christian Bürkner
[ABSTRACT]
Bayesian inference typically relies on a large number of model evaluations to
estimate posterior distributions. Established methods like Markov Chain Monte
Carlo (MCMC) and Amortized Bayesian Inference (ABI) can become computationally
challenging. While ABI enables fast inference after training, generating
sufficient training data still requires thousands of model simulations, which
is infeasible for expensive models. Surrogate models offer a solution by
providing approximate simulations at a lower computational cost, allowing the
generation of large data sets for training. However, the introduced
approximation errors and uncertainties can lead to overconfident posterior
estimates. To address this, we propose Uncertainty-Aware Surrogate-based
Amortized Bayesian Inference (UA-SABI) - a framework that combines surrogate
modeling and ABI while explicitly quantifying and propagating surrogate
uncertainties through the inference pipeline. Our experiments show that this
approach enables reliable, fast, and repeated Bayesian inference for
computationally expensive models, even under tight time constraints.
[COMMENTS]
16 pages, 7 figures
[LINK]
http://arxiv.org/abs/2505.08683v1
[DATE]
2025-05-13 23:44:10+08:00
[CATEGORIES]
cs.LG
On the Impact of Uncertainty and Calibration on Likelihood-Ratio Membership Inference Attacks
[AUTHORS]
Meiyi Zhu, Caili Guo, Chunyan Feng, Osvaldo Simeone
[ABSTRACT]
In a membership inference attack (MIA), an attacker exploits the
overconfidence exhibited by typical machine learning models to determine
whether a specific data point was used to train a target model. In this paper,
we analyze the performance of the likelihood ratio attack (LiRA) within an
information-theoretical framework that allows the investigation of the impact
of the aleatoric uncertainty in the true data generation process, of the
epistemic uncertainty caused by a limited training data set, and of the
calibration level of the target model. We compare three different settings, in
which the attacker receives decreasingly informative feedback from the target
model: confidence vector (CV) disclosure, in which the output probability
vector is released; true label confidence (TLC) disclosure, in which only the
probability assigned to the true label is made available by the model; and
decision set (DS) disclosure, in which an adaptive prediction set is produced
as in conformal prediction. We derive bounds on the advantage of an MIA
adversary with the aim of offering insights into the impact of uncertainty and
calibration on the effectiveness of MIAs. Simulation results demonstrate that
the derived analytical bounds predict well the effectiveness of MIAs.
[COMMENTS]
16 pages, 23 figures
[LINK]
http://arxiv.org/abs/2402.10686v4
[DATE]
2025-05-13 23:38:09+08:00
[CATEGORIES]
cs.LG
EMPERROR: A Flexible Generative Perception Error Model for Probing Self-Driving Planners
[AUTHORS]
Niklas Hanselmann, Simon Doll, Marius Cordts, Hendrik P. A. Lensch, Andreas Geiger
[ABSTRACT]
To handle the complexities of real-world traffic, learning planners for
self-driving from data is a promising direction. While recent approaches have
shown great progress, they typically assume a setting in which the ground-truth
world state is available as input. However, when deployed, planning needs to be
robust to the long-tail of errors incurred by a noisy perception system, which
is often neglected in evaluation. To address this, previous work has proposed
drawing adversarial samples from a perception error model (PEM) mimicking the
noise characteristics of a target object detector. However, these methods use
simple PEMs that fail to accurately capture all failure modes of detection. In
this paper, we present EMPERROR, a novel transformer-based generative PEM,
apply it to stress-test an imitation learning (IL)-based planner and show that
it imitates modern detectors more faithfully than previous work. Furthermore,
it is able to produce realistic noisy inputs that increase the planner’s
collision rate by up to 85%, demonstrating its utility as a valuable tool for a
more complete evaluation of self-driving planners.
[COMMENTS]
Project page: https://lasnik.github.io/emperror/
[LINK]
http://arxiv.org/abs/2411.07719v2
[DATE]
2025-05-13 23:30:04+08:00
[CATEGORIES]
cs.LG
Sample-Efficient Reinforcement Learning of Koopman eNMPC
[AUTHORS]
Daniel Mayfrank, Mehmet Velioglu, Alexander Mitsos, Manuel Dahmen
[ABSTRACT]
Reinforcement learning (RL) can be used to tune data-driven (economic)
nonlinear model predictive controllers ((e)NMPCs) for optimal performance in a
specific control task by optimizing the dynamic model or parameters in the
policy’s objective function or constraints, such as state bounds. However, the
sample efficiency of RL is crucial, and to improve it, we combine a model-based
RL algorithm with our published method that turns Koopman (e)NMPCs into
automatically differentiable policies. We apply our approach to an eNMPC case
study of a continuous stirred-tank reactor (CSTR) model from the literature.
The approach outperforms benchmark methods, i.e., data-driven eNMPCs using
models based on system identification without further RL tuning of the
resulting policy, and neural network controllers trained with model-based RL,
by achieving superior control performance and higher sample efficiency.
Furthermore, utilizing partial prior knowledge about the system dynamics via
physics-informed learning further increases sample efficiency.
[COMMENTS]
25 pages, 9 figures, 2 tables
[LINK]
http://arxiv.org/abs/2503.18787v2
[DATE]
2025-05-13 23:16:06+08:00
[CATEGORIES]
cs.LG
Training Ultra Long Context Language Model with Fully Pipelined Distributed Transformer
[AUTHORS]
Jinghan Yao, Sam Ade Jacobs, Masahiro Tanaka, Olatunji Ruwase, Hari Subramoni, Dhabaleswar K. Panda
[ABSTRACT]
Large Language Models (LLMs) with long context capabilities are integral to
complex tasks in natural language processing and computational biology, such as
text generation and protein sequence analysis. However, training LLMs directly
on extremely long contexts demands considerable GPU resources and increased
memory, leading to higher costs and greater complexity. Alternative approaches
that introduce long context capabilities via downstream finetuning or
adaptations impose significant design limitations. In this paper, we propose
Fully Pipelined Distributed Transformer (FPDT) for efficiently training
long-context LLMs with extreme hardware efficiency. For GPT and Llama models,
we achieve a 16x increase in sequence length that can be trained on the same
hardware compared to current state-of-the-art solutions. With our dedicated
sequence chunk pipeline design, we can now train 8B LLM with 2 million sequence
length on only 4 GPUs, while also maintaining over 55% of MFU. Our proposed
FPDT is agnostic to existing training techniques and is proven to work
efficiently across different LLM models.
[COMMENTS]
The Eighth Annual Conference on Machine Learning and Systems
(MLSys’25)
[LINK]
http://arxiv.org/abs/2408.16978v2
[DATE]
2025-05-13 23:07:26+08:00
[CATEGORIES]
cs.LG
Modular Federated Learning: A Meta-Framework Perspective
[AUTHORS]
Frederico Vicente, Cláudia Soares, Dušan Jakovetić
[ABSTRACT]
Federated Learning (FL) enables distributed machine learning training while
preserving privacy, representing a paradigm shift for data-sensitive and
decentralized environments. Despite its rapid advancements, FL remains a
complex and multifaceted field, requiring a structured understanding of its
methodologies, challenges, and applications. In this survey, we introduce a
meta-framework perspective, conceptualising FL as a composition of modular
components that systematically address core aspects such as communication,
optimisation, security, and privacy. We provide a historical contextualisation
of FL, tracing its evolution from distributed optimisation to modern
distributed learning paradigms. Additionally, we propose a novel taxonomy
distinguishing Aggregation from Alignment, introducing the concept of alignment
as a fundamental operator alongside aggregation. To bridge theory with
practice, we explore available FL frameworks in Python, facilitating real-world
implementation. Finally, we systematise key challenges across FL sub-fields,
providing insights into open research questions throughout the meta-framework
modules. By structuring FL within a meta-framework of modular components and
emphasising the dual role of Aggregation and Alignment, this survey provides a
holistic and adaptable foundation for understanding and advancing FL research
and deployment.
[LINK]
http://arxiv.org/abs/2505.08646v1
[DATE]
2025-05-13 23:04:55+08:00
[CATEGORIES]
cs.LG
WixQA: A Multi-Dataset Benchmark for Enterprise Retrieval-Augmented Generation
[AUTHORS]
Dvir Cohen, Lin Burg, Sviatoslav Pykhnivskyi, Hagit Gur, Stanislav Kovynov, Olga Atzmon, Gilad Barkan
[ABSTRACT]
Retrieval-Augmented Generation (RAG) is a cornerstone of modern question
answering (QA) systems, enabling grounded answers based on external knowledge.
Although recent progress has been driven by open-domain datasets, enterprise QA
systems need datasets that mirror the concrete, domain-specific issues users
raise in day-to-day support scenarios. Critically, evaluating end-to-end RAG
systems requires benchmarks comprising not only question–answer pairs but also
the specific knowledge base (KB) snapshot from which answers were derived. To
address this need, we introduce WixQA, a benchmark suite featuring QA datasets
precisely grounded in the released KB corpus, enabling holistic evaluation of
retrieval and generation components. WixQA includes three distinct QA datasets
derived from Wix.com customer support interactions and grounded in a snapshot
of the public Wix Help Center KB: (i) WixQA-ExpertWritten, 200 real user
queries with expert-authored, multi-step answers; (ii) WixQA-Simulated, 200
expert-validated QA pairs distilled from user dialogues; and (iii)
WixQA-Synthetic, 6,222 LLM-generated QA pairs, with one pair systematically
derived from each article in the knowledge base. We release the KB snapshot
alongside the datasets under MIT license and provide comprehensive baseline
results, forming a unique benchmark for evaluating enterprise RAG systems in
realistic enterprise environments.
[LINK]
http://arxiv.org/abs/2505.08643v1
[DATE]
2025-05-13 23:02:54+08:00
[CATEGORIES]
cs.LG
Credit Assignment and Efficient Exploration based on Influence Scope in Multi-agent Reinforcement Learning
[AUTHORS]
Shuai Han, Mehdi Dastani, Shihan Wang
[ABSTRACT]
Training cooperative agents in sparse-reward scenarios poses significant
challenges for multi-agent reinforcement learning (MARL). Without clear
feedback on actions at each step in sparse-reward setting, previous methods
struggle with precise credit assignment among agents and effective exploration.
In this paper, we introduce a novel method to deal with both credit assignment
and exploration problems in reward-sparse domains. Accordingly, we propose an
algorithm that calculates the Influence Scope of Agents (ISA) on states by
taking specific value of the dimensions/attributes of states that can be
influenced by individual agents. The mutual dependence between agents’ actions
and state attributes are then used to calculate the credit assignment and to
delimit the exploration space for each individual agent. We then evaluate ISA
in a variety of sparse-reward multi-agent scenarios. The results show that our
method significantly outperforms the state-of-art baselines.
[LINK]
http://arxiv.org/abs/2505.08630v1
[DATE]
2025-05-13 22:49:26+08:00
[CATEGORIES]
cs.LG
DrivAer Transformer: A high-precision and fast prediction method for vehicle aerodynamic drag coefficient based on the DrivAerNet++ dataset
[AUTHORS]
Jiaqi He, Xiangwen Luo, Yiping Wang
[ABSTRACT]
At the current stage, deep learning-based methods have demonstrated excellent
capabilities in evaluating aerodynamic performance, significantly reducing the
time and cost required for traditional computational fluid dynamics (CFD)
simulations. However, when faced with the task of processing extremely complex
three-dimensional (3D) vehicle models, the lack of large-scale datasets and
training resources, coupled with the inherent diversity and complexity of the
geometry of different vehicle models, means that the prediction accuracy and
versatility of these networks are still not up to the level required for
current production. In view of the remarkable success of Transformer models in
the field of natural language processing and their strong potential in the
field of image processing, this study innovatively proposes a point cloud
learning framework called DrivAer Transformer (DAT). The DAT structure uses the
DrivAerNet++ dataset, which contains high-fidelity CFD data of
industrial-standard 3D vehicle shapes. enabling accurate estimation of air drag
directly from 3D meshes, thus avoiding the limitations of traditional methods
such as 2D image rendering or signed distance fields (SDF). DAT enables fast
and accurate drag prediction, driving the evolution of the aerodynamic
evaluation process and laying the critical foundation for introducing a
data-driven approach to automotive design. The framework is expected to
accelerate the vehicle design process and improve development efficiency.
[COMMENTS]
14 pages
[LINK]
http://arxiv.org/abs/2504.08217v4
[DATE]
2025-05-13 22:43:14+08:00
[CATEGORIES]
cs.LG
Cost Function Estimation Using Inverse Reinforcement Learning with Minimal Observations
[AUTHORS]
Sarmad Mehrdad, Avadesh Meduri, Ludovic Righetti
[ABSTRACT]
We present an iterative inverse reinforcement learning algorithm to infer
optimal cost functions in continuous spaces. Based on a popular maximum entropy
criteria, our approach iteratively finds a weight improvement step and proposes
a method to find an appropriate step size that ensures learned cost function
features remain similar to the demonstrated trajectory features. In contrast to
similar approaches, our algorithm can individually tune the effectiveness of
each observation for the partition function and does not need a large sample
set, enabling faster learning. We generate sample trajectories by solving an
optimal control problem instead of random sampling, leading to more informative
trajectories. The performance of our method is compared to two state of the art
algorithms to demonstrate its benefits in several simulated environments.
[LINK]
http://arxiv.org/abs/2505.08619v1
[DATE]
2025-05-13 22:38:25+08:00
[CATEGORIES]
cs.LG
Automated Model-Free Sorting of Single-Molecule Fluorescence Events Using a Deep Learning Based Hidden-State Model
[AUTHORS]
Wenqi Zeng, Shuqi Zhou, Yuan Yao, Chunlai Chen
[ABSTRACT]
Single-molecule fluorescence assays enable high-resolution analysis of
biomolecular dynamics, but traditional analysis pipelines are labor-intensive
and rely on users’ experience, limiting scalability and reproducibility. Recent
deep learning models have automated aspects of data processing, yet many still
require manual thresholds, complex architectures, or extensive labeled data.
Therefore, we present DASH, a fully streamlined architecture for trace
classification, state assignment, and automatic sorting that requires no user
input. DASH demonstrates robust performance across users and experimental
conditions both in equilibrium and non-equilibrium systems such as
Cas12a-mediated DNA cleavage. This paper proposes a novel strategy for the
automatic and detailed sorting of single-molecule fluorescence events. The
dynamic cleavage process of Cas12a is used as an example to provide a
comprehensive analysis. This approach is crucial for studying biokinetic
structural changes at the single-molecule level.
[LINK]
http://arxiv.org/abs/2505.08608v1
[DATE]
2025-05-13 22:26:33+08:00
[CATEGORIES]
cs.LG
Joint Metric Space Embedding by Unbalanced OT with Gromov-Wasserstein Marginal Penalization
[AUTHORS]
Florian Beier, Moritz Piening, Robert Beinert, Gabriele Steidl
[ABSTRACT]
We propose a new approach for unsupervised alignment of heterogeneous
datasets, which maps data from two different domains without any known
correspondences to a common metric space. Our method is based on an unbalanced
optimal transport problem with Gromov-Wasserstein marginal penalization. It can
be seen as a counterpart to the recently introduced joint multidimensional
scaling method. We prove that there exists a minimizer of our functional and
that for penalization parameters going to infinity, the corresponding sequence
of minimizers converges to a minimizer of the so-called embedded Wasserstein
distance. Our model can be reformulated as a quadratic, multi-marginal,
unbalanced optimal transport problem, for which a bi-convex relaxation admits a
numerical solver via block-coordinate descent. We provide numerical examples
for joint embeddings in Euclidean as well as non-Euclidean spaces.
[LINK]
http://arxiv.org/abs/2502.07510v2
[DATE]
2025-05-13 22:24:45+08:00
[CATEGORIES]
cs.LG
Metric Similarity and Manifold Learning of Circular Dichroism Spectra of Proteins
[AUTHORS]
Gionni Marchetti
[ABSTRACT]
We present a machine learning analysis of circular dichroism spectra of
globular proteins from the SP175 database, using the optimal transport-based
$1$-Wasserstein distance $\mathcal{W}_1$ (with order $p=1$) and the manifold
learning algorithm $t$-SNE. Our results demonstrate that $\mathcal{W}_1$ is
consistent with both Euclidean and Manhattan metrics while exhibiting
robustness to noise. On the other hand, $t$-SNE uncovers meaningful structure
in the high-dimensional data. The clustering in the $t$-SNE embedding is
primarily determined by proteins with distinct secondary structure
compositions: one cluster predominantly contains $\beta$-rich proteins, while
the other consists mainly of proteins with mixed $\alpha/\beta$ and
$\alpha$-helical content.
[COMMENTS]
Some parts of this preprint have been incorporated in the following
preprint arXiv:2505.06466 and its Supplementary Information
[LINK]
http://arxiv.org/abs/2504.19355v2
[DATE]
2025-05-13 22:15:55+08:00
[CATEGORIES]
cs.LG
MINIMALIST: switched-capacitor circuits for efficient in-memory computation of gated recurrent units
[AUTHORS]
Sebastian Billaudelle, Laura Kriener, Filippo Moro, Tristan Torchet, Melika Payvand
[ABSTRACT]
Recurrent neural networks (RNNs) have been a long-standing candidate for
processing of temporal sequence data, especially in memory-constrained systems
that one may find in embedded edge computing environments. Recent advances in
training paradigms have now inspired new generations of efficient RNNs. We
introduce a streamlined and hardware-compatible architecture based on minimal
gated recurrent units (GRUs), and an accompanying efficient mixed-signal
hardware implementation of the model. The proposed design leverages
switched-capacitor circuits not only for in-memory computation (IMC), but also
for the gated state updates. The mixed-signal cores rely solely on commodity
circuits consisting of metal capacitors, transmission gates, and a clocked
comparator, thus greatly facilitating scaling and transfer to other technology
nodes.
We benchmark the performance of our architecture on time series data,
introducing all constraints required for a direct mapping to the hardware
system. The direct compatibility is verified in mixed-signal simulations,
reproducing data recorded from the software-only network model.
[LINK]
http://arxiv.org/abs/2505.08599v1
[DATE]
2025-05-13 22:13:41+08:00
[CATEGORIES]
cs.LG
Clustering of Incomplete Data via a Bipartite Graph Structure
[AUTHORS]
Amirhossein Javaheri, Daniel P. Palomar
[ABSTRACT]
There are various approaches to graph learning for data clustering,
incorporating different spectral and structural constraints through diverse
graph structures. Some methods rely on bipartite graph models, where nodes are
divided into two classes: centers and members. These models typically require
access to data for the center nodes in addition to observations from the member
nodes. However, such additional data may not always be available in many
practical scenarios. Moreover, popular Gaussian models for graph learning have
demonstrated limited effectiveness in modeling data with heavy-tailed
distributions, which are common in financial markets. In this paper, we propose
a clustering method based on a bipartite graph model that addresses these
challenges. First, it can infer clusters from incomplete data without requiring
information about the center nodes. Second, it is designed to effectively
handle heavy-tailed data. Numerical experiments using real financial data
validate the efficiency of the proposed method for data clustering.
[LINK]
http://arxiv.org/abs/2505.08594v1
[DATE]
2025-05-13 22:06:13+08:00
[CATEGORIES]
cs.LG
Model Steering: Learning with a Reference Model Improves Generalization Bounds and Scaling Laws
[AUTHORS]
Xiyuan Wei, Ming Lin, Fanjiang Ye, Fengguang Song, Liangliang Cao, My T. Thai, Tianbao Yang
[ABSTRACT]
This paper formalizes an emerging learning paradigm that uses a trained model
as a reference to guide and enhance the training of a target model through
strategic data selection or weighting, named $\textbf{model steering}$. While
ad-hoc methods have been used in various contexts, including the training of
large foundation models, its underlying principles remain insufficiently
understood, leading to sub-optimal performance. In this work, we propose a
theory-driven framework for model steering called $\textbf{DRRho risk
minimization}$, which is rooted in Distributionally Robust Optimization (DRO).
Through a generalization analysis, we provide theoretical insights into why
this approach improves generalization and data efficiency compared to training
without a reference model. To the best of our knowledge, this is the first time
such theoretical insights are provided for the new learning paradigm, which
significantly enhance our understanding and practice of model steering.
Building on these insights and the connection between contrastive learning and
DRO, we introduce a novel method for Contrastive Language-Image Pretraining
(CLIP) with a reference model, termed DRRho-CLIP. Extensive experiments
validate the theoretical insights, reveal a superior scaling law compared to
CLIP without a reference model, and demonstrate its strength over existing
heuristic approaches.
[COMMENTS]
18 pages, 6 figures
[LINK]
http://arxiv.org/abs/2505.06699v2
[DATE]
2025-05-13 22:01:05+08:00
[CATEGORIES]
cs.LG
DFA-CON: A Contrastive Learning Approach for Detecting Copyright Infringement in DeepFake Art
[AUTHORS]
Haroon Wahab, Hassan Ugail, Irfan Mehmood
[ABSTRACT]
Recent proliferation of generative AI tools for visual content
creation-particularly in the context of visual artworks-has raised serious
concerns about copyright infringement and forgery. The large-scale datasets
used to train these models often contain a mixture of copyrighted and
non-copyrighted artworks. Given the tendency of generative models to memorize
training patterns, they are susceptible to varying degrees of copyright
violation. Building on the recently proposed DeepfakeArt Challenge benchmark,
this work introduces DFA-CON, a contrastive learning framework designed to
detect copyright-infringing or forged AI-generated art. DFA-CON learns a
discriminative representation space, posing affinity among original artworks
and their forged counterparts within a contrastive learning framework. The
model is trained across multiple attack types, including inpainting, style
transfer, adversarial perturbation, and cutmix. Evaluation results demonstrate
robust detection performance across most attack types, outperforming recent
pretrained foundation models. Code and model checkpoints will be released
publicly upon acceptance.
[LINK]
http://arxiv.org/abs/2505.08552v1
[DATE]
2025-05-13 21:23:52+08:00
[CATEGORIES]
cs.LG
Building-Block Aware Generative Modeling for 3D Crystals of Metal Organic Frameworks
[AUTHORS]
Chenru Duan, Aditya Nandy, Sizhan Liu, Yuanqi Du, Liu He, Yi Qu, Haojun Jia, Jin-Hu Dou
[ABSTRACT]
Metal-organic frameworks (MOFs) marry inorganic nodes, organic edges, and
topological nets into programmable porous crystals, yet their astronomical
design space defies brute-force synthesis. Generative modeling holds ultimate
promise, but existing models either recycle known building blocks or are
restricted to small unit cells. We introduce Building-Block-Aware MOF Diffusion
(BBA MOF Diffusion), an SE(3)-equivariant diffusion model that learns 3D
all-atom representations of individual building blocks, encoding
crystallographic topological nets explicitly. Trained on the CoRE-MOF database,
BBA MOF Diffusion readily samples MOFs with unit cells containing 1000 atoms
with great geometric validity, novelty, and diversity mirroring experimental
databases. Its native building-block representation produces unprecedented
metal nodes and organic edges, expanding accessible chemical space by orders of
magnitude. One high-scoring [Zn(1,4-TDC)(EtOH)2] MOF predicted by the model was
synthesized, where powder X-ray diffraction, thermogravimetric analysis, and N2
sorption confirm its structural fidelity. BBA-Diff thus furnishes a practical
pathway to synthesizable and high-performing MOFs.
[LINK]
http://arxiv.org/abs/2505.08531v1
[DATE]
2025-05-13 21:02:28+08:00
[CATEGORIES]
cs.LG
GradMix: Gradient-based Selective Mixup for Robust Data Augmentation in Class-Incremental Learning
[AUTHORS]
Minsu Kim, Seong-Hyeon Hwang, Steven Euijong Whang
[ABSTRACT]
In the context of continual learning, acquiring new knowledge while
maintaining previous knowledge presents a significant challenge. Existing
methods often use experience replay techniques that store a small portion of
previous task data for training. In experience replay approaches, data
augmentation has emerged as a promising strategy to further improve the model
performance by mixing limited previous task data with sufficient current task
data. However, we theoretically and empirically analyze that training with
mixed samples from random sample pairs may harm the knowledge of previous tasks
and cause greater catastrophic forgetting. We then propose GradMix, a robust
data augmentation method specifically designed for mitigating catastrophic
forgetting in class-incremental learning. GradMix performs gradient-based
selective mixup using a class-based criterion that mixes only samples from
helpful class pairs and not from detrimental class pairs for reducing
catastrophic forgetting. Our experiments on various real datasets show that
GradMix outperforms data augmentation baselines in accuracy by minimizing the
forgetting of previous knowledge.
[LINK]
http://arxiv.org/abs/2505.08528v1
[DATE]
2025-05-13 21:01:38+08:00
[CATEGORIES]
cs.LG
Physics-informed neural networks viewpoint for solving the Dyson-Schwinger equations of quantum electrodynamics
[AUTHORS]
Rodrigo Carmo Terin
[ABSTRACT]
Physics-informed neural networks (PINNs) are employed to solve the
Dyson–Schwinger equations of quantum electrodynamics (QED) in Euclidean space,
with a focus on the non-perturbative generation of the fermion’s dynamical mass
function in the Landau gauge. By inserting the integral equation directly into
the loss function, our PINN framework enables a single neural network to learn
a continuous and differentiable representation of the mass function over a
spectrum of momenta. Also, we benchmark our approach against a traditional
numerical algorithm showing the main differences among them. Our novel
strategy, which is expected to be extended to other quantum field theories, is
the first step towards forefront applications of machine learning in high-level
theoretical physics.
[COMMENTS]
18 pages, 2 figures, 2 tables. The requested changes from SciPost
Physics reviewers have been implemented
[LINK]
http://arxiv.org/abs/2411.02177v3
[DATE]
2025-05-13 21:01:29+08:00
[CATEGORIES]
cs.LG
SPP-SBL: Space-Power Prior Sparse Bayesian Learning for Block Sparse Recovery
[AUTHORS]
Yanhao Zhang, Zhihan Zhu, Yong Xia
[ABSTRACT]
The recovery of block-sparse signals with unknown structural patterns remains
a fundamental challenge in structured sparse signal reconstruction. By
proposing a variance transformation framework, this paper unifies existing
pattern-based block sparse Bayesian learning methods, and introduces a novel
space power prior based on undirected graph models to adaptively capture the
unknown patterns of block-sparse signals. By combining the EM algorithm with
high-order equation root-solving, we develop a new structured sparse Bayesian
learning method, SPP-SBL, which effectively addresses the open problem of space
coupling parameter estimation in pattern-based methods. We further demonstrate
that learning the relative values of space coupling parameters is key to
capturing unknown block-sparse patterns and improving recovery accuracy.
Experiments validate that SPP-SBL successfully recovers various challenging
structured sparse signals (e.g., chain-structured signals and multi-pattern
sparse signals) and real-world multi-modal structured sparse signals (images,
audio), showing significant advantages in recovery accuracy across multiple
metrics.
[COMMENTS]
12 pages, 6 figures, 4 tables
[LINK]
http://arxiv.org/abs/2505.08518v1
[DATE]
2025-05-13 20:49:25+08:00
[CATEGORIES]
cs.LG
Learning Advanced Self-Attention for Linear Transformers in the Singular Value Domain
[AUTHORS]
Hyowon Wi, Jeongwhan Choi, Noseong Park
[ABSTRACT]
Transformers have demonstrated remarkable performance across diverse domains.
The key component of Transformers is self-attention, which learns the
relationship between any two tokens in the input sequence. Recent studies have
revealed that the self-attention can be understood as a normalized adjacency
matrix of a graph. Notably, from the perspective of graph signal processing
(GSP), the self-attention can be equivalently defined as a simple graph filter,
applying GSP using the value vector as the signal. However, the self-attention
is a graph filter defined with only the first order of the polynomial matrix,
and acts as a low-pass filter preventing the effective leverage of various
frequency information. Consequently, existing self-attention mechanisms are
designed in a rather simplified manner. Therefore, we propose a novel method,
called \underline{\textbf{A}}ttentive \underline{\textbf{G}}raph
\underline{\textbf{F}}ilter (AGF), interpreting the self-attention as learning
the graph filter in the singular value domain from the perspective of graph
signal processing for directed graphs with the linear complexity w.r.t. the
input length $n$, i.e., $\mathcal{O}(nd^2)$. In our experiments, we demonstrate
that AGF achieves state-of-the-art performance on various tasks, including Long
Range Arena benchmark and time series classification.
[COMMENTS]
IJCAI25 Accepted
[LINK]
http://arxiv.org/abs/2505.08516v1
[DATE]
2025-05-13 20:48:04+08:00
[CATEGORIES]
cs.LG
MARCO: A Multi-Agent System for Optimizing HPC Code Generation Using Large Language Models
[AUTHORS]
Asif Rahman, Veljko Cvetkovic, Kathleen Reece, Aidan Walters, Yasir Hassan, Aneesh Tummeti, Bryan Torres, Denise Cooney, Margaret Ellis, Dimitrios S. Nikolopoulos
[ABSTRACT]
Large language models (LLMs) have transformed software development through
code generation capabilities, yet their effectiveness for high-performance
computing (HPC) remains limited. HPC code requires specialized optimizations
for parallelism, memory efficiency, and architecture-specific considerations
that general-purpose LLMs often overlook. We present MARCO (Multi-Agent
Reactive Code Optimizer), a novel framework that enhances LLM-generated code
for HPC through a specialized multi-agent architecture. MARCO employs separate
agents for code generation and performance evaluation, connected by a feedback
loop that progressively refines optimizations. A key innovation is MARCO’s
web-search component that retrieves real-time optimization techniques from
recent conference proceedings and research publications, bridging the knowledge
gap in pre-trained LLMs. Our extensive evaluation on the LeetCode 75 problem
set demonstrates that MARCO achieves a 14.6% average runtime reduction compared
to Claude 3.5 Sonnet alone, while the integration of the web-search component
yields a 30.9% performance improvement over the base MARCO system. These
results highlight the potential of multi-agent systems to address the
specialized requirements of high-performance code generation, offering a
cost-effective alternative to domain-specific model fine-tuning.
[COMMENTS]
9 pages, 4 figures, 2 tables
[LINK]
http://arxiv.org/abs/2505.03906v2
[DATE]
2025-05-13 20:41:18+08:00
[CATEGORIES]
cs.LG
TrialMatchAI: An End-to-End AI-powered Clinical Trial Recommendation System to Streamline Patient-to-Trial Matching
[AUTHORS]
Majd Abdallah, Sigve Nakken, Mariska Bierkens, Johanna Galvis, Alexis Groppi, Slim Karkar, Lana Meiqari, Maria Alexandra Rujano, Steve Canham, Rodrigo Dienstmann, Remond Fijneman, Eivind Hovig, Gerrit Meijer, Macha Nikolski
[ABSTRACT]
Patient recruitment remains a major bottleneck in clinical trials, calling
for scalable and automated solutions. We present TrialMatchAI, an AI-powered
recommendation system that automates patient-to-trial matching by processing
heterogeneous clinical data, including structured records and unstructured
physician notes. Built on fine-tuned, open-source large language models (LLMs)
within a retrieval-augmented generation framework, TrialMatchAI ensures
transparency and reproducibility and maintains a lightweight deployment
footprint suitable for clinical environments. The system normalizes biomedical
entities, retrieves relevant trials using a hybrid search strategy combining
lexical and semantic similarity, re-ranks results, and performs criterion-level
eligibility assessments using medical Chain-of-Thought reasoning. This pipeline
delivers explainable outputs with traceable decision rationales. In real-world
validation, 92 percent of oncology patients had at least one relevant trial
retrieved within the top 20 recommendations. Evaluation across synthetic and
real clinical datasets confirmed state-of-the-art performance, with expert
assessment validating over 90 percent accuracy in criterion-level eligibility
classification, particularly excelling in biomarker-driven matches. Designed
for modularity and privacy, TrialMatchAI supports Phenopackets-standardized
data, enables secure local deployment, and allows seamless replacement of LLM
components as more advanced models emerge. By enhancing efficiency and
interpretability and offering lightweight, open-source deployment, TrialMatchAI
provides a scalable solution for AI-driven clinical trial matching in precision
medicine.
[LINK]
http://arxiv.org/abs/2505.08508v1
[DATE]
2025-05-13 20:39:06+08:00
[CATEGORIES]
cs.LG
InfoPO: On Mutual Information Maximization for Large Language Model Alignment
[AUTHORS]
Teng Xiao, Zhen Ge, Sujay Sanghavi, Tian Wang, Julian Katz-Samuels, Marc Versage, Qingjun Cui, Trishul Chilimbi
[ABSTRACT]
We study the post-training of large language models (LLMs) with human
preference data. Recently, direct preference optimization and its variants have
shown considerable promise in aligning language models, eliminating the need
for reward models and online sampling. Despite these benefits, these methods
rely on explicit assumptions about the Bradley-Terry (BT) model, which makes
them prone to overfitting and results in suboptimal performance, particularly
on reasoning-heavy tasks. To address these challenges, we propose a principled
preference fine-tuning algorithm called InfoPO, which effectively and
efficiently aligns large language models using preference data. InfoPO
eliminates the reliance on the BT model and prevents the likelihood of the
chosen response from decreasing. Extensive experiments confirm that InfoPO
consistently outperforms established baselines on widely used open benchmarks,
particularly in reasoning tasks.
[COMMENTS]
NAACL 2025
[LINK]
http://arxiv.org/abs/2505.08507v1
[DATE]
2025-05-13 20:37:48+08:00
[CATEGORIES]
cs.LG
A new methodology to decompose a parametric domain using reduced order data manifold in machine learning
[AUTHORS]
Chetra Mang, Axel TahmasebiMoradi, Mouadh Yagoubi
[ABSTRACT]
We propose a new methodology for parametric domain decomposition using
iterative principal component analysis. Starting with iterative principle
component analysis, the high dimension manifold is reduced to the lower
dimension manifold. Moreover, two approaches are developed to reconstruct the
inverse projector to project from the lower data component to the original one.
Afterward, we provide a detailed strategy to decompose the parametric domain
based on the low dimension manifold. Finally, numerical examples of harmonic
transport problem are given to illustrate the efficiency and effectiveness of
the proposed method comparing to the classical meta-models such as neural
networks.
[LINK]
http://arxiv.org/abs/2505.08497v1
[DATE]
2025-05-13 20:25:16+08:00
[CATEGORIES]
cs.LG
Achieving Scalable Robot Autonomy via neurosymbolic planning using lightweight local LLM
[AUTHORS]
Nicholas Attolino, Alessio Capitanelli, Fulvio Mastrogiovanni
[ABSTRACT]
PDDL-based symbolic task planning remains pivotal for robot autonomy yet
struggles with dynamic human-robot collaboration due to scalability,
re-planning demands, and delayed plan availability. Although a few
neurosymbolic frameworks have previously leveraged LLMs such as GPT-3 to
address these challenges, reliance on closed-source, remote models with limited
context introduced critical constraints: third-party dependency, inconsistent
response times, restricted plan length and complexity, and multi-domain
scalability issues. We present Gideon, a novel framework that enables the
transition to modern, smaller, local LLMs with extended context length. Gideon
integrates a novel problem generator to systematically generate large-scale
datasets of realistic domain-problem-plan tuples for any domain, and adapts
neurosymbolic planning for local LLMs, enabling on-device execution and
extended context for multi-domain support. Preliminary experiments in
single-domain scenarios performed on Qwen-2.5 1.5B and trained on 8k-32k
samples, demonstrate a valid plan percentage of 66.1% (32k model) and show that
the figure can be further scaled through additional data. Multi-domain tests on
16k samples yield an even higher 70.6% planning validity rate, proving
extensibility across domains and signaling that data variety can have a
positive effect on learning efficiency. Although long-horizon planning and
reduced model size make Gideon training much less efficient than baseline
models based on larger LLMs, the results are still significant considering that
the trained model is about 120x smaller than baseline and that significant
advantages can be achieved in inference efficiency, scalability, and
multi-domain adaptability, all critical factors in human-robot collaboration.
Training inefficiency can be mitigated by Gideon’s streamlined data generation
pipeline.
[COMMENTS]
19 pages, 3 figures, 4 tables, accepted at IAS 2025
[LINK]
http://arxiv.org/abs/2505.08492v1
[DATE]
2025-05-13 20:22:38+08:00
[CATEGORIES]
cs.LG
Isolation Forest in Novelty Detection Scenario
[AUTHORS]
Adam Ulrich, Jan Krňávek, Roman Šenkeřík, Zuzana Komínková Oplatková, Radek Vala
[ABSTRACT]
Data mining offers a diverse toolbox for extracting meaningful structures
from complex datasets, with anomaly detection emerging as a critical subfield
particularly in the context of streaming or real-time data. Within anomaly
detection, novelty detection focuses on identifying previously unseen patterns
after training solely on regular data. While classic algorithms such as
One-Class SVM or Local Outlier Factor (LOF) have been widely applied, they
often lack interpretability and scalability. In this work, we explore the
Half-Space Tree (HST) algorithm, originally proposed for streaming anomaly
detection, and propose a novel theoretical modification to adapt it
specifically for novelty detection tasks. Our approach is grounded in the idea
that anomalies i.e., novelties tend to appear in the higher leaves of the tree,
which are less frequently visited by regular instances. We analytically
demonstrate the effectiveness of this approach using probabilistic analysis,
expected depth (EXD) calculations, and combinatorial reasoning. A comparative
analysis of expected depths between our modified HST and the original Isolation
Forest highlights that novelty points are significantly more isolated in our
approach. This supports the hypothesis that HSTs, with appropriate structural
adaptation, can serve as interpretable and efficient novelty detectors. The
paper contributes a theoretical foundation and supporting analysis for this
adaptation, setting the stage for further application and experimentation.
[LINK]
http://arxiv.org/abs/2505.08489v1
[DATE]
2025-05-13 20:21:53+08:00
[CATEGORIES]
cs.LG
An adaptive sampling algorithm for data-generation to build a data-manifold for physical problem surrogate modeling
[AUTHORS]
Chetra Mang, Axel TahmasebiMoradi, David Danan, Mouadh Yagoubi
[ABSTRACT]
Physical models classically involved Partial Differential equations (PDE) and
depending of their underlying complexity and the level of accuracy required,
and known to be computationally expensive to numerically solve them. Thus, an
idea would be to create a surrogate model relying on data generated by such
solver. However, training such a model on an imbalanced data have been shown to
be a very difficult task. Indeed, if the distribution of input leads to a poor
response manifold representation, the model may not learn well and
consequently, it may not predict the outcome with acceptable accuracy. In this
work, we present an Adaptive Sampling Algorithm for Data Generation (ASADG)
involving a physical model. As the initial input data may not accurately
represent the response manifold in higher dimension, this algorithm iteratively
adds input data into it. At each step the barycenter of each simplicial
complex, that the manifold is discretized into, is added as new input data, if
a certain threshold is satisfied. We demonstrate the efficiency of the data
sampling algorithm in comparison with LHS method for generating more
representative input data. To do so, we focus on the construction of a harmonic
transport problem metamodel by generating data through a classical solver. By
using such algorithm, it is possible to generate the same number of input data
as LHS while providing a better representation of the response manifold.
[LINK]
http://arxiv.org/abs/2505.08487v1
[DATE]
2025-05-13 20:17:10+08:00
[CATEGORIES]
cs.LG
Unlocking Location Intelligence: A Survey from Deep Learning to The LLM Era
[AUTHORS]
Xixuan Hao, Yutian Jiang, Xingchen Zou, Jiabo Liu, Yifang Yin, Yuxuan Liang
[ABSTRACT]
Location Intelligence (LI), the science of transforming location-centric
geospatial data into actionable knowledge, has become a cornerstone of modern
spatial decision-making. The rapid evolution of Geospatial Representation
Learning is fundamentally reshaping LI development through two successive
technological revolutions: the deep learning breakthrough and the emerging
large language model (LLM) paradigm. While deep neural networks (DNNs) have
demonstrated remarkable success in automated feature extraction from structured
geospatial data (e.g., satellite imagery, GPS trajectories), the recent
integration of LLMs introduces transformative capabilities for cross-modal
geospatial reasoning and unstructured geo-textual data processing. This survey
presents a comprehensive review of geospatial representation learning across
both technological eras, organizing them into a structured taxonomy based on
the complete pipeline comprising: (1) data perspective, (2) methodological
perspective and (3) application perspective. We also highlight current
advancements, discuss existing limitations, and propose potential future
research directions in the LLM era. This work offers a thorough exploration of
the field and providing a roadmap for further innovation in LI. The summary of
the up-to-date paper list can be found in
https://github.com/CityMind-Lab/Awesome-Location-Intelligence and will undergo
continuous updates.
[LINK]
http://arxiv.org/abs/2505.09651v1
[DATE]
2025-05-13 20:16:26+08:00
[CATEGORIES]
cs.LG
TUM2TWIN: Introducing the Large-Scale Multimodal Urban Digital Twin Benchmark Dataset
[AUTHORS]
Olaf Wysocki, Benedikt Schwab, Manoj Kumar Biswanath, Michael Greza, Qilin Zhang, Jingwei Zhu, Thomas Froech, Medhini Heeramaglore, Ihab Hijazi, Khaoula Kanna, Mathias Pechinger, Zhaiyu Chen, Yao Sun, Alejandro Rueda Segura, Ziyang Xu, Omar AbdelGafar, Mansour Mehranfar, Chandan Yeshwanth, Yueh-Cheng Liu, Hadi Yazdi, Jiapan Wang, Stefan Auer, Katharina Anders, Klaus Bogenberger, Andre Borrmann, Angela Dai, Ludwig Hoegner, Christoph Holst, Thomas H. Kolbe, Ferdinand Ludwig, Matthias Nießner, Frank Petzold, Xiao Xiang Zhu, Boris Jutzi
[ABSTRACT]
Urban Digital Twins (UDTs) have become essential for managing cities and
integrating complex, heterogeneous data from diverse sources. Creating UDTs
involves challenges at multiple process stages, including acquiring accurate 3D
source data, reconstructing high-fidelity 3D models, maintaining models’
updates, and ensuring seamless interoperability to downstream tasks. Current
datasets are usually limited to one part of the processing chain, hampering
comprehensive UDTs validation. To address these challenges, we introduce the
first comprehensive multimodal Urban Digital Twin benchmark dataset: TUM2TWIN.
This dataset includes georeferenced, semantically aligned 3D models and
networks along with various terrestrial, mobile, aerial, and satellite
observations boasting 32 data subsets over roughly 100,000 $m^2$ and currently
767 GB of data. By ensuring georeferenced indoor-outdoor acquisition, high
accuracy, and multimodal data integration, the benchmark supports robust
analysis of sensors and the development of advanced reconstruction methods.
Additionally, we explore downstream tasks demonstrating the potential of
TUM2TWIN, including novel view synthesis of NeRF and Gaussian Splatting, solar
potential analysis, point cloud semantic segmentation, and LoD3 building
reconstruction. We are convinced this contribution lays a foundation for
overcoming current limitations in UDT creation, fostering new research
directions and practical solutions for smarter, data-driven urban environments.
The project is available under: https://tum2t.win
[COMMENTS]
Submitted to the ISPRS Journal of Photogrammetry and Remote Sensing
[LINK]
http://arxiv.org/abs/2505.07396v2
[DATE]
2025-05-13 20:12:36+08:00
[CATEGORIES]
cs.LG
Transforming Hyperspectral Images Into Chemical Maps: An End-to-End Deep Learning Approach
[AUTHORS]
Ole-Christian Galbo Engstrøm, Michela Albano-Gaglio, Erik Schou Dreier, Yamine Bouzembrak, Maria Font-i-Furnols, Puneet Mishra, Kim Steenstrup Pedersen
[ABSTRACT]
Current approaches to chemical map generation from hyperspectral images are
based on models such as partial least squares (PLS) regression, generating
pixel-wise predictions that do not consider spatial context and suffer from a
high degree of noise. This study proposes an end-to-end deep learning approach
using a modified version of U-Net and a custom loss function to directly obtain
chemical maps from hyperspectral images, skipping all intermediate steps
required for traditional pixel-wise analysis. We compare the U-Net with the
traditional PLS regression on a real dataset of pork belly samples with
associated mean fat reference values. The U-Net obtains a test set root mean
squared error of between 9% and 13% lower than that of PLS regression on the
task of mean fat prediction. At the same time, U-Net generates fine detail
chemical maps where 99.91% of the variance is spatially correlated. Conversely,
only 2.53% of the variance in the PLS-generated chemical maps is spatially
correlated, indicating that each pixel-wise prediction is largely independent
of neighboring pixels. Additionally, while the PLS-generated chemical maps
contain predictions far beyond the physically possible range of 0-100%, U-Net
learns to stay inside this range. Thus, the findings of this study indicate
that U-Net is superior to PLS for chemical map generation.
[LINK]
http://arxiv.org/abs/2504.14131v3
[DATE]
2025-05-13 20:06:22+08:00
[CATEGORIES]
cs.LG
Parameter Estimation using Reinforcement Learning Causal Curiosity: Limits and Challenges
[AUTHORS]
Miguel Arana-Catania, Weisi Guo
[ABSTRACT]
Causal understanding is important in many disciplines of science and
engineering, where we seek to understand how different factors in the system
causally affect an experiment or situation and pave a pathway towards creating
effective or optimising existing models. Examples of use cases are autonomous
exploration and modelling of unknown environments or assessing key variables in
optimising large complex systems. In this paper, we analyse a Reinforcement
Learning approach called Causal Curiosity, which aims to estimate as accurately
and efficiently as possible, without directly measuring them, the value of
factors that causally determine the dynamics of a system. Whilst the idea
presents a pathway forward, measurement accuracy is the foundation of
methodology effectiveness. Focusing on the current causal curiosity’s robotic
manipulator, we present for the first time a measurement accuracy analysis of
the future potentials and current limitations of this technique and an analysis
of its sensitivity and confounding factor disentanglement capability - crucial
for causal analysis. As a result of our work, we promote proposals for an
improved and efficient design of Causal Curiosity methods to be applied to
real-world complex scenarios.
[COMMENTS]
24 pages, 10 figures, 9 tables
[LINK]
http://arxiv.org/abs/2505.08453v1
[DATE]
2025-05-13 19:30:51+08:00
[CATEGORIES]
cs.LG
Trade-off between Gradient Measurement Efficiency and Expressivity in Deep Quantum Neural Networks
[AUTHORS]
Koki Chinzei, Shinichiro Yamano, Quoc Hoan Tran, Yasuhiro Endo, Hirotaka Oshima
[ABSTRACT]
Quantum neural networks (QNNs) require an efficient training algorithm to
achieve practical quantum advantages. A promising approach is gradient-based
optimization, where gradients are estimated by quantum measurements. However,
QNNs currently lack general quantum algorithms for efficiently measuring
gradients, which limits their scalability. To elucidate the fundamental limits
and potentials of efficient gradient estimation, we rigorously prove a
trade-off between gradient measurement efficiency (the mean number of
simultaneously measurable gradient components) and expressivity in deep QNNs.
This trade-off indicates that more expressive QNNs require higher measurement
costs per parameter for gradient estimation, while reducing QNN expressivity to
suit a given task can increase gradient measurement efficiency. We further
propose a general QNN ansatz called the stabilizer-logical product ansatz
(SLPA), which achieves the trade-off upper bound by exploiting the symmetric
structure of the quantum circuit. Numerical experiments show that the SLPA
drastically reduces the sample complexity needed for training while maintaining
accuracy and trainability compared to well-designed circuits based on the
parameter-shift method.
[COMMENTS]
32 pages, 11 figures
[LINK]
http://arxiv.org/abs/2406.18316v3
[DATE]
2025-05-13 19:02:13+08:00
[CATEGORIES]
cs.LG
A primal-dual perspective for distributed TD-learning
[AUTHORS]
Han-Dong Lim, Donghwan Lee
[ABSTRACT]
The goal of this paper is to investigate distributed temporal difference (TD)
learning for a networked multi-agent Markov decision process. The proposed
approach is based on distributed optimization algorithms, which can be
interpreted as primal-dual Ordinary differential equation (ODE) dynamics
subject to null-space constraints. Based on the exponential convergence
behavior of the primal-dual ODE dynamics subject to null-space constraints, we
examine the behavior of the final iterate in various distributed TD-learning
scenarios, considering both constant and diminishing step-sizes and
incorporating both i.i.d. and Markovian observation models. Unlike existing
methods, the proposed algorithm does not require the assumption that the
underlying communication network structure is characterized by a doubly
stochastic matrix.
[COMMENTS]
To appear in IJCAI2025
[LINK]
http://arxiv.org/abs/2310.00638v3
[DATE]
2025-05-13 18:50:51+08:00
[CATEGORIES]
cs.LG
Transfer Learning of Surrogate Models: Integrating Domain Warping and Affine Transformations
[AUTHORS]
Shuaiqun Pan, Diederick Vermetten, Manuel López-Ibáñez, Thomas Bäck, Hao Wang
[ABSTRACT]
Surrogate models provide efficient alternatives to computationally demanding
real world processes but often require large datasets for effective training. A
promising solution to this limitation is the transfer of pre-trained surrogate
models to new tasks. Previous studies have investigated the transfer of
differentiable and non-differentiable surrogate models, typically assuming an
affine transformation between the source and target functions. This paper
extends previous research by addressing a broader range of transformations,
including linear and nonlinear variations. Specifically, we consider the
combination of an unknown input warping, such as one modeled by the beta
cumulative distribution function, with an unspecified affine transformation.
Our approach achieves transfer learning by employing a limited number of data
points from the target task to optimize these transformations, minimizing
empirical loss on the transfer dataset. We validate the proposed method on the
widely used Black-Box Optimization Benchmark (BBOB) testbed and a real-world
transfer learning task from the automobile industry. The results underscore the
significant advantages of the approach, revealing that the transferred
surrogate significantly outperforms both the original surrogate and the one
built from scratch using the transfer dataset, particularly in data-scarce
scenarios.
[LINK]
http://arxiv.org/abs/2501.18344v2
[DATE]
2025-05-13 18:49:36+08:00
[CATEGORIES]
cs.LG
Genus expansion for non-linear random matrix ensembles with applications to neural networks
[AUTHORS]
Nicola Muca Cirone, Jad Hamdan, Cristopher Salvi
[ABSTRACT]
We present a unified approach to studying certain non-linear random matrix
ensembles and associated random neural networks at initialization. This begins
with a novel series expansion for neural networks which generalizes Fa'a di
Bruno’s formula to an arbitrary number of compositions. The role of monomials
is played by random multilinear maps indexed by directed graphs, whose edges
correspond to random matrices. Crucially, this expansion linearizes the effect
of the activation functions, allowing for the direct application of Wick’s
principle and the genus expansion technique. As an application, we prove
several results about neural networks with random weights. We first give a new
proof of the fact that they converge to Gaussian processes as their width tends
to infinity. Secondly, we quantify the rate of convergence of the Neural
Tangent Kernel to its deterministic limit in Frobenius norm. Finally, we
compute the moments of the limiting spectral distribution of the Jacobian (only
the first two of which were previously known), expressing them as sums over
non-crossing partitions. All of these results are then generalised to the case
of neural networks with sparse and non-Gaussian weights, under moment
assumptions.
[COMMENTS]
63 pages. v5: Previous versions contained non-trivial errors and had
overlooked important references. This version addresses this and includes
substantial changes in the exposition
[LINK]
http://arxiv.org/abs/2407.08459v5
[DATE]
2025-05-13 18:37:08+08:00
[CATEGORIES]
cs.LG
Understanding molecular ratios in the carbon and oxygen poor outer Milky Way with interpretable machine learning
[AUTHORS]
Gijs Vermariën, Serena Viti, Johannes Heyl, Francesco Fontani
[ABSTRACT]
Context. The outer Milky Way has a lower metallicity than our solar
neighbourhood, but still many molecules are detected in the region. Molecular
line ratios can serve as probes to better understand the chemistry and physics
in these regions. Aims. We use interpretable machine learning to study 9
different molecular ratios, helping us understand the forward connection
between the physics of these environments and the carbon and oxygen
chemistries. Methods. Using a large grid of astrochemical models generated
using UCLCHEM, we study the properties of molecular clouds of low oxygen and
carbon initial abundance. We first try to understand the line ratios using a
classical analysis. We then move on to using interpretable machine learning,
namely Shapley Additive Explanations (SHAP), to understand the higher order
dependencies of the ratios over the entire parameter grid. Lastly we use the
Uniform Manifold Approximation and Projection technique (UMAP) as a reduction
method to create intuitive groupings of models. Results. We find that the
parameter space is well covered by the line ratios, allowing us to investigate
all input parameters. SHAP analysis shows that the temperature and density are
the most important features, but the carbon and oxygen abundances are important
in parts of the parameter space. Lastly, we find that we can group different
types of ratios using UMAP. Conclusions. We show the chosen ratios are mostly
sensitive to changes in the carbon initial abundance, together with the
temperature and density. Especially the CN/HCN and HNC/HCN ratio are shown to
be sensitive to the initial carbon abundance, making them excellent probes for
this parameter. Out of the ratios, only CS/SO shows a sensitivity to the oxygen
abundance.
[COMMENTS]
Accepted for publication in A&A Sect. 6. Interstellar and
circumstellar matter
[LINK]
http://arxiv.org/abs/2505.08410v1
[DATE]
2025-05-13 18:08:37+08:00
[CATEGORIES]
cs.LG
Calibrated and Efficient Sampling-Free Confidence Estimation for LiDAR Scene Semantic Segmentation
[AUTHORS]
Hanieh Shojaei Miandashti, Qianqian Zou, Claus Brenner
[ABSTRACT]
Reliable deep learning models require not only accurate predictions but also
well-calibrated confidence estimates to ensure dependable uncertainty
estimation. This is crucial in safety-critical applications like autonomous
driving, which depend on rapid and precise semantic segmentation of LiDAR point
clouds for real-time 3D scene understanding. In this work, we introduce a
sampling-free approach for estimating well-calibrated confidence values for
classification tasks, achieving alignment with true classification accuracy and
significantly reducing inference time compared to sampling-based methods. Our
evaluation using the Adaptive Calibration Error (ACE) metric for LiDAR semantic
segmentation shows that our approach maintains well-calibrated confidence
values while achieving increased processing speed compared to a sampling
baseline. Additionally, reliability diagrams reveal that our method produces
underconfidence rather than overconfident predictions, an advantage for
safety-critical applications. Our sampling-free approach offers well-calibrated
and time-efficient predictions for LiDAR scene semantic segmentation.
[LINK]
http://arxiv.org/abs/2411.11935v2
[DATE]
2025-05-13 18:07:04+08:00
[CATEGORIES]
cs.LG
ConDiSim: Conditional Diffusion Models for Simulation Based Inference
[AUTHORS]
Mayank Nautiyal, Andreas Hellander, Prashant Singh
[ABSTRACT]
We present a conditional diffusion model - ConDiSim, for simulation-based
inference of complex systems with intractable likelihoods. ConDiSim leverages
denoising diffusion probabilistic models to approximate posterior
distributions, consisting of a forward process that adds Gaussian noise to
parameters, and a reverse process learning to denoise, conditioned on observed
data. This approach effectively captures complex dependencies and
multi-modalities within posteriors. ConDiSim is evaluated across ten benchmark
problems and two real-world test problems, where it demonstrates effective
posterior approximation accuracy while maintaining computational efficiency and
stability in model training. ConDiSim offers a robust and extensible framework
for simulation-based inference, particularly suitable for parameter inference
workflows requiring fast inference methods.
[LINK]
http://arxiv.org/abs/2505.08403v1
[DATE]
2025-05-13 17:58:23+08:00
[CATEGORIES]
cs.LG
Quantum Support Vector Regression for Robust Anomaly Detection
[AUTHORS]
Kilian Tscharke, Maximilian Wendlinger, Sebastian Issel, Pascal Debus
[ABSTRACT]
Anomaly Detection (AD) is critical in data analysis, particularly within the
domain of IT security. In recent years, Machine Learning (ML) algorithms have
emerged as a powerful tool for AD in large-scale data. In this study, we
explore the potential of quantum ML approaches, specifically quantum kernel
methods, for the application to robust AD. We build upon previous work on
Quantum Support Vector Regression (QSVR) for semisupervised AD by conducting a
comprehensive benchmark on IBM quantum hardware using eleven datasets. Our
results demonstrate that QSVR achieves strong classification performance and
even outperforms the noiseless simulation on two of these datasets. Moreover,
we investigate the influence of - in the NISQ-era inevitable - quantum noise on
the performance of the QSVR. Our findings reveal that the model exhibits
robustness to depolarizing, phase damping, phase flip, and bit flip noise,
while amplitude damping and miscalibration noise prove to be more disruptive.
Finally, we explore the domain of Quantum Adversarial Machine Learning and
demonstrate that QSVR is highly vulnerable to adversarial attacks and that
noise does not improve the adversarial robustness of the model.
[COMMENTS]
Submitted to IEEE International Conference on Quantum Computing and
Engineering (QCE) 2025
[LINK]
http://arxiv.org/abs/2505.01012v2
[DATE]
2025-05-13 17:54:41+08:00
[CATEGORIES]
cs.LG
Clinically inspired enhance Explainability and Interpretability of an AI-Tool for BCC diagnosis based on expert annotation
[AUTHORS]
Iván Matas, Carmen Serrano, Francisca Silva, Amalia Serrano, Tomás Toledo-Pastrana, Begoña Acha
[ABSTRACT]
An AI tool has been developed to provide interpretable support for the
diagnosis of BCC via teledermatology, thus speeding up referrals and optimizing
resource utilization. The interpretability is provided in two ways: on the one
hand, the main BCC dermoscopic patterns are found in the image to justify the
BCC/Non BCC classification. Secondly, based on the common visual XAI Grad-CAM,
a clinically inspired visual explanation is developed where the relevant
features for diagnosis are located. Since there is no established ground truth
for BCC dermoscopic features, a standard reference is inferred from the
diagnosis of four dermatologists using an Expectation Maximization (EM) based
algorithm. The results demonstrate significant improvements in classification
accuracy and interpretability, positioning this approach as a valuable tool for
early BCC detection and referral to dermatologists. The BCC/non-BCC
classification achieved an accuracy rate of 90%. For Clinically-inspired XAI
results, the detection of BCC patterns useful to clinicians reaches 99%
accuracy. As for the Clinically-inspired Visual XAI results, the mean of the
Grad-CAM normalized value within the manually segmented clinical features is
0.57, while outside this region it is 0.16. This indicates that the model
struggles to accurately identify the regions of the BCC patterns. These results
prove the ability of the AI tool to provide a useful explanation.
[COMMENTS]
8 pages, 4 figures, 4 tables, under review
[LINK]
http://arxiv.org/abs/2407.00104v2
[DATE]
2025-05-13 17:29:47+08:00
[CATEGORIES]
cs.LG
Continuous World Coverage Path Planning for Fixed-Wing UAVs using Deep Reinforcement Learning
[AUTHORS]
Mirco Theile, Andres R. Zapata Rodriguez, Marco Caccamo, Alberto L. Sangiovanni-Vincentelli
[ABSTRACT]
Unmanned Aerial Vehicle (UAV) Coverage Path Planning (CPP) is critical for
applications such as precision agriculture and search and rescue. While
traditional methods rely on discrete grid-based representations, real-world UAV
operations require power-efficient continuous motion planning. We formulate the
UAV CPP problem in a continuous environment, minimizing power consumption while
ensuring complete coverage. Our approach models the environment with
variable-size axis-aligned rectangles and UAV motion with curvature-constrained
B'ezier curves. We train a reinforcement learning agent using an
action-mapping-based Soft Actor-Critic (AM-SAC) algorithm employing a
self-adaptive curriculum. Experiments on both procedurally generated and
hand-crafted scenarios demonstrate the effectiveness of our method in learning
energy-efficient coverage strategies.
[COMMENTS]
Submitted to IROS 2025
[LINK]
http://arxiv.org/abs/2505.08382v1
[DATE]
2025-05-13 17:29:16+08:00
[CATEGORIES]
cs.LG
Learning Treatment Allocations with Risk Control Under Partial Identifiability
[AUTHORS]
Sofia Ek, Dave Zachariah
[ABSTRACT]
Learning beneficial treatment allocations for a patient population is an
important problem in precision medicine. Many treatments come with adverse side
effects that are not commensurable with their potential benefits. Patients who
do not receive benefits after such treatments are thereby subjected to
unnecessary harm. This is a `treatment risk’ that we aim to control when
learning beneficial allocations. The constrained learning problem is challenged
by the fact that the treatment risk is not in general identifiable using either
randomized trial or observational data. We propose a certifiable learning
method that controls the treatment risk with finite samples in the partially
identified setting. The method is illustrated using both simulated and real
data.
[LINK]
http://arxiv.org/abs/2505.08378v1
[DATE]
2025-05-13 17:22:18+08:00
[CATEGORIES]
cs.LG
Adaptive Diffusion Policy Optimization for Robotic Manipulation
[AUTHORS]
Huiyun Jiang, Zhuang Yang
[ABSTRACT]
Recent studies have shown the great potential of diffusion models in
improving reinforcement learning (RL) by modeling complex policies, expressing
a high degree of multi-modality, and efficiently handling high-dimensional
continuous control tasks. However, there is currently limited research on how
to optimize diffusion-based polices (e.g., Diffusion Policy) fast and stably.
In this paper, we propose an Adam-based Diffusion Policy Optimization (ADPO), a
fast algorithmic framework containing best practices for fine-tuning
diffusion-based polices in robotic control tasks using the adaptive gradient
descent method in RL. Adaptive gradient method is less studied in training RL,
let alone diffusion-based policies. We confirm that ADPO outperforms other
diffusion-based RL methods in terms of overall effectiveness for fine-tuning on
standard robotic tasks. Concretely, we conduct extensive experiments on
standard robotic control tasks to test ADPO, where, particularly, six popular
diffusion-based RL methods are provided as benchmark methods. Experimental
results show that ADPO acquires better or comparable performance than the
baseline methods. Finally, we systematically analyze the sensitivity of
multiple hyperparameters in standard robotics tasks, providing guidance for
subsequent practical applications. Our video demonstrations are released in
https://github.com/Timeless-lab/ADPO.git.
[LINK]
http://arxiv.org/abs/2505.08376v1
[DATE]
2025-05-13 17:21:45+08:00
[CATEGORIES]
cs.LG
Semantic Shift Estimation via Dual-Projection and Classifier Reconstruction for Exemplar-Free Class-Incremental Learning
[AUTHORS]
Run He, Di Fang, Yicheng Xu, Yawen Cui, Ming Li, Cen Chen, Ziqian Zeng, Huiping Zhuang
[ABSTRACT]
Exemplar-Free Class-Incremental Learning (EFCIL) aims to sequentially learn
from distinct categories without retaining exemplars but easily suffers from
catastrophic forgetting of learned knowledge. While existing EFCIL methods
leverage knowledge distillation to alleviate forgetting, they still face two
critical challenges: semantic shift and decision bias. Specifically, the
embeddings of old tasks shift in the embedding space after learning new tasks,
and the classifier becomes biased towards new tasks due to training solely with
new data, hindering the balance between old and new knowledge. To address these
issues, we propose the Dual-Projection Shift Estimation and Classifier
Reconstruction (DPCR) approach for EFCIL. DPCR effectively estimates semantic
shift through a dual-projection, which combines a learnable transformation with
a row-space projection to capture both task-wise and category-wise shifts.
Furthermore, to mitigate decision bias, DPCR employs ridge regression to
reformulate a classifier reconstruction process. This reconstruction exploits
previous in covariance and prototype of each class after calibration with
estimated shift, thereby reducing decision bias. Extensive experiments
demonstrate that, on various datasets, DPCR effectively balances old and new
tasks, outperforming state-of-the-art EFCIL methods. Our codes are available at
https://github.com/RHe502/ICML25-DPCR.
[COMMENTS]
Accepted by ICML 2025; Camera ready version
[LINK]
http://arxiv.org/abs/2503.05423v3
[DATE]
2025-05-13 17:19:56+08:00
[CATEGORIES]
cs.LG
Density Ratio-based Causal Discovery from Bivariate Continuous-Discrete Data
[AUTHORS]
Takashi Nicholas Maeda, Shohei Shimizu, Hidetoshi Matsui
[ABSTRACT]
This paper proposes a causal discovery method for mixed bivariate data
consisting of one continuous and one discrete variable. Existing
constraint-based approaches are ineffective in the bivariate setting, as they
rely on conditional independence tests that are not suited to bivariate data.
Score-based methods either impose strong distributional assumptions or face
challenges in fairly comparing causal directions between variables of different
types, due to differences in their information content. We introduce a novel
approach that determines causal direction by analyzing the monotonicity of the
conditional density ratio of the continuous variable, conditioned on different
values of the discrete variable. Our theoretical analysis shows that the
conditional density ratio exhibits monotonicity when the continuous variable
causes the discrete variable, but not in the reverse direction. This property
provides a principled basis for comparing causal directions between variables
of different types, free from strong distributional assumptions and bias
arising from differences in their information content. We demonstrate its
effectiveness through experiments on both synthetic and real-world datasets,
showing superior accuracy compared to existing methods.
[LINK]
http://arxiv.org/abs/2505.08371v1
[DATE]
2025-05-13 17:18:41+08:00
[CATEGORIES]
cs.LG
Streamlining Prediction in Bayesian Deep Learning
[AUTHORS]
Rui Li, Marcus Klasson, Arno Solin, Martin Trapp
[ABSTRACT]
The rising interest in Bayesian deep learning (BDL) has led to a plethora of
methods for estimating the posterior distribution. However, efficient
computation of inferences, such as predictions, has been largely overlooked
with Monte Carlo integration remaining the standard. In this work we examine
streamlining prediction in BDL through a single forward pass without sampling.
For this we use local linearisation on activation functions and local Gaussian
approximations at linear layers. Thus allowing us to analytically compute an
approximation to the posterior predictive distribution. We showcase our
approach for both MLP and transformers, such as ViT and GPT-2, and assess its
performance on regression and classification tasks.
Open-source library: https://github.com/AaltoML/SUQ
[LINK]
http://arxiv.org/abs/2411.18425v3
[DATE]
2025-05-13 17:16:34+08:00
[CATEGORIES]
cs.LG
Localization of Impacts on Thin-Walled Structures by Recurrent Neural Networks: End-to-end Learning from Real-World Data
[AUTHORS]
Alexander Humer, Lukas Grasboeck, Ayech Benjeddou
[ABSTRACT]
Today, machine learning is ubiquitous, and structural health monitoring (SHM)
is no exception. Specifically, we address the problem of impact localization on
shell-like structures, where knowledge of impact locations aids in assessing
structural integrity. Impacts on thin-walled structures excite Lamb waves,
which can be measured with piezoelectric sensors. Their dispersive
characteristics make it difficult to detect and localize impacts by
conventional methods. In the present contribution, we explore the localization
of impacts using neural networks. In particular, we propose to use {recurrent
neural networks} (RNNs) to estimate impact positions end-to-end, i.e., directly
from {sequential sensor data}. We deal with comparatively long sequences of
thousands of samples, since high sampling rate are needed to accurately capture
elastic waves. For this reason, the proposed approach builds upon Gated
Recurrent Units (GRUs), which are less prone to vanishing gradients as compared
to conventional RNNs. Quality and quantity of data are crucial when training
neural networks. Often, synthetic data is used, which inevitably introduces a
reality gap. Here, by contrast, we train our networks using {physical data from
experiments}, which requires automation to handle the large number of
experiments needed. For this purpose, a {robot is used to drop steel balls}
onto an {aluminum plate} equipped with {piezoceramic sensors}. Our results show
remarkable accuracy in estimating impact positions, even with a comparatively
small dataset.
[COMMENTS]
XI ECCOMAS Thematic Conference on Smart Structures and Materials
(SMART 2025)
[LINK]
http://arxiv.org/abs/2505.08362v1
[DATE]
2025-05-13 17:08:47+08:00
[CATEGORIES]
cs.LG
DPR: Diffusion Preference-based Reward for Offline Reinforcement Learning
[AUTHORS]
Teng Pang, Bingzheng Wang, Guoqiang Wu, Yilong Yin
[ABSTRACT]
Offline preference-based reinforcement learning (PbRL) mitigates the need for
reward definition, aligning with human preferences via preference-driven reward
feedback without interacting with the environment. However, the effectiveness
of preference-driven reward functions depends on the modeling ability of the
learning model, which current MLP-based and Transformer-based methods may fail
to adequately provide. To alleviate the failure of the reward function caused
by insufficient modeling, we propose a novel preference-based reward
acquisition method: Diffusion Preference-based Reward (DPR). Unlike previous
methods using Bradley-Terry models for trajectory preferences, we use diffusion
models to directly model preference distributions for state-action pairs,
allowing rewards to be discriminatively obtained from these distributions. In
addition, considering the particularity of preference data that only know the
internal relationships of paired trajectories, we further propose Conditional
Diffusion Preference-based Reward (C-DPR), which leverages relative preference
information to enhance the construction of the diffusion model. We apply the
above methods to existing offline reinforcement learning algorithms and a
series of experiment results demonstrate that the diffusion-based reward
acquisition approach outperforms previous MLP-based and Transformer-based
methods.
[LINK]
http://arxiv.org/abs/2503.01143v2
[DATE]
2025-05-13 17:05:27+08:00
[CATEGORIES]
cs.LG
SHAP-based Explanations are Sensitive to Feature Representation
[AUTHORS]
Hyunseung Hwang, Andrew Bell, Joao Fonseca, Venetia Pliatsika, Julia Stoyanovich, Steven Euijong Whang
[ABSTRACT]
Local feature-based explanations are a key component of the XAI toolkit.
These explanations compute feature importance values relative to an
“interpretable” feature representation. In tabular data, feature values
themselves are often considered interpretable. This paper examines the impact
of data engineering choices on local feature-based explanations. We demonstrate
that simple, common data engineering techniques, such as representing age with
a histogram or encoding race in a specific way, can manipulate feature
importance as determined by popular methods like SHAP. Notably, the sensitivity
of explanations to feature representation can be exploited by adversaries to
obscure issues like discrimination. While the intuition behind these results is
straightforward, their systematic exploration has been lacking. Previous work
has focused on adversarial attacks on feature-based explainers by biasing data
or manipulating models. To the best of our knowledge, this is the first study
demonstrating that explainers can be misled by standard, seemingly innocuous
data engineering techniques.
[COMMENTS]
Accepted to ACM FAccT 2025
[LINK]
http://arxiv.org/abs/2505.08345v1
[DATE]
2025-05-13 16:43:09+08:00
[CATEGORIES]
cs.LG
GraphSparseNet: a Novel Method for Large Scale Traffic Flow Prediction
[AUTHORS]
Weiyang Kong, Kaiqi Wu, Sen Zhang, Yubao Liu
[ABSTRACT]
Traffic flow forecasting is a critical spatio-temporal data mining task with
wide-ranging applications in intelligent route planning and dynamic traffic
management. Recent advancements in deep learning, particularly through Graph
Neural Networks (GNNs), have significantly enhanced the accuracy of these
forecasts by capturing complex spatio-temporal dynamics. However, the
scalability of GNNs remains a challenge due to their exponential growth in
model complexity with increasing nodes in the graph. Existing methods to
address this issue, including sparsification, decomposition, and kernel-based
approaches, either do not fully resolve the complexity issue or risk
compromising predictive accuracy. This paper introduces GraphSparseNet (GSNet),
a novel framework designed to improve both the scalability and accuracy of
GNN-based traffic forecasting models. GraphSparseNet is comprised of two core
modules: the Feature Extractor and the Relational Compressor. These modules
operate with linear time and space complexity, thereby reducing the overall
computational complexity of the model to a linear scale. Our extensive
experiments on multiple real-world datasets demonstrate that GraphSparseNet not
only significantly reduces training time by 3.51x compared to state-of-the-art
linear models but also maintains high predictive performance.
[COMMENTS]
Accepted by VLDB 2025
[LINK]
http://arxiv.org/abs/2502.19823v2
[DATE]
2025-05-13 16:38:27+08:00
[CATEGORIES]
cs.LG
Transformer representation learning is necessary for dynamic multi-modal physiological data on small-cohort patients
[AUTHORS]
Bingxu Wang, Min Ge, Kunzhi Cai, Yuqi Zhang, Zeyi Zhou, Wenjiao Li, Yachong Guo, Wei Wang, Qing Zhou
[ABSTRACT]
Postoperative delirium (POD), a severe neuropsychiatric complication
affecting nearly 50% of high-risk surgical patients, is defined as an acute
disorder of attention and cognition, It remains significantly underdiagnosed in
the intensive care units (ICUs) due to subjective monitoring methods. Early and
accurate diagnosis of POD is critical and achievable. Here, we propose a POD
prediction framework comprising a Transformer representation model followed by
traditional machine learning algorithms. Our approaches utilizes multi-modal
physiological data, including amplitude-integrated electroencephalography
(aEEG), vital signs, electrocardiographic monitor data as well as hemodynamic
parameters. We curated the first multi-modal POD dataset encompassing two
patient types and evaluated the various Transformer architectures for
representation learning. Empirical results indicate a consistent improvements
of sensitivity and Youden index in patient TYPE I using Transformer
representations, particularly our fusion adaptation of Pathformer. By enabling
effective delirium diagnosis from postoperative day 1 to 3, our extensive
experimental findings emphasize the potential of multi-modal physiological data
and highlight the necessity of representation learning via multi-modal
Transformer architecture in clinical diagnosis.
[LINK]
http://arxiv.org/abs/2504.04120v3
[DATE]
2025-05-13 16:22:50+08:00
[CATEGORIES]
cs.LG
UVTM: Universal Vehicle Trajectory Modeling with ST Feature Domain Generation
[AUTHORS]
Yan Lin, Jilin Hu, Shengnan Guo, Bin Yang, Christian S. Jensen, Youfang Lin, Huaiyu Wan
[ABSTRACT]
Vehicle movement is frequently captured in the form of GPS trajectories,
i.e., sequences of timestamped GPS locations. Such data is widely used for
various tasks such as travel-time estimation, trajectory recovery, and
trajectory prediction. A universal vehicle trajectory model could be applied to
different tasks, removing the need to maintain multiple specialized models,
thereby reducing computational and storage costs. However, creating such a
model is challenging when the integrity of trajectory features is compromised,
i.e., in scenarios where only partial features are available or the
trajectories are sparse.
To address these challenges, we propose the Universal Vehicle Trajectory
Model (UVTM), which can effectively adapt to different tasks without excessive
retraining. UVTM incorporates two specialized designs. First, it divides
trajectory features into three distinct domains. Each domain can be masked and
generated independently to accommodate tasks with only partially available
features. Second, UVTM is pre-trained by reconstructing dense, feature-complete
trajectories from sparse, feature-incomplete counterparts, enabling strong
performance even when the integrity of trajectory features is compromised.
Experiments involving four representative trajectory-related tasks on three
real-world vehicle trajectory datasets provide insight into the performance of
UVTM and offer evidence that it is capable of meeting its objectives.
[LINK]
http://arxiv.org/abs/2402.07232v4
[DATE]
2025-05-13 16:16:03+08:00
[CATEGORIES]
cs.LG
Structural-Temporal Coupling Anomaly Detection with Dynamic Graph Transformer
[AUTHORS]
Chang Zong, Yueting Zhuang, Jian Shao, Weiming Lu
[ABSTRACT]
Detecting anomalous edges in dynamic graphs is an important task in many
applications over evolving triple-based data, such as social networks,
transaction management, and epidemiology. A major challenge with this task is
the absence of structural-temporal coupling information, which decreases the
ability of the representation to distinguish anomalies from normal instances.
Existing methods focus on handling independent structural and temporal features
with embedding models, which ignore the deep interaction between these two
types of information. In this paper, we propose a structural-temporal coupling
anomaly detection architecture with a dynamic graph transformer model.
Specifically, we introduce structural and temporal features from two
integration levels to provide anomaly-aware graph evolutionary patterns. Then,
a dynamic graph transformer enhanced by two-dimensional positional encoding is
implemented to capture both discrimination and contextual consistency signals.
Extensive experiments on six datasets demonstrate that our method outperforms
current state-of-the-art models. Finally, a case study illustrates the strength
of our method when applied to a real-world task.
[COMMENTS]
20 pages, 6 figures
[LINK]
http://arxiv.org/abs/2505.08330v1
[DATE]
2025-05-13 16:10:41+08:00
[CATEGORIES]
cs.LG
Low-Complexity Inference in Continual Learning via Compressed Knowledge Transfer
[AUTHORS]
Zhenrong Liu, Janne M. J. Huttunen, Mikko Honkala
[ABSTRACT]
Continual learning (CL) aims to train models that can learn a sequence of
tasks without forgetting previously acquired knowledge. A core challenge in CL
is balancing stability – preserving performance on old tasks – and plasticity
– adapting to new ones. Recently, large pre-trained models have been widely
adopted in CL for their ability to support both, offering strong generalization
for new tasks and resilience against forgetting. However, their high
computational cost at inference time limits their practicality in real-world
applications, especially those requiring low latency or energy efficiency. To
address this issue, we explore model compression techniques, including pruning
and knowledge distillation (KD), and propose two efficient frameworks tailored
for class-incremental learning (CIL), a challenging CL setting where task
identities are unavailable during inference. The pruning-based framework
includes pre- and post-pruning strategies that apply compression at different
training stages. The KD-based framework adopts a teacher-student architecture,
where a large pre-trained teacher transfers downstream-relevant knowledge to a
compact student. Extensive experiments on multiple CIL benchmarks demonstrate
that the proposed frameworks achieve a better trade-off between accuracy and
inference complexity, consistently outperforming strong baselines. We further
analyze the trade-offs between the two frameworks in terms of accuracy and
efficiency, offering insights into their use across different scenarios.
[LINK]
http://arxiv.org/abs/2505.08327v1
[DATE]
2025-05-13 16:07:40+08:00
[CATEGORIES]
cs.LG
A Finite Sample Analysis of Distributional TD Learning with Linear Function Approximation
[AUTHORS]
Yang Peng, Kaicheng Jin, Liangyu Zhang, Zhihua Zhang
[ABSTRACT]
In this paper, we study the finite-sample statistical rates of distributional
temporal difference (TD) learning with linear function approximation. The aim
of distributional TD learning is to estimate the return distribution of a
discounted Markov decision process for a given policy {\pi}. Previous works on
statistical analysis of distributional TD learning mainly focus on the tabular
case. In contrast, we first consider the linear function approximation setting
and derive sharp finite-sample rates. Our theoretical results demonstrate that
the sample complexity of linear distributional TD learning matches that of
classic linear TD learning. This implies that, with linear function
approximation, learning the full distribution of the return from streaming data
is no more difficult than learning its expectation (value function). To derive
tight sample complexity bounds, we conduct a fine-grained analysis of the
linear-categorical Bellman equation and employ the exponential stability
arguments for products of random matrices. Our results provide new insights
into the statistical efficiency of distributional reinforcement learning
algorithms.
[LINK]
http://arxiv.org/abs/2502.14172v2
[DATE]
2025-05-13 16:03:07+08:00
[CATEGORIES]
cs.LG
The Odyssey of the Fittest: Can Agents Survive and Still Be Good?
[AUTHORS]
Dylan Waldner, Risto Miikkulainen
[ABSTRACT]
As AI models grow in power and generality, understanding how agents learn and
make decisions in complex environments is critical to promoting ethical
behavior. This study introduces the Odyssey, a lightweight, adaptive text based
adventure game, providing a scalable framework for exploring AI ethics and
safety. The Odyssey examines the ethical implications of implementing
biological drives, specifically, self preservation, into three different
agents. A Bayesian agent optimized with NEAT, a Bayesian agent optimized with
stochastic variational inference, and a GPT 4o agent. The agents select actions
at each scenario to survive, adapting to increasingly challenging scenarios.
Post simulation analysis evaluates the ethical scores of the agent decisions,
uncovering the tradeoffs it navigates to survive. Specifically, analysis finds
that when danger increases, agents ethical behavior becomes unpredictable.
Surprisingly, the GPT 4o agent outperformed the Bayesian models in both
survival and ethical consistency, challenging assumptions about traditional
probabilistic methods and raising a new challenge to understand the mechanisms
of LLMs’ probabilistic reasoning.
[COMMENTS]
Accepted to CogSci 2025
[LINK]
http://arxiv.org/abs/2502.05442v2
[DATE]
2025-05-13 16:00:22+08:00
[CATEGORIES]
cs.LG
Early Detection of Forest Calamities in Homogeneous Stands – Deep Learning Applied to Bark-Beetle Outbreaks
[AUTHORS]
Maximilian Kirsch, Jakob Wernicke, Pawan Datta, Christine Preisach
[ABSTRACT]
Climate change has increased the vulnerability of forests to insect-related
damage, resulting in widespread forest loss in Central Europe and highlighting
the need for effective, continuous monitoring systems. Remote sensing based
forest health monitoring, oftentimes, relies on supervised machine learning
algorithms that require labeled training data. Monitoring temporal patterns
through time series analysis offers a potential alternative for earlier
detection of disturbance but requires substantial storage resources. This study
investigates the potential of a Deep Learning algorithm based on a Long Short
Term Memory (LSTM) Autoencoder for the detection of anomalies in forest health
(e.g. bark beetle outbreaks), utilizing Sentinel-2 time series data. This
approach is an alternative to supervised machine learning methods, avoiding the
necessity for labeled training data. Furthermore, it is more memory-efficient
than other time series analysis approaches, as a robust model can be created
using only a 26-week-long time series as input. In this study, we monitored
pure stands of spruce in Thuringia, Germany, over a 7-year period from 2018 to
the end of 2024. Our best model achieved a detection accuracy of 87% on test
data and was able to detect 61% of all anomalies at a very early stage (more
than a month before visible signs of forest degradation). Compared to another
widely used time series break detection algorithm - BFAST (Breaks For Additive
Season and Trend), our approach consistently detected higher percentage of
anomalies at an earlier stage. These findings suggest that LSTM-based
Autoencoders could provide a promising, resource-efficient approach to forest
health monitoring, enabling more timely responses to emerging threats.
[COMMENTS]
24 pages, 18 figures, submitted to IEEE: Journal of Selected Topics
in Applied Earth Observations and Remote Sensing
[LINK]
http://arxiv.org/abs/2503.12883v2
[DATE]
2025-05-13 15:55:00+08:00
[CATEGORIES]
cs.LG
Graph Attention is Not Always Beneficial: A Theoretical Analysis of Graph Attention Mechanisms via Contextual Stochastic Block Models
[AUTHORS]
Zhongtian Ma, Qiaosheng Zhang, Bocheng Zhou, Yexin Zhang, Shuyue Hu, Zhen Wang
[ABSTRACT]
Despite the growing popularity of graph attention mechanisms, their
theoretical understanding remains limited. This paper aims to explore the
conditions under which these mechanisms are effective in node classification
tasks through the lens of Contextual Stochastic Block Models (CSBMs). Our
theoretical analysis reveals that incorporating graph attention mechanisms is
\emph{not universally beneficial}. Specifically, by appropriately defining
\emph{structure noise} and \emph{feature noise} in graphs, we show that graph
attention mechanisms can enhance classification performance when structure
noise exceeds feature noise. Conversely, when feature noise predominates,
simpler graph convolution operations are more effective. Furthermore, we
examine the over-smoothing phenomenon and show that, in the high
signal-to-noise ratio (SNR) regime, graph convolutional networks suffer from
over-smoothing, whereas graph attention mechanisms can effectively resolve this
issue. Building on these insights, we propose a novel multi-layer Graph
Attention Network (GAT) architecture that significantly outperforms
single-layer GATs in achieving \emph{perfect node classification} in CSBMs,
relaxing the SNR requirement from $ \omega(\sqrt{\log n}) $ to $
\omega(\sqrt{\log n} / \sqrt[3]{n}) $. To our knowledge, this is the first
study to delineate the conditions for perfect node classification using
multi-layer GATs. Our theoretical contributions are corroborated by extensive
experiments on both synthetic and real-world datasets, highlighting the
practical implications of our findings.
[COMMENTS]
Accepted by ICML 2025
[LINK]
http://arxiv.org/abs/2412.15496v3
[DATE]
2025-05-13 15:37:55+08:00
[CATEGORIES]
cs.LG
Rapid Overfitting of Multi-Pass Stochastic Gradient Descent in Stochastic Convex Optimization
[AUTHORS]
Shira Vansover-Hager, Tomer Koren, Roi Livni
[ABSTRACT]
We study the out-of-sample performance of multi-pass stochastic gradient
descent (SGD) in the fundamental stochastic convex optimization (SCO) model.
While one-pass SGD is known to achieve an optimal $\Theta(1/\sqrt{n})$ excess
population loss given a sample of size $n$, much less is understood about the
multi-pass version of the algorithm which is widely used in practice. Somewhat
surprisingly, we show that in the general non-smooth case of SCO, just a few
epochs of SGD can already hurt its out-of-sample performance significantly and
lead to overfitting. In particular, using a step size $\eta =
\Theta(1/\sqrt{n})$, which gives the optimal rate after one pass, can lead to
population loss as large as $\Omega(1)$ after just one additional pass. More
generally, we show that the population loss from the second pass onward is of
the order $\Theta(1/(\eta T) + \eta \sqrt{T})$, where $T$ is the total number
of steps. These results reveal a certain phase-transition in the out-of-sample
behavior of SGD after the first epoch, as well as a sharp separation between
the rates of overfitting in the smooth and non-smooth cases of SCO.
Additionally, we extend our results to with-replacement SGD, proving that the
same asymptotic bounds hold after $O(n \log n)$ steps. Finally, we also prove a
lower bound of $\Omega(\eta \sqrt{n})$ on the generalization gap of one-pass
SGD in dimension $d = \smash{\widetilde O}(n)$, improving on recent results of
Koren et al.(2022) and Schliserman et al.(2024).
[LINK]
http://arxiv.org/abs/2505.08306v1
[DATE]
2025-05-13 15:32:48+08:00
[CATEGORIES]
cs.LG
Efficient Unstructured Pruning of Mamba State-Space Models for Resource-Constrained Environments
[AUTHORS]
Ibne Farabi Shihab, Sanjeda Akter, Anuj Sharma
[ABSTRACT]
State-space models (SSMs), particularly the Mamba architecture, have emerged
as powerful alternatives to Transformers for sequence modeling, offering
linear-time complexity and competitive performance across diverse tasks.
However, their large parameter counts pose significant challenges for
deployment in resource-constrained environments. We propose a novel
unstructured pruning framework tailored for Mamba models that achieves up to
70\% parameter reduction while retaining over 95\% of the original performance.
Our approach integrates three key innovations: (1) a gradient-aware magnitude
pruning technique that combines weight magnitude and gradient information to
identify less critical parameters, (2) an iterative pruning schedule that
gradually increases sparsity to maintain model stability, and (3) a global
pruning strategy that optimizes parameter allocation across the entire model.
Through extensive experiments on WikiText-103, Long Range Arena, and ETT
time-series benchmarks, we demonstrate significant efficiency gains with
minimal performance degradation. Our analysis of pruning effects on Mamba’s
components reveals critical insights into the architecture’s redundancy and
robustness, enabling practical deployment in resource-constrained settings
while broadening Mamba’s applicability.
[LINK]
http://arxiv.org/abs/2505.08299v1
[DATE]
2025-05-13 15:23:08+08:00
[CATEGORIES]
cs.LG
A Practical Introduction to Deep Reinforcement Learning
[AUTHORS]
Yinghan Sun, Hongxi Wang, Hua Chen, Wei Zhang
[ABSTRACT]
Deep reinforcement learning (DRL) has emerged as a powerful framework for
solving sequential decision-making problems, achieving remarkable success in a
wide range of applications, including game AI, autonomous driving, biomedicine,
and large language models. However, the diversity of algorithms and the
complexity of theoretical foundations often pose significant challenges for
beginners seeking to enter the field. This tutorial aims to provide a concise,
intuitive, and practical introduction to DRL, with a particular focus on the
Proximal Policy Optimization (PPO) algorithm, which is one of the most widely
used and effective DRL methods. To facilitate learning, we organize all
algorithms under the Generalized Policy Iteration (GPI) framework, offering
readers a unified and systematic perspective. Instead of lengthy theoretical
proofs, we emphasize intuitive explanations, illustrative examples, and
practical engineering techniques. This work serves as an efficient and
accessible guide, helping readers rapidly progress from basic concepts to the
implementation of advanced DRL algorithms.
[LINK]
http://arxiv.org/abs/2505.08295v1
[DATE]
2025-05-13 15:19:16+08:00
[CATEGORIES]
cs.LG
Prototype Augmented Hypernetworks for Continual Learning
[AUTHORS]
Neil De La Fuente, Maria Pilligua, Daniel Vidal, Albin Soutiff, Cecilia Curreli, Daniel Cremers, Andrey Barsky
[ABSTRACT]
Continual learning (CL) aims to learn a sequence of tasks without forgetting
prior knowledge, but gradient updates for a new task often overwrite the
weights learned earlier, causing catastrophic forgetting (CF). We propose
Prototype-Augmented Hypernetworks (PAH), a framework where a single
hypernetwork, conditioned on learnable task prototypes, dynamically generates
task-specific classifier heads on demand. To mitigate forgetting, PAH combines
cross-entropy with dual distillation losses, one to align logits and another to
align prototypes, ensuring stable feature representations across tasks.
Evaluations on Split-CIFAR100 and TinyImageNet demonstrate that PAH achieves
state-of-the-art performance, reaching 74.5 % and 63.7 % accuracy with only 1.7
% and 4.4 % forgetting, respectively, surpassing prior methods without storing
samples or heads.
[COMMENTS]
CVPR (LatinX in CV)
[LINK]
http://arxiv.org/abs/2505.07450v2
[DATE]
2025-05-13 15:08:25+08:00
[CATEGORIES]
cs.LG
Equilibrium Propagation for Learning in Lagrangian Dynamical Systems
[AUTHORS]
Serge Massar
[ABSTRACT]
We propose a method for training dynamical systems governed by Lagrangian
mechanics using Equilibrium Propagation. Our approach extends Equilibrium
Propagation – initially developed for energy-based models – to dynamical
trajectories by leveraging the principle of action extremization. Training is
achieved by gently nudging trajectories toward desired targets and measuring
how the variables conjugate to the parameters to be trained respond. This
method is particularly suited to systems with periodic boundary conditions or
fixed initial and final states, enabling efficient parameter updates without
requiring explicit backpropagation through time. In the case of periodic
boundary conditions, this approach yields the semiclassical limit of Quantum
Equilibrium Propagation. Applications to systems with dissipation are also
discussed.
[COMMENTS]
8 pages, 1 figure
[LINK]
http://arxiv.org/abs/2505.07363v2
[DATE]
2025-05-13 15:06:52+08:00
[CATEGORIES]
cs.LG
UAV-VLA: Vision-Language-Action System for Large Scale Aerial Mission Generation
[AUTHORS]
Oleg Sautenkov, Yasheerah Yaqoot, Artem Lykov, Muhammad Ahsan Mustafa, Grik Tadevosyan, Aibek Akhmetkazy, Miguel Altamirano Cabrera, Mikhail Martynov, Sausar Karaf, Dzmitry Tsetserukou
[ABSTRACT]
The UAV-VLA (Visual-Language-Action) system is a tool designed to facilitate
communication with aerial robots. By integrating satellite imagery processing
with the Visual Language Model (VLM) and the powerful capabilities of GPT,
UAV-VLA enables users to generate general flight paths-and-action plans through
simple text requests. This system leverages the rich contextual information
provided by satellite images, allowing for enhanced decision-making and mission
planning. The combination of visual analysis by VLM and natural language
processing by GPT can provide the user with the path-and-action set, making
aerial operations more efficient and accessible. The newly developed method
showed the difference in the length of the created trajectory in 22% and the
mean error in finding the objects of interest on a map in 34.22 m by Euclidean
distance in the K-Nearest Neighbors (KNN) approach.
[COMMENTS]
HRI 2025
[LINK]
http://arxiv.org/abs/2501.05014v2
[DATE]
2025-05-13 14:54:45+08:00
[CATEGORIES]
cs.LG
Decoupled Multimodal Prototypes for Visual Recognition with Missing Modalities
[AUTHORS]
Jueqing Lu, Yuanyuan Qi, Xiaohao Yang, Shujie Zhou, Lan Du
[ABSTRACT]
Multimodal learning enhances deep learning models by enabling them to
perceive and understand information from multiple data modalities, such as
visual and textual inputs. However, most existing approaches assume the
availability of all modalities, an assumption that often fails in real-world
applications. Recent works have introduced learnable missing-case-aware prompts
to mitigate performance degradation caused by missing modalities while reducing
the need for extensive model fine-tuning. Building upon the effectiveness of
missing-case-aware handling for missing modalities, we propose a novel
decoupled prototype-based output head, which leverages missing-case-aware
class-wise prototypes tailored for each individual modality. This approach
dynamically adapts to different missing modality scenarios and can be
seamlessly integrated with existing prompt-based methods. Extensive experiments
demonstrate that our proposed output head significantly improves performance
across a wide range of missing-modality scenarios and varying missing rates.
[LINK]
http://arxiv.org/abs/2505.08283v1
[DATE]
2025-05-13 14:53:37+08:00
[CATEGORIES]
cs.LG
Iteratively reweighted kernel machines efficiently learn sparse functions
[AUTHORS]
Libin Zhu, Damek Davis, Dmitriy Drusvyatskiy, Maryam Fazel
[ABSTRACT]
The impressive practical performance of neural networks is often attributed
to their ability to learn low-dimensional data representations and hierarchical
structure directly from data. In this work, we argue that these two phenomena
are not unique to neural networks, and can be elicited from classical kernel
methods. Namely, we show that the derivative of the kernel predictor can detect
the influential coordinates with low sample complexity. Moreover, by
iteratively using the derivatives to reweight the data and retrain kernel
machines, one is able to efficiently learn hierarchical polynomials with finite
leap complexity. Numerical experiments illustrate the developed theory.
[LINK]
http://arxiv.org/abs/2505.08277v1
[DATE]
2025-05-13 14:41:39+08:00
[CATEGORIES]
cs.LG
Adaptive Security Policy Management in Cloud Environments Using Reinforcement Learning
[AUTHORS]
Muhammad Saqib, Dipkumar Mehta, Fnu Yashu, Shubham Malhotra
[ABSTRACT]
The security of cloud environments, such as Amazon Web Services (AWS), is
complex and dynamic. Static security policies have become inadequate as threats
evolve and cloud resources exhibit elasticity [1]. This paper addresses the
limitations of static policies by proposing a security policy management
framework that uses reinforcement learning (RL) to adapt dynamically.
Specifically, we employ deep reinforcement learning algorithms, including deep
Q Networks and proximal policy optimization, enabling the learning and
continuous adjustment of controls such as firewall rules and Identity and
Access Management (IAM) policies. The proposed RL based solution leverages
cloud telemetry data (AWS Cloud Trail logs, network traffic data, threat
intelligence feeds) to continuously refine security policies, maximizing threat
mitigation, and compliance while minimizing resource impact. Experimental
results demonstrate that our adaptive RL based framework significantly
outperforms static policies, achieving higher intrusion detection rates (92%
compared to 82% for static policies) and substantially reducing incident
detection and response times by 58%. In addition, it maintains high conformity
with security requirements and efficient resource usage. These findings
validate the effectiveness of adaptive reinforcement learning approaches in
improving cloud security policy management.
[COMMENTS]
10 pages, 6 figures, 1 table
[LINK]
http://arxiv.org/abs/2505.08837v1
[DATE]
2025-05-13 14:34:54+08:00
[CATEGORIES]
cs.LG
Functional Complexity-adaptive Temporal Tensor Decomposition
[AUTHORS]
Panqi Chen, Lei Cheng, Jianlong Li, Weichang Li, Weiqing Liu, Jiang Bian, Shikai Fang
[ABSTRACT]
Tensor decomposition is a fundamental tool for analyzing multi-dimensional
data by learning low-rank factors to represent high-order interactions. While
recent works on temporal tensor decomposition have made significant progress by
incorporating continuous timestamps in latent factors, they still struggle with
general tensor data with continuous indexes not only in the temporal mode but
also in other modes, such as spatial coordinates in climate data. Moreover, the
challenge of self-adapting model complexity is largely unexplored in functional
temporal tensor models, with existing methods being inapplicable in this
setting. To address these limitations, we propose functional
\underline{C}omplexity-\underline{A}daptive \underline{T}emporal
\underline{T}ensor d\underline{E}composition (\textsc{Catte}).
Our approach encodes continuous spatial indexes as learnable Fourier features
and employs neural ODEs in latent space to learn the temporal trajectories of
factors. To enable automatic adaptation of model complexity, we introduce a
sparsity-inducing prior over the factor trajectories.
We develop an efficient variational inference scheme with an analytical
evidence lower bound, enabling sampling-free optimization. Through extensive
experiments on both synthetic and real-world datasets, we demonstrate that
\textsc{Catte} not only reveals the underlying ranks of functional temporal
tensors but also significantly outperforms existing methods in prediction
performance and robustness against noise.
[LINK]
http://arxiv.org/abs/2502.06164v2
[DATE]
2025-05-13 14:33:49+08:00
[CATEGORIES]
cs.LG
GRID: Protecting Training Graph from Link Stealing Attacks on GNN Models
[AUTHORS]
Jiadong Lou, Xu Yuan, Rui Zhang, Xingliang Yuan, Neil Gong, Nian-Feng Tzeng
[ABSTRACT]
Graph neural networks (GNNs) have exhibited superior performance in various
classification tasks on graph-structured data. However, they encounter the
potential vulnerability from the link stealing attacks, which can infer the
presence of a link between two nodes via measuring the similarity of its
incident nodes’ prediction vectors produced by a GNN model. Such attacks pose
severe security and privacy threats to the training graph used in GNN models.
In this work, we propose a novel solution, called Graph Link Disguise (GRID),
to defend against link stealing attacks with the formal guarantee of GNN model
utility for retaining prediction accuracy. The key idea of GRID is to add
carefully crafted noises to the nodes’ prediction vectors for disguising
adjacent nodes as n-hop indirect neighboring nodes. We take into account the
graph topology and select only a subset of nodes (called core nodes) covering
all links for adding noises, which can avert the noises offset and have the
further advantages of reducing both the distortion loss and the computation
cost. Our crafted noises can ensure 1) the noisy prediction vectors of any two
adjacent nodes have their similarity level like that of two non-adjacent nodes
and 2) the model prediction is unchanged to ensure zero utility loss. Extensive
experiments on five datasets are conducted to show the effectiveness of our
proposed GRID solution against different representative link-stealing attacks
under transductive settings and inductive settings respectively, as well as two
influence-based attacks. Meanwhile, it achieves a much better privacy-utility
trade-off than existing methods when extended to GNNs.
[LINK]
http://arxiv.org/abs/2501.10985v2
[DATE]
2025-05-13 14:32:32+08:00
[CATEGORIES]
cs.LG
Open the Eyes of MPNN: Vision Enhances MPNN in Link Prediction
[AUTHORS]
Yanbin Wei, Xuehao Wang, Zhan Zhuang, Yang Chen, Shuhao Chen, Yulong Zhang, Yu Zhang, James Kwok
[ABSTRACT]
Message-passing graph neural networks (MPNNs) and structural features (SFs)
are cornerstones for the link prediction task. However, as a common and
intuitive mode of understanding, the potential of visual perception has been
overlooked in the MPNN community. For the first time, we equip MPNNs with
vision structural awareness by proposing an effective framework called Graph
Vision Network (GVN), along with a more efficient variant (E-GVN). Extensive
empirical results demonstrate that with the proposed frameworks, GVN
consistently benefits from the vision enhancement across seven link prediction
datasets, including challenging large-scale graphs. Such improvements are
compatible with existing state-of-the-art (SOTA) methods and GVNs achieve new
SOTA results, thereby underscoring a promising novel direction for link
prediction.
[COMMENTS]
ICML 2025
[LINK]
http://arxiv.org/abs/2505.08266v1
[DATE]
2025-05-13 14:32:23+08:00
[CATEGORIES]
cs.LG
LLM Enhancers for GNNs: An Analysis from the Perspective of Causal Mechanism Identification
[AUTHORS]
Hang Gao, Wenxuan Huang, Fengge Wu, Junsuo Zhao, Changwen Zheng, Huaping Liu
[COMMENTS]
Accepted by ICML 2025
[LINK]
http://arxiv.org/abs/2505.08265v1
[DATE]
2025-05-13 14:29:25+08:00
[CATEGORIES]
cs.LG
Super-fast rates of convergence for Neural Networks Classifiers under the Hard Margin Condition
[AUTHORS]
Nathanael Tepakbong, Ding-Xuan Zhou, Xiang Zhou
[ABSTRACT]
We study the classical binary classification problem for hypothesis spaces of
Deep Neural Networks (DNNs) with ReLU activation under Tsybakov’s low-noise
condition with exponent $q>0$, and its limit-case $q\to\infty$ which we refer
to as the “hard-margin condition”. We show that DNNs which minimize the
empirical risk with square loss surrogate and $\ell_p$ penalty can achieve
finite-sample excess risk bounds of order $\mathcal{O}\left(n^{-\alpha}\right)$
for arbitrarily large $\alpha>0$ under the hard-margin condition, provided that
the regression function $\eta$ is sufficiently smooth. The proof relies on a
novel decomposition of the excess risk which might be of independent interest.
[COMMENTS]
31 pages
[LINK]
http://arxiv.org/abs/2505.08262v1
[DATE]
2025-05-13 14:26:04+08:00
[CATEGORIES]
cs.LG
Clustering-based Low-Rank Matrix Approximation: An Adaptive Theoretical Analysis with Application to Data Compression
[AUTHORS]
Sisipho Hamlomo, Marcellin Atemkeng
[ABSTRACT]
Low-rank matrix approximation (LoRMA) is a fundamental tool for compressing
high-resolution data matrices by extracting important features while
suppressing redundancy. Low-rank methods, such as global singular value
decomposition (SVD), apply uniform compression across the entire data matrix,
often ignoring important local variations and leading to the loss of fine
structural details. To address these limitations, we introduce an adaptive
LoRMA, which partitions data matrix into overlapping patches, groups
structurally similar patches into several clusters using k-means, and performs
SVD within each cluster. We derive the overall compression factor accounting
for patch overlap and analyze how patch size influences compression efficiency
and computational cost. While the proposed adaptive LoRMA method is applicable
to any data exhibiting high local variation, we focus on medical imaging due to
its pronounced local variability. We evaluate and compare our adaptive LoRMA
against global SVD across four imaging modalities: MRI, ultrasound, CT scan,
and chest X-ray. Results demonstrate that adaptive LoRMA effectively preserves
structural integrity, edge details, and diagnostic relevance, as measured by
peak signal-to-noise ratio (PSNR), structural similarity index (SSIM), mean
squared error (MSE), intersection over union (IoU), and edge preservation index
(EPI). Adaptive LoRMA significantly minimizes block artifacts and residual
errors, particularly in pathological regions, consistently outperforming global
SVD in terms of PSNR, SSIM, IoU, EPI, and achieving lower MSE. Adaptive LoRMA
prioritizes clinically salient regions while allowing aggressive compression in
non-critical regions, optimizing storage efficiency. Although adaptive LoRMA
requires higher processing time, its diagnostic fidelity justifies the overhead
for high-compression applications.
[LINK]
http://arxiv.org/abs/2505.08256v1
[DATE]
2025-05-13 14:10:05+08:00
[CATEGORIES]
cs.LG
Privacy-Preserving Analytics for Smart Meter (AMI) Data: A Hybrid Approach to Comply with CPUC Privacy Regulations
[AUTHORS]
Benjamin Westrich
[ABSTRACT]
Advanced Metering Infrastructure (AMI) data from smart electric and gas
meters enables valuable insights for utilities and consumers, but also raises
significant privacy concerns. In California, regulatory decisions (CPUC
D.11-07-056 and D.11-08-045) mandate strict privacy protections for customer
energy usage data, guided by the Fair Information Practice Principles (FIPPs).
We comprehensively explore solutions drawn from data anonymization,
privacy-preserving machine learning (differential privacy and federated
learning), synthetic data generation, and cryptographic techniques (secure
multiparty computation, homomorphic encryption). This allows advanced
analytics, including machine learning models, statistical and econometric
analysis on energy consumption data, to be performed without compromising
individual privacy.
We evaluate each technique’s theoretical foundations, effectiveness, and
trade-offs in the context of utility data analytics, and we propose an
integrated architecture that combines these methods to meet real-world needs.
The proposed hybrid architecture is designed to ensure compliance with
California’s privacy rules and FIPPs while enabling useful analytics, from
forecasting and personalized insights to academic research and econometrics,
while strictly protecting individual privacy. Mathematical definitions and
derivations are provided where appropriate to demonstrate privacy guarantees
and utility implications rigorously. We include comparative evaluations of the
techniques, an architecture diagram, and flowcharts to illustrate how they work
together in practice. The result is a blueprint for utility data scientists and
engineers to implement privacy-by-design in AMI data handling, supporting both
data-driven innovation and strict regulatory compliance.
[LINK]
http://arxiv.org/abs/2505.08237v1
[DATE]
2025-05-13 13:30:35+08:00
[CATEGORIES]
cs.LG
Enhanced Importance Sampling through Latent Space Exploration in Normalizing Flows
[AUTHORS]
Liam A. Kruse, Alexandros E. Tzikas, Harrison Delecki, Mansur M. Arief, Mykel J. Kochenderfer
[ABSTRACT]
Importance sampling is a rare event simulation technique used in Monte Carlo
simulations to bias the sampling distribution towards the rare event of
interest. By assigning appropriate weights to sampled points, importance
sampling allows for more efficient estimation of rare events or tails of
distributions. However, importance sampling can fail when the proposal
distribution does not effectively cover the target distribution. In this work,
we propose a method for more efficient sampling by updating the proposal
distribution in the latent space of a normalizing flow. Normalizing flows learn
an invertible mapping from a target distribution to a simpler latent
distribution. The latent space can be more easily explored during the search
for a proposal distribution, and samples from the proposal distribution are
recovered in the space of the target distribution via the invertible mapping.
We empirically validate our methodology on simulated robotics applications such
as autonomous racing and aircraft ground collision avoidance.
[COMMENTS]
Accepted at AAAI 2025
[LINK]
http://arxiv.org/abs/2501.03394v2
[DATE]
2025-05-13 13:04:45+08:00
[CATEGORIES]
cs.LG
Generative AI for Urban Planning: Synthesizing Satellite Imagery via Diffusion Models
[AUTHORS]
Qingyi Wang, Yuebing Liang, Yunhan Zheng, Kaiyuan Xu, Jinhua Zhao, Shenhao Wang
[ABSTRACT]
Generative AI offers new opportunities for automating urban planning by
creating site-specific urban layouts and enabling flexible design exploration.
However, existing approaches often struggle to produce realistic and practical
designs at scale. Therefore, we adapt a state-of-the-art Stable Diffusion
model, extended with ControlNet, to generate high-fidelity satellite imagery
conditioned on land use descriptions, infrastructure, and natural environments.
To overcome data availability limitations, we spatially link satellite imagery
with structured land use and constraint information from OpenStreetMap. Using
data from three major U.S. cities, we demonstrate that the proposed diffusion
model generates realistic and diverse urban landscapes by varying land-use
configurations, road networks, and water bodies, facilitating cross-city
learning and design diversity. We also systematically evaluate the impacts of
varying language prompts and control imagery on the quality of satellite
imagery generation. Our model achieves high FID and KID scores and demonstrates
robustness across diverse urban contexts. Qualitative assessments from urban
planners and the general public show that generated images align closely with
design descriptions and constraints, and are often preferred over real images.
This work establishes a benchmark for controlled urban imagery generation and
highlights the potential of generative AI as a tool for enhancing planning
workflows and public engagement.
[LINK]
http://arxiv.org/abs/2505.08833v1
[DATE]
2025-05-13 12:55:38+08:00
[CATEGORIES]
cs.LG
Bellman Unbiasedness: Toward Provably Efficient Distributional Reinforcement Learning with General Value Function Approximation
[AUTHORS]
Taehyun Cho, Seungyub Han, Seokhun Ju, Dohyeong Kim, Kyungjae Lee, Jungwoo Lee
[ABSTRACT]
Distributional reinforcement learning improves performance by capturing
environmental stochasticity, but a comprehensive theoretical understanding of
its effectiveness remains elusive. In addition, the intractable element of the
infinite dimensionality of distributions has been overlooked. In this paper, we
present a regret analysis of distributional reinforcement learning with general
value function approximation in a finite episodic Markov decision process
setting. We first introduce a key notion of $\textit{Bellman unbiasedness}$
which is essential for exactly learnable and provably efficient distributional
updates in an online manner. Among all types of statistical functionals for
representing infinite-dimensional return distributions, our theoretical results
demonstrate that only moment functionals can exactly capture the statistical
information. Secondly, we propose a provably efficient algorithm,
$\texttt{SF-LSVI}$, that achieves a tight regret bound of $\tilde{O}(d_E
H^{\frac{3}{2}}\sqrt{K})$ where $H$ is the horizon, $K$ is the number of
episodes, and $d_E$ is the eluder dimension of a function class.
[LINK]
http://arxiv.org/abs/2407.21260v3
[DATE]
2025-05-13 12:53:31+08:00
[CATEGORIES]
cs.LG
Policy-labeled Preference Learning: Is Preference Enough for RLHF?
[AUTHORS]
Taehyun Cho, Seokhun Ju, Seungyub Han, Dohyeong Kim, Kyungjae Lee, Jungwoo Lee
[ABSTRACT]
To design rewards that align with human goals, Reinforcement Learning from
Human Feedback (RLHF) has emerged as a prominent technique for learning reward
functions from human preferences and optimizing policies via reinforcement
learning algorithms. However, existing RLHF methods often misinterpret
trajectories as being generated by an optimal policy, causing inaccurate
likelihood estimation and suboptimal learning. Inspired by Direct Preference
Optimization framework which directly learns optimal policy without explicit
reward, we propose policy-labeled preference learning (PPL), to resolve
likelihood mismatch issues by modeling human preferences with regret, which
reflects behavior policy information. We also provide a contrastive KL
regularization, derived from regret-based principles, to enhance RLHF in
sequential decision making. Experiments in high-dimensional continuous control
tasks demonstrate PPL’s significant improvements in offline RLHF performance
and its effectiveness in online settings.
[LINK]
http://arxiv.org/abs/2505.06273v2
[DATE]
2025-05-13 12:50:08+08:00
[CATEGORIES]
cs.LG
Position: AI Scaling: From Up to Down and Out
[AUTHORS]
Yunke Wang, Yanxi Li, Chang Xu
[ABSTRACT]
AI Scaling has traditionally been synonymous with Scaling Up, which builds
larger and more powerful models. However, the growing demand for efficiency,
adaptability, and collaboration across diverse applications necessitates a
broader perspective. This position paper presents a holistic framework for AI
scaling, encompassing Scaling Up, Scaling Down, and Scaling Out. It argues that
while Scaling Up of models faces inherent bottlenecks, the future trajectory of
AI scaling lies in Scaling Down and Scaling Out. These paradigms address
critical technical and societal challenges, such as reducing carbon footprint,
ensuring equitable access, and enhancing cross-domain collaboration. We explore
transformative applications in healthcare, smart manufacturing, and content
creation, demonstrating how AI Scaling can enable breakthroughs in efficiency,
personalization, and global connectivity. Additionally, we highlight key
challenges, including balancing model complexity with interpretability,
managing resource constraints, and fostering ethical development. By
synthesizing these approaches, we propose a unified roadmap that redefines the
future of AI research and application, paving the way for advancements toward
Artificial General Intelligence (AGI).
[COMMENTS]
ICML 2025
[LINK]
http://arxiv.org/abs/2502.01677v2
[DATE]
2025-05-13 12:47:13+08:00
[CATEGORIES]
cs.LG
Deep Probabilistic Modeling of User Behavior for Anomaly Detection via Mixture Density Networks
[AUTHORS]
Lu Dai, Wenxuan Zhu, Xuehui Quan, Renzi Meng, Sheng Cai, Yichen Wang
[ABSTRACT]
To improve the identification of potential anomaly patterns in complex user
behavior, this paper proposes an anomaly detection method based on a deep
mixture density network. The method constructs a Gaussian mixture model
parameterized by a neural network, enabling conditional probability modeling of
user behavior. It effectively captures the multimodal distribution
characteristics commonly present in behavioral data. Unlike traditional
classifiers that rely on fixed thresholds or a single decision boundary, this
approach defines an anomaly scoring function based on probability density using
negative log-likelihood. This significantly enhances the model’s ability to
detect rare and unstructured behaviors. Experiments are conducted on the
real-world network user dataset UNSW-NB15. A series of performance comparisons
and stability validation experiments are designed. These cover multiple
evaluation aspects, including Accuracy, F1- score, AUC, and loss fluctuation.
The results show that the proposed method outperforms several advanced neural
network architectures in both performance and training stability. This study
provides a more expressive and discriminative solution for user behavior
modeling and anomaly detection. It strongly promotes the application of deep
probabilistic modeling techniques in the fields of network security and
intelligent risk control.
[LINK]
http://arxiv.org/abs/2505.08220v1
[DATE]
2025-05-13 12:32:21+08:00
[CATEGORIES]
cs.LG
An Effective Flow-based Method for Positive-Unlabeled Learning: 2-HNC
[AUTHORS]
Dorit Hochbaum, Torpong Nitayanont
[ABSTRACT]
In many scenarios of binary classification, only positive instances are
provided in the training data, leaving the rest of the data unlabeled. This
setup, known as positive-unlabeled (PU) learning, is addressed here with a
network flow-based method which utilizes pairwise similarities between samples.
The method we propose here, 2-HNC, leverages Hochbaum’s Normalized Cut (HNC)
and the set of solutions it provides by solving a parametric minimum cut
problem. The set of solutions, that are nested partitions of the samples into
two sets, correspond to varying tradeoff values between the two goals: high
intra-similarity inside the sets and low inter-similarity between the two sets.
This nested sequence is utilized here to deliver a ranking of unlabeled samples
by their likelihood of being negative. Building on this insight, our method,
2-HNC, proceeds in two stages. The first stage generates this ranking without
assuming any negative labels, using a problem formulation that is constrained
only on positive labeled samples. The second stage augments the positive set
with likely-negative samples and recomputes the classification. The final label
prediction selects among all generated partitions in both stages, the one that
delivers a positive class proportion, closest to a prior estimate of this
quantity, which is assumed to be given. Extensive experiments across synthetic
and real datasets show that 2-HNC yields strong performance and often surpasses
existing state-of-the-art algorithms.
[LINK]
http://arxiv.org/abs/2505.08212v1
[DATE]
2025-05-13 11:58:16+08:00
[CATEGORIES]
cs.LG
AI and Generative AI Transforming Disaster Management: A Survey of Damage Assessment and Response Techniques
[AUTHORS]
Aman Raj, Lakshit Arora, Sanjay Surendranath Girija, Shashank Kapoor, Dipen Pradhan, Ankit Shetgaonkar
[ABSTRACT]
Natural disasters, including earthquakes, wildfires and cyclones, bear a huge
risk on human lives as well as infrastructure assets. An effective response to
disaster depends on the ability to rapidly and efficiently assess the intensity
of damage. Artificial Intelligence (AI) and Generative Artificial Intelligence
(GenAI) presents a breakthrough solution, capable of combining knowledge from
multiple types and sources of data, simulating realistic scenarios of disaster,
and identifying emerging trends at a speed previously unimaginable. In this
paper, we present a comprehensive review on the prospects of AI and GenAI in
damage assessment for various natural disasters, highlighting both its
strengths and limitations. We talk about its application to multimodal data
such as text, image, video, and audio, and also cover major issues of data
privacy, security, and ethical use of the technology during crises. The paper
also recognizes the threat of Generative AI misuse, in the form of
dissemination of misinformation and for adversarial attacks. Finally, we
outline avenues of future research, emphasizing the need for secure, reliable,
and ethical Generative AI systems for disaster management in general. We
believe that this work represents the first comprehensive survey of Gen-AI
techniques being used in the field of Disaster Assessment and Response.
[COMMENTS]
Accepted in IEEE Compsac 2025
[LINK]
http://arxiv.org/abs/2505.08202v1
[DATE]
2025-05-13 11:33:31+08:00
[CATEGORIES]
cs.LG
An Efficient On-Policy Deep Learning Framework for Stochastic Optimal Control
[AUTHORS]
Mengjian Hua, Mathieu Laurière, Eric Vanden-Eijnden
[ABSTRACT]
We present a novel on-policy algorithm for solving stochastic optimal control
(SOC) problems. By leveraging the Girsanov theorem, our method directly
computes on-policy gradients of the SOC objective without expensive
backpropagation through stochastic differential equations or adjoint problem
solutions. This approach significantly accelerates the optimization of neural
network control policies while scaling efficiently to high-dimensional problems
and long time horizons. We evaluate our method on classical SOC benchmarks as
well as applications to sampling from unnormalized distributions via
Schr"odinger-F"ollmer processes and fine-tuning pre-trained diffusion models.
Experimental results demonstrate substantial improvements in both computational
speed and memory efficiency compared to existing approaches.
[LINK]
http://arxiv.org/abs/2410.05163v3
[DATE]
2025-05-13 11:30:44+08:00
[CATEGORIES]
cs.LG
A Multi-scale Representation Learning Framework for Long-Term Time Series Forecasting
[AUTHORS]
Boshi Gao, Qingjian Ni, Fanbo Ju, Yu Chen, Ziqi Zhao
[ABSTRACT]
Long-term time series forecasting (LTSF) offers broad utility in practical
settings like energy consumption and weather prediction. Accurately predicting
long-term changes, however, is demanding due to the intricate temporal patterns
and inherent multi-scale variations within time series. This work confronts key
issues in LTSF, including the suboptimal use of multi-granularity information,
the neglect of channel-specific attributes, and the unique nature of trend and
seasonal components, by introducing a proficient MLP-based forecasting
framework. Our method adeptly disentangles complex temporal dynamics using
clear, concurrent predictions across various scales. These multi-scale
forecasts are then skillfully integrated through a system that dynamically
assigns importance to information from different granularities, sensitive to
individual channel characteristics. To manage the specific features of temporal
patterns, a two-pronged structure is utilized to model trend and seasonal
elements independently. Experimental results on eight LTSF benchmarks
demonstrate that MDMixer improves average MAE performance by 4.64% compared to
the recent state-of-the-art MLP-based method (TimeMixer), while achieving an
effective balance between training efficiency and model interpretability.
[LINK]
http://arxiv.org/abs/2505.08199v1
[DATE]
2025-05-13 11:26:44+08:00
[CATEGORIES]
cs.LG
SIM-Shapley: A Stable and Computationally Efficient Approach to Shapley Value Approximation
[AUTHORS]
Wangxuan Fan, Siqi Li, Doudou Zhou, Yohei Okada, Chuan Hong, Molei Liu, Nan Liu
[ABSTRACT]
Explainable artificial intelligence (XAI) is essential for trustworthy
machine learning (ML), particularly in high-stakes domains such as healthcare
and finance. Shapley value (SV) methods provide a principled framework for
feature attribution in complex models but incur high computational costs,
limiting their scalability in high-dimensional settings. We propose Stochastic
Iterative Momentum for Shapley Value Approximation (SIM-Shapley), a stable and
efficient SV approximation method inspired by stochastic optimization. We
analyze variance theoretically, prove linear $Q$-convergence, and demonstrate
improved empirical stability and low bias in practice on real-world datasets.
In our numerical experiments, SIM-Shapley reduces computation time by up to 85%
relative to state-of-the-art baselines while maintaining comparable feature
attribution quality. Beyond feature attribution, our stochastic mini-batch
iterative framework extends naturally to a broader class of sample average
approximation problems, offering a new avenue for improving computational
efficiency with stability guarantees. Code is publicly available at
https://github.com/nliulab/SIM-Shapley.
[COMMENTS]
21 pages, 6 figures, 5 tables
[LINK]
http://arxiv.org/abs/2505.08198v1
[DATE]
2025-05-13 11:23:10+08:00
[CATEGORIES]
cs.LG
Nesterov acceleration in benignly non-convex landscapes
[AUTHORS]
Kanan Gupta, Stephan Wojtowytsch
[ABSTRACT]
While momentum-based optimization algorithms are commonly used in the
notoriously non-convex optimization problems of deep learning, their analysis
has historically been restricted to the convex and strongly convex setting. In
this article, we partially close this gap between theory and practice and
demonstrate that virtually identical guarantees can be obtained in optimization
problems with a `benign’ non-convexity. We show that these weaker geometric
assumptions are well justified in overparametrized deep learning, at least
locally. Variations of this result are obtained for a continuous time model of
Nesterov’s accelerated gradient descent algorithm (NAG), the classical discrete
time version of NAG, and versions of NAG with stochastic gradient estimates
with purely additive noise and with noise that exhibits both additive and
multiplicative scaling.
[COMMENTS]
ICLR 2025 Spotlight
[LINK]
http://arxiv.org/abs/2410.08395v3
[DATE]
2025-05-13 11:12:54+08:00
[CATEGORIES]
cs.LG
Aitomia: Your Intelligent Assistant for AI-Driven Atomistic and Quantum Chemical Simulations
[AUTHORS]
Jinming Hu, Hassan Nawaz, Yuting Rui, Lijie Chi, Arif Ullah, Pavlo O. Dral
[ABSTRACT]
We have developed Aitomia - a platform powered by AI to assist in performing
AI-driven atomistic and quantum chemical (QC) simulations. This intelligent
assistant platform is equipped with chatbots and AI agents to help experts and
guide non-experts in setting up and running the atomistic simulations,
monitoring their computation status, analyzing the simulation results, and
summarizing them for the user in text and graphical forms. We achieve these
goals by exploiting fine-tuned open-source large language models (LLMs),
rule-based agents, and a retrieval-augmented generation (RAG) system. Aitomia
leverages the versatility of our MLatom ecosystem for AI-enhanced computational
chemistry. This intelligent assistant is going to be integrated into the
Aitomistic Hub and XACS online computing services, with some functionality
already publicly available as described at http://mlatom.com/aitomia. Aitomia
is expected to lower the barrier to performing atomistic simulations,
accelerating research and development in the relevant fields.
[LINK]
http://arxiv.org/abs/2505.08195v1
[DATE]
2025-05-13 11:11:41+08:00
[CATEGORIES]
cs.LG
Rethinking Latent Redundancy in Behavior Cloning: An Information Bottleneck Approach for Robot Manipulation
[AUTHORS]
Shuanghao Bai, Wanqi Zhou, Pengxiang Ding, Wei Zhao, Donglin Wang, Badong Chen
[ABSTRACT]
Behavior Cloning (BC) is a widely adopted visual imitation learning method in
robot manipulation. Current BC approaches often enhance generalization by
leveraging large datasets and incorporating additional visual and textual
modalities to capture more diverse information. However, these methods overlook
whether the learned representations contain redundant information and lack a
solid theoretical foundation to guide the learning process. To address these
limitations, we adopt an information-theoretic perspective and introduce mutual
information to quantify and mitigate redundancy in latent representations.
Building on this, we incorporate the Information Bottleneck (IB) principle into
BC, which extends the idea of reducing redundancy by providing a structured
framework for compressing irrelevant information while preserving task-relevant
features. This work presents the first comprehensive study on redundancy in
latent representations across various methods, backbones, and experimental
settings, while extending the generalizability of the IB to BC. Extensive
experiments and analyses on the CortexBench and LIBERO benchmarks demonstrate
significant performance improvements with IB, underscoring the importance of
reducing input data redundancy and highlighting its practical value for more
practical applications. Project Page:
https://baishuanghao.github.io/BC-IB.github.io.
[COMMENTS]
Accepted by ICML 2025
[LINK]
http://arxiv.org/abs/2502.02853v5
[DATE]
2025-05-13 11:02:42+08:00
[CATEGORIES]
cs.LG
Unsupervised Raindrop Removal from a Single Image using Conditional Diffusion Models
[AUTHORS]
Lhuqita Fazry, Valentino Vito
[ABSTRACT]
Raindrop removal is a challenging task in image processing. Removing
raindrops while relying solely on a single image further increases the
difficulty of the task. Common approaches include the detection of raindrop
regions in the image, followed by performing a background restoration process
conditioned on those regions. While various methods can be applied for the
detection step, the most common architecture used for background restoration is
the Generative Adversarial Network (GAN). Recent advances in the use of
diffusion models have led to state-of-the-art image inpainting techniques. In
this paper, we introduce a novel technique for raindrop removal from a single
image using diffusion-based image inpainting.
[LINK]
http://arxiv.org/abs/2505.08190v1
[DATE]
2025-05-13 11:00:01+08:00
[CATEGORIES]
cs.LG
DSADF: Thinking Fast and Slow for Decision Making
[AUTHORS]
Alex Zhihao Dou, Dongfei Cui, Jun Yan, Weida Wang, Benteng Chen, Haoming Wang, Zeke Xie, Shufei Zhang
[ABSTRACT]
Although Reinforcement Learning (RL) agents are effective in well-defined
environments, they often struggle to generalize their learned policies to
dynamic settings due to their reliance on trial-and-error interactions. Recent
work has explored applying Large Language Models (LLMs) or Vision Language
Models (VLMs) to boost the generalization of RL agents through policy
optimization guidance or prior knowledge. However, these approaches often lack
seamless coordination between the RL agent and the foundation model, leading to
unreasonable decision-making in unfamiliar environments and efficiency
bottlenecks. Making full use of the inferential capabilities of foundation
models and the rapid response capabilities of RL agents and enhancing the
interaction between the two to form a dual system is still a lingering
scientific question. To address this problem, we draw inspiration from
Kahneman’s theory of fast thinking (System 1) and slow thinking (System 2),
demonstrating that balancing intuition and deep reasoning can achieve nimble
decision-making in a complex world. In this study, we propose a Dual-System
Adaptive Decision Framework (DSADF), integrating two complementary modules:
System 1, comprising an RL agent and a memory space for fast and intuitive
decision making, and System 2, driven by a VLM for deep and analytical
reasoning. DSADF facilitates efficient and adaptive decision-making by
combining the strengths of both systems. The empirical study in the video game
environment: Crafter and Housekeep demonstrates the effectiveness of our
proposed method, showing significant improvements in decision abilities for
both unseen and known tasks.
[LINK]
http://arxiv.org/abs/2505.08189v1
[DATE]
2025-05-13 10:58:04+08:00
[CATEGORIES]
cs.LG
COMRECGC: Global Graph Counterfactual Explainer through Common Recourse
[AUTHORS]
Gregoire Fournier, Sourav Medya
[COMMENTS]
Accepted at ICML 2025
[LINK]
http://arxiv.org/abs/2505.07081v2
[DATE]
2025-05-13 10:51:33+08:00
[CATEGORIES]
cs.LG
LLMs meet Federated Learning for Scalable and Secure IoT Management
[AUTHORS]
Yazan Otoum, Arghavan Asad, Amiya Nayak
[ABSTRACT]
The rapid expansion of IoT ecosystems introduces severe challenges in
scalability, security, and real-time decision-making. Traditional centralized
architectures struggle with latency, privacy concerns, and excessive resource
consumption, making them unsuitable for modern large-scale IoT deployments.
This paper presents a novel Federated Learning-driven Large Language Model
(FL-LLM) framework, designed to enhance IoT system intelligence while ensuring
data privacy and computational efficiency. The framework integrates Generative
IoT (GIoT) models with a Gradient Sensing Federated Strategy (GSFS),
dynamically optimizing model updates based on real-time network conditions. By
leveraging a hybrid edge-cloud processing architecture, our approach balances
intelligence, scalability, and security in distributed IoT environments.
Evaluations on the IoT-23 dataset demonstrate that our framework improves model
accuracy, reduces response latency, and enhances energy efficiency,
outperforming traditional FL techniques (i.e., FedAvg, FedOpt). These findings
highlight the potential of integrating LLM-powered federated learning into
large-scale IoT ecosystems, paving the way for more secure, scalable, and
adaptive IoT management solutions.
[COMMENTS]
This work has been submitted to the IEEE Global Communications
Conference (GLOBECOM) 2025 for possible publication
[LINK]
http://arxiv.org/abs/2504.16032v2
[DATE]
2025-05-13 10:49:49+08:00
[CATEGORIES]
cs.LG
Feasibility-Aware Pessimistic Estimation: Toward Long-Horizon Safety in Offline RL
[AUTHORS]
Zhikun Tao, Gang Xiong, He Fang, Zhen Shen, Yunjun Han, Qing-Shan Jia
[ABSTRACT]
Offline safe reinforcement learning(OSRL) derives constraint-satisfying
policies from pre-collected datasets, offers a promising avenue for deploying
RL in safety-critical real-world domains such as robotics. However, the
majority of existing approaches emphasize only short-term safety, neglecting
long-horizon considerations. Consequently, they may violate safety constraints
and fail to ensure sustained protection during online deployment. Moreover, the
learned policies often struggle to handle states and actions that are not
present or out-of-distribution(OOD) from the offline dataset, and exhibit
limited sample efficiency. To address these challenges, we propose a novel
framework Feasibility-Aware offline Safe Reinforcement Learning with CVAE-based
Pessimism (FASP). First, we employ Hamilton-Jacobi (H-J) reachability analysis
to generate reliable safety labels, which serve as supervisory signals for
training both a conditional variational autoencoder (CVAE) and a safety
classifier. This approach not only ensures high sampling efficiency but also
provides rigorous long-horizon safety guarantees. Furthermore, we utilize
pessimistic estimation methods to estimate the Q-value of reward and cost,
which mitigates the extrapolation errors induces by OOD actions, and penalize
unsafe actions to enabled the agent to proactively avoid high-risk behaviors.
Moreover, we theoretically prove the validity of this pessimistic estimation.
Extensive experiments on DSRL benchmarks demonstrate that FASP algorithm
achieves competitive performance across multiple experimental tasks,
particularly outperforming state-of-the-art algorithms in terms of safety.
[LINK]
http://arxiv.org/abs/2505.08179v1
[DATE]
2025-05-13 10:32:49+08:00
[CATEGORIES]
cs.LG
Enhancing User Interest based on Stream Clustering and Memory Networks in Large-Scale Recommender Systems
[AUTHORS]
Peng Liu, Nian Wang, Cong Xu, Ming Zhao, Bin Wang, Yi Ren
[ABSTRACT]
Recommender Systems (RSs) provide personalized recommendation service based
on user interest, which are widely used in various platforms. However, there
are lots of users with sparse interest due to lacking consumption behaviors,
which leads to poor recommendation results for them. This problem is widespread
in large-scale RSs and is particularly difficult to address. To solve this
challenging problem, we propose an innovative solution called User Interest
Enhancement (UIE). UIE enhances user interest including user profile and user
history behavior sequences by leveraging the enhancement vectors and
personalized enhancement vectors generated based on dynamic streaming
clustering of similar users and items from multiple perspectives, which are
stored and updated in memory networks. UIE not only remarkably improves model
performance for users with sparse interest, but also delivers notable gains for
other users. As an end-to-end solution, UIE is easy to implement on top of
existing ranking models. Furthermore, we extend our approach to long-tail items
using similar methods, which also yields excellent improvements. We conduct
extensive offline and online experiments in an industrial RS. The results
demonstrate that UIE substantially outperforms other existing approaches,
especially for users with sparse interest. UIE has been deployed in several
large-scale RSs at Tencent since 2022, which was made public on 21 May 2024. In
addition, UIE-based methods have also been successfully applied in candidate
generation, pre-ranking, and context-DNN stages. Multiple teams have developed
solutions based on UIE, focusing on updating clustering algorithms and
attention mechanisms. As far as we know, UIE has been applied in multiple RSs,
advertising systems and search engines. The thoughts of UIE, dynamic streaming
clustering and similarity enhancement, have inspired subsequent relevant works.
[LINK]
http://arxiv.org/abs/2405.13238v6
[DATE]
2025-05-13 10:16:29+08:00
[CATEGORIES]
cs.LG
Capability-Aware Shared Hypernetworks for Flexible Heterogeneous Multi-Robot Coordination
[AUTHORS]
Kevin Fu, Shalin Anand Jain, Pierce Howell, Harish Ravichandar
[ABSTRACT]
Recent advances have enabled heterogeneous multi-robot teams to learn complex
and effective coordination skills. However, existing neural architectures that
support heterogeneous teaming tend to force a trade-off between expressivity
and efficiency. Shared-parameter designs prioritize sample efficiency by
enabling a single network to be shared across all or a pre-specified subset of
robots (via input augmentations), but tend to limit behavioral diversity. In
contrast, recent designs employ a separate policy for each robot, enabling
greater diversity and expressivity at the cost of efficiency and
generalization. Our key insight is that such tradeoffs can be avoided by
viewing these design choices as ends of a broad spectrum. Inspired by recent
work in transfer and meta learning, and building on prior work in multi-robot
task allocation, we propose Capability-Aware Shared Hypernetworks (CASH), a
soft weight sharing architecture that uses hypernetworks to efficiently learn a
flexible shared policy that dynamically adapts to each robot post-training. By
explicitly encoding the impact of robot capabilities (e.g., speed and payload)
on collective behavior, CASH enables zero-shot generalization to unseen robots
or team compositions. Our experiments involve multiple heterogeneous tasks,
three learning paradigms (imitation learning, value-based, and policy-gradient
RL), and SOTA multi-robot simulation (JaxMARL) and hardware (Robotarium)
platforms. Across all conditions, we find that CASH generates
appropriately-diverse behaviors and consistently outperforms baseline
architectures in terms of performance and sample efficiency during both
training and zero-shot generalization, all with 60%-80% fewer learnable
parameters.
[COMMENTS]
22 pages, 8 figures, equal authorship between Kevin Fu and Shalin
Anand Jain
[LINK]
http://arxiv.org/abs/2501.06058v4
[DATE]
2025-05-13 10:02:30+08:00
[CATEGORIES]
cs.LG
Computing High-dimensional Confidence Sets for Arbitrary Distributions
[AUTHORS]
Chao Gao, Liren Shan, Vaidehi Srinivas, Aravindan Vijayaraghavan
[ABSTRACT]
We study the problem of learning a high-density region of an arbitrary
distribution over $\mathbb{R}^d$. Given a target coverage parameter $\delta$,
and sample access to an arbitrary distribution $D$, we want to output a
confidence set $S \subset \mathbb{R}^d$ such that $S$ achieves $\delta$
coverage of $D$, i.e., $\mathbb{P}_{y \sim D} \left[ y \in S \right] \ge
\delta$, and the volume of $S$ is as small as possible. This is a central
problem in high-dimensional statistics with applications in finding confidence
sets, uncertainty quantification, and support estimation.
In the most general setting, this problem is statistically intractable, so we
restrict our attention to competing with sets from a concept class $C$ with
bounded VC-dimension. An algorithm is competitive with class $C$ if, given
samples from an arbitrary distribution $D$, it outputs in polynomial time a set
that achieves $\delta$ coverage of $D$, and whose volume is competitive with
the smallest set in $C$ with the required coverage $\delta$. This problem is
computationally challenging even in the basic setting when $C$ is the set of
all Euclidean balls. Existing algorithms based on coresets find in polynomial
time a ball whose volume is $\exp(\tilde{O}( d/ \log d))$-factor competitive
with the volume of the best ball.
Our main result is an algorithm that finds a confidence set whose volume is
$\exp(\tilde{O}(d^{1/2}))$ factor competitive with the optimal ball having the
desired coverage. The algorithm is improper (it outputs an ellipsoid). Combined
with our computational intractability result for proper learning balls within
an $\exp(\tilde{O}(d^{1-o(1)}))$ approximation factor in volume, our results
provide an interesting separation between proper and (improper) learning of
confidence sets.
[COMMENTS]
Improves volume approximation factor from $\exp(\tilde{O}(d^{2/3}))$
to $\exp(\tilde{O}(d^{1/2}))$, along with other minor edits. To appear in
COLT 2025
[LINK]
http://arxiv.org/abs/2504.02723v2
[DATE]
2025-05-13 10:01:44+08:00
[CATEGORIES]
cs.LG
Enhancing the Efficiency of Complex Systems Crystal Structure Prediction by Active Learning Guided Machine Learning Potential
[AUTHORS]
Jiaxiang Li, Junwei Feng, Jie Luo, Bowen Jiang, Xiangyu Zheng, Jian Lv, Keith Butler, Hanyu Liu, Congwei Xie, Yu Xie, Yanming Ma
[ABSTRACT]
Understanding multicomponent complex material systems is essential for design
of advanced materials for a wide range of technological applications. While
state-of-the-art crystal structure prediction (CSP) methods effectively
identify new structures and assess phase stability, they face fundamental
limitations when applied to complex systems. This challenge stems from the
combinatorial explosion of atomic configurations and the vast stoichiometric
space, both of which contribute to computational demands that rapidly exceed
practical feasibility. In this work, we propose a flexible and automated
workflow to build a highly generalizable and data-efficient machine learning
potential (MLP), effectively unlocking the full potential of CSP algorithms.
The workflow is validated on both Mg-Ca-H ternary and Be-P-N-O quaternary
systems, demonstrating substantial machine learning acceleration in
high-throughput structural optimization and enabling the efficient
identification of promising compounds. These results underscore the
effectiveness of our approach in exploring complex material systems and
accelerating the discovery of new multicomponent materials.
[LINK]
http://arxiv.org/abs/2505.08159v1
[DATE]
2025-05-13 09:34:34+08:00
[CATEGORIES]
cs.LG
Stable Derivative Free Gaussian Mixture Variational Inference for Bayesian Inverse Problems
[AUTHORS]
Baojun Che, Yifan Chen, Zhenghao Huan, Daniel Zhengyu Huang, Weijie Wang
[ABSTRACT]
This paper is concerned with the approximation of probability distributions
known up to normalization constants, with a focus on Bayesian inference for
large-scale inverse problems in scientific computing. In this context, key
challenges include costly repeated evaluations of forward models,
multimodality, and inaccessible gradients for the forward model. To address
them, we develop a variational inference framework that combines Fisher-Rao
natural gradient with specialized quadrature rules to enable derivative free
updates of Gaussian mixture variational families. The resulting method, termed
Derivative Free Gaussian Mixture Variational Inference (DF-GMVI), guarantees
covariance positivity and affine invariance, offering a stable and efficient
framework for approximating complex posterior distributions. The effectiveness
of DF-GMVI is demonstrated through numerical experiments on challenging
scenarios, including distributions with multiple modes, infinitely many modes,
and curved modes in spaces with up to 100 dimensions. The method’s practicality
is further demonstrated in a large-scale application, where it successfully
recovers the initial conditions of the Navier-Stokes equations from solution
data at positive times.
[COMMENTS]
26 pages, 11 figures
[LINK]
http://arxiv.org/abs/2501.04259v2
[DATE]
2025-05-13 09:08:58+08:00
[CATEGORIES]
cs.LG
Tensor Sketch: Fast and Scalable Polynomial Kernel Approximation
[AUTHORS]
Ninh Pham, Rasmus Pagh
[ABSTRACT]
Approximation of non-linear kernels using random feature maps has become a
powerful technique for scaling kernel methods to large datasets. We propose
\textit{Tensor Sketch}, an efficient random feature map for approximating
polynomial kernels. Given $n$ training samples in $\R^d$ Tensor Sketch computes
low-dimensional embeddings in $\R^D$ in time $\BO{n(d+D \log{D})}$ making it
well-suited for high-dimensional and large-scale settings. We provide
theoretical guarantees on the approximation error, ensuring the fidelity of the
resulting kernel function estimates. We also discuss extensions and highlight
applications where Tensor Sketch serves as a central computational tool.
[COMMENTS]
Extension of KDD 2013 and correcting the variance bound
[LINK]
http://arxiv.org/abs/2505.08146v1
[DATE]
2025-05-13 08:47:17+08:00
[CATEGORIES]
cs.LG
Multi-Layer Hierarchical Federated Learning with Quantization
[AUTHORS]
Seyed Mohammad Azimi-Abarghouyi, Carlo Fischione
[ABSTRACT]
Almost all existing hierarchical federated learning (FL) models are limited
to two aggregation layers, restricting scalability and flexibility in complex,
large-scale networks. In this work, we propose a Multi-Layer Hierarchical
Federated Learning framework (QMLHFL), which appears to be the first study that
generalizes hierarchical FL to arbitrary numbers of layers and network
architectures through nested aggregation, while employing a layer-specific
quantization scheme to meet communication constraints. We develop a
comprehensive convergence analysis for QMLHFL and derive a general convergence
condition and rate that reveal the effects of key factors, including
quantization parameters, hierarchical architecture, and intra-layer iteration
counts. Furthermore, we determine the optimal number of intra-layer iterations
to maximize the convergence rate while meeting a deadline constraint that
accounts for both communication and computation times. Our results show that
QMLHFL consistently achieves high learning accuracy, even under high data
heterogeneity, and delivers notably improved performance when optimized,
compared to using randomly selected values.
[LINK]
http://arxiv.org/abs/2505.08145v1
[DATE]
2025-05-13 08:47:13+08:00
[CATEGORIES]
cs.LG
Outlier-robust neural network training: variation regularization meets trimmed loss to prevent functional breakdown
[AUTHORS]
Akifumi Okuno, Shotaro Yagishita
[ABSTRACT]
In this study, we tackle the challenge of outlier-robust predictive modeling
using highly expressive neural networks. Our approach integrates two key
components: (1) a transformed trimmed loss (TTL), a computationally efficient
variant of the classical trimmed loss, and (2) higher-order variation
regularization (HOVR), which imposes smoothness constraints on the prediction
function. While traditional robust statistics typically assume low-complexity
models such as linear and kernel models, applying TTL alone to modern neural
networks may fail to ensure robustness, as their high expressive power allows
them to fit both inliers and outliers, even when a robust loss is used. To
address this, we revisit the traditional notion of breakdown point and adapt it
to the nonlinear function setting, introducing a regularization scheme via HOVR
that controls the model’s capacity and suppresses overfitting to outliers. We
theoretically establish that our training procedure retains a high functional
breakdown point, thereby ensuring robustness to outlier contamination. We
develop a stochastic optimization algorithm tailored to this framework and
provide a theoretical guarantee of its convergence.
[COMMENTS]
27 pages, 54 figures
[LINK]
http://arxiv.org/abs/2308.02293v4
[DATE]
2025-05-13 08:35:11+08:00
[CATEGORIES]
cs.LG
Lost in Transmission: When and Why LLMs Fail to Reason Globally
[AUTHORS]
Tobias Schnabel, Kiran Tomlinson, Adith Swaminathan, Jennifer Neville
[ABSTRACT]
Despite their many successes, transformer-based large language models (LLMs)
continue to struggle with tasks that require complex reasoning over large parts
of their input. We argue that these failures arise due to capacity limits on
the accurate flow of information within LLMs. To formalize this issue, we
introduce the bounded attention prefix oracle (BAPO) model, a new computational
framework that models bandwidth constraints on attention heads, the mechanism
for internal communication in LLMs. We show that several important reasoning
problems like graph reachability require high communication bandwidth for BAPOs
to solve; we call these problems BAPO-hard. Our experiments corroborate our
theoretical predictions: GPT-4, Claude, and Gemini succeed on BAPO-easy tasks
and fail even on relatively small BAPO-hard tasks. BAPOs also reveal another
benefit of chain of thought (CoT): we prove that breaking down a task using CoT
can turn any BAPO-hard problem into a BAPO-easy one. Our results offer
principled explanations for key LLM failures and suggest directions for
architectures and inference methods that mitigate bandwidth limits.
[COMMENTS]
28 pages
[LINK]
http://arxiv.org/abs/2505.08140v1
[DATE]
2025-05-13 08:25:23+08:00
[CATEGORIES]
cs.LG
Mirror Mirror on the Wall, Have I Forgotten it All? A New Framework for Evaluating Machine Unlearning
[AUTHORS]
Brennon Brimhall, Philip Mathew, Neil Fendley, Yinzhi Cao, Matthew Green
[ABSTRACT]
Machine unlearning methods take a model trained on a dataset and a forget
set, then attempt to produce a model as if it had only been trained on the
examples not in the forget set. We empirically show that an adversary is able
to distinguish between a mirror model (a control model produced by retraining
without the data to forget) and a model produced by an unlearning method across
representative unlearning methods from the literature. We build distinguishing
algorithms based on evaluation scores in the literature (i.e. membership
inference scores) and Kullback-Leibler divergence.
We propose a strong formal definition for machine unlearning called
computational unlearning. Computational unlearning is defined as the inability
for an adversary to distinguish between a mirror model and a model produced by
an unlearning method. If the adversary cannot guess better than random (except
with negligible probability), then we say that an unlearning method achieves
computational unlearning.
Our computational unlearning definition provides theoretical structure to
prove unlearning feasibility results. For example, our computational unlearning
definition immediately implies that there are no deterministic computational
unlearning methods for entropic learning algorithms. We also explore the
relationship between differential privacy (DP)-based unlearning methods and
computational unlearning, showing that DP-based approaches can satisfy
computational unlearning at the cost of an extreme utility collapse. These
results demonstrate that current methodology in the literature fundamentally
falls short of achieving computational unlearning. We conclude by identifying
several open questions for future work.
[LINK]
http://arxiv.org/abs/2505.08138v1
[DATE]
2025-05-13 08:23:17+08:00
[CATEGORIES]
cs.LG
Learning Optimal Classification Trees Robust to Distribution Shifts
[AUTHORS]
Nathan Justin, Sina Aghaei, Andrés Gómez, Phebe Vayanos
[ABSTRACT]
We consider the problem of learning classification trees that are robust to
distribution shifts between training and testing/deployment data. This problem
arises frequently in high stakes settings such as public health and social work
where data is often collected using self-reported surveys which are highly
sensitive to e.g., the framing of the questions, the time when and place where
the survey is conducted, and the level of comfort the interviewee has in
sharing information with the interviewer. We propose a method for learning
optimal robust classification trees based on mixed-integer robust optimization
technology. In particular, we demonstrate that the problem of learning an
optimal robust tree can be cast as a single-stage mixed-integer robust
optimization problem with a highly nonlinear and discontinuous objective. We
reformulate this problem equivalently as a two-stage linear robust optimization
problem for which we devise a tailored solution procedure based on constraint
generation. We evaluate the performance of our approach on numerous publicly
available datasets, and compare the performance to a regularized, non-robust
optimal tree. We show an increase of up to 12.48% in worst-case accuracy and of
up to 4.85% in average-case accuracy across several datasets and distribution
shifts from using our robust solution in comparison to the non-robust one.
[COMMENTS]
51 pages, 10 figures
[LINK]
http://arxiv.org/abs/2310.17772v2
[DATE]
2025-05-13 08:10:16+08:00
[CATEGORIES]
cs.LG
High-order Regularization for Machine Learning and Learning-based Control
[AUTHORS]
Xinghua Liu, Ming Cao
[ABSTRACT]
The paper proposes a novel regularization procedure for machine learning. The
proposed high-order regularization (HR) provides new insight into
regularization, which is widely used to train a neural network that can be
utilized to approximate the action-value function in general reinforcement
learning problems. The proposed HR method ensures the provable convergence of
the approximation algorithm, which makes the much-needed connection between
regularization and explainable learning using neural networks. The proposed HR
method theoretically demonstrates that regularization can be regarded as an
approximation in terms of inverse mapping with explicitly calculable
approximation error, and the $L_2$ regularization is a lower-order case of the
proposed method. We provide lower and upper bounds for the error of the
proposed HR solution, which helps build a reliable model. We also find that
regularization with the proposed HR can be regarded as a contraction. We prove
that the generalizability of neural networks can be maximized with a proper
regularization matrix, and the proposed HR is applicable for neural networks
with any mapping matrix. With the theoretical explanation of the extreme
learning machine for neural network training and the proposed high-order
regularization, one can better interpret the output of the neural network, thus
leading to explainable learning. We present a case study based on regularized
extreme learning neural networks to demonstrate the application of the proposed
HR and give the corresponding incremental HR solution. We verify the
performance of the proposed HR method by solving a classic control problem in
reinforcement learning. The result demonstrates the superior performance of the
method with significant enhancement in the generalizability of the neural
network.
[LINK]
http://arxiv.org/abs/2505.08129v1
[DATE]
2025-05-13 08:00:23+08:00
[CATEGORIES]
cs.LG
Beyond Basic A/B testing: Improving Statistical Efficiency for Business Growth
[AUTHORS]
Changshuai Wei, Phuc Nguyen, Benjamin Zelditch, Joyce Chen
[ABSTRACT]
The standard A/B testing approaches are mostly based on t-test in large scale
industry applications. These standard approaches however suffers from low
statistical power in business settings, due to nature of small sample-size or
non-Gaussian distribution or return-on-investment (ROI) consideration. In this
paper, we propose several approaches to addresses these challenges: (i)
regression adjustment, generalized estimating equation, Man-Whitney U and
Zero-Trimmed U that addresses each of these issues separately, and (ii) a novel
doubly robust generalized U that handles ROI consideration, distribution
robustness and small samples in one framework. We provide theoretical results
on asymptotic normality and efficiency bounds, together with insights on the
efficiency gain from theoretical analysis. We further conduct comprehensive
simulation studies and apply the methods to multiple real A/B tests.
[LINK]
http://arxiv.org/abs/2505.08128v1
[DATE]
2025-05-13 08:00:06+08:00
[CATEGORIES]
cs.LG
Self Rewarding Self Improving
[AUTHORS]
Toby Simonds, Kevin Lopez, Akira Yoshiyama, Dominique Garmier
[ABSTRACT]
We demonstrate that large language models can effectively self-improve
through self-judging without requiring reference solutions, leveraging the
inherent asymmetry between generating and verifying solutions. Our experiments
on Countdown puzzles and MIT Integration Bee problems show that models can
provide reliable reward signals without ground truth answers, enabling
reinforcement learning in domains previously not possible. By implementing
self-judging, we achieve significant performance gains maintaining alignment
with formal verification. When combined with synthetic question generation, we
establish a complete self-improvement loop where models generate practice
problems, solve them, and evaluate their own performance-achieving an 8%
improvement with Qwen 2.5 7B over baseline and surpassing GPT-4o performance on
integration tasks. Our findings demonstrate that LLM judges can provide
effective reward signals for training models, unlocking many reinforcement
learning environments previously limited by the difficulty of creating
programmatic rewards. This suggests a potential paradigm shift toward AI
systems that continuously improve through self-directed learning rather than
human-guided training, potentially accelerating progress in domains with scarce
training data or complex evaluation requirements.
[LINK]
http://arxiv.org/abs/2505.08827v1
[DATE]
2025-05-13 07:51:04+08:00
[CATEGORIES]
cs.LG
Sharp Gaussian approximations for Decentralized Federated Learning
[AUTHORS]
Soham Bonnerjee, Sayar Karmakar, Wei Biao Wu
[ABSTRACT]
Federated Learning has gained traction in privacy-sensitive collaborative
environments, with local SGD emerging as a key optimization method in
decentralized settings. While its convergence properties are well-studied,
asymptotic statistical guarantees beyond convergence remain limited. In this
paper, we present two generalized Gaussian approximation results for local SGD
and explore their implications. First, we prove a Berry-Esseen theorem for the
final local SGD iterates, enabling valid multiplier bootstrap procedures.
Second, motivated by robustness considerations, we introduce two distinct
time-uniform Gaussian approximations for the entire trajectory of local SGD.
The time-uniform approximations support Gaussian bootstrap-based tests for
detecting adversarial attacks. Extensive simulations are provided to support
our theoretical results.
[LINK]
http://arxiv.org/abs/2505.08125v1
[DATE]
2025-05-13 07:40:13+08:00
[CATEGORIES]
cs.LG
Topology-Guided Knowledge Distillation for Efficient Point Cloud Processing
[AUTHORS]
Luu Tung Hai, Thinh D. Le, Zhicheng Ding, Qing Tian, Truong-Son Hy
[ABSTRACT]
Point cloud processing has gained significant attention due to its critical
role in applications such as autonomous driving and 3D object recognition.
However, deploying high-performance models like Point Transformer V3 in
resource-constrained environments remains challenging due to their high
computational and memory demands. This work introduces a novel distillation
framework that leverages topology-aware representations and gradient-guided
knowledge distillation to effectively transfer knowledge from a high-capacity
teacher to a lightweight student model. Our approach captures the underlying
geometric structures of point clouds while selectively guiding the student
model’s learning process through gradient-based feature alignment. Experimental
results in the Nuscenes, SemanticKITTI, and Waymo datasets demonstrate that the
proposed method achieves competitive performance, with an approximately 16x
reduction in model size and a nearly 1.9x decrease in inference time compared
to its teacher model. Notably, on NuScenes, our method achieves
state-of-the-art performance among knowledge distillation techniques trained
solely on LiDAR data, surpassing prior knowledge distillation baselines in
segmentation performance. Our implementation is available publicly at:
https://github.com/HySonLab/PointDistill
[LINK]
http://arxiv.org/abs/2505.08101v1
[DATE]
2025-05-13 06:15:54+08:00
[CATEGORIES]
cs.LG
Fused3S: Fast Sparse Attention on Tensor Cores
[AUTHORS]
Zitong Li, Aparna Chandramowlishwaran
[ABSTRACT]
Sparse attention is a core building block in many leading neural network
models, from graph-structured learning to sparse sequence modeling. It can be
decomposed into a sequence of three sparse matrix operations (3S): sampled
dense-dense matrix multiplication (SDDMM), softmax normalization, and sparse
matrix multiplication (SpMM). Efficiently executing the 3S computational
pattern on modern GPUs remains challenging due to (a) the mismatch between
unstructured sparsity and tensor cores optimized for dense operations, and (b)
the high cost of data movement. Previous works have optimized these sparse
operations individually or addressed one of these challenges. This paper
introduces Fused3S, the first fused 3S algorithm that jointly maximizes tensor
core utilization and minimizes data movement. Across real-world graph datasets,
Fused3S achieves $1.6- 16.3\times$ and $1.5-14\times$ speedup over
state-of-the-art on H100 and A30 GPUs. Furthermore, integrating Fused3S into
Graph Transformer inference accelerates end-to-end performance by
$1.05-5.36\times$, consistently outperforming all 3S baselines across diverse
datasets (single and batched graphs) and GPU architectures.
[LINK]
http://arxiv.org/abs/2505.08098v1
[DATE]
2025-05-13 06:09:05+08:00
[CATEGORIES]
cs.LG
Manifold Learning with Normalizing Flows: Towards Regularity, Expressivity and Iso-Riemannian Geometry
[AUTHORS]
Willem Diepeveen, Deanna Needell
[ABSTRACT]
Modern machine learning increasingly leverages the insight that
high-dimensional data often lie near low-dimensional, non-linear manifolds, an
idea known as the manifold hypothesis. By explicitly modeling the geometric
structure of data through learning Riemannian geometry algorithms can achieve
improved performance and interpretability in tasks like clustering,
dimensionality reduction, and interpolation. In particular, learned pullback
geometry has recently undergone transformative developments that now make it
scalable to learn and scalable to evaluate, which further opens the door for
principled non-linear data analysis and interpretable machine learning.
However, there are still steps to be taken when considering real-world
multi-modal data. This work focuses on addressing distortions and modeling
errors that can arise in the multi-modal setting and proposes to alleviate both
challenges through isometrizing the learned Riemannian structure and balancing
regularity and expressivity of the diffeomorphism parametrization. We showcase
the effectiveness of the synergy of the proposed approaches in several
numerical experiments with both synthetic and real data.
[LINK]
http://arxiv.org/abs/2505.08087v1
[DATE]
2025-05-13 05:44:42+08:00
[CATEGORIES]
cs.LG
A Federated Random Forest Solution for Secure Distributed Machine Learning
[AUTHORS]
Alexandre Cotorobai, Jorge Miguel Silva, Jose Luis Oliveira
[ABSTRACT]
Privacy and regulatory barriers often hinder centralized machine learning
solutions, particularly in sectors like healthcare where data cannot be freely
shared. Federated learning has emerged as a powerful paradigm to address these
concerns; however, existing frameworks primarily support gradient-based models,
leaving a gap for more interpretable, tree-based approaches. This paper
introduces a federated learning framework for Random Forest classifiers that
preserves data privacy and provides robust performance in distributed settings.
By leveraging PySyft for secure, privacy-aware computation, our method enables
multiple institutions to collaboratively train Random Forest models on locally
stored data without exposing sensitive information. The framework supports
weighted model averaging to account for varying data distributions, incremental
learning to progressively refine models, and local evaluation to assess
performance across heterogeneous datasets. Experiments on two real-world
healthcare benchmarks demonstrate that the federated approach maintains
competitive predictive accuracy - within a maximum 9\% margin of centralized
methods - while satisfying stringent privacy requirements. These findings
underscore the viability of tree-based federated learning for scenarios where
data cannot be centralized due to regulatory, competitive, or technical
constraints. The proposed solution addresses a notable gap in existing
federated learning libraries, offering an adaptable tool for secure distributed
machine learning tasks that demand both transparency and reliable performance.
The tool is available at https://github.com/ieeta-pt/fed_rf.
[LINK]
http://arxiv.org/abs/2505.08085v1
[DATE]
2025-05-13 05:40:35+08:00
[CATEGORIES]
cs.LG
Fréchet Power-Scenario Distance: A Metric for Evaluating Generative AI Models across Multiple Time-Scales in Smart Grids
[AUTHORS]
Yuting Cai, Shaohuai Liu, Chao Tian, Le Xie
[ABSTRACT]
Generative artificial intelligence (AI) models in smart grids have advanced
significantly in recent years due to their ability to generate large amounts of
synthetic data, which would otherwise be difficult to obtain in the real world
due to confidentiality constraints. A key challenge in utilizing such synthetic
data is how to assess the data quality produced from such generative models.
Traditional Euclidean distance-based metrics only reflect pair-wise relations
between two individual samples, and could fail in evaluating quality
differences between groups of synthetic datasets. In this work, we propose a
novel metric based on the Fr'{e}chet Distance (FD) estimated between two
datasets in a learned feature space. The proposed method evaluates the quality
of generation from a distributional perspective. Empirical results demonstrate
the superiority of the proposed metric across timescales and models, enhancing
the reliability of data-driven decision-making in smart grid operations.
[LINK]
http://arxiv.org/abs/2505.08082v1
[DATE]
2025-05-13 05:32:23+08:00
[CATEGORIES]
cs.LG
LSHBloom: Memory-efficient, Extreme-scale Document Deduplication
[AUTHORS]
Arham Khan, Robert Underwood, Carlo Siebenschuh, Yadu Babuji, Aswathy Ajith, Kyle Hippe, Ozan Gokdemir, Alexander Brace, Kyle Chard, Ian Foster
[ABSTRACT]
Deduplication is a major focus for assembling and curating training datasets
for large language models (LLM) – detecting and eliminating additional
instances of the same content – in large collections of technical documents.
Unrestrained, duplicates in the training dataset increase training costs and
lead to undesirable properties such as memorization in trained models or
cheating on evaluation. Contemporary approaches to document-level deduplication
are often extremely expensive in both runtime and memory. We propose LSHBloom,
an extension to MinhashLSH, which replaces the expensive LSHIndex with
lightweight Bloom filters. LSHBloom demonstrates the same deduplication
performance as MinhashLSH with only a marginal increase in false positives (as
low as 1e-5 in our experiments); demonstrates competitive runtime (270\% faster
than MinhashLSH on peS2o); and, crucially, uses just 0.6\% of the disk space
required by MinhashLSH to deduplicate peS2o. We demonstrate that this space
advantage scales with increased dataset size – at the extreme scale of several
billion documents, LSHBloom promises a 250\% speedup and a 54$\times$ space
advantage over traditional MinHashLSH scaling deduplication of text datasets to
many billions of documents.
[LINK]
http://arxiv.org/abs/2411.04257v2
[DATE]
2025-05-13 05:11:37+08:00
[CATEGORIES]
cs.LG
On Unbiased Low-Rank Approximation with Minimum Distortion
[AUTHORS]
Leighton Pate Barnes, Stephen Cameron, Benjamin Howard
[ABSTRACT]
We describe an algorithm for sampling a low-rank random matrix $Q$ that best
approximates a fixed target matrix $P\in\mathbb{C}^{n\times m}$ in the
following sense: $Q$ is unbiased, i.e., $\mathbb{E}[Q] = P$;
$\mathsf{rank}(Q)\leq r$; and $Q$ minimizes the expected Frobenius norm error
$\mathbb{E}|P-Q|_F^2$. Our algorithm mirrors the solution to the efficient
unbiased sparsification problem for vectors, except applied to the singular
components of the matrix $P$. Optimality is proven by showing that our
algorithm matches the error from an existing lower bound.
[LINK]
http://arxiv.org/abs/2505.09647v1
[DATE]
2025-05-13 04:52:28+08:00
[CATEGORIES]
cs.LG
The Geography of Transportation Cybersecurity: Visitor Flows, Industry Clusters, and Spatial Dynamics
[AUTHORS]
Yuhao Wang, Kailai Wang, Songhua Hu, Yunpeng, Zhang, Gino Lim, Pengyu Zhu
[ABSTRACT]
The rapid evolution of the transportation cybersecurity ecosystem,
encompassing cybersecurity, automotive, and transportation and logistics
sectors, will lead to the formation of distinct spatial clusters and visitor
flow patterns across the US. This study examines the spatiotemporal dynamics of
visitor flows, analyzing how socioeconomic factors shape industry clustering
and workforce distribution within these evolving sectors. To model and predict
visitor flow patterns, we develop a BiTransGCN framework, integrating an
attention-based Transformer architecture with a Graph Convolutional Network
backbone. By integrating AI-enabled forecasting techniques with spatial
analysis, this study improves our ability to track, interpret, and anticipate
changes in industry clustering and mobility trends, thereby supporting
strategic planning for a secure and resilient transportation network. It offers
a data-driven foundation for economic planning, workforce development, and
targeted investments in the transportation cybersecurity ecosystem.
[LINK]
http://arxiv.org/abs/2505.08822v1
[DATE]
2025-05-13 04:44:02+08:00
[CATEGORIES]
cs.LG
LiteLMGuard: Seamless and Lightweight On-Device Prompt Filtering for Safeguarding Small Language Models against Quantization-induced Risks and Vulnerabilities
[AUTHORS]
Kalyan Nakka, Jimmy Dani, Ausmit Mondal, Nitesh Saxena
[ABSTRACT]
The growing adoption of Large Language Models (LLMs) has influenced the
development of their lighter counterparts-Small Language Models (SLMs)-to
enable on-device deployment across smartphones and edge devices. These SLMs
offer enhanced privacy, reduced latency, server-free functionality, and
improved user experience. However, due to resource constraints of on-device
environment, SLMs undergo size optimization through compression techniques like
quantization, which can inadvertently introduce fairness, ethical and privacy
risks. Critically, quantized SLMs may respond to harmful queries directly,
without requiring adversarial manipulation, raising significant safety and
trust concerns.
To address this, we propose LiteLMGuard (LLMG), an on-device prompt guard
that provides real-time, prompt-level defense for quantized SLMs. Additionally,
our prompt guard is designed to be model-agnostic such that it can be
seamlessly integrated with any SLM, operating independently of underlying
architectures. Our LLMG formalizes prompt filtering as a deep learning
(DL)-based prompt answerability classification task, leveraging semantic
understanding to determine whether a query should be answered by any SLM. Using
our curated dataset, Answerable-or-Not, we trained and fine-tuned several DL
models and selected ELECTRA as the candidate, with 97.75% answerability
classification accuracy.
Our safety effectiveness evaluations demonstrate that LLMG defends against
over 87% of harmful prompts, including both direct instruction and jailbreak
attack strategies. We further showcase its ability to mitigate the Open
Knowledge Attacks, where compromised SLMs provide unsafe responses without
adversarial prompting. In terms of prompt filtering effectiveness, LLMG
achieves near state-of-the-art filtering accuracy of 94%, with an average
latency of 135 ms, incurring negligible overhead for users.
[COMMENTS]
14 pages, 18 figures, and 4 tables
[LINK]
http://arxiv.org/abs/2505.05619v2
[DATE]
2025-05-13 04:32:53+08:00
[CATEGORIES]
cs.LG
Mobile Jamming Mitigation in 5G Networks: A MUSIC-Based Adaptive Beamforming Approach
[AUTHORS]
Olivia Holguin, Rachel Donati, Seyed bagher Hashemi Natanzi, Bo Tang
[ABSTRACT]
Mobile jammers pose a critical threat to 5G networks, particularly in
military communications. We propose an intelligent anti-jamming framework that
integrates Multiple Signal Classification (MUSIC) for high-resolution
Direction-of-Arrival (DoA) estimation, Minimum Variance Distortionless Response
(MVDR) beamforming for adaptive interference suppression, and machine learning
(ML) to enhance DoA prediction for mobile jammers. Extensive simulations in a
realistic highway scenario demonstrate that our hybrid approach achieves an
average Signal-to-Noise Ratio (SNR) improvement of 9.58 dB (maximum 11.08 dB)
and up to 99.8% DoA estimation accuracy. The framework’s computational
efficiency and adaptability to dynamic jammer mobility patterns outperform
conventional anti-jamming techniques, making it a robust solution for securing
5G communications in contested environments.
[LINK]
http://arxiv.org/abs/2505.08046v1
[DATE]
2025-05-13 04:31:31+08:00
[CATEGORIES]
cs.LG
Building Age Estimation: A New Multi-Modal Benchmark Dataset and Community Challenge
[AUTHORS]
Nikolaos Dionelis, Nicolas Longépé, Alessandra Feliciotti, Mattia Marconcini, Devis Peressutti, Nika Oman Kadunc, JaeWan Park, Hagai Raja Sinulingga, Steve Andreas Immanuel, Ba Tran, Caroline Arnold
[ABSTRACT]
Estimating the construction year of buildings is of great importance for
sustainability. Sustainable buildings minimize energy consumption and are a key
part of responsible and sustainable urban planning and development to
effectively combat climate change. By using Artificial Intelligence (AI) and
recently proposed powerful Transformer models, we are able to estimate the
construction epoch of buildings from a multi-modal dataset. In this paper, we
introduce a new benchmark multi-modal dataset, i.e. the Map your City Dataset
(MyCD), containing top-view Very High Resolution (VHR) images, Earth
Observation (EO) multi-spectral data from the Copernicus Sentinel-2 satellite
constellation, and street-view images in many different cities in Europe that
are co-localized with respect to the building under study and labelled with the
construction epoch. We assess EO generalization performance on new/ previously
unseen cities that have been held-out from training and appear only during
inference. In this work, we present the community-based data challenge we
organized based on MyCD. The AI4EO Challenge ESA MapYourCity was opened in 2024
for 4 months. In this paper, we present the Top-4 performing models of the
challenge, and the evaluation results. During inference, the performance of the
models using: i) both all three input modalities, and ii) only the two top-view
modalities, i.e. without the street-view ground images, is examined. The
evaluation results in this work show that the models to estimate the
construction year of buildings are effective and can achieve good performance
on this difficult important real-world task, even when inference is on
previously unseen cities, as well as even when using only the two top-view
modalities (i.e. VHR and Sentinel-2) during inference.
[COMMENTS]
13 pages, 22 figures, Submitted
[LINK]
http://arxiv.org/abs/2502.13818v2
[DATE]
2025-05-13 04:16:09+08:00
[CATEGORIES]
cs.LG
Demo: A Practical Testbed for Decentralized Federated Learning on Physical Edge Devices
[AUTHORS]
Chao Feng, Nicolas Huber, Alberto Huertas Celdran, Gerome Bovet, Burkhard Stiller
[ABSTRACT]
Federated Learning (FL) enables collaborative model training without sharing
raw data, preserving participant privacy. Decentralized FL (DFL) eliminates
reliance on a central server, mitigating the single point of failure inherent
in the traditional FL paradigm, while introducing deployment challenges on
resource-constrained devices. To evaluate real-world applicability, this work
designs and deploys a physical testbed using edge devices such as Raspberry Pi
and Jetson Nano. The testbed is built upon a DFL training platform, NEBULA, and
extends it with a power monitoring module to measure energy consumption during
training. Experiments across multiple datasets show that model performance is
influenced by the communication topology, with denser topologies leading to
better outcomes in DFL settings.
[LINK]
http://arxiv.org/abs/2505.08033v1
[DATE]
2025-05-13 04:00:45+08:00
[CATEGORIES]
cs.LG
Dynamical Low-Rank Compression of Neural Networks with Robustness under Adversarial Attacks
[AUTHORS]
Steffen Schotthöfer, H. Lexie Yang, Stefan Schnake
[ABSTRACT]
Deployment of neural networks on resource-constrained devices demands models
that are both compact and robust to adversarial inputs. However, compression
and adversarial robustness often conflict. In this work, we introduce a
dynamical low-rank training scheme enhanced with a novel spectral regularizer
that controls the condition number of the low-rank core in each layer. This
approach mitigates the sensitivity of compressed models to adversarial
perturbations without sacrificing clean accuracy. The method is model- and
data-agnostic, computationally efficient, and supports rank adaptivity to
automatically compress the network at hand. Extensive experiments across
standard architectures, datasets, and adversarial attacks show the regularized
networks can achieve over 94% compression while recovering or improving
adversarial accuracy relative to uncompressed baselines.
[LINK]
http://arxiv.org/abs/2505.08022v1
[DATE]
2025-05-13 03:46:29+08:00
[CATEGORIES]
cs.LG
Multi-Task Dynamic Pricing in Credit Market with Contextual Information
[AUTHORS]
Adel Javanmard, Jingwei Ji, Renyuan Xu
[ABSTRACT]
We study the dynamic pricing problem faced by a broker seeking to learn
prices for a large number of credit market securities, such as corporate bonds,
government bonds, loans, and other credit-related securities. A major challenge
in pricing these securities stems from their infrequent trading and the lack of
transparency in over-the-counter (OTC) markets, which leads to insufficient
data for individual pricing. Nevertheless, many securities share structural
similarities that can be exploited. Moreover, brokers often place small
“probing” orders to infer competitors’ pricing behavior. Leveraging these
insights, we propose a multi-task dynamic pricing framework that leverages the
shared structure across securities to enhance pricing accuracy.
In the OTC market, a broker wins a quote by offering a more competitive price
than rivals. The broker’s goal is to learn winning prices while minimizing
expected regret against a clairvoyant benchmark. We model each security using a
$d$-dimensional feature vector and assume a linear contextual model for the
competitor’s pricing of the yield, with parameters unknown a priori. We propose
the Two-Stage Multi-Task (TSMT) algorithm: first, an unregularized MLE over
pooled data to obtain a coarse parameter estimate; second, a regularized MLE on
individual securities to refine the parameters. We show that the TSMT achieves
a regret bounded by $\tilde{O} ( \delta_{\max} \sqrt{T M d} + M d ) $,
outperforming both fully individual and fully pooled baselines, where $M$ is
the number of securities and $\delta_{\max}$ quantifies their heterogeneity.
[LINK]
http://arxiv.org/abs/2410.14839v3
[DATE]
2025-05-13 03:45:00+08:00
[CATEGORIES]
cs.LG
Thoughts on Objectives of Sparse and Hierarchical Masked Image Model
[AUTHORS]
Asahi Miyazaki, Tsuyoshi Okita
[ABSTRACT]
Masked image modeling is one of the most poplular objectives of training.
Recently, the SparK model has been proposed with superior performance among
self-supervised learning models. This paper proposes a new mask pattern for
this SparK model, proposing it as the Mesh Mask-ed SparK model. We report the
effect of the mask pattern used for image masking in pre-training on
performance.
[COMMENTS]
9 pages, 11 figures
[LINK]
http://arxiv.org/abs/2505.08819v1
[DATE]
2025-05-13 02:40:46+08:00
[CATEGORIES]
cs.LG
FLOWR: Flow Matching for Structure-Aware De Novo, Interaction- and Fragment-Based Ligand Generation
[AUTHORS]
Julian Cremer, Ross Irwin, Alessandro Tibo, Jon Paul Janet, Simon Olsson, Djork-Arné Clevert
[ABSTRACT]
We introduce FLOWR, a novel structure-based framework for the generation and
optimization of three-dimensional ligands. FLOWR integrates continuous and
categorical flow matching with equivariant optimal transport, enhanced by an
efficient protein pocket conditioning. Alongside FLOWR, we present SPINDR, a
thoroughly curated dataset comprising ligand-pocket co-crystal complexes
specifically designed to address existing data quality issues. Empirical
evaluations demonstrate that FLOWR surpasses current state-of-the-art
diffusion- and flow-based methods in terms of PoseBusters-validity, pose
accuracy, and interaction recovery, while offering a significant inference
speedup, achieving up to 70-fold faster performance. In addition, we introduce
FLOWR:multi, a highly accurate multi-purpose model allowing for the targeted
sampling of novel ligands that adhere to predefined interaction profiles and
chemical substructures for fragment-based design without the need of
re-training or any re-sampling strategies
[LINK]
http://arxiv.org/abs/2504.10564v2
[DATE]
2025-05-13 02:36:32+08:00
[CATEGORIES]
cs.LG
ALinFiK: Learning to Approximate Linearized Future Influence Kernel for Scalable Third-Party LLM Data Valuation
[AUTHORS]
Yanzhou Pan, Huawei Lin, Yide Ran, Jiamin Chen, Xiaodong Yu, Weijie Zhao, Denghui Zhang, Zhaozhuo Xu
[COMMENTS]
Proceedings of the NAACL 2025. Keywords: Influence Function, Data
Valuation, Influence Estimation.
https://aclanthology.org/2025.naacl-long.589/
[LINK]
http://arxiv.org/abs/2503.01052v2
[DATE]
2025-05-13 02:28:48+08:00
[CATEGORIES]
cs.LG
Prism: Unleashing GPU Sharing for Cost-Efficient Multi-LLM Serving
[AUTHORS]
Shan Yu, Jiarong Xing, Yifan Qiao, Mingyuan Ma, Yangmin Li, Yang Wang, Shuo Yang, Zhiqiang Xie, Shiyi Cao, Ke Bao, Ion Stoica, Harry Xu, Ying Sheng
[ABSTRACT]
Serving large language models (LLMs) is expensive, especially for providers
hosting many models, making cost reduction essential. The unique workload
patterns of serving multiple LLMs (i.e., multi-LLM serving) create new
opportunities and challenges for this task. The long-tail popularity of models
and their long idle periods present opportunities to improve utilization
through GPU sharing. However, existing GPU sharing systems lack the ability to
adjust their resource allocation and sharing policies at runtime, making them
ineffective at meeting latency service-level objectives (SLOs) under rapidly
fluctuating workloads.
This paper presents Prism, a multi-LLM serving system that unleashes the full
potential of GPU sharing to achieve both cost efficiency and SLO attainment. At
its core, Prism tackles a key limitation of existing
systems$\unicode{x2014}$the lack of $\textit{cross-model memory coordination}$,
which is essential for flexibly sharing GPU memory across models under dynamic
workloads. Prism achieves this with two key designs. First, it supports
on-demand memory allocation by dynamically mapping physical to virtual memory
pages, allowing flexible memory redistribution among models that space- and
time-share a GPU. Second, it improves memory efficiency through a two-level
scheduling policy that dynamically adjusts sharing strategies based on models’
runtime demands. Evaluations on real-world traces show that Prism achieves more
than $2\times$ cost savings and $3.3\times$ SLO attainment compared to
state-of-the-art systems.
[LINK]
http://arxiv.org/abs/2505.04021v2
[DATE]
2025-05-13 02:19:46+08:00
[CATEGORIES]
cs.LG
Wasserstein Distributionally Robust Nonparametric Regression
[AUTHORS]
Changyu Liu, Yuling Jiao, Junhui Wang, Jian Huang
[ABSTRACT]
Distributionally robust optimization has become a powerful tool for
prediction and decision-making under model uncertainty. By focusing on the
local worst-case risk, it enhances robustness by identifying the most
unfavorable distribution within a predefined ambiguity set. While extensive
research has been conducted in parametric settings, studies on nonparametric
frameworks remain limited. This paper studies the generalization properties of
Wasserstein distributionally robust nonparametric estimators, with particular
attention to the impact of model misspecification, where non-negligible
discrepancies between the estimation function space and target function can
impair generalization performance. We establish non-asymptotic error bounds for
the excess local worst-case risk by analyzing the regularization effects
induced by distributional perturbations and employing feedforward neural
networks with Lipschitz constraints. These bounds illustrate how uncertainty
levels and neural network structures influence generalization performance and
are applicable to both Lipschitz and quadratic loss functions. Furthermore, we
investigate the Lagrangian relaxation of the local worst-case risk and derive
corresponding non-asymptotic error bounds for these estimators. The robustness
of the proposed estimator is evaluated through simulation studies and
illustrated with an application to the MNIST dataset.
[COMMENTS]
50 pages
[LINK]
http://arxiv.org/abs/2505.07967v1
[DATE]
2025-05-13 02:07:37+08:00
[CATEGORIES]
cs.LG
WATCH: Adaptive Monitoring for AI Deployments via Weighted-Conformal Martingales
[AUTHORS]
Drew Prinster, Xing Han, Anqi Liu, Suchi Saria
[ABSTRACT]
Responsibly deploying artificial intelligence (AI) / machine learning (ML)
systems in high-stakes settings arguably requires not only proof of system
reliability, but moreover continual, post-deployment monitoring to quickly
detect and address any unsafe behavior. Statistical methods for nonparametric
change-point detection – especially the tools of conformal test martingales
(CTMs) and anytime-valid inference – offer promising approaches to this
monitoring task. However, existing methods are restricted to monitoring limited
hypothesis classes or “alarm criteria” (such as data shifts that violate
certain exchangeability assumptions), do not allow for online adaptation in
response to shifts, and/or do not enable root-cause analysis of any
degradation. In this paper, we expand the scope of these monitoring methods by
proposing a weighted generalization of conformal test martingales (WCTMs),
which lay a theoretical foundation for online monitoring for any unexpected
changepoints in the data distribution while controlling false-alarms. For
practical applications, we propose specific WCTM algorithms that adapt online
to mild covariate shifts (in the marginal input distribution) while quickly
detecting and diagnosing more severe shifts, such as concept shifts (in the
conditional label distribution) or extreme (out-of-support) covariate shifts
that cannot be easily adapted to. On real-world datasets, we demonstrate
improved performance relative to state-of-the-art baselines.
[COMMENTS]
To be published in The International Conference on Machine Learning
(ICML), 2025
[LINK]
http://arxiv.org/abs/2505.04608v2
[DATE]
2025-05-13 01:56:52+08:00
[CATEGORIES]
cs.LG
Improving Trajectory Stitching with Flow Models
[AUTHORS]
Reece O’Mahoney, Wanming Yu, Ioannis Havoutis
[LINK]
http://arxiv.org/abs/2505.07802v1
[DATE]
2025-05-13 01:50:10+08:00
[CATEGORIES]
cs.LG
Automatically Differentiable Model Updating (ADiMU): conventional, hybrid, and neural network material model discovery including history-dependency
[AUTHORS]
Bernardo P. Ferreira, Miguel A. Bessa
[ABSTRACT]
We introduce the first Automatically Differentiable Model Updating (ADiMU)
framework that finds any history-dependent material model from full-field
displacement and global force data (global, indirect discovery) or from
strain-stress data (local, direct discovery). We show that ADiMU can update
conventional (physics-based), neural network (data-driven), and hybrid material
models. Moreover, this framework requires no fine-tuning of hyperparameters or
additional quantities beyond those inherent to the user-selected material model
architecture and optimizer. The robustness and versatility of ADiMU is
extensively exemplified by updating different models spanning tens to millions
of parameters, in both local and global discovery settings. Relying on fully
differentiable code, the algorithmic implementation leverages vectorizing maps
that enable history-dependent automatic differentiation via efficient batched
execution of shared computation graphs. This contribution also aims to
facilitate the integration, evaluation and application of future material model
architectures by openly supporting the research community. Therefore, ADiMU is
released as an open-source computational tool, integrated into a carefully
designed and documented software named HookeAI.
[COMMENTS]
77 pages, 50 figures
[LINK]
http://arxiv.org/abs/2505.07801v1
[DATE]
2025-05-13 01:49:54+08:00
[CATEGORIES]
cs.LG
Overflow Prevention Enhances Long-Context Recurrent LLMs
[AUTHORS]
Assaf Ben-Kish, Itamar Zimerman, M. Jehanzeb Mirza, James Glass, Leonid Karlinsky, Raja Giryes
[ABSTRACT]
A recent trend in LLMs is developing recurrent sub-quadratic models that
improve long-context processing efficiency. We investigate leading large
long-context models, focusing on how their fixed-size recurrent memory affects
their performance. Our experiments reveal that, even when these models are
trained for extended contexts, their use of long contexts remains
underutilized. Specifically, we demonstrate that a chunk-based inference
procedure, which identifies and processes only the most relevant portion of the
input can mitigate recurrent memory failures and be effective for many
long-context tasks: On LongBench, our method improves the overall performance
of Falcon3-Mamba-Inst-7B by 14%, Falcon-Mamba-Inst-7B by 28%,
RecurrentGemma-IT-9B by 50%, and RWKV6-Finch-7B by 51%. Surprisingly, this
simple approach also leads to state-of-the-art results in the challenging
LongBench v2 benchmark, showing competitive performance with equivalent size
Transformers. Furthermore, our findings raise questions about whether recurrent
models genuinely exploit long-range dependencies, as our single-chunk strategy
delivers stronger performance - even in tasks that presumably require
cross-context relations.
[LINK]
http://arxiv.org/abs/2505.07793v1
[DATE]
2025-05-13 01:45:05+08:00
[CATEGORIES]
cs.LG
Analytic theory of dropout regularization
[AUTHORS]
Francesco Mori, Francesca Mignacco
[ABSTRACT]
Dropout is a regularization technique widely used in training artificial
neural networks to mitigate overfitting. It consists of dynamically
deactivating subsets of the network during training to promote more robust
representations. Despite its widespread adoption, dropout probabilities are
often selected heuristically, and theoretical explanations of its success
remain sparse. Here, we analytically study dropout in two-layer neural networks
trained with online stochastic gradient descent. In the high-dimensional limit,
we derive a set of ordinary differential equations that fully characterize the
evolution of the network during training and capture the effects of dropout. We
obtain a number of exact results describing the generalization error and the
optimal dropout probability at short, intermediate, and long training times.
Our analysis shows that dropout reduces detrimental correlations between hidden
nodes, mitigates the impact of label noise, and that the optimal dropout
probability increases with the level of noise in the data. Our results are
validated by extensive numerical simulations.
[COMMENTS]
17 pages, 8 figures
[LINK]
http://arxiv.org/abs/2505.07792v1
[DATE]
2025-05-13 01:45:02+08:00
[CATEGORIES]
cs.LG
A Comparative Study on Dynamic Graph Embedding based on Mamba and Transformers
[AUTHORS]
Ashish Parmanand Pandey, Alan John Varghese, Sarang Patil, Mengjia Xu
[ABSTRACT]
Dynamic graph embedding has emerged as an important technique for modeling
complex time-evolving networks across diverse domains. While transformer-based
models have shown promise in capturing long-range dependencies in temporal
graph data, they face scalability challenges due to quadratic computational
complexity. This study presents a comparative analysis of dynamic graph
embedding approaches using transformers and the recently proposed Mamba
architecture, a state-space model with linear complexity. We introduce three
novel models: TransformerG2G augment with graph convolutional networks,
\mathcal{DG}-Mamba, and \mathcal{GDG}-Mamba with graph isomorphism network edge
convolutions. Our experiments on multiple benchmark datasets demonstrate that
Mamba-based models achieve comparable or superior performance to
transformer-based approaches in link prediction tasks while offering
significant computational efficiency gains on longer sequences. Notably,
\mathcal{DG}-Mamba variants consistently outperform transformer-based models on
datasets with high temporal variability, such as UCI, Bitcoin, and Reality
Mining, while maintaining competitive performance on more stable graphs like
SBM. We provide insights into the learned temporal dependencies through
analysis of attention weights and state matrices, revealing the models’ ability
to capture complex temporal patterns. By effectively combining state-space
models with graph neural networks, our work addresses key limitations of
previous approaches and contributes to the growing body of research on
efficient temporal graph representation learning. These findings offer
promising directions for scaling dynamic graph embedding to larger, more
complex real-world networks, potentially enabling new applications in areas
such as social network analysis, financial modeling, and biological system
dynamics.
[COMMENTS]
18 pages, 6 figures
[LINK]
http://arxiv.org/abs/2412.11293v2
[DATE]
2025-05-13 01:41:35+08:00
[CATEGORIES]
cs.LG
Relative Overfitting and Accept-Reject Framework
[AUTHORS]
Yanxin Liu, Yunqi Zhang
[ABSTRACT]
Currently, the scaling law of Large Language Models (LLMs) faces challenges
and bottlenecks. This paper posits that noise effects, stemming from changes in
the signal-to-noise ratio under diminishing marginal returns, are the root
cause of these issues. To control this noise, we investigated the differences
between models with performance advantages and disadvantages, introducing the
concept of “relative overfitting.” Based on their complementary strengths, we
have proposed an application framework, Accept-Reject (AR). In Natural Language
Processing (NLP), we use LLMs and Small Language Models (SLMs) as the medium
for discussion. This framework enables SLMs to exert a universal positive
influence on LLM decision outputs, rather than the intuitively expected
negative influence. We validated our approach using self-built models based on
mainstream architectures and pre-trained mainstream models across multiple
datasets, including basic language modeling, long-context tasks, subject
examination, and question-answering (QA) benchmarks. The results demonstrate
that through our structure, compared to increasing the LLM’s parameters, we can
achieve better performance improvements with significantly lower parameter and
computational costs in many scenarios. These improvements are universal,
stable, and effective. Furthermore, we explore the potential of “relative
overfitting” and the AR framework in other machine learning domains, such as
computer vision (CV) and AI for science. We hope the proposed approach can help
scale laws overcome existing bottlenecks.
[LINK]
http://arxiv.org/abs/2505.07783v1
[DATE]
2025-05-13 01:36:14+08:00
[CATEGORIES]
cs.LG
MLE-Dojo: Interactive Environments for Empowering LLM Agents in Machine Learning Engineering
[AUTHORS]
Rushi Qiang, Yuchen Zhuang, Yinghao Li, Dingu Sagar V K, Rongzhi Zhang, Changhao Li, Ian Shu-Hei Wong, Sherry Yang, Percy Liang, Chao Zhang, Bo Dai
[ABSTRACT]
We introduce MLE-Dojo, a Gym-style framework for systematically reinforcement
learning, evaluating, and improving autonomous large language model (LLM)
agents in iterative machine learning engineering (MLE) workflows. Unlike
existing benchmarks that primarily rely on static datasets or single-attempt
evaluations, MLE-Dojo provides an interactive environment enabling agents to
iteratively experiment, debug, and refine solutions through structured feedback
loops. Built upon 200+ real-world Kaggle challenges, MLE-Dojo covers diverse,
open-ended MLE tasks carefully curated to reflect realistic engineering
scenarios such as data processing, architecture search, hyperparameter tuning,
and code debugging. Its fully executable environment supports comprehensive
agent training via both supervised fine-tuning and reinforcement learning,
facilitating iterative experimentation, realistic data sampling, and real-time
outcome verification. Extensive evaluations of eight frontier LLMs reveal that
while current models achieve meaningful iterative improvements, they still
exhibit significant limitations in autonomously generating long-horizon
solutions and efficiently resolving complex errors. Furthermore, MLE-Dojo’s
flexible and extensible architecture seamlessly integrates diverse data
sources, tools, and evaluation protocols, uniquely enabling model-based agent
tuning and promoting interoperability, scalability, and reproducibility. We
open-source our framework and benchmarks to foster community-driven innovation
towards next-generation MLE agents.
[LINK]
http://arxiv.org/abs/2505.07782v1
[DATE]
2025-05-13 01:35:43+08:00
[CATEGORIES]
cs.LG
Towards SFW sampling for diffusion models via external conditioning
[AUTHORS]
Camilo Carvajal Reyes, Joaquín Fontbona, Felipe Tobar
[ABSTRACT]
Score-based generative models (SBM), also known as diffusion models, are the
de facto state of the art for image synthesis. Despite their unparalleled
performance, SBMs have recently been in the spotlight for being tricked into
creating not-safe-for-work (NSFW) content, such as violent images and
non-consensual nudity. Current approaches that prevent unsafe generation are
based on the models’ own knowledge, and the majority of them require
fine-tuning. This article explores the use of external sources for ensuring
safe outputs in SBMs. Our safe-for-work (SFW) sampler implements a Conditional
Trajectory Correction step that guides the samples away from undesired regions
in the ambient space using multimodal models as the source of conditioning.
Furthermore, using Contrastive Language Image Pre-training (CLIP), our method
admits user-defined NSFW classes, which can vary in different settings. Our
experiments on the text-to-image SBM Stable Diffusion validate that the
proposed SFW sampler effectively reduces the generation of explicit content
while being competitive with other fine-tuning-based approaches, as assessed
via independent NSFW detectors. Moreover, we evaluate the impact of the SFW
sampler on image quality and show that the proposed correction scheme comes at
a minor cost with negligible effect on samples not needing correction. Our
study confirms the suitability of the SFW sampler towards aligned SBM models
and the potential of using model-agnostic conditioning for the prevention of
unwanted images.
[COMMENTS]
Accepcted at IJCNN 2025
[LINK]
http://arxiv.org/abs/2505.08817v1
[DATE]
2025-05-13 01:27:40+08:00
[CATEGORIES]
cs.LG
Synthesizing Diverse Network Flow Datasets with Scalable Dynamic Multigraph Generation
[AUTHORS]
Arya Grayeli, Vipin Swarup, Steven E. Noel
[ABSTRACT]
Obtaining real-world network datasets is often challenging because of
privacy, security, and computational constraints. In the absence of such
datasets, graph generative models become essential tools for creating synthetic
datasets. In this paper, we introduce a novel machine learning model for
generating high-fidelity synthetic network flow datasets that are
representative of real-world networks. Our approach involves the generation of
dynamic multigraphs using a stochastic Kronecker graph generator for structure
generation and a tabular generative adversarial network for feature generation.
We further employ an XGBoost (eXtreme Gradient Boosting) model for graph
alignment, ensuring accurate overlay of features onto the generated graph
structure. We evaluate our model using new metrics that assess both the
accuracy and diversity of the synthetic graphs. Our results demonstrate
improvements in accuracy over previous large-scale graph generation methods
while maintaining similar efficiency. We also explore the trade-off between
accuracy and diversity in synthetic graph dataset creation, a topic not
extensively covered in related works. Our contributions include the synthesis
and evaluation of large real-world netflow datasets and the definition of new
metrics for evaluating synthetic graph generative models.
[LINK]
http://arxiv.org/abs/2505.07777v1
[DATE]
2025-05-13 01:26:48+08:00
[CATEGORIES]
cs.LG
Solving Nonlinear PDEs with Sparse Radial Basis Function Networks
[AUTHORS]
Zihan Shao, Konstantin Pieper, Xiaochuan Tian
[ABSTRACT]
We propose a novel framework for solving nonlinear PDEs using sparse radial
basis function (RBF) networks. Sparsity-promoting regularization is employed to
prevent over-parameterization and reduce redundant features. This work is
motivated by longstanding challenges in traditional RBF collocation methods,
along with the limitations of physics-informed neural networks (PINNs) and
Gaussian process (GP) approaches, aiming to blend their respective strengths in
a unified framework. The theoretical foundation of our approach lies in the
function space of Reproducing Kernel Banach Spaces (RKBS) induced by
one-hidden-layer neural networks of possibly infinite width. We prove a
representer theorem showing that the solution to the sparse optimization
problem in the RKBS admits a finite solution and establishes error bounds that
offer a foundation for generalizing classical numerical analysis. The
algorithmic framework is based on a three-phase algorithm to maintain
computational efficiency through adaptive feature selection, second-order
optimization, and pruning of inactive neurons. Numerical experiments
demonstrate the effectiveness of our method and highlight cases where it offers
notable advantages over GP approaches. This work opens new directions for
adaptive PDE solvers grounded in rigorous analysis with efficient,
learning-inspired implementation.
[COMMENTS]
35 pages, 7 figures
[LINK]
http://arxiv.org/abs/2505.07765v1
[DATE]
2025-05-13 01:12:53+08:00
[CATEGORIES]
cs.LG
Emotion-Gradient Metacognitive RSI (Part I): Theoretical Foundations and Single-Agent Architecture
[AUTHORS]
Rintaro Ando
[ABSTRACT]
We present the Emotion-Gradient Metacognitive Recursive Self-Improvement
(EG-MRSI) framework, a novel architecture that integrates introspective
metacognition, emotion-based intrinsic motivation, and recursive
self-modification into a unified theoretical system. The framework is
explicitly capable of overwriting its own learning algorithm under formally
bounded risk. Building upon the Noise-to-Meaning RSI (N2M-RSI) foundation,
EG-MRSI introduces a differentiable intrinsic reward function driven by
confidence, error, novelty, and cumulative success. This signal regulates both
a metacognitive mapping and a self-modification operator constrained by
provable safety mechanisms. We formally define the initial agent configuration,
emotion-gradient dynamics, and RSI trigger conditions, and derive a
reinforcement-compatible optimization objective that guides the agent’s
development trajectory. Meaning Density and Meaning Conversion Efficiency are
introduced as quantifiable metrics of semantic learning, closing the gap
between internal structure and predictive informativeness. This Part I paper
establishes the single-agent theoretical foundations of EG-MRSI. Future parts
will extend this framework to include safety certificates and rollback
protocols (Part II), collective intelligence mechanisms (Part III), and
feasibility constraints including thermodynamic and computational limits (Part
IV). Together, the EG-MRSI series provides a rigorous, extensible foundation
for open-ended and safe AGI.
[COMMENTS]
21 pages, 3 figures. Part I of a four-part series (Parts II-IV
forthcoming)
[LINK]
http://arxiv.org/abs/2505.07757v1
[DATE]
2025-05-13 01:02:47+08:00
[CATEGORIES]
cs.LG
BNEM: A Boltzmann Sampler Based on Bootstrapped Noised Energy Matching
[AUTHORS]
RuiKang OuYang, Bo Qiang, José Miguel Hernández-Lobato
[ABSTRACT]
Developing an efficient sampler capable of generating independent and
identically distributed (IID) samples from a Boltzmann distribution is a
crucial challenge in scientific research, e.g. molecular dynamics. In this
work, we intend to learn neural samplers given energy functions instead of data
sampled from the Boltzmann distribution. By learning the energies of the noised
data, we propose a diffusion-based sampler, Noised Energy Matching, which
theoretically has lower variance and more complexity compared to related works.
Furthermore, a novel bootstrapping technique is applied to NEM to balance
between bias and variance. We evaluate NEM and BNEM on a 2-dimensional 40
Gaussian Mixture Model (GMM) and a 4-particle double-well potential (DW-4). The
experimental results demonstrate that BNEM can achieve state-of-the-art
performance while being more robust.
[COMMENTS]
38 pages, 10 figures, 10 tables
[LINK]
http://arxiv.org/abs/2409.09787v4
[DATE]
2025-05-13 00:54:02+08:00
[CATEGORIES]
cs.LG
BodyGPS: Anatomical Positioning System
[AUTHORS]
Halid Ziya Yerebakan, Kritika Iyer, Xueqi Guo, Yoshihisa Shinagawa, Gerardo Hermosillo Valadez
[ABSTRACT]
We introduce a new type of foundational model for parsing human anatomy in
medical images that works for different modalities. It supports supervised or
unsupervised training and can perform matching, registration, classification,
or segmentation with or without user interaction. We achieve this by training a
neural network estimator that maps query locations to atlas coordinates via
regression. Efficiency is improved by sparsely sampling the input, enabling
response times of less than 1 ms without additional accelerator hardware. We
demonstrate the utility of the algorithm in both CT and MRI modalities.
[LINK]
http://arxiv.org/abs/2505.07744v1
[DATE]
2025-05-13 00:53:41+08:00
[CATEGORIES]
cs.LG
Assessing the Chemical Intelligence of Large Language Models
[AUTHORS]
Nicholas T. Runcie, Charlotte M. Deane, Fergus Imrie
[ABSTRACT]
Large Language Models are versatile, general-purpose tools with a wide range
of applications. Recently, the advent of “reasoning models” has led to
substantial improvements in their abilities in advanced problem-solving domains
such as mathematics and software engineering. In this work, we assessed the
ability of reasoning models to directly perform chemistry tasks, without any
assistance from external tools. We created a novel benchmark, called ChemIQ,
which consists of 796 questions assessing core concepts in organic chemistry,
focused on molecular comprehension and chemical reasoning. Unlike previous
benchmarks, which primarily use multiple choice formats, our approach requires
models to construct short-answer responses, more closely reflecting real-world
applications. The reasoning models, exemplified by OpenAI’s o3-mini, correctly
answered 28%-59% of questions depending on the reasoning level used, with
higher reasoning levels significantly increasing performance on all tasks.
These models substantially outperformed the non-reasoning model, GPT-4o, which
achieved only 7% accuracy. We found that Large Language Models can now convert
SMILES strings to IUPAC names, a task earlier models were unable to perform.
Additionally, we show that the latest reasoning models can elucidate structures
from 1H and 13C NMR data, correctly generating SMILES strings for 74% of
molecules containing up to 10 heavy atoms, and in one case solving a structure
comprising 21 heavy atoms. For each task, we found evidence that the reasoning
process mirrors that of a human chemist. Our results demonstrate that the
latest reasoning models have the ability to perform advanced chemical
reasoning.
[LINK]
http://arxiv.org/abs/2505.07735v1
[DATE]
2025-05-13 00:44:38+08:00
[CATEGORIES]
cs.LG
Guiding Data Collection via Factored Scaling Curves
[AUTHORS]
Lihan Zha, Apurva Badithela, Michael Zhang, Justin Lidard, Jeremy Bao, Emily Zhou, David Snyder, Allen Z. Ren, Dhruv Shah, Anirudha Majumdar
[ABSTRACT]
Generalist imitation learning policies trained on large datasets show great
promise for solving diverse manipulation tasks. However, to ensure
generalization to different conditions, policies need to be trained with data
collected across a large set of environmental factor variations (e.g., camera
pose, table height, distractors) $-$ a prohibitively expensive undertaking, if
done exhaustively. We introduce a principled method for deciding what data to
collect and how much to collect for each factor by constructing factored
scaling curves (FSC), which quantify how policy performance varies as data
scales along individual or paired factors. These curves enable targeted data
acquisition for the most influential factor combinations within a given budget.
We evaluate the proposed method through extensive simulated and real-world
experiments, across both training-from-scratch and fine-tuning settings, and
show that it boosts success rates in real-world tasks in new environments by up
to 26% over existing data-collection strategies. We further demonstrate how
factored scaling curves can effectively guide data collection using an offline
metric, without requiring real-world evaluation at scale.
[COMMENTS]
Project website: https://factored-data-scaling.github.io
[LINK]
http://arxiv.org/abs/2505.07728v1
[DATE]
2025-05-13 00:36:35+08:00
[CATEGORIES]
cs.LG
Training neural control variates using correlated configurations
[AUTHORS]
Hyunwoo Oh
[ABSTRACT]
Neural control variates (NCVs) have emerged as a powerful tool for variance
reduction in Monte Carlo (MC) simulations, particularly in high-dimensional
problems where traditional control variates are difficult to construct
analytically. By training neural networks to learn auxiliary functions
correlated with the target observable, NCVs can significantly reduce estimator
variance while preserving unbiasedness. However, a critical but often
overlooked aspect of NCV training is the role of autocorrelated samples
generated by Markov Chain Monte Carlo (MCMC). While such samples are typically
discarded for error estimation due to their statistical redundancy, they may
contain useful information about the structure of the underlying probability
distribution that can benefit the training process. In this work, we
systematically examine the effect of using correlated configurations in
training neural control variates. We demonstrate, both conceptually and
numerically, that training on correlated data can improve control variate
performance, especially in settings with limited computational resources. Our
analysis includes empirical results from $U(1)$ gauge theory and scalar field
theory, illustrating when and how autocorrelated samples enhance NCV
construction. These findings provide practical guidance for the efficient use
of MCMC data in training neural networks.
[COMMENTS]
8 pages, 6 figures
[LINK]
http://arxiv.org/abs/2505.07719v1
[DATE]
2025-05-13 00:25:00+08:00
[CATEGORIES]
cs.LG
SmartUT: Receive Beamforming for Spectral Coexistence of NGSO Satellite Systems
[AUTHORS]
Almoatssimbillah Saifaldawla, Eva Lagunas, Flor Ortiz, Abuzar B. M. Adam, Symeon Chatzinotas
[ABSTRACT]
In this paper, we investigate downlink co-frequency interference (CFI)
mitigation in non-geostationary satellites orbits (NGSOs) co-existing systems.
Traditional mitigation techniques, such as Zero-forcing (ZF), produce a null
towards the direction of arrivals (DOAs) of the interfering signals, but they
suffer from high computational complexity due to matrix inversions and required
knowledge of the channel state information (CSI). Furthermore, adaptive
beamformers, such as sample matrix inversion (SMI)-based minimum variance,
provide poor performance when the available snapshots are limited. We propose a
Mamba-based beamformer (MambaBF) that leverages an unsupervised deep learning
(DL) approach and can be deployed on the user terminal (UT) antenna array, for
assisting downlink beamforming and CFI mitigation using only a limited number
of available array snapshots as input, and without CSI knowledge. Simulation
results demonstrate that MambaBF consistently outperforms conventional
beamforming techniques in mitigating interference and maximizing the
signal-to-interference-plus-noise ratio (SINR), particularly under challenging
conditions characterized by low SINR, limited snapshots, and imperfect CSI.
[LINK]
http://arxiv.org/abs/2505.07714v1
[DATE]
2025-05-13 00:19:06+08:00
[CATEGORIES]
cs.LG
ISAC: An Invertible and Stable Auditory Filter Bank with Customizable Kernels for ML Integration
[AUTHORS]
Daniel Haider, Felix Perfler, Peter Balazs, Clara Hollomey, Nicki Holighaus
[ABSTRACT]
This paper introduces ISAC, an invertible and stable, perceptually-motivated
filter bank that is specifically designed to be integrated into machine
learning paradigms. More precisely, the center frequencies and bandwidths of
the filters are chosen to follow a non-linear, auditory frequency scale, the
filter kernels have user-defined maximum temporal support and may serve as
learnable convolutional kernels, and there exists a corresponding filter bank
such that both form a perfect reconstruction pair. ISAC provides a powerful and
user-friendly audio front-end suitable for any application, including
analysis-synthesis schemes.
[COMMENTS]
Accepted at the IEEE International Conference on Sampling Theory and
Applications (SampTA) 2025
[LINK]
http://arxiv.org/abs/2505.07709v1
[DATE]
2025-05-13 00:15:59+08:00
[CATEGORIES]
cs.LG
4TaStiC: Time and trend traveling time series clustering for classifying long-term type 2 diabetes patients
[AUTHORS]
Onthada Preedasawakul, Nathakhun Wiroonsri
[ABSTRACT]
Diabetes is one of the most prevalent diseases worldwide, characterized by
persistently high blood sugar levels, capable of damaging various internal
organs and systems. Diabetes patients require routine check-ups, resulting in a
time series of laboratory records, such as hemoglobin A1c, which reflects each
patient’s health behavior over time and informs their doctor’s recommendations.
Clustering patients into groups based on their entire time series data assists
doctors in making recommendations and choosing treatments without the need to
review all records. However, time series clustering of this type of dataset
introduces some challenges; patients visit their doctors at different time
points, making it difficult to capture and match trends, peaks, and patterns.
Additionally, two aspects must be considered: differences in the levels of
laboratory results and differences in trends and patterns. To address these
challenges, we introduce a new clustering algorithm called Time and Trend
Traveling Time Series Clustering (4TaStiC), using a base dissimilarity measure
combined with Euclidean and Pearson correlation metrics. We evaluated this
algorithm on artificial datasets, comparing its performance with that of seven
existing methods. The results show that 4TaStiC outperformed the other methods
on the targeted datasets. Finally, we applied 4TaStiC to cluster a cohort of
1,989 type 2 diabetes patients at Siriraj Hospital. Each group of patients
exhibits clear characteristics that will benefit doctors in making efficient
clinical decisions. Furthermore, the proposed algorithm can be applied to
contexts outside the medical field.
[LINK]
http://arxiv.org/abs/2505.07702v1
[DATE]
2025-05-13 00:10:32+08:00
[CATEGORIES]
cs.LG
VLM2-Bench: A Closer Look at How Well VLMs Implicitly Link Explicit Matching Visual Cues
[AUTHORS]
Jianshu Zhang, Dongyu Yao, Renjie Pi, Paul Pu Liang, Yi R. Fung
[ABSTRACT]
Visually linking matching cues is a crucial ability in daily life, such as
identifying the same person in multiple photos based on their cues, even
without knowing who they are. Despite the extensive knowledge that
vision-language models (VLMs) possess, it remains largely unexplored whether
they are capable of performing this fundamental task. To address this, we
introduce VLM2-Bench, a benchmark designed to assess whether VLMs can Visually
Link Matching cues, with 9 subtasks and over 3,000 test cases. Comprehensive
evaluation across eight open-source VLMs and GPT-4o, along with further
analysis of various language-side and vision-side prompting methods, leads to a
total of eight key findings. We identify critical challenges in models’ ability
to link visual cues, highlighting a significant performance gap where even
GPT-4o lags 34.80% behind humans. Based on these insights, we advocate for (i)
enhancing core visual capabilities to improve adaptability and reduce reliance
on prior knowledge, (ii) establishing clearer principles for integrating
language-based reasoning in vision-centric tasks to prevent unnecessary biases,
and (iii) shifting vision-text training paradigms toward fostering models’
ability to independently structure and infer relationships among visual cues.
[COMMENTS]
Project Page: https://vlm2-bench.github.io/
[LINK]
http://arxiv.org/abs/2502.12084v3
[DATE]
2025-05-12 23:55:55+08:00
[CATEGORIES]
cs.CL
Mapping Biomedical Ontology Terms to IDs: Effect of Domain Prevalence on Prediction Accuracy
[AUTHORS]
Thanh Son Do, Daniel B. Hier, Tayo Obafemi-Ajayi
[ABSTRACT]
This study evaluates the ability of large language models (LLMs) to map
biomedical ontology terms to their corresponding ontology IDs across the Human
Phenotype Ontology (HPO), Gene Ontology (GO), and UniProtKB terminologies.
Using counts of ontology IDs in the PubMed Central (PMC) dataset as a surrogate
for their prevalence in the biomedical literature, we examined the relationship
between ontology ID prevalence and mapping accuracy. Results indicate that
ontology ID prevalence strongly predicts accurate mapping of HPO terms to HPO
IDs, GO terms to GO IDs, and protein names to UniProtKB accession numbers.
Higher prevalence of ontology IDs in the biomedical literature correlated with
higher mapping accuracy. Predictive models based on receiver operating
characteristic (ROC) curves confirmed this relationship.
In contrast, this pattern did not apply to mapping protein names to Human
Genome Organisation’s (HUGO) gene symbols. GPT-4 achieved a high baseline
performance (95%) in mapping protein names to HUGO gene symbols, with mapping
accuracy unaffected by prevalence. We propose that the high prevalence of HUGO
gene symbols in the literature has caused these symbols to become lexicalized,
enabling GPT-4 to map protein names to HUGO gene symbols with high accuracy.
These findings highlight the limitations of LLMs in mapping ontology terms to
low-prevalence ontology IDs and underscore the importance of incorporating
ontology ID prevalence into the training and evaluation of LLMs for biomedical
applications.
[COMMENTS]
Presented at 2025 IEEE Conference on Artificial Intelligence (CAI).
Santa Clara, CA. May 5, 2025
[LINK]
http://arxiv.org/abs/2409.13746v2
[DATE]
2025-05-12 23:43:37+08:00
[CATEGORIES]
cs.CL
Benchmarking Retrieval-Augmented Generation for Chemistry
[AUTHORS]
Xianrui Zhong, Bowen Jin, Siru Ouyang, Yanzhen Shen, Qiao Jin, Yin Fang, Zhiyong Lu, Jiawei Han
[ABSTRACT]
Retrieval-augmented generation (RAG) has emerged as a powerful framework for
enhancing large language models (LLMs) with external knowledge, particularly in
scientific domains that demand specialized and dynamic information. Despite its
promise, the application of RAG in the chemistry domain remains underexplored,
primarily due to the lack of high-quality, domain-specific corpora and
well-curated evaluation benchmarks. In this work, we introduce ChemRAG-Bench, a
comprehensive benchmark designed to systematically assess the effectiveness of
RAG across a diverse set of chemistry-related tasks. The accompanying chemistry
corpus integrates heterogeneous knowledge sources, including scientific
literature, the PubChem database, PubMed abstracts, textbooks, and Wikipedia
entries. In addition, we present ChemRAG-Toolkit, a modular and extensible RAG
toolkit that supports five retrieval algorithms and eight LLMs. Using
ChemRAG-Toolkit, we demonstrate that RAG yields a substantial performance gain
– achieving an average relative improvement of 17.4% over direct inference
methods. We further conduct in-depth analyses on retriever architectures,
corpus selection, and the number of retrieved passages, culminating in
practical recommendations to guide future research and deployment of RAG
systems in the chemistry domain. The code and data is available at
https://chemrag.github.io.
[LINK]
http://arxiv.org/abs/2505.07671v1
[DATE]
2025-05-12 23:34:45+08:00
[CATEGORIES]
cs.CL
Using Information Theory to Characterize Prosodic Typology: The Case of Tone, Pitch-Accent and Stress-Accent
[AUTHORS]
Ethan Gotlieb Wilcox, Cui Ding, Giovanni Acampa, Tiago Pimentel, Alex Warstadt, Tamar I. Regev
[ABSTRACT]
This paper argues that the relationship between lexical identity and prosody
– one well-studied parameter of linguistic variation – can be characterized
using information theory. We predict that languages that use prosody to make
lexical distinctions should exhibit a higher mutual information between word
identity and prosody, compared to languages that don’t. We test this hypothesis
in the domain of pitch, which is used to make lexical distinctions in tonal
languages, like Cantonese. We use a dataset of speakers reading sentences aloud
in ten languages across five language families to estimate the mutual
information between the text and their pitch curves. We find that, across
languages, pitch curves display similar amounts of entropy. However, these
curves are easier to predict given their associated text in the tonal
languages, compared to pitch- and stress-accent languages, and thus the mutual
information is higher in these languages, supporting our hypothesis. Our
results support perspectives that view linguistic typology as gradient, rather
than categorical.
[LINK]
http://arxiv.org/abs/2505.07659v1
[DATE]
2025-05-12 23:25:17+08:00
[CATEGORIES]
cs.CL
JobHop: A Large-Scale Dataset of Career Trajectories
[AUTHORS]
Iman Johary, Raphael Romero, Alexandru C. Mara, Tijl De Bie
[ABSTRACT]
Understanding labor market dynamics is essential for policymakers, employers,
and job seekers. However, comprehensive datasets that capture real-world career
trajectories are scarce. In this paper, we introduce JobHop, a large-scale
public dataset derived from anonymized resumes provided by VDAB, the public
employment service in Flanders, Belgium. Utilizing Large Language Models
(LLMs), we process unstructured resume data to extract structured career
information, which is then mapped to standardized ESCO occupation codes using a
multi-label classification model. This results in a rich dataset of over 2.3
million work experiences, extracted from and grouped into more than 391,000
user resumes and mapped to standardized ESCO occupation codes, offering
valuable insights into real-world occupational transitions. This dataset
enables diverse applications, such as analyzing labor market mobility, job
stability, and the effects of career breaks on occupational transitions. It
also supports career path prediction and other data-driven decision-making
processes. To illustrate its potential, we explore key dataset characteristics,
including job distributions, career breaks, and job transitions, demonstrating
its value for advancing labor market research.
[LINK]
http://arxiv.org/abs/2505.07653v1
[DATE]
2025-05-12 23:22:29+08:00
[CATEGORIES]
cs.CL
Chronocept: Instilling a Sense of Time in Machines
[AUTHORS]
Krish Goel, Sanskar Pandey, KS Mahadevan, Harsh Kumar, Vishesh Khadaria
[ABSTRACT]
Human cognition is deeply intertwined with a sense of time, known as
Chronoception. This sense allows us to judge how long facts remain valid and
when knowledge becomes outdated. Despite progress in vision, language, and
motor control, AI still struggles to reason about temporal validity. We
introduce Chronocept, the first benchmark to model temporal validity as a
continuous probability distribution over time. Using skew-normal curves fitted
along semantically decomposed temporal axes, Chronocept captures nuanced
patterns of emergence, decay, and peak relevance. It includes two datasets:
Benchmark I (atomic facts) and Benchmark II (multi-sentence passages).
Annotations show strong inter-annotator agreement (84% and 89%). Our baselines
predict curve parameters - location, scale, and skewness - enabling
interpretable, generalizable learning and outperforming classification-based
approaches. Chronocept fills a foundational gap in AI’s temporal reasoning,
supporting applications in knowledge grounding, fact-checking,
retrieval-augmented generation (RAG), and proactive agents. Code and data are
publicly available.
[COMMENTS]
20 pages, 8 figures, 18 tables
[LINK]
http://arxiv.org/abs/2505.07637v1
[DATE]
2025-05-12 23:07:32+08:00
[CATEGORIES]
cs.CL
cs.LG
ConTextual: Improving Clinical Text Summarization in LLMs with Context-preserving Token Filtering and Knowledge Graphs
[AUTHORS]
Fahmida Liza Piya, Rahmatollah Beheshti
[ABSTRACT]
Unstructured clinical data can serve as a unique and rich source of
information that can meaningfully inform clinical practice. Extracting the most
pertinent context from such data is critical for exploiting its true potential
toward optimal and timely decision-making in patient care. While prior research
has explored various methods for clinical text summarization, most prior
studies either process all input tokens uniformly or rely on heuristic-based
filters, which can overlook nuanced clinical cues and fail to prioritize
information critical for decision-making. In this study, we propose Contextual,
a novel framework that integrates a Context-Preserving Token Filtering method
with a Domain-Specific Knowledge Graph (KG) for contextual augmentation. By
preserving context-specific important tokens and enriching them with structured
knowledge, ConTextual improves both linguistic coherence and clinical fidelity.
Our extensive empirical evaluations on two public benchmark datasets
demonstrate that ConTextual consistently outperforms other baselines. Our
proposed approach highlights the complementary role of token-level filtering
and structured retrieval in enhancing both linguistic and clinical integrity,
as well as offering a scalable solution for improving precision in clinical
text generation.
[LINK]
http://arxiv.org/abs/2504.16394v2
[DATE]
2025-05-12 22:57:14+08:00
[CATEGORIES]
cs.CL
Concept-Level Explainability for Auditing & Steering LLM Responses
[AUTHORS]
Kenza Amara, Rita Sevastjanova, Mennatallah El-Assady
[ABSTRACT]
As large language models (LLMs) become widely deployed, concerns about their
safety and alignment grow. An approach to steer LLM behavior, such as
mitigating biases or defending against jailbreaks, is to identify which parts
of a prompt influence specific aspects of the model’s output. Token-level
attribution methods offer a promising solution, but still struggle in text
generation, explaining the presence of each token in the output separately,
rather than the underlying semantics of the entire LLM response. We introduce
ConceptX, a model-agnostic, concept-level explainability method that identifies
the concepts, i.e., semantically rich tokens in the prompt, and assigns them
importance based on the outputs’ semantic similarity. Unlike current
token-level methods, ConceptX also offers to preserve context integrity through
in-place token replacements and supports flexible explanation goals, e.g.,
gender bias. ConceptX enables both auditing, by uncovering sources of bias, and
steering, by modifying prompts to shift the sentiment or reduce the harmfulness
of LLM responses, without requiring retraining. Across three LLMs, ConceptX
outperforms token-level methods like TokenSHAP in both faithfulness and human
alignment. Steering tasks boost sentiment shift by 0.252 versus 0.131 for
random edits and lower attack success rates from 0.463 to 0.242, outperforming
attribution and paraphrasing baselines. While prompt engineering and
self-explaining methods sometimes yield safer responses, ConceptX offers a
transparent and faithful alternative for improving LLM safety and alignment,
demonstrating the practical value of attribution-based explainability in
guiding LLM behavior.
[COMMENTS]
9 pages, 7 figures, Submission to Neurips 2025
[LINK]
http://arxiv.org/abs/2505.07610v1
[DATE]
2025-05-12 22:31:51+08:00
[CATEGORIES]
cs.CL
MiMo: Unlocking the Reasoning Potential of Language Model – From Pretraining to Posttraining
[AUTHORS]
Xiaomi LLM-Core Team, :, Bingquan Xia, Bowen Shen, Cici, Dawei Zhu, Di Zhang, Gang Wang, Hailin Zhang, Huaqiu Liu, Jiebao Xiao, Jinhao Dong, Liang Zhao, Peidian Li, Peng Wang, Shihua Yu, Shimao Chen, Weikun Wang, Wenhan Ma, Xiangwei Deng, Yi Huang, Yifan Song, Zihan Jiang, Bowen Ye, Can Cai, Chenhong He, Dong Zhang, Duo Zhang, Guoan Wang, Hao Tian, Haochen Zhao, Heng Qu, Hongshen Xu, Jun Shi, Kainan Bao, QingKai Fang, Kang Zhou, Kangyang Zhou, Lei Li, Menghang Zhu, Nuo Chen, Qiantong Wang, Shaohui Liu, Shicheng Li, Shuhao Gu, Shuhuai Ren, Shuo Liu, Sirui Deng, Weiji Zhuang, Weiwei Lv, Wenyu Yang, Xin Zhang, Xing Yong, Xing Zhang, Xingchen Song, Xinzhe Xu, Xu Wang, Yihan Yan, Yu Tu, Yuanyuan Tian, Yudong Wang, Yue Yu, Zhenru Lin, Zhichao Song, Zihao Yue
[ABSTRACT]
We present MiMo-7B, a large language model born for reasoning tasks, with
optimization across both pre-training and post-training stages. During
pre-training, we enhance the data preprocessing pipeline and employ a
three-stage data mixing strategy to strengthen the base model’s reasoning
potential. MiMo-7B-Base is pre-trained on 25 trillion tokens, with additional
Multi-Token Prediction objective for enhanced performance and accelerated
inference speed. During post-training, we curate a dataset of 130K verifiable
mathematics and programming problems for reinforcement learning, integrating a
test-difficulty-driven code-reward scheme to alleviate sparse-reward issues and
employing strategic data resampling to stabilize training. Extensive
evaluations show that MiMo-7B-Base possesses exceptional reasoning potential,
outperforming even much larger 32B models. The final RL-tuned model,
MiMo-7B-RL, achieves superior performance on mathematics, code and general
reasoning tasks, surpassing the performance of OpenAI o1-mini. The model
checkpoints are available at https://github.com/xiaomimimo/MiMo.
[LINK]
http://arxiv.org/abs/2505.07608v1
[DATE]
2025-05-12 22:30:11+08:00
[CATEGORIES]
cs.CL
cs.LG
Characterizing the Investigative Methods of Fictional Detectives with Large Language Models
[AUTHORS]
Edirlei Soares de Lima, Marco A. Casanova, Bruno Feijó, Antonio L. Furtado
[ABSTRACT]
Detective fiction, a genre defined by its complex narrative structures and
character-driven storytelling, presents unique challenges for computational
narratology, a research field focused on integrating literary theory into
automated narrative generation. While traditional literary studies have offered
deep insights into the methods and archetypes of fictional detectives, these
analyses often focus on a limited number of characters and lack the scalability
needed for the extraction of unique traits that can be used to guide narrative
generation methods. In this paper, we present an AI-driven approach for
systematically characterizing the investigative methods of fictional
detectives. Our multi-phase workflow explores the capabilities of 15 Large
Language Models (LLMs) to extract, synthesize, and validate distinctive
investigative traits of fictional detectives. This approach was tested on a
diverse set of seven iconic detectives - Hercule Poirot, Sherlock Holmes,
William Murdoch, Columbo, Father Brown, Miss Marple, and Auguste Dupin -
capturing the distinctive investigative styles that define each character. The
identified traits were validated against existing literary analyses and further
tested in a reverse identification phase, achieving an overall accuracy of
91.43%, demonstrating the method’s effectiveness in capturing the distinctive
investigative approaches of each detective. This work contributes to the
broader field of computational narratology by providing a scalable framework
for character analysis, with potential applications in AI-driven interactive
storytelling and automated narrative generation.
[LINK]
http://arxiv.org/abs/2505.07601v1
[DATE]
2025-05-12 22:24:58+08:00
[CATEGORIES]
cs.CL
NCL-UoR at SemEval-2025 Task 3: Detecting Multilingual Hallucination and Related Observable Overgeneration Text Spans with Modified RefChecker and Modified SeflCheckGPT
[AUTHORS]
Jiaying Hong, Thanet Markchom, Jianfei Xu, Tong Wu, Huizhi Liang
[ABSTRACT]
SemEval-2025 Task 3 (Mu-SHROOM) focuses on detecting hallucinations in
content generated by various large language models (LLMs) across multiple
languages. This task involves not only identifying the presence of
hallucinations but also pinpointing their specific occurrences. To tackle this
challenge, this study introduces two methods: modified RefChecker and modified
SelfCheckGPT. The modified RefChecker integrates prompt-based factual
verification into References, structuring them as claim-based tests rather than
single external knowledge sources. The modified SelfCheckGPT incorporates
external knowledge to overcome its reliance on internal knowledge. In addition,
both methods’ original prompt designs are enhanced to identify hallucinated
words within LLM-generated texts. Experimental results demonstrate the
effectiveness of the approach, achieving a high ranking on the test dataset in
detecting hallucinations across various languages, with an average IoU of
0.5310 and an average COR of 0.5669.
[LINK]
http://arxiv.org/abs/2503.01921v2
[DATE]
2025-05-12 22:24:25+08:00
[CATEGORIES]
cs.CL
Reinforced Internal-External Knowledge Synergistic Reasoning for Efficient Adaptive Search Agent
[AUTHORS]
Ziyang Huang, Xiaowei Yuan, Yiming Ju, Jun Zhao, Kang Liu
[ABSTRACT]
Retrieval-augmented generation (RAG) is a common strategy to reduce
hallucinations in Large Language Models (LLMs). While reinforcement learning
(RL) can enable LLMs to act as search agents by activating retrieval
capabilities, existing ones often underutilize their internal knowledge. This
can lead to redundant retrievals, potential harmful knowledge conflicts, and
increased inference latency. To address these limitations, an efficient and
adaptive search agent capable of discerning optimal retrieval timing and
synergistically integrating parametric (internal) and retrieved (external)
knowledge is in urgent need. This paper introduces the Reinforced
Internal-External Knowledge Synergistic Reasoning Agent (IKEA), which could
indentify its own knowledge boundary and prioritize the utilization of internal
knowledge, resorting to external search only when internal knowledge is deemed
insufficient. This is achieved using a novel knowledge-boundary aware reward
function and a knowledge-boundary aware training dataset. These are designed
for internal-external knowledge synergy oriented RL, incentivizing the model to
deliver accurate answers, minimize unnecessary retrievals, and encourage
appropriate external searches when its own knowledge is lacking. Evaluations
across multiple knowledge reasoning tasks demonstrate that IKEA significantly
outperforms baseline methods, reduces retrieval frequency significantly, and
exhibits robust generalization capabilities.
[LINK]
http://arxiv.org/abs/2505.07596v1
[DATE]
2025-05-12 22:21:57+08:00
[CATEGORIES]
cs.CL
A Multi-Dimensional Constraint Framework for Evaluating and Improving Instruction Following in Large Language Models
[AUTHORS]
Junjie Ye, Caishuang Huang, Zhuohan Chen, Wenjie Fu, Chenyuan Yang, Leyi Yang, Yilong Wu, Peng Wang, Meng Zhou, Xiaolong Yang, Tao Gui, Qi Zhang, Zhongchao Shi, Jianping Fan, Xuanjing Huang
[ABSTRACT]
Instruction following evaluates large language models (LLMs) on their ability
to generate outputs that adhere to user-defined constraints. However, existing
benchmarks often rely on templated constraint prompts, which lack the diversity
of real-world usage and limit fine-grained performance assessment. To fill this
gap, we propose a multi-dimensional constraint framework encompassing three
constraint patterns, four constraint categories, and four difficulty levels.
Building on this framework, we develop an automated instruction generation
pipeline that performs constraint expansion, conflict detection, and
instruction rewriting, yielding 1,200 code-verifiable instruction-following
test samples. We evaluate 19 LLMs across seven model families and uncover
substantial variation in performance across constraint forms. For instance,
average performance drops from 77.67% at Level I to 32.96% at Level IV.
Furthermore, we demonstrate the utility of our approach by using it to generate
data for reinforcement learning, achieving substantial gains in instruction
following without degrading general performance. In-depth analysis indicates
that these gains stem primarily from modifications in the model’s attention
modules parameters, which enhance constraint recognition and adherence. Code
and data are available in https://github.com/Junjie-Ye/MulDimIF.
[LINK]
http://arxiv.org/abs/2505.07591v1
[DATE]
2025-05-12 22:16:55+08:00
[CATEGORIES]
cs.CL
SciCom Wiki: Fact-Checking and FAIR Knowledge Distribution for Scientific Videos and Podcasts
[AUTHORS]
Tim Wittenborg, Constantin Sebastian Tremel, Niklas Stehr, Oliver Karras, Markus Stocker, Sören Auer
[ABSTRACT]
Democratic societies need accessible, reliable information. Videos and
Podcasts have established themselves as the medium of choice for civic
dissemination, but also as carriers of misinformation. The emerging Science
Communication Knowledge Infrastructure (SciCom KI) curating non-textual media
is still fragmented and not adequately equipped to scale against the content
flood. Our work sets out to support the SciCom KI with a central, collaborative
platform, the SciCom Wiki, to facilitate FAIR (findable, accessible,
interoperable, reusable) media representation and the fact-checking of their
content, particularly for videos and podcasts. Building an open-source service
system centered around Wikibase, we survey requirements from 53 stakeholders,
refine these in 11 interviews, and evaluate our prototype based on these
requirements with another 14 participants. To address the most requested
feature, fact-checking, we developed a neurosymbolic computational
fact-checking approach, converting heterogenous media into knowledge graphs.
This increases machine-readability and allows comparing statements against
equally represented ground-truth. Our computational fact-checking tool was
iteratively evaluated through 10 expert interviews, a public user survey with
43 participants verified the necessity and usability of our tool. Overall, our
findings identified several needs to systematically support the SciCom KI. The
SciCom Wiki, as a FAIR digital library complementing our neurosymbolic
computational fact-checking framework, was found suitable to address the raised
requirements. Further, we identified that the SciCom KI is severely
underdeveloped regarding FAIR knowledge and related systems facilitating its
collaborative creation and curation. Our system can provide a central knowledge
node, yet a collaborative effort is required to scale against the imminent
(mis-)information flood.
[COMMENTS]
18 pages, 10 figures, submitted to TPDL 2025
[LINK]
http://arxiv.org/abs/2505.07912v1
[DATE]
2025-05-12 21:38:20+08:00
[CATEGORIES]
cs.CL
Direct Density Ratio Optimization: A Statistically Consistent Approach to Aligning Large Language Models
[AUTHORS]
Rei Higuchi, Taiji Suzuki
[ABSTRACT]
Aligning large language models (LLMs) with human preferences is crucial for
safe deployment, yet existing methods assume specific preference models like
Bradley-Terry model. This assumption leads to statistical inconsistency, where
more data doesn’t guarantee convergence to true human preferences. To address
this critical gap, we introduce a novel alignment method Direct Density Ratio
Optimization (DDRO). DDRO directly estimates the density ratio between
preferred and unpreferred output distributions, circumventing the need for
explicit human preference modeling. We theoretically prove that DDRO is
statistically consistent, ensuring convergence to the true preferred
distribution as the data size grows, regardless of the underlying preference
structure. Experiments demonstrate that DDRO achieves superior performance
compared to existing methods on many major benchmarks. DDRO unlocks the
potential for truly data-driven alignment, paving the way for more reliable and
human-aligned LLMs.
[LINK]
http://arxiv.org/abs/2505.07558v1
[DATE]
2025-05-12 21:36:25+08:00
[CATEGORIES]
cs.LG
cs.CL
MedualTime: A Dual-Adapter Language Model for Medical Time Series-Text Multimodal Learning
[AUTHORS]
Jiexia Ye, Weiqi Zhang, Ziyue Li, Jia Li, Meng Zhao, Fugee Tsung
[ABSTRACT]
The recent rapid advancements in language models (LMs) have garnered
attention in medical time series-text multimodal learning. However, existing
contrastive learning-based and prompt-based LM approaches tend to be biased,
often assigning a primary role to time series modality while treating text
modality as secondary. We classify these approaches under a temporal-primary
paradigm, which may overlook the unique and critical task-relevant information
embedded in text modality like clinical reports, thus failing to fully leverage
mutual benefits and complementarity of different modalities. To fill this gap,
we propose a novel textual-temporal multimodal learning paradigm that enables
either modality to serve as the primary while being enhanced by the other,
thereby effectively capturing modality-specific information and fostering
cross-modal interaction. In specific, we design MedualTime, a language model
composed of dual adapters to implement temporal-primary and textual-primary
modeling simultaneously. Within each adapter, lightweight adaptation tokens are
injected into the top layers of LM to encourage high-level modality fusion. The
shared LM pipeline by dual adapters not only achieves adapter alignment but
also enables efficient fine-tuning, reducing computational resources.
Empirically, MedualTime demonstrates superior performance on medical data,
achieving notable improvements of 8% accuracy and 12% F1 in supervised
settings. Furthermore, MedualTime’s transferability is validated by few-shot
label transfer experiments from coarse-grained to fine-grained medical data.
https://github.com/start2020/MedualTime
[COMMENTS]
9 pages, 6 figure, 3 tables
[LINK]
http://arxiv.org/abs/2406.06620v3
[DATE]
2025-05-12 21:27:11+08:00
[CATEGORIES]
cs.LG
cs.CL
SEReDeEP: Hallucination Detection in Retrieval-Augmented Models via Semantic Entropy and Context-Parameter Fusion
[AUTHORS]
Lei Wang
[ABSTRACT]
Retrieval-Augmented Generation (RAG) models frequently encounter
hallucination phenomena when integrating external information with internal
parametric knowledge. Empirical studies demonstrate that the disequilibrium
between external contextual information and internal parametric knowledge
constitutes a primary factor in hallucination generation. Existing
hallucination detection methodologies predominantly emphasize either the
external or internal mechanism in isolation, thereby overlooking their
synergistic effects. The recently proposed ReDeEP framework decouples these
dual mechanisms, identifying two critical contributors to hallucinations:
excessive reliance on parametric knowledge encoded in feed-forward networks
(FFN) and insufficient utilization of external information by attention
mechanisms (particularly copy heads). ReDeEP quantitatively assesses these
factors to detect hallucinations and dynamically modulates the contributions of
FFNs and copy heads to attenuate their occurrence. Nevertheless, ReDeEP and
numerous other hallucination detection approaches have been employed at
logit-level uncertainty estimation or language-level self-consistency
evaluation, inadequately address the semantic dimensions of model responses,
resulting in inconsistent hallucination assessments in RAG implementations.
Building upon ReDeEP’s foundation, this paper introduces SEReDeEP, which
enhances computational processes through semantic entropy captured via trained
linear probes, thereby achieving hallucination assessments that more accurately
reflect ground truth evaluations.
[LINK]
http://arxiv.org/abs/2505.07528v1
[DATE]
2025-05-12 21:10:46+08:00
[CATEGORIES]
cs.CL
A Reproduction Study: The Kernel PCA Interpretation of Self-Attention Fails Under Scrutiny
[AUTHORS]
Karahan Sarıtaş, Çağatay Yıldız
[ABSTRACT]
In this reproduction study, we revisit recent claims that self-attention
implements kernel principal component analysis (KPCA) (Teo et al., 2024),
positing that (i) value vectors $V$ capture the eigenvectors of the Gram matrix
of the keys, and (ii) that self-attention projects queries onto the principal
component axes of the key matrix $K$ in a feature space. Our analysis reveals
three critical inconsistencies: (1) No alignment exists between learned
self-attention value vectors and what is proposed in the KPCA perspective, with
average similarity metrics (optimal cosine similarity $\leq 0.32$, linear CKA
(Centered Kernel Alignment) $\leq 0.11$, kernel CKA $\leq 0.32$) indicating
negligible correspondence; (2) Reported decreases in reconstruction loss
$J_\text{proj}$, arguably justifying the claim that the self-attention
minimizes the projection error of KPCA, are misinterpreted, as the quantities
involved differ by orders of magnitude ($\sim!10^3$); (3) Gram matrix
eigenvalue statistics, introduced to justify that $V$ captures the eigenvector
of the gram matrix, are irreproducible without undocumented
implementation-specific adjustments. Across 10 transformer architectures, we
conclude that the KPCA interpretation of self-attention lacks empirical
support.
[LINK]
http://arxiv.org/abs/2505.07908v1
[DATE]
2025-05-12 20:38:46+08:00
[CATEGORIES]
cs.LG
cs.CL
Translating the Grievance Dictionary: a psychometric evaluation of Dutch, German, and Italian versions
[AUTHORS]
Isabelle van der Vegt, Bennett Kleinberg, Marilu Miotto, Jonas Festor
[ABSTRACT]
This paper introduces and evaluates three translations of the Grievance
Dictionary, a psycholinguistic dictionary for the analysis of violent,
threatening or grievance-fuelled texts. Considering the relevance of these
themes in languages beyond English, we translated the Grievance Dictionary to
Dutch, German, and Italian. We describe the process of automated translation
supplemented by human annotation. Psychometric analyses are performed,
including internal reliability of dictionary categories and correlations with
the LIWC dictionary. The Dutch and German translations perform similarly to the
original English version, whereas the Italian dictionary shows low reliability
for some categories. Finally, we make suggestions for further validation and
application of the dictionary, as well as for future dictionary translations
following a similar approach.
[LINK]
http://arxiv.org/abs/2505.07495v1
[DATE]
2025-05-12 20:27:38+08:00
[CATEGORIES]
cs.CL
A Survey on Collaborative Mechanisms Between Large and Small Language Models
[AUTHORS]
Yi Chen, JiaHao Zhao, HaoHao Han
[ABSTRACT]
Large Language Models (LLMs) deliver powerful AI capabilities but face
deployment challenges due to high resource costs and latency, whereas Small
Language Models (SLMs) offer efficiency and deployability at the cost of
reduced performance. Collaboration between LLMs and SLMs emerges as a crucial
paradigm to synergistically balance these trade-offs, enabling advanced AI
applications, especially on resource-constrained edge devices. This survey
provides a comprehensive overview of LLM-SLM collaboration, detailing various
interaction mechanisms (pipeline, routing, auxiliary, distillation, fusion),
key enabling technologies, and diverse application scenarios driven by
on-device needs like low latency, privacy, personalization, and offline
operation. While highlighting the significant potential for creating more
efficient, adaptable, and accessible AI, we also discuss persistent challenges
including system overhead, inter-model consistency, robust task allocation,
evaluation complexity, and security/privacy concerns. Future directions point
towards more intelligent adaptive frameworks, deeper model fusion, and
expansion into multimodal and embodied AI, positioning LLM-SLM collaboration as
a key driver for the next generation of practical and ubiquitous artificial
intelligence.
[LINK]
http://arxiv.org/abs/2505.07460v1
[DATE]
2025-05-12 19:48:42+08:00
[CATEGORIES]
cs.CL
Beyond Boundaries: A Comprehensive Survey of Transferable Attacks on AI Systems
[AUTHORS]
Guangjing Wang, Ce Zhou, Yuanda Wang, Bocheng Chen, Hanqing Guo, Qiben Yan
[ABSTRACT]
As Artificial Intelligence (AI) systems increasingly underpin critical
applications, from autonomous vehicles to biometric authentication, their
vulnerability to transferable attacks presents a growing concern. These
attacks, designed to generalize across instances, domains, models, tasks,
modalities, or even hardware platforms, pose severe risks to security, privacy,
and system integrity. This survey delivers the first comprehensive review of
transferable attacks across seven major categories, including evasion,
backdoor, data poisoning, model stealing, model inversion, membership
inference, and side-channel attacks. We introduce a unified six-dimensional
taxonomy: cross-instance, cross-domain, cross-modality, cross-model,
cross-task, and cross-hardware, which systematically captures the diverse
transfer pathways of adversarial strategies. Through this framework, we examine
both the underlying mechanics and practical implications of transferable
attacks on AI systems. Furthermore, we review cutting-edge methods for
enhancing attack transferability, organized around data augmentation and
optimization strategies. By consolidating fragmented research and identifying
critical future directions, this work provides a foundational roadmap for
understanding, evaluating, and defending against transferable threats in
real-world AI systems.
[LINK]
http://arxiv.org/abs/2311.11796v2
[DATE]
2025-05-12 19:25:11+08:00
[CATEGORIES]
cs.CL
I Predict Therefore I Am: Is Next Token Prediction Enough to Learn Human-Interpretable Concepts from Data?
[AUTHORS]
Yuhang Liu, Dong Gong, Yichao Cai, Erdun Gao, Zhen Zhang, Biwei Huang, Mingming Gong, Anton van den Hengel, Javen Qinfeng Shi
[ABSTRACT]
The remarkable achievements of large language models (LLMs) have led many to
conclude that they exhibit a form of intelligence. This is as opposed to
explanations of their capabilities based on their ability to perform relatively
simple manipulations of vast volumes of data. To illuminate the distinction
between these explanations, we introduce a novel generative model that
generates tokens on the basis of human-interpretable concepts represented as
latent discrete variables. Under mild conditions, even when the mapping from
the latent space to the observed space is non-invertible, we establish an
identifiability result, i.e., the representations learned by LLMs through
next-token prediction can be approximately modeled as the logarithm of the
posterior probabilities of these latent discrete concepts given input context,
up to an invertible linear transformation. This theoretical finding not only
provides evidence that LLMs capture underlying generative factors, but also
provide a unified prospective for understanding of the linear representation
hypothesis. Taking this a step further, our finding motivates a reliable
evaluation of sparse autoencoders by treating the performance of supervised
concept extractors as an upper bound. Pushing this idea even further, it
inspires a structural variant that enforces dependence among latent concepts in
addition to promoting sparsity. Empirically, we validate our theoretical
results through evaluations on both simulation data and the Pythia, Llama, and
DeepSeek model families, and demonstrate the effectiveness of our structured
sparse autoencoder.
[LINK]
http://arxiv.org/abs/2503.08980v3
[DATE]
2025-05-12 18:45:23+08:00
[CATEGORIES]
cs.LG
cs.CL
Comparative sentiment analysis of public perception: Monkeypox vs. COVID-19 behavioral insights
[AUTHORS]
Mostafa Mohaimen Akand Faisal, Rabeya Amin Jhuma
[ABSTRACT]
The emergence of global health crises, such as COVID-19 and Monkeypox (mpox),
has underscored the importance of understanding public sentiment to inform
effective public health strategies. This study conducts a comparative sentiment
analysis of public perceptions surrounding COVID-19 and mpox by leveraging
extensive datasets of 147,475 and 106,638 tweets, respectively. Advanced
machine learning models, including Logistic Regression, Naive Bayes, RoBERTa,
DistilRoBERTa and XLNet, were applied to perform sentiment classification, with
results indicating key trends in public emotion and discourse. The analysis
highlights significant differences in public sentiment driven by disease
characteristics, media representation, and pandemic fatigue. Through the lens
of sentiment polarity and thematic trends, this study offers valuable insights
into tailoring public health messaging, mitigating misinformation, and
fostering trust during concurrent health crises. The findings contribute to
advancing sentiment analysis applications in public health informatics, setting
the groundwork for enhanced real-time monitoring and multilingual analysis in
future research.
[LINK]
http://arxiv.org/abs/2505.07430v1
[DATE]
2025-05-12 18:37:33+08:00
[CATEGORIES]
cs.CL
cs.LG
None of the Others: a General Technique to Distinguish Reasoning from Memorization in Multiple-Choice LLM Evaluation Benchmarks
[AUTHORS]
Eva Sánchez Salido, Julio Gonzalo, Guillermo Marco
[ABSTRACT]
In LLM evaluations, reasoning is often distinguished from recall/memorization
by performing numerical variations to math-oriented questions. Here we
introduce a general variation method for multiple-choice questions that
completely dissociates the correct answer from previously seen tokens or
concepts, requiring LLMs to understand and reason (rather than memorizing) in
order to answer correctly. Using this method, we evaluate state-of-the-art
proprietary and open-source LLMs on two datasets available in English and
Spanish: the public MMLU benchmark and the private UNED-Access 2024 dataset.
Results show that all models experience remarkable accuracy drops under our
proposed variation, with an average loss of 57% on MMLU and 50% on UNED-Access
2024, ranging from 10% to 93% across models. Notably, the most accurate model
in our experimentation (OpenAI-o3-mini) is not the most robust
(DeepSeek-R1-70B), suggesting that the best models in standard evaluations may
not be the ones with better reasoning capabilities. Also, we see larger
accuracy drops in public (vs private) datasets and questions posed in their
original language (vs a manual translation), which are signs of contamination
and also point to a relevant role of recall/memorization in current LLMs’
answers.
[LINK]
http://arxiv.org/abs/2502.12896v3
[DATE]
2025-05-12 18:30:51+08:00
[CATEGORIES]
cs.CL
SEM: Reinforcement Learning for Search-Efficient Large Language Models
[AUTHORS]
Zeyang Sha, Shiwen Cui, Weiqiang Wang
[ABSTRACT]
Recent advancements in Large Language Models(LLMs) have demonstrated their
capabilities not only in reasoning but also in invoking external tools,
particularly search engines. However, teaching models to discern when to invoke
search and when to rely on their internal knowledge remains a significant
challenge. Existing reinforcement learning approaches often lead to redundant
search behaviors, resulting in inefficiencies and over-cost. In this paper, we
propose SEM, a novel post-training reinforcement learning framework that
explicitly trains LLMs to optimize search usage. By constructing a balanced
dataset combining MuSiQue and MMLU, we create scenarios where the model must
learn to distinguish between questions it can answer directly and those
requiring external retrieval. We design a structured reasoning template and
employ Group Relative Policy Optimization(GRPO) to post-train the model’s
search behaviors. Our reward function encourages accurate answering without
unnecessary search while promoting effective retrieval when needed.
Experimental results demonstrate that our method significantly reduces
redundant search operations while maintaining or improving answer accuracy
across multiple challenging benchmarks. This framework advances the model’s
reasoning efficiency and extends its capability to judiciously leverage
external knowledge.
[LINK]
http://arxiv.org/abs/2505.07903v1
[DATE]
2025-05-12 17:45:40+08:00
[CATEGORIES]
cs.CL
HREB-CRF: Hierarchical Reduced-bias EMA for Chinese Named Entity Recognition
[AUTHORS]
Sijin Sun, Ming Deng, Xinrui Yu, Liangbin Zhao
[ABSTRACT]
Incorrect boundary division, complex semantic representation, and differences
in pronunciation and meaning often lead to errors in Chinese Named Entity
Recognition(CNER). To address these issues, this paper proposes HREB-CRF
framework: Hierarchical Reduced-bias EMA with CRF. The proposed method
amplifies word boundaries and pools long text gradients through exponentially
fixed-bias weighted average of local and global hierarchical attention.
Experimental results on the MSRA, Resume, and Weibo datasets show excellent in
F1, outperforming the baseline model by 1.1\%, 1.6\%, and 9.8\%. The
significant improvement in F1 shows evidences of strong effectiveness and
robustness of approach in CNER tasks.
[COMMENTS]
8 pages, 6 figures; Accepted for publication at the 2025
International Joint Conference on Neural Networks (IJCNN 2025), Rome, Italy,
30 June - 5 July
[LINK]
http://arxiv.org/abs/2503.01217v2
[DATE]
2025-05-12 17:24:06+08:00
[CATEGORIES]
cs.CL
cs.LG
XGrammar: Flexible and Efficient Structured Generation Engine for Large Language Models
[AUTHORS]
Yixin Dong, Charlie F. Ruan, Yaxing Cai, Ruihang Lai, Ziyi Xu, Yilong Zhao, Tianqi Chen
[ABSTRACT]
The applications of LLM Agents are becoming increasingly complex and diverse,
leading to a high demand for structured outputs that can be parsed into code,
structured function calls, and embodied agent commands. These developments
bring significant demands for structured generation in LLM inference.
Context-free grammar is a flexible approach to enable structured generation via
constrained decoding. However, executing context-free grammar requires going
through several stack states over all tokens in vocabulary during runtime,
bringing non-negligible overhead for structured generation. In this paper, we
propose XGrammar, a flexible and efficient structure generation engine for
large language models. XGrammar accelerates context-free grammar execution by
dividing the vocabulary into context-independent tokens that can be prechecked
and context-dependent tokens that need to be interpreted during runtime. We
further build transformations to expand the grammar context and reduce the
number of context-independent tokens. Additionally, we build an efficient
persistent stack to accelerate the context-dependent token checks. Finally, we
co-design the grammar engine with LLM inference engine to overlap grammar
computation with GPU executions. Evaluation results show that XGrammar can
achieve up to 100x speedup over existing solutions. Combined with an LLM
inference engine, it can generate near-zero overhead structure generation in
end-to-end low-LLM serving.
[COMMENTS]
MLSys ‘25
[LINK]
http://arxiv.org/abs/2411.15100v3
[DATE]
2025-05-12 16:20:08+08:00
[CATEGORIES]
cs.CL
The Devil Is in the Details: Tackling Unimodal Spurious Correlations for Generalizable Multimodal Reward Models
[AUTHORS]
Zichao Li, Xueru Wen, Jie Lou, Yuqiu Ji, Yaojie Lu, Xianpei Han, Debing Zhang, Le Sun
[COMMENTS]
ICML 2025
[LINK]
http://arxiv.org/abs/2503.03122v3
[DATE]
2025-05-12 16:19:25+08:00
[CATEGORIES]
cs.CL
Towards Multi-Agent Reasoning Systems for Collaborative Expertise Delegation: An Exploratory Design Study
[AUTHORS]
Baixuan Xu, Chunyang Li, Weiqi Wang, Wei Fan, Tianshi Zheng, Haochen Shi, Tao Fan, Yangqiu Song, Qiang Yang
[ABSTRACT]
Designing effective collaboration structure for multi-agent LLM systems to
enhance collective reasoning is crucial yet remains under-explored. In this
paper, we systematically investigate how collaborative reasoning performance is
affected by three key design dimensions: (1) Expertise-Domain Alignment, (2)
Collaboration Paradigm (structured workflow vs. diversity-driven integration),
and (3) System Scale. Our findings reveal that expertise alignment benefits are
highly domain-contingent, proving most effective for contextual reasoning
tasks. Furthermore, collaboration focused on integrating diverse knowledge
consistently outperforms rigid task decomposition. Finally, we empirically
explore the impact of scaling the multi-agent system with expertise
specialization and study the computational trade off, highlighting the need for
more efficient communication protocol design. This work provides concrete
guidelines for configuring specialized multi-agent system and identifies
critical architectural trade-offs and bottlenecks for scalable multi-agent
reasoning. The code will be made available upon acceptance.
[COMMENTS]
18 pages
[LINK]
http://arxiv.org/abs/2505.07313v1
[DATE]
2025-05-12 15:59:13+08:00
[CATEGORIES]
cs.CL
A Syntax-Injected Approach for Faster and More Accurate Sentiment Analysis
[AUTHORS]
Muhammad Imran, Olga Kellert, Carlos Gómez-Rodríguez
[ABSTRACT]
Sentiment Analysis (SA) is a crucial aspect of Natural Language Processing
(NLP), addressing subjective assessments in textual content. Syntactic parsing
is useful in SA because explicit syntactic information can improve accuracy
while providing explainability, but it tends to be a computational bottleneck
in practice due to the slowness of parsing algorithms. This paper addresses
said bottleneck by using a SEquence Labeling Syntactic Parser (SELSP) to inject
syntax into SA. By treating dependency parsing as a sequence labeling problem,
we greatly enhance the speed of syntax-based SA. SELSP is trained and evaluated
on a ternary polarity classification task, demonstrating its faster performance
and better accuracy in polarity prediction tasks compared to conventional
parsers like Stanza and to heuristic approaches that use shallow syntactic
rules for SA like VADER. This increased speed and improved accuracy make SELSP
particularly appealing to SA practitioners in both research and industry. In
addition, we test several sentiment dictionaries on our SELSP to see which one
improves the performance in polarity prediction tasks. Moreover, we compare the
SELSP with Transformer-based models trained on a 5-label classification task.
The results show that dictionaries that capture polarity judgment variation
provide better results than dictionaries that ignore polarity judgment
variation. Moreover, we show that SELSP is considerably faster than
Transformer-based models in polarity prediction tasks.
[LINK]
http://arxiv.org/abs/2406.15163v2
[DATE]
2025-05-12 15:42:18+08:00
[CATEGORIES]
cs.CL
AttentionInfluence: Adopting Attention Head Influence for Weak-to-Strong Pretraining Data Selection
[AUTHORS]
Kai Hua, Steven Wu, Ge Zhang, Ke Shen
[ABSTRACT]
Recently, there has been growing interest in collecting reasoning-intensive
pretraining data to improve LLMs’ complex reasoning ability. Prior approaches
typically rely on supervised classifiers to identify such data, which requires
labeling by humans or LLMs, often introducing domain-specific biases. Due to
the attention heads being crucial to in-context reasoning, we propose
AttentionInfluence, a simple yet effective, training-free method without
supervision signal. Our approach enables a small pretrained language model to
act as a strong data selector through a simple attention head masking
operation. Specifically, we identify retrieval heads and compute the loss
difference when masking these heads. We apply AttentionInfluence to a
1.3B-parameter dense model to conduct data selection on the SmolLM corpus of
241B tokens, and mix the SmolLM corpus with the selected subset comprising 73B
tokens to pretrain a 7B-parameter dense model using 1T training tokens and WSD
learning rate scheduling. Our experimental results demonstrate substantial
improvements, ranging from 1.4pp to 3.5pp, across several knowledge-intensive
and reasoning-heavy benchmarks (i.e., MMLU, MMLU-Pro, AGIEval-en, GSM8K, and
HumanEval). This demonstrates an effective weak-to-strong scaling property,
with small models improving the final performance of larger models-offering a
promising and scalable path for reasoning-centric data selection.
[COMMENTS]
28 pages, 19 figures
[LINK]
http://arxiv.org/abs/2505.07293v1
[DATE]
2025-05-12 15:25:51+08:00
[CATEGORIES]
cs.CL
Semantic Retention and Extreme Compression in LLMs: Can We Have Both?
[AUTHORS]
Stanislas Laborde, Martin Cousseau, Antoun Yaacoub, Lionel Prevost
[ABSTRACT]
The exponential growth in Large Language Model (LLM) deployment has
intensified the need for efficient model compression techniques to reduce
computational and memory costs. While pruning and quantization have shown
promise, their combined potential remains largely unexplored. In this paper, we
examine joint compression and how strategically combining pruning and
quantization could yield superior performance-to-compression ratios compared to
single-method approaches. Recognizing the challenges in accurately assessing
LLM performance, we address key limitations of previous evaluation frameworks
and introduce the Semantic Retention Compression Rate (SrCr), a novel metric
that quantifies the trade-off between model compression and semantic
preservation, facilitating the optimization of pruning-quantization
configurations. Experiments demonstrate that our recommended combination
achieves, on average, a 20% performance increase compared to an equivalent
quantization-only model at the same theoretical compression rate.
[COMMENTS]
Accepted for publication in the Proceedings of the 2025 International
Joint Conference on Neural Networks (IJCNN); this arXiv version includes an
appendix with 6 result tables; 10 pages, 15 figures, 7 tables
[LINK]
http://arxiv.org/abs/2505.07289v1
[DATE]
2025-05-12 15:23:19+08:00
[CATEGORIES]
cs.CL
cs.LG
DeltaEdit: Enhancing Sequential Editing in Large Language Models by Controlling Superimposed Noise
[AUTHORS]
Ding Cao, Yuchen Cai, Rongxi Guo, Xuesong He, Guiquan Liu
[ABSTRACT]
Sequential knowledge editing techniques aim to continuously update the
knowledge in large language models at a low cost, preventing the models from
generating outdated or incorrect information. However, existing sequential
editing methods suffer from a significant decline in editing success rates
after long-term editing. Through theoretical analysis and experiments, we
identify that as the number of edits increases, the model’s output increasingly
deviates from the desired target, leading to a drop in editing success rates.
We refer to this issue as the accumulation of superimposed noise problem. To
address this, we identify the factors contributing to this deviation and
propose DeltaEdit, a novel method that optimizes update parameters through a
dynamic orthogonal constraints strategy, effectively reducing interference
between edits to mitigate deviation. Experimental results demonstrate that
DeltaEdit significantly outperforms existing methods in edit success rates and
the retention of generalization capabilities, ensuring stable and reliable
model performance even under extensive sequential editing.
[LINK]
http://arxiv.org/abs/2505.07899v1
[DATE]
2025-05-12 15:11:26+08:00
[CATEGORIES]
cs.CL
Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs
[AUTHORS]
Jan Betley, Daniel Tan, Niels Warncke, Anna Sztyber-Betley, Xuchan Bao, Martín Soto, Nathan Labenz, Owain Evans
[COMMENTS]
40 pages, 38 figures An earlier revision of this paper was accepted
at ICML 2025. Since then, it has been updated to include new results on
training dynamics (4.7) and base models (4.8)
[LINK]
http://arxiv.org/abs/2502.17424v6
[DATE]
2025-05-12 14:51:03+08:00
[CATEGORIES]
cs.CL
cs.LG
On the Robustness of Reward Models for Language Model Alignment
[AUTHORS]
Jiwoo Hong, Noah Lee, Eunki Kim, Guijin Son, Woojin Chung, Aman Gupta, Shao Tang, James Thorne
[ABSTRACT]
The Bradley-Terry (BT) model is widely practiced in reward modeling for
reinforcement learning with human feedback (RLHF). Despite its effectiveness,
reward models (RMs) trained with BT model loss are prone to over-optimization,
losing generalizability to unseen input distributions. In this paper, we study
the cause of over-optimization in RM training and its downstream effects on the
RLHF procedure, accentuating the importance of distributional robustness of RMs
in unseen data. First, we show that the excessive dispersion of hidden state
norms is the main source of over-optimization. Then, we propose batch-wise
sum-to-zero regularization (BSR) to enforce zero-centered reward sum per batch,
constraining the rewards with extreme magnitudes. We assess the impact of BSR
in improving robustness in RMs through four scenarios of over-optimization,
where BSR consistently manifests better robustness. Subsequently, we compare
the plain BT model and BSR on RLHF training and empirically show that robust
RMs better align the policy to the gold preference model. Finally, we apply BSR
to high-quality data and models, which surpasses state-of-the-art RMs in the 8B
scale by adding more than 5% in complex preference prediction tasks. By
conducting RLOO training with 8B RMs, AlpacaEval 2.0 reduces generation length
by 40% while adding a 7% increase in win rate, further highlighting that
robustness in RMs induces robustness in RLHF training. We release the code,
data, and models: https://github.com/LinkedIn-XFACT/RM-Robustness.
[COMMENTS]
ICML 2025
[LINK]
http://arxiv.org/abs/2505.07271v1
[DATE]
2025-05-12 14:48:26+08:00
[CATEGORIES]
cs.CL
cs.LG
No Query, No Access
[AUTHORS]
Wenqiang Wang, Siyuan Liang, Yangshijie Zhang, Xiaojun Jia, Hao Lin, Xiaochun Cao
[ABSTRACT]
Textual adversarial attacks mislead NLP models, including Large Language
Models (LLMs), by subtly modifying text. While effective, existing attacks
often require knowledge of the victim model, extensive queries, or access to
training data, limiting real-world feasibility. To overcome these constraints,
we introduce the \textbf{Victim Data-based Adversarial Attack (VDBA)}, which
operates using only victim texts. To prevent access to the victim model, we
create a shadow dataset with publicly available pre-trained models and
clustering methods as a foundation for developing substitute models. To address
the low attack success rate (ASR) due to insufficient information feedback, we
propose the hierarchical substitution model design, generating substitute
models to mitigate the failure of a single substitute model at the decision
boundary.
Concurrently, we use diverse adversarial example generation, employing
various attack methods to generate and select the adversarial example with
better similarity and attack effectiveness. Experiments on the Emotion and SST5
datasets show that VDBA outperforms state-of-the-art methods, achieving an ASR
improvement of 52.08\% while significantly reducing attack queries to 0. More
importantly, we discover that VDBA poses a significant threat to LLMs such as
Qwen2 and the GPT family, and achieves the highest ASR of 45.99% even without
access to the API, confirming that advanced NLP models still face serious
security risks. Our codes can be found at
https://anonymous.4open.science/r/VDBA-Victim-Data-based-Adversarial-Attack-36EC/
[LINK]
http://arxiv.org/abs/2505.07258v1
[DATE]
2025-05-12 14:19:59+08:00
[CATEGORIES]
cs.CL
DynamicRAG: Leveraging Outputs of Large Language Model as Feedback for Dynamic Reranking in Retrieval-Augmented Generation
[AUTHORS]
Jiashuo Sun, Xianrui Zhong, Sizhe Zhou, Jiawei Han
[ABSTRACT]
Retrieval-augmented generation (RAG) systems combine large language models
(LLMs) with external knowledge retrieval, making them highly effective for
knowledge-intensive tasks. A crucial but often under-explored component of
these systems is the reranker, which refines retrieved documents to enhance
generation quality and explainability. The challenge of selecting the optimal
number of documents (k) remains unsolved: too few may omit critical
information, while too many introduce noise and inefficiencies. Although recent
studies have explored LLM-based rerankers, they primarily leverage internal
model knowledge and overlook the rich supervisory signals that LLMs can
provide, such as using response quality as feedback for optimizing reranking
decisions. In this paper, we propose DynamicRAG, a novel RAG framework where
the reranker dynamically adjusts both the order and number of retrieved
documents based on the query. We model the reranker as an agent optimized
through reinforcement learning (RL), using rewards derived from LLM output
quality. Across seven knowledge-intensive datasets, DynamicRAG demonstrates
superior performance, achieving state-of-the-art results. The model, data and
code are available at https://github.com/GasolSun36/DynamicRAG
[COMMENTS]
24 pages, 6 figures, 15 tables
[LINK]
http://arxiv.org/abs/2505.07233v1
[DATE]
2025-05-12 13:19:01+08:00
[CATEGORIES]
cs.CL
Softplus Attention with Re-weighting Boosts Length Extrapolation in Large Language Models
[AUTHORS]
Bo Gao, Michael W. Spratling
[ABSTRACT]
Large language models have achieved remarkable success in recent years,
primarily due to the implementation of self-attention mechanisms. However,
traditional Softmax attention suffers from numerical instability and reduced
performance as the length of inference tokens increases. This paper addresses
these issues by decomposing the Softmax operation into a non-linear
transformation and the $l_1$-norm. We identify the latter as essential for
maintaining model performance. By replacing the non-linear transformation with
the Softplus activation function and introducing a dynamic scale factor for
different token lengths based on invariance entropy, we create a novel
attention mechanism with performance better than conventional Softmax attention
across various inference lengths. To further improve the length extrapolation
ability of the proposed attention mechanism, we introduce a novel re-weighting
mechanism that amplifies significant attention weights while diminishing weaker
ones, enabling the model to concentrate more effectively on relevant tokens.
When combined with our proposed attention mechanism, this approach maintains
nearly constant validation loss even at 16$\times$ the training token length,
ensures numerical stability, and achieves superior results on downstream
benchmarks.
[COMMENTS]
14 pages and 2 figures
[LINK]
http://arxiv.org/abs/2501.13428v3
[DATE]
2025-05-12 11:16:04+08:00
[CATEGORIES]
cs.CL
cs.LG
From Large to Super-Tiny: End-to-End Optimization for Cost-Efficient LLMs
[AUTHORS]
Jiliang Ni, Jiachen Pu, Zhongyi Yang, Kun Zhou, Hui Wang, Xiaoliang Xiao, Dakui Wang, Xin Li, Jingfeng Luo, Conggang Hu
[ABSTRACT]
Large Language Models (LLMs) have significantly advanced artificial
intelligence by optimizing traditional Natural Language Processing (NLP)
workflows, facilitating their integration into various systems. Many such NLP
systems, including ours, directly incorporate LLMs. However, this approach
either results in expensive costs or yields suboptimal performance after
fine-tuning. In this paper, we introduce a three-stage cost-efficient
end-to-end LLM deployment pipeline, comprising prototyping, knowledge transfer,
and model compression, to effectively tackle the cost-performance dilemma in
LLM-based frameworks. Its high cost-efficiency is manifested not only in
simplifying system complexity and producing super-tiny online models with
enhanced performance and reduced costs in the results, but also in addressing
development cycle constraints, the lack of extensive high-quality data, and
limited computational resources during the project development process. In the
first stage, we construct an optimal performance prototype system by
transforming complex tasks into a function call-based LLM-driven pipeline,
which serves as a teacher model to generate high-quality data. In the second
stage, we combine techniques like rejection sampling fine-tuning, reinforcement
learning, and knowledge distillation to transfer knowledge to 0.5B student
models, delivering effective performance at minimal cost. In the final stage,
we further compress models to 0.4B via quantization and pruning, achieving
ultra-low latency and cost. Extensive experimental results and the framework’s
modular design suggest cross-domain capabilities and potential applicability in
other NLP areas.
[LINK]
http://arxiv.org/abs/2504.13471v3
[DATE]
2025-05-12 10:27:53+08:00
[CATEGORIES]
cs.CL
Structural Entropy Guided Agent for Detecting and Repairing Knowledge Deficiencies in LLMs
[AUTHORS]
Yifan Wei, Xiaoyan Yu, Tengfei Pan, Angsheng Li, Li Du
[ABSTRACT]
Large language models (LLMs) have achieved unprecedented performance by
leveraging vast pretraining corpora, yet their performance remains suboptimal
in knowledge-intensive domains such as medicine and scientific research, where
high factual precision is required. While synthetic data provides a promising
avenue for augmenting domain knowledge, existing methods frequently generate
redundant samples that do not align with the model’s true knowledge gaps. To
overcome this limitation, we propose a novel Structural Entropy-guided
Knowledge Navigator (SENATOR) framework that addresses the intrinsic knowledge
deficiencies of LLMs. Our approach employs the Structure Entropy (SE) metric to
quantify uncertainty along knowledge graph paths and leverages Monte Carlo Tree
Search (MCTS) to selectively explore regions where the model lacks
domain-specific knowledge. Guided by these insights, the framework generates
targeted synthetic data for supervised fine-tuning, enabling continuous
self-improvement. Experimental results on LLaMA-3 and Qwen2 across multiple
domain-specific benchmarks show that SENATOR effectively detects and repairs
knowledge deficiencies, achieving notable performance improvements. The code
and data for our methods and experiments are available at
https://github.com/weiyifan1023/senator.
[LINK]
http://arxiv.org/abs/2505.07184v1
[DATE]
2025-05-12 10:21:36+08:00
[CATEGORIES]
cs.CL
Training and Evaluating with Human Label Variation: An Empirical Study
[AUTHORS]
Kemal Kurniawan, Meladel Mistica, Timothy Baldwin, Jey Han Lau
[ABSTRACT]
Human label variation (HLV) challenges the standard assumption that a
labelled instance has a single ground truth, instead embracing the natural
variation in human annotation to train and evaluate models. While various
training methods and metrics for HLV have been proposed, it is still unclear
which methods and metrics perform best in what settings. We propose new
evaluation metrics for HLV leveraging fuzzy set theory. Since these new
proposed metrics are differentiable, we then in turn experiment with employing
these metrics as training objectives. We conduct an extensive study over 6 HLV
datasets testing 14 training methods and 6 evaluation metrics. We find that
training on either disaggregated annotations or soft labels performs best
across metrics, outperforming training using the proposed training objectives
with differentiable metrics. We also show that our proposed soft metric is more
interpretable and correlates best with human preference.
[COMMENTS]
25 pages, 7 figures. Fixed PO-JSD values on the MFRC dataset
[LINK]
http://arxiv.org/abs/2502.01891v3
[DATE]
2025-05-12 09:35:26+08:00
[CATEGORIES]
cs.LG
cs.CL
Order Matters in Hallucination: Reasoning Order as Benchmark and Reflexive Prompting for Large-Language-Models
[AUTHORS]
Zikai Xie
[COMMENTS]
8 pages, submitted to ACL ARR
[LINK]
http://arxiv.org/abs/2408.05093v4
[DATE]
2025-05-12 09:30:34+08:00
[CATEGORIES]
cs.CL
One Trigger Token Is Enough: A Defense Strategy for Balancing Safety and Usability in Large Language Models
[AUTHORS]
Haoran Gu, Handing Wang, Yi Mei, Mengjie Zhang, Yaochu Jin
[ABSTRACT]
Large Language Models (LLMs) have been extensively used across diverse
domains, including virtual assistants, automated code generation, and
scientific research. However, they remain vulnerable to jailbreak attacks,
which manipulate the models into generating harmful responses despite safety
alignment. Recent studies have shown that current safety-aligned LLMs often
undergo the shallow safety alignment, where the first few tokens largely
determine whether the response will be harmful. Through comprehensive
observations, we find that safety-aligned LLMs and various defense strategies
generate highly similar initial tokens in their refusal responses, which we
define as safety trigger tokens. Building on this insight, we propose
\texttt{D-STT}, a simple yet effective defense algorithm that identifies and
explicitly decodes safety trigger tokens of the given safety-aligned LLM to
trigger the model’s learned safety patterns. In this process, the safety
trigger is constrained to a single token, which effectively preserves model
usability by introducing minimum intervention in the decoding process.
Extensive experiments across diverse jailbreak attacks and benign prompts
demonstrate that \ours significantly reduces output harmfulness while
preserving model usability and incurring negligible response time overhead,
outperforming ten baseline methods.
[LINK]
http://arxiv.org/abs/2505.07167v1
[DATE]
2025-05-12 09:26:50+08:00
[CATEGORIES]
cs.CL
KDH-MLTC: Knowledge Distillation for Healthcare Multi-Label Text Classification
[AUTHORS]
Hajar Sakai, Sarah S. Lam
[ABSTRACT]
The increasing volume of healthcare textual data requires computationally
efficient, yet highly accurate classification approaches able to handle the
nuanced and complex nature of medical terminology. This research presents
Knowledge Distillation for Healthcare Multi-Label Text Classification
(KDH-MLTC), a framework leveraging model compression and Large Language Models
(LLMs). The proposed approach addresses conventional healthcare Multi-Label
Text Classification (MLTC) challenges by integrating knowledge distillation and
sequential fine-tuning, subsequently optimized through Particle Swarm
Optimization (PSO) for hyperparameter tuning. KDH-MLTC transfers knowledge from
a more complex teacher LLM (i.e., BERT) to a lighter student LLM (i.e.,
DistilBERT) through sequential training adapted to MLTC that preserves the
teacher’s learned information while significantly reducing computational
requirements. As a result, the classification is enabled to be conducted
locally, making it suitable for healthcare textual data characterized by
sensitivity and, therefore, ensuring HIPAA compliance. The experiments
conducted on three medical literature datasets of different sizes, sampled from
the Hallmark of Cancer (HoC) dataset, demonstrate that KDH-MLTC achieves
superior performance compared to existing approaches, particularly for the
largest dataset, reaching an F1 score of 82.70%. Additionally, statistical
validation and an ablation study are carried out, proving the robustness of
KDH-MLTC. Furthermore, the PSO-based hyperparameter optimization process
allowed the identification of optimal configurations. The proposed approach
contributes to healthcare text classification research, balancing efficiency
requirements in resource-constrained healthcare settings with satisfactory
accuracy demands.
[LINK]
http://arxiv.org/abs/2505.07162v1
[DATE]
2025-05-12 08:58:25+08:00
[CATEGORIES]
cs.CL
HAMLET: Healthcare-focused Adaptive Multilingual Learning Embedding-based Topic Modeling
[AUTHORS]
Hajar Sakai, Sarah S. Lam
[ABSTRACT]
Traditional topic models often struggle with contextual nuances and fail to
adequately handle polysemy and rare words. This limitation typically results in
topics that lack coherence and quality. Large Language Models (LLMs) can
mitigate this issue by generating an initial set of topics. However, these raw
topics frequently lack refinement and representativeness, which leads to
redundancy without lexical similarity and reduced interpretability. This paper
introduces HAMLET, a graph-driven architecture for cross-lingual healthcare
topic modeling that uses LLMs. The proposed approach leverages neural-enhanced
semantic fusion to refine the embeddings of topics generated by the LLM.
Instead of relying solely on statistical co-occurrence or human interpretation
to extract topics from a document corpus, this method introduces a topic
embedding refinement that uses Bidirectional Encoder Representations from
Transformers (BERT) and Graph Neural Networks (GNN). After topic generation, a
hybrid technique that involves BERT and Sentence-BERT (SBERT) is employed for
embedding. The topic representations are further refined using a GNN, which
establishes connections between documents, topics, words, similar topics, and
similar words. A novel method is introduced to compute similarities.
Consequently, the topic embeddings are refined, and the top k topics are
extracted. Experiments were conducted using two healthcare datasets, one in
English and one in French, from which six sets were derived. The results
demonstrate the effectiveness of HAMLET.
[LINK]
http://arxiv.org/abs/2505.07157v1
[DATE]
2025-05-12 08:31:36+08:00
[CATEGORIES]
cs.CL
Reassessing Large Language Model Boolean Query Generation for Systematic Reviews
[AUTHORS]
Shuai Wang, Harrisen Scells, Bevan Koopman, Guido Zuccon
[ABSTRACT]
Systematic reviews are comprehensive literature reviews that address highly
focused research questions and represent the highest form of evidence in
medicine. A critical step in this process is the development of complex Boolean
queries to retrieve relevant literature. Given the difficulty of manually
constructing these queries, recent efforts have explored Large Language Models
(LLMs) to assist in their formulation. One of the first studies,Wang et al.,
investigated ChatGPT for this task, followed by Staudinger et al., which
evaluated multiple LLMs in a reproducibility study. However, the latter
overlooked several key aspects of the original work, including (i) validation
of generated queries, (ii) output formatting constraints, and (iii) selection
of examples for chain-of-thought (Guided) prompting. As a result, its findings
diverged significantly from the original study. In this work, we systematically
reproduce both studies while addressing these overlooked factors. Our results
show that query effectiveness varies significantly across models and prompt
designs, with guided query formulation benefiting from well-chosen seed
studies. Overall, prompt design and model selection are key drivers of
successful query formulation. Our findings provide a clearer understanding of
LLMs’ potential in Boolean query generation and highlight the importance of
model- and prompt-specific optimisations. The complex nature of systematic
reviews adds to challenges in both developing and reproducing methods but also
highlights the importance of reproducibility studies in this domain.
[COMMENTS]
Accepted in SIGIR-2025
[LINK]
http://arxiv.org/abs/2505.07155v1
[DATE]
2025-05-12 08:15:02+08:00
[CATEGORIES]
cs.CL
SmallPlan: Leverage Small Language Models for Sequential Path Planning with Simulation-Powered, LLM-Guided Distillation
[AUTHORS]
Quang P. M. Pham, Khoi T. N. Nguyen, Nhi H. Doan, Cuong A. Pham, Kentaro Inui, Dezhen Song
[ABSTRACT]
Efficient path planning in robotics, particularly within large-scale, dynamic
environments, remains a significant hurdle. While Large Language Models (LLMs)
offer strong reasoning capabilities, their high computational cost and limited
adaptability in dynamic scenarios hinder real-time deployment on edge devices.
We present SmallPlan – a novel framework leveraging LLMs as teacher models to
train lightweight Small Language Models (SLMs) for high-level path planning
tasks. In SmallPlan, the SLMs provide optimal action sequences to navigate
across scene graphs that compactly represent full-scaled 3D scenes. The SLMs
are trained in a simulation-powered, interleaved manner with LLM-guided
supervised fine-tuning (SFT) and reinforcement learning (RL). This strategy not
only enables SLMs to successfully complete navigation tasks but also makes them
aware of important factors like travel distance and number of trials. Through
experiments, we demonstrate that the fine-tuned SLMs perform competitively with
larger models like GPT-4o on sequential path planning, without suffering from
hallucination and overfitting. SmallPlan is resource-efficient, making it
well-suited for edge-device deployment and advancing practical autonomous
robotics. Our source code is available here:
https://github.com/quangpham2006/SmallPlan
[COMMENTS]
Paper is under review
[LINK]
http://arxiv.org/abs/2505.00831v4
[DATE]
2025-05-12 04:14:14+08:00
[CATEGORIES]
cs.CL
Understanding Learner-LLM Chatbot Interactions and the Impact of Prompting Guidelines
[AUTHORS]
Cansu Koyuturk, Emily Theophilou, Sabrina Patania, Gregor Donabauer, Andrea Martinenghi, Chiara Antico, Alessia Telari, Alessia Testa, Sathya Bursic, Franca Garzotto, Davinia Hernandez-Leo, Udo Kruschwitz, Davide Taibi, Simona Amenta, Martin Ruskov, Dimitri Ognibene
[ABSTRACT]
Large Language Models (LLMs) have transformed human-computer interaction by
enabling natural language-based communication with AI-powered chatbots. These
models are designed to be intuitive and user-friendly, allowing users to
articulate requests with minimal effort. However, despite their accessibility,
studies reveal that users often struggle with effective prompting, resulting in
inefficient responses. Existing research has highlighted both the limitations
of LLMs in interpreting vague or poorly structured prompts and the difficulties
users face in crafting precise queries. This study investigates learner-AI
interactions through an educational experiment in which participants receive
structured guidance on effective prompting. We introduce and compare three
types of prompting guidelines: a task-specific framework developed through a
structured methodology and two baseline approaches. To assess user behavior and
prompting efficacy, we analyze a dataset of 642 interactions from 107 users.
Using Von NeuMidas, an extended pragmatic annotation schema for LLM interaction
analysis, we categorize common prompting errors and identify recurring
behavioral patterns. We then evaluate the impact of different guidelines by
examining changes in user behavior, adherence to prompting strategies, and the
overall quality of AI-generated responses. Our findings provide a deeper
understanding of how users engage with LLMs and the role of structured
prompting guidance in enhancing AI-assisted communication. By comparing
different instructional frameworks, we offer insights into more effective
approaches for improving user competency in AI interactions, with implications
for AI literacy, chatbot usability, and the design of more responsive AI
systems.
[COMMENTS]
Accepted for AIED 2025, the 26th International Conference on
Artificial Intelligence in Education, July 22 - 26, 2025, Palermo, Italy
[LINK]
http://arxiv.org/abs/2504.07840v2
[DATE]
2025-05-12 03:14:59+08:00
[CATEGORIES]
cs.CL
A-MEM: Agentic Memory for LLM Agents
[AUTHORS]
Wujiang Xu, Kai Mei, Hang Gao, Juntao Tan, Zujie Liang, Yongfeng Zhang
[ABSTRACT]
While large language model (LLM) agents can effectively use external tools
for complex real-world tasks, they require memory systems to leverage
historical experiences. Current memory systems enable basic storage and
retrieval but lack sophisticated memory organization, despite recent attempts
to incorporate graph databases. Moreover, these systems’ fixed operations and
structures limit their adaptability across diverse tasks. To address this
limitation, this paper proposes a novel agentic memory system for LLM agents
that can dynamically organize memories in an agentic way. Following the basic
principles of the Zettelkasten method, we designed our memory system to create
interconnected knowledge networks through dynamic indexing and linking. When a
new memory is added, we generate a comprehensive note containing multiple
structured attributes, including contextual descriptions, keywords, and tags.
The system then analyzes historical memories to identify relevant connections,
establishing links where meaningful similarities exist. Additionally, this
process enables memory evolution - as new memories are integrated, they can
trigger updates to the contextual representations and attributes of existing
historical memories, allowing the memory network to continuously refine its
understanding. Our approach combines the structured organization principles of
Zettelkasten with the flexibility of agent-driven decision making, allowing for
more adaptive and context-aware memory management. Empirical experiments on six
foundation models show superior improvement against existing SOTA baselines.
The source code for evaluating performance is available at
https://github.com/WujiangXu/AgenticMemory, while the source code of agentic
memory system is available at https://github.com/agiresearch/A-mem.
[LINK]
http://arxiv.org/abs/2502.12110v6
[DATE]
2025-05-12 02:10:25+08:00
[CATEGORIES]
cs.CL
GTSinger: A Global Multi-Technique Singing Corpus with Realistic Music Scores for All Singing Tasks
[AUTHORS]
Yu Zhang, Changhao Pan, Wenxiang Guo, Ruiqi Li, Zhiyuan Zhu, Jialei Wang, Wenhao Xu, Jingyu Lu, Zhiqing Hong, Chuxin Wang, LiChao Zhang, Jinzheng He, Ziyue Jiang, Yuxin Chen, Chen Yang, Jiecheng Zhou, Xinyu Cheng, Zhou Zhao
[COMMENTS]
Accepted by NeurIPS 2024 (Spotlight)
[LINK]
http://arxiv.org/abs/2409.13832v6
[DATE]
2025-05-12 01:54:10+08:00
[CATEGORIES]
cs.CL
TrumorGPT: Graph-Based Retrieval-Augmented Large Language Model for Fact-Checking
[AUTHORS]
Ching Nam Hang, Pei-Duo Yu, Chee Wei Tan
[ABSTRACT]
In the age of social media, the rapid spread of misinformation and rumors has
led to the emergence of infodemics, where false information poses a significant
threat to society. To combat this issue, we introduce TrumorGPT , a novel
generative artificial intelligence solution designed for fact-checking in the
health domain. TrumorGPT aims to distinguish “trumors”, which are
health-related rumors that turn out to be true, providing a crucial tool in
differentiating between mere speculation and verified facts. This framework
leverages a large language model (LLM) with few-shot learning for semantic
health knowledge graph construction and semantic reasoning. TrumorGPT
incorporates graph-based retrieval-augmented generation (GraphRAG) to address
the hallucination issue common in LLMs and the limitations of static training
data. GraphRAG involves accessing and utilizing information from regularly
updated semantic health knowledge graphs that consist of the latest medical
news and health information, ensuring that fact-checking by TrumorGPT is based
on the most recent data. Evaluating with extensive healthcare datasets,
TrumorGPT demonstrates superior performance in fact-checking for public health
claims. Its ability to effectively conduct fact-checking across various
platforms marks a critical step forward in the fight against health-related
misinformation, enhancing trust and accuracy in the digital information age.
[LINK]
http://arxiv.org/abs/2505.07891v1
[DATE]
2025-05-12 01:00:21+08:00
[CATEGORIES]
cs.CL
Unboxing Engagement in YouTube Influencer Videos: An Attention-Based Approach
[AUTHORS]
Prashant Rajaram, Puneet Manchanda
[ABSTRACT]
Influencer marketing has become a widely used strategy for reaching
customers. Despite growing interest among influencers and brand partners in
predicting engagement with influencer videos, there has been little research on
the relative importance of different video data modalities in predicting
engagement. We analyze unstructured data from long-form YouTube influencer
videos - spanning text, audio, and video images - using an interpretable deep
learning framework that leverages model attention to video elements. This
framework enables strong out-of-sample prediction, followed by ex-post
interpretation using a novel approach that prunes spurious associations. Our
prediction-based results reveal that “what is said” through words (text) is
more important than “how it is said” through imagery (video images) or
acoustics (audio) in predicting video engagement. Interpretation-based findings
show that during the critical onset period of a video (first 30 seconds),
auditory stimuli (e.g., brand mentions and music) are associated with sentiment
expressed in verbal engagement (comments), while visual stimuli (e.g., video
images of humans and packaged goods) are linked with sentiment expressed
through non-verbal engagement (the thumbs-up/down ratio). We validate our
approach through multiple methods, connect our findings to relevant theory, and
discuss implications for influencers, brands and agencies.
[COMMENTS]
50 pages, Online Appendix
[LINK]
http://arxiv.org/abs/2012.12311v6
[DATE]
2025-05-12 00:59:53+08:00
[CATEGORIES]
cs.LG
cs.CL
Heterogeneous Data Game: Characterizing the Model Competition Across Multiple Data Sources
[AUTHORS]
Renzhe Xu, Kang Wang, Bo Li
[COMMENTS]
ICML 2025
[LINK]
http://arxiv.org/abs/2505.07688v1
[DATE]
2025-05-12 23:51:31+08:00
[CATEGORIES]
cs.LG
S-GRPO: Early Exit via Reinforcement Learning in Reasoning Models
[AUTHORS]
Muzhi Dai, Chenxu Yang, Qingyi Si
[ABSTRACT]
As Test-Time Scaling emerges as an active research focus in the large
language model community, advanced post-training methods increasingly emphasize
extending chain-of-thought (CoT) generation length, thereby enhancing reasoning
capabilities to approach Deepseek R1-like reasoning models. However, recent
studies reveal that reasoning models (even Qwen3) consistently exhibit
excessive thought redundancy in CoT generation. This overthinking problem stems
from conventional outcome-reward reinforcement learning’s systematic neglect in
regulating intermediate reasoning steps. This paper proposes Serial-Group
Decaying-Reward Policy Optimization (namely S-GRPO), a novel reinforcement
learning method that empowers models with the capability to determine the
sufficiency of reasoning steps, subsequently triggering early exit of CoT
generation. Specifically, unlike GRPO, which samples multiple possible
completions (parallel group) in parallel, we select multiple temporal positions
in the generation of one CoT to allow the model to exit thinking and instead
generate answers (serial group), respectively. For the correct answers in a
serial group, we assign rewards that decay according to positions, with lower
rewards towards the later ones, thereby reinforcing the model’s behavior to
generate higher-quality answers at earlier phases with earlier exits of
thinking. Empirical evaluations demonstrate compatibility with state-of-the-art
reasoning models, including Qwen3 and Deepseek-distill models, achieving 35.4%
~ 61.1\% sequence length reduction with 0.72% ~ 6.08% accuracy improvements
across GSM8K, AIME 2024, AMC 2023, MATH-500, and GPQA Diamond benchmarks.
[LINK]
http://arxiv.org/abs/2505.07686v1
[DATE]
2025-05-12 23:50:44+08:00
[CATEGORIES]
cs.LG
Semi-supervised Node Importance Estimation with Informative Distribution Modeling for Uncertainty Regularization
[AUTHORS]
Yankai Chen, Taotao Wang, Yixiang Fang, Yunyu Xiao
[ABSTRACT]
Node importance estimation, a classical problem in network analysis,
underpins various web applications. Previous methods either exploit intrinsic
topological characteristics, e.g., graph centrality, or leverage additional
information, e.g., data heterogeneity, for node feature enhancement. However,
these methods follow the supervised learning setting, overlooking the fact that
ground-truth node-importance data are usually partially labeled in practice. In
this work, we propose the first semi-supervised node importance estimation
framework, i.e., EASING, to improve learning quality for unlabeled data in
heterogeneous graphs. Different from previous approaches, EASING explicitly
captures uncertainty to reflect the confidence of model predictions. To jointly
estimate the importance values and uncertainties, EASING incorporates DJE, a
deep encoder-decoder neural architecture. DJE introduces distribution modeling
for graph nodes, where the distribution representations derive both importance
and uncertainty estimates. Additionally, DJE facilitates effective pseudo-label
generation for the unlabeled data to enrich the training samples. Based on
labeled and pseudo-labeled data, EASING develops effective semi-supervised
heteroscedastic learning with varying node uncertainty regularization.
Extensive experiments on three real-world datasets highlight the superior
performance of EASING compared to competing methods. Codes are available via
https://github.com/yankai-chen/EASING.
[COMMENTS]
Accepted by WWW’25. A few typos corrected
[LINK]
http://arxiv.org/abs/2503.20697v2
[DATE]
2025-05-12 23:48:40+08:00
[CATEGORIES]
cs.LG
SpecRouter: Adaptive Routing for Multi-Level Speculative Decoding in Large Language Models
[AUTHORS]
Hang Wu, Jianian Zhu, Yinghui Li, Haojie Wang, Biao Hou, Jidong Zhai
[ABSTRACT]
Large Language Models (LLMs) present a critical trade-off between inference
quality and computational cost: larger models offer superior capabilities but
incur significant latency, while smaller models are faster but less powerful.
Existing serving strategies often employ fixed model scales or static two-stage
speculative decoding, failing to dynamically adapt to the varying complexities
of user requests or fluctuations in system performance. This paper introduces
\systemname{}, a novel framework that reimagines LLM inference as an adaptive
routing problem solved through multi-level speculative decoding. \systemname{}
dynamically constructs and optimizes inference “paths” (chains of models) based
on real-time feedback, addressing the limitations of static approaches. Our
contributions are threefold: (1) An \textbf{adaptive model chain scheduling}
mechanism that leverages performance profiling (execution times) and predictive
similarity metrics (derived from token distribution divergence) to continuously
select the optimal sequence of draft and verifier models, minimizing predicted
latency per generated token. (2) A \textbf{multi-level collaborative
verification} framework where intermediate models within the selected chain can
validate speculative tokens, reducing the verification burden on the final,
most powerful target model. (3) A \textbf{synchronized state management} system
providing efficient, consistent KV cache handling across heterogeneous models
in the chain, including precise, low-overhead rollbacks tailored for
asynchronous batch processing inherent in multi-level speculation. Preliminary
experiments demonstrate the validity of our method.
[COMMENTS]
10 pages
[LINK]
http://arxiv.org/abs/2505.07680v1
[DATE]
2025-05-12 23:46:28+08:00
[CATEGORIES]
cs.LG
Geospatial Mechanistic Interpretability of Large Language Models
[AUTHORS]
Stef De Sabbata, Stefano Mizzaro, Kevin Roitero
[ABSTRACT]
Large Language Models (LLMs) have demonstrated unprecedented capabilities
across various natural language processing tasks. Their ability to process and
generate viable text and code has made them ubiquitous in many fields, while
their deployment as knowledge bases and “reasoning” tools remains an area of
ongoing research. In geography, a growing body of literature has been focusing
on evaluating LLMs’ geographical knowledge and their ability to perform spatial
reasoning. However, very little is still known about the internal functioning
of these models, especially about how they process geographical information.
In this chapter, we establish a novel framework for the study of geospatial
mechanistic interpretability - using spatial analysis to reverse engineer how
LLMs handle geographical information. Our aim is to advance our understanding
of the internal representations that these complex models generate while
processing geographical information - what one might call “how LLMs think about
geographic information” if such phrasing was not an undue anthropomorphism.
We first outline the use of probing in revealing internal structures within
LLMs. We then introduce the field of mechanistic interpretability, discussing
the superposition hypothesis and the role of sparse autoencoders in
disentangling polysemantic internal representations of LLMs into more
interpretable, monosemantic features. In our experiments, we use spatial
autocorrelation to show how features obtained for placenames display spatial
patterns related to their geographic location and can thus be interpreted
geospatially, providing insights into how these models process geographical
information. We conclude by discussing how our framework can help shape the
study and use of foundation models in geography.
[COMMENTS]
Figures 2 and 3: fixed issue with min boundary in colorbar
[LINK]
http://arxiv.org/abs/2505.03368v2
[DATE]
2025-05-12 23:44:44+08:00
[CATEGORIES]
cs.LG
Transfer Learning Across Fixed-Income Product Classes
[AUTHORS]
Nicolas Camenzind, Damir Filipovic
[ABSTRACT]
We propose a framework for transfer learning of discount curves across
different fixed-income product classes. Motivated by challenges in estimating
discount curves from sparse or noisy data, we extend kernel ridge regression
(KR) to a vector-valued setting, formulating a convex optimization problem in a
vector-valued reproducing kernel Hilbert space (RKHS). Each component of the
solution corresponds to the discount curve implied by a specific product class.
We introduce an additional regularization term motivated by economic
principles, promoting smoothness of spread curves between product classes, and
show that it leads to a valid separable kernel structure. A main theoretical
contribution is a decomposition of the vector-valued RKHS norm induced by
separable kernels. We further provide a Gaussian process interpretation of
vector-valued KR, enabling quantification of estimation uncertainty.
Illustrative examples demonstrate that transfer learning significantly improves
extrapolation performance and tightens confidence intervals compared to
single-curve estimation.
[LINK]
http://arxiv.org/abs/2505.07676v1
[DATE]
2025-05-12 23:43:29+08:00
[CATEGORIES]
cs.LG
On Kernel-based Variational Autoencoder
[AUTHORS]
Tian Qin, Wei-Min Huang
[ABSTRACT]
In this paper, we bridge Variational Autoencoders (VAEs) and kernel density
estimations (KDEs) by approximating the posterior by KDEs and deriving an upper
bound of the Kullback-Leibler (KL) divergence in the evidence lower bound
(ELBO). The flexibility of KDEs makes the optimization of posteriors in VAEs
possible, which not only addresses the limitations of Gaussian latent space in
vanilla VAE but also provides a new perspective of estimating the KL-divergence
in ELBO. Under appropriate conditions, we show that the Epanechnikov kernel is
the optimal choice in minimizing the derived upper bound of KL-divergence
asymptotically. Compared with Gaussian kernel, Epanechnikov kernel has compact
support which should make the generated sample less noisy and blurry. The
implementation of Epanechnikov kernel in ELBO is straightforward as it lies in
the “location-scale” family of distributions where the reparametrization tricks
can be directly employed. A series of experiments on benchmark datasets such as
MNIST, Fashion-MNIST, CIFAR-10 and CelebA further demonstrate the superiority
of Epanechnikov Variational Autoenocoder (EVAE) over vanilla VAE in the quality
of reconstructed images, as measured by the FID score and Sharpness.
[LINK]
http://arxiv.org/abs/2405.12783v2
[DATE]
2025-05-12 23:43:11+08:00
[CATEGORIES]
cs.LG
Joint Graph Convolution and Sequential Modeling for Scalable Network Traffic Estimation
[AUTHORS]
Nan Jiang, Wenxuan Zhu, Xu Han, Weiqiang Huang, Yumeng Sun
[ABSTRACT]
This study focuses on the challenge of predicting network traffic within
complex topological environments. It introduces a spatiotemporal modeling
approach that integrates Graph Convolutional Networks (GCN) with Gated
Recurrent Units (GRU). The GCN component captures spatial dependencies among
network nodes, while the GRU component models the temporal evolution of traffic
data. This combination allows for precise forecasting of future traffic
patterns. The effectiveness of the proposed model is validated through
comprehensive experiments on the real-world Abilene network traffic dataset.
The model is benchmarked against several popular deep learning methods.
Furthermore, a set of ablation experiments is conducted to examine the
influence of various components on performance, including changes in the number
of graph convolution layers, different temporal modeling strategies, and
methods for constructing the adjacency matrix. Results indicate that the
proposed approach achieves superior performance across multiple metrics,
demonstrating robust stability and strong generalization capabilities in
complex network traffic forecasting scenarios.
[LINK]
http://arxiv.org/abs/2505.07674v1
[DATE]
2025-05-12 23:38:19+08:00
[CATEGORIES]
cs.LG
The Pump Scheduling Problem: A Real-World Scenario for Reinforcement Learning
[AUTHORS]
Henrique Donâncio, Laurent Vercouter, Harald Roclawski
[ABSTRACT]
Deep Reinforcement Learning (DRL) has demonstrated impressive results in
domains such as games and robotics, where task formulations are well-defined.
However, few DRL benchmarks are grounded in complex, real-world environments,
where safety constraints, partial observability, and the need for
hand-engineered task representations pose significant challenges. To help
bridge this gap, we introduce a testbed based on the pump scheduling problem in
a real-world water distribution facility. The task involves controlling pumps
to ensure a reliable water supply while minimizing energy consumption and
respecting the constraints of the system. Our testbed includes a realistic
simulator, three years of high-resolution (1-minute) operational data from
human-led control, and a baseline RL task formulation. This testbed supports a
wide range of research directions, including offline RL, safe exploration,
inverse RL, and multi-objective optimization.
[LINK]
http://arxiv.org/abs/2210.11111v2
[DATE]
2025-05-12 23:37:29+08:00
[CATEGORIES]
cs.LG
Convergence of Time-Averaged Mean Field Gradient Descent Dynamics for Continuous Multi-Player Zero-Sum Games
[AUTHORS]
Yulong Lu, Pierre Monmarché
[ABSTRACT]
The approximation of mixed Nash equilibria (MNE) for zero-sum games with
mean-field interacting players has recently raised much interest in machine
learning. In this paper we propose a mean-field gradient descent dynamics for
finding the MNE of zero-sum games involving $K$ players with $K\geq 2$. The
evolution of the players’ strategy distributions follows coupled mean-field
gradient descent flows with momentum, incorporating an exponentially discounted
time-averaging of gradients. First, in the case of a fixed entropic
regularization, we prove an exponential convergence rate for the mean-field
dynamics to the mixed Nash equilibrium with respect to the total variation
metric. This improves a previous polynomial convergence rate for a similar
time-averaged dynamics with different averaging factors. Moreover, unlike
previous two-scale approaches for finding the MNE, our approach treats all
player types on the same time scale. We also show that with a suitable choice
of decreasing temperature, a simulated annealing version of the mean-field
dynamics converges to an MNE of the initial unregularized problem.
[COMMENTS]
21 pages
[LINK]
http://arxiv.org/abs/2505.07642v1
[DATE]
2025-05-12 23:12:27+08:00
[CATEGORIES]
cs.LG
Certified Data Removal Under High-dimensional Settings
[AUTHORS]
Haolin Zou, Arnab Auddy, Yongchan Kwon, Kamiar Rahnama Rad, Arian Maleki
[ABSTRACT]
Machine unlearning focuses on the computationally efficient removal of
specific training data from trained models, ensuring that the influence of
forgotten data is effectively eliminated without the need for full retraining.
Despite advances in low-dimensional settings, where the number of parameters (
p ) is much smaller than the sample size ( n ), extending similar
theoretical guarantees to high-dimensional regimes remains challenging. We
propose an unlearning algorithm that starts from the original model parameters
and performs a theory-guided sequence of Newton steps ( T \in \{ 1,2\}).
After this update, carefully scaled isotropic Laplacian noise is added to the
estimate to ensure that any (potential) residual influence of forget data is
completely removed. We show that when both ( n, p \to \infty ) with a fixed
ratio ( n/p ), significant theoretical and computational obstacles arise due
to the interplay between the complexity of the model and the finite
signal-to-noise ratio. Finally, we show that, unlike in low-dimensional
settings, a single Newton step is insufficient for effective unlearning in
high-dimensional problems – however, two steps are enough to achieve the
desired certifiebility. We provide numerical experiments to support the
certifiability and accuracy claims of this approach.
[COMMENTS]
46 pages, 4 figures
[LINK]
http://arxiv.org/abs/2505.07640v1
[DATE]
2025-05-12 23:11:13+08:00
[CATEGORIES]
cs.LG
Generating Skyline Explanations for Graph Neural Networks
[AUTHORS]
Dazhuo Qiu, Haolai Che, Arijit Khan, Yinghui Wu
[ABSTRACT]
This paper proposes a novel approach to generate subgraph explanations for
graph neural networks GNNs that simultaneously optimize multiple measures for
explainability. Existing GNN explanation methods often compute subgraphs
(called “explanatory subgraphs”) that optimize a pre-defined, single
explainability measure, such as fidelity or conciseness. This can lead to
biased explanations that cannot provide a comprehensive explanation to clarify
the output of GNN models. We introduce skyline explanation, a GNN explanation
paradigm that aims to identify k explanatory subgraphs by simultaneously
optimizing multiple explainability measures. (1) We formulate skyline
explanation generation as a multi-objective optimization problem, and pursue
explanations that approximate a skyline set of explanatory subgraphs. We show
the hardness for skyline explanation generation. (2) We design efficient
algorithms with an onion-peeling approach that strategically removes edges from
neighbors of nodes of interests, and incrementally improves explanations as it
explores an interpretation domain, with provable quality guarantees. (3) We
further develop an algorithm to diversify explanations to provide more
comprehensive perspectives. Using real-world graphs, we empirically verify the
effectiveness, efficiency, and scalability of our algorithms.
[LINK]
http://arxiv.org/abs/2505.07635v1
[DATE]
2025-05-12 23:05:46+08:00
[CATEGORIES]
cs.LG
Efficient and Reproducible Biomedical Question Answering using Retrieval Augmented Generation
[AUTHORS]
Linus Stuhlmann, Michael Alexander Saxer, Jonathan Fürst
[ABSTRACT]
Biomedical question-answering (QA) systems require effective retrieval and
generation components to ensure accuracy, efficiency, and scalability. This
study systematically examines a Retrieval-Augmented Generation (RAG) system for
biomedical QA, evaluating retrieval strategies and response time trade-offs. We
first assess state-of-the-art retrieval methods, including BM25, BioBERT,
MedCPT, and a hybrid approach, alongside common data stores such as
Elasticsearch, MongoDB, and FAISS, on a ~10% subset of PubMed (2.4M documents)
to measure indexing efficiency, retrieval latency, and retriever performance in
the end-to-end RAG system. Based on these insights, we deploy the final RAG
system on the full 24M PubMed corpus, comparing different retrievers’ impact on
overall performance. Evaluations of the retrieval depth show that retrieving 50
documents with BM25 before reranking with MedCPT optimally balances accuracy
(0.90), recall (0.90), and response time (1.91s). BM25 retrieval time remains
stable (82ms), while MedCPT incurs the main computational cost. These results
highlight previously not well-known trade-offs in retrieval depth, efficiency,
and scalability for biomedical QA. With open-source code, the system is fully
reproducible and extensible.
[COMMENTS]
Accepted at SDS25
[LINK]
http://arxiv.org/abs/2505.07917v1
[DATE]
2025-05-12 22:51:47+08:00
[CATEGORIES]
cs.LG
Higher-Order Convolution Improves Neural Predictivity in the Retina
[AUTHORS]
Simone Azeglio, Victor Calbiague Garcia, Guilhem Glaziou, Peter Neri, Olivier Marre, Ulisse Ferrari
[ABSTRACT]
We present a novel approach to neural response prediction that incorporates
higher-order operations directly within convolutional neural networks (CNNs).
Our model extends traditional 3D CNNs by embedding higher-order operations
within the convolutional operator itself, enabling direct modeling of
multiplicative interactions between neighboring pixels across space and time.
Our model increases the representational power of CNNs without increasing their
depth, therefore addressing the architectural disparity between deep artificial
networks and the relatively shallow processing hierarchy of biological visual
systems. We evaluate our approach on two distinct datasets: salamander retinal
ganglion cell (RGC) responses to natural scenes, and a new dataset of mouse RGC
responses to controlled geometric transformations. Our higher-order CNN (HoCNN)
achieves superior performance while requiring only half the training data
compared to standard architectures, demonstrating correlation coefficients up
to 0.75 with neural responses (against 0.80$\pm$0.02 retinal reliability). When
integrated into state-of-the-art architectures, our approach consistently
improves performance across different species and stimulus conditions. Analysis
of the learned representations reveals that our network naturally encodes
fundamental geometric transformations, particularly scaling parameters that
characterize object expansion and contraction. This capability is especially
relevant for specific cell types, such as transient OFF-alpha and transient ON
cells, which are known to detect looming objects and object motion
respectively, and where our model shows marked improvement in response
prediction. The correlation coefficients for scaling parameters are more than
twice as high in HoCNN (0.72) compared to baseline models (0.32).
[LINK]
http://arxiv.org/abs/2505.07620v1
[DATE]
2025-05-12 22:43:32+08:00
[CATEGORIES]
cs.LG
Neuronal correlations shape the scaling behavior of memory capacity and nonlinear computational capability of recurrent neural networks
[AUTHORS]
Shotaro Takasu, Toshio Aoyagi
[ABSTRACT]
Reservoir computing is a powerful framework for real-time information
processing, characterized by its high computational ability and quick learning,
with applications ranging from machine learning to biological systems. In this
paper, we demonstrate that the memory capacity of a reservoir recurrent neural
network scales sublinearly with the number of readout neurons. To elucidate
this phenomenon, we develop a theoretical framework for analytically deriving
memory capacity, attributing the decaying growth of memory capacity to neuronal
correlations. In addition, numerical simulations reveal that once memory
capacity becomes sublinear, increasing the number of readout neurons
successively enables nonlinear processing at progressively higher polynomial
orders. Furthermore, our theoretical framework suggests that neuronal
correlations govern not only memory capacity but also the sequential growth of
nonlinear computational capabilities. Our findings establish a foundation for
designing scalable and cost-effective reservoir computing, providing novel
insights into the interplay among neuronal correlations, linear memory, and
nonlinear processing.
[COMMENTS]
20 pages, 8 figures
[LINK]
http://arxiv.org/abs/2504.19657v2
[DATE]
2025-05-12 22:36:47+08:00
[CATEGORIES]
cs.LG
Diffused Responsibility: Analyzing the Energy Consumption of Generative Text-to-Audio Diffusion Models
[AUTHORS]
Riccardo Passoni, Francesca Ronchini, Luca Comanducci, Romain Serizel, Fabio Antonacci
[ABSTRACT]
Text-to-audio models have recently emerged as a powerful technology for
generating sound from textual descriptions. However, their high computational
demands raise concerns about energy consumption and environmental impact. In
this paper, we conduct an analysis of the energy usage of 7 state-of-the-art
text-to-audio diffusion-based generative models, evaluating to what extent
variations in generation parameters affect energy consumption at inference
time. We also aim to identify an optimal balance between audio quality and
energy consumption by considering Pareto-optimal solutions across all selected
models. Our findings provide insights into the trade-offs between performance
and environmental impact, contributing to the development of more efficient
generative audio models.
[LINK]
http://arxiv.org/abs/2505.07615v1
[DATE]
2025-05-12 22:36:47+08:00
[CATEGORIES]
cs.LG
Trial and Trust: Addressing Byzantine Attacks with Comprehensive Defense Strategy
[AUTHORS]
Gleb Molodtsov, Daniil Medyakov, Sergey Skorik, Nikolas Khachaturov, Shahane Tigranyan, Vladimir Aletov, Aram Avetisyan, Martin Takáč, Aleksandr Beznosikov
[ABSTRACT]
Recent advancements in machine learning have improved performance while also
increasing computational demands. While federated and distributed setups
address these issues, their structure is vulnerable to malicious influences. In
this paper, we address a specific threat, Byzantine attacks, where compromised
clients inject adversarial updates to derail global convergence. We combine the
trust scores concept with trial function methodology to dynamically filter
outliers. Our methods address the critical limitations of previous approaches,
allowing functionality even when Byzantine nodes are in the majority. Moreover,
our algorithms adapt to widely used scaled methods like Adam and RMSProp, as
well as practical scenarios, including local training and partial
participation. We validate the robustness of our methods by conducting
extensive experiments on both synthetic and real ECG data collected from
medical institutions. Furthermore, we provide a broad theoretical analysis of
our algorithms and their extensions to aforementioned practical setups. The
convergence guarantees of our methods are comparable to those of classical
algorithms developed without Byzantine interference.
[LINK]
http://arxiv.org/abs/2505.07614v1
[DATE]
2025-05-12 22:36:45+08:00
[CATEGORIES]
cs.LG
Nonlinear functional regression by functional deep neural network with kernel embedding
[AUTHORS]
Zhongjie Shi, Jun Fan, Linhao Song, Ding-Xuan Zhou, Johan A. K. Suykens
[ABSTRACT]
Recently, deep learning has been widely applied in functional data analysis
(FDA) with notable empirical success. However, the infinite dimensionality of
functional data necessitates an effective dimension reduction approach for
functional learning tasks, particularly in nonlinear functional regression. In
this paper, we introduce a functional deep neural network with an adaptive and
discretization-invariant dimension reduction method. Our functional network
architecture consists of three parts: first, a kernel embedding step that
features an integral transformation with an adaptive smooth kernel; next, a
projection step that utilizes eigenfunction bases based on a projection Mercer
kernel for the dimension reduction; and finally, a deep ReLU neural network is
employed for the prediction. Explicit rates of approximating nonlinear smooth
functionals across various input function spaces by our proposed functional
network are derived. Additionally, we conduct a generalization analysis for the
empirical risk minimization (ERM) algorithm applied to our functional net, by
employing a novel two-stage oracle inequality and the established functional
approximation results. Ultimately, we conduct numerical experiments on both
simulated and real datasets to demonstrate the effectiveness and benefits of
our functional net.
[LINK]
http://arxiv.org/abs/2401.02890v2
[DATE]
2025-05-12 22:30:59+08:00
[CATEGORIES]
cs.LG
TACOS: Temporally-aligned Audio CaptiOnS for Language-Audio Pretraining
[AUTHORS]
Paul Primus, Florian Schmid, Gerhard Widmer
[ABSTRACT]
Learning to associate audio with textual descriptions is valuable for a range
of tasks, including pretraining, zero-shot classification, audio retrieval,
audio captioning, and text-conditioned audio generation. Existing contrastive
language-audio pretrained models are typically trained using global, clip-level
descriptions, which provide only weak temporal supervision. We hypothesize that
CLAP-like language-audio models - particularly, if they are expected to produce
frame-level embeddings - can benefit from a stronger temporal supervision. To
confirm our hypothesis, we curate a novel dataset of approximately 12,000 audio
recordings from Freesound, each annotated with single-sentence free-text
descriptions linked to a specific temporal segment in an audio recording. We
use large language models to clean these annotations by removing references to
non-audible events, transcribed speech, typos, and annotator language bias. We
further propose a frame-wise contrastive training strategy that learns to align
text descriptions with temporal regions in an audio recording and demonstrate
that our model has better temporal text-audio alignment abilities compared to
models trained only on global captions when evaluated on the AudioSet Strong
benchmark. The dataset and our source code are available on Zenodo and GitHub,
respectively.
[COMMENTS]
submitted to the IEEE Workshop on Applications of Signal Processing
to Audio and Acoustics (WASPAA), 2025. Dataset (Zenodo):
https://zenodo.org/records/15379789, Implementation (GitHub):
https://github.com/OptimusPrimus/tacos
[LINK]
http://arxiv.org/abs/2505.07609v1
[DATE]
2025-05-12 22:30:39+08:00
[CATEGORIES]
cs.LG
Multi-Objective Reinforcement Learning for Energy-Efficient Industrial Control
[AUTHORS]
Georg Schäfer, Raphael Seliger, Jakob Rehrl, Stefan Huber, Simon Hirlaender
[ABSTRACT]
Industrial automation increasingly demands energy-efficient control
strategies to balance performance with environmental and cost constraints. In
this work, we present a multi-objective reinforcement learning (MORL) framework
for energy-efficient control of the Quanser Aero 2 testbed in its
one-degree-of-freedom configuration. We design a composite reward function that
simultaneously penalizes tracking error and electrical power consumption.
Preliminary experiments explore the influence of varying the Energy penalty
weight, alpha, on the trade-off between pitch tracking and energy savings. Our
results reveal a marked performance shift for alpha values between 0.0 and
0.25, with non-Pareto optimal solutions emerging at lower alpha values, on both
the simulation and the real system. We hypothesize that these effects may be
attributed to artifacts introduced by the adaptive behavior of the Adam
optimizer, which could bias the learning process and favor bang-bang control
strategies. Future work will focus on automating alpha selection through
Gaussian Process-based Pareto front modeling and transitioning the approach
from simulation to real-world deployment.
[COMMENTS]
Accepted at DEXA 2025 (AI4IP)
[LINK]
http://arxiv.org/abs/2505.07607v1
[DATE]
2025-05-12 22:28:42+08:00
[CATEGORIES]
cs.LG
Finite-Sample-Based Reachability for Safe Control with Gaussian Process Dynamics
[AUTHORS]
Manish Prajapat, Johannes Köhler, Amon Lahr, Andreas Krause, Melanie N. Zeilinger
[ABSTRACT]
Gaussian Process (GP) regression is shown to be effective for learning
unknown dynamics, enabling efficient and safety-aware control strategies across
diverse applications. However, existing GP-based model predictive control
(GP-MPC) methods either rely on approximations, thus lacking guarantees, or are
overly conservative, which limits their practical utility. To close this gap,
we present a sampling-based framework that efficiently propagates the model’s
epistemic uncertainty while avoiding conservatism. We establish a novel sample
complexity result that enables the construction of a reachable set using a
finite number of dynamics functions sampled from the GP posterior. Building on
this, we design a sampling-based GP-MPC scheme that is recursively feasible and
guarantees closed-loop safety and stability with high probability. Finally, we
showcase the effectiveness of our method on two numerical examples,
highlighting accurate reachable set over-approximation and safe closed-loop
performance.
[LINK]
http://arxiv.org/abs/2505.07594v1
[DATE]
2025-05-12 22:20:20+08:00
[CATEGORIES]
cs.LG
On-Device Crack Segmentation for Edge Structural Health Monitoring
[AUTHORS]
Yuxuan Zhang, Ye Xu, Luciano Sebastian Martinez-Rau, Quynh Nguyen Phuong Vu, Bengt Oelmann, Sebastian Bader
[ABSTRACT]
Crack segmentation can play a critical role in Structural Health Monitoring
(SHM) by enabling accurate identification of crack size and location, which
allows to monitor structural damages over time. However, deploying deep
learning models for crack segmentation on resource-constrained microcontrollers
presents significant challenges due to limited memory, computational power, and
energy resources. To address these challenges, this study explores lightweight
U-Net architectures tailored for TinyML applications, focusing on three
optimization strategies: filter number reduction, network depth reduction, and
the use of Depthwise Separable Convolutions (DWConv2D). Our results demonstrate
that reducing convolution kernels and network depth significantly reduces RAM
and Flash requirement, and inference times, albeit with some accuracy
trade-offs. Specifically, by reducing the filer number to 25%, the network
depth to four blocks, and utilizing depthwise convolutions, a good compromise
between segmentation performance and resource consumption is achieved. This
makes the network particularly suitable for low-power TinyML applications. This
study not only advances TinyML-based crack segmentation but also provides the
possibility for energy-autonomous edge SHM systems.
[COMMENTS]
This paper has been accepted for the 2025 IEEE Sensors Applications
Symposium (SAS)
[LINK]
http://arxiv.org/abs/2505.07915v1
[DATE]
2025-05-12 22:17:59+08:00
[CATEGORIES]
cs.LG
LLMs Outperform Experts on Challenging Biology Benchmarks
[AUTHORS]
Lennart Justen
[ABSTRACT]
This study systematically evaluates 27 frontier Large Language Models on
eight biology benchmarks spanning molecular biology, genetics, cloning,
virology, and biosecurity. Models from major AI developers released between
November 2022 and April 2025 were assessed through ten independent runs per
benchmark. The findings reveal dramatic improvements in biological
capabilities. Top model performance increased more than 4-fold on the
challenging text-only subset of the Virology Capabilities Test over the study
period, with OpenAI’s o3 now performing twice as well as expert virologists.
Several models now match or exceed expert-level performance on other
challenging benchmarks, including the biology subsets of GPQA and WMDP and
LAB-Bench CloningScenarios. Contrary to expectations, chain-of-thought did not
substantially improve performance over zero-shot evaluation, while extended
reasoning features in o3-mini and Claude 3.7 Sonnet typically improved
performance as predicted by inference scaling. Benchmarks such as PubMedQA and
the MMLU and WMDP biology subsets exhibited performance plateaus well below
100%, suggesting benchmark saturation and errors in the underlying benchmark
data. The analysis highlights the need for more sophisticated evaluation
methodologies as AI systems continue to advance.
[LINK]
http://arxiv.org/abs/2505.06108v2
[DATE]
2025-05-12 22:17:41+08:00
[CATEGORIES]
cs.LG
Predicting solvation free energies with an implicit solvent machine learning potential
[AUTHORS]
Sebastien Röcken, Anton F. Burnet, Julija Zavadlav
[ABSTRACT]
Machine learning (ML) potentials are a powerful tool in molecular modeling,
enabling ab initio accuracy for comparably small computational costs.
Nevertheless, all-atom simulations employing best-performing graph neural
network architectures are still too expensive for applications requiring
extensive sampling, such as free energy computations. Implicit solvent models
could provide the necessary speed-up due to reduced degrees of freedom and
faster dynamics. Here, we introduce a Solvation Free Energy Path Reweighting
(ReSolv) framework to parametrize an implicit solvent ML potential for small
organic molecules that accurately predicts the hydration free energy, an
essential parameter in drug design and pollutant modeling. With a combination
of top-down (experimental hydration free energy data) and bottom-up (ab initio
data of molecules in a vacuum) learning, ReSolv bypasses the need for
intractable ab initio data of molecules in explicit bulk solvent and does not
have to resort to less accurate data-generating models. On the FreeSolv
dataset, ReSolv achieves a mean absolute error close to average experimental
uncertainty, significantly outperforming standard explicit solvent force
fields. Compared to the explicit solvent ML potential, ReSolv offers a
computational speedup of four orders of magnitude and attains closer agreement
with experiments. The presented framework paves the way toward deep molecular
models that are more accurate yet computationally cheaper than classical
atomistic models.
[LINK]
http://arxiv.org/abs/2406.00183v3
[DATE]
2025-05-12 21:56:27+08:00
[CATEGORIES]
cs.LG
Self-Supervised Transformer-based Contrastive Learning for Intrusion Detection Systems
[AUTHORS]
Ippokratis Koukoulis, Ilias Syrigos, Thanasis Korakis
[ABSTRACT]
As the digital landscape becomes more interconnected, the frequency and
severity of zero-day attacks, have significantly increased, leading to an
urgent need for innovative Intrusion Detection Systems (IDS). Machine
Learning-based IDS that learn from the network traffic characteristics and can
discern attack patterns from benign traffic offer an advanced solution to
traditional signature-based IDS. However, they heavily rely on labeled
datasets, and their ability to generalize when encountering unseen traffic
patterns remains a challenge. This paper proposes a novel self-supervised
contrastive learning approach based on transformer encoders, specifically
tailored for generalizable intrusion detection on raw packet sequences. Our
proposed learning scheme employs a packet-level data augmentation strategy
combined with a transformer-based architecture to extract and generate
meaningful representations of traffic flows. Unlike traditional methods reliant
on handcrafted statistical features (NetFlow), our approach automatically
learns comprehensive packet sequence representations, significantly enhancing
performance in anomaly identification tasks and supervised learning for
intrusion detection. Our transformer-based framework exhibits better
performance in comparison to existing NetFlow self-supervised methods.
Specifically, we achieve up to a 3% higher AUC in anomaly detection for
intra-dataset evaluation and up to 20% higher AUC scores in inter-dataset
evaluation. Moreover, our model provides a strong baseline for supervised
intrusion detection with limited labeled data, exhibiting an improvement over
self-supervised NetFlow models of up to 1.5% AUC when pretrained and evaluated
on the same dataset. Additionally, we show the adaptability of our pretrained
model when fine-tuned across different datasets, demonstrating strong
performance even when lacking benign data from the target domain.
[COMMENTS]
Accepted at IFIP Networking 2025. Code available at
https://github.com/koukipp/contrastive_transformers_ids
[LINK]
http://arxiv.org/abs/2505.08816v1
[DATE]
2025-05-12 21:42:00+08:00
[CATEGORIES]
cs.LG
Combining Bayesian Inference and Reinforcement Learning for Agent Decision Making: A Review
[AUTHORS]
Chengmin Zhou, Ville Kyrki, Pasi Fränti, Laura Ruotsalainen
[ABSTRACT]
Bayesian inference has many advantages in decision making of agents (e.g.
robotics/simulative agent) over a regular data-driven black-box neural network:
Data-efficiency, generalization, interpretability, and safety where these
advantages benefit directly/indirectly from the uncertainty quantification of
Bayesian inference. However, there are few comprehensive reviews to summarize
the progress of Bayesian inference on reinforcement learning (RL) for decision
making to give researchers a systematic understanding. This paper focuses on
combining Bayesian inference with RL that nowadays is an important approach in
agent decision making. To be exact, this paper discusses the following five
topics: 1) Bayesian methods that have potential for agent decision making.
First basic Bayesian methods and models (Bayesian rule, Bayesian learning, and
Bayesian conjugate models) are discussed followed by variational inference,
Bayesian optimization, Bayesian deep learning, Bayesian active learning,
Bayesian generative models, Bayesian meta-learning, and lifelong Bayesian
learning. 2) Classical combinations of Bayesian methods with model-based RL
(with approximation methods), model-free RL, and inverse RL. 3) Latest
combinations of potential Bayesian methods with RL. 4) Analytical comparisons
of methods that combine Bayesian methods with RL with respect to
data-efficiency, generalization, interpretability, and safety. 5) In-depth
discussions in six complex problem variants of RL, including unknown reward,
partial-observability, multi-agent, multi-task, non-linear non-Gaussian, and
hierarchical RL problems and the summary of how Bayesian methods work in the
data collection, data processing and policy learning stages of RL to pave the
way for better agent decision-making strategies.
[LINK]
http://arxiv.org/abs/2505.07911v1
[DATE]
2025-05-12 21:34:50+08:00
[CATEGORIES]
cs.LG
Gaussian entropic optimal transport: Schrödinger bridges and the Sinkhorn algorithm
[AUTHORS]
O. Deniz Akyildiz, Pierre Del Moral, Joaquín Miguez
[ABSTRACT]
Entropic optimal transport problems are regularized versions of optimal
transport problems. These models play an increasingly important role in machine
learning and generative modelling. For finite spaces, these problems are
commonly solved using Sinkhorn algorithm (a.k.a. iterative proportional fitting
procedure). However, in more general settings the Sinkhorn iterations are based
on nonlinear conditional/conjugate transformations and exact finite-dimensional
solutions cannot be computed.
This article presents a finite-dimensional recursive formulation of the
iterative proportional fitting procedure for general Gaussian multivariate
models. As expected, this recursive formulation is closely related to the
celebrated Kalman filter and related Riccati matrix difference equations, and
it yields algorithms that can be implemented in practical settings without
further approximations. We extend this filtering methodology to develop a
refined and self-contained convergence analysis of Gaussian Sinkhorn
algorithms, including closed form expressions of entropic transport maps and
Schr"odinger bridges.
[COMMENTS]
80 pages
[LINK]
http://arxiv.org/abs/2412.18432v4
[DATE]
2025-05-12 21:33:20+08:00
[CATEGORIES]
cs.LG
Injecting Knowledge Graphs into Large Language Models
[AUTHORS]
Erica Coppolillo
[ABSTRACT]
Integrating structured knowledge from Knowledge Graphs (KGs) into Large
Language Models (LLMs) remains a key challenge for symbolic reasoning. Existing
methods mainly rely on prompt engineering or fine-tuning, which lose structural
fidelity or incur high computational costs. Building on recent encoding
techniques which integrate graph embeddings within the LLM input as tokens, we
extend this paradigm to the KG domain by leveraging Knowledge Graph Embedding
(KGE) models, thus enabling graph-aware reasoning. Our approach is
model-agnostic, resource-efficient, and compatible with any LLMs. Extensive
experimentation on synthetic and real-world datasets shows that our method
improves reasoning performance over established baselines, further achieving
the best trade-off in terms of accuracy and efficiency against state-of-the-art
LLMs.
[LINK]
http://arxiv.org/abs/2505.07554v1
[DATE]
2025-05-12 21:31:26+08:00
[CATEGORIES]
cs.LG
Noise Optimized Conditional Diffusion for Domain Adaptation
[AUTHORS]
Lingkun Luo, Shiqiang Hu, Liming Chen
[ABSTRACT]
Pseudo-labeling is a cornerstone of Unsupervised Domain Adaptation (UDA), yet
the scarcity of High-Confidence Pseudo-Labeled Target Domain Samples
(\textbf{hcpl-tds}) often leads to inaccurate cross-domain statistical
alignment, causing DA failures. To address this challenge, we propose
\textbf{N}oise \textbf{O}ptimized \textbf{C}onditional \textbf{D}iffusion for
\textbf{D}omain \textbf{A}daptation (\textbf{NOCDDA}), which seamlessly
integrates the generative capabilities of conditional diffusion models with the
decision-making requirements of DA to achieve task-coupled optimization for
efficient adaptation. For robust cross-domain consistency, we modify the DA
classifier to align with the conditional diffusion classifier within a unified
optimization framework, enabling forward training on noise-varying cross-domain
samples. Furthermore, we argue that the conventional ( \mathcal{N}(\mathbf{0},
\mathbf{I}) ) initialization in diffusion models often generates
class-confused hcpl-tds, compromising discriminative DA. To resolve this, we
introduce a class-aware noise optimization strategy that refines sampling
regions for reverse class-specific hcpl-tds generation, effectively enhancing
cross-domain alignment. Extensive experiments across 5 benchmark datasets and
29 DA tasks demonstrate significant performance gains of \textbf{NOCDDA} over
31 state-of-the-art methods, validating its robustness and effectiveness.
[COMMENTS]
9 pages, 4 figures This work has been accepted by the International
Joint Conference on Artificial Intelligence (IJCAI 2025)
[LINK]
http://arxiv.org/abs/2505.07548v1
[DATE]
2025-05-12 21:28:31+08:00
[CATEGORIES]
cs.LG
Keep your distance: learning dispersed embeddings on $\mathbb{S}_m$
[AUTHORS]
Evgeniia Tokarchuk, Hua Chang Bakker, Vlad Niculae
[ABSTRACT]
Learning well-separated features in high-dimensional spaces, such as text or
image embeddings, is crucial for many machine learning applications. Achieving
such separation can be effectively accomplished through the dispersion of
embeddings, where unrelated vectors are pushed apart as much as possible. By
constraining features to be on a hypersphere, we can connect dispersion to
well-studied problems in mathematics and physics, where optimal solutions are
known for limited low-dimensional cases. However, in representation learning we
typically deal with a large number of features in high-dimensional space, and
moreover, dispersion is usually traded off with some other task-oriented
training objective, making existing theoretical and numerical solutions
inapplicable. Therefore, it is common to rely on gradient-based methods to
encourage dispersion, usually by minimizing some function of the pairwise
distances. In this work, we first give an overview of existing methods from
disconnected literature, making new connections and highlighting similarities.
Next, we introduce some new angles. We propose to reinterpret pairwise
dispersion using a maximum mean discrepancy (MMD) motivation. We then propose
an online variant of the celebrated Lloyd’s algorithm, of K-Means fame, as an
effective alternative regularizer for dispersion on generic domains. Finally,
we derive a novel dispersion method that directly exploits properties of the
hypersphere. Our experiments show the importance of dispersion in image
classification and natural language processing tasks, and how algorithms
exhibit different trade-offs in different regimes.
[LINK]
http://arxiv.org/abs/2502.08231v3
[DATE]
2025-05-12 21:26:51+08:00
[CATEGORIES]
cs.LG
MetaMolGen: A Neural Graph Motif Generation Model for De Novo Molecular Design
[AUTHORS]
Zimo Yan, Jie Zhang, Zheng Xie, Chang Liu, Yizhen Liu, Yiping Song
[ABSTRACT]
Molecular generation plays an important role in drug discovery and materials
science, especially in data-scarce scenarios where traditional generative
models often struggle to achieve satisfactory conditional generalization. To
address this challenge, we propose MetaMolGen, a first-order
meta-learning-based molecular generator designed for few-shot and
property-conditioned molecular generation. MetaMolGen standardizes the
distribution of graph motifs by mapping them to a normalized latent space, and
employs a lightweight autoregressive sequence model to generate SMILES
sequences that faithfully reflect the underlying molecular structure. In
addition, it supports conditional generation of molecules with target
properties through a learnable property projector integrated into the
generative process.Experimental results demonstrate that MetaMolGen
consistently generates valid and diverse SMILES sequences under low-data
regimes, outperforming conventional baselines. This highlights its advantage in
fast adaptation and efficient conditional generation for practical molecular
design.
[LINK]
http://arxiv.org/abs/2504.15587v2
[DATE]
2025-05-12 21:18:44+08:00
[CATEGORIES]
cs.LG
The Human-Data-Model Interaction Canvas for Visual Analytics
[AUTHORS]
Jürgen Bernard
[ABSTRACT]
Visual Analytics (VA) integrates humans, data, and models as key actors in
insight generation and data-driven decision-making. This position paper values
and reflects on 16 VA process models and frameworks and makes nine high-level
observations that motivate a fresh perspective on VA. The contribution is the
HDMI Canvas, a perspective to VA that complements the strengths of existing VA
process models and frameworks. It systematically characterizes diverse roles of
humans, data, and models, and how these actors benefit from and contribute to
VA processes. The descriptive power of the HDMI Canvas eases the
differentiation between a series of VA building blocks, rather than describing
general VA principles only. The canvas includes modern human-centered
methodologies, including human knowledge externalization and forms of feedback
loops, while interpretable and explainable AI highlight model contributions
beyond their conventional outputs. The HDMI Canvas has generative power,
guiding the design of new VA processes and is optimized for external
stakeholders, improving VA outreach, interdisciplinary collaboration, and
user-centered design. The utility of the HDMI Canvas is demonstrated through
two preliminary case studies.
[COMMENTS]
7 pages, 5 figures, LaTeX; to appear at the 16th International
EuroVis Workshop on Visual Analytics (EuroVA’25) as a position paper
[LINK]
http://arxiv.org/abs/2505.07534v1
[DATE]
2025-05-12 21:15:31+08:00
[CATEGORIES]
cs.LG
Kalman Filter Enhanced GRPO for Reinforcement Learning-Based Language Model Reasoning
[AUTHORS]
Hu Wang, Congbo Ma, Ian Reid, Mohammad Yaqub
[ABSTRACT]
Reward baseline is important for Reinforcement Learning (RL) algorithms to
reduce variance in policy gradient estimates. Recently, for language modeling,
Group Relative Policy Optimization (GRPO) is proposed to compute the advantage
for each output by subtracting the mean reward, as the baseline, for all
outputs in the group. However, it can lead to inaccurate advantage estimates in
environments with highly noisy rewards, potentially introducing bias. In this
work, we propose a model, called Kalman Filter Enhanced Group Relative Policy
Optimization (KRPO), by using lightweight Kalman filtering to dynamically
estimate the latent reward mean and variance. This filtering technique replaces
the naive batch mean baseline, enabling more adaptive advantage normalization.
Our method does not require additional learned parameters over GRPO. This
approach offers a simple yet effective way to incorporate multiple outputs of
GRPO into advantage estimation, improving policy optimization in settings where
highly dynamic reward signals are difficult to model for language models.
Through experiments and analyses, we show that using a more adaptive advantage
estimation model, KRPO can improve the stability and performance of GRPO. The
code is available at https://github.com/billhhh/KRPO_LLMs_RL
[LINK]
http://arxiv.org/abs/2505.07527v1
[DATE]
2025-05-12 21:09:49+08:00
[CATEGORIES]
cs.LG
MD-NOMAD: Mixture density nonlinear manifold decoder for emulating stochastic differential equations and uncertainty propagation
[AUTHORS]
Akshay Thakur, Souvik Chakraborty
[ABSTRACT]
We propose a neural operator framework, termed mixture density nonlinear
manifold decoder (MD-NOMAD), for stochastic simulators. Our approach leverages
an amalgamation of the pointwise operator learning neural architecture
nonlinear manifold decoder (NOMAD) with mixture density-based methods to
estimate conditional probability distributions for stochastic output functions.
MD-NOMAD harnesses the ability of probabilistic mixture models to estimate
complex probability and the high-dimensional scalability of pointwise neural
operator NOMAD. We conduct empirical assessments on a wide array of stochastic
ordinary and partial differential equations and present the corresponding
results, which highlight the performance of the proposed framework.
[LINK]
http://arxiv.org/abs/2404.15731v2
[DATE]
2025-05-12 21:09:30+08:00
[CATEGORIES]
cs.LG
Adaptive Latent-Space Constraints in Personalized FL
[AUTHORS]
Sana Ayromlou, D. B. Emerson
[COMMENTS]
14 Pages, 1 Algorithm, 3 Figures, 3 Tables
[LINK]
http://arxiv.org/abs/2505.07525v1
[DATE]
2025-05-12 21:08:54+08:00
[CATEGORIES]
cs.LG
Topological Schrödinger Bridge Matching
[AUTHORS]
Maosheng Yang
[ABSTRACT]
Given two boundary distributions, the Schr"odinger Bridge (SB) problem seeks
the most likely
random evolution between them with respect to a reference
process. It has revealed rich connections to recent machine learning methods
for generative modeling and distribution matching. While these methods perform
well in Euclidean domains, they are not directly applicable to topological
domains such as graphs and simplicial complexes, which are crucial for data
defined over network entities, such as node signals and edge flows. In this
work, we propose the Topological Schr"odinger Bridge problem (TSBP) for
matching signal distributions on a topological domain. We set the reference
process to follow some linear tractable topology-aware stochastic dynamics such
as topological heat diffusion. For the case of Gaussian boundary distributions,
we derive a closed-form topological SB (TSB) in terms of its time-marginal and
stochastic differential. In the general case, leveraging the well-known result,
we show that the optimal process follows the forward-backward topological
dynamics governed by some unknowns. Building on these results, we develop
TSB-based models for matching topological signals by parameterizing the
unknowns in the optimal process as (topological) neural networks and learning
them through likelihood training. We validate the theoretical results and
demonstrate the practical applications of TSB-based models on both synthetic
and real-world networks, emphasizing the role of topology. Additionally, we
discuss the connections of TSB-based models to other emerging models, and
outline future directions for topological signal matching.
[COMMENTS]
ICLR 2025 Spotlight, 42 pages
[LINK]
http://arxiv.org/abs/2504.04799v2
[DATE]
2025-05-12 21:03:17+08:00
[CATEGORIES]
cs.LG
MAIS: Memory-Attention for Interactive Segmentation
[AUTHORS]
Mauricio Orbes-Arteaga, Oeslle Lucena, Sabastien Ourselin, M. Jorge Cardoso
[ABSTRACT]
Interactive medical segmentation reduces annotation effort by refining
predictions through user feedback. Vision Transformer (ViT)-based models, such
as the Segment Anything Model (SAM), achieve state-of-the-art performance using
user clicks and prior masks as prompts. However, existing methods treat
interactions as independent events, leading to redundant corrections and
limited refinement gains. We address this by introducing MAIS, a
Memory-Attention mechanism for Interactive Segmentation that stores past user
inputs and segmentation states, enabling temporal context integration. Our
approach enhances ViT-based segmentation across diverse imaging modalities,
achieving more efficient and accurate refinements.
[LINK]
http://arxiv.org/abs/2505.07511v1
[DATE]
2025-05-12 20:48:27+08:00
[CATEGORIES]
cs.LG
EAGLE: Contrastive Learning for Efficient Graph Anomaly Detection
[AUTHORS]
Jing Ren, Mingliang Hou, Zhixuan Liu, Xiaomei Bai
[ABSTRACT]
Graph anomaly detection is a popular and vital task in various real-world
scenarios, which has been studied for several decades. Recently, many studies
extending deep learning-based methods have shown preferable performance on
graph anomaly detection. However, existing methods are lack of efficiency that
is definitely necessary for embedded devices. Towards this end, we propose an
Efficient Anomaly detection model on heterogeneous Graphs via contrastive
LEarning (EAGLE) by contrasting abnormal nodes with normal ones in terms of
their distances to the local context. The proposed method first samples
instance pairs on meta path-level for contrastive learning. Then, a graph
autoencoder-based model is applied to learn informative node embeddings in an
unsupervised way, which will be further combined with the discriminator to
predict the anomaly scores of nodes. Experimental results show that EAGLE
outperforms the state-of-the-art methods on three heterogeneous network
datasets.
[LINK]
http://arxiv.org/abs/2505.07508v1
[DATE]
2025-05-12 20:45:07+08:00
[CATEGORIES]
cs.LG
Identifying Causal Direction via Variational Bayesian Compression
[AUTHORS]
Quang-Duy Tran, Bao Duong, Phuoc Nguyen, Thin Nguyen
[ABSTRACT]
Telling apart the cause and effect between two random variables with purely
observational data is a challenging problem that finds applications in various
scientific disciplines. A key principle utilized in this task is the
algorithmic Markov condition, which postulates that the joint distribution,
when factorized according to the causal direction, yields a more succinct
codelength compared to the anti-causal direction. Previous approaches
approximate these codelengths by relying on simple functions or Gaussian
processes (GPs) with easily evaluable complexity, compromising between model
fitness and computational complexity. To overcome these limitations, we propose
leveraging the variational Bayesian learning of neural networks as an
interpretation of the codelengths. Consequently, we can enhance the model
fitness while promoting the succinctness of the codelengths, while avoiding the
significant computational complexity of the GP-based approaches. Extensive
experiments on both synthetic and real-world benchmarks in cause-effect
identification demonstrate the effectiveness of our proposed method, surpassing
the overall performance of related complexity-based and structural causal model
regression-based approaches.
[COMMENTS]
Accepted at the 42nd International Conference on Machine Learning
(ICML2025)
[LINK]
http://arxiv.org/abs/2505.07503v1
[DATE]
2025-05-12 20:40:15+08:00
[CATEGORIES]
cs.LG
Linux Kernel Configurations at Scale: A Dataset for Performance and Evolution Analysis
[AUTHORS]
Heraldo Borges, Juliana Alves Pereira, Djamel Eddine Khelladi, Mathieu Acher
[ABSTRACT]
Configuring the Linux kernel to meet specific requirements, such as binary
size, is highly challenging due to its immense complexity-with over 15,000
interdependent options evolving rapidly across different versions. Although
several studies have explored sampling strategies and machine learning methods
to understand and predict the impact of configuration options, the literature
still lacks a comprehensive and large-scale dataset encompassing multiple
kernel versions along with detailed quantitative measurements. To bridge this
gap, we introduce LinuxData, an accessible collection of kernel configurations
spanning several kernel releases, specifically from versions 4.13 to 5.8. This
dataset, gathered through automated tools and build processes, comprises over
240,000 kernel configurations systematically labeled with compilation outcomes
and binary sizes. By providing detailed records of configuration evolution and
capturing the intricate interplay among kernel options, our dataset enables
innovative research in feature subset selection, prediction models based on
machine learning, and transfer learning across kernel versions. Throughout this
paper, we describe how the dataset has been made easily accessible via OpenML
and illustrate how it can be leveraged using only a few lines of Python code to
evaluate AI-based techniques, such as supervised machine learning. We
anticipate that this dataset will significantly enhance reproducibility and
foster new insights into configuration-space analysis at a scale that presents
unique opportunities and inherent challenges, thereby advancing our
understanding of the Linux kernel’s configurability and evolution.
[LINK]
http://arxiv.org/abs/2505.07487v1
[DATE]
2025-05-12 20:19:46+08:00
[CATEGORIES]
cs.LG
You Only Look One Step: Accelerating Backpropagation in Diffusion Sampling with Gradient Shortcuts
[AUTHORS]
Hongkun Dou, Zeyu Li, Xingyu Jiang, Hongjue Li, Lijun Yang, Wen Yao, Yue Deng
[ABSTRACT]
Diffusion models (DMs) have recently demonstrated remarkable success in
modeling large-scale data distributions. However, many downstream tasks require
guiding the generated content based on specific differentiable metrics,
typically necessitating backpropagation during the generation process. This
approach is computationally expensive, as generating with DMs often demands
tens to hundreds of recursive network calls, resulting in high memory usage and
significant time consumption. In this paper, we propose a more efficient
alternative that approaches the problem from the perspective of parallel
denoising. We show that full backpropagation throughout the entire generation
process is unnecessary. The downstream metrics can be optimized by retaining
the computational graph of only one step during generation, thus providing a
shortcut for gradient propagation. The resulting method, which we call Shortcut
Diffusion Optimization (SDO), is generic, high-performance, and computationally
lightweight, capable of optimizing all parameter types in diffusion sampling.
We demonstrate the effectiveness of SDO on several real-world tasks, including
controlling generation by optimizing latent and aligning the DMs by fine-tuning
network parameters. Compared to full backpropagation, our approach reduces
computational costs by $\sim 90\%$ while maintaining superior performance. Code
is available at https://github.com/deng-ai-lab/SDO.
[LINK]
http://arxiv.org/abs/2505.07477v1
[DATE]
2025-05-12 20:09:11+08:00
[CATEGORIES]
cs.LG
Privacy of SGD under Gaussian or Heavy-Tailed Noise: Guarantees without Gradient Clipping
[AUTHORS]
Umut Şimşekli, Mert Gürbüzbalaban, Sinan Yıldırım, Lingjiong Zhu
[ABSTRACT]
The injection of heavy-tailed noise into the iterates of stochastic gradient
descent (SGD) has garnered growing interest in recent years due to its
theoretical and empirical benefits for optimization and generalization.
However, its implications for privacy preservation remain largely unexplored.
Aiming to bridge this gap, we provide differential privacy (DP) guarantees for
noisy SGD, when the injected noise follows an $\alpha$-stable distribution,
which includes a spectrum of heavy-tailed distributions (with infinite
variance) as well as the light-tailed Gaussian distribution. Considering the
$(\epsilon, \delta)$-DP framework, we show that SGD with heavy-tailed
perturbations achieves $(0, O(1/n))$-DP for a broad class of loss functions
which can be non-convex, where $n$ is the number of data points. As a
remarkable byproduct, contrary to prior work that necessitates bounded
sensitivity for the gradients or clipping the iterates, our theory can handle
unbounded gradients without clipping, and reveals that under mild assumptions,
such a projection step is not actually necessary. Our results suggest that,
given other benefits of heavy-tails in optimization, heavy-tailed noising
schemes can be a viable alternative to their light-tailed counterparts.
[LINK]
http://arxiv.org/abs/2403.02051v2
[DATE]
2025-05-12 19:51:18+08:00
[CATEGORIES]
cs.LG
Image-Guided Microstructure Optimization using Diffusion Models: Validated with Li-Mn-rich Cathode Precursors
[AUTHORS]
Geunho Choi, Changhwan Lee, Jieun Kim, Insoo Ye, Keeyoung Jung, Inchul Park
[ABSTRACT]
Microstructure often dictates materials performance, yet it is rarely treated
as an explicit design variable because microstructure is hard to quantify,
predict, and optimize. Here, we introduce an image centric, closed-loop
framework that makes microstructural morphology into a controllable objective
and demonstrate its use case with Li- and Mn-rich layered oxide cathode
precursors. This work presents an integrated, AI driven framework for the
predictive design and optimization of lithium-ion battery cathode precursor
synthesis. This framework integrates a diffusion-based image generation model,
a quantitative image analysis pipeline, and a particle swarm optimization (PSO)
algorithm. By extracting key morphological descriptors such as texture,
sphericity, and median particle size (D50) from SEM images, the platform
accurately predicts SEM like morphologies resulting from specific
coprecipitation conditions, including reaction time-, solution concentration-,
and pH-dependent structural changes. Optimization then pinpoints synthesis
parameters that yield user defined target morphologies, as experimentally
validated by the close agreement between predicted and synthesized structures.
This framework offers a practical strategy for data driven materials design,
enabling both forward prediction and inverse design of synthesis conditions and
paving the way toward autonomous, image guided microstructure engineering.
[COMMENTS]
37 pages, 10 figures
[LINK]
http://arxiv.org/abs/2505.07906v1
[DATE]
2025-05-12 19:42:04+08:00
[CATEGORIES]
cs.LG
Transfer Learning with Foundational Models for Time Series Forecasting using Low-Rank Adaptations
[AUTHORS]
M. Germán-Morales, A. J. Rivera-Rivas, M. J. del Jesus Díaz, C. J. Carmona
[ABSTRACT]
Foundational Models are an emerging widely used technique of GenAI. These
models are distinguished by their scalability and the ease with which they can
be adapted through the exploitation of Transfer Learning. The availability of
high computational power and large datasets have supported their development,
achieving a high generalization capacity due to the enormous and heterogeneous
amounts of data used in their initial training. These characteristics
contribute to a solid base that can be adapted or adjusted to a wide range of
tasks, increasing their applicability. This study proposes the methodology
LLIAM, a straightforward adaptation of a kind of FM, Large Language Models, for
the Time Series Forecasting task. An adequate time-series prompting schema and
Low-Rank Adaptations are used to enhance the knowledge of the model with
diverse time series datasets, known as the fine-tuning phase. A study divided
in two stages has been performed for evaluating the effectiveness of the
proposed methodology. Initially, a comparison was made between the performance
of LLIAM and different state-of-the-art DL algorithms, including Recurrent
Neural Networks and Temporal Convolutional Networks, as well as a LLM-based
method, TimeLLM. Following this, a zero-shot study is presented in order to
evaluate the generalization capacity of the proposed methodology with time
series datasets from unknown domains not considered in the model training. The
outcomes of this investigation demonstrate the efficacy of LLIAM, highlighting
that this straightforward and general approach can attain competent results
without the necessity for applying complex modifications. This work also
encourages the use of available resources (such as these pre-trained models)
and efficient fine-tuning techniques to avoid unnecessary and costly training,
narrowing the gap between the goals of traditional AI and Green AI.
[LINK]
http://arxiv.org/abs/2410.11539v3
[DATE]
2025-05-12 19:26:58+08:00
[CATEGORIES]
cs.LG
Identifying Drivers of Predictive Aleatoric Uncertainty
[AUTHORS]
Pascal Iversen, Simon Witzke, Katharina Baum, Bernhard Y. Renard
[ABSTRACT]
Explainability and uncertainty quantification are key to trustable artificial
intelligence. However, the reasoning behind uncertainty estimates is generally
left unexplained. Identifying the drivers of uncertainty complements
explanations of point predictions in recognizing model limitations and
enhancing transparent decision-making. So far, explanations of uncertainties
have been rarely studied. The few exceptions rely on Bayesian neural networks
or technically intricate approaches, such as auxiliary generative models,
thereby hindering their broad adoption. We propose a straightforward approach
to explain predictive aleatoric uncertainties. We estimate uncertainty in
regression as predictive variance by adapting a neural network with a Gaussian
output distribution. Subsequently, we apply out-of-the-box explainers to the
model’s variance output. This approach can explain uncertainty influences more
reliably than complex published approaches, which we demonstrate in a synthetic
setting with a known data-generating process. We substantiate our findings with
a nuanced, quantitative benchmark including synthetic and real, tabular and
image datasets. For this, we adapt metrics from conventional XAI research to
uncertainty explanations. Overall, the proposed method explains uncertainty
estimates with little modifications to the model architecture and outperforms
more intricate methods in most settings.
[COMMENTS]
Simon Witzke and Pascal Iversen contributed equally
[LINK]
http://arxiv.org/abs/2312.07252v3
[DATE]
2025-05-12 19:25:36+08:00
[CATEGORIES]
cs.LG
The Energy Cost of Artificial Intelligence Lifecycle in Communication Networks
[AUTHORS]
Shih-Kai Chou, Jernej Hribar, Vid Hanžel, Mihael Mohorčič, Carolina Fortuna
[ABSTRACT]
Artificial Intelligence (AI) is being incorporated in several optimization,
scheduling, orchestration as well as in native communication network functions.
While this paradigm shift results in increased energy consumption, quantifying
the end-toend energy consumption of adding intelligence to such systems is
particularly challenging. Conventional metrics focus on either communication,
computation infrastructure, or model development. To address this, we propose a
new metric, the Energy Cost of AI Lifecycle (eCAL) of one AI model in a system.
eCAL captures the energy consumption throughout the development and deployment
of an AI-model providing intelligence in a wireless communication network by
analyzing the complexity of data collection and manipulation in individual
components and deriving overall and per-bit energy consumption. We show that
the better a model is and the more it is used, the more energy efficient an
inference is. For a simple case study, eCAL for making 100 inferences is 2.73
times higher than for 1000 inferences. Additionally, we have developed a
modular and extendable opensource simulation tool to enable researchers,
practitioners, and engineers to calculate the end-to-end energy cost with
various configurations and across various systems, ensuring adaptability to
diverse use cases.
[COMMENTS]
13 pages, 9 figures
[LINK]
http://arxiv.org/abs/2408.00540v3
[DATE]
2025-05-12 19:18:06+08:00
[CATEGORIES]
cs.LG
Unified Continuous Generative Models
[AUTHORS]
Peng Sun, Yi Jiang, Tao Lin
[ABSTRACT]
Recent advances in continuous generative models, including multi-step
approaches like diffusion and flow-matching (typically requiring 8-1000
sampling steps) and few-step methods such as consistency models (typically 1-8
steps), have demonstrated impressive generative performance. However, existing
work often treats these approaches as distinct paradigms, resulting in separate
training and sampling methodologies. We introduce a unified framework for
training, sampling, and analyzing these models. Our implementation, the Unified
Continuous Generative Models Trainer and Sampler (UCGM-{T,S}), achieves
state-of-the-art (SOTA) performance. For example, on ImageNet 256x256 using a
675M diffusion transformer, UCGM-T trains a multi-step model achieving 1.30 FID
in 20 steps and a few-step model reaching 1.42 FID in just 2 steps.
Additionally, applying UCGM-S to a pre-trained model (previously 1.26 FID at
250 steps) improves performance to 1.06 FID in only 40 steps. Code is available
at: https://github.com/LINs-lab/UCGM.
[COMMENTS]
https://github.com/LINs-lab/UCGM
[LINK]
http://arxiv.org/abs/2505.07447v1
[DATE]
2025-05-12 19:15:39+08:00
[CATEGORIES]
cs.LG
LEAD: Iterative Data Selection for Efficient LLM Instruction Tuning
[AUTHORS]
Xiaotian Lin, Yanlin Qi, Yizhang Zhu, Themis Palpanas, Chengliang Chai, Nan Tang, Yuyu Luo
[ABSTRACT]
Instruction tuning has emerged as a critical paradigm for improving the
capabilities and alignment of large language models (LLMs). However, existing
iterative model-aware data selection methods incur significant computational
overhead, as they rely on repeatedly performing full-dataset model inference to
estimate sample utility for subsequent training iterations, creating a
fundamental efficiency bottleneck. In this paper, we propose LEAD, an efficient
iterative data selection framework that accurately estimates sample utility
entirely within the standard training loop, eliminating the need for costly
additional model inference. At its core, LEAD introduces Instance-Level Dynamic
Uncertainty (IDU), a theoretically grounded utility function combining
instantaneous training loss, gradient-based approximation of loss changes, and
exponential smoothing of historical loss signals. To further scale efficiently
to large datasets, LEAD employs a two-stage, coarse-to-fine selection strategy,
adaptively prioritizing informative clusters through a multi-armed bandit
mechanism, followed by precise fine-grained selection of high-utility samples
using IDU. Extensive experiments across four diverse benchmarks show that LEAD
significantly outperforms state-of-the-art methods, improving average model
performance by 6.1%-10.8% while using only 2.5% of the training data and
reducing overall training time by 5-10x.
[LINK]
http://arxiv.org/abs/2505.07437v1
[DATE]
2025-05-12 18:57:51+08:00
[CATEGORIES]
cs.LG
Data Integration with Fusion Searchlight: Classifying Brain States from Resting-state fMRI
[AUTHORS]
Simon Wein, Marco Riebel, Lisa-Marie Brunner, Caroline Nothdurfter, Rainer Rupprecht, Jens V. Schwarzbach
[ABSTRACT]
Resting-state fMRI captures spontaneous neural activity characterized by
complex spatiotemporal dynamics. Various metrics, such as local and global
brain connectivity and low-frequency amplitude fluctuations, quantify distinct
aspects of these dynamics. However, these measures are typically analyzed
independently, overlooking their interrelations and potentially limiting
analytical sensitivity. Here, we introduce the Fusion Searchlight (FuSL)
framework, which integrates complementary information from multiple
resting-state fMRI metrics. We demonstrate that combining these metrics
enhances the accuracy of pharmacological treatment prediction from rs-fMRI
data, enabling the identification of additional brain regions affected by
sedation with alprazolam. Furthermore, we leverage explainable AI to delineate
the differential contributions of each metric, which additionally improves
spatial specificity of the searchlight analysis. Moreover, this framework can
be adapted to combine information across imaging modalities or experimental
conditions, providing a versatile and interpretable tool for data fusion in
neuroimaging.
[LINK]
http://arxiv.org/abs/2412.10161v2
[DATE]
2025-05-12 18:55:31+08:00
[CATEGORIES]
cs.LG
Beyond DAGs: A Latent Partial Causal Model for Multimodal Learning
[AUTHORS]
Yuhang Liu, Zhen Zhang, Dong Gong, Erdun Gao, Biwei Huang, Mingming Gong, Anton van den Hengel, Kun Zhang, Javen Qinfeng Shi
[ABSTRACT]
Directed acyclic graphs (DAGs) are fundamental graph structures in causal
modeling, but identifying the desired DAG from observational data often
requires strong assumptions that may not hold in real-world scenarios,
especially for latent causal models and complex multimodal data. This raises
the question of whether we can relax or bypass the DAG assumption while
maintaining practical utility. In this work, we propose a novel latent partial
causal model for multimodal data, featuring two latent coupled variables,
connected by an undirected edge, to represent the transfer of knowledge across
modalities. Under specific statistical assumptions, we establish an
identifiability result, demonstrating that representations learned by
multimodal contrastive learning correspond to the latent coupled variables up
to a trivial transformation. This result deepens our understanding of the why
multimodal contrastive learning works, highlights its potential for
disentanglement, and expands the utility of pre-trained models like CLIP.
Synthetic experiments confirm the robustness of our findings, even when the
assumptions are partially violated. Most importantly, experiments on a
pre-trained CLIP model embodies disentangled representations, enabling few-shot
learning and improving domain generalization across diverse real-world
datasets. Together, these contributions push the boundaries of multimodal
contrastive learning, both theoretically and, crucially, in practical
applications.
[LINK]
http://arxiv.org/abs/2402.06223v2
[DATE]
2025-05-12 18:29:12+08:00
[CATEGORIES]
cs.LG
Neural timescales from a computational perspective
[AUTHORS]
Roxana Zeraati, Anna Levina, Jakob H. Macke, Richard Gao
[ABSTRACT]
Neural activity fluctuates over a wide range of timescales within and across
brain areas. Experimental observations suggest that diverse neural timescales
reflect information in dynamic environments. However, how timescales are
defined and measured from brain recordings vary across the literature.
Moreover, these observations do not specify the mechanisms underlying timescale
variations, nor whether specific timescales are necessary for neural
computation and brain function. Here, we synthesize three directions where
computational approaches can distill the broad set of empirical observations
into quantitative and testable theories: We review (i) how different data
analysis methods quantify timescales across distinct behavioral states and
recording modalities, (ii) how biophysical models provide mechanistic
explanations for the emergence of diverse timescales, and (iii) how
task-performing networks and machine learning models uncover the functional
relevance of neural timescales. This integrative computational perspective thus
complements experimental investigations, providing a holistic view on how
neural timescales reflect the relationship between brain structure, dynamics,
and behavior.
[COMMENTS]
21 pages, 5 figures, 3 boxes, 1 table
[LINK]
http://arxiv.org/abs/2409.02684v2
[DATE]
2025-05-12 18:25:06+08:00
[CATEGORIES]
cs.LG
Learning Penalty for Optimal Partitioning via Automatic Feature Extraction
[AUTHORS]
Tung L Nguyen, Toby Hocking
[ABSTRACT]
Changepoint detection identifies significant shifts in data sequences, making
it important in areas like finance, genetics, and healthcare. The Optimal
Partitioning algorithms efficiently detect these changes, using a penalty
parameter to limit the changepoints number. Determining the appropriate value
for this penalty can be challenging. Traditionally, this process involved
manually extracting statistical features, such as sequence length or variance
to make the prediction. This study proposes a novel approach that uses
recurrent neural networks to learn this penalty directly from raw sequences by
automatically extracting features. Experiments conducted on 20 benchmark
genomic datasets show that this novel method surpasses traditional methods in
partitioning accuracy in most cases.
[COMMENTS]
9 Figures
[LINK]
http://arxiv.org/abs/2505.07413v1
[DATE]
2025-05-12 18:07:55+08:00
[CATEGORIES]
cs.LG
Self-Adaptive Gamma Context-Aware SSM-based Model for Metal Defect Detection
[AUTHORS]
Sijin Sun, Ming Deng, Xingrui Yu, Xingyu Xi, Liangbin Zhao
[ABSTRACT]
Metal defect detection is critical in industrial quality assurance, yet
existing methods struggle with grayscale variations and complex defect states,
limiting its robustness. To address these challenges, this paper proposes a
Self-Adaptive Gamma Context-Aware SSM-based model(GCM-DET). This advanced
detection framework integrating a Dynamic Gamma Correction (GC) module to
enhance grayscale representation and optimize feature extraction for precise
defect reconstruction. A State-Space Search Management (SSM) architecture
captures robust multi-scale features, effectively handling defects of varying
shapes and scales. Focal Loss is employed to mitigate class imbalance and
refine detection accuracy. Additionally, the CD5-DET dataset is introduced,
specifically designed for port container maintenance, featuring significant
grayscale variations and intricate defect patterns. Experimental results
demonstrate that the proposed model achieves substantial improvements, with
[email protected] gains of 27.6\%, 6.6\%, and 2.6\% on the CD5-DET, NEU-DET, and GC10-DET
datasets.
[COMMENTS]
8 pages, 5 figures; Accepted for publication at the 2025
International Joint Conference on Neural Networks (IJCNN 2025), Rome, Italy,
30 June - 5 July
[LINK]
http://arxiv.org/abs/2503.01234v3
[DATE]
2025-05-12 17:40:09+08:00
[CATEGORIES]
cs.LG
Latent Behavior Diffusion for Sequential Reaction Generation in Dyadic Setting
[AUTHORS]
Minh-Duc Nguyen, Hyung-Jeong Yang, Soo-Hyung Kim, Ji-Eun Shin, Seung-Won Kim
[ABSTRACT]
The dyadic reaction generation task involves synthesizing responsive facial
reactions that align closely with the behaviors of a conversational partner,
enhancing the naturalness and effectiveness of human-like interaction
simulations. This paper introduces a novel approach, the Latent Behavior
Diffusion Model, comprising a context-aware autoencoder and a diffusion-based
conditional generator that addresses the challenge of generating diverse and
contextually relevant facial reactions from input speaker behaviors. The
autoencoder compresses high-dimensional input features, capturing dynamic
patterns in listener reactions while condensing complex input data into a
concise latent representation, facilitating more expressive and contextually
appropriate reaction synthesis. The diffusion-based conditional generator
operates on the latent space generated by the autoencoder to predict realistic
facial reactions in a non-autoregressive manner. This approach allows for
generating diverse facial reactions that reflect subtle variations in
conversational cues and emotional states. Experimental results demonstrate the
effectiveness of our approach in achieving superior performance in dyadic
reaction synthesis tasks compared to existing methods.
[LINK]
http://arxiv.org/abs/2505.07901v1
[DATE]
2025-05-12 17:22:27+08:00
[CATEGORIES]
cs.LG
Amortized Safe Active Learning for Real-Time Data Acquisition: Pretrained Neural Policies from Simulated Nonparametric Functions
[AUTHORS]
Cen-You Li, Marc Toussaint, Barbara Rakitsch, Christoph Zimmer
[ABSTRACT]
Safe active learning (AL) is a sequential scheme for learning unknown systems
while respecting safety constraints during data acquisition. Existing methods
often rely on Gaussian processes (GPs) to model the task and safety
constraints, requiring repeated GP updates and constrained acquisition
optimization-incurring in significant computations which are challenging for
real-time decision-making. We propose an amortized safe AL framework that
replaces expensive online computations with a pretrained neural policy.
Inspired by recent advances in amortized Bayesian experimental design, we turn
GPs into a pretraining simulator. We train our policy prior to the AL
deployment on simulated nonparametric functions, using Fourier feature-based GP
sampling and a differentiable, safety-aware acquisition objective. At
deployment, our policy selects safe and informative queries via a single
forward pass, eliminating the need for GP inference or constrained
optimization. This leads to substantial speed improvements while preserving
safety and learning quality. Our framework is modular and can be adapted to
unconstrained, time-sensitive AL tasks by omitting the safety requirement.
[COMMENTS]
Part of the content published earlier at arXiv:2407.17992
[LINK]
http://arxiv.org/abs/2501.15458v2
[DATE]
2025-05-12 17:21:39+08:00
[CATEGORIES]
cs.LG
AIS Data-Driven Maritime Monitoring Based on Transformer: A Comprehensive Review
[AUTHORS]
Zhiye Xie, Enmei Tu, Xianping Fu, Guoliang Yuan, Yi Han
[ABSTRACT]
With the increasing demands for safety, efficiency, and sustainability in
global shipping, Automatic Identification System (AIS) data plays an
increasingly important role in maritime monitoring. AIS data contains
spatial-temporal variation patterns of vessels that hold significant research
value in the marine domain. However, due to its massive scale, the full
potential of AIS data has long remained untapped. With its powerful sequence
modeling capabilities, particularly its ability to capture long-range
dependencies and complex temporal dynamics, the Transformer model has emerged
as an effective tool for processing AIS data. Therefore, this paper reviews the
research on Transformer-based AIS data-driven maritime monitoring, providing a
comprehensive overview of the current applications of Transformer models in the
marine field. The focus is on Transformer-based trajectory prediction methods,
behavior detection, and prediction techniques. Additionally, this paper
collects and organizes publicly available AIS datasets from the reviewed
papers, performing data filtering, cleaning, and statistical analysis. The
statistical results reveal the operational characteristics of different vessel
types, providing data support for further research on maritime monitoring
tasks. Finally, we offer valuable suggestions for future research, identifying
two promising research directions. Datasets are available at
https://github.com/eyesofworld/Maritime-Monitoring.
[LINK]
http://arxiv.org/abs/2505.07374v1
[DATE]
2025-05-12 17:17:43+08:00
[CATEGORIES]
cs.LG
Generalization Bounds and Stopping Rules for Learning with Self-Selected Data
[AUTHORS]
Julian Rodemann, James Bailie
[ABSTRACT]
Many learning paradigms self-select training data in light of previously
learned parameters. Examples include active learning, semi-supervised learning,
bandits, or boosting. Rodemann et al. (2024) unify them under the framework of
“reciprocal learning”. In this article, we address the question of how well
these methods can generalize from their self-selected samples. In particular,
we prove universal generalization bounds for reciprocal learning using covering
numbers and Wasserstein ambiguity sets. Our results require no assumptions on
the distribution of self-selected data, only verifiable conditions on the
algorithms. We prove results for both convergent and finite iteration
solutions. The latter are anytime valid, thereby giving rise to stopping rules
for a practitioner seeking to guarantee the out-of-sample performance of their
reciprocal learning algorithm. Finally, we illustrate our bounds and stopping
rules for reciprocal learning’s special case of semi-supervised learning.
[COMMENTS]
38 pages, 4 figures
[LINK]
http://arxiv.org/abs/2505.07367v1
[DATE]
2025-05-12 17:06:39+08:00
[CATEGORIES]
cs.LG
Steering Large Language Models using Conceptors: Improving Addition-Based Activation Engineering
[AUTHORS]
Joris Postmus, Steven Abreu
[ABSTRACT]
Large language models have transformed AI, yet reliably controlling their
outputs remains a challenge. This paper explores activation engineering, where
outputs of pre-trained LLMs are controlled by manipulating their activations at
inference time. Unlike traditional methods using a single steering vector, we
introduce conceptors - mathematical constructs that represent sets of
activation vectors as ellipsoidal regions. Conceptors act as soft projection
matrices and offer more precise control over complex activation patterns. Our
experiments demonstrate that conceptors outperform traditional methods across
multiple steering tasks. We further use Boolean operations on conceptors for
combined steering goals that empirically outperform additively combining
steering vectors on a set of tasks. These results highlight conceptors as a
promising tool for more effective steering of LLMs. Our code is available on
github.com/jorispos/conceptorsteering.
[COMMENTS]
Presented at the MINT workshop at NeurIPS 2024. v4: fix sign in
equation 10
[LINK]
http://arxiv.org/abs/2410.16314v4
[DATE]
2025-05-12 16:59:12+08:00
[CATEGORIES]
cs.LG
Inverse Covariance and Partial Correlation Matrix Estimation via Joint Partial Regression
[AUTHORS]
Samuel Erickson, Tobias Rydén
[ABSTRACT]
We present a method for estimating sparse high-dimensional inverse covariance
and partial correlation matrices, which exploits the connection between the
inverse covariance matrix and linear regression. The method is a two-stage
estimation method wherein each individual feature is regressed on all other
features while positive semi-definiteness is enforced simultaneously. We derive
non-asymptotic estimation rates for both inverse covariance and partial
correlation matrix estimation. An efficient proximal splitting algorithm for
numerically computing the estimate is also dervied. The effectiveness of the
proposed method is demonstrated on both synthetic and real-world data.
[LINK]
http://arxiv.org/abs/2502.08414v2
[DATE]
2025-05-12 16:57:55+08:00
[CATEGORIES]
cs.LG
Decentralized Adversarial Training over Graphs
[AUTHORS]
Ying Cao, Elsa Rizk, Stefan Vlaski, Ali H. Sayed
[ABSTRACT]
The vulnerability of machine learning models to adversarial attacks has been
attracting considerable attention in recent years. Most existing studies focus
on the behavior of stand-alone single-agent learners. In comparison, this work
studies adversarial training over graphs, where individual agents are subjected
to perturbations of varied strength levels across space. It is expected that
interactions by linked agents, and the heterogeneity of the attack models that
are possible over the graph, can help enhance robustness in view of the
coordination power of the group. Using a min-max formulation of distributed
learning, we develop a decentralized adversarial training framework for
multi-agent systems. Specifically, we devise two decentralized adversarial
training algorithms by relying on two popular decentralized learning
strategies–diffusion and consensus. We analyze the convergence properties of
the proposed framework for strongly-convex, convex, and non-convex
environments, and illustrate the enhanced robustness to adversarial attacks.
[COMMENTS]
arXiv admin note: text overlap with arXiv:2303.01936
[LINK]
http://arxiv.org/abs/2303.13326v3
[DATE]
2025-05-12 16:53:31+08:00
[CATEGORIES]
cs.LG
From Search To Sampling: Generative Models For Robust Algorithmic Recourse
[AUTHORS]
Prateek Garg, Lokesh Nagalapatti, Sunita Sarawagi
[ABSTRACT]
Algorithmic Recourse provides recommendations to individuals who are
adversely impacted by automated model decisions, on how to alter their profiles
to achieve a favorable outcome. Effective recourse methods must balance three
conflicting goals: proximity to the original profile to minimize cost,
plausibility for realistic recourse, and validity to ensure the desired
outcome. We show that existing methods train for these objectives separately
and then search for recourse through a joint optimization over the recourse
goals during inference, leading to poor recourse recommendations. We introduce
GenRe, a generative recourse model designed to train the three recourse
objectives jointly. Training such generative models is non-trivial due to lack
of direct recourse supervision. We propose efficient ways to synthesize such
supervision and further show that GenRe’s training leads to a consistent
estimator. Unlike most prior methods, that employ non-robust gradient descent
based search during inference, GenRe simply performs a forward sampling over
the generative model to produce minimum cost recourse, leading to superior
performance across multiple metrics. We also demonstrate GenRe provides the
best trade-off between cost, plausibility and validity, compared to
state-of-art baselines. Our code is available at:
https://github.com/prateekgargx/genre.
[LINK]
http://arxiv.org/abs/2505.07351v1
[DATE]
2025-05-12 16:44:28+08:00
[CATEGORIES]
cs.LG
UniMoMo: Unified Generative Modeling of 3D Molecules for De Novo Binder Design
[AUTHORS]
Xiangzhe Kong, Zishen Zhang, Ziting Zhang, Rui Jiao, Jianzhu Ma, Wenbing Huang, Kai Liu, Yang Liu
[ABSTRACT]
The design of target-specific molecules such as small molecules, peptides,
and antibodies is vital for biological research and drug discovery. Existing
generative methods are restricted to single-domain molecules, failing to
address versatile therapeutic needs or utilize cross-domain transferability to
enhance model performance. In this paper, we introduce Unified generative
Modeling of 3D Molecules (UniMoMo), the first framework capable of designing
binders of multiple molecular domains using a single model. In particular,
UniMoMo unifies the representations of different molecules as graphs of blocks,
where each block corresponds to either a standard amino acid or a molecular
fragment. Subsequently, UniMoMo utilizes a geometric latent diffusion model for
3D molecular generation, featuring an iterative full-atom autoencoder to
compress blocks into latent space points, followed by an E(3)-equivariant
diffusion process. Extensive benchmarks across peptides, antibodies, and small
molecules demonstrate the superiority of our unified framework over existing
domain-specific models, highlighting the benefits of multi-domain training.
[COMMENTS]
Accepted to ICML 2025
[LINK]
http://arxiv.org/abs/2503.19300v2
[DATE]
2025-05-12 16:35:47+08:00
[CATEGORIES]
cs.LG
Stochastic Variational Propagation: Local, Scalable and Efficient Alternative to Backpropagation
[AUTHORS]
Bojian Yin, Federico Corradi
[ABSTRACT]
Backpropagation (BP) is the cornerstone of deep learning, but its reliance on
global gradient synchronization limits scalability and imposes significant
memory overhead. We propose Stochastic Variational Propagation (SVP), a
scalable alternative that reframes training as hierarchical variational
inference. SVP treats layer activations as latent variables and optimizes local
Evidence Lower Bounds (ELBOs), enabling independent, local updates while
preserving global coherence. However, directly applying KL divergence in
layer-wise ELBOs risks inter-layer’s representation collapse due to excessive
compression. To prevent this, SVP projects activations into low-dimensional
spaces via fixed random matrices, ensuring information preservation and
representational diversity. Combined with a feature alignment loss for
inter-layer consistency, SVP achieves competitive accuracy with BP across
diverse architectures (MLPs, CNNs, Transformers) and datasets (MNIST to
ImageNet), reduces memory usage by up to 4x, and significantly improves
scalability. More broadly, SVP introduces a probabilistic perspective to deep
representation learning, opening pathways toward more modular and interpretable
neural network design.
[COMMENTS]
14 pages, 5 figures
[LINK]
http://arxiv.org/abs/2505.05181v2
[DATE]
2025-05-12 16:27:14+08:00
[CATEGORIES]
cs.LG
Towards Understanding Deep Learning Model in Image Recognition via Coverage Test
[AUTHORS]
Wenkai Li, Xiaoqi Li, Yingjie Mao, Yishun Wang
[ABSTRACT]
Deep neural networks (DNNs) play a crucial role in the field of artificial
intelligence, and their security-related testing has been a prominent research
focus. By inputting test cases, the behavior of models is examined for
anomalies, and coverage metrics are utilized to determine the extent of neurons
covered by these test cases. With the widespread application and advancement of
DNNs, different types of neural behaviors have garnered attention, leading to
the emergence of various coverage metrics for neural networks. However, there
is currently a lack of empirical research on these coverage metrics,
specifically in analyzing the relationships and patterns between model depth,
configuration information, and neural network coverage. This paper aims to
investigate the relationships and patterns of four coverage metrics: primary
functionality, boundary, hierarchy, and structural coverage. A series of
empirical experiments were conducted, selecting LeNet, VGG, and ResNet as
different DNN architectures, along with 10 models of varying depths ranging
from 5 to 54 layers, to compare and study the relationships between different
depths, configuration information, and various neural network coverage metrics.
Additionally, an investigation was carried out on the relationships between
modified decision/condition coverage and dataset size. Finally, three potential
future directions are proposed to further contribute to the security testing of
DNN Models.
[LINK]
http://arxiv.org/abs/2505.08814v1
[DATE]
2025-05-12 16:25:55+08:00
[CATEGORIES]
cs.LG
Private LoRA Fine-tuning of Open-Source LLMs with Homomorphic Encryption
[AUTHORS]
Jordan Frery, Roman Bredehoft, Jakub Klemsa, Arthur Meyre, Andrei Stoian
[ABSTRACT]
Preserving data confidentiality during the fine-tuning of open-source Large
Language Models (LLMs) is crucial for sensitive applications. This work
introduces an interactive protocol adapting the Low-Rank Adaptation (LoRA)
technique for private fine-tuning. Homomorphic Encryption (HE) protects the
confidentiality of training data and gradients handled by remote worker nodes
performing the bulk of computations involving the base model weights. The data
owner orchestrates training, requiring minimal local computing power and
memory, thus alleviating the need for expensive client-side GPUs. We
demonstrate feasibility by fine-tuning a Llama-3.2-1B model, presenting
convergence results using HE-compatible quantization and performance benchmarks
for HE computations on GPU hardware. This approach enables applications such as
confidential knowledge base question answering, private codebase fine-tuning
for AI code assistants, AI agents for drafting emails based on a company’s
email archive, and adapting models to analyze sensitive legal or healthcare
documents.
[LINK]
http://arxiv.org/abs/2505.07329v1
[DATE]
2025-05-12 16:14:33+08:00
[CATEGORIES]
cs.LG
Direct Discriminative Optimization: Your Likelihood-Based Visual Generative Model is Secretly a GAN Discriminator
[AUTHORS]
Kaiwen Zheng, Yongxin Chen, Huayu Chen, Guande He, Ming-Yu Liu, Jun Zhu, Qinsheng Zhang
[ABSTRACT]
While likelihood-based generative models, particularly diffusion and
autoregressive models, have achieved remarkable fidelity in visual generation,
the maximum likelihood estimation (MLE) objective, which minimizes the forward
KL divergence, inherently suffers from a mode-covering tendency that limits the
generation quality under limited model capacity. In this work, we propose
Direct Discriminative Optimization (DDO) as a unified framework that integrates
likelihood-based generative training and GAN-type discrimination to bypass this
fundamental constraint by exploiting reverse KL and self-generated negative
signals. Our key insight is to parameterize a discriminator implicitly using
the likelihood ratio between a learnable target model and a fixed reference
model, drawing parallels with the philosophy of Direct Preference Optimization
(DPO). Unlike GANs, this parameterization eliminates the need for joint
training of generator and discriminator networks, allowing for direct,
efficient, and effective finetuning of a well-trained model to its full
potential beyond the limits of MLE. DDO can be performed iteratively in a
self-play manner for progressive model refinement, with each round requiring
less than 1% of pretraining epochs. Our experiments demonstrate the
effectiveness of DDO by significantly advancing the previous SOTA diffusion
model EDM, reducing FID scores from 1.79/1.58/1.96 to new records of
1.30/0.97/1.26 on CIFAR-10/ImageNet-64/ImageNet 512x512 datasets without any
guidance mechanisms, and by consistently improving both guidance-free and
CFG-enhanced FIDs of visual autoregressive models on ImageNet 256x256.
[COMMENTS]
ICML 2025 Spotlight Project Page:
https://research.nvidia.com/labs/dir/ddo/ Code: https://github.com/NVlabs/DDO
[LINK]
http://arxiv.org/abs/2503.01103v2
[DATE]
2025-05-12 16:12:46+08:00
[CATEGORIES]
cs.LG
Learning to Fuse Temporal Proximity Networks: A Case Study in Chimpanzee Social Interactions
[AUTHORS]
Yixuan He, Aaron Sandel, David Wipf, Mihai Cucuringu, John Mitani, Gesine Reinert
[ABSTRACT]
How can we identify groups of primate individuals which could be conjectured
to drive social structure? To address this question, one of us has collected a
time series of data for social interactions between chimpanzees. Here we use a
network representation, leading to the task of combining these data into a time
series of a single weighted network per time stamp, where different proximities
should be given different weights reflecting their relative importance. We
optimize these proximity-type weights in a principled way, using an innovative
loss function which rewards structural consistency across time. The approach is
empirically validated by carefully designed synthetic data. Using statistical
tests, we provide a way of identifying groups of individuals that stay related
for a significant length of time. Applying the approach to the chimpanzee data
set, we detect cliques in the animal social network time series, which can be
validated by real-world intuition from prior research and qualitative
observations by chimpanzee experts.
[LINK]
http://arxiv.org/abs/2502.00302v2
[DATE]
2025-05-12 16:07:11+08:00
[CATEGORIES]
cs.LG
Dynamical Label Augmentation and Calibration for Noisy Electronic Health Records
[AUTHORS]
Yuhao Li, Ling Luo, Uwe Aickelin
[ABSTRACT]
Medical research, particularly in predicting patient outcomes, heavily relies
on medical time series data extracted from Electronic Health Records (EHR),
which provide extensive information on patient histories. Despite rigorous
examination, labeling errors are inevitable and can significantly impede
accurate predictions of patient outcome. To address this challenge, we propose
an \textbf{A}ttention-based Learning Framework with Dynamic
\textbf{C}alibration and Augmentation for \textbf{T}ime series Noisy
\textbf{L}abel \textbf{L}earning (ACTLL). This framework leverages a
two-component Beta mixture model to identify the certain and uncertain sets of
instances based on the fitness distribution of each class, and it captures
global temporal dynamics while dynamically calibrating labels from the
uncertain set or augmenting confident instances from the certain set.
Experimental results on large-scale EHR datasets eICU and MIMIC-IV-ED, and
several benchmark datasets from the UCR and UEA repositories, demonstrate that
our model ACTLL has achieved state-of-the-art performance, especially under
high noise levels.
[LINK]
http://arxiv.org/abs/2505.07320v1
[DATE]
2025-05-12 16:06:16+08:00
[CATEGORIES]
cs.LG
FedIFL: A federated cross-domain diagnostic framework for motor-driven systems with inconsistent fault modes
[AUTHORS]
Zexiao Wang, Yankai Wang, Xiaoqiang Liao, Xinguo Ming, Weiming Shen
[ABSTRACT]
Due to the scarcity of industrial data, individual equipment users,
particularly start-ups, struggle to independently train a comprehensive fault
diagnosis model; federated learning enables collaborative training while
ensuring data privacy, making it an ideal solution. However, the diversity of
working conditions leads to variations in fault modes, resulting in
inconsistent label spaces across different clients. In federated diagnostic
scenarios, label space inconsistency leads to local models focus on
client-specific fault modes and causes local models from different clients to
map different failure modes to similar feature representations, which weakens
the aggregated global model’s generalization. To tackle this issue, this
article proposed a federated cross-domain diagnostic framework termed Federated
Invariant Features Learning (FedIFL). In intra-client training, prototype
contrastive learning mitigates intra-client domain shifts, subsequently,
feature generating ensures local models can access distributions of other
clients in a privacy-friendly manner. Besides, in cross-client training, a
feature disentanglement mechanism is introduced to mitigate cross-client domain
shifts, specifically, an instance-level federated instance consistency loss is
designed to ensure the instance-level consistency of invariant features between
different clients, furthermore, a federated instance personalization loss and
an orthogonal loss are constructed to distinguish specific features that from
the invariant features. Eventually, the aggregated model achieves promising
generalization among global label spaces, enabling accurate fault diagnosis for
target clients’ Motor Driven Systems (MDSs) with inconsistent label spaces.
Experiments on real-world MDSs validate the effectiveness and superiority of
FedIFL in federated cross-domain diagnosis with inconsistent fault modes.
[LINK]
http://arxiv.org/abs/2505.07315v1
[DATE]
2025-05-12 16:00:49+08:00
[CATEGORIES]
cs.LG
From Prompting to Alignment: A Generative Framework for Query Recommendation
[AUTHORS]
Erxue Min, Hsiu-Yuan Huang, Min Yang, Xihong Yang, Xin Jia, Yunfang Wu, Hengyi Cai, Junfeng Wang, Shuaiqiang Wang, Dawei Yin
[ABSTRACT]
In modern search systems, search engines often suggest relevant queries to
users through various panels or components, helping refine their information
needs. Traditionally, these recommendations heavily rely on historical search
logs to build models, which suffer from cold-start or long-tail issues.
Furthermore, tasks such as query suggestion, completion or clarification are
studied separately by specific design, which lacks generalizability and hinders
adaptation to novel applications. Despite recent attempts to explore the use of
LLMs for query recommendation, these methods mainly rely on the inherent
knowledge of LLMs or external sources like few-shot examples, retrieved
documents, or knowledge bases, neglecting the importance of the calibration and
alignment with user feedback, thus limiting their practical utility. To address
these challenges, we first propose a general Generative Query Recommendation
(GQR) framework that aligns LLM-based query generation with user preference.
Specifically, we unify diverse query recommendation tasks by a universal prompt
framework, leveraging the instruct-following capability of LLMs for effective
generation. Secondly, we align LLMs with user feedback via presenting a
CTR-alignment framework, which involves training a query-wise CTR predictor as
a process reward model and employing list-wise preference alignment to maximize
the click probability of the generated query list. Furthermore, recognizing the
inconsistency between LLM knowledge and proactive search intents arising from
the separation of user-initiated queries from models, we align LLMs with user
initiative via retrieving co-occurrence queries as side information when
historical logs are available.
[LINK]
http://arxiv.org/abs/2504.10208v2
[DATE]
2025-05-12 15:58:20+08:00
[CATEGORIES]
cs.LG
Online Episodic Convex Reinforcement Learning
[AUTHORS]
Bianca Marin Moreno, Khaled Eldowa, Pierre Gaillard, Margaux Brégère, Nadia Oudjane
[ABSTRACT]
We study online learning in episodic finite-horizon Markov decision processes
(MDPs) with convex objective functions, known as the concave utility
reinforcement learning (CURL) problem. This setting generalizes RL from linear
to convex losses on the state-action distribution induced by the agent’s
policy. The non-linearity of CURL invalidates classical Bellman equations and
requires new algorithmic approaches. We introduce the first algorithm achieving
near-optimal regret bounds for online CURL without any prior knowledge on the
transition function. To achieve this, we use an online mirror descent algorithm
with varying constraint sets and a carefully designed exploration bonus. We
then address for the first time a bandit version of CURL, where the only
feedback is the value of the objective function on the state-action
distribution induced by the agent’s policy. We achieve a sub-linear regret
bound for this more challenging problem by adapting techniques from bandit
convex optimization to the MDP setting.
[LINK]
http://arxiv.org/abs/2505.07303v1
[DATE]
2025-05-12 15:47:49+08:00
[CATEGORIES]
cs.LG
AttackBench: Evaluating Gradient-based Attacks for Adversarial Examples
[AUTHORS]
Antonio Emanuele Cinà, Jérôme Rony, Maura Pintor, Luca Demetrio, Ambra Demontis, Battista Biggio, Ismail Ben Ayed, Fabio Roli
[ABSTRACT]
Adversarial examples are typically optimized with gradient-based attacks.
While novel attacks are continuously proposed, each is shown to outperform its
predecessors using different experimental setups, hyperparameter settings, and
number of forward and backward calls to the target models. This provides
overly-optimistic and even biased evaluations that may unfairly favor one
particular attack over the others. In this work, we aim to overcome these
limitations by proposing AttackBench, i.e., the first evaluation framework that
enables a fair comparison among different attacks. To this end, we first
propose a categorization of gradient-based attacks, identifying their main
components and differences. We then introduce our framework, which evaluates
their effectiveness and efficiency. We measure these characteristics by (i)
defining an optimality metric that quantifies how close an attack is to the
optimal solution, and (ii) limiting the number of forward and backward queries
to the model, such that all attacks are compared within a given maximum query
budget. Our extensive experimental analysis compares more than $100$ attack
implementations with a total of over $800$ different configurations against
CIFAR-10 and ImageNet models, highlighting that only very few attacks
outperform all the competing approaches. Within this analysis, we shed light on
several implementation issues that prevent many attacks from finding better
solutions or running at all. We release AttackBench as a publicly-available
benchmark, aiming to continuously update it to include and evaluate novel
gradient-based attacks for optimizing adversarial examples.
[COMMENTS]
Paper accepted at AAAI2025. Project page and leaderboard:
https://attackbench.github.io
[LINK]
http://arxiv.org/abs/2404.19460v3
[DATE]
2025-05-12 15:37:51+08:00
[CATEGORIES]
cs.LG
INTELLECT-2: A Reasoning Model Trained Through Globally Decentralized Reinforcement Learning
[AUTHORS]
Prime Intellect Team, Sami Jaghouar, Justus Mattern, Jack Min Ong, Jannik Straube, Manveer Basra, Aaron Pazdera, Kushal Thaman, Matthew Di Ferrante, Felix Gabriel, Fares Obeid, Kemal Erdem, Michael Keiblinger, Johannes Hagemann
[ABSTRACT]
We introduce INTELLECT-2, the first globally distributed reinforcement
learning (RL) training run of a 32 billion parameter language model. Unlike
traditional centralized training efforts, INTELLECT-2 trains a reasoning model
using fully asynchronous RL across a dynamic, heterogeneous swarm of
permissionless compute contributors.
To enable a training run with this unique infrastructure, we built various
components from scratch: we introduce PRIME-RL, our training framework
purpose-built for distributed asynchronous reinforcement learning, based on top
of novel components such as TOPLOC, which verifies rollouts from untrusted
inference workers, and SHARDCAST, which efficiently broadcasts policy weights
from training nodes to inference workers.
Beyond infrastructure components, we propose modifications to the standard
GRPO training recipe and data filtering techniques that were crucial to achieve
training stability and ensure that our model successfully learned its training
objective, thus improving upon QwQ-32B, the state of the art reasoning model in
the 32B parameter range.
We open-source INTELLECT-2 along with all of our code and data, hoping to
encourage and enable more open research in the field of decentralized training.
[COMMENTS]
26 pages, 12 figures
[LINK]
http://arxiv.org/abs/2505.07291v1
[DATE]
2025-05-12 15:24:33+08:00
[CATEGORIES]
cs.LG
Enhancing Sample Selection Against Label Noise by Cutting Mislabeled Easy Examples
[AUTHORS]
Suqin Yuan, Lei Feng, Bo Han, Tongliang Liu
[ABSTRACT]
Sample selection is a prevalent approach in learning with noisy labels,
aiming to identify confident samples for training. Although existing sample
selection methods have achieved decent results by reducing the noise rate of
the selected subset, they often overlook that not all mislabeled examples harm
the model’s performance equally. In this paper, we demonstrate that mislabeled
examples correctly predicted by the model early in the training process are
particularly harmful to model performance. We refer to these examples as
Mislabeled Easy Examples (MEEs). To address this, we propose Early Cutting,
which introduces a recalibration step that employs the model’s later training
state to re-select the confident subset identified early in training, thereby
avoiding misleading confidence from early learning and effectively filtering
out MEEs. Experiments on the CIFAR, WebVision, and full ImageNet-1k datasets
demonstrate that our method effectively improves sample selection and model
performance by reducing MEEs.
[LINK]
http://arxiv.org/abs/2502.08227v2
[DATE]
2025-05-12 15:22:24+08:00
[CATEGORIES]
cs.LG
Piloting Structure-Based Drug Design via Modality-Specific Optimal Schedule
[AUTHORS]
Keyue Qiu, Yuxuan Song, Zhehuan Fan, Peidong Liu, Zhe Zhang, Mingyue Zheng, Hao Zhou, Wei-Ying Ma
[ABSTRACT]
Structure-Based Drug Design (SBDD) is crucial for identifying bioactive
molecules. Recent deep generative models are faced with challenges in geometric
structure modeling. A major bottleneck lies in the twisted probability path of
multi-modalities – continuous 3D positions and discrete 2D topologies – which
jointly determine molecular geometries. By establishing the fact that noise
schedules decide the Variational Lower Bound (VLB) for the twisted probability
path, we propose VLB-Optimal Scheduling (VOS) strategy in this under-explored
area, which optimizes VLB as a path integral for SBDD. Our model effectively
enhances molecular geometries and interaction modeling, achieving
state-of-the-art PoseBusters passing rate of 95.9% on CrossDock, more than 10%
improvement upon strong baselines, while maintaining high affinities and robust
intramolecular validity evaluated on held-out test set.
[COMMENTS]
Accepted to ICML 2025
[LINK]
http://arxiv.org/abs/2505.07286v1
[DATE]
2025-05-12 15:18:09+08:00
[CATEGORIES]
cs.LG
Cache-Efficient Posterior Sampling for Reinforcement Learning with LLM-Derived Priors Across Discrete and Continuous Domains
[AUTHORS]
Ibne Farabi Shihab, Sanjeda Akter, Anuj Sharma
[ABSTRACT]
Integrating large language models (LLMs) as priors in reinforcement learning
(RL) offers significant advantages but comes with substantial computational
costs. We present a principled cache-efficient framework for posterior sampling
with LLM-derived priors that dramatically reduces these costs while maintaining
high performance. At the core of our approach is an adaptive caching mechanism,
where cache parameters are meta-optimized using surrogate gradients derived
from policy performance. This design enables efficient inference across both
discrete text environments (e.g., TextWorld, ALFWorld) and continuous control
domains (e.g., MuJoCo), achieving a 3.8–4.7$\times$ reduction in LLM queries
and 4.0–12.0$\times$ lower median latencies (85–93\,ms on a consumer GPU)
while retaining 96–98\% of uncached performance. Our theoretical analysis
provides KL divergence bounds on approximation quality, validated empirically.
The framework extends to offline RL, where our CQL-Prior variant improves
performance by 14–29\% and reduces training time by 38–40\%. Extensive
evaluations across a diverse suite of eight tasks demonstrate the
generalizability and practical viability of LLM-guided RL in
resource-constrained settings.
[LINK]
http://arxiv.org/abs/2505.07274v1
[DATE]
2025-05-12 14:53:24+08:00
[CATEGORIES]
cs.LG
ALPCAH: Subspace Learning for Sample-wise Heteroscedastic Data
[AUTHORS]
Javier Salazar Cavazos, Jeffrey A. Fessler, Laura Balzano
[ABSTRACT]
Principal component analysis (PCA) is a key tool in the field of data
dimensionality reduction. However, some applications involve heterogeneous data
that vary in quality due to noise characteristics associated with each data
sample. Heteroscedastic methods aim to deal with such mixed data quality. This
paper develops a subspace learning method, named ALPCAH, that can estimate the
sample-wise noise variances and use this information to improve the estimate of
the subspace basis associated with the low-rank structure of the data. Our
method makes no distributional assumptions of the low-rank component and does
not assume that the noise variances are known. Further, this method uses a soft
rank constraint that does not require subspace dimension to be known.
Additionally, this paper develops a matrix factorized version of ALPCAH, named
LR-ALPCAH, that is much faster and more memory efficient at the cost of
requiring subspace dimension to be known or estimated. Simulations and real
data experiments show the effectiveness of accounting for data
heteroscedasticity compared to existing algorithms. Code available at
https://github.com/javiersc1/ALPCAH.
[LINK]
http://arxiv.org/abs/2505.07272v1
[DATE]
2025-05-12 14:49:47+08:00
[CATEGORIES]
cs.LG
Adaptive, Robust and Scalable Bayesian Filtering for Online Learning
[AUTHORS]
Gerardo Duran-Martin
[ABSTRACT]
In this thesis, we introduce Bayesian filtering as a principled framework for
tackling diverse sequential machine learning problems, including online
(continual) learning, prequential (one-step-ahead) forecasting, and contextual
bandits. To this end, this thesis addresses key challenges in applying Bayesian
filtering to these problems: adaptivity to non-stationary environments,
robustness to model misspecification and outliers, and scalability to the
high-dimensional parameter space of deep neural networks. We develop novel
tools within the Bayesian filtering framework to address each of these
challenges, including: (i) a modular framework that enables the development
adaptive approaches for online learning; (ii) a novel, provably robust filter
with similar computational cost to standard filters, that employs Generalised
Bayes; and (iii) a set of tools for sequentially updating model parameters
using approximate second-order optimisation methods that exploit the
overparametrisation of high-dimensional parametric models such as neural
networks. Theoretical analysis and empirical results demonstrate the improved
performance of our methods in dynamic, high-dimensional, and misspecified
models.
[COMMENTS]
PhD thesis
[LINK]
http://arxiv.org/abs/2505.07267v1
[DATE]
2025-05-12 14:40:29+08:00
[CATEGORIES]
cs.LG
UMoE: Unifying Attention and FFN with Shared Experts
[AUTHORS]
Yuanhang Yang, Chaozheng Wang, Jing Li
[ABSTRACT]
Sparse Mixture of Experts (MoE) architectures have emerged as a promising
approach for scaling Transformer models. While initial works primarily
incorporated MoE into feed-forward network (FFN) layers, recent studies have
explored extending the MoE paradigm to attention layers to enhance model
performance. However, existing attention-based MoE layers require specialized
implementations and demonstrate suboptimal performance compared to their
FFN-based counterparts. In this paper, we aim to unify the MoE designs in
attention and FFN layers by introducing a novel reformulation of the attention
mechanism, revealing an underlying FFN-like structure within attention modules.
Our proposed architecture, UMoE, achieves superior performance through
attention-based MoE layers while enabling efficient parameter sharing between
FFN and attention components.
[LINK]
http://arxiv.org/abs/2505.07260v1
[DATE]
2025-05-12 14:21:44+08:00
[CATEGORIES]
cs.LG
Causal Post-Processing of Predictive Models
[AUTHORS]
Carlos Fernández-Loría, Yanfang Hou, Foster Provost, Jennifer Hill
[ABSTRACT]
Decision makers across various domains rely on predictive models to guide
individual-level intervention decisions. However, these models are typically
trained to predict outcomes rather than causal effects, leading to
misalignments when they are used for causal decision making. Experimental data
to train effective causal effect models often is limited. To address this
issue, we propose causal post-processing (CPP), a family of techniques for
refining predictive scores to better align with causal effects using limited
experimental data. Rather than training separate causal models for each
intervention, causal post-processing can adapt existing predictive scores to
support different decision-making requirements, such as estimating effect
sizes, ranking individuals by expected effects, or classifying individuals
based on an intervention threshold. We introduce three main CPP approaches –
monotonic post-processing, correction post-processing, and model-based
post-processing – each balancing statistical efficiency and flexibility
differently. Through simulations and an empirical application in advertising,
we demonstrate that causal post-processing improves intervention decisions,
particularly in settings where experimental data is expensive or difficult to
obtain at scale. Our findings highlight the advantages of integrating
non-causal predictive models with experimental data, rather than treating them
as competing alternatives, which provides a scalable and data-efficient
approach to causal inference for decision making.
[LINK]
http://arxiv.org/abs/2406.09567v2
[DATE]
2025-05-12 13:54:27+08:00
[CATEGORIES]
cs.LG
The Influence of the Memory Capacity of Neural DDEs on the Universal Approximation Property
[AUTHORS]
Christian Kuehn, Sara-Viola Kuntz
[ABSTRACT]
Neural Ordinary Differential Equations (Neural ODEs), which are the
continuous-time analog of Residual Neural Networks (ResNets), have gained
significant attention in recent years. Similarly, Neural Delay Differential
Equations (Neural DDEs) can be interpreted as an infinite depth limit of
Densely Connected Residual Neural Networks (DenseResNets). In contrast to
traditional ResNet architectures, DenseResNets are feed-forward networks that
allow for shortcut connections across all layers. These additional connections
introduce memory in the network architecture, as typical in many modern
architectures. In this work, we explore how the memory capacity in neural DDEs
influences the universal approximation property. The key parameter for studying
the memory capacity is the product $K \tau$ of the Lipschitz constant and the
delay of the DDE. In the case of non-augmented architectures, where the network
width is not larger than the input and output dimensions, neural ODEs and
classical feed-forward neural networks cannot have the universal approximation
property. We show that if the memory capacity $K\tau$ is sufficiently small,
the dynamics of the neural DDE can be approximated by a neural ODE.
Consequently, non-augmented neural DDEs with a small memory capacity also lack
the universal approximation property. In contrast, if the memory capacity
$K\tau$ is sufficiently large, we can establish the universal approximation
property of neural DDEs for continuous functions. If the neural DDE
architecture is augmented, we can expand the parameter regions in which
universal approximation is possible. Overall, our results show that by
increasing the memory capacity $K\tau$, the infinite-dimensional phase space of
DDEs with positive delay $\tau>0$ is not sufficient to guarantee a direct jump
transition to universal approximation, but only after a certain memory
threshold, universal approximation holds.
[LINK]
http://arxiv.org/abs/2505.07244v1
[DATE]
2025-05-12 13:36:39+08:00
[CATEGORIES]
cs.LG
Rethinking Graph Contrastive Learning through Relative Similarity Preservation
[AUTHORS]
Zhiyuan Ning, Pengfei Wang, Ziyue Qiao, Pengyang Wang, Yuanchun Zhou
[ABSTRACT]
Graph contrastive learning (GCL) has achieved remarkable success by following
the computer vision paradigm of preserving absolute similarity between
augmented views. However, this approach faces fundamental challenges in graphs
due to their discrete, non-Euclidean nature – view generation often breaks
semantic validity and similarity verification becomes unreliable. Through
analyzing 11 real-world graphs, we discover a universal pattern transcending
the homophily-heterophily dichotomy: label consistency systematically
diminishes as structural distance increases, manifesting as smooth decay in
homophily graphs and oscillatory decay in heterophily graphs. We establish
theoretical guarantees for this pattern through random walk theory, proving
label distribution convergence and characterizing the mechanisms behind
different decay behaviors. This discovery reveals that graphs naturally encode
relative similarity patterns, where structurally closer nodes exhibit
collectively stronger semantic relationships. Leveraging this insight, we
propose RELGCL, a novel GCL framework with complementary pairwise and listwise
implementations that preserve these inherent patterns through collective
similarity objectives. Extensive experiments demonstrate that our method
consistently outperforms 20 existing approaches across both homophily and
heterophily graphs, validating the effectiveness of leveraging natural relative
similarity over artificial absolute similarity.
[COMMENTS]
Accepted by IJCAI2025; full version including appendix
[LINK]
http://arxiv.org/abs/2505.05533v2
[DATE]
2025-05-12 13:13:49+08:00
[CATEGORIES]
cs.LG
Differentiable Folding for Nearest Neighbor Model Optimization
[AUTHORS]
Ryan K. Krueger, Sharon Aviran, David H. Mathews, Jeffrey Zuber, Max Ward
[ABSTRACT]
The Nearest Neighbor model is the $\textit{de facto}$ thermodynamic model of
RNA secondary structure formation and is a cornerstone of RNA structure
prediction and sequence design. The current functional form (Turner 2004)
contains $\approx13,000$ underlying thermodynamic parameters, and fitting these
to both experimental and structural data is computationally challenging. Here,
we leverage recent advances in $\textit{differentiable folding}$, a method for
directly computing gradients of the RNA folding algorithms, to devise an
efficient, scalable, and flexible means of parameter optimization that uses
known RNA structures and thermodynamic experiments. Our method yields a
significantly improved parameter set that outperforms existing baselines on all
metrics, including an increase in the average predicted probability of
ground-truth sequence-structure pairs for a single RNA family by over 23 orders
of magnitude. Our framework provides a path towards drastically improved RNA
models, enabling the flexible incorporation of new experimental data,
definition of novel loss terms, large training sets, and even treatment as a
module in larger deep learning pipelines. We make available a new database,
RNAometer, with experimentally-determined stabilities for small RNA model
systems.
[LINK]
http://arxiv.org/abs/2503.09085v2
[DATE]
2025-05-12 12:58:33+08:00
[CATEGORIES]
cs.LG
Jointly spatial-temporal representation learning for individual trajectories
[AUTHORS]
Fei Huang, Jianrong Lv, Yang Yue
[ABSTRACT]
Individual trajectories, rich in human-environment interaction information
across space and time, serve as vital inputs for geospatial foundation models
(GeoFMs). However, existing attempts at learning trajectory representations
have overlooked the implicit spatial-temporal dependency within trajectories,
failing to encode such dependency in a deep learning-friendly format. That
poses a challenge in obtaining general-purpose trajectory representations.
Therefore, this paper proposes a spatial-temporal joint representation learning
method (ST-GraphRL) to formalize learnable spatial-temporal dependencies into
trajectory representations. The proposed ST-GraphRL consists of three
compositions: (i) a weighted directed spatial-temporal graph to explicitly
construct mobility interactions in both space and time dimensions; (ii) a
two-stage jointly encoder (i.e., decoupling and fusion), to learn entangled
spatial-temporal dependencies by independently decomposing and jointly
aggregating space and time information; (iii) a decoder guides ST-GraphRL to
learn explicit mobility regularities by simulating the spatial-temporal
distributions of trajectories. Tested on three real-world human mobility
datasets, the proposed ST-GraphRL outperformed all the baseline models in
predicting movement spatial-temporal distributions and preserving trajectory
similarity with high spatial-temporal correlations. Analyzing spatial-temporal
features presented in latent space validates that ST-GraphRL understands
spatial-temporal patterns. This study may also benefit representation learnings
of other geospatial data to achieve general-purpose data representations and
advance GeoFMs development.
[COMMENTS]
27 pages, 3 tables, 7 figures
[LINK]
http://arxiv.org/abs/2312.04055v3
[DATE]
2025-05-12 12:32:21+08:00
[CATEGORIES]
cs.LG
Compression, Regularity, Randomness and Emergent Structure: Rethinking Physical Complexity in the Data-Driven Era
[AUTHORS]
Nima Dehghani
[ABSTRACT]
Complexity science offers a wide range of measures for quantifying
unpredictability, structure, and information. Yet, a systematic conceptual
organization of these measures is still missing.
We present a unified framework that locates statistical, algorithmic, and
dynamical measures along three axes (regularity, randomness, and complexity)
and situates them in a common conceptual space. We map statistical,
algorithmic, and dynamical measures into this conceptual space, discussing
their computational accessibility and approximability.
This taxonomy reveals the deep challenges posed by uncomputability and
highlights the emergence of modern data-driven methods (including autoencoders,
latent dynamical models, symbolic regression, and physics-informed neural
networks) as pragmatic approximations to classical complexity ideals. Latent
spaces emerge as operational arenas where regularity extraction, noise
management, and structured compression converge, bridging theoretical
foundations with practical modeling in high-dimensional systems.
We close by outlining implications for physics-informed AI and AI-guided
discovery in complex physical systems, arguing that classical questions of
complexity remain central to next-generation scientific modeling.
[LINK]
http://arxiv.org/abs/2505.07222v1
[DATE]
2025-05-12 12:30:42+08:00
[CATEGORIES]
cs.LG
LLMEasyQuant: Scalable Quantization for Parallel and Distributed LLM Inference
[AUTHORS]
Dong Liu, Yanxuan Yu
[ABSTRACT]
As large language models (LLMs) grow in size and deployment scale,
quantization has become an essential technique for reducing memory footprint
and improving inference efficiency. However, existing quantization toolkits
often lack transparency, flexibility, and system-level scalability across GPUs
and distributed environments. We present \textbf{LLMEasyQuant}, a modular,
system-aware quantization framework designed for efficient, low-bit inference
of LLMs on single-node multi-GPU, multi-node, and edge hardware. LLMEasyQuant
supports a wide range of quantization methods – including Symmetric
Quantization, ZeroQuant, SmoothQuant, and SimQuant – with unified interfaces
for per-layer calibration, bitwidth assignment, and runtime adaptation. It
integrates fused CUDA kernels with NCCL-based distributed synchronization and
supports both static and online quantization. Empirical results show that
LLMEasyQuant can achieve substantial speedups in GEMM execution, HBM load time,
and near-linear multi-GPU scaling. Ablation studies further validate its
ability to balance latency, memory, and accuracy under diverse deployment
conditions. LLMEasyQuant offers a practical quantization serving system for
scalable, hardware-optimized LLM inference.
[LINK]
http://arxiv.org/abs/2406.19657v4
[DATE]
2025-05-12 12:21:38+08:00
[CATEGORIES]
cs.LG
GradStop: Exploring Training Dynamics in Unsupervised Outlier Detection through Gradient
[AUTHORS]
Yuang Zhang, Liping Wang, Yihong Huang, Yuanxing Zheng, Fan Zhang, Xuemin Lin
[ABSTRACT]
Unsupervised Outlier Detection (UOD) is a critical task in data mining and
machine learning, aiming to identify instances that significantly deviate from
the majority. Without any label, deep UOD methods struggle with the
misalignment between the model’s direct optimization goal and the final
performance goal of Outlier Detection (OD) task. Through the perspective of
training dynamics, this paper proposes an early stopping algorithm to optimize
the training of deep UOD models, ensuring they perform optimally in OD rather
than overfitting the entire contaminated dataset.
Inspired by UOD mechanism and inlier priority phenomenon, where intuitively
models fit inliers more quickly than outliers, we propose GradStop, a
sampling-based label-free algorithm to estimate model’s real-time performance
during training. First, a sampling method generates two sets: one likely
containing more outliers and the other more inliers, then a metric based on
gradient cohesion is applied to probe into current training dynamics, which
reflects model’s performance on OD task.
Experimental results on 4 deep UOD algorithms and 47 real-world datasets and
theoretical proofs demonstrate the effectiveness of our proposed early stopping
algorithm in enhancing the performance of deep UOD models. Auto Encoder (AE)
enhanced by GradStop achieves better performance than itself, other SOTA UOD
methods, and even ensemble AEs. Our method provides a robust and effective
solution to the problem of performance degradation during training, enabling
deep UOD models to achieve better potential in anomaly detection tasks.
[LINK]
http://arxiv.org/abs/2412.08501v2
[DATE]
2025-05-12 11:52:29+08:00
[CATEGORIES]
cs.LG
Looped Transformers for Length Generalization
[AUTHORS]
Ying Fan, Yilun Du, Kannan Ramchandran, Kangwook Lee
[ABSTRACT]
Recent work has shown that Transformers trained from scratch can successfully
solve various arithmetic and algorithmic tasks, such as adding numbers and
computing parity. While these Transformers generalize well on unseen inputs of
the same length, they struggle with length generalization, i.e., handling
inputs of unseen lengths. In this work, we demonstrate that looped Transformers
with an adaptive number of steps significantly improve length generalization.
We focus on tasks with a known iterative solution, involving multiple
iterations of a RASP-L operation - a length-generalizable operation that can be
expressed by a finite-sized Transformer. We train looped Transformers using our
proposed learning algorithm and observe that they learn highly
length-generalizable solutions for various tasks.
[COMMENTS]
ICLR 2025
[LINK]
http://arxiv.org/abs/2409.15647v5
[DATE]
2025-05-12 11:51:20+08:00
[CATEGORIES]
cs.LG
FloE: On-the-Fly MoE Inference on Memory-constrained GPU
[AUTHORS]
Yuxin Zhou, Zheng Li, Jun Zhang, Jue Wang, Yiping Wang, Zhongle Xie, Ke Chen, Lidan Shou
[ABSTRACT]
With the widespread adoption of Mixture-of-Experts (MoE) models, there is a
growing demand for efficient inference on memory-constrained devices. While
offloading expert parameters to CPU memory and loading activated experts on
demand has emerged as a potential solution, the large size of activated experts
overburdens the limited PCIe bandwidth, hindering the effectiveness in
latency-sensitive scenarios. To mitigate this, we propose FloE, an on-the-fly
MoE inference system on memory-constrained GPUs. FloE is built on the insight
that there exists substantial untapped redundancy within sparsely activated
experts. It employs various compression techniques on the expert’s internal
parameter matrices to reduce the data movement load, combined with low-cost
sparse prediction, achieving perceptible inference acceleration in wall-clock
time on resource-constrained devices. Empirically, FloE achieves a 9.3x
compression of parameters per expert in Mixtral-8x7B; enables deployment on a
GPU with only 11GB VRAM, reducing the memory footprint by up to 8.5x; and
delivers a 48.7x inference speedup compared to DeepSpeed-MII on a single
GeForce RTX 3090 - all with only a 4.4$\%$ - 7.6$\%$ average performance
degradation.
[COMMENTS]
Accepted by ICML 2025
[LINK]
http://arxiv.org/abs/2505.05950v2
[DATE]
2025-05-12 11:29:12+08:00
[CATEGORIES]
cs.LG
Collaborative Deterministic-Diffusion Model for Probabilistic Spatiotemporal Prediction
[AUTHORS]
Zhi Sheng, Yuan Yuan, Yudi Zhang, Depeng Jin, Yong Li
[ABSTRACT]
Accurate prediction of urban spatiotemporal dynamics is essential for
enhancing urban management and decision-making. Existing spatiotemporal
prediction models are predominantly deterministic, focusing on primary
spatiotemporal patterns. However, those dynamics are highly complex, exhibiting
multi-modal distributions that are challenging for deterministic models to
capture. In this paper, we highlight the critical role of probabilistic
prediction in capturing the uncertainties and complexities inherent in
spatiotemporal data. While mainstream probabilistic models can capture
uncertainty, they struggle with accurately learning primary patterns and often
suffer from computational inefficiency. To address these challenges, we propose
CoST, which collaborates deterministic and probabilistic models to improve both
predictive accuracy and the ability to handle uncertainty. To achieve this, we
design a mean-residual decomposition framework, where the mean value is modeled
by a deterministic model, and the residual variations are learned by a
probabilistic model, specifically diffusion models. Moreover, we introduce a
scale-aware diffusion process, which better accounts for spatially
heterogeneous dynamics across different regions. Extensive experiments on eight
real-world datasets demonstrate that CoST significantly outperforms existing
methods in both deterministic and probabilistic metrics, achieving a 20%
improvement with low computational cost. CoST bridges the gap between
deterministic precision and probabilistic uncertainty, making a significant
advancement in the field of urban spatiotemporal prediction.
[LINK]
http://arxiv.org/abs/2502.11013v3
[DATE]
2025-05-12 11:14:02+08:00
[CATEGORIES]
cs.LG
Representation Learning with Mutual Influence of Modalities for Node Classification in Multi-Modal Heterogeneous Networks
[AUTHORS]
Jiafan Li, Jiaqi Zhu, Liang Chang, Yilin Li, Miaomiao Li, Yang Wang, Hongan Wang
[ABSTRACT]
Nowadays, numerous online platforms can be described as multi-modal
heterogeneous networks (MMHNs), such as Douban’s movie networks and Amazon’s
product review networks. Accurately categorizing nodes within these networks is
crucial for analyzing the corresponding entities, which requires effective
representation learning on nodes. However, existing multi-modal fusion methods
often adopt either early fusion strategies which may lose the unique
characteristics of individual modalities, or late fusion approaches overlooking
the cross-modal guidance in GNN-based information propagation. In this paper,
we propose a novel model for node classification in MMHNs, named Heterogeneous
Graph Neural Network with Inter-Modal Attention (HGNN-IMA). It learns node
representations by capturing the mutual influence of multiple modalities during
the information propagation process, within the framework of heterogeneous
graph transformer. Specifically, a nested inter-modal attention mechanism is
integrated into the inter-node attention to achieve adaptive multi-modal
fusion, and modality alignment is also taken into account to encourage the
propagation among nodes with consistent similarities across all modalities.
Moreover, an attention loss is augmented to mitigate the impact of missing
modalities. Extensive experiments validate the superiority of the model in the
node classification task, providing an innovative view to handle multi-modal
data, especially when accompanied with network structures.
[LINK]
http://arxiv.org/abs/2505.07895v1
[DATE]
2025-05-12 10:59:46+08:00
[CATEGORIES]
cs.LG
Causal View of Time Series Imputation: Some Identification Results on Missing Mechanism
[AUTHORS]
Ruichu Cai, Kaitao Zheng, Junxian Huang, Zijian Li, Zhengming Chen, Boyan Xu, Zhifeng Hao
[ABSTRACT]
Time series imputation is one of the most challenge problems and has broad
applications in various fields like health care and the Internet of Things.
Existing methods mainly aim to model the temporally latent dependencies and the
generation process from the observed time series data. In real-world scenarios,
different types of missing mechanisms, like MAR (Missing At Random), and MNAR
(Missing Not At Random) can occur in time series data. However, existing
methods often overlook the difference among the aforementioned missing
mechanisms and use a single model for time series imputation, which can easily
lead to misleading results due to mechanism mismatching. In this paper, we
propose a framework for time series imputation problem by exploring Different
Missing Mechanisms (DMM in short) and tailoring solutions accordingly.
Specifically, we first analyze the data generation processes with temporal
latent states and missing cause variables for different mechanisms.
Sequentially, we model these generation processes via variational inference and
estimate prior distributions of latent variables via normalizing flow-based
neural architecture. Furthermore, we establish identifiability results under
the nonlinear independent component analysis framework to show that latent
variables are identifiable. Experimental results show that our method surpasses
existing time series imputation techniques across various datasets with
different missing mechanisms, demonstrating its effectiveness in real-world
applications.
[LINK]
http://arxiv.org/abs/2505.07180v1
[DATE]
2025-05-12 10:13:14+08:00
[CATEGORIES]
cs.LG
DiffGAN: A Test Generation Approach for Differential Testing of Deep Neural Networks
[AUTHORS]
Zohreh Aghababaeyan, Manel Abdellatif, Lionel Briand, Ramesh S
[ABSTRACT]
Deep Neural Networks (DNNs) are increasingly deployed across applications.
However, ensuring their reliability remains a challenge, and in many
situations, alternative models with similar functionality and accuracy are
available. Traditional accuracy-based evaluations often fail to capture
behavioral differences between models, especially with limited test datasets,
making it difficult to select or combine models effectively. Differential
testing addresses this by generating test inputs that expose discrepancies in
DNN model behavior. However, existing approaches face significant limitations:
many rely on model internals or are constrained by available seed inputs. To
address these challenges, we propose DiffGAN, a black-box test image generation
approach for differential testing of DNN models. DiffGAN leverages a Generative
Adversarial Network (GAN) and the Non-dominated Sorting Genetic Algorithm II to
generate diverse and valid triggering inputs that reveal behavioral
discrepancies between models. DiffGAN employs two custom fitness functions,
focusing on diversity and divergence, to guide the exploration of the GAN input
space and identify discrepancies between models’ outputs. By strategically
searching this space, DiffGAN generates inputs with specific features that
trigger differences in model behavior. DiffGAN is black-box, making it
applicable in more situations. We evaluate DiffGAN on eight DNN model pairs
trained on widely used image datasets. Our results show DiffGAN significantly
outperforms a SOTA baseline, generating four times more triggering inputs, with
greater diversity and validity, within the same budget. Additionally, the
generated inputs improve the accuracy of a machine learning-based model
selection mechanism, which selects the best-performing model based on input
characteristics and can serve as a smart output voting mechanism when using
alternative models.
[LINK]
http://arxiv.org/abs/2410.19794v2
[DATE]
2025-05-12 10:06:12+08:00
[CATEGORIES]
cs.LG
EnvCDiff: Joint Refinement of Environmental Information and Channel Fingerprints via Conditional Generative Diffusion Model
[AUTHORS]
Zhenzhou Jin, Li You, Xiang-Gen Xia, Xiqi Gao
[ABSTRACT]
The paradigm shift from environment-unaware communication to intelligent
environment-aware communication is expected to facilitate the acquisition of
channel state information for future wireless communications. Channel
Fingerprint (CF), as an emerging enabling technology for environment-aware
communication, provides channel-related knowledge for potential locations
within the target communication area. However, due to the limited availability
of practical devices for sensing environmental information and measuring
channel-related knowledge, most of the acquired environmental information and
CF are coarse-grained, insufficient to guide the design of wireless
transmissions. To address this, this paper proposes a deep conditional
generative learning approach, namely a customized conditional generative
diffusion model (CDiff). The proposed CDiff simultaneously refines
environmental information and CF, reconstructing a fine-grained CF that
incorporates environmental information, referred to as EnvCF, from its
coarse-grained counterpart. Experimental results show that the proposed
approach significantly improves the performance of EnvCF construction compared
to the baselines.
[COMMENTS]
6 pages, 2 figures
[LINK]
http://arxiv.org/abs/2505.07894v1
[DATE]
2025-05-12 09:36:18+08:00
[CATEGORIES]
cs.LG
Channel Fingerprint Construction for Massive MIMO: A Deep Conditional Generative Approach
[AUTHORS]
Zhenzhou Jin, Li You, Xudong Li, Zhen Gao, Yuanwei Liu, Xiang-Gen Xia, Xiqi Gao
[ABSTRACT]
Accurate channel state information (CSI) acquisition for massive
multiple-input multiple-output (MIMO) systems is essential for future mobile
communication networks. Channel fingerprint (CF), also referred to as channel
knowledge map, is a key enabler for intelligent environment-aware communication
and can facilitate CSI acquisition. However, due to the cost limitations of
practical sensing nodes and test vehicles, the resulting CF is typically
coarse-grained, making it insufficient for wireless transceiver design. In this
work, we introduce the concept of CF twins and design a conditional generative
diffusion model (CGDM) with strong implicit prior learning capabilities as the
computational core of the CF twin to establish the connection between coarse-
and fine-grained CFs. Specifically, we employ a variational inference technique
to derive the evidence lower bound (ELBO) for the log-marginal distribution of
the observed fine-grained CF conditioned on the coarse-grained CF, enabling the
CGDM to learn the complicated distribution of the target data. During the
denoising neural network optimization, the coarse-grained CF is introduced as
side information to accurately guide the conditioned generation of the CGDM. To
make the proposed CGDM lightweight, we further leverage the additivity of
network layers and introduce a one-shot pruning approach along with a
multi-objective knowledge distillation technique. Experimental results show
that the proposed approach exhibits significant improvement in reconstruction
performance compared to the baselines. Additionally, zero-shot testing on
reconstruction tasks with different magnification factors further demonstrates
the scalability and generalization ability of the proposed approach.
[COMMENTS]
15 pages, 7 figures
[LINK]
http://arxiv.org/abs/2505.07893v1
[DATE]
2025-05-12 09:36:06+08:00
[CATEGORIES]
cs.LG
Continuous Thought Machines
[AUTHORS]
Luke Darlow, Ciaran Regan, Sebastian Risi, Jeffrey Seely, Llion Jones
[ABSTRACT]
Biological brains demonstrate complex neural activity, where the timing and
interplay between neurons is critical to how brains process information. Most
deep learning architectures simplify neural activity by abstracting away
temporal dynamics. In this paper we challenge that paradigm. By incorporating
neuron-level processing and synchronization, we can effectively reintroduce
neural timing as a foundational element. We present the Continuous Thought
Machine (CTM), a model designed to leverage neural dynamics as its core
representation. The CTM has two core innovations: (1) neuron-level temporal
processing, where each neuron uses unique weight parameters to process a
history of incoming signals; and (2) neural synchronization employed as a
latent representation. The CTM aims to strike a balance between oversimplified
neuron abstractions that improve computational efficiency, and biological
realism. It operates at a level of abstraction that effectively captures
essential temporal dynamics while remaining computationally tractable for deep
learning. We demonstrate the CTM’s strong performance and versatility across a
range of challenging tasks, including ImageNet-1K classification, solving 2D
mazes, sorting, parity computation, question-answering, and RL tasks. Beyond
displaying rich internal representations and offering a natural avenue for
interpretation owing to its internal process, the CTM is able to perform tasks
that require complex sequential reasoning. The CTM can also leverage adaptive
compute, where it can stop earlier for simpler tasks, or keep computing when
faced with more challenging instances. The goal of this work is to share the
CTM and its associated innovations, rather than pushing for new
state-of-the-art results. To that end, we believe the CTM represents a
significant step toward developing more biologically plausible and powerful
artificial intelligence systems.
[COMMENTS]
Technical report accompanied by online project page:
https://pub.sakana.ai/ctm/
[LINK]
http://arxiv.org/abs/2505.05522v2
[DATE]
2025-05-12 09:35:39+08:00
[CATEGORIES]
cs.LG
Multi-Modal Molecular Representation Learning via Structure Awareness
[AUTHORS]
Rong Yin, Ruyue Liu, Xiaoshuai Hao, Xingrui Zhou, Yong Liu, Can Ma, Weiping Wang
[ABSTRACT]
Accurate extraction of molecular representations is a critical step in the
drug discovery process. In recent years, significant progress has been made in
molecular representation learning methods, among which multi-modal molecular
representation methods based on images, and 2D/3D topologies have become
increasingly mainstream. However, existing these multi-modal approaches often
directly fuse information from different modalities, overlooking the potential
of intermodal interactions and failing to adequately capture the complex
higher-order relationships and invariant features between molecules. To
overcome these challenges, we propose a structure-awareness-based multi-modal
self-supervised molecular representation pre-training framework (MMSA) designed
to enhance molecular graph representations by leveraging invariant knowledge
between molecules. The framework consists of two main modules: the multi-modal
molecular representation learning module and the structure-awareness module.
The multi-modal molecular representation learning module collaboratively
processes information from different modalities of the same molecule to
overcome intermodal differences and generate a unified molecular embedding.
Subsequently, the structure-awareness module enhances the molecular
representation by constructing a hypergraph structure to model higher-order
correlations between molecules. This module also introduces a memory mechanism
for storing typical molecular representations, aligning them with memory
anchors in the memory bank to integrate invariant knowledge, thereby improving
the model generalization ability. Extensive experiments have demonstrated the
effectiveness of MMSA, which achieves state-of-the-art performance on the
MoleculeNet benchmark, with average ROC-AUC improvements ranging from 1.8% to
9.6% over baseline methods.
[COMMENTS]
Accepted by IEEE Transactions on Image Processing (TIP) 2025
[LINK]
http://arxiv.org/abs/2505.05877v2
[DATE]
2025-05-12 09:15:32+08:00
[CATEGORIES]
cs.LG
Using matrix-product states for time-series machine learning
[AUTHORS]
Joshua B. Moore, Hugo P. Stackhouse, Ben D. Fulcher, Sahand Mahmoodian
[ABSTRACT]
Matrix-product states (MPS) have proven to be a versatile ansatz for modeling
quantum many-body physics. For many applications, and particularly in
one-dimension, they capture relevant quantum correlations in many-body
wavefunctions while remaining tractable to store and manipulate on a classical
computer. This has motivated researchers to also apply the MPS ansatz to
machine learning (ML) problems where capturing complex correlations in datasets
is also a key requirement. Here, we develop and apply an MPS-based algorithm,
MPSTime, for learning a joint probability distribution underlying an observed
time-series dataset, and show how it can be used to tackle important
time-series ML problems, including classification and imputation. MPSTime can
efficiently learn complicated time-series probability distributions directly
from data, requires only moderate maximum MPS bond dimension $\chi_{\rm max}$,
with values for our applications ranging between $\chi_{\rm max} = 20-160$, and
can be trained for both classification and imputation tasks under a single
logarithmic loss function. Using synthetic and publicly available real-world
datasets, spanning applications in medicine, energy, and astronomy, we
demonstrate performance competitive with state-of-the-art ML approaches, but
with the key advantage of encoding the full joint probability distribution
learned from the data, which is useful for analyzing and interpreting its
underlying structure. This manuscript is supplemented with the release of a
publicly available code package MPSTime that implements our approach. The
effectiveness of the MPS-based ansatz for capturing complex correlation
structures in time-series data makes it a powerful foundation for tackling
challenging time-series analysis problems across science, industry, and
medicine.
[COMMENTS]
31 pages, 14 figures
[LINK]
http://arxiv.org/abs/2412.15826v2
[DATE]
2025-05-12 09:12:12+08:00
[CATEGORIES]
cs.LG
DittoGym: Learning to Control Soft Shape-Shifting Robots
[AUTHORS]
Suning Huang, Boyuan Chen, Huazhe Xu, Vincent Sitzmann
[ABSTRACT]
Robot co-design, where the morphology of a robot is optimized jointly with a
learned policy to solve a specific task, is an emerging area of research. It
holds particular promise for soft robots, which are amenable to novel
manufacturing techniques that can realize learned morphologies and actuators.
Inspired by nature and recent novel robot designs, we propose to go a step
further and explore the novel reconfigurable robots, defined as robots that can
change their morphology within their lifetime. We formalize control of
reconfigurable soft robots as a high-dimensional reinforcement learning (RL)
problem. We unify morphology change, locomotion, and environment interaction in
the same action space, and introduce an appropriate, coarse-to-fine curriculum
that enables us to discover policies that accomplish fine-grained control of
the resulting robots. We also introduce DittoGym, a comprehensive RL benchmark
for reconfigurable soft robots that require fine-grained morphology changes to
accomplish the tasks. Finally, we evaluate our proposed coarse-to-fine
algorithm on DittoGym and demonstrate robots that learn to change their
morphology several times within a sequence, uniquely enabled by our RL
algorithm. More results are available at
https://suninghuang19.github.io/dittogym_page/.
[LINK]
http://arxiv.org/abs/2401.13231v3
[DATE]
2025-05-12 09:12:08+08:00
[CATEGORIES]
cs.LG
Exact Spin Elimination in Ising Hamiltonians and Energy-Based Machine Learning
[AUTHORS]
Natalia G. Berloff
[ABSTRACT]
We present an exact spin-elimination technique that reduces the
dimensionality of both quadratic and k-local Ising Hamiltonians while
preserving their original ground-state configurations. By systematically
replacing each removed spin with an effective interaction among its neighbors,
our method lowers the total spin count without invoking approximations or
iterative recalculations. This capability is especially beneficial for
hardware-constrained platforms, classical or quantum, that can directly
implement multi-body interactions but have limited qubit or spin resources. We
demonstrate three key advances enabled by this technique. First, we handle
larger instances of benchmark problems such as Max-Cut on cubic graphs without
exceeding a 2-local interaction limit. Second, we reduce qubit requirements in
QAOA-based integer factorization on near-term quantum devices, thus extending
the feasible range of integers to be factorized. Third, we improve memory
capacity in Hopfield associative memories and enhance memory retrieval by
suppressing spurious attractors, enhancing retrieval performance. Our
spin-elimination procedure trades local spin complexity for higher-order
couplings or higher node degrees in a single pass, opening new avenues for
scaling up combinatorial optimization and energy-based machine learning on
near-term hardware. Finally, these results underscore that the next-generation
physical spin machines will likely capitalize on k-local spin Hamiltonians to
offer an alternative to classical computations.
[COMMENTS]
28 pages, 6 figures
[LINK]
http://arxiv.org/abs/2505.07163v1
[DATE]
2025-05-12 09:04:24+08:00
[CATEGORIES]
cs.LG
VoI-Driven Joint Optimization of Control and Communication in Vehicular Digital Twin Network
[AUTHORS]
Lei Lei, Kan Zheng, Jie Mei, Xuemin, Shen
[ABSTRACT]
The vision of sixth-generation (6G) wireless networks paves the way for the
seamless integration of digital twins into vehicular networks, giving rise to a
Vehicular Digital Twin Network (VDTN). The large amount of computing resources
as well as the massive amount of spatial-temporal data in Digital Twin (DT)
domain can be utilized to enhance the communication and control performance of
Internet of Vehicle (IoV) systems. In this article, we first propose the
architecture of VDTN, emphasizing key modules that center on functions related
to the joint optimization of control and communication. We then delve into the
intricacies of the multitimescale decision process inherent in joint
optimization in VDTN, specifically investigating the dynamic interplay between
control and communication. To facilitate the joint optimization, we define two
Value of Information (VoI) concepts rooted in control performance.
Subsequently, utilizing VoI as a bridge between control and communication, we
introduce a novel joint optimization framework, which involves iterative
processing of two Deep Reinforcement Learning (DRL) modules corresponding to
control and communication to derive the optimal policy. Finally, we conduct
simulations of the proposed framework applied to a platoon scenario to
demonstrate its effectiveness in ensu
[LINK]
http://arxiv.org/abs/2505.07892v1
[DATE]
2025-05-12 08:53:37+08:00
[CATEGORIES]
cs.LG
Audio Transformers
[AUTHORS]
Prateek Verma, Jonathan Berger
[ABSTRACT]
Over the past two decades, CNN architectures have produced compelling models
of sound perception and cognition, learning hierarchical organizations of
features. Analogous to successes in computer vision, audio feature
classification can be optimized for a particular task of interest, over a wide
variety of datasets and labels. In fact similar architectures designed for
image understanding have proven effective for acoustic scene analysis. Here we
propose applying Transformer based architectures without convolutional layers
to raw audio signals. On a standard dataset of Free Sound 50K,comprising of 200
categories, our model outperforms convolutional models to produce state of the
art results. This is significant as unlike in natural language processing and
computer vision, we do not perform unsupervised pre-training for outperforming
convolutional architectures. On the same training set, with respect mean
aver-age precision benchmarks, we show a significant improvement. We further
improve the performance of Transformer architectures by using techniques such
as pooling inspired from convolutional net-work designed in the past few years.
In addition, we also show how multi-rate signal processing ideas inspired from
wavelets, can be applied to the Transformer embeddings to improve the results.
We also show how our models learns a non-linear non constant band-width
filter-bank, which shows an adaptable time frequency front end representation
for the task of audio understanding, different from other tasks e.g. pitch
estimation.
[COMMENTS]
5 pages, 4 figures; Under review WASPAA 2021; Typo Fixes
[LINK]
http://arxiv.org/abs/2105.00335v2
[DATE]
2025-05-12 07:57:58+08:00
[CATEGORIES]
cs.LG
Generalized Compressed Sensing for Image Reconstruction with Diffusion Probabilistic Models
[AUTHORS]
Ling-Qi Zhang, Zahra Kadkhodaie, Eero P. Simoncelli, David H. Brainard
[ABSTRACT]
We examine the problem of selecting a small set of linear measurements for
reconstructing high-dimensional signals. Well-established methods for
optimizing such measurements include principal component analysis (PCA),
independent component analysis (ICA) and compressed sensing (CS) based on
random projections, all of which rely on axis- or subspace-aligned statistical
characterization of the signal source. However, many naturally occurring
signals, including photographic images, contain richer statistical structure.
To exploit such structure, we introduce a general method for obtaining an
optimized set of linear measurements for efficient image reconstruction, where
the signal statistics are expressed by the prior implicit in a neural network
trained to perform denoising (known as a “diffusion model”). We demonstrate
that the optimal measurements derived for two natural image datasets differ
from those of PCA, ICA, or CS, and result in substantially lower mean squared
reconstruction error. Interestingly, the marginal distributions of the
measurement values are asymmetrical (skewed), substantially more so than those
of previous methods. We also find that optimizing with respect to perceptual
loss, as quantified by structural similarity (SSIM), leads to measurements
different from those obtained when optimizing for MSE. Our results highlight
the importance of incorporating the specific statistical regularities of
natural signals when designing effective linear measurements.
[LINK]
http://arxiv.org/abs/2405.17456v3
[DATE]
2025-05-12 07:40:32+08:00
[CATEGORIES]
cs.LG
AugMixCloak: A Defense against Membership Inference Attacks via Image Transformation
[AUTHORS]
Heqing Ren, Chao Feng, Alberto Huertas, Burkhard Stiller
[ABSTRACT]
Traditional machine learning (ML) raises serious privacy concerns, while
federated learning (FL) mitigates the risk of data leakage by keeping data on
local devices. However, the training process of FL can still leak sensitive
information, which adversaries may exploit to infer private data. One of the
most prominent threats is the membership inference attack (MIA), where the
adversary aims to determine whether a particular data record was part of the
training set.
This paper addresses this problem through a two-stage defense called
AugMixCloak. The core idea is to apply data augmentation and principal
component analysis (PCA)-based information fusion to query images, which are
detected by perceptual hashing (pHash) as either identical to or highly similar
to images in the training set. Experimental results show that AugMixCloak
successfully defends against both binary classifier-based MIA and metric-based
MIA across five datasets and various decentralized FL (DFL) topologies.
Compared with regularization-based defenses, AugMixCloak demonstrates stronger
protection. Compared with confidence score masking, AugMixCloak exhibits better
generalization.
[LINK]
http://arxiv.org/abs/2505.07149v1
[DATE]
2025-05-12 07:38:44+08:00
[CATEGORIES]
cs.LG
Flow Matching with Gaussian Process Priors for Probabilistic Time Series Forecasting
[AUTHORS]
Marcel Kollovieh, Marten Lienen, David Lüdke, Leo Schwinn, Stephan Günnemann
[ABSTRACT]
Recent advancements in generative modeling, particularly diffusion models,
have opened new directions for time series modeling, achieving state-of-the-art
performance in forecasting and synthesis. However, the reliance of
diffusion-based models on a simple, fixed prior complicates the generative
process since the data and prior distributions differ significantly. We
introduce TSFlow, a conditional flow matching (CFM) model for time series
combining Gaussian processes, optimal transport paths, and data-dependent prior
distributions. By incorporating (conditional) Gaussian processes, TSFlow aligns
the prior distribution more closely with the temporal structure of the data,
enhancing both unconditional and conditional generation. Furthermore, we
propose conditional prior sampling to enable probabilistic forecasting with an
unconditionally trained model. In our experimental evaluation on eight
real-world datasets, we demonstrate the generative capabilities of TSFlow,
producing high-quality unconditional samples. Finally, we show that both
conditionally and unconditionally trained models achieve competitive results
across multiple forecasting benchmarks.
[LINK]
http://arxiv.org/abs/2410.03024v2
[DATE]
2025-05-12 06:30:03+08:00
[CATEGORIES]
cs.LG
Triangulating PL functions and the existence of efficient ReLU DNNs
[AUTHORS]
Danny Calegari
[ABSTRACT]
We show that every piecewise linear function $f:R^d \to R$ with compact
support a polyhedron $P$ has a representation as a sum of so-called `simplex
functions’. Such representations arise from degree 1 triangulations of the
relative homology class (in $R^{d+1}$) bounded by $P$ and the graph of $f$, and
give a short elementary proof of the existence of efficient universal ReLU
neural networks that simultaneously compute all such functions $f$ of bounded
complexity.
[COMMENTS]
4 pages
[LINK]
http://arxiv.org/abs/2505.07137v1
[DATE]
2025-05-12 06:20:16+08:00
[CATEGORIES]
cs.LG
Reward-Augmented Data Enhances Direct Preference Alignment of LLMs
[AUTHORS]
Shenao Zhang, Zhihan Liu, Boyi Liu, Yufeng Zhang, Yingxiang Yang, Yongfei Liu, Liyu Chen, Tao Sun, Zhaoran Wang
[ABSTRACT]
Preference alignment in Large Language Models (LLMs) has significantly
improved their ability to adhere to human instructions and intentions. However,
existing direct alignment algorithms primarily focus on relative preferences
and often overlook the qualitative aspects of responses, despite having access
to preference data that includes reward scores from judge models during AI
feedback. Striving to maximize the implicit reward gap between the chosen and
the slightly inferior rejected responses can cause overfitting and unnecessary
unlearning of the high-quality rejected responses. The unawareness of the
reward scores also drives the LLM to indiscriminately favor the low-quality
chosen responses and fail to generalize to optimal responses that are sparse in
data. To overcome these shortcomings, our study introduces reward-conditioned
LLM policies that discern and learn from the entire spectrum of response
quality within the dataset, helping extrapolate to more optimal regions. We
propose an effective yet simple data relabeling method that conditions the
preference pairs on quality scores to construct a reward-augmented dataset. The
experiments across various benchmarks and diverse models demonstrate that our
approach consistently boosts DPO by a considerable margin. Through
comprehensive ablation studies, we demonstrate that our method not only
maximizes the utility of preference data but also mitigates the issue of
unlearning, demonstrating its broad effectiveness beyond mere data expansion.
Our code is available at
https://github.com/shenao-zhang/reward-augmented-preference.
[COMMENTS]
Published at ICML 2025
[LINK]
http://arxiv.org/abs/2410.08067v6
[DATE]
2025-05-12 06:01:18+08:00
[CATEGORIES]
cs.LG
Learning from Samples: Inverse Problems over measures via Sharpened Fenchel-Young Losses
[AUTHORS]
Francisco Andrade, Gabriel Peyré, Clarice Poon
[ABSTRACT]
Estimating parameters from samples of an optimal probability distribution is
essential in applications ranging from socio-economic modeling to biological
system analysis. In these settings, the probability distribution arises as the
solution to an optimization problem that captures either static interactions
among agents or the dynamic evolution of a system over time. Our approach
relies on minimizing a new class of loss functions, called sharpened
Fenchel-Young losses, which measure the sub-optimality gap of the optimization
problem over the space of measures. We study the stability of this estimation
method when only a finite number of sample is available. The parameters to be
estimated typically correspond to a cost function in static problems and to a
potential function in dynamic problems. To analyze stability, we introduce a
general methodology that leverages the strong convexity of the loss function
together with the sample complexity of the forward optimization problem. Our
analysis emphasizes two specific settings in the context of optimal transport,
where our method provides explicit stability guarantees: The first is inverse
unbalanced optimal transport (iUOT) with entropic regularization, where the
parameters to estimate are cost functions that govern transport computations;
this method has applications such as link prediction in machine learning. The
second is inverse gradient flow (iJKO), where the objective is to recover a
potential function that drives the evolution of a probability distribution via
the Jordan-Kinderlehrer-Otto (JKO) time-discretization scheme; this is
particularly relevant for understanding cell population dynamics in single-cell
genomics. Finally, we validate our approach through numerical experiments on
Gaussian distributions, where closed-form solutions are available, to
demonstrate the practical performance of our methods
[LINK]
http://arxiv.org/abs/2505.07124v1
[DATE]
2025-05-12 05:26:44+08:00
[CATEGORIES]
cs.LG
Leveraging State Space Models in Long Range Genomics
[AUTHORS]
Matvei Popov, Aymen Kallala, Anirudha Ramesh, Narimane Hennouni, Shivesh Khaitan, Rick Gentry, Alain-Sam Cohen
[ABSTRACT]
Long-range dependencies are critical for understanding genomic structure and
function, yet most conventional methods struggle with them. Widely adopted
transformer-based models, while excelling at short-context tasks, are limited
by the attention module’s quadratic computational complexity and inability to
extrapolate to sequences longer than those seen in training. In this work, we
explore State Space Models (SSMs) as a promising alternative by benchmarking
two SSM-inspired architectures, Caduceus and Hawk, on long-range genomics
modeling tasks under conditions parallel to a 50M parameter transformer
baseline. We discover that SSMs match transformer performance and exhibit
impressive zero-shot extrapolation across multiple tasks, handling contexts 10
to 100 times longer than those seen during training, indicating more
generalizable representations better suited for modeling the long and complex
human genome. Moreover, we demonstrate that these models can efficiently
process sequences of 1M tokens on a single GPU, allowing for modeling entire
genomic regions at once, even in labs with limited compute. Our findings
establish SSMs as efficient and scalable for long-context genomic analysis.
[COMMENTS]
Accepted at ICLR 2025 (Spotlight @ LMRL) - Project page:
https://anirudharamesh.github.io/iclr-long-range-genomics/
[LINK]
http://arxiv.org/abs/2504.06304v2
[DATE]
2025-05-12 04:33:43+08:00
[CATEGORIES]
cs.LG
Statistical Guarantees in Synthetic Data through Conformal Adversarial Generation
[AUTHORS]
Rahul Vishwakarma, Shrey Dharmendra Modi, Vishwanath Seshagiri
[ABSTRACT]
The generation of high-quality synthetic data presents significant challenges
in machine learning research, particularly regarding statistical fidelity and
uncertainty quantification. Existing generative models produce compelling
synthetic samples but lack rigorous statistical guarantees about their relation
to the underlying data distribution, limiting their applicability in critical
domains requiring robust error bounds. We address this fundamental limitation
by presenting a novel framework that incorporates conformal prediction
methodologies into Generative Adversarial Networks (GANs). By integrating
multiple conformal prediction paradigms including Inductive Conformal
Prediction (ICP), Mondrian Conformal Prediction, Cross-Conformal Prediction,
and Venn-Abers Predictors, we establish distribution-free uncertainty
quantification in generated samples. This approach, termed Conformalized GAN
(cGAN), demonstrates enhanced calibration properties while maintaining the
generative power of traditional GANs, producing synthetic data with provable
statistical guarantees. We provide rigorous mathematical proofs establishing
finite-sample validity guarantees and asymptotic efficiency properties,
enabling the reliable application of synthetic data in high-stakes domains
including healthcare, finance, and autonomous systems.
[COMMENTS]
6 pages, 1 figure
[LINK]
http://arxiv.org/abs/2504.17058v3
[DATE]
2025-05-12 04:31:29+08:00
[CATEGORIES]
cs.LG
Knowledge Distillation for Enhancing Walmart E-commerce Search Relevance Using Large Language Models
[AUTHORS]
Hongwei Shang, Nguyen Vo, Nitin Yadav, Tian Zhang, Ajit Puthenputhussery, Xunfan Cai, Shuyi Chen, Prijith Chandran, Changsung Kang
[ABSTRACT]
Ensuring the products displayed in e-commerce search results are relevant to
users queries is crucial for improving the user experience. With their advanced
semantic understanding, deep learning models have been widely used for
relevance matching in search tasks. While large language models (LLMs) offer
superior ranking capabilities, it is challenging to deploy LLMs in real-time
systems due to the high-latency requirements. To leverage the ranking power of
LLMs while meeting the low-latency demands of production systems, we propose a
novel framework that distills a high performing LLM into a more efficient,
low-latency student model. To help the student model learn more effectively
from the teacher model, we first train the teacher LLM as a classification
model with soft targets. Then, we train the student model to capture the
relevance margin between pairs of products for a given query using mean squared
error loss. Instead of using the same training data as the teacher model, we
significantly expand the student model dataset by generating unlabeled data and
labeling it with the teacher model predictions. Experimental results show that
the student model performance continues to improve as the size of the augmented
training data increases. In fact, with enough augmented data, the student model
can outperform the teacher model. The student model has been successfully
deployed in production at Walmart.com with significantly positive metrics.
[COMMENTS]
9 pages, published at WWWW’25
[LINK]
http://arxiv.org/abs/2505.07105v1
[DATE]
2025-05-12 04:00:00+08:00
[CATEGORIES]
cs.LG
Tight Finite Time Bounds of Two-Time-Scale Linear Stochastic Approximation with Markovian Noise
[AUTHORS]
Shaan Ul Haque, Sajad Khodadadian, Siva Theja Maguluri
[ABSTRACT]
Stochastic approximation (SA) is an iterative algorithm for finding the fixed
point of an operator using noisy samples and widely used in optimization and
Reinforcement Learning (RL). The noise in RL exhibits a Markovian structure,
and in some cases, such as gradient temporal difference (GTD) methods, SA is
employed in a two-time-scale framework. This combination introduces significant
theoretical challenges for analysis.
We derive an upper bound on the error for the iterations of linear
two-time-scale SA with Markovian noise. We demonstrate that the mean squared
error decreases as $trace (\Sigma^y)/k + o(1/k)$ where $k$ is the number of
iterates, and $\Sigma^y$ is an appropriately defined covariance matrix. A key
feature of our bounds is that the leading term, $\Sigma^y$, exactly matches
with the covariance in the Central Limit Theorem (CLT) for the two-time-scale
SA, and we call them tight finite-time bounds. We illustrate their use in RL by
establishing sample complexity for off-policy algorithms, TDC, GTD, and GTD2.
A special case of linear two-time-scale SA that is extensively studied is
linear SA with Polyak-Ruppert averaging. We present tight finite time bounds
corresponding to the covariance matrix of the CLT. Such bounds can be used to
study TD-learning with Polyak-Ruppert averaging.
[COMMENTS]
83 pages, 6 figures
[LINK]
http://arxiv.org/abs/2401.00364v2
[DATE]
2025-05-12 03:55:30+08:00
[CATEGORIES]
cs.LG
SHAP values via sparse Fourier representation
[AUTHORS]
Ali Gorji, Andisheh Amrollahi, Andreas Krause
[ABSTRACT]
SHAP (SHapley Additive exPlanations) values are a widely used method for
local feature attribution in interpretable and explainable AI. We propose an
efficient two-stage algorithm for computing SHAP values in both black-box
setting and tree-based models. Motivated by spectral bias in real-world
predictors, we first approximate models using compact Fourier representations,
exactly for trees and approximately for black-box models. In the second stage,
we introduce a closed-form formula for {\em exactly} computing SHAP values
using the Fourier representation, that “linearizes” the computation into a
simple summation and is amenable to parallelization. As the Fourier
approximation is computed only once, our method enables amortized SHAP value
computation, achieving significant speedups over existing methods and a tunable
trade-off between efficiency and precision.
[COMMENTS]
Under review
[LINK]
http://arxiv.org/abs/2410.06300v2
[DATE]
2025-05-12 03:42:54+08:00
[CATEGORIES]
cs.LG
Moral Alignment for LLM Agents
[AUTHORS]
Elizaveta Tennant, Stephen Hailes, Mirco Musolesi
[COMMENTS]
Published at the 13th International Conference on Learning
Representations (ICLR‘25), Singapore, Apr 2025.
https://openreview.net/forum?id=MeGDmZjUXy
[LINK]
http://arxiv.org/abs/2410.01639v4
[DATE]
2025-05-12 03:14:09+08:00
[CATEGORIES]
cs.LG
Physics-informed Multiple-Input Operators for efficient dynamic response prediction of structures
[AUTHORS]
Bilal Ahmed, Yuqing Qiu, Diab W. Abueidda, Waleed El-Sekelly, Tarek Abdoun, Mostafa E. Mobasher
[ABSTRACT]
Finite element (FE) modeling is essential for structural analysis but remains
computationally intensive, especially under dynamic loading. While operator
learning models have shown promise in replicating static structural responses
at FEM level accuracy, modeling dynamic behavior remains more challenging. This
work presents a Multiple Input Operator Network (MIONet) that incorporates a
second trunk network to explicitly encode temporal dynamics, enabling accurate
prediction of structural responses under moving loads. Traditional DeepONet
architectures using recurrent neural networks (RNNs) are limited by fixed time
discretization and struggle to capture continuous dynamics. In contrast, MIONet
predicts responses continuously over both space and time, removing the need for
step wise modeling. It maps scalar inputs including load type, velocity,
spatial mesh, and time steps to full field structural responses. To improve
efficiency and enforce physical consistency, we introduce a physics informed
loss based on dynamic equilibrium using precomputed mass, damping, and
stiffness matrices, without solving the governing PDEs directly. Further, a
Schur complement formulation reduces the training domain, significantly cutting
computational costs while preserving global accuracy. The model is validated on
both a simple beam and the KW-51 bridge, achieving FEM level accuracy within
seconds. Compared to GRU based DeepONet, our model offers comparable accuracy
with improved temporal continuity and over 100 times faster inference, making
it well suited for real-time structural monitoring and digital twin
applications.
[LINK]
http://arxiv.org/abs/2505.07090v1
[DATE]
2025-05-12 02:45:58+08:00
[CATEGORIES]
cs.LG
Neural empirical interpolation method for nonlinear model reduction
[AUTHORS]
Max Hirsch, Federico Pichi, Jan S. Hesthaven
[ABSTRACT]
In this paper, we introduce the neural empirical interpolation method (NEIM),
a neural network-based alternative to the discrete empirical interpolation
method for reducing the time complexity of computing the nonlinear term in a
reduced order model (ROM) for a parameterized nonlinear partial differential
equation. NEIM is a greedy algorithm which accomplishes this reduction by
approximating an affine decomposition of the nonlinear term of the ROM, where
the vector terms of the expansion are given by neural networks depending on the
ROM solution, and the coefficients are given by an interpolation of some
“optimal” coefficients. Because NEIM is based on a greedy strategy, we are able
to provide a basic error analysis to investigate its performance. NEIM has the
advantages of being easy to implement in models with automatic differentiation,
of being a nonlinear projection of the ROM nonlinearity, of being efficient for
both nonlocal and local nonlinearities, and of relying solely on data and not
the explicit form of the ROM nonlinearity. We demonstrate the effectiveness of
the methodology on solution-dependent and solution-independent nonlinearities,
a nonlinear elliptic problem, and a nonlinear parabolic model of liquid
crystals.
Code availability: https://github.com/maxhirsch/NEIM
[LINK]
http://arxiv.org/abs/2406.03562v2
[DATE]
2025-05-12 02:38:53+08:00
[CATEGORIES]
cs.LG
DEFT: Efficient Fine-Tuning of Diffusion Models by Learning the Generalised $h$-transform
[AUTHORS]
Alexander Denker, Francisco Vargas, Shreyas Padhy, Kieran Didi, Simon Mathis, Vincent Dutordoir, Riccardo Barbano, Emile Mathieu, Urszula Julia Komorowska, Pietro Lio
[ABSTRACT]
Generative modelling paradigms based on denoising diffusion processes have
emerged as a leading candidate for conditional sampling in inverse problems. In
many real-world applications, we often have access to large, expensively
trained unconditional diffusion models, which we aim to exploit for improving
conditional sampling. Most recent approaches are motivated heuristically and
lack a unifying framework, obscuring connections between them. Further, they
often suffer from issues such as being very sensitive to hyperparameters, being
expensive to train or needing access to weights hidden behind a closed API. In
this work, we unify conditional training and sampling using the mathematically
well-understood Doob’s h-transform. This new perspective allows us to unify
many existing methods under a common umbrella. Under this framework, we propose
DEFT (Doob’s h-transform Efficient FineTuning), a new approach for conditional
generation that simply fine-tunes a very small network to quickly learn the
conditional $h$-transform, while keeping the larger unconditional network
unchanged. DEFT is much faster than existing baselines while achieving
state-of-the-art performance across a variety of linear and non-linear
benchmarks. On image reconstruction tasks, we achieve speedups of up to
1.6$\times$, while having the best perceptual quality on natural images and
reconstruction performance on medical images. Further, we also provide initial
experiments on protein motif scaffolding and outperform reconstruction guidance
methods.
[COMMENTS]
arXiv admin note: text overlap with arXiv:2312.09236
[LINK]
http://arxiv.org/abs/2406.01781v4
[DATE]
2025-05-12 02:28:43+08:00
[CATEGORIES]
cs.LG
Discovering Concept Directions from Diffusion-based Counterfactuals via Latent Clustering
[AUTHORS]
Payal Varshney, Adriano Lucieri, Christoph Balada, Andreas Dengel, Sheraz Ahmed
[ABSTRACT]
Concept-based explanations have emerged as an effective approach within
Explainable Artificial Intelligence, enabling interpretable insights by
aligning model decisions with human-understandable concepts. However, existing
methods rely on computationally intensive procedures and struggle to
efficiently capture complex, semantic concepts. Recently, the Concept Discovery
through Latent Diffusion-based Counterfactual Trajectories (CDCT) framework,
introduced by Varshney et al. (2025), attempts to identify concepts via
dimension-wise traversal of the latent space of a Variational Autoencoder
trained on counterfactual trajectories. Extending the CDCT framework, this work
introduces Concept Directions via Latent Clustering (CDLC), which extracts
global, class-specific concept directions by clustering latent difference
vectors derived from factual and diffusion-generated counterfactual image
pairs. CDLC substantially reduces computational complexity by eliminating the
exhaustive latent dimension traversal required in CDCT and enables the
extraction of multidimensional semantic concepts encoded across the latent
dimensions. This approach is validated on a real-world skin lesion dataset,
demonstrating that the extracted concept directions align with clinically
recognized dermoscopic features and, in some cases, reveal dataset-specific
biases or unknown biomarkers. These results highlight that CDLC is
interpretable, scalable, and applicable across high-stakes domains and diverse
data modalities.
[LINK]
http://arxiv.org/abs/2505.07073v1
[DATE]
2025-05-12 01:53:02+08:00
[CATEGORIES]
cs.LG
Scaling Laws and Representation Learning in Simple Hierarchical Languages: Transformers vs. Convolutional Architectures
[AUTHORS]
Francesco Cagnetta, Alessandro Favero, Antonio Sclocchi, Matthieu Wyart
[ABSTRACT]
How do neural language models acquire a language’s structure when trained for
next-token prediction? We address this question by deriving theoretical scaling
laws for neural network performance on synthetic datasets generated by the
Random Hierarchy Model (RHM) – an ensemble of probabilistic context-free
grammars designed to capture the hierarchical structure of natural language
while remaining analytically tractable. Previously, we developed a theory of
representation learning based on data correlations that explains how deep
learning models capture the hierarchical structure of the data sequentially,
one layer at a time. Here, we extend our theoretical framework to account for
architectural differences. In particular, we predict and empirically validate
that convolutional networks, whose structure aligns with that of the generative
process through locality and weight sharing, enjoy a faster scaling of
performance compared to transformer models, which rely on global self-attention
mechanisms. This finding clarifies the architectural biases underlying neural
scaling laws and highlights how representation learning is shaped by the
interaction between model architecture and the statistical properties of data.
[COMMENTS]
14 pages, 8 figures
[LINK]
http://arxiv.org/abs/2505.07070v1
[DATE]
2025-05-12 01:44:14+08:00
[CATEGORIES]
cs.LG
A Sparse Bayesian Learning Algorithm for Estimation of Interaction Kernels in Motsch-Tadmor Model
[AUTHORS]
Jinchao Feng, Sui Tang
[ABSTRACT]
In this paper, we investigate the data-driven identification of asymmetric
interaction kernels in the Motsch-Tadmor model based on observed trajectory
data. The model under consideration is governed by a class of semilinear
evolution equations, where the interaction kernel defines a normalized,
state-dependent Laplacian operator that governs collective dynamics. To address
the resulting nonlinear inverse problem, we propose a variational framework
that reformulates kernel identification using the implicit form of the
governing equations, reducing it to a subspace identification problem. We
establish an identifiability result that characterizes conditions under which
the interaction kernel can be uniquely recovered up to scale. To solve the
inverse problem robustly, we develop a sparse Bayesian learning algorithm that
incorporates informative priors for regularization, quantifies uncertainty, and
enables principled model selection. Extensive numerical experiments on
representative interacting particle systems demonstrate the accuracy,
robustness, and interpretability of the proposed framework across a range of
noise levels and data regimes.
[COMMENTS]
18 pages
[LINK]
http://arxiv.org/abs/2505.07068v1
[DATE]
2025-05-12 01:43:32+08:00
[CATEGORIES]
cs.LG
Learning curves theory for hierarchically compositional data with power-law distributed features
[AUTHORS]
Francesco Cagnetta, Hyunmo Kang, Matthieu Wyart
[ABSTRACT]
Recent theories suggest that Neural Scaling Laws arise whenever the task is
linearly decomposed into power-law distributed units. Alternatively, scaling
laws also emerge when data exhibit a hierarchically compositional structure, as
is thought to occur in language and images. To unify these views, we consider
classification and next-token prediction tasks based on probabilistic
context-free grammars – probabilistic models that generate data via a
hierarchy of production rules. For classification, we show that having
power-law distributed production rules results in a power-law learning curve
with an exponent depending on the rules’ distribution and a large
multiplicative constant that depends on the hierarchical structure. By
contrast, for next-token prediction, the distribution of production rules
controls the local details of the learning curve, but not the exponent
describing the large-scale behaviour.
[LINK]
http://arxiv.org/abs/2505.07067v1
[DATE]
2025-05-12 01:38:40+08:00
[CATEGORIES]
cs.LG
On the Impact of Black-box Deployment Strategies for Edge AI on Latency and Model Performance
[AUTHORS]
Jaskirat Singh, Emad Fallahzadeh, Bram Adams, Ahmed E. Hassan
[ABSTRACT]
Deciding what combination of operators to use across the Edge AI tiers to
achieve specific latency and model performance requirements is an open question
for MLOps engineers. This study aims to empirically assess the accuracy vs
inference time trade-off of different black-box Edge AI deployment strategies,
i.e., combinations of deployment operators and deployment tiers. In this paper,
we conduct inference experiments involving 3 deployment operators (i.e.,
Partitioning, Quantization, Early Exit), 3 deployment tiers (i.e., Mobile,
Edge, Cloud) and their combinations on four widely used Computer-Vision models
to investigate the optimal strategies from the point of view of MLOps
developers. Our findings suggest that Edge deployment using the hybrid
Quantization + Early Exit operator could be preferred over non-hybrid operators
(Quantization/Early Exit on Edge, Partition on Mobile-Edge) when faster latency
is a concern at medium accuracy loss. However, when minimizing accuracy loss is
a concern, MLOps engineers should prefer using only a Quantization operator on
edge at a latency reduction or increase, respectively over the Early
Exit/Partition (on edge/mobile-edge) and Quantized Early Exit (on edge)
operators. In scenarios constrained by Mobile CPU/RAM resources, a preference
for Partitioning across mobile and edge tiers is observed over mobile
deployment. For models with smaller input data samples (such as FCN), a
network-constrained cloud deployment can also be a better alternative than
Mobile/Edge deployment and Partitioning strategies. For models with large input
data samples (ResNet, ResNext, DUC), an edge tier having higher
network/computational capabilities than Cloud/Mobile can be a more viable
option than Partitioning and Mobile/Cloud deployment strategies.
[LINK]
http://arxiv.org/abs/2403.17154v3
[DATE]
2025-05-12 01:15:02+08:00
[CATEGORIES]
cs.LG
YANNs: Y-wise Affine Neural Networks for Exact and Efficient Representations of Piecewise Linear Functions
[AUTHORS]
Austin Braniff, Yuhe Tian
[ABSTRACT]
This work formally introduces Y-wise Affine Neural Networks (YANNs), a
fully-explainable network architecture that continuously and efficiently
represent piecewise affine functions with polytopic subdomains. Following from
the proofs, it is shown that the development of YANNs requires no training to
achieve the functionally equivalent representation. YANNs thus maintain all
mathematical properties of the original formulations. Multi-parametric model
predictive control is utilized as an application showcase of YANNs, which
theoretically computes optimal control laws as a piecewise affine function of
states, outputs, setpoints, and disturbances. With the exact representation of
multi-parametric control laws, YANNs retain essential control-theoretic
guarantees such as recursive feasibility and stability. This sets YANNs apart
from the existing works which apply neural networks for approximating optimal
control laws instead of exactly representing them. By optimizing the inference
speed of the networks, YANNs can evaluate substantially faster in real-time
compared to traditional piecewise affine function calculations. Numerical case
studies are presented to demonstrate the algorithmic scalability with respect
to the input/output dimensions and the number of subdomains. YANNs represent a
significant advancement in control as the first neural network-based controller
that inherently ensures both feasibility and stability. Future applications can
leverage them as an efficient and interpretable starting point for data-driven
modeling/control.
[LINK]
http://arxiv.org/abs/2505.07054v1
[DATE]
2025-05-12 00:55:38+08:00
[CATEGORIES]
cs.LG
Streaming Krylov-Accelerated Stochastic Gradient Descent
[AUTHORS]
Stephen Thomas
[ABSTRACT]
We present SKA-SGD (Streaming Krylov-Accelerated Stochastic Gradient
Descent), a novel optimization approach that accelerates convergence for
ill-conditioned problems by projecting stochastic gradients onto a
low-dimensional Krylov subspace. Directly inspired by recent advances in s-step
Conjugate Gradient methods with streaming Gauss-Seidel Gram solvers
\cite{dambra2025sstep}, our method extends these techniques to the stochastic
optimization domain. Our approach combines three key innovations: (1)
projection coefficients computed via a single streaming Gauss-Seidel iteration,
which is mathematically equivalent to Modified Gram-Schmidt orthogonalization;
(2) a Chebyshev polynomial basis for constructing the Krylov subspace,
providing superior numerical stability; and (3) efficient implementation for
AMD GPUs using HIP. We prove that our streaming approach achieves a backward
error near machine precision with $O(s^2)$ complexity rather than $O(s^3)$,
where $s$ is the Krylov subspace dimension. Experimental results demonstrate
that SKA-SGD significantly outperforms standard SGD and Adam in convergence
rate and final error, particularly for problems with condition numbers
exceeding $10^3$. GPU performance analysis reveals a crossover point where
communication-avoiding benefits outweigh computational overhead, typically
occurring at moderate scale ($p \approx 64$ processors) for problem sizes $n
\geq 10^6$.
[LINK]
http://arxiv.org/abs/2505.07046v1
[DATE]
2025-05-12 00:36:20+08:00
[CATEGORIES]
cs.LG
Reinforcement Learning (RL) Meets Urban Climate Modeling: Investigating the Efficacy and Impacts of RL-Based HVAC Control
[AUTHORS]
Junjie Yu, John S. Schreck, David John Gagne, Keith W. Oleson, Jie Li, Yongtu Liang, Qi Liao, Mingfei Sun, David O. Topping, Zhonghua Zheng
[ABSTRACT]
Reinforcement learning (RL)-based heating, ventilation, and air conditioning
(HVAC) control has emerged as a promising technology for reducing building
energy consumption while maintaining indoor thermal comfort. However, the
efficacy of such strategies is influenced by the background climate and their
implementation may potentially alter both the indoor climate and local urban
climate. This study proposes an integrated framework combining RL with an urban
climate model that incorporates a building energy model, aiming to evaluate the
efficacy of RL-based HVAC control across different background climates, impacts
of RL strategies on indoor climate and local urban climate, and the
transferability of RL strategies across cities. Our findings reveal that the
reward (defined as a weighted combination of energy consumption and thermal
comfort) and the impacts of RL strategies on indoor climate and local urban
climate exhibit marked variability across cities with different background
climates. The sensitivity of reward weights and the transferability of RL
strategies are also strongly influenced by the background climate. Cities in
hot climates tend to achieve higher rewards across most reward weight
configurations that balance energy consumption and thermal comfort, and those
cities with more varying atmospheric temperatures demonstrate greater RL
strategy transferability. These findings underscore the importance of
thoroughly evaluating RL-based HVAC control strategies in diverse climatic
contexts. This study also provides a new insight that city-to-city learning
will potentially aid the deployment of RL-based HVAC control.
[LINK]
http://arxiv.org/abs/2505.07045v1
[DATE]
2025-05-12 00:33:42+08:00
[CATEGORIES]
cs.LG
Empirical Analysis of Asynchronous Federated Learning on Heterogeneous Devices: Efficiency, Fairness, and Privacy Trade-offs
[AUTHORS]
Samaneh Mohammadi, Iraklis Symeonidis, Ali Balador, Francesco Flammini
[ABSTRACT]
Device heterogeneity poses major challenges in Federated Learning (FL), where
resource-constrained clients slow down synchronous schemes that wait for all
updates before aggregation. Asynchronous FL addresses this by incorporating
updates as they arrive, substantially improving efficiency. While its
efficiency gains are well recognized, its privacy costs remain largely
unexplored, particularly for high-end devices that contribute updates more
frequently, increasing their cumulative privacy exposure. This paper presents
the first comprehensive analysis of the efficiency-fairness-privacy trade-off
in synchronous vs. asynchronous FL under realistic device heterogeneity. We
empirically compare FedAvg and staleness-aware FedAsync using a physical
testbed of five edge devices spanning diverse hardware tiers, integrating Local
Differential Privacy (LDP) and the Moments Accountant to quantify per-client
privacy loss. Using Speech Emotion Recognition (SER) as a privacy-critical
benchmark, we show that FedAsync achieves up to 10x faster convergence but
exacerbates fairness and privacy disparities: high-end devices contribute 6-10x
more updates and incur up to 5x higher privacy loss, while low-end devices
suffer amplified accuracy degradation due to infrequent, stale, and
noise-perturbed updates. These findings motivate the need for adaptive FL
protocols that jointly optimize aggregation and privacy mechanisms based on
client capacity and participation dynamics, moving beyond static,
one-size-fits-all solutions.
[COMMENTS]
This paper was accepted to IJCNN 2025. This version is a preprint and
not the official published version
[LINK]
http://arxiv.org/abs/2505.07041v1
[DATE]
2025-05-12 00:25:06+08:00
[CATEGORIES]
cs.LG
Predicting Diabetes Using Machine Learning: A Comparative Study of Classifiers
[AUTHORS]
Mahade Hasan, Farhana Yasmin
[ABSTRACT]
Diabetes remains a significant health challenge globally, contributing to
severe complications like kidney disease, vision loss, and heart issues. The
application of machine learning (ML) in healthcare enables efficient and
accurate disease prediction, offering avenues for early intervention and
patient support. Our study introduces an innovative diabetes prediction
framework, leveraging both traditional ML techniques such as Logistic
Regression, SVM, Na"ive Bayes, and Random Forest and advanced ensemble methods
like AdaBoost, Gradient Boosting, Extra Trees, and XGBoost. Central to our
approach is the development of a novel model, DNet, a hybrid architecture
combining Convolutional Neural Network (CNN) and Long Short-Term Memory (LSTM)
layers for effective feature extraction and sequential learning. The DNet model
comprises an initial convolutional block for capturing essential features,
followed by a residual block with skip connections to facilitate efficient
information flow. Batch Normalization and Dropout are employed for robust
regularization, and an LSTM layer captures temporal dependencies within the
data. Using a Kaggle-sourced real-world diabetes dataset, our model evaluation
spans cross-validation accuracy, precision, recall, F1 score, and ROC-AUC.
Among the models, DNet demonstrates the highest efficacy with an accuracy of
99.79% and an AUC-ROC of 99.98%, establishing its potential for superior
diabetes prediction. This robust hybrid architecture showcases the value of
combining CNN and LSTM layers, emphasizing its applicability in medical
diagnostics and disease prediction tasks.
[LINK]
http://arxiv.org/abs/2505.07036v1
[DATE]
2025-05-12 00:14:31+08:00
[CATEGORIES]
cs.LG
Revealing Weaknesses in Text Watermarking Through Self-Information Rewrite Attacks
[AUTHORS]
Yixin Cheng, Hongcheng Guo, Yangming Li, Leonid Sigal
[ABSTRACT]
Text watermarking aims to subtly embed statistical signals into text by
controlling the Large Language Model (LLM)’s sampling process, enabling
watermark detectors to verify that the output was generated by the specified
model. The robustness of these watermarking algorithms has become a key factor
in evaluating their effectiveness. Current text watermarking algorithms embed
watermarks in high-entropy tokens to ensure text quality. In this paper, we
reveal that this seemingly benign design can be exploited by attackers, posing
a significant risk to the robustness of the watermark. We introduce a generic
efficient paraphrasing attack, the Self-Information Rewrite Attack (SIRA),
which leverages the vulnerability by calculating the self-information of each
token to identify potential pattern tokens and perform targeted attack. Our
work exposes a widely prevalent vulnerability in current watermarking
algorithms. The experimental results show SIRA achieves nearly 100% attack
success rates on seven recent watermarking methods with only 0.88 USD per
million tokens cost. Our approach does not require any access to the watermark
algorithms or the watermarked LLM and can seamlessly transfer to any LLM as the
attack model, even mobile-level models. Our findings highlight the urgent need
for more robust watermarking.
[COMMENTS]
ICML 2025 Accpeted
[LINK]
http://arxiv.org/abs/2505.05190v2
[DATE]
2025-05-11 22:24:22+08:00
[CATEGORIES]
cs.LG
cs.CL
Convert Language Model into a Value-based Strategic Planner
[AUTHORS]
Xiaoyu Wang, Yue Zhao, Qingqing Gu, Zhonglin Jiang, Xiaokai Chen, Yong Chen, Luo Ji
[ABSTRACT]
Emotional support conversation (ESC) aims to alleviate the emotional distress
of individuals through effective conversations. Although large language models
(LLMs) have obtained remarkable progress on ESC, most of these studies might
not define the diagram from the state model perspective, therefore providing a
suboptimal solution for long-term satisfaction. To address such an issue, we
leverage the Q-learning on LLMs, and propose a framework called straQ. Our
framework allows a plug-and-play LLM to bootstrap the planning during ESC,
determine the optimal strategy based on long-term returns, and finally guide
the LLM to response. Substantial experiments on ESC datasets suggest that
straQ outperforms many baselines, including direct inference, self-refine,
chain of thought, finetuning, and finite state machines.
[COMMENTS]
11 pages, 5 figures, Accepted by ACL 2025 Industry Track
[LINK]
http://arxiv.org/abs/2505.06987v1
[DATE]
2025-05-11 22:13:58+08:00
[CATEGORIES]
cs.CL
Understanding the Impact of Confidence in Retrieval Augmented Generation: A Case Study in the Medical Domain
[AUTHORS]
Shintaro Ozaki, Yuta Kato, Siyuan Feng, Masayo Tomita, Kazuki Hayashi, Wataru Hashimoto, Ryoma Obara, Masafumi Oyamada, Katsuhiko Hayashi, Hidetaka Kamigaito, Taro Watanabe
[ABSTRACT]
Retrieval Augmented Generation (RAG) complements the knowledge of Large
Language Models (LLMs) by leveraging external information to enhance response
accuracy for queries. This approach is widely applied in several fields by
taking its advantage of injecting the most up-to-date information, and
researchers are focusing on understanding and improving this aspect to unlock
the full potential of RAG in such high-stakes applications. However, despite
the potential of RAG to address these needs, the mechanisms behind the
confidence levels of its outputs remain underexplored, although the confidence
of information is very critical in some domains, such as finance, healthcare,
and medicine. Our study focuses the impact of RAG on confidence within the
medical domain under various configurations and models. We evaluate confidence
by treating the model’s predicted probability as its output and calculating
Expected Calibration Error (ECE) and Adaptive Calibration Error (ACE) scores
based on the probabilities and accuracy. In addition, we analyze whether the
order of retrieved documents within prompts calibrates the confidence. Our
findings reveal large variation in confidence and accuracy depending on the
model, settings, and the format of input prompts. These results underscore the
necessity of optimizing configurations based on the specific model and
conditions.
[COMMENTS]
Accepted to BioNLP2025 (Workshop colocated with ACL2025)
[LINK]
http://arxiv.org/abs/2412.20309v2
[DATE]
2025-05-11 18:24:33+08:00
[CATEGORIES]
cs.CL
BioProBench: Comprehensive Dataset and Benchmark in Biological Protocol Understanding and Reasoning
[AUTHORS]
Yuyang Liu, Liuzhenghao Lv, Xiancheng Zhang, Li Yuan, Yonghong Tian
[ABSTRACT]
Biological protocols are fundamental to reproducible and safe life science
research. While LLMs excel on general tasks, their systematic evaluation on
these highly specialized, accuracy-critical, and inherently procedural texts
remains limited. In this work, we present BioProBench, the first large-scale,
integrated multi-task benchmark for biological protocol understanding and
reasoning. While limited benchmarks have touched upon specific aspects like
protocol QA, BioProBench provides a comprehensive suite of five core tasks:
Protocol Question Answering, Step Ordering, Error Correction, Protocol
Generation, and Protocol Reasoning, enabling a holistic evaluation of LLMs on
procedural biological texts. Built upon 27K original protocols, it yields
nearly 556K high-quality structured instances. We evaluate 12 mainstream
open/closed-source LLMs on BioProBench. Experimental results reveal that while
top models preform well on surface understanding tasks, struggle significantly
with deep reasoning and structured generation tasks like ordering and
generation. Furthermore, model comparisons reveal diverse performance: certain
open-source models approach closed-source levels on some tasks, yet
bio-specific small models lag behind general LLMs, indicating limitations on
complex procedural content. Overall, our findings underscore that procedural
reasoning within biological protocols represents a significant challenge for
current LLMs. BioProBench serves as a standardized framework to diagnose these
specific limitations and guide the development of AI systems better equipped
for safely automating complex scientific procedures. The code and data are
available at: https://github.com/YuyangSunshine/bioprotocolbench and
https://huggingface.co/datasets/GreatCaptainNemo/BioProBench.
[LINK]
http://arxiv.org/abs/2505.07889v1
[DATE]
2025-05-11 17:42:24+08:00
[CATEGORIES]
cs.CL
The Distracting Effect: Understanding Irrelevant Passages in RAG
[AUTHORS]
Chen Amiraz, Florin Cuconasu, Simone Filice, Zohar Karnin
[ABSTRACT]
A well-known issue with Retrieval Augmented Generation (RAG) is that
retrieved passages that are irrelevant to the query sometimes distract the
answer-generating LLM, causing it to provide an incorrect response. In this
paper, we shed light on this core issue and formulate the distracting effect of
a passage w.r.t. a query (and an LLM). We provide a quantifiable measure of the
distracting effect of a passage and demonstrate its robustness across LLMs.
Our research introduces novel methods for identifying and using hard
distracting passages to improve RAG systems. By fine-tuning LLMs with these
carefully selected distracting passages, we achieve up to a 7.5% increase in
answering accuracy compared to counterparts fine-tuned on conventional RAG
datasets. Our contribution is two-fold: first, we move beyond the simple binary
classification of irrelevant passages as either completely unrelated vs.
distracting, and second, we develop and analyze multiple methods for finding
hard distracting passages. To our knowledge, no other research has provided
such a comprehensive framework for identifying and utilizing hard distracting
passages.
[LINK]
http://arxiv.org/abs/2505.06914v1
[DATE]
2025-05-11 17:25:05+08:00
[CATEGORIES]
cs.CL
Unleashing the potential of prompt engineering for large language models
[AUTHORS]
Banghao Chen, Zhaofeng Zhang, Nicolas Langrené, Shengxin Zhu
[ABSTRACT]
This comprehensive review delves into the pivotal role of prompt engineering
in unleashing the capabilities of Large Language Models (LLMs). The development
of Artificial Intelligence (AI), from its inception in the 1950s to the
emergence of advanced neural networks and deep learning architectures, has made
a breakthrough in LLMs, with models such as GPT-4o and Claude-3, and in
Vision-Language Models (VLMs), with models such as CLIP and ALIGN. Prompt
engineering is the process of structuring inputs, which has emerged as a
crucial technique to maximize the utility and accuracy of these models. This
paper explores both foundational and advanced methodologies of prompt
engineering, including techniques such as self-consistency, chain-of-thought,
and generated knowledge, which significantly enhance model performance.
Additionally, it examines the prompt method of VLMs through innovative
approaches such as Context Optimization (CoOp), Conditional Context
Optimization (CoCoOp), and Multimodal Prompt Learning (MaPLe). Critical to this
discussion is the aspect of AI security, particularly adversarial attacks that
exploit vulnerabilities in prompt engineering. Strategies to mitigate these
risks and enhance model robustness are thoroughly reviewed. The evaluation of
prompt methods is also addressed through both subjective and objective metrics,
ensuring a robust analysis of their efficacy. This review also reflects the
essential role of prompt engineering in advancing AI capabilities, providing a
structured framework for future research and application.
[COMMENTS]
v6 - Metadata updated (title, journal ref, DOI). PDF identical to v5
(original submission). Please cite the peer-reviewed Version of Record in
“Patterns” (DOI: 10.1016/j.patter.2025.101260)
[LINK]
http://arxiv.org/abs/2310.14735v6
[DATE]
2025-05-11 17:23:41+08:00
[CATEGORIES]
cs.CL
Explanatory Summarization with Discourse-Driven Planning
[AUTHORS]
Dongqi Liu, Xi Yu, Vera Demberg, Mirella Lapata
[ABSTRACT]
Lay summaries for scientific documents typically include explanations to help
readers grasp sophisticated concepts or arguments. However, current automatic
summarization methods do not explicitly model explanations, which makes it
difficult to align the proportion of explanatory content with human-written
summaries. In this paper, we present a plan-based approach that leverages
discourse frameworks to organize summary generation and guide explanatory
sentences by prompting responses to the plan. Specifically, we propose two
discourse-driven planning strategies, where the plan is conditioned as part of
the input or part of the output prefix, respectively. Empirical experiments on
three lay summarization datasets show that our approach outperforms existing
state-of-the-art methods in terms of summary quality, and it enhances model
robustness, controllability, and mitigates hallucination.
[COMMENTS]
Accepted by the Transactions of the Association for Computational
Linguistics (TACL 2025)
[LINK]
http://arxiv.org/abs/2504.19339v2
[DATE]
2025-05-11 17:00:44+08:00
[CATEGORIES]
cs.CL
cs.LG
EcoLANG: Efficient and Effective Agent Communication Language Induction for Social Simulation
[AUTHORS]
Xinyi Mou, Chen Qian, Wei Liu, Xuanjing Huang, Zhongyu Wei
[ABSTRACT]
Large language models (LLMs) have demonstrated an impressive ability to
role-play humans and replicate complex social dynamics. While large-scale
social simulations are gaining increasing attention, they still face
significant challenges, particularly regarding high time and computation costs.
Existing solutions, such as distributed mechanisms or hybrid agent-based model
(ABM) integrations, either fail to address inference costs or compromise
accuracy and generalizability. To this end, we propose EcoLANG: Efficient and
Effective Agent Communication Language Induction for Social Simulation. EcoLANG
operates in two stages: (1) language evolution, where we filter synonymous
words and optimize sentence-level rules through natural selection, and (2)
language utilization, where agents in social simulations communicate using the
evolved language. Experimental results demonstrate that EcoLANG reduces token
consumption by over 20%, enhancing efficiency without sacrificing simulation
accuracy.
[LINK]
http://arxiv.org/abs/2505.06904v1
[DATE]
2025-05-11 16:51:56+08:00
[CATEGORIES]
cs.CL
Multi-Modal Explainable Medical AI Assistant for Trustworthy Human-AI Collaboration
[AUTHORS]
Honglong Yang, Shanshan Song, Yi Qin, Lehan Wang, Haonan Wang, Xinpeng Ding, Qixiang Zhang, Bodong Du, Xiaomeng Li
[ABSTRACT]
Generalist Medical AI (GMAI) systems have demonstrated expert-level
performance in biomedical perception tasks, yet their clinical utility remains
limited by inadequate multi-modal explainability and suboptimal prognostic
capabilities. Here, we present XMedGPT, a clinician-centric, multi-modal AI
assistant that integrates textual and visual interpretability to support
transparent and trustworthy medical decision-making. XMedGPT not only produces
accurate diagnostic and descriptive outputs, but also grounds referenced
anatomical sites within medical images, bridging critical gaps in
interpretability and enhancing clinician usability. To support real-world
deployment, we introduce a reliability indexing mechanism that quantifies
uncertainty through consistency-based assessment via interactive
question-answering. We validate XMedGPT across four pillars: multi-modal
interpretability, uncertainty quantification, and prognostic modeling, and
rigorous benchmarking. The model achieves an IoU of 0.703 across 141 anatomical
regions, and a Kendall’s tau-b of 0.479, demonstrating strong alignment between
visual rationales and clinical outcomes. For uncertainty estimation, it attains
an AUC of 0.862 on visual question answering and 0.764 on radiology report
generation. In survival and recurrence prediction for lung and glioma cancers,
it surpasses prior leading models by 26.9%, and outperforms GPT-4o by 25.0%.
Rigorous benchmarking across 347 datasets covers 40 imaging modalities and
external validation spans 4 anatomical systems confirming exceptional
generalizability, with performance gains surpassing existing GMAI by 20.7% for
in-domain evaluation and 16.7% on 11,530 in-house data evaluation. Together,
XMedGPT represents a significant leap forward in clinician-centric AI
integration, offering trustworthy and scalable support for diverse healthcare
applications.
[LINK]
http://arxiv.org/abs/2505.06898v1
[DATE]
2025-05-11 16:32:01+08:00
[CATEGORIES]
cs.CL
IM-BERT: Enhancing Robustness of BERT through the Implicit Euler Method
[AUTHORS]
Mihyeon Kim, Juhyoung Park, Youngbin Kim
[ABSTRACT]
Pre-trained Language Models (PLMs) have achieved remarkable performance on
diverse NLP tasks through pre-training and fine-tuning. However, fine-tuning
the model with a large number of parameters on limited downstream datasets
often leads to vulnerability to adversarial attacks, causing overfitting of the
model on standard datasets.
To address these issues, we propose IM-BERT from the perspective of a dynamic
system by conceptualizing a layer of BERT as a solution of Ordinary
Differential Equations (ODEs). Under the situation of initial value
perturbation, we analyze the numerical stability of two main numerical ODE
solvers: the explicit and implicit Euler approaches.
Based on these analyses, we introduce a numerically robust IM-connection
incorporating BERT’s layers. This strategy enhances the robustness of PLMs
against adversarial attacks, even in low-resource scenarios, without
introducing additional parameters or adversarial training strategies.
Experimental results on the adversarial GLUE (AdvGLUE) dataset validate the
robustness of IM-BERT under various conditions. Compared to the original BERT,
IM-BERT exhibits a performance improvement of approximately 8.3\%p on the
AdvGLUE dataset. Furthermore, in low-resource scenarios, IM-BERT outperforms
BERT by achieving 5.9\%p higher accuracy.
[COMMENTS]
Accepted to EMNLP 2024 Main
[LINK]
http://arxiv.org/abs/2505.06889v1
[DATE]
2025-05-11 15:54:33+08:00
[CATEGORIES]
cs.CL
A Split-then-Join Approach to Abstractive Summarization for Very Long Documents in a Low Resource Setting
[AUTHORS]
Lhuqita Fazry
[ABSTRACT]
$\texttt{BIGBIRD-PEGASUS}$ model achieves $\textit{state-of-the-art}$ on
abstractive text summarization for long documents. However it’s capacity still
limited to maximum of $4,096$ tokens, thus caused performance degradation on
summarization for very long documents. Common method to deal with the issue is
to truncate the documents. In this reasearch, we’ll use different approach.
We’ll use the pretrained $\texttt{BIGBIRD-PEGASUS}$ model by fine tuned the
model on other domain dataset. First, we filter out all documents which length
less than $20,000$ tokens to focus on very long documents. To prevent domain
shifting problem and overfitting on transfer learning due to small dataset, we
augment the dataset by splitting document-summary training pair into parts, to
fit the document into $4,096$ tokens. Source code available on
$\href{https://github.com/lhfazry/SPIN-summ}{https://github.com/lhfazry/SPIN-summ}$.
[LINK]
http://arxiv.org/abs/2505.06862v1
[DATE]
2025-05-11 14:14:39+08:00
[CATEGORIES]
cs.CL
Implementing Long Text Style Transfer with LLMs through Dual-Layered Sentence and Paragraph Structure Extraction and Mapping
[AUTHORS]
Yusen Wu, Xiaotie Deng
[ABSTRACT]
This paper addresses the challenge in long-text style transfer using
zero-shot learning of large language models (LLMs), proposing a hierarchical
framework that combines sentence-level stylistic adaptation with
paragraph-level structural coherence. We argue that in the process of effective
paragraph-style transfer, to preserve the consistency of original syntactic and
semantic information, it is essential to perform style transfer not only at the
sentence level but also to incorporate paragraph-level semantic considerations,
while ensuring structural coherence across inter-sentential relationships. Our
proposed framework, ZeroStylus, operates through two systematic phases:
hierarchical template acquisition from reference texts and template-guided
generation with multi-granular matching. The framework dynamically constructs
sentence and paragraph template repositories, enabling context-aware
transformations while preserving inter-sentence logical relationships.
Experimental evaluations demonstrate significant improvements over baseline
methods, with structured rewriting achieving 6.90 average score compared to
6.70 for direct prompting approaches in tri-axial metrics assessing style
consistency, content preservation, and expression quality. Ablation studies
validate the necessity of both template hierarchies during style transfer,
showing higher content preservation win rate against sentence-only approaches
through paragraph-level structural encoding, as well as direct prompting method
through sentence-level pattern extraction and matching. The results establish
new capabilities for coherent long-text style transfer without requiring
parallel corpora or LLM fine-tuning.
[LINK]
http://arxiv.org/abs/2505.07888v1
[DATE]
2025-05-11 13:53:33+08:00
[CATEGORIES]
cs.CL
Development of a WAZOBIA-Named Entity Recognition System
[AUTHORS]
S. E Emedem, I. E Onyenwe, E. G Onyedinma
[ABSTRACT]
Named Entity Recognition NER is very crucial for various natural language
processing applications, including information extraction, machine translation,
and sentiment analysis. Despite the ever-increasing interest in African
languages within computational linguistics, existing NER systems focus mainly
on English, European, and a few other global languages, leaving a significant
gap for under-resourced languages. This research presents the development of a
WAZOBIA-NER system tailored for the three most prominent Nigerian languages:
Hausa, Yoruba, and Igbo. This research begins with a comprehensive compilation
of annotated datasets for each language, addressing data scarcity and
linguistic diversity challenges. Exploring the state-of-the-art machine
learning technique, Conditional Random Fields (CRF) and deep learning models
such as Bidirectional Long Short-Term Memory (BiLSTM), Bidirectional Encoder
Representation from Transformers (Bert) and fine-tune with a Recurrent Neural
Network (RNN), the study evaluates the effectiveness of these approaches in
recognizing three entities: persons, organizations, and locations. The system
utilizes optical character recognition (OCR) technology to convert textual
images into machine-readable text, thereby enabling the Wazobia system to
accept both input text and textual images for extraction purposes. The system
achieved a performance of 0.9511 in precision, 0.9400 in recall, 0.9564 in
F1-score, and 0.9301 in accuracy. The model’s evaluation was conducted across
three languages, with precision, recall, F1-score, and accuracy as key
assessment metrics. The Wazobia-NER system demonstrates that it is feasible to
build robust NER tools for under-resourced African languages using current NLP
frameworks and transfer learning.
[COMMENTS]
6 pages, 3 figures, 1 table
[LINK]
http://arxiv.org/abs/2505.07884v1
[DATE]
2025-05-11 06:59:24+08:00
[CATEGORIES]
cs.CL
cs.LG
Calibrating Translation Decoding with Quality Estimation on LLMs
[AUTHORS]
Di Wu, Yibin Lei, Christof Monz
[ABSTRACT]
Neural machine translation (NMT) systems typically employ maximum a
posteriori (MAP) decoding to select the highest-scoring translation from the
distribution mass. However, recent evidence highlights the inadequacy of MAP
decoding, often resulting in low-quality or even pathological hypotheses – the
decoding objective is not aligned with real-world translation quality. This
paper proposes calibrating hypothesis likelihoods with translation quality from
a distribution view by directly optimizing their Pearson correlation – thereby
enhancing the effectiveness of translation decoding. With our method,
translation on large language models (LLMs) improves substantially after
limited training (2K instances per direction). This improvement is orthogonal
to those achieved through supervised fine-tuning, leading to substantial gains
across a broad range of metrics and human evaluations – even when applied to
top-performing translation-specialized LLMs fine-tuned on high-quality
translation data, such as Tower, or when compared to recent preference
optimization methods, like CPO. Moreover, the calibrated translation likelihood
can directly serve as a strong proxy for translation quality, closely
approximating or even surpassing some state-of-the-art translation quality
estimation models, like CometKiwi. Lastly, our in-depth analysis demonstrates
that calibration enhances the effectiveness of MAP decoding, thereby enabling
greater efficiency in real-world deployment. The resulting state-of-the-art
translation model, which covers 10 languages, along with the accompanying code
and human evaluation data, has been released to the community:
https://github.com/moore3930/calibrating-llm-mt.
[LINK]
http://arxiv.org/abs/2504.19044v2
[DATE]
2025-05-11 05:53:02+08:00
[CATEGORIES]
cs.CL
MABR: Multilayer Adversarial Bias Removal Without Prior Bias Knowledge
[AUTHORS]
Maxwell J. Yin, Boyu Wang, Charles Ling
[ABSTRACT]
Models trained on real-world data often mirror and exacerbate existing social
biases. Traditional methods for mitigating these biases typically require prior
knowledge of the specific biases to be addressed, such as gender or racial
biases, and the social groups associated with each instance. In this paper, we
introduce a novel adversarial training strategy that operates independently of
prior bias-type knowledge and protected attribute labels. Our approach
proactively identifies biases during model training by utilizing auxiliary
models, which are trained concurrently by predicting the performance of the
main model without relying on task labels. Additionally, we implement these
auxiliary models at various levels of the feature maps of the main model,
enabling the detection of a broader and more nuanced range of bias features.
Through experiments on racial and gender biases in sentiment and occupation
classification tasks, our method effectively reduces social biases without the
need for demographic annotations. Moreover, our approach not only matches but
often surpasses the efficacy of methods that require detailed demographic
insights, marking a significant advancement in bias mitigation techniques.
[LINK]
http://arxiv.org/abs/2408.05497v3
[DATE]
2025-05-11 03:55:51+08:00
[CATEGORIES]
cs.CL
Endless Jailbreaks with Bijection Learning
[AUTHORS]
Brian R. Y. Huang, Maximilian Li, Leonard Tang
[ABSTRACT]
Despite extensive safety measures, LLMs are vulnerable to adversarial inputs,
or jailbreaks, which can elicit unsafe behaviors. In this work, we introduce
bijection learning, a powerful attack algorithm which automatically fuzzes LLMs
for safety vulnerabilities using randomly-generated encodings whose complexity
can be tightly controlled. We leverage in-context learning to teach models
bijective encodings, pass encoded queries to the model to bypass built-in
safety mechanisms, and finally decode responses back into English. Our attack
is extremely effective on a wide range of frontier language models. Moreover,
by controlling complexity parameters such as number of key-value mappings in
the encodings, we find a close relationship between the capability level of the
attacked LLM and the average complexity of the most effective bijection
attacks. Our work highlights that new vulnerabilities in frontier models can
emerge with scale: more capable models are more severely jailbroken by
bijection attacks.
[LINK]
http://arxiv.org/abs/2410.01294v3
[DATE]
2025-05-11 03:38:13+08:00
[CATEGORIES]
cs.CL
Fleet of Agents: Coordinated Problem Solving with Large Language Models
[AUTHORS]
Lars Klein, Nearchos Potamitis, Roland Aydin, Robert West, Caglar Gulcehre, Akhil Arora
[ABSTRACT]
While numerous frameworks have been developed to enhance the reasoning
abilities of large language models (LLMs), there is a scarcity of methods that
effectively balance the trade-off between cost and quality. In this paper, we
introduce Fleet of Agents (FoA), a novel and intuitive yet principled framework
utilizing LLMs as agents to navigate through dynamic tree searches, employing a
genetic-type particle filtering approach. FoA spawns a multitude of agents,
each exploring the search space autonomously, followed by a selection phase
where resampling based on a heuristic value function optimizes the balance
between exploration and exploitation. This mechanism enables dynamic branching,
adapting the exploration strategy based on discovered solutions. We conduct
extensive experiments on three benchmark tasks, “Game of 24”,
“Mini-Crosswords”, and “WebShop”, utilizing four different LLMs,
“GPT-3.5”, “GPT-4”, “LLaMA3.2-11B”, and “LLaMA3.2-90B”. On average
across all tasks and LLMs, FoA obtains a quality improvement of ~5% while
requiring only ~40% of the cost of previous SOTA methods. Notably, our analyses
reveal that (1) FoA achieves the best cost-quality trade-off among all
benchmarked methods and (2) FoA + LLaMA3.2-11B surpasses the Llama3.2-90B
model. FoA is publicly available at https://github.com/au-clan/FoA.
[COMMENTS]
ICML 2025; 28 pages, 68 figures, 8 tables
[LINK]
http://arxiv.org/abs/2405.06691v3
[DATE]
2025-05-11 03:36:43+08:00
[CATEGORIES]
cs.CL
cs.LG
Recovering Event Probabilities from Large Language Model Embeddings via Axiomatic Constraints
[AUTHORS]
Jian-Qiao Zhu, Haijiang Yan, Thomas L. Griffiths
[ABSTRACT]
Rational decision-making under uncertainty requires coherent degrees of
belief in events. However, event probabilities generated by Large Language
Models (LLMs) have been shown to exhibit incoherence, violating the axioms of
probability theory. This raises the question of whether coherent event
probabilities can be recovered from the embeddings used by the models. If so,
those derived probabilities could be used as more accurate estimates in events
involving uncertainty. To explore this question, we propose enforcing axiomatic
constraints, such as the additive rule of probability theory, in the latent
space learned by an extended variational autoencoder (VAE) applied to LLM
embeddings. This approach enables event probabilities to naturally emerge in
the latent space as the VAE learns to both reconstruct the original embeddings
and predict the embeddings of semantically related events. We evaluate our
method on complementary events (i.e., event A and its complement, event not-A),
where the true probabilities of the two events must sum to 1. Experiment
results on open-weight language models demonstrate that probabilities recovered
from embeddings exhibit greater coherence than those directly reported by the
corresponding models and align closely with the true probabilities.
[LINK]
http://arxiv.org/abs/2505.07883v1
[DATE]
2025-05-11 03:04:56+08:00
[CATEGORIES]
cs.CL
Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free
[AUTHORS]
Zihan Qiu, Zekun Wang, Bo Zheng, Zeyu Huang, Kaiyue Wen, Songlin Yang, Rui Men, Le Yu, Fei Huang, Suozhi Huang, Dayiheng Liu, Jingren Zhou, Junyang Lin
[ABSTRACT]
Gating mechanisms have been widely utilized, from early models like LSTMs and
Highway Networks to recent state space models, linear attention, and also
softmax attention. Yet, existing literature rarely examines the specific
effects of gating. In this work, we conduct comprehensive experiments to
systematically investigate gating-augmented softmax attention variants.
Specifically, we perform a comprehensive comparison over 30 variants of 15B
Mixture-of-Experts (MoE) models and 1.7B dense models trained on a 3.5 trillion
token dataset. Our central finding is that a simple modification-applying a
head-specific sigmoid gate after the Scaled Dot-Product Attention
(SDPA)-consistently improves performance. This modification also enhances
training stability, tolerates larger learning rates, and improves scaling
properties. By comparing various gating positions and computational variants,
we attribute this effectiveness to two key factors: (1) introducing
non-linearity upon the low-rank mapping in the softmax attention, and (2)
applying query-dependent sparse gating scores to modulate the SDPA output.
Notably, we find this sparse gating mechanism mitigates ‘attention sink’ and
enhances long-context extrapolation performance, and we also release related
$\href{https://github.com/qiuzh20/gated_attention}{codes}$ and
$\href{https://huggingface.co/QwQZh/gated_attention}{models}$ to facilitate
future research.
[LINK]
http://arxiv.org/abs/2505.06708v1
[DATE]
2025-05-11 01:15:49+08:00
[CATEGORIES]
cs.CL
From Rankings to Insights: Evaluation Should Shift Focus from Leaderboard to Feedback
[AUTHORS]
Zongqi Wang, Tianle Gu, Chen Gong, Xin Tian, Siqi Bao, Yujiu Yang
[ABSTRACT]
Automatic evaluation benchmarks such as MT-Bench, Arena-Hard, and Auto-Arena
are seeing growing adoption for the evaluation of Large Language Models (LLMs).
Existing research has primarily focused on approximating human-based model
rankings using limited data and LLM-as-a-Judge. However, the fundamental
premise of these studies, which attempts to replicate human rankings, is
flawed. Specifically, these benchmarks typically offer only overall scores,
limiting their utility to leaderboard rankings, rather than providing feedback
that can guide model optimization and support model profiling. Therefore, we
advocate for an evaluation paradigm shift from approximating human-based model
rankings to providing feedback with analytical value. To this end, we introduce
Feedbacker, an evaluation framework that provides comprehensive and
fine-grained results, thereby enabling thorough identification of a model’s
specific strengths and weaknesses. Such feedback not only supports the targeted
optimization of the model but also enhances the understanding of its behavior.
Feedbacker comprises three key components: an extensible tree-based query
taxonomy builder, an automated query synthesis scheme, and a suite of
visualization and analysis tools. Furthermore, we propose a novel
LLM-as-a-Judge method: PC2 (Pre-Comparison-derived Criteria) pointwise
evaluation. This method derives evaluation criteria by pre-comparing the
differences between several auxiliary responses, achieving the accuracy of
pairwise evaluation while maintaining the time complexity of pointwise
evaluation. Finally, leveraging the evaluation results of 17 mainstream LLMs,
we demonstrate the usage of Feedbacker and highlight its effectiveness and
potential. Our homepage project is available at
https://liudan193.github.io/Feedbacker.
[LINK]
http://arxiv.org/abs/2505.06698v1
[DATE]
2025-05-11 00:52:40+08:00
[CATEGORIES]
cs.CL
Enhancing BERTopic with Intermediate Layer Representations
[AUTHORS]
Dominik Koterwa, Maciej Świtała
[ABSTRACT]
BERTopic is a topic modeling algorithm that leverages transformer-based
embeddings to create dense clusters, enabling the estimation of topic
structures and the extraction of valuable insights from a corpus of documents.
This approach allows users to efficiently process large-scale text data and
gain meaningful insights into its structure. While BERTopic is a powerful tool,
embedding preparation can vary, including extracting representations from
intermediate model layers and applying transformations to these embeddings. In
this study, we evaluate 18 different embedding representations and present
findings based on experiments conducted on three diverse datasets. To assess
the algorithm’s performance, we report topic coherence and topic diversity
metrics across all experiments. Our results demonstrate that, for each dataset,
it is possible to find an embedding configuration that performs better than the
default setting of BERTopic. Additionally, we investigate the influence of stop
words on different embedding configurations.
[COMMENTS]
Repository with code for reproduction:
https://github.com/dkoterwa/optimizing_bertopic
[LINK]
http://arxiv.org/abs/2505.06696v1
[DATE]
2025-05-11 00:47:08+08:00
[CATEGORIES]
cs.CL
Enhancing stroke disease classification through machine learning models by feature selection techniques
[AUTHORS]
Mahade Hasan, Farhana Yasmin, Xue Yu
[ABSTRACT]
Heart disease remains a leading cause of mortality and morbidity worldwide,
necessitating the development of accurate and reliable predictive models to
facilitate early detection and intervention. While state of the art work has
focused on various machine learning approaches for predicting heart disease,
but they could not able to achieve remarkable accuracy. In response to this
need, we applied nine machine learning algorithms XGBoost, logistic regression,
decision tree, random forest, k-nearest neighbors (KNN), support vector machine
(SVM), gaussian na"ive bayes (NB gaussian), adaptive boosting, and linear
regression to predict heart disease based on a range of physiological
indicators. Our approach involved feature selection techniques to identify the
most relevant predictors, aimed at refining the models to enhance both
performance and interpretability. The models were trained, incorporating
processes such as grid search hyperparameter tuning, and cross-validation to
minimize overfitting. Additionally, we have developed a novel voting system
with feature selection techniques to advance heart disease classification.
Furthermore, we have evaluated the models using key performance metrics
including accuracy, precision, recall, F1-score, and the area under the
receiver operating characteristic curve (ROC AUC). Among the models, XGBoost
demonstrated exceptional performance, achieving 99% accuracy, precision,
F1-Score, 98% recall, and 100% ROC AUC. This study offers a promising approach
to early heart disease diagnosis and preventive healthcare.
[LINK]
http://arxiv.org/abs/2504.00485v2
[DATE]
2025-05-11 23:57:02+08:00
[CATEGORIES]
cs.LG
Efficient Fault Detection in WSN Based on PCA-Optimized Deep Neural Network Slicing Trained with GOA
[AUTHORS]
Mahmood Mohassel Feghhi, Raya Majid Alsharfa, Majid Hameed Majeed
[ABSTRACT]
Fault detection in Wireless Sensor Networks (WSNs) is crucial for reliable
data transmission and network longevity. Traditional fault detection methods
often struggle with optimizing deep neural networks (DNNs) for efficient
performance, especially in handling high-dimensional data and capturing
nonlinear relationships. Additionally, these methods typically suffer from slow
convergence and difficulty in finding optimal network architectures using
gradient-based optimization. This study proposes a novel hybrid method
combining Principal Component Analysis (PCA) with a DNN optimized by the
Grasshopper Optimization Algorithm (GOA) to address these limitations. Our
approach begins by computing eigenvalues from the original 12-dimensional
dataset and sorting them in descending order. The cumulative sum of these
values is calculated, retaining principal components until 99.5% variance is
achieved, effectively reducing dimensionality to 4 features while preserving
critical information. This compressed representation trains a six-layer DNN
where GOA optimizes the network architecture, overcoming backpropagation’s
limitations in discovering nonlinear relationships. This hybrid PCA-GOA-DNN
framework compresses the data and trains a six-layer DNN that is optimized by
GOA, enhancing both training efficiency and fault detection accuracy. The
dataset used in this study is a real-world WSNs dataset developed by the
University of North Carolina, which was used to evaluate the proposed method’s
performance. Extensive simulations demonstrate that our approach achieves a
remarkable 99.72% classification accuracy, with exceptional precision and
recall, outperforming conventional methods. The method is computationally
efficient, making it suitable for large-scale WSN deployments, and represents a
significant advancement in fault detection for resource-constrained WSNs.
[COMMENTS]
22 pages, 18 figures, Accepted for publication in International
Journal of Intelligent Engineering and Systems, May 2025
[LINK]
http://arxiv.org/abs/2505.07030v1
[DATE]
2025-05-11 23:51:56+08:00
[CATEGORIES]
cs.LG
Efficient Machine Unlearning by Model Splitting and Core Sample Selection
[AUTHORS]
Maximilian Egger, Rawad Bitar, Rüdiger Urbanke
[LINK]
http://arxiv.org/abs/2505.07026v1
[DATE]
2025-05-11 23:42:11+08:00
[CATEGORIES]
cs.LG
Incremental Uncertainty-aware Performance Monitoring with Active Labeling Intervention
[AUTHORS]
Alexander Koebler, Thomas Decker, Ingo Thon, Volker Tresp, Florian Buettner
[ABSTRACT]
We study the problem of monitoring machine learning models under gradual
distribution shifts, where circumstances change slowly over time, often leading
to unnoticed yet significant declines in accuracy. To address this, we propose
Incremental Uncertainty-aware Performance Monitoring (IUPM), a novel label-free
method that estimates performance changes by modeling gradual shifts using
optimal transport. In addition, IUPM quantifies the uncertainty in the
performance prediction and introduces an active labeling procedure to restore a
reliable estimate under a limited labeling budget. Our experiments show that
IUPM outperforms existing performance estimation baselines in various gradual
shift scenarios and that its uncertainty awareness guides label acquisition
more effectively compared to other strategies.
[LINK]
http://arxiv.org/abs/2505.07023v1
[DATE]
2025-05-11 23:35:55+08:00
[CATEGORIES]
cs.LG
Marginalization Consistent Probabilistic Forecasting of Irregular Time Series via Mixture of Separable flows
[AUTHORS]
Vijaya Krishna Yalavarthi, Randolf Scholz, Christian Kloetergens, Kiran Madhusudhanan, Stefan Born, Lars Schmidt-Thieme
[ABSTRACT]
Probabilistic forecasting models for joint distributions of targets in
irregular time series with missing values are a heavily under-researched area
in machine learning, with, to the best of our knowledge, only two Models have
been researched so far: The Gaussian Process Regression model, and ProFITi.
While ProFITi, thanks to using multivariate normalizing flows, is very
expressive, leading to better predictive performance, it suffers from
marginalization inconsistency: It does not guarantee that the marginal
distributions of a subset of variables in its predictive distributions coincide
with the directly predicted distributions of these variables. When asked to
directly predict marginal distributions, they are often vastly inaccurate. We
propose MOSES (Marginalization Consistent Mixture of Separable Flows), a model
that parametrizes a stochastic process through a mixture of several latent
multivariate Gaussian Processes combined with separable univariate Normalizing
Flows. In particular, MOSES can be analytically marginalized, allowing it to
directly answer a wider range of probabilistic queries than most competitors.
Experiments on four datasets show that MOSES achieves both accurate joint and
marginal predictions, surpassing all other marginalization consistent
baselines, while only trailing slightly behind ProFITi in joint prediction, but
vastly superior when predicting marginal distributions.
[LINK]
http://arxiv.org/abs/2406.07246v2
[DATE]
2025-05-11 23:30:43+08:00
[CATEGORIES]
cs.LG
Diffusion Approximations for Thompson Sampling
[AUTHORS]
Lin Fan, Peter W. Glynn
[ABSTRACT]
We study the behavior of Thompson sampling from the perspective of weak
convergence. In the regime with small $\gamma > 0$, where the gaps between arm
means scale as $\sqrt{\gamma}$ and over time horizons that scale as $1/\gamma$,
we show that the dynamics of Thompson sampling evolve according to discrete
versions of SDE’s and stochastic ODE’s. As $\gamma \downarrow 0$, we show that
the dynamics converge weakly to solutions of the corresponding SDE’s and
stochastic ODE’s. Our weak convergence theory is developed from first
principles using the Continuous Mapping Theorem, and can be easily adapted to
analyze other sampling-based bandit algorithms. In this regime, we also show
that the weak limits of the dynamics of many sampling-based algorithms –
including Thompson sampling designed for single-parameter exponential family
rewards, and algorithms using bootstrap-based sampling to balance exploration
and exploitation – coincide with those of Gaussian Thompson sampling.
Moreover, in this regime, these algorithms are generally robust to model
mis-specification.
[LINK]
http://arxiv.org/abs/2105.09232v4
[DATE]
2025-05-11 23:18:07+08:00
[CATEGORIES]
cs.LG
Adaptive Width Neural Networks
[AUTHORS]
Federico Errica, Henrik Christiansen, Viktor Zaverkin, Mathias Niepert, Francesco Alesiani
[ABSTRACT]
For almost 70 years, researchers have mostly relied on hyper-parameter tuning
to select the width of neural networks’ layers. This paper challenges the
status quo by introducing an easy-to-use technique to learn an unbounded width
of a neural network’s layer during training. The technique does not rely on
alternate optimization nor hand-crafted gradient heuristics; rather, it jointly
optimizes the width and the parameters of each layer via simple
backpropagation. We apply the technique to a broad range of data domains such
as tables, images, text, sequences, and graphs, showing how the width adapts to
the task’s difficulty. The method imposes a soft ordering of importance among
neurons, by which it also is possible to truncate the trained network at
virtually zero cost, achieving a smooth trade-off between performance and
compute resources in a structured way. Alternatively, one can dynamically
compress the network with no performance degradation. In light of recent
foundation models trained on large datasets, believed to require billions of
parameters and where hyper-parameter tuning is unfeasible due to humongous
training costs, our approach stands as a viable alternative for width learning.
[LINK]
http://arxiv.org/abs/2501.15889v3
[DATE]
2025-05-11 23:14:57+08:00
[CATEGORIES]
cs.LG
Adaptive Message Passing: A General Framework to Mitigate Oversmoothing, Oversquashing, and Underreaching
[AUTHORS]
Federico Errica, Henrik Christiansen, Viktor Zaverkin, Takashi Maruyama, Mathias Niepert, Francesco Alesiani
[ABSTRACT]
Long-range interactions are essential for the correct description of complex
systems in many scientific fields. The price to pay for including them in the
calculations, however, is a dramatic increase in the overall computational
costs. Recently, deep graph networks have been employed as efficient,
data-driven models for predicting properties of complex systems represented as
graphs. These models rely on a message passing strategy that should, in
principle, capture long-range information without explicitly modeling the
corresponding interactions. In practice, most deep graph networks cannot really
model long-range dependencies due to the intrinsic limitations of (synchronous)
message passing, namely oversmoothing, oversquashing, and underreaching. This
work proposes a general framework that learns to mitigate these limitations:
within a variational inference framework, we endow message passing
architectures with the ability to adapt their depth and filter messages along
the way. With theoretical and empirical arguments, we show that this strategy
better captures long-range interactions, by competing with the state of the art
on five node and graph prediction datasets.
[LINK]
http://arxiv.org/abs/2312.16560v3
[DATE]
2025-05-11 23:08:10+08:00
[CATEGORIES]
cs.LG
GuidedQuant: Large Language Model Quantization via Exploiting End Loss Guidance
[AUTHORS]
Jinuk Kim, Marwa El Halabi, Wonpyo Park, Clemens JS Schaefer, Deokjae Lee, Yeonhong Park, Jae W. Lee, Hyun Oh Song
[COMMENTS]
ICML 2025
[LINK]
http://arxiv.org/abs/2505.07004v1
[DATE]
2025-05-11 22:55:09+08:00
[CATEGORIES]
cs.LG
Hallucination-Aware Multimodal Benchmark for Gastrointestinal Image Analysis with Large Vision-Language Models
[AUTHORS]
Bidur Khanal, Sandesh Pokhrel, Sanjay Bhandari, Ramesh Rana, Nikesh Shrestha, Ram Bahadur Gurung, Cristian Linte, Angus Watson, Yash Raj Shrestha, Binod Bhattarai
[ABSTRACT]
Vision-Language Models (VLMs) are becoming increasingly popular in the
medical domain, bridging the gap between medical images and clinical language.
Existing VLMs demonstrate an impressive ability to comprehend medical images
and text queries to generate detailed, descriptive diagnostic medical reports.
However, hallucination–the tendency to generate descriptions that are
inconsistent with the visual content–remains a significant issue in VLMs, with
particularly severe implications in the medical field. To facilitate VLM
research on gastrointestinal (GI) image analysis and study hallucination, we
curate a multimodal image-text GI dataset: Gut-VLM. This dataset is created
using a two-stage pipeline: first, descriptive medical reports of Kvasir-v2
images are generated using ChatGPT, which introduces some hallucinated or
incorrect texts. In the second stage, medical experts systematically review
these reports, and identify and correct potential inaccuracies to ensure
high-quality, clinically reliable annotations. Unlike traditional datasets that
contain only descriptive texts, our dataset also features tags identifying
hallucinated sentences and their corresponding corrections. A common approach
to reducing hallucination in VLM is to finetune the model on a small-scale,
problem-specific dataset. However, we take a different strategy using our
dataset. Instead of finetuning the VLM solely for generating textual reports,
we finetune it to detect and correct hallucinations, an approach we call
hallucination-aware finetuning. Our results show that this approach is better
than simply finetuning for descriptive report generation. Additionally, we
conduct an extensive evaluation of state-of-the-art VLMs across several
metrics, establishing a benchmark. GitHub Repo:
https://github.com/bhattarailab/Hallucination-Aware-VLM.
[LINK]
http://arxiv.org/abs/2505.07001v1
[DATE]
2025-05-11 22:54:11+08:00
[CATEGORIES]
cs.LG
Is Stochastic Gradient Descent Effective? A PDE Perspective on Machine Learning processes
[AUTHORS]
Davide Barbieri, Matteo Bonforte, Peio Ibarrondo
[ABSTRACT]
In this paper we analyze the behaviour of the stochastic gradient descent
(SGD), a widely used method in supervised learning for optimizing neural
network weights via a minimization of non-convex loss functions. Since the
pioneering work of E, Li and Tai (2017), the underlying structure of such
processes can be understood via parabolic PDEs of Fokker-Planck type, which are
at the core of our analysis. Even if Fokker-Planck equations have a long
history and a extensive literature, almost nothing is known when the potential
is non-convex or when the diffusion matrix is degenerate, and this is the main
difficulty that we face in our analysis.
We identify two different regimes: in the initial phase of SGD, the loss
function drives the weights to concentrate around the nearest local minimum. We
refer to this phase as the drift regime and we provide quantitative estimates
on this concentration phenomenon. Next, we introduce the diffusion regime,
where stochastic fluctuations help the learning process to escape suboptimal
local minima. We analyze the Mean Exit Time (MET) and prove upper and lower
bounds of the MET. Finally, we address the asymptotic convergence of SGD, for a
non-convex cost function and a degenerate diffusion matrix, that do not allow
to use the standard approaches, and require new techniques. For this purpose,
we exploit two different methods: duality and entropy methods.
We provide new results about the dynamics and effectiveness of SGD, offering
a deep connection between stochastic optimization and PDE theory, and some
answers and insights to basic questions in the Machine Learning processes: How
long does SGD take to escape from a bad minimum? Do neural network parameters
converge using SGD? How do parameters evolve in the first stage of training
with SGD?
[LINK]
http://arxiv.org/abs/2501.08425v2
[DATE]
2025-05-11 22:54:09+08:00
[CATEGORIES]
cs.LG
LSR-IGRU: Stock Trend Prediction Based on Long Short-Term Relationships and Improved GRU
[AUTHORS]
Peng Zhu, Yuante Li, Yifan Hu, Qinyuan Liu, Dawei Cheng, Yuqi Liang
[ABSTRACT]
Stock price prediction is a challenging problem in the field of finance and
receives widespread attention. In recent years, with the rapid development of
technologies such as deep learning and graph neural networks, more research
methods have begun to focus on exploring the interrelationships between stocks.
However, existing methods mostly focus on the short-term dynamic relationships
of stocks and directly integrating relationship information with temporal
information. They often overlook the complex nonlinear dynamic characteristics
and potential higher-order interaction relationships among stocks in the stock
market. Therefore, we propose a stock price trend prediction model named
LSR-IGRU in this paper, which is based on long short-term stock relationships
and an improved GRU input. Firstly, we construct a long short-term relationship
matrix between stocks, where secondary industry information is employed for the
first time to capture long-term relationships of stocks, and overnight price
information is utilized to establish short-term relationships. Next, we improve
the inputs of the GRU model at each step, enabling the model to more
effectively integrate temporal information and long short-term relationship
information, thereby significantly improving the accuracy of predicting stock
trend changes. Finally, through extensive experiments on multiple datasets from
stock markets in China and the United States, we validate the superiority of
the proposed LSR-IGRU model over the current state-of-the-art baseline models.
We also apply the proposed model to the algorithmic trading system of a
financial company, achieving significantly higher cumulative portfolio returns
compared to other baseline methods. Our sources are released at
https://github.com/ZP1481616577/Baselines_LSR-IGRU.
[LINK]
http://arxiv.org/abs/2409.08282v3
[DATE]
2025-05-11 22:36:17+08:00
[CATEGORIES]
cs.LG
Branches: Efficiently Seeking Optimal Sparse Decision Trees with AO*
[AUTHORS]
Ayman Chaouki, Jesse Read, Albert Bifet
[ABSTRACT]
Decision Tree (DT) Learning is a fundamental problem in Interpretable Machine
Learning, yet it poses a formidable optimisation challenge. Practical
algorithms have recently emerged, primarily leveraging Dynamic Programming and
Branch & Bound. However, most of these approaches rely on a Depth-First-Search
strategy, which is inefficient when searching for DTs at high depths and
requires the definition of a maximum depth hyperparameter. Best-First-Search
was also employed by other methods to circumvent these issues. The downside of
this strategy is its higher memory consumption, as such, it has to be designed
in a fully efficient manner that takes full advantage of the problem’s
structure. We formulate the problem within an AND/OR graph search framework and
we solve it with a novel AO*-type algorithm called Branches. We prove both
optimality and complexity guarantees for Branches and we show that it is more
efficient than the state of the art theoretically and on a variety of
experiments. Furthermore, Branches supports non-binary features unlike the
other methods, we show that this property can further induce larger gains in
computational efficiency.
[COMMENTS]
Proceedings of the 42nd International Conference on Machine Learning,
Vancouver, Canada. PMLR 267, 2025
[LINK]
http://arxiv.org/abs/2406.02175v5
[DATE]
2025-05-11 22:13:55+08:00
[CATEGORIES]
cs.LG
Differentially Private Bilevel Optimization
[AUTHORS]
Guy Kornowski
[ABSTRACT]
We present differentially private (DP) algorithms for bilevel optimization, a
problem class that received significant attention lately in various machine
learning applications. These are the first algorithms for such problems under
standard DP constraints, and are also the first to avoid Hessian computations
which are prohibitive in large-scale settings. Under the well-studied setting
in which the upper-level is not necessarily convex and the lower-level problem
is strongly-convex, our proposed gradient-based $(\epsilon,\delta)$-DP
algorithm returns a point with hypergradient norm at most
$\widetilde{\mathcal{O}}\left((\sqrt{d_\mathrm{up}}/\epsilon
n)^{1/2}+(\sqrt{d_\mathrm{low}}/\epsilon n)^{1/3}\right)$ where $n$ is the
dataset size, and $d_\mathrm{up}/d_\mathrm{low}$ are the upper/lower level
dimensions. Our analysis covers constrained and unconstrained problems alike,
accounts for mini-batch gradients, and applies to both empirical and population
losses. As an application, we specialize our analysis to derive a simple
private rule for tuning a regularization hyperparameter.
[COMMENTS]
Major rewrite: Sections 3 & 7 are new; various improvements in
presentation
[LINK]
http://arxiv.org/abs/2409.19800v2
[DATE]
2025-05-11 22:13:26+08:00
[CATEGORIES]
cs.LG
Targeted Deep Learning System Boundary Testing
[AUTHORS]
Oliver Weißl, Amr Abdellatif, Xingcheng Chen, Giorgi Merabishvili, Vincenzo Riccio, Severin Kacianka, Andrea Stocco
[ABSTRACT]
Evaluating the behavioral boundaries of deep learning (DL) systems is crucial
for understanding their reliability across diverse, unseen inputs. Existing
solutions fall short as they rely on untargeted random, model- or latent-based
perturbations, due to difficulties in generating controlled input variations.
In this work, we introduce Mimicry, a novel black-box test generator for
fine-grained, targeted exploration of DL system boundaries. Mimicry performs
boundary testing by leveraging the probabilistic nature of DL outputs to
identify promising directions for exploration. It uses style-based GANs to
disentangle input representations into content and style components, enabling
controlled feature mixing to approximate the decision boundary. We evaluated
Mimicry’s effectiveness in generating boundary inputs for five widely used DL
image classification systems of increasing complexity, comparing it to two
baseline approaches. Our results show that Mimicry consistently identifies
inputs closer to the decision boundary. It generates semantically meaningful
boundary test cases that reveal new functional (mis)behaviors, while the
baselines produce mainly corrupted or invalid inputs. Thanks to its enhanced
control over latent space manipulations, Mimicry remains effective as dataset
complexity increases, maintaining competitive diversity and higher validity
rates, confirmed by human assessors.
[LINK]
http://arxiv.org/abs/2408.06258v2
[DATE]
2025-05-11 21:33:33+08:00
[CATEGORIES]
cs.LG
Learning Value of Information towards Joint Communication and Control in 6G V2X
[AUTHORS]
Lei Lei, Kan Zheng, Xuemin, Shen
[ABSTRACT]
As Cellular Vehicle-to-Everything (C-V2X) evolves towards future
sixth-generation (6G) networks, Connected Autonomous Vehicles (CAVs) are
emerging to become a key application. Leveraging data-driven Machine Learning
(ML), especially Deep Reinforcement Learning (DRL), is expected to
significantly enhance CAV decision-making in both vehicle control and V2X
communication under uncertainty. These two decision-making processes are
closely intertwined, with the value of information (VoI) acting as a crucial
bridge between them. In this paper, we introduce Sequential Stochastic Decision
Process (SSDP) models to define and assess VoI, demonstrating their application
in optimizing communication systems for CAVs. Specifically, we formally define
the SSDP model and demonstrate that the MDP model is a special case of it. The
SSDP model offers a key advantage by explicitly representing the set of
information that can enhance decision-making when available. Furthermore, as
current research on VoI remains fragmented, we propose a systematic VoI
modeling framework grounded in the MDP, Reinforcement Learning (RL) and Optimal
Control theories. We define different categories of VoI and discuss their
corresponding estimation methods. Finally, we present a structured approach to
leverage the various VoI metrics for optimizing the When",
What”, and
``How” to communicate problems. For this purpose, SSDP models are formulated
with VoI-associated reward functions derived from VoI-based optimization
objectives. While we use a simple vehicle-following control problem to
illustrate the proposed methodology, it holds significant potential to
facilitate the joint optimization of stochastic, sequential control and
communication decisions in a wide range of networked control systems.
[LINK]
http://arxiv.org/abs/2505.06978v1
[DATE]
2025-05-11 21:30:35+08:00
[CATEGORIES]
cs.LG
Scaling Large Motion Models with Million-Level Human Motions
[AUTHORS]
Ye Wang, Sipeng Zheng, Bin Cao, Qianshan Wei, Weishuai Zeng, Qin Jin, Zongqing Lu
[ABSTRACT]
Inspired by the recent success of LLMs, the field of human motion
understanding has increasingly shifted toward developing large motion models.
Despite some progress, current efforts remain far from achieving truly
generalist models, primarily due to the lack of massive high-quality data. To
address this gap, we present MotionLib, the first million-level dataset for
motion generation, which is at least 15$\times$ larger than existing
counterparts and enriched with hierarchical text descriptions. Using MotionLib,
we train a large motion model named Being-M0, demonstrating robust performance
across a wide range of human activities, including unseen ones. Through
systematic investigation, for the first time, we highlight the importance of
scaling both data and model size for advancing motion generation, along with
key insights to achieve this goal. To better integrate the motion modality, we
propose Motionbook, an innovative motion encoding approach including (1) a
compact yet lossless feature to represent motions; (2) a novel 2D lookup-free
motion tokenizer that preserves fine-grained motion details while expanding
codebook capacity, significantly enhancing the representational power of motion
tokens. We believe this work lays the groundwork for developing more versatile
and powerful motion generation models in the future. For further details, visit
https://github.com/BeingBeyond/Being-M0.
[COMMENTS]
ICML 2025
[LINK]
http://arxiv.org/abs/2410.03311v2
[DATE]
2025-05-11 21:16:11+08:00
[CATEGORIES]
cs.LG
Clustering Properties of Self-Supervised Learning
[AUTHORS]
Xi Weng, Jianing An, Xudong Ma, Binhang Qi, Jie Luo, Xi Yang, Jin Song Dong, Lei Huang
[ABSTRACT]
Self-supervised learning (SSL) methods via joint embedding architectures have
proven remarkably effective at capturing semantically rich representations with
strong clustering properties, magically in the absence of label supervision.
Despite this, few of them have explored leveraging these untapped properties to
improve themselves. In this paper, we provide an evidence through various
metrics that the encoder’s output $encoding$ exhibits superior and more stable
clustering properties compared to other components. Building on this insight,
we propose a novel positive-feedback SSL method, termed Representation
Self-Assignment (ReSA), which leverages the model’s clustering properties to
promote learning in a self-guided manner. Extensive experiments on standard SSL
benchmarks reveal that models pretrained with ReSA outperform other
state-of-the-art SSL methods by a significant margin. Finally, we analyze how
ReSA facilitates better clustering properties, demonstrating that it
effectively enhances clustering performance at both fine-grained and
coarse-grained levels, shaping representations that are inherently more
structured and semantically meaningful.
[COMMENTS]
Accepted at ICML 2025
[LINK]
http://arxiv.org/abs/2501.18452v2
[DATE]
2025-05-11 20:46:57+08:00
[CATEGORIES]
cs.LG
A Formally Verified Robustness Certifier for Neural Networks (Extended Version)
[AUTHORS]
James Tobler, Hira Taqdees Syeda, Toby Murray
[ABSTRACT]
Neural networks are often susceptible to minor perturbations in input that
cause them to misclassify. A recent solution to this problem is the use of
globally-robust neural networks, which employ a function to certify that the
classification of an input cannot be altered by such a perturbation. Outputs
that pass this test are called certified robust. However, to the authors’
knowledge, these certification functions have not yet been verified at the
implementation level. We demonstrate how previous unverified implementations
are exploitably unsound in certain circumstances. Moreover, they often rely on
approximation-based algorithms, such as power iteration, that (perhaps
surprisingly) do not guarantee soundness. To provide assurance that a given
output is robust, we implemented and formally verified a certification function
for globally-robust neural networks in Dafny. We describe the program, its
specifications, and the important design decisions taken for its implementation
and verification, as well as our experience applying it in practice.
[LINK]
http://arxiv.org/abs/2505.06958v1
[DATE]
2025-05-11 20:05:14+08:00
[CATEGORIES]
cs.LG
Unsupervised Learning for Class Distribution Mismatch
[AUTHORS]
Pan Du, Wangbo Zhao, Xinai Lu, Nian Liu, Zhikai Li, Chaoyu Gong, Suyun Zhao, Hong Chen, Cuiping Li, Kai Wang, Yang You
[ABSTRACT]
Class distribution mismatch (CDM) refers to the discrepancy between class
distributions in training data and target tasks. Previous methods address this
by designing classifiers to categorize classes known during training, while
grouping unknown or new classes into an “other” category. However, they focus
on semi-supervised scenarios and heavily rely on labeled data, limiting their
applicability and performance. To address this, we propose Unsupervised
Learning for Class Distribution Mismatch (UCDM), which constructs
positive-negative pairs from unlabeled data for classifier training. Our
approach randomly samples images and uses a diffusion model to add or erase
semantic classes, synthesizing diverse training pairs. Additionally, we
introduce a confidence-based labeling mechanism that iteratively assigns
pseudo-labels to valuable real-world data and incorporates them into the
training process. Extensive experiments on three datasets demonstrate UCDM’s
superiority over previous semi-supervised methods. Specifically, with a 60%
mismatch proportion on Tiny-ImageNet dataset, our approach, without relying on
labeled data, surpasses OpenMatch (with 40 labels per class) by 35.1%, 63.7%,
and 72.5% in classifying known, unknown, and new classes.
[COMMENTS]
Accepted by ICML 2025
[LINK]
http://arxiv.org/abs/2505.06948v1
[DATE]
2025-05-11 19:29:48+08:00
[CATEGORIES]
cs.LG
RobGC: Towards Robust Graph Condensation
[AUTHORS]
Xinyi Gao, Hongzhi Yin, Tong Chen, Guanhua Ye, Wentao Zhang, Bin Cui
[ABSTRACT]
Graph neural networks (GNNs) have attracted widespread attention for their
impressive capability of graph representation learning. However, the increasing
prevalence of large-scale graphs presents a significant challenge for GNN
training due to their computational demands, limiting the applicability of GNNs
in various scenarios. In response to this challenge, graph condensation (GC) is
proposed as a promising acceleration solution, focusing on generating an
informative compact graph that enables efficient training of GNNs while
retaining performance. Despite the potential to accelerate GNN training,
existing GC methods overlook the quality of large training graphs during both
the training and inference stages. They indiscriminately emulate the training
graph distributions, making the condensed graphs susceptible to noises within
the training graph and significantly impeding the application of GC in
intricate real-world scenarios. To address this issue, we propose robust graph
condensation (RobGC), a plug-and-play approach for GC to extend the robustness
and applicability of condensed graphs in noisy graph structure environments.
Specifically, RobGC leverages the condensed graph as a feedback signal to guide
the denoising process on the original training graph. A label propagation-based
alternating optimization strategy is in place for the condensation and
denoising processes, contributing to the mutual purification of the condensed
graph and training graph. Additionally, as a GC method designed for inductive
graph inference, RobGC facilitates test-time graph denoising by leveraging the
noise-free condensed graph to calibrate the structure of the test graph.
Extensive experiments show that RobGC is compatible with various GC methods,
significantly boosting their robustness under different types and levels of
graph structural noises.
[COMMENTS]
Accepted by TKDE 2025
[LINK]
http://arxiv.org/abs/2406.13200v2
[DATE]
2025-05-11 19:03:10+08:00
[CATEGORIES]
cs.LG
Mamba-Based Graph Convolutional Networks: Tackling Over-smoothing with Selective State Space
[AUTHORS]
Xin He, Yili Wang, Wenqi Fan, Xu Shen, Xin Juan, Rui Miao, Xin Wang
[ABSTRACT]
Graph Neural Networks (GNNs) have shown great success in various graph-based
learning tasks. However, it often faces the issue of over-smoothing as the
model depth increases, which causes all node representations to converge to a
single value and become indistinguishable. This issue stems from the inherent
limitations of GNNs, which struggle to distinguish the importance of
information from different neighborhoods. In this paper, we introduce MbaGCN, a
novel graph convolutional architecture that draws inspiration from the Mamba
paradigm-originally designed for sequence modeling. MbaGCN presents a new
backbone for GNNs, consisting of three key components: the Message Aggregation
Layer, the Selective State Space Transition Layer, and the Node State
Prediction Layer. These components work in tandem to adaptively aggregate
neighborhood information, providing greater flexibility and scalability for
deep GNN models. While MbaGCN may not consistently outperform all existing
methods on each dataset, it provides a foundational framework that demonstrates
the effective integration of the Mamba paradigm into graph representation
learning. Through extensive experiments on benchmark datasets, we demonstrate
that MbaGCN paves the way for future advancements in graph neural network
research.
[COMMENTS]
11 pages, 4 figures
[LINK]
http://arxiv.org/abs/2501.15461v2
[DATE]
2025-05-11 19:02:32+08:00
[CATEGORIES]
cs.LG
AI-Powered Inverse Design of Ku-Band SIW Resonant Structures by Iterative Residual Correction Network
[AUTHORS]
Mohammad Mashayekhi, Kamran Salehian
[ABSTRACT]
Inverse electromagnetic modeling has emerged as a powerful approach for
designing complex microwave structures with high accuracy and efficiency. In
this study, we propose an Iterative Residual Correction Network (IRC-Net) for
the inverse design of Ku-band Substrate Integrated Waveguide (SIW) components
based on multimode resonators. We use a multimode resonance structure to
demonstrate that it is possible to control the resonances of the structure.
Therefore, these structures can be used for resonant components and smart
filter design. The proposed deep learning architecture leverages residual
neural networks to overcome the limitations of traditional inverse design
techniques, such as the Feedforward Inverse Model (FIM), offering improved
generalization and prediction accuracy. The approach begins with a FIM to
generate initial design estimates, followed by an iterative correction strategy
inspired by the Hybrid Inverse-Forward Residual Refinement Network
(HiFR\textsuperscript{2}-Net), which we call IRC-Net. Experiments demonstrate
that the IRC-Net achieves substantial improvements in prediction accuracy
compared to traditional single-stage networks, validated through statistical
metrics, full-wave electromagnetic simulations, and measurements. To validate
the proposed framework, we first design and fabricate a three-resonance SIW
structure. Next, we apply the trained IRC-Net model to predict the geometry of
a four-resonance structure based on its desired frequency response. Both
designs are fabricated and tested, showing strong agreement between the
simulated, predicted, and measured results, confirming the effectiveness and
practicality of the proposed method.
[COMMENTS]
18 pages, 14 figures
[LINK]
http://arxiv.org/abs/2505.06936v1
[DATE]
2025-05-11 18:51:43+08:00
[CATEGORIES]
cs.LG
ItDPDM: Information-Theoretic Discrete Poisson Diffusion Model
[AUTHORS]
Sagnik Bhattacharya, Abhiram Gorle, Ahmed Mohsin, Ahsan Bilal, Connor Ding, Amit Kumar Singh Yadav, Tsachy Weissman
[ABSTRACT]
Existing methods for generative modeling of discrete data, such as symbolic
music tokens, face two primary challenges: (1) they either embed discrete
inputs into continuous state-spaces or (2) rely on variational losses that only
approximate the true negative log-likelihood. Previous efforts have
individually targeted these limitations. While information-theoretic Gaussian
diffusion models alleviate the suboptimality of variational losses, they still
perform modeling in continuous domains. In this work, we introduce the
Information-Theoretic Discrete Poisson Diffusion Model (ItDPDM), which
simultaneously addresses both limitations by directly operating in a discrete
state-space via a Poisson diffusion process inspired by photon arrival
processes in camera sensors. We introduce a novel Poisson Reconstruction Loss
(PRL) and derive an exact relationship between PRL and the true negative
log-likelihood, thereby eliminating the need for approximate evidence lower
bounds. Experiments conducted on the Lakh MIDI symbolic music dataset and the
CIFAR-10 image benchmark demonstrate that ItDPDM delivers significant
improvements, reducing test NLL by up to 80% compared to prior baselines, while
also achieving faster convergence.
[COMMENTS]
Pre-print
[LINK]
http://arxiv.org/abs/2505.05082v2
[DATE]
2025-05-11 18:49:46+08:00
[CATEGORIES]
cs.LG
Reward-free World Models for Online Imitation Learning
[AUTHORS]
Shangzhe Li, Zhiao Huang, Hao Su
[ABSTRACT]
Imitation learning (IL) enables agents to acquire skills directly from expert
demonstrations, providing a compelling alternative to reinforcement learning.
However, prior online IL approaches struggle with complex tasks characterized
by high-dimensional inputs and complex dynamics. In this work, we propose a
novel approach to online imitation learning that leverages reward-free world
models. Our method learns environmental dynamics entirely in latent spaces
without reconstruction, enabling efficient and accurate modeling. We adopt the
inverse soft-Q learning objective, reformulating the optimization process in
the Q-policy space to mitigate the instability associated with traditional
optimization in the reward-policy space. By employing a learned latent dynamics
model and planning for control, our approach consistently achieves stable,
expert-level performance in tasks with high-dimensional observation or action
spaces and intricate dynamics. We evaluate our method on a diverse set of
benchmarks, including DMControl, MyoSuite, and ManiSkill2, demonstrating
superior empirical performance compared to existing approaches.
[COMMENTS]
ICML 2025; Code available at: https://github.com/TobyLeelsz/iqmpc
[LINK]
http://arxiv.org/abs/2410.14081v5
[DATE]
2025-05-11 18:32:36+08:00
[CATEGORIES]
cs.LG
Unraveling Quantum Environments: Transformer-Assisted Learning in Lindblad Dynamics
[AUTHORS]
Chi-Sheng Chen, En-Jui Kuo
[ABSTRACT]
Understanding dissipation in open quantum systems is crucial for the
development of robust quantum technologies. In this work, we introduce a
Transformer-based machine learning framework to infer time-dependent
dissipation rates in quantum systems governed by the Lindblad master equation.
Our approach uses time series of observable quantities, such as expectation
values of single Pauli operators, as input to learn dissipation profiles
without requiring knowledge of the initial quantum state or even the system
Hamiltonian.
We demonstrate the effectiveness of our approach on a hierarchy of open
quantum models of increasing complexity, including single-qubit systems with
time-independent or time-dependent jump rates, two-qubit interacting systems
(e.g., Heisenberg and transverse Ising models), and the Jaynes–Cummings model
involving light–matter interaction and cavity loss with time-dependent decay
rates. Our method accurately reconstructs both fixed and time-dependent decay
rates from observable time series. To support this, we prove that under
reasonable assumptions, the jump rates in all these models are uniquely
determined by a finite set of observables, such as qubit and photon
measurements. In practice, we combine Transformer-based architectures with
lightweight feature extraction techniques to efficiently learn these dynamics.
Our results suggest that modern machine learning tools can serve as scalable
and data-driven alternatives for identifying unknown environments in open
quantum systems.
[LINK]
http://arxiv.org/abs/2505.06928v1
[DATE]
2025-05-11 18:18:19+08:00
[CATEGORIES]
cs.LG
Optimal Cross-Validation for Sparse Linear Regression
[AUTHORS]
Ryan Cory-Wright, Andrés Gómez
[ABSTRACT]
Given a high-dimensional covariate matrix and a response vector,
ridge-regularized sparse linear regression selects a subset of features that
explains the relationship between covariates and the response in an
interpretable manner. To select the sparsity and robustness of linear
regressors, techniques like k-fold cross-validation are commonly used for
hyperparameter tuning. However, cross-validation substantially increases the
computational cost of sparse regression as it requires solving many
mixed-integer optimization problems (MIOs) for each hyperparameter combination.
To improve upon this state of affairs, we obtain computationally tractable
relaxations of k-fold cross-validation metrics, facilitating hyperparameter
selection after solving 50-80% fewer MIOs in practice. These relaxations result
in an efficient cyclic coordinate descent scheme, achieving 10%-30% lower
validation errors than via traditional methods such as grid search with MCP or
GLMNet across a suite of 13 real-world datasets.
[COMMENTS]
Moved stability-adjustment content to a different paper, as it was a
separate idea to the main point of the paper
[LINK]
http://arxiv.org/abs/2306.14851v3
[DATE]
2025-05-11 18:11:36+08:00
[CATEGORIES]
cs.LG
Stability Regularized Cross-Validation
[AUTHORS]
Ryan Cory-Wright, Andrés Gómez
[ABSTRACT]
We revisit the problem of ensuring strong test-set performance via
cross-validation. Motivated by the generalization theory literature, we propose
a nested k-fold cross-validation scheme that selects hyperparameters by
minimizing a weighted sum of the usual cross-validation metric and an empirical
model-stability measure. The weight on the stability term is itself chosen via
a nested cross-validation procedure. This reduces the risk of strong validation
set performance and poor test set performance due to instability. We benchmark
our procedure on a suite of 13 real-world UCI datasets, and find that, compared
to k-fold cross-validation over the same hyperparameters, it improves the
out-of-sample MSE for sparse ridge regression and CART by 4% on average, but
has no impact on XGBoost. This suggests that for interpretable and unstable
models, such as sparse regression and CART, our approach is a viable and
computationally affordable method for improving test-set performance.
[COMMENTS]
Some of this material previously appeared in 2306.14851v2, which we
have split into two papers (this one and 2306.14851v3), because it contained
two ideas that need separate papers
[LINK]
http://arxiv.org/abs/2505.06927v1
[DATE]
2025-05-11 18:06:59+08:00
[CATEGORIES]
cs.LG
DSP: Dynamic Sequence Parallelism for Multi-Dimensional Transformers
[AUTHORS]
Xuanlei Zhao, Shenggan Cheng, Chang Chen, Zangwei Zheng, Ziming Liu, Zheming Yang, Yang You
[ABSTRACT]
Scaling multi-dimensional transformers to long sequences is indispensable
across various domains. However, the challenges of large memory requirements
and slow speeds of such sequences necessitate sequence parallelism. All
existing approaches fall under the category of embedded sequence parallelism,
which are limited to shard along a single sequence dimension, thereby
introducing significant communication overhead. However, the nature of
multi-dimensional transformers involves independent calculations across
multiple sequence dimensions. To this end, we propose Dynamic Sequence
Parallelism (DSP) as a novel abstraction of sequence parallelism. DSP
dynamically switches the parallel dimension among all sequences according to
the computation stage with efficient resharding strategy. DSP offers
significant reductions in communication costs, adaptability across modules, and
ease of implementation with minimal constraints. Experimental evaluations
demonstrate DSP’s superiority over state-of-the-art embedded sequence
parallelism methods by remarkable throughput improvements ranging from 32.2% to
10x, with less than 25% communication volume.
[COMMENTS]
ICML 2025
[LINK]
http://arxiv.org/abs/2403.10266v5
[DATE]
2025-05-11 17:53:27+08:00
[CATEGORIES]
cs.LG
Uni-AIMS: AI-Powered Microscopy Image Analysis
[AUTHORS]
Yanhui Hong, Nan Wang, Zhiyi Xia, Haoyi Tao, Xi Fang, Yiming Li, Jiankun Wang, Peng Jin, Xiaochen Cai, Shengyu Li, Ziqi Chen, Zezhong Zhang, Guolin Ke, Linfeng Zhang
[ABSTRACT]
This paper presents a systematic solution for the intelligent recognition and
automatic analysis of microscopy images. We developed a data engine that
generates high-quality annotated datasets through a combination of the
collection of diverse microscopy images from experiments, synthetic data
generation and a human-in-the-loop annotation process. To address the unique
challenges of microscopy images, we propose a segmentation model capable of
robustly detecting both small and large objects. The model effectively
identifies and separates thousands of closely situated targets, even in
cluttered visual environments. Furthermore, our solution supports the precise
automatic recognition of image scale bars, an essential feature in quantitative
microscopic analysis. Building upon these components, we have constructed a
comprehensive intelligent analysis platform and validated its effectiveness and
practicality in real-world applications. This study not only advances automatic
recognition in microscopy imaging but also ensures scalability and
generalizability across multiple application domains, offering a powerful tool
for automated microscopic analysis in interdisciplinary research.
[LINK]
http://arxiv.org/abs/2505.06918v1
[DATE]
2025-05-11 17:35:53+08:00
[CATEGORIES]
cs.LG
MMiC: Mitigating Modality Incompleteness in Clustered Federated Learning
[AUTHORS]
Lishan Yang, Wei Zhang, Quan Z. Sheng, Weitong Chen, Lina Yao, Weitong Chen, Ali Shakeri
[ABSTRACT]
In the era of big data, data mining has become indispensable for uncovering
hidden patterns and insights from vast and complex datasets. The integration of
multimodal data sources further enhances its potential. Multimodal Federated
Learning (MFL) is a distributed approach that enhances the efficiency and
quality of multimodal learning, ensuring collaborative work and privacy
protection. However, missing modalities pose a significant challenge in MFL,
often due to data quality issues or privacy policies across the clients. In
this work, we present MMiC, a framework for Mitigating Modality incompleteness
in MFL within the Clusters. MMiC replaces partial parameters within client
models inside clusters to mitigate the impact of missing modalities.
Furthermore, it leverages the Banzhaf Power Index to optimize client selection
under these conditions. Finally, MMiC employs an innovative approach to
dynamically control global aggregation by utilizing Markovitz Portfolio
Optimization. Extensive experiments demonstrate that MMiC consistently
outperforms existing federated learning architectures in both global and
personalized performance on multimodal datasets with missing modalities,
confirming the effectiveness of our proposed solution.
[COMMENTS]
10 pages, 10 figures, it’s KDD’2025 under reviewing
[LINK]
http://arxiv.org/abs/2505.06911v1
[DATE]
2025-05-11 17:12:36+08:00
[CATEGORIES]
cs.LG
Near-Field Channel Estimation for XL-MIMO: A Deep Generative Model Guided by Side Information
[AUTHORS]
Zhenzhou Jin, Li You, Derrick Wing Kwan Ng, Xiang-Gen Xia, Xiqi Gao
[ABSTRACT]
This paper investigates the near-field (NF) channel estimation (CE) for
extremely large-scale multiple-input multiple-output (XL-MIMO) systems.
Considering the pronounced NF effects in XL-MIMO communications, we first
establish a joint angle-distance (AD) domain-based spherical-wavefront physical
channel model that captures the inherent sparsity of XL-MIMO channels.
Leveraging the channel’s sparsity in the joint AD domain, the CE is approached
as a task of reconstructing sparse signals. Anchored in this framework, we
first propose a compressed sensing algorithm to acquire a preliminary channel
estimate. Harnessing the powerful implicit prior learning capability of
generative artificial intelligence (GenAI), we further propose a GenAI-based
approach to refine the estimated channel. Specifically, we introduce the
preliminary estimated channel as side information, and derive the evidence
lower bound (ELBO) of the log-marginal distribution of the target NF channel
conditioned on the preliminary estimated channel, which serves as the
optimization objective for the proposed generative diffusion model (GDM).
Additionally, we introduce a more generalized version of the GDM, the
non-Markovian GDM (NM-GDM), to accelerate the sampling process, achieving an
approximately tenfold enhancement in sampling efficiency. Experimental results
indicate that the proposed approach is capable of offering substantial
performance gain in CE compared to existing benchmark schemes within NF XL-MIMO
systems. Furthermore, our approach exhibits enhanced generalization
capabilities in both the NF or far-field (FF) regions.
[COMMENTS]
15 pages, 11 figures, to appear on IEEE Transactions on Cognitive
Communications and Networking
[LINK]
http://arxiv.org/abs/2505.06900v1
[DATE]
2025-05-11 16:35:36+08:00
[CATEGORIES]
cs.LG
Learning Soft Sparse Shapes for Efficient Time-Series Classification
[AUTHORS]
Zhen Liu, Yicheng Luo, Boyuan Li, Emadeldeen Eldele, Min Wu, Qianli Ma
[ABSTRACT]
Shapelets are discriminative subsequences (or shapes) with high
interpretability in time series classification. Due to the time-intensive
nature of shapelet discovery, existing shapelet-based methods mainly focus on
selecting discriminative shapes while discarding others to achieve candidate
subsequence sparsification. However, this approach may exclude beneficial
shapes and overlook the varying contributions of shapelets to classification
performance. To this end, we propose a \textbf{Soft} sparse \textbf{Shape}s
(\textbf{SoftShape}) model for efficient time series classification. Our
approach mainly introduces soft shape sparsification and soft shape learning
blocks. The former transforms shapes into soft representations based on
classification contribution scores, merging lower-scored ones into a single
shape to retain and differentiate all subsequence information. The latter
facilitates intra- and inter-shape temporal pattern learning, improving model
efficiency by using sparsified soft shapes as inputs. Specifically, we employ a
learnable router to activate a subset of class-specific expert networks for
intra-shape pattern learning. Meanwhile, a shared expert network learns
inter-shape patterns by converting sparsified shapes into sequences. Extensive
experiments show that SoftShape outperforms state-of-the-art methods and
produces interpretable results.
[COMMENTS]
Accepted in ICML 2025
[LINK]
http://arxiv.org/abs/2505.06892v1
[DATE]
2025-05-11 16:14:37+08:00
[CATEGORIES]
cs.LG
Formal Verification of Markov Processes with Learned Parameters
[AUTHORS]
Muhammad Maaz, Timothy C. Y. Chan
[ABSTRACT]
We introduce the problem of formally verifying properties of Markov processes
where the parameters are given by the output of machine learning models. For a
broad class of machine learning models, including linear models, tree-based
models, and neural networks, verifying properties of Markov chains like
reachability, hitting time, and total reward can be formulated as a bilinear
program. We develop a decomposition and bound propagation scheme for solving
the bilinear program and show through computational experiments that our method
solves the problem to global optimality up to 100x faster than state-of-the-art
solvers. To demonstrate the practical utility of our approach, we apply it to a
real-world healthcare case study. Along with the paper, we release markovml, an
open-source tool for building Markov processes, integrating pretrained machine
learning models, and verifying their properties, available at
https://github.com/mmaaz-git/markovml.
[COMMENTS]
9 pages (main manuscript), 3 figures, 1 table
[LINK]
http://arxiv.org/abs/2501.15767v2
[DATE]
2025-05-11 16:04:44+08:00
[CATEGORIES]
cs.LG
Image Classification Using a Diffusion Model as a Pre-Training Model
[AUTHORS]
Kosuke Ukita, Ye Xiaolong, Tsuyoshi Okita
[ABSTRACT]
In this paper, we propose a diffusion model that integrates a
representation-conditioning mechanism, where the representations derived from a
Vision Transformer (ViT) are used to condition the internal process of a
Transformer-based diffusion model. This approach enables
representation-conditioned data generation, addressing the challenge of
requiring large-scale labeled datasets by leveraging self-supervised learning
on unlabeled data. We evaluate our method through a zero-shot classification
task for hematoma detection in brain imaging. Compared to the strong
contrastive learning baseline, DINOv2, our method achieves a notable
improvement of +6.15% in accuracy and +13.60% in F1-score, demonstrating its
effectiveness in image classification.
[COMMENTS]
10 pages, 9 figures
[LINK]
http://arxiv.org/abs/2505.06890v1
[DATE]
2025-05-11 16:03:18+08:00
[CATEGORIES]
cs.LG
Long Term Memory: The Foundation of AI Self-Evolution
[AUTHORS]
Xun Jiang, Feng Li, Han Zhao, Jiahao Qiu, Jiaying Wang, Jun Shao, Shihao Xu, Shu Zhang, Weiling Chen, Xavier Tang, Yize Chen, Mengyue Wu, Weizhi Ma, Mengdi Wang, Tianqiao Chen
[ABSTRACT]
Large language models (LLMs) like GPTs, trained on vast datasets, have
demonstrated impressive capabilities in language understanding, reasoning, and
planning, achieving human-level performance in various tasks. Most studies
focus on enhancing these models by training on ever-larger datasets to build
more powerful foundation models. While training stronger models is important,
enabling models to evolve during inference is equally crucial, a process we
refer to as AI self-evolution. Unlike large-scale training, self-evolution may
rely on limited data or interactions. Inspired by the columnar organization of
the human cerebral cortex, we hypothesize that AI models could develop
cognitive abilities and build internal representations through iterative
interactions with their environment. To achieve this, models need long-term
memory (LTM) to store and manage processed interaction data. LTM supports
self-evolution by representing diverse experiences across environments and
agents. In this report, we explore AI self-evolution and its potential to
enhance models during inference. We examine LTM’s role in lifelong learning,
allowing models to evolve based on accumulated interactions. We outline the
structure of LTM and the systems needed for effective data retention and
representation. We also classify approaches for building personalized models
with LTM data and show how these models achieve self-evolution through
interaction. Using LTM, our multi-agent framework OMNE achieved first place on
the GAIA benchmark, demonstrating LTM’s potential for AI self-evolution.
Finally, we present a roadmap for future research, emphasizing the importance
of LTM for advancing AI technology and its practical applications.
[COMMENTS]
56 pages, 13 figures
[LINK]
http://arxiv.org/abs/2410.15665v4
[DATE]
2025-05-11 15:56:18+08:00
[CATEGORIES]
cs.LG
NeuRN: Neuro-inspired Domain Generalization for Image Classification
[AUTHORS]
Hamd Jalil, Ahmed Qazi, Asim Iqbal
[ABSTRACT]
Domain generalization in image classification is a crucial challenge, with
models often failing to generalize well across unseen datasets. We address this
issue by introducing a neuro-inspired Neural Response Normalization (NeuRN)
layer which draws inspiration from neurons in the mammalian visual cortex,
which aims to enhance the performance of deep learning architectures on unseen
target domains by training deep learning models on a source domain. The
performance of these models is considered as a baseline and then compared
against models integrated with NeuRN on image classification tasks. We perform
experiments across a range of deep learning architectures, including ones
derived from Neural Architecture Search and Vision Transformer. Additionally,
in order to shortlist models for our experiment from amongst the vast range of
deep neural networks available which have shown promising results, we also
propose a novel method that uses the Needleman-Wunsch algorithm to compute
similarity between deep learning architectures. Our results demonstrate the
effectiveness of NeuRN by showing improvement against baseline in cross-domain
image classification tasks. Our framework attempts to establish a foundation
for future neuro-inspired deep learning models.
[COMMENTS]
14 pages, 7 figures, 1 table
[LINK]
http://arxiv.org/abs/2505.06881v1
[DATE]
2025-05-11 15:20:11+08:00
[CATEGORIES]
cs.LG
Neural Algorithmic Reasoning with Multiple Correct Solutions
[AUTHORS]
Zeno Kujawa, John Poole, Dobrik Georgiev, Danilo Numeroso, Henry Fleischmann, Pietro Liò
[ABSTRACT]
Neural Algorithmic Reasoning (NAR) extends classical algorithms to higher
dimensional data. However, canonical implementations of NAR train neural
networks to return only a single solution, even when there are multiple correct
solutions to a problem, such as single-source shortest paths. For some
applications, it is desirable to recover more than one correct solution. To
that end, we give the first method for NAR with multiple solutions. We
demonstrate our method on two classical algorithms: Bellman-Ford (BF) and
Depth-First Search (DFS), favouring deeper insight into two algorithms over a
broader survey of algorithms. This method involves generating appropriate
training data as well as sampling and validating solutions from model output.
Each step of our method, which can serve as a framework for neural algorithmic
reasoning beyond the tasks presented in this paper, might be of independent
interest to the field and our results represent the first attempt at this task
in the NAR literature.
[LINK]
http://arxiv.org/abs/2409.06953v4
[DATE]
2025-05-11 15:01:18+08:00
[CATEGORIES]
cs.LG
M2PDE: Compositional Generative Multiphysics and Multi-component PDE Simulation
[AUTHORS]
Tao Zhang, Zhenhai Liu, Feipeng Qi, Yongjun Jiao, Tailin Wu
[ABSTRACT]
Multiphysics simulation, which models the interactions between multiple
physical processes, and multi-component simulation of complex structures are
critical in fields like nuclear and aerospace engineering. Previous studies use
numerical solvers or ML-based surrogate models for these simulations. However,
multiphysics simulations typically require integrating multiple specialized
solvers-each for a specific physical process-into a coupled program, which
introduces significant development challenges. Furthermore, existing numerical
algorithms struggle with highly complex large-scale structures in
multi-component simulations. Here we propose compositional Multiphysics and
Multi-component PDE Simulation with Diffusion models (M2PDE) to overcome these
challenges. During diffusion-based training, M2PDE learns energy functions
modeling the conditional probability of one physical process/component
conditioned on other processes/components. In inference, M2PDE generates
coupled multiphysics and multi-component solutions by sampling from the joint
probability distribution. We evaluate M2PDE on two multiphysics
tasks-reaction-diffusion and nuclear thermal coupling-where it achieves more
accurate predictions than surrogate models in challenging scenarios. We then
apply it to a multi-component prismatic fuel element problem, demonstrating
that M2PDE scales from single-component training to a 64-component structure
and outperforms existing domain-decomposition and graph-based approaches. The
code is available at https://github.com/AI4Science-WestlakeU/M2PDE.
[COMMENTS]
29pages,14 figures
[LINK]
http://arxiv.org/abs/2412.04134v3
[DATE]
2025-05-11 14:50:10+08:00
[CATEGORIES]
cs.LG
TokenProber: Jailbreaking Text-to-image Models via Fine-grained Word Impact Analysis
[AUTHORS]
Longtian Wang, Xiaofei Xie, Tianlin Li, Yuhan Zhi, Chao Shen
[ABSTRACT]
Text-to-image (T2I) models have significantly advanced in producing
high-quality images. However, such models have the ability to generate images
containing not-safe-for-work (NSFW) content, such as pornography, violence,
political content, and discrimination. To mitigate the risk of generating NSFW
content, refusal mechanisms, i.e., safety checkers, have been developed to
check potential NSFW content. Adversarial prompting techniques have been
developed to evaluate the robustness of the refusal mechanisms. The key
challenge remains to subtly modify the prompt in a way that preserves its
sensitive nature while bypassing the refusal mechanisms. In this paper, we
introduce TokenProber, a method designed for sensitivity-aware differential
testing, aimed at evaluating the robustness of the refusal mechanisms in T2I
models by generating adversarial prompts. Our approach is based on the key
observation that adversarial prompts often succeed by exploiting discrepancies
in how T2I models and safety checkers interpret sensitive content. Thus, we
conduct a fine-grained analysis of the impact of specific words within prompts,
distinguishing between dirty words that are essential for NSFW content
generation and discrepant words that highlight the different sensitivity
assessments between T2I models and safety checkers. Through the
sensitivity-aware mutation, TokenProber generates adversarial prompts, striking
a balance between maintaining NSFW content generation and evading detection.
Our evaluation of TokenProber against 5 safety checkers on 3 popular T2I
models, using 324 NSFW prompts, demonstrates its superior effectiveness in
bypassing safety filters compared to existing methods (e.g., 54%+ increase on
average), highlighting TokenProber’s ability to uncover robustness issues in
the existing refusal mechanisms.
[COMMENTS]
13 pages, 5 figures
[LINK]
http://arxiv.org/abs/2505.08804v1
[DATE]
2025-05-11 14:32:33+08:00
[CATEGORIES]
cs.LG
NewsNet-SDF: Stochastic Discount Factor Estimation with Pretrained Language Model News Embeddings via Adversarial Networks
[AUTHORS]
Shunyao Wang, Ming Cheng, Christina Dan Wang
[ABSTRACT]
Stochastic Discount Factor (SDF) models provide a unified framework for asset
pricing and risk assessment, yet traditional formulations struggle to
incorporate unstructured textual information. We introduce NewsNet-SDF, a novel
deep learning framework that seamlessly integrates pretrained language model
embeddings with financial time series through adversarial networks. Our
multimodal architecture processes financial news using GTE-multilingual models,
extracts temporal patterns from macroeconomic data via LSTM networks, and
normalizes firm characteristics, fusing these heterogeneous information sources
through an innovative adversarial training mechanism. Our dataset encompasses
approximately 2.5 million news articles and 10,000 unique securities,
addressing the computational challenges of processing and aligning text data
with financial time series. Empirical evaluations on U.S. equity data
(1980-2022) demonstrate NewsNet-SDF substantially outperforms alternatives with
a Sharpe ratio of 2.80. The model shows a 471% improvement over CAPM, over 200%
improvement versus traditional SDF implementations, and a 74% reduction in
pricing errors compared to the Fama-French five-factor model. In comprehensive
comparisons, our deep learning approach consistently outperforms traditional,
modern, and other neural asset pricing models across all key metrics. Ablation
studies confirm that text embeddings contribute significantly more to model
performance than macroeconomic features, with news-derived principal components
ranking among the most influential determinants of SDF dynamics. These results
validate the effectiveness of our multimodal deep learning approach in
integrating unstructured text with traditional financial data for more accurate
asset pricing, providing new insights for digital intelligent decision-making
in financial technology.
[LINK]
http://arxiv.org/abs/2505.06864v1
[DATE]
2025-05-11 14:18:58+08:00
[CATEGORIES]
cs.LG
DP-TRAE: A Dual-Phase Merging Transferable Reversible Adversarial Example for Image Privacy Protection
[AUTHORS]
Xia Du, Jiajie Zhu, Jizhe Zhou, Chi-man Pun, Zheng Lin, Cong Wu, Zhe Chen, Jun Luo
[ABSTRACT]
In the field of digital security, Reversible Adversarial Examples (RAE)
combine adversarial attacks with reversible data hiding techniques to
effectively protect sensitive data and prevent unauthorized analysis by
malicious Deep Neural Networks (DNNs). However, existing RAE techniques
primarily focus on white-box attacks, lacking a comprehensive evaluation of
their effectiveness in black-box scenarios. This limitation impedes their
broader deployment in complex, dynamic environments. Further more, traditional
black-box attacks are often characterized by poor transferability and high
query costs, significantly limiting their practical applicability. To address
these challenges, we propose the Dual-Phase Merging Transferable Reversible
Attack method, which generates highly transferable initial adversarial
perturbations in a white-box model and employs a memory augmented black-box
strategy to effectively mislead target mod els. Experimental results
demonstrate the superiority of our approach, achieving a 99.0% attack success
rate and 100% recovery rate in black-box scenarios, highlighting its robustness
in privacy protection. Moreover, we successfully implemented a black-box attack
on a commercial model, further substantiating the potential of this approach
for practical use.
[COMMENTS]
12 pages, 5 figures
[LINK]
http://arxiv.org/abs/2505.06860v1
[DATE]
2025-05-11 14:11:10+08:00
[CATEGORIES]
cs.LG
FreqMoE: Dynamic Frequency Enhancement for Neural PDE Solvers
[AUTHORS]
Tianyu Chen, Haoyi Zhou, Ying Li, Hao Wang, Zhenzhe Zhang, Tianchen Zhu, Shanghang Zhang, Jianxin Li
[ABSTRACT]
Fourier Neural Operators (FNO) have emerged as promising solutions for
efficiently solving partial differential equations (PDEs) by learning
infinite-dimensional function mappings through frequency domain
transformations. However, the sparsity of high-frequency signals limits
computational efficiency for high-dimensional inputs, and fixed-pattern
truncation often causes high-frequency signal loss, reducing performance in
scenarios such as high-resolution inputs or long-term predictions. To address
these challenges, we propose FreqMoE, an efficient and progressive training
framework that exploits the dependency of high-frequency signals on
low-frequency components. The model first learns low-frequency weights and then
applies a sparse upward-cycling strategy to construct a mixture of experts
(MoE) in the frequency domain, effectively extending the learned weights to
high-frequency regions. Experiments on both regular and irregular grid PDEs
demonstrate that FreqMoE achieves up to 16.6% accuracy improvement while using
merely 2.1% parameters (47.32x reduction) compared to dense FNO. Furthermore,
the approach demonstrates remarkable stability in long-term predictions and
generalizes seamlessly to various FNO variants and grid structures,
establishing a new “Low frequency Pretraining, High frequency Fine-tuning”
paradigm for solving PDEs.
[COMMENTS]
Accepted by IJCAI 2025
[LINK]
http://arxiv.org/abs/2505.06858v1
[DATE]
2025-05-11 14:06:32+08:00
[CATEGORIES]
cs.LG
Improving Random Forests by Smoothing
[AUTHORS]
Ziyi Liu, Phuc Luong, Mario Boley, Daniel F. Schmidt
[ABSTRACT]
Gaussian process regression is a popular model in the small data regime due
to its sound uncertainty quantification and the exploitation of the smoothness
of the regression function that is encountered in a wide range of practical
problems. However, Gaussian processes perform sub-optimally when the degree of
smoothness is non-homogeneous across the input domain. Random forest regression
partially addresses this issue by providing local basis functions of variable
support set sizes that are chosen in a data-driven way. However, they do so at
the expense of forgoing any degree of smoothness, which often results in poor
performance in the small data regime. Here, we aim to combine the advantages of
both models by applying a kernel-based smoothing mechanism to a learned random
forest or any other piecewise constant prediction function. As we demonstrate
empirically, the resulting model consistently improves the predictive
performance of the underlying random forests and, in almost all test cases,
also improves the log loss of the usual uncertainty quantification based on
inter-tree variance. The latter advantage can be attributed to the ability of
the smoothing model to take into account the uncertainty over the exact
tree-splitting locations.
[COMMENTS]
14 pages, 2 figures, 4 pages appendix, 3 figures in appendix
[LINK]
http://arxiv.org/abs/2505.06852v1
[DATE]
2025-05-11 13:39:08+08:00
[CATEGORIES]
cs.LG
Predictive Digital Twins for Thermal Management Using Machine Learning and Reduced-Order Models
[AUTHORS]
Tamilselvan Subramani, Sebastian Bartscher
[ABSTRACT]
Digital twins enable real-time simulation and prediction in engineering
systems. This paper presents a novel framework for predictive digital twins of
a headlamp heatsink, integrating physics-based reduced-order models (ROMs) from
computational fluid dynamics (CFD) with supervised machine learning. A
component-based ROM library, derived via proper orthogonal decomposition (POD),
captures thermal dynamics efficiently. Machine learning models, including
Decision Trees, k-Nearest Neighbors, Support Vector Regression (SVR), and
Neural Networks, predict optimal ROM configurations, enabling rapid digital
twin updates. The Neural Network achieves a mean absolute error (MAE) of
54.240, outperforming other models. Quantitative comparisons of predicted and
original values demonstrate high accuracy. This scalable, interpretable
framework advances thermal management in automotive systems, supporting robust
design and predictive maintenance.
[COMMENTS]
10 pages, 2 tables, from M.Tech. thesis accepted at BITS Pilani, 2022
[LINK]
http://arxiv.org/abs/2505.06849v1
[DATE]
2025-05-11 13:20:16+08:00
[CATEGORIES]
cs.LG
Optimizing Recommendations using Fine-Tuned LLMs
[AUTHORS]
Prabhdeep Cheema, Erhan Guven
[ABSTRACT]
As digital media platforms strive to meet evolving user expectations,
delivering highly personalized and intuitive movies and media recommendations
has become essential for attracting and retaining audiences. Traditional
systems often rely on keyword-based search and recommendation techniques, which
limit users to specific keywords and a combination of keywords. This paper
proposes an approach that generates synthetic datasets by modeling real-world
user interactions, creating complex chat-style data reflective of diverse
preferences. This allows users to express more information with complex
preferences, such as mood, plot details, and thematic elements, in addition to
conventional criteria like genre, title, and actor-based searches. In today’s
search space, users cannot write queries like ``Looking for a fantasy movie
featuring dire wolves, ideally set in a harsh frozen world with themes of
loyalty and survival.’’
Building on these contributions, we evaluate synthetic datasets for diversity
and effectiveness in training and benchmarking models, particularly in areas
often absent from traditional datasets. This approach enhances personalization
and accuracy by enabling expressive and natural user queries. It establishes a
foundation for the next generation of conversational AI-driven search and
recommendation systems in digital entertainment.
[COMMENTS]
Accepted and presented at IEEE CAI 2025. This version includes minor
clarifications and formatting updates
[LINK]
http://arxiv.org/abs/2505.06841v1
[DATE]
2025-05-11 12:53:34+08:00
[CATEGORIES]
cs.LG
Low-Rank Matrix Approximation for Neural Network Compression
[AUTHORS]
Kalyan Cherukuri, Aarav Lala
[ABSTRACT]
Deep Neural Networks (DNNs) have encountered an emerging deployment challenge
due to large and expensive memory and computation requirements. In this paper,
we present a new Adaptive-Rank Singular Value Decomposition (ARSVD) method that
approximates the optimal rank for compressing weight matrices in neural
networks using spectral entropy. Unlike conventional SVD-based methods that
apply a fixed-rank truncation across all layers, ARSVD uses an adaptive
selection of the rank per layer through the entropy distribution of its
singular values. This approach ensures that each layer will retain a certain
amount of its informational content, thereby reducing redundancy. Our method
enables efficient, layer-wise compression, yielding improved performance with
reduced space and time complexity compared to static-rank reduction techniques.
[LINK]
http://arxiv.org/abs/2504.20078v2
[DATE]
2025-05-11 12:52:45+08:00
[CATEGORIES]
cs.LG
The power of fine-grained experts: Granularity boosts expressivity in Mixture of Experts
[AUTHORS]
Enric Boix-Adsera, Philippe Rigollet
[ABSTRACT]
Mixture-of-Experts (MoE) layers are increasingly central to frontier model
architectures. By selectively activating parameters, they reduce computational
cost while scaling total parameter count. This paper investigates the impact of
the number of active experts, termed granularity, comparing architectures with
many (e.g., 8 per layer in DeepSeek) to those with fewer (e.g., 1 per layer in
Llama-4 models). We prove an exponential separation in network expressivity
based on this design parameter, suggesting that models benefit from higher
granularity. Experimental results corroborate our theoretical findings and
illustrate this separation.
[LINK]
http://arxiv.org/abs/2505.06839v1
[DATE]
2025-05-11 12:35:40+08:00
[CATEGORIES]
cs.LG
The Geometry of Self-Verification in a Task-Specific Reasoning Model
[AUTHORS]
Andrew Lee, Lihao Sun, Chris Wendler, Fernanda Viégas, Martin Wattenberg
[ABSTRACT]
How do reasoning models verify their own answers? We study this question by
training a model using DeepSeek R1’s recipe on the CountDown task. We leverage
the fact that preference tuning leads to mode collapse, yielding a model that
always produces highly structured chain-of-thought sequences. With this setup,
we do top-down and bottom-up analyses to reverse-engineer how the model
verifies its outputs. Top-down, we find Gated Linear Unit (GLU) weights
encoding verification-related tokens, such as “success” or “incorrect”.
Bottom-up, we find that “previous-token heads” are mainly responsible for
self-verification in our setup. Our analyses meet in the middle: drawing
inspiration from inter-layer communication channels, we use the identified GLU
weights to localize as few as three attention heads that can disable
self-verification, pointing to a necessary component of a potentially larger
verification circuit. Finally, we verify that similar verification components
exist in our base model and a general reasoning DeepSeek-R1 model.
[LINK]
http://arxiv.org/abs/2504.14379v2
[DATE]
2025-05-11 12:15:06+08:00
[CATEGORIES]
cs.LG
Streaming Sliced Optimal Transport
[AUTHORS]
Khai Nguyen
[ABSTRACT]
Sliced optimal transport (SOT) or sliced Wasserstein (SW) distance is widely
recognized for its statistical and computational scalability. In this work, we
further enhance the computational scalability by proposing the first method for
computing SW from sample streams, called \emph{streaming sliced Wasserstein}
(Stream-SW). To define Stream-SW, we first introduce the streaming computation
of the one-dimensional Wasserstein distance. Since the one-dimensional
Wasserstein (1DW) distance has a closed-form expression, given by the absolute
difference between the quantile functions of the compared distributions, we
leverage quantile approximation techniques for sample streams to define the
streaming 1DW distance. By applying streaming 1DW to all projections, we obtain
Stream-SW. The key advantage of Stream-SW is its low memory complexity while
providing theoretical guarantees on the approximation error. We demonstrate
that Stream-SW achieves a more accurate approximation of SW than random
subsampling, with lower memory consumption, in comparing Gaussian distributions
and mixtures of Gaussians from streaming samples. Additionally, we conduct
experiments on point cloud classification, point cloud gradient flows, and
streaming change point detection to further highlight the favorable performance
of Stream-SW.
[COMMENTS]
28 pages, 9 figures, 3 tables
[LINK]
http://arxiv.org/abs/2505.06835v1
[DATE]
2025-05-11 12:09:24+08:00
[CATEGORIES]
cs.LG
Deep Learning Models for Flood Predictions in South Florida
[AUTHORS]
Jimeng Shi, Zeda Yin, Rukmangadh Myana, Khandker Ishtiaq, Anupama John, Jayantha Obeysekera, Arturo Leon, Giri Narasimhan
[ABSTRACT]
Simulating and predicting the water level/stage in river systems is essential
for flood warnings, hydraulic operations, and flood mitigations. Physics-based
detailed hydrological and hydraulic computational tools, such as HEC-RAS, MIKE,
and SWMM, can be used to simulate a complete watershed and compute the water
stage at any point in the river system. However, these physics-based models are
computationally intensive, especially for large watersheds and for longer
simulations, since they use detailed grid representations of terrain elevation
maps of the entire watershed and solve complex partial differential equations
(PDEs) for each grid cell. To overcome this problem, we train several deep
learning (DL) models for use as surrogate models to rapidly predict the water
stage. A portion of the Miami River in South Florida was chosen as a case study
for this paper. Extensive experiments show that the performance of various DL
models (MLP, RNN, CNN, LSTM, and RCNN) is significantly better than that of the
physics-based model, HEC-RAS, even during extreme precipitation conditions
(i.e., tropical storms), and with speedups exceeding 500x. To predict the water
stages more accurately, our DL models use both measured variables of the river
system from the recent past and covariates for which predictions are typically
available for the near future.
[LINK]
http://arxiv.org/abs/2306.15907v5
[DATE]
2025-05-11 11:44:57+08:00
[CATEGORIES]
cs.LG
Diffusion Models Learn Low-Dimensional Distributions via Subspace Clustering
[AUTHORS]
Peng Wang, Huijie Zhang, Zekai Zhang, Siyi Chen, Yi Ma, Qing Qu
[ABSTRACT]
Recent empirical studies have demonstrated that diffusion models can
effectively learn the image distribution and generate new samples. Remarkably,
these models can achieve this even with a small number of training samples
despite a large image dimension, circumventing the curse of dimensionality. In
this work, we provide theoretical insights into this phenomenon by leveraging
key empirical observations: (i) the low intrinsic dimensionality of image data,
(ii) a union of manifold structure of image data, and (iii) the low-rank
property of the denoising autoencoder in trained diffusion models. These
observations motivate us to assume the underlying data distribution of image
data as a mixture of low-rank Gaussians and to parameterize the denoising
autoencoder as a low-rank model according to the score function of the assumed
distribution. With these setups, we rigorously show that optimizing the
training loss of diffusion models is equivalent to solving the canonical
subspace clustering problem over the training samples. Based on this
equivalence, we further show that the minimal number of samples required to
learn the underlying distribution scales linearly with the intrinsic dimensions
under the above data and model assumptions. This insight sheds light on why
diffusion models can break the curse of dimensionality and exhibit the phase
transition in learning distributions. Moreover, we empirically establish a
correspondence between the subspaces and the semantic representations of image
data, facilitating image editing. We validate these results with corroborated
experimental results on both simulated distributions and image datasets.
[COMMENTS]
39 pages, 8 figures, 2 tables
[LINK]
http://arxiv.org/abs/2409.02426v3
[DATE]
2025-05-11 11:27:54+08:00
[CATEGORIES]
cs.LG
Active Learning for Multi-class Image Classification
[AUTHORS]
Thien Nhan Vo
[ABSTRACT]
A principle bottleneck in image classification is the large number of
training examples needed to train a classifier. Using active learning, we can
reduce the number of training examples to teach a CNN classifier by
strategically selecting examples. Assigning values to image examples using
different uncertainty metrics allows the model to identify and select
high-value examples in a smaller training set size. We demonstrate results for
digit recognition and fruit classification on the MNIST and Fruits360 data
sets. We formally compare results for four different uncertainty metrics.
Finally, we observe active learning is also effective on simpler (binary)
classification tasks, but marked improvement from random sampling is more
evident on more difficult tasks. We show active learning is a viable algorithm
for image classification problems.
[LINK]
http://arxiv.org/abs/2505.06825v1
[DATE]
2025-05-11 11:25:09+08:00
[CATEGORIES]
cs.LG
Sparse Ellipsoidal Radial Basis Function Network for Point Cloud Surface Representation
[AUTHORS]
Bobo Lian, Dandan Wang, Chenjian Wu, Minxin Chen
[ABSTRACT]
Point cloud surface representation is a fundamental problem in computer
graphics and vision. This paper presents a machine learning approach for
approximating the signed distance function (SDF) of a point cloud using a
sparse ellipsoidal radial basis function network, enabling a compact and
accurate surface representation. Given the SDF values defined on the grid
points constructed from the point cloud, our method approximates the SDF
accurately with as few ellipsoidal radial basis functions (ERBFs) as possible,
i.e., represents the SDF of a point cloud by sparse ERBFs. To balance sparsity
and approximation precision, a dynamic multi-objective optimization strategy is
introduced, which adaptively adds the regularization terms and jointly
optimizes the weights, centers, shapes, and orientations of ERBFs. To improve
computational efficiency, a nearest-neighbor-based data structure is employed,
restricting function calculations to points near each Gaussian kernel center.
The computations for each kernel are further parallelized on CUDA, which
significantly improves the optimization speed. Additionally, a hierarchical
octree-based refinement strategy is designed for training. Specifically, the
initialization and optimization of network parameters are conducted using
coarse grid points in the octree lattice structure. Subsequently, fine lattice
points are progressively incorporated to accelerate model convergence and
enhance training efficiency. Extensive experiments on multiple benchmark
datasets demonstrate that our method outperforms previous sparse representation
approaches in terms of accuracy, robustness, and computational efficiency. The
corresponding executable program is publicly available at
https://github.com/lianbobo/SE-RBFNet.git.
[LINK]
http://arxiv.org/abs/2505.02350v2
[DATE]
2025-05-11 10:43:41+08:00
[CATEGORIES]
cs.LG
SymbolFit: Automatic Parametric Modeling with Symbolic Regression
[AUTHORS]
Ho Fung Tsoi, Dylan Rankin, Cecile Caillol, Miles Cranmer, Sridhara Dasu, Javier Duarte, Philip Harris, Elliot Lipeles, Vladimir Loncar
[ABSTRACT]
We introduce SymbolFit, a framework that automates parametric modeling by
using symbolic regression to perform a machine-search for functions that fit
the data while simultaneously providing uncertainty estimates in a single run.
Traditionally, constructing a parametric model to accurately describe binned
data has been a manual and iterative process, requiring an adequate functional
form to be determined before the fit can be performed. The main challenge
arises when the appropriate functional forms cannot be derived from first
principles, especially when there is no underlying true closed-form function
for the distribution. In this work, we develop a framework that automates and
streamlines the process by utilizing symbolic regression, a machine learning
technique that explores a vast space of candidate functions without requiring a
predefined functional form because the functional form itself is treated as a
trainable parameter, making the process far more efficient and effortless than
traditional regression methods. We demonstrate the framework in high-energy
physics experiments at the CERN Large Hadron Collider (LHC) using five real
proton-proton collision datasets from new physics searches, including
background modeling in resonance searches for high-mass dijet, trijet,
paired-dijet, diphoton, and dimuon events. We show that our framework can
flexibly and efficiently generate a wide range of candidate functions that fit
a nontrivial distribution well using a simple fit configuration that varies
only by random seed, and that the same fit configuration, which defines a vast
function space, can also be applied to distributions of different shapes,
whereas achieving a comparable result with traditional methods would have
required extensive manual effort.
[COMMENTS]
52 pages, 35 figures. Under review. The API can be used
out-of-the-box and is available at https://github.com/hftsoi/symbolfit
[LINK]
http://arxiv.org/abs/2411.09851v4
[DATE]
2025-05-11 10:19:58+08:00
[CATEGORIES]
cs.LG
Partial Answer of How Transformers Learn Automata
[AUTHORS]
Tiantian Zhang
[ABSTRACT]
We introduce a novel framework for simulating finite automata using
representation-theoretic semidirect products and Fourier modules, achieving
more efficient Transformer-based implementations.
[LINK]
http://arxiv.org/abs/2504.20395v2
[DATE]
2025-05-11 10:19:44+08:00
[CATEGORIES]
cs.LG
Transformers Handle Endogeneity in In-Context Linear Regression
[AUTHORS]
Haodong Liang, Krishnakumar Balasubramanian, Lifeng Lai
[ABSTRACT]
We explore the capability of transformers to address endogeneity in
in-context linear regression. Our main finding is that transformers inherently
possess a mechanism to handle endogeneity effectively using instrumental
variables (IV). First, we demonstrate that the transformer architecture can
emulate a gradient-based bi-level optimization procedure that converges to the
widely used two-stage least squares $(\textsf{2SLS})$ solution at an
exponential rate. Next, we propose an in-context pretraining scheme and provide
theoretical guarantees showing that the global minimizer of the pre-training
loss achieves a small excess loss. Our extensive experiments validate these
theoretical findings, showing that the trained transformer provides more robust
and reliable in-context predictions and coefficient estimates than the
$\textsf{2SLS}$ method, in the presence of endogeneity.
[COMMENTS]
37 pages, 8 figures
[LINK]
http://arxiv.org/abs/2410.01265v3
[DATE]
2025-05-11 10:08:45+08:00
[CATEGORIES]
cs.LG
Practical Efficiency of Muon for Pretraining
[AUTHORS]
Essential AI, :, Ishaan Shah, Anthony M. Polloreno, Karl Stratos, Philip Monk, Adarsh Chaluvaraju, Andrew Hojel, Andrew Ma, Anil Thomas, Ashish Tanwer, Darsh J Shah, Khoi Nguyen, Kurt Smith, Michael Callahan, Michael Pust, Mohit Parmar, Peter Rushton, Platon Mazarakis, Ritvik Kapila, Saurabh Srivastava, Somanshu Singla, Tim Romanski, Yash Vanjani, Ashish Vaswani
[ABSTRACT]
We demonstrate that Muon, the simplest instantiation of a second-order
optimizer, explicitly expands the Pareto frontier over AdamW on the
compute-time tradeoff. We find that Muon is more effective than AdamW in
retaining data efficiency at large batch sizes, far beyond the so-called
critical batch size, while remaining computationally efficient, thus enabling
more economical training. We study the combination of Muon and the maximal
update parameterization (muP) for efficient hyperparameter transfer and present
a simple telescoping algorithm that accounts for all sources of error in muP
while introducing only a modest overhead in resources. We validate our findings
through extensive experiments with model sizes up to four billion parameters
and ablations on the data distribution and architecture.
[LINK]
http://arxiv.org/abs/2505.02222v3
[DATE]
2025-05-11 09:55:11+08:00
[CATEGORIES]
cs.LG
Order-Optimal Regret with Novel Policy Gradient Approaches in Infinite-Horizon Average Reward MDPs
[AUTHORS]
Swetha Ganesh, Washim Uddin Mondal, Vaneet Aggarwal
[ABSTRACT]
We present two Policy Gradient-based algorithms with general parametrization
in the context of infinite-horizon average reward Markov Decision Process
(MDP). The first one employs Implicit Gradient Transport for variance
reduction, ensuring an expected regret of the order
$\tilde{\mathcal{O}}(T^{2/3})$. The second approach, rooted in Hessian-based
techniques, ensures an expected regret of the order
$\tilde{\mathcal{O}}(\sqrt{T})$. These results significantly improve the
state-of-the-art $\tilde{\mathcal{O}}(T^{3/4})$ regret and achieve the
theoretical lower bound. We also show that the average-reward function is
approximately $L$-smooth, a result that was previously assumed in earlier
works.
[COMMENTS]
In the Proceedings of the 28th International Conference on Artificial
Intelligence and Statistics (AISTATS), 2025
[LINK]
http://arxiv.org/abs/2404.02108v2
[DATE]
2025-05-11 09:27:57+08:00
[CATEGORIES]
cs.LG
A stochastic gradient method for trilevel optimization
[AUTHORS]
Tommaso Giovannelli, Griffin Dean Kent, Luis Nunes Vicente
[ABSTRACT]
With the success that the field of bilevel optimization has seen in recent
years, similar methodologies have started being applied to solving more
difficult applications that arise in trilevel optimization. At the helm of
these applications are new machine learning formulations that have been
proposed in the trilevel context and, as a result, efficient and theoretically
sound stochastic methods are required. In this work, we propose the first-ever
stochastic gradient descent method for solving unconstrained trilevel
optimization problems and provide a convergence theory that covers all forms of
inexactness of the trilevel adjoint gradient, such as the inexact solutions of
the middle-level and lower-level problems, inexact computation of the trilevel
adjoint formula, and noisy estimates of the gradients, Hessians, Jacobians, and
tensors of third-order derivatives involved. We also demonstrate the promise of
our approach by providing numerical results on both synthetic trilevel problems
and trilevel formulations for hyperparameter adversarial tuning.
[LINK]
http://arxiv.org/abs/2505.06805v1
[DATE]
2025-05-11 09:05:29+08:00
[CATEGORIES]
cs.LG
Topology Guidance: Controlling the Outputs of Generative Models via Vector Field Topology
[AUTHORS]
Xiaohan Wang, Matthew Berger
[ABSTRACT]
For domains that involve numerical simulation, it can be computationally
expensive to run an ensemble of simulations spanning a parameter space of
interest to a user. To this end, an attractive surrogate for simulation is the
generative modeling of fields produced by an ensemble, allowing one to
synthesize fields in a computationally cheap, yet accurate, manner. However,
for the purposes of visual analysis, a limitation of generative models is their
lack of control, as it is unclear what one should expect when sampling a field
from a model. In this paper we study how to make generative models of fields
more controllable, so that users can specify features of interest, in
particular topological features, that they wish to see in the output. We
propose topology guidance, a method for guiding the sampling process of a
generative model, specifically a diffusion model, such that a topological
description specified as input is satisfied in the generated output. Central to
our method, we couple a coordinate-based neural network used to represent
fields, with a diffusion model used for generation. We show how to use
topologically-relevant signals provided by the coordinate-based network to help
guide the denoising process of a diffusion model. This enables us to faithfully
represent a user’s specified topology, while ensuring that the output field
remains within the generative data distribution. Specifically, we study 2D
vector field topology, evaluating our method over an ensemble of fluid flows,
where we show that generated vector fields faithfully adhere to the location,
and type, of critical points over the spatial domain. We further show the
benefits of our method in aiding the comparison of ensembles, allowing one to
explore commonalities and differences in distributions along prescribed
topological features.
[LINK]
http://arxiv.org/abs/2505.06804v1
[DATE]
2025-05-11 09:02:01+08:00
[CATEGORIES]
cs.LG
Reverse-BSDE Monte Carlo
[AUTHORS]
Jairon H. N. Batista, Flávio B. Gonçalves, Yuri F. Saporito, Rodrigo S. Targino
[ABSTRACT]
Recently, there has been a growing interest in generative models based on
diffusions driven by the empirical robustness of these methods in generating
high-dimensional photorealistic images and the possibility of using the vast
existing toolbox of stochastic differential equations. %This remarkable ability
may stem from their capacity to model and generate multimodal distributions. In
this work, we offer a novel perspective on the approach introduced in Song et
al. (2021), shifting the focus from a “learning” problem to a “sampling”
problem. To achieve this, we reformulate the equations governing
diffusion-based generative models as a Forward-Backward Stochastic Differential
Equation (FBSDE), which avoids the well-known issue of pre-estimating the
gradient of the log target density. The solution of this FBSDE is proved to be
unique using non-standard techniques. Additionally, we propose a numerical
solution to this problem, leveraging on Deep Learning techniques. This
reformulation opens new pathways for sampling multidimensional distributions
with densities known up to a normalization constant, a problem frequently
encountered in Bayesian statistics.
[LINK]
http://arxiv.org/abs/2505.06800v1
[DATE]
2025-05-11 08:42:07+08:00
[CATEGORIES]
cs.LG
Discrete distributions are learnable from metastable samples
[AUTHORS]
Abhijith Jayakumar, Andrey Y. Lokhov, Sidhant Misra, Marc Vuffray
[ABSTRACT]
Physically motivated stochastic dynamics are often used to sample from
high-dimensional distributions. However such dynamics often get stuck in
specific regions of their state space and mix very slowly to the desired
stationary state. This causes such systems to approximately sample from a
metastable distribution which is usually quite different from the desired,
stationary distribution of the dynamic. We rigorously show that, in the case of
multi-variable discrete distributions, the true model describing the stationary
distribution can be recovered from samples produced from a metastable
distribution under minimal assumptions about the system. This follows from a
fundamental observation that the single-variable conditionals of metastable
distributions that satisfy a strong metastability condition are on average
close to those of the stationary distribution. This holds even when the
metastable distribution differs considerably from the true model in terms of
global metrics like Kullback-Leibler divergence or total variation distance.
This property allows us to learn the true model using a conditional likelihood
based estimator, even when the samples come from a metastable distribution
concentrated in a small region of the state space. Explicit examples of such
metastable states can be constructed from regions that effectively bottleneck
the probability flow and cause poor mixing of the Markov chain. For specific
cases of binary pairwise undirected graphical models (i.e. Ising models), we
extend our results to further rigorously show that data coming from metastable
states can be used to learn the parameters of the energy function and recover
the structure of the model.
[COMMENTS]
Submitted version, 31 pages
[LINK]
http://arxiv.org/abs/2410.13800v3
[DATE]
2025-05-11 08:06:32+08:00
[CATEGORIES]
cs.LG
Effective Regularization Through Loss-Function Metalearning
[AUTHORS]
Santiago Gonzalez, Risto Miikkulainen
[ABSTRACT]
Evolutionary computation can be used to optimize several different aspects of
neural network architectures. For instance, the TaylorGLO method discovers
novel, customized loss functions, resulting in improved performance, faster
training, and improved data utilization. A likely reason is that such functions
discourage overfitting, leading to effective regularization. This paper
demonstrates theoretically that this is indeed the case for TaylorGLO. Learning
rule decomposition reveals that evolved loss functions balance two factors: the
pull toward zero error, and a push away from it to avoid overfitting. This is a
general principle that may be used to understand other regularization
techniques as well (as demonstrated in this paper for label smoothing). The
theoretical analysis leads to a constraint that can be utilized to find more
effective loss functions in practice; the mechanism also results in networks
that are more robust (as demonstrated in this paper with adversarial inputs).
The analysis in this paper thus constitutes a first step towards understanding
regularization, and demonstrates the power of evolutionary neural architecture
search in general.
[LINK]
http://arxiv.org/abs/2010.00788v3
[DATE]
2025-05-11 07:50:24+08:00
[CATEGORIES]
cs.LG
Quantum RNNs and LSTMs Through Entangling and Disentangling Power of Unitary Transformations
[AUTHORS]
Ammar Daskin
[ABSTRACT]
In this paper, we discuss how quantum recurrent neural networks (RNNs) and
their enhanced version, long short-term memory (LSTM) networks, can be modeled
using the core ideas presented in Ref.[1], where the entangling and
disentangling power of unitary transformations is investigated. In particular,
we interpret entangling and disentangling power as information retention and
forgetting mechanisms in LSTMs. Therefore, entanglement becomes a key component
of the optimization (training) process. We believe that, by leveraging prior
knowledge of the entangling power of unitaries, the proposed quantum-classical
framework can guide and help to design better-parameterized quantum circuits
for various real-world applications.
[COMMENTS]
the simulation code can be downloaded from
https://github.com/adaskin/quantum-lstm
[LINK]
http://arxiv.org/abs/2505.06774v1
[DATE]
2025-05-11 06:56:18+08:00
[CATEGORIES]
cs.LG
Multi-modal Synthetic Data Training and Model Collapse: Insights from VLMs and Diffusion Models
[AUTHORS]
Zizhao Hu, Mohammad Rostami, Jesse Thomason
[ABSTRACT]
Recent research has highlighted the risk of generative model collapse, where
performance progressively degrades when continually trained on self-generated
data. However, existing exploration on model collapse is limited to single,
unimodal models, limiting our understanding in more realistic scenarios, such
as diverse multi-modal AI agents interacting autonomously through synthetic
data and continually evolving. We expand the synthetic data training and model
collapse study to multi-modal vision-language generative systems, such as
vision-language models (VLMs) and text-to-image diffusion models, as well as
recursive generate-train loops with multiple models. We find that model
collapse, previously observed in single-modality generative models, exhibits
distinct characteristics in the multi-modal context, such as improved
vision-language alignment and increased variance in VLM image-captioning task.
Additionally, we find that general approaches such as increased decoding
budgets, greater model diversity, and relabeling with frozen models can
effectively mitigate model collapse. Our findings provide initial insights and
practical guidelines for reducing the risk of model collapse in self-improving
multi-agent AI systems and curating robust multi-modal synthetic datasets.
[LINK]
http://arxiv.org/abs/2505.08803v1
[DATE]
2025-05-11 06:42:29+08:00
[CATEGORIES]
cs.LG
Investigating Robotaxi Crash Severity Using Geographical Random Forest
[AUTHORS]
Junfeng Jiao, Seung Gyu Baik, Seung Jun Choi, Yiming Xu
[ABSTRACT]
This paper quantitatively investigates the crash severity of Autonomous
Vehicles (AVs) with spatially localized machine learning and macroscopic
measures of the urban built environment. We address spatial heterogeneity and
spatial autocorrelation, while focusing on land use patterns and human
behavior. Our Geographical Random Forest (GRF) model, accompanied with a crash
severity risk map of San Francisco, presents three findings that are useful for
commercial operations of AVs and robotaxis. First, spatially localized machine
learning performed better than regular machine learning, when predicting AV
crash severity. Bias-variance tradeoff was evident as we adjust the
localization weight hyperparameter. Second, land use was the most important
built environment measure, compared to intersections, building footprints,
public transit stops, and Points Of Interests (POIs). Third, it was predicted
that city center areas with greater diversity and commercial activities were
more likely to result in low-severity AV crashes, than residential
neighborhoods. Residential land use may be associated with higher severity due
to human behavior and less restrictive environment. This paper recommends to
explicitly consider geographic locations, and to design safety measures
specific to residential neighborhoods, when robotaxi operators train their AV
systems.
[COMMENTS]
21 pages, 8 figures
[LINK]
http://arxiv.org/abs/2505.06762v1
[DATE]
2025-05-11 05:47:01+08:00
[CATEGORIES]
cs.LG
Privacy-aware Berrut Approximated Coded Computing applied to general distributed learning
[AUTHORS]
Xavier Martínez-Luaña, Manuel Fernández-Veiga, Rebeca P. Díaz-Redondo, Ana Fernández-Vilas
[ABSTRACT]
Coded computing is one of the techniques that can be used for privacy
protection in Federated Learning. However, most of the constructions used for
coded computing work only under the assumption that the computations involved
are exact, generally restricted to special classes of functions, and require
quantized inputs. This paper considers the use of Private Berrut Approximate
Coded Computing (PBACC) as a general solution to add strong but non-perfect
privacy to federated learning. We derive new adapted PBACC algorithms for
centralized aggregation, secure distributed training with centralized data, and
secure decentralized training with decentralized data, thus enlarging
significantly the applications of the method and the existing privacy
protection tools available for these paradigms. Particularly, PBACC can be used
robustly to attain privacy guarantees in decentralized federated learning for a
variety of models. Our numerical results show that the achievable quality of
different learning models (convolutional neural networks, variational
autoencoders, and Cox regression) is minimally altered by using these new
computing schemes, and that the privacy leakage can be bounded strictly to less
than a fraction of one bit per participant. Additionally, the computational
cost of the encoding and decoding processes depends only of the degree of
decentralization of the data.
[LINK]
http://arxiv.org/abs/2505.06759v1
[DATE]
2025-05-11 05:27:40+08:00
[CATEGORIES]
cs.LG
Towards Optimal Branching of Linear and Semidefinite Relaxations for Neural Network Robustness Certification
[AUTHORS]
Brendon G. Anderson, Ziye Ma, Jingqi Li, Somayeh Sojoudi
[ABSTRACT]
In this paper, we study certifying the robustness of ReLU neural networks
against adversarial input perturbations. To diminish the relaxation error
suffered by the popular linear programming (LP) and semidefinite programming
(SDP) certification methods, we take a branch-and-bound approach to propose
partitioning the input uncertainty set and solving the relaxations on each part
separately. We show that this approach reduces relaxation error, and that the
error is eliminated entirely upon performing an LP relaxation with a partition
intelligently designed to exploit the nature of the ReLU activations. To scale
this approach to large networks, we consider using a coarser partition whereby
the number of parts in the partition is reduced. We prove that computing such a
coarse partition that directly minimizes the LP relaxation error is NP-hard. By
instead minimizing the worst-case LP relaxation error, we develop a closed-form
branching scheme in the single-hidden layer case. We extend the analysis to the
SDP, where the feasible set geometry is exploited to design a branching scheme
that minimizes the worst-case SDP relaxation error. Experiments on MNIST,
CIFAR-10, and Wisconsin breast cancer diagnosis classifiers demonstrate
significant increases in the percentages of test samples certified. By
independently increasing the input size and the number of layers, we
empirically illustrate under which regimes the branched LP and branched SDP are
best applied. Finally, we extend our LP branching method into a multi-layer
branching heuristic, which attains comparable performance to prior
state-of-the-art heuristics on large-scale, deep neural network certification
benchmarks.
[COMMENTS]
Accepted for publication in the Journal of Machine Learning Research
(JMLR). This is an extension of our IEEE CDC 2020 conference paper
arXiv:2004.00570
[LINK]
http://arxiv.org/abs/2101.09306v4
[DATE]
2025-05-11 05:15:51+08:00
[CATEGORIES]
cs.LG
Out-of-Sample Embedding with Proximity Data: Projection versus Restricted Reconstruction
[AUTHORS]
Michael W. Trosset, Kaiyi Tan, Minh Tang, Carey E. Priebe
[ABSTRACT]
The problem of using proximity (similarity or dissimilarity) data for the
purpose of “adding a point to a vector diagram” was first studied by J.C. Gower
in 1968. Since then, a number of methods – mostly kernel methods – have been
proposed for solving what has come to be called the problem of out-of-sample
embedding. We survey the various kernel methods that we have encountered and
show that each can be derived from one or the other of two competing
strategies: projection or restricted reconstruction. Projection can be
analogized to a well-known formula for adding a point to a principal component
analysis. Restricted reconstruction poses a different challenge: how to best
approximate redoing the entire multivariate analysis while holding fixed the
vector diagram that was previously obtained. This strategy results in a
nonlinear optimization problem that can be simplified to a unidimensional
search. Various circumstances may warrant either projection or restricted
reconstruction.
[COMMENTS]
19 pages, 2 figures
[LINK]
http://arxiv.org/abs/2505.06756v1
[DATE]
2025-05-11 05:11:30+08:00
[CATEGORIES]
cs.LG
Boltzmann Classifier: A Thermodynamic-Inspired Approach to Supervised Learning
[AUTHORS]
Muhamed Amin, Bernard R. Brooks
[ABSTRACT]
We propose a novel classification algorithm, the Boltzmann Classifier,
inspired by the thermodynamic principles underlying the Boltzmann distribution.
Our method computes a probabilistic estimate for each class based on an energy
function derived from feature-wise deviations between input samples and
class-specific centroids. The resulting probabilities are proportional to the
exponential negative energies, normalized across classes, analogous to the
Boltzmann distribution used in statistical mechanics. In addition, the KT
variable can be used to allow the high energy states to be more accessible,
which allows the tuning of their probabilities as needed. We evaluate the model
performance on several datasets from different applications. The model achieves
a high accuracy, which indicates that the Boltzmann Classifier is competitive
with standard models like logistic regression and k-nearest neighbors while
offering a thermodynamically motivated probabilistic interpretation. our
classifier does not require iterative optimization or backpropagation and is
thus computationally efficient and easy to integrate into existing workflows.
This work demonstrates how ideas from physics can inform new directions in
machine learning, providing a foundation for interpretable, energy-based
decision-making systems.
[LINK]
http://arxiv.org/abs/2505.06753v1
[DATE]
2025-05-11 04:54:50+08:00
[CATEGORIES]
cs.LG
LineFlow: A Framework to Learn Active Control of Production Lines
[AUTHORS]
Kai Müller, Martin Wenzel, Tobias Windisch
[ABSTRACT]
Many production lines require active control mechanisms, such as adaptive
routing, worker reallocation, and rescheduling, to maintain optimal
performance. However, designing these control systems is challenging for
various reasons, and while reinforcement learning (RL) has shown promise in
addressing these challenges, a standardized and general framework is still
lacking. In this work, we introduce LineFlow, an extensible, open-source Python
framework for simulating production lines of arbitrary complexity and training
RL agents to control them. To demonstrate the capabilities and to validate the
underlying theoretical assumptions of LineFlow, we formulate core subproblems
of active line control in ways that facilitate mathematical analysis. For each
problem, we provide optimal solutions for comparison. We benchmark
state-of-the-art RL algorithms and show that the learned policies approach
optimal performance in well-understood scenarios. However, for more complex,
industrial-scale production lines, RL still faces significant challenges,
highlighting the need for further research in areas such as reward shaping,
curriculum learning, and hierarchical control.
[COMMENTS]
Accepted at ICML 2025
[LINK]
http://arxiv.org/abs/2505.06744v1
[DATE]
2025-05-11 03:36:18+08:00
[CATEGORIES]
cs.LG
Towards One Model for Classical Dimensionality Reduction: A Probabilistic Perspective on UMAP and t-SNE
[AUTHORS]
Aditya Ravuri, Neil D. Lawrence
[ABSTRACT]
This paper shows that dimensionality reduction methods such as UMAP and
t-SNE, can be approximately recast as MAP inference methods corresponding to a
model introduced in Ravuri et al. (2023), that describes the graph Laplacian
(an estimate of the data precision matrix) using a Wishart distribution, with a
mean given by a non-linear covariance function evaluated on the latents. This
interpretation offers deeper theoretical and semantic insights into such
algorithms, and forging a connection to Gaussian process latent variable models
by showing that well-known kernels can be used to describe covariances implied
by graph Laplacians. We also introduce tools with which similar dimensionality
reduction methods can be studied.
[COMMENTS]
Updated figures
[LINK]
http://arxiv.org/abs/2405.17412v5
[DATE]
2025-05-11 03:36:12+08:00
[CATEGORIES]
cs.LG
Deeply Explainable Artificial Neural Network
[AUTHORS]
David Zucker
[ABSTRACT]
While deep learning models have demonstrated remarkable success in numerous
domains, their black-box nature remains a significant limitation, especially in
critical fields such as medical image analysis and inference. Existing
explainability methods, such as SHAP, LIME, and Grad-CAM, are typically applied
post hoc, adding computational overhead and sometimes producing inconsistent or
ambiguous results. In this paper, we present the Deeply Explainable Artificial
Neural Network (DxANN), a novel deep learning architecture that embeds
explainability ante hoc, directly into the training process. Unlike
conventional models that require external interpretation methods, DxANN is
designed to produce per-sample, per-feature explanations as part of the forward
pass. Built on a flow-based framework, it enables both accurate predictions and
transparent decision-making, and is particularly well-suited for image-based
tasks. While our focus is on medical imaging, the DxANN architecture is readily
adaptable to other data modalities, including tabular and sequential data.
DxANN marks a step forward toward intrinsically interpretable deep learning,
offering a practical solution for applications where trust and accountability
are essential.
[LINK]
http://arxiv.org/abs/2505.06731v1
[DATE]
2025-05-11 02:45:38+08:00
[CATEGORIES]
cs.LG
Activity and Subject Detection for UCI HAR Dataset with & without missing Sensor Data
[AUTHORS]
Debashish Saha, Piyush Malik, Adrika Saha
[ABSTRACT]
Current studies in Human Activity Recognition (HAR) primarily focus on the
classification of activities through sensor data, while there is not much
emphasis placed on recognizing the individuals performing these activities.
This type of classification is very important for developing personalized and
context-sensitive applications. Additionally, the issue of missing sensor data,
which often occurs in practical situations due to hardware malfunctions, has
not been explored yet. This paper seeks to fill these voids by introducing a
lightweight LSTM-based model that can be used to classify both activities and
subjects. The proposed model was used to classify the HAR dataset by UCI [1],
achieving an accuracy of 93.89% in activity recognition (across six
activities), nearing the 96.67% benchmark, and an accuracy of 80.19% in subject
recognition (involving 30 subjects), thereby establishing a new baseline for
this area of research. We then simulate the absence of sensor data to mirror
real-world scenarios and incorporate imputation techniques, both with and
without Principal Component Analysis (PCA), to restore incomplete datasets. We
found that K-Nearest Neighbors (KNN) imputation performs the best for filling
the missing sensor data without PCA because the use of PCA resulted in slightly
lower accuracy. These results demonstrate how well the framework handles
missing sensor data, which is a major step forward in using the Human Activity
Recognition dataset for reliable classification tasks.
[LINK]
http://arxiv.org/abs/2505.06730v1
[DATE]
2025-05-11 02:43:00+08:00
[CATEGORIES]
cs.LG
Beyond $\tilde{O}(\sqrt{T})$ Constraint Violation for Online Convex Optimization with Adversarial Constraints
[AUTHORS]
Abhishek Sinha, Rahul Vaze
[ABSTRACT]
We revisit the Online Convex Optimization problem with adversarial
constraints (COCO) where, in each round, a learner is presented with a convex
cost function and a convex constraint function, both of which may be chosen
adversarially. The learner selects actions from a convex decision set in an
online fashion, with the goal of minimizing both regret and the cumulative
constraint violation (CCV) over a horizon of $T$ rounds. The best-known policy
for this problem achieves $O(\sqrt{T})$ regret and $\tilde{O}(\sqrt{T})$ CCV.
In this paper, we present a surprising improvement that achieves a
significantly smaller CCV by trading it off with regret. Specifically, for any
bounded convex cost and constraint functions, we propose an online policy that
achieves $\tilde{O}(\sqrt{dT}+ T^\beta)$ regret and $\tilde{O}(dT^{1-\beta})$
CCV, where $d$ is the dimension of the decision set and $\beta \in [0,1]$ is a
tunable parameter. We achieve this result by first considering the special case
of $\textsf{Constrained Expert}$ problem where the decision set is a
probability simplex and the cost and constraint functions are linear.
Leveraging a new adaptive small-loss regret bound, we propose an efficient
policy for the $\textsf{Constrained Expert}$ problem, that attains
$O(\sqrt{T\ln N}+T^{\beta})$ regret and $\tilde{O}(T^{1-\beta} \ln N)$ CCV,
where $N$ is the number of experts. The original problem is then reduced to the
$\textsf{Constrained Expert}$ problem via a covering argument. Finally, with an
additional smoothness assumption, we propose an efficient gradient-based policy
attaining $O(T^{\max(\frac{1}{2},\beta)})$ regret and $\tilde{O}(T^{1-\beta})$
CCV.
[LINK]
http://arxiv.org/abs/2505.06709v1
[DATE]
2025-05-11 01:23:10+08:00
[CATEGORIES]
cs.LG
RuleGenie: SIEM Detection Rule Set Optimization
[AUTHORS]
Akansha Shukla, Parth Atulbhai Gandhi, Yuval Elovici, Asaf Shabtai
[ABSTRACT]
SIEM systems serve as a critical hub, employing rule-based logic to detect
and respond to threats. Redundant or overlapping rules in SIEM systems lead to
excessive false alerts, degrading analyst performance due to alert fatigue, and
increase computational overhead and response latency for actual threats. As a
result, optimizing SIEM rule sets is essential for efficient operations.
Despite the importance of such optimization, research in this area is limited,
with current practices relying on manual optimization methods that are both
time-consuming and error-prone due to the scale and complexity of
enterprise-level rule sets. To address this gap, we present RuleGenie, a novel
large language model (LLM) aided recommender system designed to optimize SIEM
rule sets. Our approach leverages transformer models’ multi-head attention
capabilities to generate SIEM rule embeddings, which are then analyzed using a
similarity matching algorithm to identify the top-k most similar rules. The LLM
then processes the rules identified, utilizing its information extraction,
language understanding, and reasoning capabilities to analyze rule similarity,
evaluate threat coverage and performance metrics, and deliver optimized
recommendations for refining the rule set. By automating the rule optimization
process, RuleGenie allows security teams to focus on more strategic tasks while
enhancing the efficiency of SIEM systems and strengthening organizations’
security posture. We evaluated RuleGenie on a comprehensive set of real-world
SIEM rule formats, including Splunk, Sigma, and AQL (Ariel query language),
demonstrating its platform-agnostic capabilities and adaptability across
diverse security infrastructures. Our experimental results show that RuleGenie
can effectively identify redundant rules, which in turn decreases false
positive rates and enhances overall rule efficiency.
[LINK]
http://arxiv.org/abs/2505.06701v1
[DATE]
2025-05-11 00:56:17+08:00
[CATEGORIES]
cs.LG
E2E-FANet: A Highly Generalizable Framework for Waves prediction Behind Floating Breakwaters via Exogenous-to-Endogenous Variable Attention
[AUTHORS]
Jianxin Zhang, Lianzi Jiang, Xinyu Han, Xiangrong Wang, Weinan Huang
[ABSTRACT]
Accurate prediction of waves behind floating breakwaters (FB) is crucial for
optimizing coastal engineering structures, enhancing safety, and improving
design efficiency. Existing methods demonstrate limitations in capturing
nonlinear interactions between waves and structures, while exhibiting
insufficient capability in modeling the complex frequency-domain relationships
among elevations of different wave gauges. To address these challenges, this
study introduces the Exogenous-to-Endogenous Frequency-Aware Network
(E2E-FANet), a novel end-to-end neural network designed to model relationships
between waves and structures. The E2E-FANetarchitecture incorporates a
Dual-Basis Frequency Mapping (DBFM) module that leverages orthogonal cosine and
sine bases to extract wave features from the frequency domain while preserving
temporal information. Additionally, we introduce the Exogenous-to-Endogenous
Cross-Attention (E2ECA) module, which employs cross attention to model the
interactions between endogenous and exogenous variables. We incorporate a
Temporal-wise Attention (TA) mechanism that adaptively captures complex
dependencies in endogenous variables. These integrated modules function
synergistically, enabling E2E-FANet to achieve both comprehensive feature
perception in the time-frequency domain and precise modeling of wave-structure
interactions. To comprehensively evaluate the performance of E2E-FANet, we
constructed a multi-level validation framework comprising three distinct
testing scenarios: internal validation under identical wave conditions,
generalization testing across different wave conditions, and adaptability
testing with varying relative water density (RW) conditions. These
comprehensive tests demonstrate that E2E-FANet provides accurate waves behind
FB predictions while successfully generalizing diverse wave conditions.
[LINK]
http://arxiv.org/abs/2505.06690v1
[DATE]
2025-05-11 00:28:48+08:00
[CATEGORIES]
cs.LG
A Novel Framework for Significant Wave Height Prediction based on Adaptive Feature Extraction Time-Frequency Network
[AUTHORS]
Jianxin Zhang, Lianzi Jiang, Xinyu Han, Xiangrong Wang
[ABSTRACT]
Precise forecasting of significant wave height (Hs) is essential for the
development and utilization of wave energy. The challenges in predicting Hs
arise from its non-linear and non-stationary characteristics. The combination
of decomposition preprocessing and machine learning models have demonstrated
significant effectiveness in Hs prediction by extracting data features.
However, decomposing the unknown data in the test set can lead to data leakage
issues. To simultaneously achieve data feature extraction and prevent data
leakage, a novel Adaptive Feature Extraction Time-Frequency Network (AFE-TFNet)
is proposed to improve prediction accuracy and stability. It is encoder-decoder
rolling framework. The encoder consists of two stages: feature extraction and
feature fusion. In the feature extraction stage, global and local frequency
domain features are extracted by combining Wavelet Transform (WT) and Fourier
Transform (FT), and multi-scale frequency analysis is performed using Inception
blocks. In the feature fusion stage, time-domain and frequency-domain features
are integrated through dominant harmonic sequence energy weighting (DHSEW). The
decoder employed an advanced long short-term memory (LSTM) model. Hourly
measured wind speed (Ws), dominant wave period (DPD), average wave period (APD)
and Hs from three stations are used as the dataset, and the four metrics are
employed to evaluate the forecasting performance. Results show that AFE-TFNet
significantly outperforms benchmark methods in terms of prediction accuracy.
Feature extraction can significantly improve the prediction accuracy. DHSEW has
substantially increased the accuracy of medium-term to long-term forecasting.
The prediction accuracy of AFE-TFNet does not demonstrate significant
variability with changes of rolling time window size. Overall, AFE-TFNet shows
strong potential for handling complex signal forecasting.
[LINK]
http://arxiv.org/abs/2505.06688v1
[DATE]
2025-05-11 00:25:31+08:00
[CATEGORIES]
cs.LG
Enhancing Trust Management System for Connected Autonomous Vehicles Using Machine Learning Methods: A Survey
[AUTHORS]
Qian Xu, Lei Zhang, Yixiao Liu
[ABSTRACT]
Connected Autonomous Vehicles (CAVs) operate in dynamic, open, and
multi-domain networks, rendering them vulnerable to various threats. Trust
Management Systems (TMS) systematically organize essential steps in the trust
mechanism, identifying malicious nodes against internal threats and external
threats, as well as ensuring reliable decision-making for more cooperative
tasks. Recent advances in machine learning (ML) offer significant potential to
enhance TMS, especially for the strict requirements of CAVs, such as CAV nodes
moving at varying speeds, and opportunistic and intermittent network behavior.
Those features distinguish ML-based TMS from social networks, static IoT, and
Social IoT. This survey proposes a novel three-layer ML-based TMS framework for
CAVs in the vehicle-road-cloud integration system, i.e., trust data layer,
trust calculation layer and trust incentive layer. A six-dimensional taxonomy
of objectives is proposed. Furthermore, the principles of ML methods for each
module in each layer are analyzed. Then, recent studies are categorized based
on traffic scenarios that are against the proposed objectives. Finally, future
directions are suggested, addressing the open issues and meeting the research
trend. We maintain an active repository that contains up-to-date literature and
open-source projects at
https://github.com/octoberzzzzz/ML-based-TMS-CAV-Survey.
[COMMENTS]
31 pages, 9 figures
[LINK]
http://arxiv.org/abs/2505.07882v1
[DATE]
2025-05-11 00:13:36+08:00
[CATEGORIES]
cs.LG
A Survey on Data-Driven Modeling of Human Drivers’ Lane-Changing Decisions
[AUTHORS]
Linxuan Huang, Dong-Fan Xie, Li Li, Zhengbing He
[ABSTRACT]
Lane-changing (LC) behavior, a critical yet complex driving maneuver,
significantly influences driving safety and traffic dynamics. Traditional
analytical LC decision (LCD) models, while effective in specific environments,
often oversimplify behavioral heterogeneity and complex interactions, limiting
their capacity to capture real LCD. Data-driven approaches address these gaps
by leveraging rich empirical data and machine learning to decode latent
decision-making patterns, enabling adaptive LCD modeling in dynamic
environments. In light of the rapid development of artificial intelligence and
the demand for data-driven models oriented towards connected vehicles and
autonomous vehicles, this paper presents a comprehensive survey of data-driven
LCD models, with a particular focus on human drivers LC decision-making. It
systematically reviews the modeling framework, covering data sources and
preprocessing, model inputs and outputs, objectives, structures, and validation
methods. This survey further discusses the opportunities and challenges faced
by data-driven LCD models, including driving safety, uncertainty, as well as
the integration and improvement of technical frameworks.
[LINK]
http://arxiv.org/abs/2505.06680v1
[DATE]
2025-05-11 00:09:03+08:00
[CATEGORIES]
cs.LG
Self-Data Distillation for Recovering Quality in Pruned Large Language Models
[AUTHORS]
Vithursan Thangarasa, Ganesh Venkatesh, Mike Lasby, Nish Sinnadurai, Sean Lie
[ABSTRACT]
Large language models have driven significant progress in natural language
processing, but their deployment requires substantial compute and memory
resources. As models scale, compression techniques become essential for
balancing model quality with computational efficiency. Structured pruning,
which removes less critical components of the model, is a promising strategy
for reducing complexity. However, one-shot pruning often results in significant
quality degradation, particularly in tasks requiring multi-step reasoning. To
recover lost quality, supervised fine-tuning (SFT) is commonly applied, but it
can lead to catastrophic forgetting by shifting the model’s learned data
distribution. Therefore, addressing the degradation from both pruning and SFT
is essential to preserve the original model’s quality. In this work, we utilize
self-data distilled fine-tuning to address these challenges. Our approach
leverages the original, unpruned model to generate a distilled dataset that
preserves semantic richness and mitigates catastrophic forgetting by
maintaining alignment with the base model’s knowledge. Empirically, we
demonstrate that self-data distillation consistently outperforms standard SFT,
improving average accuracy by up to 8% on the HuggingFace OpenLLM Leaderboard
v1. Specifically, when pruning six decoder blocks on Llama3.1-8B Instruct
(i.e., 32 to 26 layers, reducing the model size from 8.03B to 6.72B
parameters), our method retains 91.2% of the original model’s accuracy compared
to 81.7% with SFT, while reducing real-world FLOPs by 16.3%. Furthermore,
combining self-data distilled models through model merging yields enhanced
quality retention. Additionally, leveraging these pruned models in speculative
decoding increases token acceptance rates, thereby improving inference
efficiency in applied settings.
[COMMENTS]
Accepted to MLSys 2025. Main paper: 14 pp., 4 figs., 6 tabs.;
Supplementary: 5 pp
[LINK]
http://arxiv.org/abs/2410.09982v4
[DATE]
2025-05-10 23:39:41+08:00
[CATEGORIES]
cs.LG
cs.CL
EffiLearner: Enhancing Efficiency of Generated Code via Self-Optimization
[AUTHORS]
Dong Huang, Jianbo Dai, Han Weng, Puzhen Wu, Yuhao Qing, Heming Cui, Zhijiang Guo, Jie M. Zhang
[ABSTRACT]
Large language models (LLMs) have shown remarkable progress in code
generation, but their generated code often suffers from inefficiency, resulting
in longer execution times and higher memory consumption. To address this issue,
we propose \textbf{EffiLearner}, a self-optimization framework that utilizes
execution overhead profiles to improve the efficiency of LLM-generated code.
EffiLearner first generates code using an LLM, then executes it locally to
capture execution time and memory usage profiles. These profiles are fed back
to the LLM, which then revises the code to reduce overhead. To evaluate the
effectiveness of EffiLearner, we conduct extensive experiments on the
EffiBench, HumanEval, and MBPP with 16 open-source and 6 closed-source models.
Our evaluation results demonstrate that through iterative self-optimization,
EffiLearner significantly enhances the efficiency of LLM-generated code. For
example, the execution time (ET) of StarCoder2-15B for the EffiBench decreases
from 0.93 (s) to 0.12 (s) which reduces 87.1% the execution time requirement
compared with the initial code. The total memory usage (TMU) of StarCoder2-15B
also decreases from 22.02 (Mbs) to 2.03 (Mbs), which decreases 90.8% of total
memory consumption during the execution process. The source code of EffiLearner
was released in https://github.com/huangd1999/EffiLearner
[COMMENTS]
Accepted by NeurIPS 2024
[LINK]
http://arxiv.org/abs/2405.15189v4
[DATE]
2025-05-10 23:18:22+08:00
[CATEGORIES]
cs.CL
EffiBench: Benchmarking the Efficiency of Automatically Generated Code
[AUTHORS]
Dong Huang, Yuhao Qing, Weiyi Shang, Heming Cui, Jie M. Zhang
[ABSTRACT]
Code generation models have increasingly become integral to aiding software
development. Although current research has thoroughly examined the correctness
of the code produced by code generation models, a vital aspect that plays a
pivotal role in green computing and sustainability efforts has often been
neglected. This paper presents EffiBench, a benchmark with 1,000
efficiency-critical coding problems to assess the efficiency of code generated
by code generation models. EffiBench contains a diverse set of LeetCode coding
problems. Each problem is paired with an executable human-written canonical
solution, which obtains the SOTA efficiency on the LeetCode solution
leaderboard. With EffiBench, we empirically examine the ability of 42 large
language models (35 open-source and 7 closed-source) to generate efficient
code. Our evaluation results demonstrate that the efficiency of the code
generated by LLMs is generally worse than the efficiency of human-written
canonical solutions. For example, GPT-4 generated code has an average
\textbf{3.12} times execution time that of the human-written canonical
solutions. In the most extreme cases, the execution time and total memory usage
of GPT-4 generated code are \textbf{13.89} and \textbf{43.92} times that of the
canonical solutions. The source code of EffiBench is released on
https://github.com/huangd1999/EffiBench. We also provide the LeaderBoard at
https://huggingface.co/spaces/EffiBench/effibench-leaderboard.
[COMMENTS]
Camera Ready for NeurIPS 2024
[LINK]
http://arxiv.org/abs/2402.02037v6
[DATE]
2025-05-10 23:11:34+08:00
[CATEGORIES]
cs.CL
TS-SUPERB: A Target Speech Processing Benchmark for Speech Self-Supervised Learning Models
[AUTHORS]
Junyi Peng, Takanori Ashihara, Marc Delcroix, Tsubasa Ochiai, Oldrich Plchot, Shoko Araki, Jan Černocký
[ABSTRACT]
Self-supervised learning (SSL) models have significantly advanced speech
processing tasks, and several benchmarks have been proposed to validate their
effectiveness. However, previous benchmarks have primarily focused on
single-speaker scenarios, with less exploration of target-speaker tasks in
noisy, multi-talker conditions – a more challenging yet practical case. In
this paper, we introduce the Target-Speaker Speech Processing Universal
Performance Benchmark (TS-SUPERB), which includes four widely recognized
target-speaker processing tasks that require identifying the target speaker and
extracting information from the speech mixture. In our benchmark, the speaker
embedding extracted from enrollment speech is used as a clue to condition
downstream models. The benchmark result reveals the importance of evaluating
SSL models in target speaker scenarios, demonstrating that performance cannot
be easily inferred from related single-speaker tasks. Moreover, by using a
unified SSL-based target speech encoder, consisting of a speaker encoder and an
extractor module, we also investigate joint optimization across TS tasks to
leverage mutual information and demonstrate its effectiveness.
[COMMENTS]
Accepted at ICASSP 2025
[LINK]
http://arxiv.org/abs/2505.06660v1
[DATE]
2025-05-10 22:23:37+08:00
[CATEGORIES]
cs.CL
Evaluating Creative Short Story Generation in Humans and Large Language Models
[AUTHORS]
Mete Ismayilzada, Claire Stevenson, Lonneke van der Plas
[ABSTRACT]
Story-writing is a fundamental aspect of human imagination, relying heavily
on creativity to produce narratives that are novel, effective, and surprising.
While large language models (LLMs) have demonstrated the ability to generate
high-quality stories, their creative story-writing capabilities remain
under-explored. In this work, we conduct a systematic analysis of creativity in
short story generation across 60 LLMs and 60 people using a five-sentence
cue-word-based creative story-writing task. We use measures to automatically
evaluate model- and human-generated stories across several dimensions of
creativity, including novelty, surprise, diversity, and linguistic complexity.
We also collect creativity ratings and Turing Test classifications from
non-expert and expert human raters and LLMs. Automated metrics show that LLMs
generate stylistically complex stories, but tend to fall short in terms of
novelty, surprise and diversity when compared to average human writers. Expert
ratings generally coincide with automated metrics. However, LLMs and
non-experts rate LLM stories to be more creative than human-generated stories.
We discuss why and how these differences in ratings occur, and their
implications for both human and artificial creativity.
[COMMENTS]
Accepted to ICCC 2025
[LINK]
http://arxiv.org/abs/2411.02316v5
[DATE]
2025-05-10 22:20:14+08:00
[CATEGORIES]
cs.CL
Improving Block-Wise LLM Quantization by 4-bit Block-Wise Optimal Float (BOF4): Analysis and Variations
[AUTHORS]
Patrick Blumenberg, Thomas Graave, Tim Fingscheidt
[ABSTRACT]
Large language models (LLMs) demand extensive memory capacity during both
fine-tuning and inference. To enable memory-efficient fine-tuning, existing
methods apply block-wise quantization techniques, such as NF4 and AF4, to the
network weights. We show that these quantization techniques incur suboptimal
quantization errors. Therefore, as a first novelty, we propose an optimization
approach for block-wise quantization. Using this method, we design a family of
quantizers named 4-bit block-wise optimal float (BOF4), which consistently
reduces the quantization error compared to both baseline methods. We provide
both a theoretical and a data-driven solution for the optimization process and
prove their practical equivalence. Secondly, we propose a modification to the
employed normalization method based on the signed absolute block maximum
(BOF4-S), enabling further reduction of the quantization error and empirically
achieving less degradation in language modeling performance. Thirdly, we
explore additional variations of block-wise quantization methods applied to
LLMs through an experimental study on the importance of accurately representing
zero and large-amplitude weights on the one hand, and optimization towards
various error metrics on the other hand. Lastly, we introduce a mixed-precision
quantization strategy dubbed outlier-preserving quantization (OPQ) to address
the distributional mismatch induced by outlier weights in block-wise
quantization. By storing outlier weights in 16-bit precision (OPQ) while
applying BOF4-S, we achieve top performance among 4-bit block-wise quantization
techniques w.r.t. perplexity.
[LINK]
http://arxiv.org/abs/2505.06653v1
[DATE]
2025-05-10 22:00:15+08:00
[CATEGORIES]
cs.LG
cs.CL
Attention Is Not All You Need: The Importance of Feedforward Networks in Transformer Models
[AUTHORS]
Isaac Gerber
[ABSTRACT]
Decoder-only transformer networks have become incredibly popular for language
modeling tasks. State-of-the-art models can have over a hundred transformer
blocks, containing billions of trainable parameters, and are trained on
trillions of tokens of text. Each transformer block typically consists of a
multi-head attention (MHA) mechanism and a two-layer fully connected
feedforward network (FFN). In this paper, we examine the importance of the FFN
during the model pre-training process through a series of experiments,
confirming that the FFN is important to model performance. Furthermore, we show
that models using a transformer block configuration with three-layer FFNs with
fewer such blocks outperform the standard two-layer configuration delivering
lower training loss with fewer total parameters in less time.
[LINK]
http://arxiv.org/abs/2505.06633v1
[DATE]
2025-05-10 20:54:21+08:00
[CATEGORIES]
cs.CL
cs.LG
Dynamic Domain Information Modulation Algorithm for Multi-domain Sentiment Analysis
[AUTHORS]
Chunyi Yue, Ang Li
[ABSTRACT]
Multi-domain sentiment classification aims to mitigate poor performance
models due to the scarcity of labeled data in a single domain, by utilizing
data labeled from various domains. A series of models that jointly train domain
classifiers and sentiment classifiers have demonstrated their advantages,
because domain classification helps generate necessary information for
sentiment classification. Intuitively, the importance of sentiment
classification tasks is the same in all domains for multi-domain sentiment
classification; but domain classification tasks are different because the
impact of domain information on sentiment classification varies across
different fields; this can be controlled through adjustable weights or hyper
parameters. However, as the number of domains increases, existing
hyperparameter optimization algorithms may face the following challenges: (1)
tremendous demand for computing resources, (2) convergence problems, and (3)
high algorithm complexity. To efficiently generate the domain information
required for sentiment classification in each domain, we propose a dynamic
information modulation algorithm. Specifically, the model training process is
divided into two stages. In the first stage, a shared hyperparameter, which
would control the proportion of domain classification tasks across all fields,
is determined. In the second stage, we introduce a novel domain-aware
modulation algorithm to adjust the domain information contained in the input
text, which is then calculated based on a gradient-based and loss-based method.
In summary, experimental results on a public sentiment analysis dataset
containing 16 domains prove the superiority of the proposed method.
[COMMENTS]
17 pages, 5 figures, 3 tables
[LINK]
http://arxiv.org/abs/2505.06630v1
[DATE]
2025-05-10 20:36:00+08:00
[CATEGORIES]
cs.CL
The Efficiency of Pre-training with Objective Masking in Pseudo Labeling for Semi-Supervised Text Classification
[AUTHORS]
Arezoo Hatefi, Xuan-Son Vu, Monowar Bhuyan, Frank Drewes
[ABSTRACT]
We extend and study a semi-supervised model for text classification proposed
earlier by Hatefi et al. for classification tasks in which document classes are
described by a small number of gold-labeled examples, while the majority of
training examples is unlabeled. The model leverages the teacher-student
architecture of Meta Pseudo Labels in which a ‘‘teacher’’ generates labels for
originally unlabeled training data to train the ‘‘student’’ and updates its own
model iteratively based on the performance of the student on the gold-labeled
portion of the data. We extend the original model of Hatefi et al. by an
unsupervised pre-training phase based on objective masking, and conduct
in-depth performance evaluations of the original model, our extension, and
various independent baselines. Experiments are performed using three different
datasets in two different languages (English and Swedish).
[LINK]
http://arxiv.org/abs/2505.06624v1
[DATE]
2025-05-10 20:16:03+08:00
[CATEGORIES]
cs.CL
Boosting Neural Language Inference via Cascaded Interactive Reasoning
[AUTHORS]
Min Li, Chun Yuan
[ABSTRACT]
Natural Language Inference (NLI) focuses on ascertaining the logical
relationship (entailment, contradiction, or neutral) between a given premise
and hypothesis. This task presents significant challenges due to inherent
linguistic features such as diverse phrasing, semantic complexity, and
contextual nuances. While Pre-trained Language Models (PLMs) built upon the
Transformer architecture have yielded substantial advancements in NLI,
prevailing methods predominantly utilize representations from the terminal
layer. This reliance on final-layer outputs may overlook valuable information
encoded in intermediate layers, potentially limiting the capacity to model
intricate semantic interactions effectively. Addressing this gap, we introduce
the Cascaded Interactive Reasoning Network (CIRN), a novel architecture
designed for deeper semantic comprehension in NLI. CIRN implements a
hierarchical feature extraction strategy across multiple network depths,
operating within an interactive space where cross-sentence information is
continuously integrated. This mechanism aims to mimic a process of progressive
reasoning, transitioning from surface-level feature matching to uncovering more
profound logical and semantic connections between the premise and hypothesis.
By systematically mining latent semantic relationships at various
representational levels, CIRN facilitates a more thorough understanding of the
input pair. Comprehensive evaluations conducted on several standard NLI
benchmark datasets reveal consistent performance gains achieved by CIRN over
competitive baseline approaches, demonstrating the efficacy of leveraging
multi-level interactive features for complex relational reasoning.
[LINK]
http://arxiv.org/abs/2505.06607v1
[DATE]
2025-05-10 19:37:15+08:00
[CATEGORIES]
cs.CL
Bridging the Gap: An Intermediate Language for Enhanced and Cost-Effective Grapheme-to-Phoneme Conversion with Homographs with Multiple Pronunciations Disambiguation
[AUTHORS]
Abbas Bertina, Shahab Beirami, Hossein Biniazian, Elham Esmaeilnia, Soheil Shahi, Mahdi Pirnia
[ABSTRACT]
Grapheme-to-phoneme (G2P) conversion for Persian presents unique challenges
due to its complex phonological features, particularly homographs and Ezafe,
which exist in formal and informal language contexts. This paper introduces an
intermediate language specifically designed for Persian language processing
that addresses these challenges through a multi-faceted approach. Our
methodology combines two key components: Large Language Model (LLM) prompting
techniques and a specialized sequence-to-sequence machine transliteration
architecture. We developed and implemented a systematic approach for
constructing a comprehensive lexical database for homographs with multiple
pronunciations disambiguation often termed polyphones, utilizing formal concept
analysis for semantic differentiation. We train our model using two distinct
datasets: the LLM-generated dataset for formal and informal Persian and the
B-Plus podcasts for informal language variants. The experimental results
demonstrate superior performance compared to existing state-of-the-art
approaches, particularly in handling the complexities of Persian phoneme
conversion. Our model significantly improves Phoneme Error Rate (PER) metrics,
establishing a new benchmark for Persian G2P conversion accuracy. This work
contributes to the growing research in low-resource language processing and
provides a robust solution for Persian text-to-speech systems and demonstrating
its applicability beyond Persian. Specifically, the approach can extend to
languages with rich homographic phenomena such as Chinese and Arabic
[COMMENTS]
pdf, 8 pages, 4 figures, 4 tables
[LINK]
http://arxiv.org/abs/2505.06599v1
[DATE]
2025-05-10 19:10:48+08:00
[CATEGORIES]
cs.CL
Integrating Video and Text: A Balanced Approach to Multimodal Summary Generation and Evaluation
[AUTHORS]
Galann Pennec, Zhengyuan Liu, Nicholas Asher, Philippe Muller, Nancy F. Chen
[LINK]
http://arxiv.org/abs/2505.06594v1
[DATE]
2025-05-10 18:52:23+08:00
[CATEGORIES]
cs.CL
MacRAG: Compress, Slice, and Scale-up for Multi-Scale Adaptive Context RAG
[AUTHORS]
Woosang Lim, Zekun Li, Gyuwan Kim, Sungyoung Ji, HyeonJung Kim, Kyuri Choi, Jin Hyuk Lim, Kyungpyo Park, William Yang Wang
[ABSTRACT]
Long-context (LC) Large Language Models (LLMs) combined with
Retrieval-Augmented Generation (RAG) hold strong potential for complex
multi-hop and large-document tasks. However, existing RAG systems often suffer
from imprecise retrieval, incomplete context coverage under constrained context
windows, and fragmented information caused by suboptimal context construction.
We introduce Multi-scale Adaptive Context RAG (MacRAG), a hierarchical
retrieval framework that compresses and partitions documents into
coarse-to-fine granularities, then adaptively merges relevant contexts through
chunk- and document-level expansions in real time. By starting from the
finest-level retrieval and progressively incorporating higher-level and broader
context, MacRAG constructs effective query-specific long contexts, optimizing
both precision and coverage. Evaluations on the challenging LongBench
expansions of HotpotQA, 2WikiMultihopQA, and Musique confirm that MacRAG
consistently surpasses baseline RAG pipelines on single- and multi-step
generation with Llama-3.1-8B, Gemini-1.5-pro, and GPT-4o. Our results establish
MacRAG as an efficient, scalable solution for real-world long-context,
multi-hop reasoning. Our code is available at
https://github.com/Leezekun/MacRAG.
[LINK]
http://arxiv.org/abs/2505.06569v1
[DATE]
2025-05-10 16:50:44+08:00
[CATEGORIES]
cs.CL
cs.LG
Towards Understanding Sycophancy in Language Models
[AUTHORS]
Mrinank Sharma, Meg Tong, Tomasz Korbak, David Duvenaud, Amanda Askell, Samuel R. Bowman, Newton Cheng, Esin Durmus, Zac Hatfield-Dodds, Scott R. Johnston, Shauna Kravec, Timothy Maxwell, Sam McCandlish, Kamal Ndousse, Oliver Rausch, Nicholas Schiefer, Da Yan, Miranda Zhang, Ethan Perez
[ABSTRACT]
Human feedback is commonly utilized to finetune AI assistants. But human
feedback may also encourage model responses that match user beliefs over
truthful ones, a behaviour known as sycophancy. We investigate the prevalence
of sycophancy in models whose finetuning procedure made use of human feedback,
and the potential role of human preference judgments in such behavior. We first
demonstrate that five state-of-the-art AI assistants consistently exhibit
sycophancy across four varied free-form text-generation tasks. To understand if
human preferences drive this broadly observed behavior, we analyze existing
human preference data. We find that when a response matches a user’s views, it
is more likely to be preferred. Moreover, both humans and preference models
(PMs) prefer convincingly-written sycophantic responses over correct ones a
non-negligible fraction of the time. Optimizing model outputs against PMs also
sometimes sacrifices truthfulness in favor of sycophancy. Overall, our results
indicate that sycophancy is a general behavior of state-of-the-art AI
assistants, likely driven in part by human preference judgments favoring
sycophantic responses.
[COMMENTS]
32 pages, 20 figures
[LINK]
http://arxiv.org/abs/2310.13548v4
[DATE]
2025-05-10 15:10:46+08:00
[CATEGORIES]
cs.CL
cs.LG
A Hybrid Architecture with Efficient Fine Tuning for Abstractive Patent Document Summarization
[AUTHORS]
Nevidu Jayatilleke, Ruvan Weerasinghe
[ABSTRACT]
Automatic patent summarization approaches that help in the patent analysis
and comprehension procedure are in high demand due to the colossal growth of
innovations. The development of natural language processing (NLP), text mining,
and deep learning has notably amplified the efficacy of text summarization
models for abundant types of documents. Summarizing patent text remains a
pertinent challenge due to the labyrinthine writing style of these documents,
which includes technical and legal intricacies. Additionally, these patent
document contents are considerably lengthier than archetypal documents, which
complicates the process of extracting pertinent information for summarization.
Embodying extractive and abstractive text summarization methodologies into a
hybrid framework, this study proposes a system for efficiently creating
abstractive summaries of patent records. The procedure involves leveraging the
LexRank graph-based algorithm to retrieve the important sentences from input
parent texts, then utilizing a Bidirectional Auto-Regressive Transformer (BART)
model that has been fine-tuned using Low-Ranking Adaptation (LoRA) for
producing text summaries. This is accompanied by methodical testing and
evaluation strategies. Furthermore, the author employed certain meta-learning
techniques to achieve Domain Generalization (DG) of the abstractive component
across multiple patent fields.
[COMMENTS]
Accepted Paper in the 8th International Research Conference on Smart
Computing and Systems Engineering, University of Kelaniya, Sri Lanka.
(Pending Publication)
[LINK]
http://arxiv.org/abs/2503.10354v2
[DATE]
2025-05-10 14:44:09+08:00
[CATEGORIES]
cs.CL
xGen-small Technical Report
[AUTHORS]
Erik Nijkamp, Bo Pang, Egor Pakhomov, Akash Gokul, Jin Qu, Silvio Savarese, Yingbo Zhou, Caiming Xiong
[ABSTRACT]
We introduce xGen-small, a family of 4B and 9B Transformer decoder models
optimized for long-context applications. Our vertically integrated pipeline
unites domain-balanced, frequency-aware data curation; multi-stage pre-training
with quality annealing and length extension to 128k tokens; and targeted
post-training via supervised fine-tuning, preference learning, and online
reinforcement learning. xGen-small delivers strong performance across various
tasks, especially in math and coding domains, while excelling at long context
benchmarks.
[LINK]
http://arxiv.org/abs/2505.06496v1
[DATE]
2025-05-10 10:54:16+08:00
[CATEGORIES]
cs.CL
Fun-tuning: Characterizing the Vulnerability of Proprietary LLMs to Optimization-based Prompt Injection Attacks via the Fine-Tuning Interface
[AUTHORS]
Andrey Labunets, Nishit V. Pandya, Ashish Hooda, Xiaohan Fu, Earlence Fernandes
[ABSTRACT]
We surface a new threat to closed-weight Large Language Models (LLMs) that
enables an attacker to compute optimization-based prompt injections.
Specifically, we characterize how an attacker can leverage the loss-like
information returned from the remote fine-tuning interface to guide the search
for adversarial prompts. The fine-tuning interface is hosted by an LLM vendor
and allows developers to fine-tune LLMs for their tasks, thus providing
utility, but also exposes enough information for an attacker to compute
adversarial prompts. Through an experimental analysis, we characterize the
loss-like values returned by the Gemini fine-tuning API and demonstrate that
they provide a useful signal for discrete optimization of adversarial prompts
using a greedy search algorithm. Using the PurpleLlama prompt injection
benchmark, we demonstrate attack success rates between 65% and 82% on Google’s
Gemini family of LLMs. These attacks exploit the classic utility-security
tradeoff - the fine-tuning interface provides a useful feature for developers
but also exposes the LLMs to powerful attacks.
[LINK]
http://arxiv.org/abs/2501.09798v2
[DATE]
2025-05-10 10:36:13+08:00
[CATEGORIES]
cs.CL
Artificial Neural Networks on Graded Vector Spaces
[AUTHORS]
Tony Shaska
[ABSTRACT]
This paper presents a transformative framework for artificial neural networks
over graded vector spaces, tailored to model hierarchical and structured data
in fields like algebraic geometry and physics. By exploiting the algebraic
properties of graded vector spaces, where features carry distinct weights, we
extend classical neural networks with graded neurons, layers, and activation
functions that preserve structural integrity. Grounded in group actions,
representation theory, and graded algebra, our approach combines theoretical
rigor with practical utility.
We introduce graded neural architectures, loss functions prioritizing graded
components, and equivariant extensions adaptable to diverse gradings. Case
studies validate the framework’s effectiveness, outperforming standard neural
networks in tasks such as predicting invariants in weighted projective spaces
and modeling supersymmetric systems.
This work establishes a new frontier in machine learning, merging
mathematical sophistication with interdisciplinary applications. Future
challenges, including computational scalability and finite field extensions,
offer rich opportunities for advancing this paradigm.
[LINK]
http://arxiv.org/abs/2407.19031v2
[DATE]
2025-05-10 23:03:42+08:00
[CATEGORIES]
cs.LG
StableMotion: Repurposing Diffusion-Based Image Priors for Motion Estimation
[AUTHORS]
Ziyi Wang, Haipeng Li, Lin Sui, Tianhao Zhou, Hai Jiang, Lang Nie, Shuaicheng Liu
[ABSTRACT]
We present StableMotion, a novel framework leverages knowledge (geometry and
content priors) from pretrained large-scale image diffusion models to perform
motion estimation, solving single-image-based image rectification tasks such as
Stitched Image Rectangling (SIR) and Rolling Shutter Correction (RSC).
Specifically, StableMotion framework takes text-to-image Stable Diffusion (SD)
models as backbone and repurposes it into an image-to-motion estimator. To
mitigate inconsistent output produced by diffusion models, we propose Adaptive
Ensemble Strategy (AES) that consolidates multiple outputs into a cohesive,
high-fidelity result. Additionally, we present the concept of Sampling Steps
Disaster (SSD), the counterintuitive scenario where increasing the number of
sampling steps can lead to poorer outcomes, which enables our framework to
achieve one-step inference. StableMotion is verified on two image rectification
tasks and delivers state-of-the-art performance in both, as well as showing
strong generalizability. Supported by SSD, StableMotion offers a speedup of 200
times compared to previous diffusion model-based methods.
[LINK]
http://arxiv.org/abs/2505.06668v1
[DATE]
2025-05-10 22:58:44+08:00
[CATEGORIES]
cs.LG
Rewriting Pre-Training Data Boosts LLM Performance in Math and Code
[AUTHORS]
Kazuki Fujii, Yukito Tajima, Sakae Mizuki, Hinari Shimada, Taihei Shiotani, Koshiro Saito, Masanari Ohi, Masaki Kawamura, Taishi Nakamura, Takumi Okamoto, Shigeki Ishida, Kakeru Hattori, Youmi Ma, Hiroya Takamura, Rio Yokota, Naoaki Okazaki
[ABSTRACT]
The performance of large language models (LLMs) in program synthesis and
mathematical reasoning is fundamentally limited by the quality of their
pre-training corpora. We introduce two openly licensed datasets, released under
the Llama 3.3 Community License, that significantly enhance LLM performance by
systematically rewriting public data. SwallowCode (approximately 16.1 billion
tokens) refines Python snippets from The-Stack-v2 through a novel four-stage
pipeline: syntax validation, pylint-based style filtering, and a two-stage LLM
rewriting process that enforces style conformity and transforms snippets into
self-contained, algorithmically efficient examples. Unlike prior methods that
rely on exclusionary filtering or limited transformations, our
transform-and-retain approach upgrades low-quality code, maximizing data
utility. SwallowMath (approximately 2.3 billion tokens) enhances Finemath-4+ by
removing boilerplate, restoring context, and reformatting solutions into
concise, step-by-step explanations. Within a fixed 50 billion token training
budget, continual pre-training of Llama-3.1-8B with SwallowCode boosts pass@1
by +17.0 on HumanEval and +17.7 on HumanEval+ compared to Stack-Edu, surpassing
the baseline model’s code generation capabilities. Similarly, substituting
SwallowMath yields +12.4 accuracy on GSM8K and +7.6 on MATH. Ablation studies
confirm that each pipeline stage contributes incrementally, with rewriting
delivering the largest gains. All datasets, prompts, and checkpoints are
publicly available, enabling reproducible research and advancing LLM
pre-training for specialized domains.
[LINK]
http://arxiv.org/abs/2505.02881v2
[DATE]
2025-05-10 22:45:30+08:00
[CATEGORIES]
cs.LG
Scaling up the Banded Matrix Factorization Mechanism for Differentially Private ML
[AUTHORS]
Ryan McKenna
[LINK]
http://arxiv.org/abs/2405.15913v4
[DATE]
2025-05-10 22:14:36+08:00
[CATEGORIES]
cs.LG
Dyn-D$^2$P: Dynamic Differentially Private Decentralized Learning with Provable Utility Guarantee
[AUTHORS]
Zehan Zhu, Yan Huang, Xin Wang, Shouling Ji, Jinming Xu
[ABSTRACT]
Most existing decentralized learning methods with differential privacy (DP)
guarantee rely on constant gradient clipping bounds and fixed-level DP Gaussian
noises for each node throughout the training process, leading to a significant
accuracy degradation compared to non-private counterparts. In this paper, we
propose a new Dynamic Differentially Private Decentralized learning approach
(termed Dyn-D$^2$P) tailored for general time-varying directed networks.
Leveraging the Gaussian DP (GDP) framework for privacy accounting, Dyn-D$^2$P
dynamically adjusts gradient clipping bounds and noise levels based on gradient
convergence. This proposed dynamic noise strategy enables us to enhance model
accuracy while preserving the total privacy budget. Extensive experiments on
benchmark datasets demonstrate the superiority of Dyn-D$^2$P over its
counterparts employing fixed-level noises, especially under strong privacy
guarantees. Furthermore, we provide a provable utility bound for Dyn-D$^2$P
that establishes an explicit dependency on network-related parameters, with a
scaling factor of $1/\sqrt{n}$ in terms of the number of nodes $n$ up to a bias
error term induced by gradient clipping. To our knowledge, this is the first
model utility analysis for differentially private decentralized non-convex
optimization with dynamic gradient clipping bounds and noise levels.
[COMMENTS]
This paper has been accepted by the 34th International Joint
Conference on Artificial Intelligence(IJCAI 2025)
[LINK]
http://arxiv.org/abs/2505.06651v1
[DATE]
2025-05-10 21:57:57+08:00
[CATEGORIES]
cs.LG
Purity Law for Generalizable Neural TSP Solvers
[AUTHORS]
Wenzhao Liu, Haoran Li, Congying Han, Zicheng Zhang, Anqi Li, Tiande Guo
[ABSTRACT]
Achieving generalization in neural approaches across different scales and
distributions remains a significant challenge for the Traveling Salesman
Problem~(TSP). A key obstacle is that neural networks often fail to learn
robust principles for identifying universal patterns and deriving optimal
solutions from diverse instances. In this paper, we first uncover Purity Law
(PuLa), a fundamental structural principle for optimal TSP solutions, defining
that edge prevalence grows exponentially with the sparsity of surrounding
vertices. Statistically validated across diverse instances, PuLa reveals a
consistent bias toward local sparsity in global optima. Building on this
insight, we propose Purity Policy Optimization~(PUPO), a novel training
paradigm that explicitly aligns characteristics of neural solutions with PuLa
during the solution construction process to enhance generalization. Extensive
experiments demonstrate that PUPO can be seamlessly integrated with popular
neural solvers, significantly enhancing their generalization performance
without incurring additional computational overhead during inference.
[LINK]
http://arxiv.org/abs/2505.04558v2
[DATE]
2025-05-10 21:39:05+08:00
[CATEGORIES]
cs.LG
Robust Learning of Diverse Code Edits
[AUTHORS]
Tushar Aggarwal, Swayam Singh, Abhijeet Awasthi, Aditya Kanade, Nagarajan Natarajan
[ABSTRACT]
Software engineering activities frequently involve edits to existing code.
However, contemporary code language models (LMs) lack the ability to handle
diverse types of code-edit requirements. In this work, we attempt to overcome
this shortcoming through (1) a novel synthetic data generation pipeline and (2)
a robust model adaptation algorithm. Starting with seed code examples and
diverse editing criteria, our pipeline generates high-quality samples
comprising original and modified code, along with natural language instructions
in different styles and verbosity. Today’s code LMs come bundled with strong
abilities, such as code generation and instruction following, which should not
be lost due to fine-tuning. To ensure this, we propose a novel adaptation
algorithm, SeleKT, that (a) leverages a dense gradient-based step to identify
the weights that are most important for code editing, and (b) does a sparse
projection onto the base model to avoid overfitting. Using our approach, we
obtain a new series of models NextCoder (adapted from QwenCoder-2.5) that
achieves strong results on five code-editing benchmarks, outperforming
comparable size models and even several larger ones. We show the generality of
our approach on two model families (DeepSeekCoder and QwenCoder), compare
against other fine-tuning approaches, and demonstrate robustness by showing
retention of code generation and general problem-solving abilities post
adaptation. We opensource the models, synthetic dataset, and implementation at
https://aka.ms/nextcoder.
[COMMENTS]
To appear in ICML 2025 as ‘NextCoder: Robust Adaptation of Code LMs
to Diverse Code Edits’
[LINK]
http://arxiv.org/abs/2503.03656v2
[DATE]
2025-05-10 19:59:18+08:00
[CATEGORIES]
cs.LG
Learning Guarantee of Reward Modeling Using Deep Neural Networks
[AUTHORS]
Yuanhang Luo, Yeheng Ge, Ruijian Han, Guohao Shen
[ABSTRACT]
In this work, we study the learning theory of reward modeling with pairwise
comparison data using deep neural networks. We establish a novel non-asymptotic
regret bound for deep reward estimators in a non-parametric setting, which
depends explicitly on the network architecture. Furthermore, to underscore the
critical importance of clear human beliefs, we introduce a margin-type
condition that assumes the conditional winning probability of the optimal
action in pairwise comparisons is significantly distanced from 1/2. This
condition enables a sharper regret bound, which substantiates the empirical
efficiency of Reinforcement Learning from Human Feedback and highlights clear
human beliefs in its success. Notably, this improvement stems from high-quality
pairwise comparison data implied by the margin-type condition, is independent
of the specific estimators used, and thus applies to various learning
algorithms and models.
[LINK]
http://arxiv.org/abs/2505.06601v1
[DATE]
2025-05-10 19:21:29+08:00
[CATEGORIES]
cs.LG
Geometry of Learning – L2 Phase Transitions in Deep and Shallow Neural Networks
[AUTHORS]
Ibrahim Talha Ersoy, Karoline Wiesner
[ABSTRACT]
When neural networks (NNs) are subject to L2 regularization, increasing the
regularization strength beyond a certain threshold pushes the model into an
under-parameterization regime. This transition manifests as a first-order phase
transition in single-hidden-layer NNs and a second-order phase transition in
NNs with two or more hidden layers. This paper establishes a unified framework
for such transitions by integrating the Ricci curvature of the loss landscape
with regularizer-driven deep learning. First, we show that a curvature
change-point separates the model-accuracy regimes in the onset of learning and
that it is identical to the critical point of the phase transition driven by
regularization. Second, we show that for more complex data sets additional
phase transitions exist between model accuracies, and that they are again
identical to curvature change points in the error landscape. Third, by studying
the MNIST data set using a Variational Autoencoder, we demonstrate that the
curvature change points identify phase transitions in model accuracy outside
the L2 setting. Our framework also offers practical insights for optimizing
model performance across various architectures and datasets. By linking
geometric features of the error landscape to observable phase transitions, our
work paves the way for more informed regularization strategies and potentially
new methods for probing the intrinsic structure of neural networks beyond the
L2 context.
[LINK]
http://arxiv.org/abs/2505.06597v1
[DATE]
2025-05-10 19:02:30+08:00
[CATEGORIES]
cs.LG
Feature Representation Transferring to Lightweight Models via Perception Coherence
[AUTHORS]
Hai-Vy Nguyen, Fabrice Gamboa, Sixin Zhang, Reda Chhaibi, Serge Gratton, Thierry Giaccone
[ABSTRACT]
In this paper, we propose a method for transferring feature representation to
lightweight student models from larger teacher models. We mathematically define
a new notion called \textit{perception coherence}. Based on this notion, we
propose a loss function, which takes into account the dissimilarities between
data points in feature space through their ranking. At a high level, by
minimizing this loss function, the student model learns to mimic how the
teacher model \textit{perceives} inputs. More precisely, our method is
motivated by the fact that the representational capacity of the student model
is weaker than the teacher model. Hence, we aim to develop a new method
allowing for a better relaxation. This means that, the student model does not
need to preserve the absolute geometry of the teacher one, while preserving
global coherence through dissimilarity ranking. Our theoretical insights
provide a probabilistic perspective on the process of feature representation
transfer. Our experiments results show that our method outperforms or achieves
on-par performance compared to strong baseline methods for representation
transferring.
[LINK]
http://arxiv.org/abs/2505.06595v1
[DATE]
2025-05-10 18:55:06+08:00
[CATEGORIES]
cs.LG
Simple Policy Optimization
[AUTHORS]
Zhengpeng Xie, Qiang Zhang, Fan Yang, Marco Hutter, Renjing Xu
[ABSTRACT]
Model-free reinforcement learning algorithms have seen remarkable progress,
but key challenges remain. Trust Region Policy Optimization (TRPO) is known for
ensuring monotonic policy improvement through conservative updates within a
trust region, backed by strong theoretical guarantees. However, its reliance on
complex second-order optimization limits its practical efficiency. Proximal
Policy Optimization (PPO) addresses this by simplifying TRPO’s approach using
ratio clipping, improving efficiency but sacrificing some theoretical
robustness. This raises a natural question: Can we combine the strengths of
both methods? In this paper, we introduce Simple Policy Optimization (SPO), a
novel unconstrained first-order algorithm. By slightly modifying the policy
loss used in PPO, SPO can achieve the best of both worlds. Our new objective
improves upon ratio clipping, offering stronger theoretical properties and
better constraining the probability ratio within the trust region. Empirical
results demonstrate that SPO outperforms PPO with a simple implementation,
particularly for training large, complex network architectures end-to-end.
[LINK]
http://arxiv.org/abs/2401.16025v8
[DATE]
2025-05-10 18:05:56+08:00
[CATEGORIES]
cs.LG
An \tilde{O}ptimal Differentially Private Learner for Concept Classes with VC Dimension 1
[AUTHORS]
Chao Yan
[ABSTRACT]
We present the first nearly optimal differentially private PAC learner for
any concept class with VC dimension 1 and Littlestone dimension $d$. Our
algorithm achieves the sample complexity of
$\tilde{O}_{\varepsilon,\delta,\alpha,\delta}(\log^* d)$, nearly matching the
lower bound of $\Omega(\log^* d)$ proved by Alon et al. [STOC19]. Prior to our
work, the best known upper bound is $\tilde{O}(VC\cdot d^5)$ for general VC
classes, as shown by Ghazi et al. [STOC21].
[LINK]
http://arxiv.org/abs/2505.06581v1
[DATE]
2025-05-10 17:51:25+08:00
[CATEGORIES]
cs.LG
Constraint-based causal discovery with tiered background knowledge and latent variables in single or overlapping datasets
[AUTHORS]
Christine W. Bang, Vanessa Didelez
[ABSTRACT]
In this paper we consider the use of tiered background knowledge within
constraint based causal discovery. Our focus is on settings relaxing causal
sufficiency, i.e. allowing for latent variables which may arise because
relevant information could not be measured at all, or not jointly, as in the
case of multiple overlapping datasets. We first present novel insights into the
properties of the ‘tiered FCI’ (tFCI) algorithm. Building on this, we introduce
a new extension of the IOD (integrating overlapping datasets) algorithm
incorporating tiered background knowledge, the ‘tiered IOD’ (tIOD) algorithm.
We show that under full usage of the tiered background knowledge tFCI and tIOD
are sound, while simple versions of the tIOD and tFCI are sound and complete.
We further show that the tIOD algorithm can often be expected to be
considerably more efficient and informative than the IOD algorithm even beyond
the obvious restriction of the Markov equivalence classes. We provide a formal
result on the conditions for this gain in efficiency and informativeness. Our
results are accompanied by a series of examples illustrating the exact role and
usefulness of tiered background knowledge.
[COMMENTS]
Accepted for the 4th Conference on Causal Learning and Reasoning
(CLeaR 2025). Version 2: Corrected numbering in Example 1
[LINK]
http://arxiv.org/abs/2503.21526v2
[DATE]
2025-05-10 17:17:58+08:00
[CATEGORIES]
cs.LG
Deep Fréchet Regression
[AUTHORS]
Su I Iao, Yidong Zhou, Hans-Georg Müller
[ABSTRACT]
Advancements in modern science have led to the increasing availability of
non-Euclidean data in metric spaces. This paper addresses the challenge of
modeling relationships between non-Euclidean responses and multivariate
Euclidean predictors. We propose a flexible regression model capable of
handling high-dimensional predictors without imposing parametric assumptions.
Two primary challenges are addressed: the curse of dimensionality in
nonparametric regression and the absence of linear structure in general metric
spaces. The former is tackled using deep neural networks, while for the latter
we demonstrate the feasibility of mapping the metric space where responses
reside to a low-dimensional Euclidean space using manifold learning. We
introduce a reverse mapping approach, employing local Fr'echet regression, to
map the low-dimensional manifold representations back to objects in the
original metric space. We develop a theoretical framework, investigating the
convergence rate of deep neural networks under dependent sub-Gaussian noise
with bias. The convergence rate of the proposed regression model is then
obtained by expanding the scope of local Fr'echet regression to accommodate
multivariate predictors in the presence of errors in predictors. Simulations
and case studies show that the proposed model outperforms existing methods for
non-Euclidean responses, focusing on the special cases of probability
distributions and networks.
[COMMENTS]
74 pages, 6 figures, 9 tables
[LINK]
http://arxiv.org/abs/2407.21407v2
[DATE]
2025-05-10 16:37:18+08:00
[CATEGORIES]
cs.LG
FreCT: Frequency-augmented Convolutional Transformer for Robust Time Series Anomaly Detection
[AUTHORS]
Wenxin Zhang, Ding Xu, Guangzhen Yao, Xiaojian Lin, Renxiang Guan, Chengze Du, Renda Han, Xi Xuan, Cuicui Luo
[ABSTRACT]
Time series anomaly detection is critical for system monitoring and risk
identification, across various domains, such as finance and healthcare.
However, for most reconstruction-based approaches, detecting anomalies remains
a challenge due to the complexity of sequential patterns in time series data.
On the one hand, reconstruction-based techniques are susceptible to
computational deviation stemming from anomalies, which can lead to impure
representations of normal sequence patterns. On the other hand, they often
focus on the time-domain dependencies of time series, while ignoring the
alignment of frequency information beyond the time domain. To address these
challenges, we propose a novel Frequency-augmented Convolutional Transformer
(FreCT). FreCT utilizes patch operations to generate contrastive views and
employs an improved Transformer architecture integrated with a convolution
module to capture long-term dependencies while preserving local topology
information. The introduced frequency analysis based on Fourier transformation
could enhance the model’s ability to capture crucial characteristics beyond the
time domain. To protect the training quality from anomalies and improve the
robustness, FreCT deploys stop-gradient Kullback-Leibler (KL) divergence and
absolute error to optimize consistency information in both time and frequency
domains. Extensive experiments on four public datasets demonstrate that FreCT
outperforms existing methods in identifying anomalies.
[LINK]
http://arxiv.org/abs/2505.00941v2
[DATE]
2025-05-10 16:32:35+08:00
[CATEGORIES]
cs.LG
OptiGait-LGBM: An Efficient Approach of Gait-based Person Re-identification in Non-Overlapping Regions
[AUTHORS]
Md. Sakib Hassan Chowdhury, Md. Hafiz Ahamed, Bishowjit Paul, Sarafat Hussain Abhi, Abu Bakar Siddique, Md. Robius Sany
[ABSTRACT]
Gait recognition, known for its ability to identify individuals from a
distance, has gained significant attention in recent times due to its
non-intrusive verification. While video-based gait identification systems
perform well on large public datasets, their performance drops when applied to
real-world, unconstrained gait data due to various factors. Among these,
uncontrolled outdoor environments, non-overlapping camera views, varying
illumination, and computational efficiency are core challenges in gait-based
authentication. Currently, no dataset addresses all these challenges
simultaneously. In this paper, we propose an OptiGait-LGBM model capable of
recognizing person re-identification under these constraints using a skeletal
model approach, which helps mitigate inconsistencies in a person’s appearance.
The model constructs a dataset from landmark positions, minimizing memory usage
by using non-sequential data. A benchmark dataset, RUET-GAIT, is introduced to
represent uncontrolled gait sequences in complex outdoor environments. The
process involves extracting skeletal joint landmarks, generating numerical
datasets, and developing an OptiGait-LGBM gait classification model. Our aim is
to address the aforementioned challenges with minimal computational cost
compared to existing methods. A comparative analysis with ensemble techniques
such as Random Forest and CatBoost demonstrates that the proposed approach
outperforms them in terms of accuracy, memory usage, and training time. This
method provides a novel, low-cost, and memory-efficient video-based gait
recognition solution for real-world scenarios.
[COMMENTS]
12 pages, 17 figures
[LINK]
http://arxiv.org/abs/2505.08801v1
[DATE]
2025-05-10 16:28:57+08:00
[CATEGORIES]
cs.LG
A Computational Approach to Epilepsy Treatment: An AI-optimized Global Natural Product Prescription System
[AUTHORS]
Zhixuan Wang
[ABSTRACT]
Epilepsy is a prevalent neurological disease with millions of patients
worldwide. Many patients have turned to alternative medicine due to the limited
efficacy and side effects of conventional antiepileptic drugs. In this study,
we developed a computational approach to optimize herbal epilepsy treatment
through AI-driven analysis of global natural products and statistically
validated randomized controlled trials (RCTs). Our intelligent prescription
system combines machine learning (ML) algorithms for herb-efficacy
characterization, Bayesian optimization for personalized dosing, and
meta-analysis of RCTs for evidence-based recommendations. The system analyzed
1,872 natural compounds from traditional Chinese medicine (TCM), Ayurveda, and
ethnopharmacological databases, integrating their bioactive properties with
clinical outcomes from 48 RCTs covering 48 epilepsy conditions (n=5,216). Using
LASSO regression and SHAP value analysis, we identified 17 high-efficacy herbs
(e.g., Gastrodia elata [using 'e for accented characters], Withania
somnifera), showing significant seizure reduction (p$<$0.01, Cohen’s d=0.89)
with statistical significance confirmed by multiple testing (p$<$0.001). A
randomized double-blind validation trial (n=120) demonstrated 28.5\% greater
seizure frequency reduction with AI-optimized herbal prescriptions compared to
conventional protocols (95\% CI: 18.7-37.3\%, p=0.003).
[LINK]
http://arxiv.org/abs/2505.09643v1
[DATE]
2025-05-10 16:14:20+08:00
[CATEGORIES]
cs.LG
RAM: Replace Attention with MLP for Efficient Multivariate Time Series Forecasting
[AUTHORS]
Suhan Guo, Jiahong Deng, Yi Wei, Hui Dou, Furao Shen, Jian Zhao
[ABSTRACT]
Attention-based architectures have become ubiquitous in time series
forecasting tasks, including spatio-temporal (STF) and long-term time series
forecasting (LTSF). Yet, our understanding of the reasons for their
effectiveness remains limited. In this work, we propose a novel pruning
strategy, $\textbf{R}$eplace $\textbf{A}$ttention with $\textbf{M}$LP (RAM),
that approximates the attention mechanism using only feedforward layers,
residual connections, and layer normalization for temporal and/or spatial
modeling in multivariate time series forecasting. Specifically, the Q, K, and V
projections, the attention score calculation, the dot-product between the
attention score and the V, and the final projection can be removed from the
attention-based networks without significantly degrading the performance, so
that the given network remains the top-tier compared to other SOTA methods. RAM
achieves a $62.579\%$ reduction in FLOPs for spatio-temporal models with less
than $2.5\%$ performance drop, and a $42.233\%$ FLOPs reduction for LTSF models
with less than $2\%$ performance drop.
[LINK]
http://arxiv.org/abs/2410.24023v2
[DATE]
2025-05-10 16:10:54+08:00
[CATEGORIES]
cs.LG
Tiled Flash Linear Attention: More Efficient Linear RNN and xLSTM Kernels
[AUTHORS]
Maximilian Beck, Korbinian Pöppel, Phillip Lippe, Sepp Hochreiter
[ABSTRACT]
Linear RNNs with gating recently demonstrated competitive performance
compared to Transformers in language modeling. Although their linear compute
scaling in sequence length offers theoretical runtime advantages over
Transformers, realizing these benefits in practice requires optimized custom
kernels, as Transformers rely on the highly efficient Flash Attention kernels
(Dao, 2024). Leveraging the chunkwise-parallel formulation of linear RNNs,
Flash Linear Attention (FLA) (Yang & Zhang, 2024) shows that linear RNN kernels
are faster than Flash Attention, by parallelizing over chunks of the input
sequence. However, since the chunk size of FLA is limited, many intermediate
states must be materialized in GPU memory. This leads to low arithmetic
intensity and causes high memory consumption and IO cost, especially for
long-context pre-training. In this work, we present Tiled Flash Linear
Attention (TFLA), a novel kernel algorithm for linear RNNs, that enables
arbitrary large chunk sizes and high arithmetic intensity by introducing an
additional level of sequence parallelization within each chunk. First, we apply
TFLA to the xLSTM with matrix memory, the mLSTM (Beck et al., 2024). Second, we
propose an mLSTM variant with sigmoid input gate and reduced computation for
even faster kernel runtimes at equal language modeling performance. In our
speed benchmarks, we show that our new mLSTM kernels based on TFLA outperform
highly optimized Flash Attention, Linear Attention and Mamba kernels, setting a
new state of the art for efficient long-context sequence modeling primitives.
[COMMENTS]
Code available at: https://github.com/NX-AI/mlstm_kernels
[LINK]
http://arxiv.org/abs/2503.14376v2
[DATE]
2025-05-10 16:07:13+08:00
[CATEGORIES]
cs.LG
Good Things Come in Pairs: Paired Autoencoders for Inverse Problems
[AUTHORS]
Matthias Chung, Bas Peters, Michael Solomon
[ABSTRACT]
In this book chapter, we discuss recent advances in data-driven approaches
for inverse problems. In particular, we focus on the \emph{paired autoencoder}
framework, which has proven to be a powerful tool for solving inverse problems
in scientific computing. The paired autoencoder framework is a novel approach
that leverages the strengths of both data-driven and model-based methods by
projecting both the data and the quantity of interest into a latent space and
mapping these latent spaces to provide surrogate forward and inverse mappings.
We illustrate the advantages of this approach through numerical experiments,
including seismic imaging and classical inpainting: nonlinear and linear
inverse problems, respectively. Although the paired autoencoder framework is
likelihood-free, it generates multiple data- and model-based reconstruction
metrics that help assess whether examples are in or out of distribution. In
addition to direct model estimates from data, the paired autoencoder enables
latent-space refinement to fit the observed data accurately. Numerical
experiments show that this procedure, combined with the latent-space initial
guess, is essential for high-quality estimates, even when data noise exceeds
the training regime. We also introduce two novel variants that combine
variational and paired autoencoder ideas, maintaining the original benefits
while enabling sampling for uncertainty analysis.
[COMMENTS]
43 pages, 17 figures
[LINK]
http://arxiv.org/abs/2505.06549v1
[DATE]
2025-05-10 15:31:09+08:00
[CATEGORIES]
cs.LG
dcFCI: Robust Causal Discovery Under Latent Confounding, Unfaithfulness, and Mixed Data
[AUTHORS]
Adèle H. Ribeiro, Dominik Heider
[ABSTRACT]
Causal discovery is central to inferring causal relationships from
observational data. In the presence of latent confounding, algorithms such as
Fast Causal Inference (FCI) learn a Partial Ancestral Graph (PAG) representing
the true model’s Markov Equivalence Class. However, their correctness
critically depends on empirical faithfulness, the assumption that observed
(in)dependencies perfectly reflect those of the underlying causal model, which
often fails in practice due to limited sample sizes. To address this, we
introduce the first nonparametric score to assess a PAG’s compatibility with
observed data, even with mixed variable types. This score is both necessary and
sufficient to characterize structural uncertainty and distinguish between
distinct PAGs. We then propose data-compatible FCI (dcFCI), the first hybrid
causal discovery algorithm to jointly address latent confounding, empirical
unfaithfulness, and mixed data types. dcFCI integrates our score into an
(Anytime)FCI-guided search that systematically explores, ranks, and validates
candidate PAGs. Experiments on synthetic and real-world scenarios demonstrate
that dcFCI significantly outperforms state-of-the-art methods, often recovering
the true PAG even in small and heterogeneous datasets. Examining top-ranked
PAGs further provides valuable insights into structural uncertainty, supporting
more robust and informed causal reasoning and decision-making.
[COMMENTS]
31 pages. This work has been submitted to the IEEE for possible
publication
[LINK]
http://arxiv.org/abs/2505.06542v1
[DATE]
2025-05-10 15:05:19+08:00
[CATEGORIES]
cs.LG
Online Feedback Efficient Active Target Discovery in Partially Observable Environments
[AUTHORS]
Anindya Sarkar, Binglin Ji, Yevgeniy Vorobeychik
[ABSTRACT]
In various scientific and engineering domains, where data acquisition is
costly, such as in medical imaging, environmental monitoring, or remote
sensing, strategic sampling from unobserved regions, guided by prior
observations, is essential to maximize target discovery within a limited
sampling budget. In this work, we introduce Diffusion-guided Active Target
Discovery (DiffATD), a novel method that leverages diffusion dynamics for
active target discovery. DiffATD maintains a belief distribution over each
unobserved state in the environment, using this distribution to dynamically
balance exploration-exploitation. Exploration reduces uncertainty by sampling
regions with the highest expected entropy, while exploitation targets areas
with the highest likelihood of discovering the target, indicated by the belief
distribution and an incrementally trained reward model designed to learn the
characteristics of the target. DiffATD enables efficient target discovery in a
partially observable environment within a fixed sampling budget, all without
relying on any prior supervised training. Furthermore, DiffATD offers
interpretability, unlike existing black-box policies that require extensive
supervised training. Through extensive experiments and ablation studies across
diverse domains, including medical imaging and remote sensing, we show that
DiffATD performs significantly better than baselines and competitively with
supervised methods that operate under full environmental observability.
[COMMENTS]
30 pages, 28 figures, Pre-print
[LINK]
http://arxiv.org/abs/2505.06535v1
[DATE]
2025-05-10 14:50:01+08:00
[CATEGORIES]
cs.LG
GBDTSVM: Combined Support Vector Machine and Gradient Boosting Decision Tree Framework for efficient snoRNA-disease association prediction
[AUTHORS]
Ummay Maria Muna, Fahim Hafiz, Shanta Biswas, Riasat Azim
[ABSTRACT]
Small nucleolar RNAs (snoRNAs) are increasingly recognized for their critical
role in the pathogenesis and characterization of various human diseases.
Consequently, the precise identification of snoRNA-disease associations (SDAs)
is essential for the progression of diseases and the advancement of treatment
strategies. However, conventional biological experimental approaches are
costly, time-consuming, and resource-intensive; therefore, machine
learning-based computational methods offer a promising solution to mitigate
these limitations. This paper proposes a model called ‘GBDTSVM’, representing a
novel and efficient machine learning approach for predicting snoRNA-disease
associations by leveraging a Gradient Boosting Decision Tree (GBDT) and Support
Vector Machine (SVM). ‘GBDTSVM’ effectively extracts integrated snoRNA-disease
feature representations utilizing GBDT and SVM is subsequently utilized to
classify and identify potential associations. Furthermore, the method enhances
the accuracy of these predictions by incorporating Gaussian kernel profile
similarity for both snoRNAs and diseases. Experimental evaluation of the
GBDTSVM model demonstrated superior performance compared to state-of-the-art
methods in the field, achieving an area under the receiver operating
characteristic (AUROC) of 0.96 and an area under the precision-recall curve
(AUPRC) of 0.95 on MDRF dataset. Moreover, our model shows superior performance
on two more datasets named LSGT and PsnoD. Additionally, a case study on the
predicted snoRNA-disease associations verified the top 10 predicted snoRNAs
across nine prevalent diseases, further validating the efficacy of the GBDTSVM
approach. These results underscore the model’s potential as a robust tool for
advancing snoRNA-related disease research. Source codes and datasets our
proposed framework can be obtained from: https://github.com/mariamuna04/gbdtsvm
[COMMENTS]
30 pages, 3 figures
[LINK]
http://arxiv.org/abs/2505.06534v1
[DATE]
2025-05-10 14:46:29+08:00
[CATEGORIES]
cs.LG
High-Dimensional Importance-Weighted Information Criteria: Theory and Optimality
[AUTHORS]
Yong-Syun Cao, Shinpei Imori, Ching-Kang Ing
[ABSTRACT]
Imori and Ing (2025) proposed the importance-weighted orthogonal greedy
algorithm (IWOGA) for model selection in high-dimensional misspecified
regression models under covariate shift. To determine the number of IWOGA
iterations, they introduced the high-dimensional importance-weighted
information criterion (HDIWIC). They argued that the combined use of IWOGA and
HDIWIC, IWOGA + HDIWIC, achieves an optimal trade-off between variance and
squared bias, leading to optimal convergence rates in terms of conditional mean
squared prediction error. In this article, we provide a theoretical
justification for this claim by establishing the optimality of IWOGA + HDIWIC
under a set of reasonable assumptions.
[LINK]
http://arxiv.org/abs/2505.06531v1
[DATE]
2025-05-10 14:26:12+08:00
[CATEGORIES]
cs.LG
A Comprehensive Survey of Synthetic Tabular Data Generation
[AUTHORS]
Ruxue Shi, Yili Wang, Mengnan Du, Xu Shen, Xin Wang
[ABSTRACT]
Tabular data remains one of the most prevalent and critical data formats
across diverse real-world applications. However, its effective use in machine
learning (ML) is often constrained by challenges such as data scarcity, privacy
concerns, and class imbalance. Synthetic data generation has emerged as a
promising solution, leveraging generative models to learn the distribution of
real datasets and produce high-fidelity, privacy-preserving samples. Various
generative paradigms have been explored, including energy-based models (EBMs),
variational autoencoders (VAEs), generative adversarial networks (GANs), large
language models (LLMs), and diffusion models. While several surveys have
investigated synthetic tabular data generation, most focus on narrow subdomains
or specific generative methods, such as GANs, diffusion models, or
privacy-preserving techniques. This limited scope often results in fragmented
insights, lacking a comprehensive synthesis that bridges diverse approaches. In
particular, recent advances driven by LLMs and diffusion-based models remain
underexplored. This gap hinders a holistic understanding of the field`s
evolution, methodological interplay, and open challenges. To address this, our
survey provides a unified and systematic review of synthetic tabular data
generation. Our contributions are threefold: (1) we propose a comprehensive
taxonomy that organizes existing methods into traditional approaches,
diffusion-based methods, and LLM-based models, and provide an in-depth
comparative analysis; (2) we detail the complete pipeline for synthetic tabular
data generation, including data synthesis, post-processing, and evaluation; (3)
we identify major challenges, explore real-world applications, and outline open
research questions and future directions to guide future work in this rapidly
evolving area.
[LINK]
http://arxiv.org/abs/2504.16506v2
[DATE]
2025-05-10 14:10:06+08:00
[CATEGORIES]
cs.LG
Interpretable SHAP-bounded Bayesian Optimization for Underwater Acoustic Metamaterial Coating Design
[AUTHORS]
Hansani Weeratunge, Dominic Robe, Elnaz Hajizadeh
[ABSTRACT]
We developed an interpretability informed Bayesian optimization framework to
optimize underwater acoustic coatings based on polyurethane elastomers with
embedded metamaterial features. A data driven model was employed to analyze the
relationship between acoustic performance, specifically sound absorption and
the corresponding design variables. By leveraging SHapley Additive exPlanations
(SHAP), a machine learning interpretability tool, we identified the key
parameters influencing the objective function and gained insights into how
these parameters affect sound absorption. The insights derived from the SHAP
analysis were subsequently used to automatically refine the bounds of the
optimization problem automatically, enabling a more targeted and efficient
exploration of the design space.
The proposed approach was applied to two polyurethane materials with distinct
hardness levels, resulting in improved optimal solutions compared to those
obtained without SHAP-informed guidance. Notably, these enhancements were
achieved without increasing the number of simulation iterations. Our findings
demonstrate the potential of SHAP to streamline optimization processes by
uncovering hidden parameter relationships and guiding the search toward
promising regions of the design space. This work underscores the effectiveness
of combining interpretability techniques with Bayesian optimization for the
efficient and cost-effective design of underwater acoustic metamaterials under
strict computational constraints and can be generalized towards other materials
and engineering optimization problems.
[LINK]
http://arxiv.org/abs/2505.06519v1
[DATE]
2025-05-10 13:33:43+08:00
[CATEGORIES]
cs.LG
Text-to-CadQuery: A New Paradigm for CAD Generation with Scalable Large Model Capabilities
[AUTHORS]
Haoyang Xie, Feng Ju
[ABSTRACT]
Computer-aided design (CAD) is fundamental to modern engineering and
manufacturing, but creating CAD models still requires expert knowledge and
specialized software. Recent advances in large language models (LLMs) open up
the possibility of generative CAD, where natural language is directly
translated into parametric 3D models. However, most existing methods generate
task-specific command sequences that pretrained models cannot directly handle.
These sequences must be converted into CAD representations such as CAD vectors
before a 3D model can be produced, which requires training models from scratch
and adds unnecessary complexity. To tackle this issue, we propose generating
CadQuery code directly from text, leveraging the strengths of pretrained LLMs
to produce 3D models without intermediate representations, using this
Python-based scripting language. Since LLMs already excel at Python generation
and spatial reasoning, fine-tuning them on Text-to-CadQuery data proves highly
effective. Given that these capabilities typically improve with scale, we
hypothesize that larger models will perform better after fine-tuning. To enable
this, we augment the Text2CAD dataset with 170,000 CadQuery annotations. We
fine-tune six open-source LLMs of varying sizes and observe consistent
improvements. Our best model achieves a top-1 exact match of 69.3%, up from
58.8%, and reduces Chamfer Distance by 48.6%. Project page:
https://github.com/Text-to-CadQuery/Text-to-CadQuery.
[LINK]
http://arxiv.org/abs/2505.06507v1
[DATE]
2025-05-10 12:47:08+08:00
[CATEGORIES]
cs.LG
Dual Alignment Maximin Optimization for Offline Model-based RL
[AUTHORS]
Chi Zhou, Wang Luo, Haoran Li, Congying Han, Tiande Guo, Zicheng Zhang
[ABSTRACT]
Offline reinforcement learning agents face significant deployment challenges
due to the synthetic-to-real distribution mismatch. While most prior research
has focused on improving the fidelity of synthetic sampling and incorporating
off-policy mechanisms, the directly integrated paradigm often fails to ensure
consistent policy behavior in biased models and underlying environmental
dynamics, which inherently arise from discrepancies between behavior and
learning policies. In this paper, we first shift the focus from model
reliability to policy discrepancies while optimizing for expected returns, and
then self-consistently incorporate synthetic data, deriving a novel
actor-critic paradigm, Dual Alignment Maximin Optimization (DAMO). It is a
unified framework to ensure both model-environment policy consistency and
synthetic and offline data compatibility. The inner minimization performs dual
conservative value estimation, aligning policies and trajectories to avoid
out-of-distribution states and actions, while the outer maximization ensures
that policy improvements remain consistent with inner value estimates.
Empirical evaluations demonstrate that DAMO effectively ensures model and
policy alignments, achieving competitive performance across diverse benchmark
tasks.
[LINK]
http://arxiv.org/abs/2502.00850v2
[DATE]
2025-05-10 12:42:40+08:00
[CATEGORIES]
cs.LG
Guided Exploration for Efficient Relational Model Learning
[AUTHORS]
Annie Feng, Nishanth Kumar, Tomas Lozano-Perez, Leslie Pack-Kaelbling
[ABSTRACT]
Efficient exploration is critical for learning relational models in
large-scale environments with complex, long-horizon tasks. Random exploration
methods often collect redundant or irrelevant data, limiting their ability to
learn accurate relational models of the environment. Goal-literal babbling
(GLIB) improves upon random exploration by setting and planning to novel goals,
but its reliance on random actions and random novel goal selection limits its
scalability to larger domains. In this work, we identify the principles
underlying efficient exploration in relational domains: (1) operator
initialization with demonstrations that cover the distinct lifted effects
necessary for planning and (2) refining preconditions to collect maximally
informative transitions by selecting informative goal-action pairs and
executing plans to them. To demonstrate these principles, we introduce
Baking-Large, a challenging domain with extensive state-action spaces and
long-horizon tasks. We evaluate methods using oracle-driven demonstrations for
operator initialization and precondition-targeting guidance to efficiently
gather critical transitions. Experiments show that both the oracle
demonstrations and precondition-targeting oracle guidance significantly improve
sample efficiency and generalization, paving the way for future methods to use
these principles to efficiently learn accurate relational models in complex
domains.
[LINK]
http://arxiv.org/abs/2502.06146v2
[DATE]
2025-05-10 12:07:03+08:00
[CATEGORIES]
cs.LG
PC-SRGAN: Physically Consistent Super-Resolution Generative Adversarial Network for General Transient Simulations
[AUTHORS]
Md Rakibul Hasan, Pouria Behnoudfar, Dan MacKinlay, Thomas Poulet
[ABSTRACT]
Machine Learning, particularly Generative Adversarial Networks (GANs), has
revolutionised Super Resolution (SR). However, generated images often lack
physical meaningfulness, which is essential for scientific applications. Our
approach, PC-SRGAN, enhances image resolution while ensuring physical
consistency for interpretable simulations. PC-SRGAN significantly improves both
the Peak Signal-to-Noise Ratio and the Structural Similarity Index Measure
compared to conventional methods, even with limited training data (e.g., only
13% of training data required for SRGAN). Beyond SR, PC-SRGAN augments
physically meaningful machine learning, incorporating numerically justified
time integrators and advanced quality metrics. These advancements promise
reliable and causal machine-learning models in scientific domains. A
significant advantage of PC-SRGAN over conventional SR techniques is its
physical consistency, which makes it a viable surrogate model for
time-dependent problems. PC-SRGAN advances scientific machine learning,
offering improved accuracy and efficiency for image processing, enhanced
process understanding, and broader applications to scientific research. The
source codes and data will be made publicly available at
https://github.com/hasan-rakibul/PC-SRGAN upon acceptance of this paper.
[LINK]
http://arxiv.org/abs/2505.06502v1
[DATE]
2025-05-10 12:05:00+08:00
[CATEGORIES]
cs.LG
Statistical Error Bounds for GANs with Nonlinear Objective Functionals
[AUTHORS]
Jeremiah Birrell
[ABSTRACT]
Generative adversarial networks (GANs) are unsupervised learning methods for
training a generator distribution to produce samples that approximate those
drawn from a target distribution. Many such methods can be formulated as
minimization of a metric or divergence between probability distributions.
Recent works have derived statistical error bounds for GANs that are based on
integral probability metrics (IPMs), e.g., WGAN which is based on the
1-Wasserstein metric. In general, IPMs are defined by optimizing a linear
functional (difference of expectations) over a space of discriminators. A much
larger class of GANs, which we here call $(f,\Gamma)$-GANs, can be constructed
using $f$-divergences (e.g., Jensen-Shannon, KL, or $\alpha$-divergences)
together with a regularizing discriminator space $\Gamma$ (e.g., $1$-Lipschitz
functions). These GANs have nonlinear objective functions, depending on the
choice of $f$, and have been shown to exhibit improved performance in a number
of applications. In this work we derive statistical error bounds for
$(f,\Gamma)$-GANs for general classes of $f$ and $\Gamma$ in the form of
finite-sample concentration inequalities. These results prove the statistical
consistency of $(f,\Gamma)$-GANs and reduce to the known results for IPM-GANs
in the appropriate limit. Our results use novel Rademacher complexity bounds
which provide new insight into the performance of IPM-GANs for distributions
with unbounded support and have application to statistical learning tasks
beyond GANs.
[COMMENTS]
29 pages
[LINK]
http://arxiv.org/abs/2406.16834v3
[DATE]
2025-05-10 11:13:26+08:00
[CATEGORIES]
cs.LG
Demystifying SGD with Doubly Stochastic Gradients
[AUTHORS]
Kyurae Kim, Joohwan Ko, Yi-An Ma, Jacob R. Gardner
[ABSTRACT]
Optimization objectives in the form of a sum of intractable expectations are
rising in importance (e.g., diffusion models, variational autoencoders, and
many more), a setting also known as “finite sum with infinite data.” For these
problems, a popular strategy is to employ SGD with doubly stochastic gradients
(doubly SGD): the expectations are estimated using the gradient estimator of
each component, while the sum is estimated by subsampling over these
estimators. Despite its popularity, little is known about the convergence
properties of doubly SGD, except under strong assumptions such as bounded
variance. In this work, we establish the convergence of doubly SGD with
independent minibatching and random reshuffling under general conditions, which
encompasses dependent component gradient estimators. In particular, for
dependent estimators, our analysis allows fined-grained analysis of the effect
correlations. As a result, under a per-iteration computational budget of $b
\times m$, where $b$ is the minibatch size and $m$ is the number of Monte Carlo
samples, our analysis suggests where one should invest most of the budget in
general. Furthermore, we prove that random reshuffling (RR) improves the
complexity dependence on the subsampling noise.
[COMMENTS]
Accepted to ICML‘24; v2: fixed typo in complexity statements
[LINK]
http://arxiv.org/abs/2406.00920v2
[DATE]
2025-05-10 10:44:13+08:00
[CATEGORIES]
cs.LG
High-Dimensional Gaussian Process Regression with Soft Kernel Interpolation
[AUTHORS]
Chris Camaño, Daniel Huang
[ABSTRACT]
We introduce Soft Kernel Interpolation (SoftKI), a method that combines
aspects of Structured Kernel Interpolation (SKI) and variational inducing point
methods, to achieve scalable Gaussian Process (GP) regression on
high-dimensional datasets. SoftKI approximates a kernel via softmax
interpolation from a smaller number of interpolation points learned by
optimizing a combination of the SoftKI marginal log-likelihood (MLL), and when
needed, an approximate MLL for improved numerical stability. Consequently, it
can overcome the dimensionality scaling challenges that SKI faces when
interpolating from a dense and static lattice while retaining the flexibility
of variational methods to adapt inducing points to the dataset. We demonstrate
the effectiveness of SoftKI across various examples and show that it is
competitive with other approximated GP methods when the data dimensionality is
modest (around 10).
[COMMENTS]
12 pages
[LINK]
http://arxiv.org/abs/2410.21419v2
[DATE]
2025-05-10 09:41:15+08:00
[CATEGORIES]
cs.LG
QoS-Efficient Serving of Multiple Mixture-of-Expert LLMs Using Partial Runtime Reconfiguration
[AUTHORS]
HamidReza Imani, Jiaxin Peng, Peiman Mohseni, Abdolah Amirany, Tarek El-Ghazawi
[ABSTRACT]
The deployment of mixture-of-experts (MoE) large language models (LLMs)
presents significant challenges due to their high memory demands. These
challenges become even more pronounced in multi-tenant environments, where
shared resources must accommodate multiple models, limiting the effectiveness
of conventional virtualization techniques. This paper addresses the problem of
efficiently serving multiple fine-tuned MoE-LLMs on a single-GPU. We propose a
serving system that employs \textit{similarity-based expert consolidation} to
reduce the overall memory footprint by sharing similar experts across models.
To ensure output quality, we introduce \textit{runtime partial
reconfiguration}, dynamically replacing non-expert layers when processing
requests from different models. As a result, our approach achieves a
competitive output quality while maintaining throughput comparable to serving a
single model while incurring a negligible increase in time-to-first-token
(TTFT). Experiments on a server with a single NVIDIA A100 GPU (80GB) using
Mixtral-8x7B models demonstrate an 85\% average reduction in turnaround time
compared to NVIDIA’s multi-instance GPU (MIG). Furthermore, experiments on
Google’s Switch Transformer Base-8 model with up to four variants demonstrate
the scalability and resilience of our approach in maintaining output quality
compared to other model merging baselines, highlighting its effectiveness.
[LINK]
http://arxiv.org/abs/2505.06481v1
[DATE]
2025-05-10 08:46:04+08:00
[CATEGORIES]
cs.LG
Probing In-Context Learning: Impact of Task Complexity and Model Architecture on Generalization and Efficiency
[AUTHORS]
Binwen Liu, Peiyu Xu, Quan Yuan, Yihong Chen
[ABSTRACT]
We investigate in-context learning (ICL) through a meticulous experimental
framework that systematically varies task complexity and model architecture.
Extending beyond the linear regression baseline, we introduce Gaussian kernel
regression and nonlinear dynamical system tasks, which emphasize temporal and
recursive reasoning. We evaluate four distinct models: a GPT2-style
Transformer, a Transformer with FlashAttention mechanism, a convolutional
Hyena-based model, and the Mamba state-space model. Each model is trained from
scratch on synthetic datasets and assessed for generalization during testing.
Our findings highlight that model architecture significantly shapes ICL
performance. The standard Transformer demonstrates robust performance across
diverse tasks, while Mamba excels in temporally structured dynamics. Hyena
effectively captures long-range dependencies but shows higher variance early in
training, and FlashAttention offers computational efficiency but is more
sensitive in low-data regimes. Further analysis uncovers locality-induced
shortcuts in Gaussian kernel tasks, enhanced nonlinear separability through
input range scaling, and the critical role of curriculum learning in mastering
high-dimensional tasks.
[LINK]
http://arxiv.org/abs/2505.06475v1
[DATE]
2025-05-10 08:22:40+08:00
[CATEGORIES]
cs.LG