Tokenisation is NP-Complete
[AUTHORS]
Philip Whittington, Gregor Bachmann, Tiago Pimentel
[ABSTRACT]
In this work, we prove the NP-completeness of two variants of tokenisation,
defined as the problem of compressing a dataset to at most $\delta$ symbols by
either finding a vocabulary directly (direct tokenisation), or selecting a
sequence of merge operations (bottom-up tokenisation).
[LINK]
http://arxiv.org/abs/2412.15210v1
[DATE]
2024-12-20 02:59:46+08:00
[CATEGORIES]
cs.CL
LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks
[AUTHORS]
Yushi Bai, Shangqing Tu, Jiajie Zhang, Hao Peng, Xiaozhi Wang, Xin Lv, Shulin Cao, Jiazheng Xu, Lei Hou, Yuxiao Dong, Jie Tang, Juanzi Li
[ABSTRACT]
This paper introduces LongBench v2, a benchmark designed to assess the
ability of LLMs to handle long-context problems requiring deep understanding
and reasoning across real-world multitasks. LongBench v2 consists of 503
challenging multiple-choice questions, with contexts ranging from 8k to 2M
words, across six major task categories: single-document QA, multi-document QA,
long in-context learning, long-dialogue history understanding, code repository
understanding, and long structured data understanding. To ensure the breadth
and the practicality, we collect data from nearly 100 highly educated
individuals with diverse professional backgrounds. We employ both automated and
manual review processes to maintain high quality and difficulty, resulting in
human experts achieving only 53.7% accuracy under a 15-minute time constraint.
Our evaluation reveals that the best-performing model, when directly answers
the questions, achieves only 50.1% accuracy. In contrast, the o1-preview model,
which includes longer reasoning, achieves 57.7%, surpassing the human baseline
by 4%. These results highlight the importance of enhanced reasoning ability and
scaling inference-time compute to tackle the long-context challenges in
LongBench v2. The project is available at https://longbench2.github.io.
[COMMENTS]
25 pages, 13 figures
[LINK]
http://arxiv.org/abs/2412.15204v1
[DATE]
2024-12-20 02:59:17+08:00
[CATEGORIES]
cs.CL
Face the Facts! Evaluating RAG-based Fact-checking Pipelines in Realistic Settings
[AUTHORS]
Daniel Russo, Stefano Menini, Jacopo Staiano, Marco Guerini
[ABSTRACT]
Natural Language Processing and Generation systems have recently shown the
potential to complement and streamline the costly and time-consuming job of
professional fact-checkers. In this work, we lift several constraints of
current state-of-the-art pipelines for automated fact-checking based on the
Retrieval-Augmented Generation (RAG) paradigm. Our goal is to benchmark, under
more realistic scenarios, RAG-based methods for the generation of verdicts -
i.e., short texts discussing the veracity of a claim - evaluating them on
stylistically complex claims and heterogeneous, yet reliable, knowledge bases.
Our findings show a complex landscape, where, for example, LLM-based retrievers
outperform other retrieval techniques, though they still struggle with
heterogeneous knowledge bases; larger models excel in verdict faithfulness,
while smaller models provide better context adherence, with human evaluations
favouring zero-shot and one-shot approaches for informativeness, and fine-tuned
models for emotional alignment.
[COMMENTS]
Code and data at https://github.com/drusso98/face-the-facts
[LINK]
http://arxiv.org/abs/2412.15189v1
[DATE]
2024-12-20 02:57:11+08:00
[CATEGORIES]
cs.CL
LlamaFusion: Adapting Pretrained Language Models for Multimodal Generation
[AUTHORS]
Weijia Shi, Xiaochuang Han, Chunting Zhou, Weixin Liang, Xi Victoria Lin, Luke Zettlemoyer, Lili Yu
[ABSTRACT]
We present LlamaFusion, a framework for empowering pretrained text-only large
language models (LLMs) with multimodal generative capabilities, enabling them
to understand and generate both text and images in arbitrary sequences.
LlamaFusion leverages existing Llama-3’s weights for processing texts
autoregressively while introducing additional and parallel transformer modules
for processing images with diffusion. During training, the data from each
modality is routed to its dedicated modules: modality-specific feedforward
layers, query-key-value projections, and normalization layers process each
modality independently, while the shared self-attention layers allow
interactions across text and image features. By freezing the text-specific
modules and only training the image-specific modules, LlamaFusion preserves the
language capabilities of text-only LLMs while developing strong visual
understanding and generation abilities. Compared to methods that pretrain
multimodal generative models from scratch, our experiments demonstrate that,
LlamaFusion improves image understanding by 20% and image generation by 3.6%
using only 50% of the FLOPs while maintaining Llama-3’s language capabilities.
We also demonstrate that this framework can adapt existing vision-language
models with multimodal generation ability. Overall, this framework not only
leverages existing computational investments in text-only LLMs but also enables
the parallel development of language and vision capabilities, presenting a
promising direction for efficient multimodal model development.
[LINK]
http://arxiv.org/abs/2412.15188v1
[DATE]
2024-12-20 02:56:24+08:00
[CATEGORIES]
cs.CL
cs.LG
CodeLutra: Boosting LLM Code Generation via Preference-Guided Refinement
[AUTHORS]
Leitian Tao, Xiang Chen, Tong Yu, Tung Mai, Ryan Rossi, Yixuan Li, Saayan Mitra
[ABSTRACT]
Large Language Models (LLMs) have revolutionized code generation but require
significant resources and often over-generalize, limiting their task-specific
efficiency. Fine-tuning smaller, open-source LLMs provides a cost-effective
alternative. However, standard supervised approaches rely only on correct
examples, missing valuable insights from failures. We introduce CodeLutra, a
framework that leverages both correct and incorrect code attempts. Instead of
using only correct solutions, CodeLutra applies iterative preference-based
refinement, comparing successful and failed outputs to better approximate
desired results. This approach narrows the performance gap with
state-of-the-art larger models without requiring massive datasets or auxiliary
models. For instance, on a challenging data science coding task, using only 500
samples improved Llama-3-8B’s accuracy from 28.2% to 48.6%, approaching GPT-4’s
level. By learning from both successes and mistakes, CodeLutra provides a
scalable and efficient path to high-quality code generation, making smaller
open-source models more competitive with leading closed-source alternatives.
[COMMENTS]
16 pages, 7 figures
[LINK]
http://arxiv.org/abs/2411.05199v2
[DATE]
2024-12-20 02:46:21+08:00
[CATEGORIES]
cs.CL
Prompt-A-Video: Prompt Your Video Diffusion Model via Preference-Aligned LLM
[AUTHORS]
Yatai Ji, Jiacheng Zhang, Jie Wu, Shilong Zhang, Shoufa Chen, Chongjian GE, Peize Sun, Weifeng Chen, Wenqi Shao, Xuefeng Xiao, Weilin Huang, Ping Luo
[ABSTRACT]
Text-to-video models have made remarkable advancements through optimization
on high-quality text-video pairs, where the textual prompts play a pivotal role
in determining quality of output videos. However, achieving the desired output
often entails multiple revisions and iterative inference to refine
user-provided prompts. Current automatic methods for refining prompts encounter
challenges such as Modality-Inconsistency, Cost-Discrepancy, and Model-Unaware
when applied to text-to-video diffusion models. To address these problem, we
introduce an LLM-based prompt adaptation framework, termed as Prompt-A-Video,
which excels in crafting Video-Centric, Labor-Free and Preference-Aligned
prompts tailored to specific video diffusion model. Our approach involves a
meticulously crafted two-stage optimization and alignment system. Initially, we
conduct a reward-guided prompt evolution pipeline to automatically create
optimal prompts pool and leverage them for supervised fine-tuning (SFT) of the
LLM. Then multi-dimensional rewards are employed to generate pairwise data for
the SFT model, followed by the direct preference optimization (DPO) algorithm
to further facilitate preference alignment. Through extensive experimentation
and comparative analyses, we validate the effectiveness of Prompt-A-Video
across diverse generation models, highlighting its potential to push the
boundaries of video generation.
[LINK]
http://arxiv.org/abs/2412.15156v1
[DATE]
2024-12-20 02:32:21+08:00
[CATEGORIES]
cs.CL
Language Models as Continuous Self-Evolving Data Engineers
[AUTHORS]
Peidong Wang, Ming Wang, Zhiming Ma, Xiaocui Yang, Shi Feng, Daling Wang, Yifei Zhang
[ABSTRACT]
Large Language Models (LLMs) have demonstrated remarkable capabilities on
various tasks, while the further evolvement is limited to the lack of
high-quality training data. In addition, traditional training approaches rely
too much on expert-labeled data, setting an upper limit on the performance of
LLMs. To address this issue, we propose a novel paradigm that enables LLMs to
train itself by autonomously generating, cleaning, reviewing, and annotating
data with preference information, named LANCE. Our approach demonstrates that
LLMs can serve as continuous self-evolving data engineers, significantly
reducing the time and cost of the post-training data construction process.
Through iterative fine-tuning on different variants of the Qwen2, we validate
the effectiveness of LANCE across various tasks, showing that it can
continuously improve model performance and maintain high-quality data
generation. Across eight benchmark dimensions, LANCE resulted in an average
score enhancement of 3.36 for Qwen2-7B and 2.70 for Qwen2-7B-Instruct. This
training paradigm with autonomous data construction not only reduces the
reliance on human experts or external models but also ensures that the data
aligns with human values and preferences, paving the way for the development of
future superintelligent systems that can exceed human capabilities.
[LINK]
http://arxiv.org/abs/2412.15151v1
[DATE]
2024-12-20 02:28:41+08:00
[CATEGORIES]
cs.CL
Adaptive Pruning for Large Language Models with Structural Importance Awareness
[AUTHORS]
Haotian Zheng, Jinke Ren, Yushan Sun, Ruichen Zhang, Wenbo Zhang, Zhen Li, Dusit Niyato, Shuguang Cui, Yatong Han
[ABSTRACT]
The recent advancements in large language models (LLMs) have significantly
improved language understanding and generation capabilities. However, it is
difficult to deploy LLMs on resource-constrained edge devices due to their high
computational and storage resource demands. To address this issue, we propose a
novel LLM model pruning method, namely structurally-aware adaptive pruning
(SAAP), to significantly reduce the computational and memory costs while
maintaining model performance. We first define an adaptive importance fusion
metric to evaluate the importance of all coupled structures in LLMs by
considering their homoscedastic uncertainty. Then, we rank the importance of
all modules to determine the specific layers that should be pruned to meet
particular performance requirements. Furthermore, we develop a new group
fine-tuning strategy to improve the inference efficiency of LLMs. Finally, we
evaluate the proposed SAAP method on multiple LLMs across two common tasks,
i.e., zero-shot classification and text generation. Experimental results show
that our SAAP method outperforms several state-of-the-art baseline methods,
achieving 2.17%, 2.37%, and 2.39% accuracy gains on LLaMA-7B, Vicuna-7B, and
LLaMA-13B. Additionally, SAAP improves the token generation speed by 5%,
showcasing its practical advantages in resource-constrained scenarios.
[COMMENTS]
12 pages, 6 figures, 12 tables
[LINK]
http://arxiv.org/abs/2412.15127v1
[DATE]
2024-12-20 02:08:04+08:00
[CATEGORIES]
cs.CL
cs.LG
Outcome-Refining Process Supervision for Code Generation
[AUTHORS]
Zhuohao Yu, Weizheng Gu, Yidong Wang, Zhengran Zeng, Jindong Wang, Wei Ye, Shikun Zhang
[ABSTRACT]
Large Language Models have demonstrated remarkable capabilities in code
generation, yet they often struggle with complex programming tasks that require
deep algorithmic reasoning. While process supervision through learned reward
models shows promise in guiding reasoning steps, it requires expensive training
data and suffers from unreliable evaluation. We propose Outcome-Refining
Process Supervision, a novel paradigm that treats outcome refinement itself as
the process to be supervised. Our framework leverages concrete execution
signals to ground the supervision of reasoning steps, while using
tree-structured exploration to maintain multiple solution trajectories
simultaneously. Experiments demonstrate that our approach enables even smaller
models to achieve high success accuracy and performance metrics on competitive
programming tasks, creates more reliable verification than traditional reward
models without requiring training PRMs. Our approach achieves significant
improvements across 5 models and 3 datasets: an average of 26.9% increase in
correctness and 42.2% in efficiency. The results suggest that providing
structured reasoning space with concrete verification signals is crucial for
solving complex programming tasks. We open-source all our code and data at:
https://github.com/zhuohaoyu/ORPS
[COMMENTS]
18 pages, 5 figures, Code: https://github.com/zhuohaoyu/ORPS
[LINK]
http://arxiv.org/abs/2412.15118v1
[DATE]
2024-12-20 01:59:42+08:00
[CATEGORIES]
cs.CL
cs.LG
URIEL+: Enhancing Linguistic Inclusion and Usability in a Typological and Multilingual Knowledge Base
[AUTHORS]
Aditya Khan, Mason Shipton, David Anugraha, Kaiyao Duan, Phuong H. Hoang, Eric Khiu, A. Seza Doğruöz, En-Shiun Annie Lee
[COMMENTS]
Accepted to COLING 2025
[LINK]
http://arxiv.org/abs/2409.18472v2
[DATE]
2024-12-20 01:57:43+08:00
[CATEGORIES]
cs.CL
cs.LG
Qwen2.5 Technical Report
[AUTHORS]
Qwen, :, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li, Tingyu Xia, Xingzhang Ren, Xuancheng Ren, Yang Fan, Yang Su, Yichang Zhang, Yu Wan, Yuqiong Liu, Zeyu Cui, Zhenru Zhang, Zihan Qiu
[ABSTRACT]
In this report, we introduce Qwen2.5, a comprehensive series of large
language models (LLMs) designed to meet diverse needs. Compared to previous
iterations, Qwen 2.5 has been significantly improved during both the
pre-training and post-training stages. In terms of pre-training, we have scaled
the high-quality pre-training datasets from the previous 7 trillion tokens to
18 trillion tokens. This provides a strong foundation for common sense, expert
knowledge, and reasoning capabilities. In terms of post-training, we implement
intricate supervised finetuning with over 1 million samples, as well as
multistage reinforcement learning. Post-training techniques enhance human
preference, and notably improve long text generation, structural data analysis,
and instruction following. To handle diverse and varied use cases effectively,
we present Qwen2.5 LLM series in rich sizes. Open-weight offerings include base
and instruction-tuned models, with quantized versions available. In addition,
for hosted solutions, the proprietary models currently include two
mixture-of-experts (MoE) variants: Qwen2.5-Turbo and Qwen2.5-Plus, both
available from Alibaba Cloud Model Studio. Qwen2.5 has demonstrated top-tier
performance on a wide range of benchmarks evaluating language understanding,
reasoning, mathematics, coding, human preference alignment, etc. Specifically,
the open-weight flagship Qwen2.5-72B-Instruct outperforms a number of open and
proprietary models and demonstrates competitive performance to the
state-of-the-art open-weight model, Llama-3-405B-Instruct, which is around 5
times larger. Qwen2.5-Turbo and Qwen2.5-Plus offer superior cost-effectiveness
while performing competitively against GPT-4o-mini and GPT-4o respectively.
Additionally, as the foundation, Qwen2.5 models have been instrumental in
training specialized models such as Qwen2.5-Math, Qwen2.5-Coder, QwQ, and
multimodal models.
[LINK]
http://arxiv.org/abs/2412.15115v1
[DATE]
2024-12-20 01:56:09+08:00
[CATEGORIES]
cs.CL
Associative memory inspires improvements for in-context learning using a novel attention residual stream architecture
[AUTHORS]
Thomas F Burns, Tomoki Fukai, Christopher J Earls
[ABSTRACT]
Large language models (LLMs) demonstrate an impressive ability to utilise
information within the context of their input sequences to appropriately
respond to data unseen by the LLM during its training procedure. This ability
is known as in-context learning (ICL). Humans and non-human animals demonstrate
similar abilities, however their neural architectures differ substantially from
LLMs. Despite this, a critical component within LLMs, the attention mechanism,
resembles modern associative memory models, widely used in and influenced by
the computational neuroscience community to model biological memory systems.
Using this connection, we introduce an associative memory model capable of
performing ICL. We use this as inspiration for a novel residual stream
architecture which allows information to directly flow between attention heads.
We test this architecture during training within a two-layer Transformer and
show its ICL abilities manifest more quickly than without this modification. We
then apply our architecture in small language models with 8 million parameters,
focusing on attention head values, with results also indicating improved ICL
performance at this larger and more naturalistic scale.
[COMMENTS]
18 pages, 6 figures, 3 tables
[LINK]
http://arxiv.org/abs/2412.15113v1
[DATE]
2024-12-20 01:55:42+08:00
[CATEGORIES]
cs.CL
Sometimes I am a Tree: Data Drives Unstable Hierarchical Generalization
[AUTHORS]
Tian Qin, Naomi Saphra, David Alvarez-Melis
[ABSTRACT]
Language models (LMs), like other neural networks, often favor shortcut
heuristics based on surface-level patterns. Although LMs behave like n-gram
models early in training, they must eventually learn hierarchical syntactic
representations to correctly apply grammatical rules out-of-distribution (OOD).
In this work, we use case studies of English grammar to explore how complex,
diverse training data drives models to generalize OOD. We construct a framework
that unifies our understanding of random variation with training dynamics, rule
selection with memorization, and data diversity with complexity. We show that
these factors are nuanced, and that intermediate levels of diversity and
complexity lead to inconsistent behavior across random seeds and to unstable
training dynamics. Our findings emphasize the critical role of training data in
shaping generalization patterns and illuminate how competing model strategies
lead to inconsistent generalization outcomes across random seeds. Code is
available at https://github.com/sunnytqin/concept_comp.git.
[LINK]
http://arxiv.org/abs/2412.04619v3
[DATE]
2024-12-20 01:51:34+08:00
[CATEGORIES]
cs.LG
cs.CL
Review-Then-Refine: A Dynamic Framework for Multi-Hop Question Answering with Temporal Adaptability
[AUTHORS]
Xiangsen Chen, Xuming Hu, Nan Tang
[ABSTRACT]
Retrieve-augmented generation (RAG) frameworks have emerged as a promising
solution to multi-hop question answering(QA) tasks since it enables large
language models (LLMs) to incorporate external knowledge and mitigate their
inherent knowledge deficiencies. Despite this progress, existing RAG
frameworks, which usually follows the retrieve-then-read paradigm, often
struggle with multi-hop QA with temporal information since it has difficulty
retrieving and synthesizing accurate time-related information. To address the
challenge, this paper proposes a novel framework called review-then-refine,
which aims to enhance LLM performance in multi-hop QA scenarios with temporal
information. Our approach begins with a review phase, where decomposed
sub-queries are dynamically rewritten with temporal information, allowing for
subsequent adaptive retrieval and reasoning process. In addition, we implement
adaptive retrieval mechanism to minimize unnecessary retrievals, thus reducing
the potential for hallucinations. In the subsequent refine phase, the LLM
synthesizes the retrieved information from each sub-query along with its
internal knowledge to formulate a coherent answer. Extensive experimental
results across multiple datasets demonstrate the effectiveness of our proposed
framework, highlighting its potential to significantly improve multi-hop QA
capabilities in LLMs.
[COMMENTS]
20 pages, 2 figures
[LINK]
http://arxiv.org/abs/2412.15101v1
[DATE]
2024-12-20 01:48:23+08:00
[CATEGORIES]
cs.CL
Typhoon 2: A Family of Open Text and Multimodal Thai Large Language Models
[AUTHORS]
Kunat Pipatanakul, Potsawee Manakul, Natapong Nitarach, Warit Sirichotedumrong, Surapon Nonesung, Teetouch Jaknamon, Parinthapat Pengpun, Pittawat Taveekitworachai, Adisai Na-Thalang, Sittipong Sripaisarnmongkol, Krisanapong Jirayoot, Kasima Tharnpipitchai
[ABSTRACT]
This paper introduces Typhoon 2, a series of text and multimodal large
language models optimized for the Thai language. The series includes models for
text, vision, and audio. Typhoon2-Text builds on state-of-the-art open models,
such as Llama 3 and Qwen2, and we perform continual pre-training on a mixture
of English and Thai data. We employ post-training techniques to enhance Thai
language performance while preserving the base models’ original capabilities.
We release text models across a range of sizes, from 1 to 70 billion
parameters, available in both base and instruction-tuned variants. To guardrail
text generation, we release Typhoon2-Safety, a classifier enhanced for Thai
cultures and language. Typhoon2-Vision improves Thai document understanding
while retaining general visual capabilities, such as image captioning.
Typhoon2-Audio introduces an end-to-end speech-to-speech model architecture
capable of processing audio, speech, and text inputs and generating both text
and speech outputs.
[COMMENTS]
technical report, 55 pages
[LINK]
http://arxiv.org/abs/2412.13702v2
[DATE]
2024-12-20 01:36:38+08:00
[CATEGORIES]
cs.CL
LLMs Lost in Translation: M-ALERT uncovers Cross-Linguistic Safety Gaps
[AUTHORS]
Felix Friedrich, Simone Tedeschi, Patrick Schramowski, Manuel Brack, Roberto Navigli, Huu Nguyen, Bo Li, Kristian Kersting
[LINK]
http://arxiv.org/abs/2412.15035v1
[DATE]
2024-12-20 00:46:54+08:00
[CATEGORIES]
cs.CL
Identifying Query-Relevant Neurons in Large Language Models for Long-Form Texts
[AUTHORS]
Lihu Chen, Adam Dejl, Francesca Toni
[ABSTRACT]
Large Language Models (LLMs) possess vast amounts of knowledge within their
parameters, prompting research into methods for locating and editing this
knowledge. Previous work has largely focused on locating entity-related (often
single-token) facts in smaller models. However, several key questions remain
unanswered: (1) How can we effectively locate query-relevant neurons in
decoder-only LLMs, such as Llama and Mistral? (2) How can we address the
challenge of long-form (or free-form) text generation? (3) Are there localized
knowledge regions in LLMs? In this study, we introduce Query-Relevant Neuron
Cluster Attribution (QRNCA), a novel architecture-agnostic framework capable of
identifying query-relevant neurons in LLMs. QRNCA allows for the examination of
long-form answers beyond triplet facts by employing the proxy task of
multi-choice question answering. To evaluate the effectiveness of our detected
neurons, we build two multi-choice QA datasets spanning diverse domains and
languages. Empirical evaluations demonstrate that our method outperforms
baseline methods significantly. Further, analysis of neuron distributions
reveals the presence of visible localized regions, particularly within
different domains. Finally, we show potential applications of our detected
neurons in knowledge editing and neuron-based prediction.
[COMMENTS]
AAAI 2025 Main Track
[LINK]
http://arxiv.org/abs/2406.10868v4
[DATE]
2024-12-20 00:22:07+08:00
[CATEGORIES]
cs.CL
Large Language Models and Code Security: A Systematic Literature Review
[AUTHORS]
Enna Basic, Alberto Giaretta
[ABSTRACT]
Large Language Models (LLMs) have emerged as powerful tools for automating
various programming tasks, including security-related ones, such as detecting
and fixing vulnerabilities. Despite their promising capabilities, when required
to produce or modify pre-existing code, LLMs could introduce vulnerabilities
unbeknown to the programmer. When analyzing code, they could miss clear
vulnerabilities or signal nonexistent ones. In this Systematic Literature
Review (SLR), we aim to investigate both the security benefits and potential
drawbacks of using LLMs for a variety of code-related tasks. In particular,
first we focus on the types of vulnerabilities that could be introduced by
LLMs, when used for producing code. Second, we analyze the capabilities of LLMs
to detect and fix vulnerabilities, in any given code, and how the prompting
strategy of choice impacts their performance in these two tasks. Last, we
provide an in-depth analysis on how data poisoning attacks on LLMs can impact
performance in the aforementioned tasks.
[LINK]
http://arxiv.org/abs/2412.15004v1
[DATE]
2024-12-20 00:20:22+08:00
[CATEGORIES]
cs.CL
Scaling 4D Representations
[AUTHORS]
João Carreira, Dilara Gokay, Michael King, Chuhan Zhang, Ignacio Rocco, Aravindh Mahendran, Thomas Albert Keck, Joseph Heyward, Skanda Koppula, Etienne Pot, Goker Erdogan, Yana Hasson, Yi Yang, Klaus Greff, Guillaume Le Moing, Sjoerd van Steenkiste, Daniel Zoran, Drew A. Hudson, Pedro Vélez, Luisa Polanía, Luke Friedman, Chris Duvarney, Ross Goroshin, Kelsey Allen, Jacob Walker, Rishabh Kabra, Eric Aboussouan, Jennifer Sun, Thomas Kipf, Carl Doersch, Viorica Pătrăucean, Dima Damen, Pauline Luc, Mehdi S. M. Sajjadi, Andrew Zisserman
[ABSTRACT]
Scaling has not yet been convincingly demonstrated for pure self-supervised
learning from video. However, prior work has focused evaluations on
semantic-related tasks $\unicode{x2013}$ action classification, ImageNet
classification, etc. In this paper we focus on evaluating self-supervised
learning on non-semantic vision tasks that are more spatial (3D) and temporal
(+1D = 4D), such as camera pose estimation, point and object tracking, and
depth estimation. We show that by learning from very large video datasets,
masked auto-encoding (MAE) with transformer video models actually scales,
consistently improving performance on these 4D tasks, as model size increases
from 20M all the way to the largest by far reported self-supervised video model
$\unicode{x2013}$ 22B parameters. Rigorous apples-to-apples comparison with
many recent image and video models demonstrates the benefits of scaling 4D
representations.
[LINK]
http://arxiv.org/abs/2412.15212v1
[DATE]
2024-12-20 02:59:51+08:00
[CATEGORIES]
cs.LG
AutoTrust: Benchmarking Trustworthiness in Large Vision Language Models for Autonomous Driving
[AUTHORS]
Shuo Xing, Hongyuan Hua, Xiangbo Gao, Shenzhe Zhu, Renjie Li, Kexin Tian, Xiaopeng Li, Heng Huang, Tianbao Yang, Zhangyang Wang, Yang Zhou, Huaxiu Yao, Zhengzhong Tu
[ABSTRACT]
Recent advancements in large vision language models (VLMs) tailored for
autonomous driving (AD) have shown strong scene understanding and reasoning
capabilities, making them undeniable candidates for end-to-end driving systems.
However, limited work exists on studying the trustworthiness of DriveVLMs – a
critical factor that directly impacts public transportation safety. In this
paper, we introduce AutoTrust, a comprehensive trustworthiness benchmark for
large vision-language models in autonomous driving (DriveVLMs), considering
diverse perspectives – including trustfulness, safety, robustness, privacy,
and fairness. We constructed the largest visual question-answering dataset for
investigating trustworthiness issues in driving scenarios, comprising over 10k
unique scenes and 18k queries. We evaluated six publicly available VLMs,
spanning from generalist to specialist, from open-source to commercial models.
Our exhaustive evaluations have unveiled previously undiscovered
vulnerabilities of DriveVLMs to trustworthiness threats. Specifically, we found
that the general VLMs like LLaVA-v1.6 and GPT-4o-mini surprisingly outperform
specialized models fine-tuned for driving in terms of overall trustworthiness.
DriveVLMs like DriveLM-Agent are particularly vulnerable to disclosing
sensitive information. Additionally, both generalist and specialist VLMs remain
susceptible to adversarial attacks and struggle to ensure unbiased
decision-making across diverse environments and populations. Our findings call
for immediate and decisive action to address the trustworthiness of DriveVLMs
– an issue of critical importance to public safety and the welfare of all
citizens relying on autonomous transportation systems. Our benchmark is
publicly available at \url{https://github.com/taco-group/AutoTrust}, and the
leaderboard is released at \url{https://taco-group.github.io/AutoTrust/}.
[COMMENTS]
55 pages, 14 figures
[LINK]
http://arxiv.org/abs/2412.15206v1
[DATE]
2024-12-20 02:59:33+08:00
[CATEGORIES]
cs.LG
LiDAR-RT: Gaussian-based Ray Tracing for Dynamic LiDAR Re-simulation
[AUTHORS]
Chenxu Zhou, Lvchang Fu, Sida Peng, Yunzhi Yan, Zhanhua Zhang, Yong Chen, Jiazhi Xia, Xiaowei Zhou
[ABSTRACT]
This paper targets the challenge of real-time LiDAR re-simulation in dynamic
driving scenarios. Recent approaches utilize neural radiance fields combined
with the physical modeling of LiDAR sensors to achieve high-fidelity
re-simulation results. Unfortunately, these methods face limitations due to
high computational demands in large-scale scenes and cannot perform real-time
LiDAR rendering. To overcome these constraints, we propose LiDAR-RT, a novel
framework that supports real-time, physically accurate LiDAR re-simulation for
driving scenes. Our primary contribution is the development of an efficient and
effective rendering pipeline, which integrates Gaussian primitives and
hardware-accelerated ray tracing technology. Specifically, we model the
physical properties of LiDAR sensors using Gaussian primitives with learnable
parameters and incorporate scene graphs to handle scene dynamics. Building upon
this scene representation, our framework first constructs a bounding volume
hierarchy (BVH), then casts rays for each pixel and generates novel LiDAR views
through a differentiable rendering algorithm. Importantly, our framework
supports realistic rendering with flexible scene editing operations and various
sensor configurations. Extensive experiments across multiple public benchmarks
demonstrate that our method outperforms state-of-the-art methods in terms of
rendering quality and efficiency. Our project page is at
https://zju3dv.github.io/lidar-rt.
[COMMENTS]
Project page: https://zju3dv.github.io/lidar-rt
[LINK]
http://arxiv.org/abs/2412.15199v1
[DATE]
2024-12-20 02:58:36+08:00
[CATEGORIES]
cs.LG
Preventing Local Pitfalls in Vector Quantization via Optimal Transport
[AUTHORS]
Borui Zhang, Wenzhao Zheng, Jie Zhou, Jiwen Lu
[ABSTRACT]
Vector-quantized networks (VQNs) have exhibited remarkable performance across
various tasks, yet they are prone to training instability, which complicates
the training process due to the necessity for techniques such as subtle
initialization and model distillation. In this study, we identify the local
minima issue as the primary cause of this instability. To address this, we
integrate an optimal transport method in place of the nearest neighbor search
to achieve a more globally informed assignment. We introduce OptVQ, a novel
vector quantization method that employs the Sinkhorn algorithm to optimize the
optimal transport problem, thereby enhancing the stability and efficiency of
the training process. To mitigate the influence of diverse data distributions
on the Sinkhorn algorithm, we implement a straightforward yet effective
normalization strategy. Our comprehensive experiments on image reconstruction
tasks demonstrate that OptVQ achieves 100% codebook utilization and surpasses
current state-of-the-art VQNs in reconstruction quality.
[COMMENTS]
Code is available at https://github.com/zbr17/OptVQ
[LINK]
http://arxiv.org/abs/2412.15195v1
[DATE]
2024-12-20 02:58:14+08:00
[CATEGORIES]
cs.LG
AV-Link: Temporally-Aligned Diffusion Features for Cross-Modal Audio-Video Generation
[AUTHORS]
Moayed Haji-Ali, Willi Menapace, Aliaksandr Siarohin, Ivan Skorokhodov, Alper Canberk, Kwot Sin Lee, Vicente Ordonez, Sergey Tulyakov
[ABSTRACT]
We propose AV-Link, a unified framework for Video-to-Audio and Audio-to-Video
generation that leverages the activations of frozen video and audio diffusion
models for temporally-aligned cross-modal conditioning. The key to our
framework is a Fusion Block that enables bidirectional information exchange
between our backbone video and audio diffusion models through a
temporally-aligned self attention operation. Unlike prior work that uses
feature extractors pretrained for other tasks for the conditioning signal,
AV-Link can directly leverage features obtained by the complementary modality
in a single framework i.e. video features to generate audio, or audio features
to generate video. We extensively evaluate our design choices and demonstrate
the ability of our method to achieve synchronized and high-quality audiovisual
content, showcasing its potential for applications in immersive media
generation. Project Page: snap-research.github.io/AVLink/
[COMMENTS]
Project Page: snap-research.github.io/AVLink/
[LINK]
http://arxiv.org/abs/2412.15191v1
[DATE]
2024-12-20 02:57:21+08:00
[CATEGORIES]
cs.LG
STRAP: Robot Sub-Trajectory Retrieval for Augmented Policy Learning
[AUTHORS]
Marius Memmel, Jacob Berg, Bingqing Chen, Abhishek Gupta, Jonathan Francis
[ABSTRACT]
Robot learning is witnessing a significant increase in the size, diversity,
and complexity of pre-collected datasets, mirroring trends in domains such as
natural language processing and computer vision. Many robot learning methods
treat such datasets as multi-task expert data and learn a multi-task,
generalist policy by training broadly across them. Notably, while these
generalist policies can improve the average performance across many tasks, the
performance of generalist policies on any one task is often suboptimal due to
negative transfer between partitions of the data, compared to task-specific
specialist policies. In this work, we argue for the paradigm of training
policies during deployment given the scenarios they encounter: rather than
deploying pre-trained policies to unseen problems in a zero-shot manner, we
non-parametrically retrieve and train models directly on relevant data at test
time. Furthermore, we show that many robotics tasks share considerable amounts
of low-level behaviors and that retrieval at the “sub”-trajectory granularity
enables significantly improved data utilization, generalization, and robustness
in adapting policies to novel problems. In contrast, existing full-trajectory
retrieval methods tend to underutilize the data and miss out on shared
cross-task content. This work proposes STRAP, a technique for leveraging
pre-trained vision foundation models and dynamic time warping to retrieve
sub-sequences of trajectories from large training corpora in a robust fashion.
STRAP outperforms both prior retrieval algorithms and multi-task learning
methods in simulated and real experiments, showing the ability to scale to much
larger offline datasets in the real world as well as the ability to learn
robust control policies with just a handful of real-world demonstrations.
[COMMENTS]
Project website at https://weirdlabuw.github.io/strap/
[LINK]
http://arxiv.org/abs/2412.15182v1
[DATE]
2024-12-20 02:54:06+08:00
[CATEGORIES]
cs.LG
HPC-Coder-V2: Studying Code LLMs Across Low-Resource Parallel Languages
[AUTHORS]
Aman Chaturvedi, Daniel Nichols, Siddharth Singh, Abhinav Bhatele
[ABSTRACT]
Large Language Model (LLM) based coding tools have been tremendously
successful as software development assistants, yet they are often designed for
general purpose programming tasks and perform poorly for more specialized
domains such as high performance computing. Creating specialized models and
tools for these domains is crucial towards gaining the benefits of LLMs in
areas such as HPC. While previous work has explored HPC-specific models, LLMs
still struggle to generate parallel code and it is not at all clear what
hurdles are still holding back these LLMs and what must be done to overcome
them. In this work, we conduct an in-depth study along the many axes of
fine-tuning a specialized HPC LLM in order to better understand the challenges.
Based on our findings we fine-tune and evaluate a specialized HPC LLM that is
shown to be the best performing open-source code LLM for parallel code
generation to date.
[LINK]
http://arxiv.org/abs/2412.15178v1
[DATE]
2024-12-20 02:52:05+08:00
[CATEGORIES]
cs.LG
Rethinking Uncertainty Estimation in Natural Language Generation
[AUTHORS]
Lukas Aichberger, Kajetan Schweighofer, Sepp Hochreiter
[ABSTRACT]
Large Language Models (LLMs) are increasingly employed in real-world
applications, driving the need to evaluate the trustworthiness of their
generated text. To this end, reliable uncertainty estimation is essential.
Since current LLMs generate text autoregressively through a stochastic process,
the same prompt can lead to varying outputs. Consequently, leading uncertainty
estimation methods generate and analyze multiple output sequences to determine
the LLM’s uncertainty. However, generating output sequences is computationally
expensive, making these methods impractical at scale. In this work, we inspect
the theoretical foundations of the leading methods and explore new directions
to enhance their computational efficiency. Building on the framework of proper
scoring rules, we find that the negative log-likelihood of the most likely
output sequence constitutes a theoretically grounded uncertainty measure. To
approximate this alternative measure, we propose G-NLL, which has the advantage
of being obtained using only a single output sequence generated by greedy
decoding. This makes uncertainty estimation more efficient and straightforward,
while preserving theoretical rigor. Empirical results demonstrate that G-NLL
achieves state-of-the-art performance across various LLMs and tasks. Our work
lays the foundation for efficient and reliable uncertainty estimation in
natural language generation, challenging the necessity of more computationally
involved methods currently leading the field.
[LINK]
http://arxiv.org/abs/2412.15176v1
[DATE]
2024-12-20 02:51:06+08:00
[CATEGORIES]
cs.LG
S$^{2}$FT: Efficient, Scalable and Generalizable LLM Fine-tuning by Structured Sparsity
[AUTHORS]
Xinyu Yang, Jixuan Leng, Geyang Guo, Jiawei Zhao, Ryumei Nakada, Linjun Zhang, Huaxiu Yao, Beidi Chen
[ABSTRACT]
Current PEFT methods for LLMs can achieve either high quality, efficient
training, or scalable serving, but not all three simultaneously. To address
this limitation, we investigate sparse fine-tuning and observe a remarkable
improvement in generalization ability. Utilizing this key insight, we propose a
family of Structured Sparse Fine-Tuning (S$^{2}$FT) methods for LLMs, which
concurrently achieve state-of-the-art fine-tuning performance, training
efficiency, and inference scalability. S$^{2}$FT accomplishes this by
“selecting sparsely and computing densely”. It selects a few heads and channels
in the MHA and FFN modules for each Transformer block, respectively. Next, it
co-permutes weight matrices on both sides of the coupled structures in LLMs to
connect the selected components in each layer into a dense submatrix. Finally,
S$^{2}$FT performs in-place gradient updates on all submatrices. Through
theoretical analysis and empirical results, our method prevents forgetting
while simplifying optimization, delivers SOTA performance on both commonsense
and arithmetic reasoning with 4.6% and 1.3% average improvements compared to
LoRA, and surpasses full FT by 11.5% when generalizing to various domains after
instruction tuning. Using our partial backpropagation algorithm, S$^{2}$FT
saves training memory up to 3$\times$ and improves latency by 1.5-2.7$\times$
compared to full FT, while delivering an average 10% improvement over LoRA on
both metrics. We further demonstrate that the weight updates in S$^{2}$FT can
be decoupled into adapters, enabling effective fusion, fast switch, and
efficient parallelism for serving multiple fine-tuned models.
[LINK]
http://arxiv.org/abs/2412.06289v3
[DATE]
2024-12-20 02:47:54+08:00
[CATEGORIES]
cs.LG
Leveraging Color Channel Independence for Improved Unsupervised Object Detection
[AUTHORS]
Bastian Jäckl, Yannick Metz, Udo Schlegel, Daniel A. Keim, Maximilian T. Fischer
[ABSTRACT]
Object-centric architectures can learn to extract distinct object
representations from visual scenes, enabling downstream applications on the
object level. Similarly to autoencoder-based image models, object-centric
approaches have been trained on the unsupervised reconstruction loss of images
encoded by RGB color spaces. In our work, we challenge the common assumption
that RGB images are the optimal color space for unsupervised learning in
computer vision. We discuss conceptually and empirically that other color
spaces, such as HSV, bear essential characteristics for object-centric
representation learning, like robustness to lighting conditions. We further
show that models improve when requiring them to predict additional color
channels. Specifically, we propose to transform the predicted targets to the
RGB-S space, which extends RGB with HSV’s saturation component and leads to
markedly better reconstruction and disentanglement for five common evaluation
datasets. The use of composite color spaces can be implemented with basically
no computational overhead, is agnostic of the models’ architecture, and is
universally applicable across a wide range of visual computing tasks and
training types. The findings of our approach encourage additional
investigations in computer vision tasks beyond object-centric learning.
[COMMENTS]
38 pages incl. references, 16 figures
[LINK]
http://arxiv.org/abs/2412.15150v1
[DATE]
2024-12-20 02:28:37+08:00
[CATEGORIES]
cs.LG
Jet: A Modern Transformer-Based Normalizing Flow
[AUTHORS]
Alexander Kolesnikov, André Susano Pinto, Michael Tschannen
[ABSTRACT]
In the past, normalizing generative flows have emerged as a promising class
of generative models for natural images. This type of model has many modeling
advantages: the ability to efficiently compute log-likelihood of the input
data, fast generation and simple overall structure. Normalizing flows remained
a topic of active research but later fell out of favor, as visual quality of
the samples was not competitive with other model classes, such as GANs,
VQ-VAE-based approaches or diffusion models. In this paper we revisit the
design of the coupling-based normalizing flow models by carefully ablating
prior design choices and using computational blocks based on the Vision
Transformer architecture, not convolutional neural networks. As a result, we
achieve state-of-the-art quantitative and qualitative performance with a much
simpler architecture. While the overall visual quality is still behind the
current state-of-the-art models, we argue that strong normalizing flow models
can help advancing research frontier by serving as building components of more
powerful generative models.
[LINK]
http://arxiv.org/abs/2412.15129v1
[DATE]
2024-12-20 02:09:42+08:00
[CATEGORIES]
cs.LG
Tests for model misspecification in simulation-based inference: from local distortions to global model checks
[AUTHORS]
Noemi Anau Montel, James Alvey, Christoph Weniger
[ABSTRACT]
Model misspecification analysis strategies, such as anomaly detection, model
validation, and model comparison are a key component of scientific model
development. Over the last few years, there has been a rapid rise in the use of
simulation-based inference (SBI) techniques for Bayesian parameter estimation,
applied to increasingly complex forward models. To move towards fully
simulation-based analysis pipelines, however, there is an urgent need for a
comprehensive simulation-based framework for model misspecification analysis.
In this work, we provide a solid and flexible foundation for a wide range of
model discrepancy analysis tasks, using distortion-driven model
misspecification tests. From a theoretical perspective, we introduce the
statistical framework built around performing many hypothesis tests for
distortions of the simulation model. We also make explicit analytic connections
to classical techniques: anomaly detection, model validation, and
goodness-of-fit residual analysis. Furthermore, we introduce an efficient
self-calibrating training algorithm that is useful for practitioners. We
demonstrate the performance of the framework in multiple scenarios, making the
connection to classical results where they are valid. Finally, we show how to
conduct such a distortion-driven model misspecification test for real
gravitational wave data, specifically on the event GW150914.
[COMMENTS]
11 pages, 5 figures. Code available on github (NoemiAM/mist) at
https://github.com/NoemiAM/mist
[LINK]
http://arxiv.org/abs/2412.15100v1
[DATE]
2024-12-20 01:48:03+08:00
[CATEGORIES]
cs.LG
A Full Transformer-based Framework for Automatic Pain Estimation using Videos
[AUTHORS]
Stefanos Gkikas, Manolis Tsiknakis
[ABSTRACT]
The automatic estimation of pain is essential in designing an optimal pain
management system offering reliable assessment and reducing the suffering of
patients. In this study, we present a novel full transformer-based framework
consisting of a Transformer in Transformer (TNT) model and a Transformer
leveraging cross-attention and self-attention blocks. Elaborating on videos
from the BioVid database, we demonstrate state-of-the-art performances, showing
the efficacy, efficiency, and generalization capability across all the primary
pain estimation tasks.
[LINK]
http://arxiv.org/abs/2412.15095v1
[DATE]
2024-12-20 01:45:08+08:00
[CATEGORIES]
cs.LG
Learning Disentangled Equivariant Representation for Explicitly Controllable 3D Molecule Generation
[AUTHORS]
Haoran Liu, Youzhi Luo, Tianxiao Li, James Caverlee, Martin Renqiang Min
[ABSTRACT]
We consider the conditional generation of 3D drug-like molecules with
\textit{explicit control} over molecular properties such as drug-like
properties (e.g., Quantitative Estimate of Druglikeness or Synthetic
Accessibility score) and effectively binding to specific protein sites. To
tackle this problem, we propose an E(3)-equivariant Wasserstein autoencoder and
factorize the latent space of our generative model into two disentangled
aspects: molecular properties and the remaining structural context of 3D
molecules. Our model ensures explicit control over these molecular attributes
while maintaining equivariance of coordinate representation and invariance of
data likelihood. Furthermore, we introduce a novel alignment-based coordinate
loss to adapt equivariant networks for auto-regressive de-novo 3D molecule
generation from scratch. Extensive experiments validate our model’s
effectiveness on property-guided and context-guided molecule generation, both
for de-novo 3D molecule design and structure-based drug discovery against
protein targets.
[COMMENTS]
AAAI 2025
[LINK]
http://arxiv.org/abs/2412.15086v1
[DATE]
2024-12-20 01:33:56+08:00
[CATEGORIES]
cs.LG
Latent Ewald summation for machine learning of long-range interactions
[AUTHORS]
Bingqing Cheng
[ABSTRACT]
Machine learning interatomic potentials (MLIPs) often neglect long-range
interactions, such as electrostatic and dispersion forces. In this work, we
introduce a straightforward and efficient method to account for long-range
interactions by learning a latent variable from local atomic descriptors and
applying an Ewald summation to this variable. We demonstrate that in systems
including charged and polar molecular dimers, bulk water, and water-vapor
interface, standard short-ranged MLIPs can lead to unphysical predictions even
when employing message passing. The long-range models effectively eliminate
these artifacts, with only about twice the computational cost of short-range
MLIPs.
[LINK]
http://arxiv.org/abs/2408.15165v2
[DATE]
2024-12-20 01:11:11+08:00
[CATEGORIES]
cs.LG
Revisiting Machine Unlearning with Dimensional Alignment
[AUTHORS]
Seonguk Seo, Dongwan Kim, Bohyung Han
[ABSTRACT]
Machine unlearning, an emerging research topic focusing on compliance with
data privacy regulations, enables trained models to remove the information
learned from specific data. While many existing methods indirectly address this
issue by intentionally injecting incorrect supervisions, they can drastically
and unpredictably alter the decision boundaries and feature spaces, leading to
training instability and undesired side effects. To fundamentally approach this
task, we first analyze the changes in latent feature spaces between original
and retrained models, and observe that the feature representations of samples
not involved in training are closely aligned with the feature manifolds of
previously seen samples in training. Based on these findings, we introduce a
novel evaluation metric for machine unlearning, coined dimensional alignment,
which measures the alignment between the eigenspaces of the forget and retain
set samples. We employ this metric as a regularizer loss to build a robust and
stable unlearning framework, which is further enhanced by integrating a
self-distillation loss and an alternating training scheme. Our framework
effectively eliminates information from the forget set and preserves knowledge
from the retain set. Lastly, we identify critical flaws in established
evaluation metrics for machine unlearning, and introduce new evaluation tools
that more accurately reflect the fundamental goals of machine unlearning.
[LINK]
http://arxiv.org/abs/2407.17710v2
[DATE]
2024-12-20 00:48:59+08:00
[CATEGORIES]
cs.LG
Metric Compatible Training for Online Backfilling in Large-Scale Retrieval
[AUTHORS]
Seonguk Seo, Mustafa Gokhan Uzunbas, Bohyung Han, Sara Cao, Ser-Nam Lim
[ABSTRACT]
Backfilling is the process of re-extracting all gallery embeddings from
upgraded models in image retrieval systems. It inevitably requires a
prohibitively large amount of computational cost and even entails the downtime
of the service. Although backward-compatible learning sidesteps this challenge
by tackling query-side representations, this leads to suboptimal solutions in
principle because gallery embeddings cannot benefit from model upgrades. We
address this dilemma by introducing an online backfilling algorithm, which
enables us to achieve a progressive performance improvement during the
backfilling process while not sacrificing the final performance of new model
after the completion of backfilling. To this end, we first propose a simple
distance rank merge technique for online backfilling. Then, we incorporate a
reverse transformation module for more effective and efficient merging, which
is further enhanced by adopting a metric-compatible contrastive learning
approach. These two components help to make the distances of old and new models
compatible, resulting in desirable merge results during backfilling with no
extra computational overhead. Extensive experiments show the effectiveness of
our framework on four standard benchmarks in various settings.
[LINK]
http://arxiv.org/abs/2301.03767v2
[DATE]
2024-12-20 00:45:52+08:00
[CATEGORIES]
cs.LG
DCTdiff: Intriguing Properties of Image Generative Modeling in the DCT Space
[AUTHORS]
Mang Ning, Mingxiao Li, Jianlin Su, Haozhe Jia, Lanmiao Liu, Martin Beneš, Albert Ali Salah, Itir Onal Ertugrul
[ABSTRACT]
This paper explores image modeling from the frequency space and introduces
DCTdiff, an end-to-end diffusion generative paradigm that efficiently models
images in the discrete cosine transform (DCT) space. We investigate the design
space of DCTdiff and reveal the key design factors. Experiments on different
frameworks (UViT, DiT), generation tasks, and various diffusion samplers
demonstrate that DCTdiff outperforms pixel-based diffusion models regarding
generative quality and training efficiency. Remarkably, DCTdiff can seamlessly
scale up to high-resolution generation without using the latent diffusion
paradigm. Finally, we illustrate several intriguing properties of DCT image
modeling. For example, we provide a theoretical proof of why `image diffusion
can be seen as spectral autoregression’, bridging the gap between diffusion and
autoregressive models. The effectiveness of DCTdiff and the introduced
properties suggest a promising direction for image modeling in the frequency
space. The code is at \url{https://github.com/forever208/DCTdiff}.
[COMMENTS]
23 pages
[LINK]
http://arxiv.org/abs/2412.15032v1
[DATE]
2024-12-20 00:44:01+08:00
[CATEGORIES]
cs.LG
Latent Variable Sequence Identification for Cognitive Models with Neural Network Estimators
[AUTHORS]
Ti-Fen Pan, Jing-Jing Li, Bill Thompson, Anne Collins
[ABSTRACT]
Extracting time-varying latent variables from computational cognitive models
is a key step in model-based neural analysis, which aims to understand the
neural correlates of cognitive processes. However, existing methods only allow
researchers to infer latent variables that explain subjects’ behavior in a
relatively small class of cognitive models. For example, a broad class of
relevant cognitive models with analytically intractable likelihood is currently
out of reach from standard techniques, based on Maximum a Posteriori parameter
estimation. Here, we present an approach that extends neural Bayes estimation
to learn a direct mapping between experimental data and the targeted latent
variable space using recurrent neural networks and simulated datasets. We show
that our approach achieves competitive performance in inferring latent variable
sequences in both tractable and intractable models. Furthermore, the approach
is generalizable across different computational models and is adaptable for
both continuous and discrete latent spaces. We then demonstrate its
applicability in real world datasets. Our work underscores that combining
recurrent neural networks and simulation-based inference to identify latent
variable sequences can enable researchers to access a wider class of cognitive
models for model-based neural analyses, and thus test a broader set of
theories.
[LINK]
http://arxiv.org/abs/2406.14742v2
[DATE]
2024-12-20 00:37:38+08:00
[CATEGORIES]
cs.LG
Stable-V2A: Synthesis of Synchronized Sound Effects with Temporal and Semantic Controls
[AUTHORS]
Riccardo Fosco Gramaccioni, Christian Marinoni, Emilian Postolache, Marco Comunità, Luca Cosmo, Joshua D. Reiss, Danilo Comminiello
[ABSTRACT]
Sound designers and Foley artists usually sonorize a scene, such as from a
movie or video game, by manually annotating and sonorizing each action of
interest in the video. In our case, the intent is to leave full creative
control to sound designers with a tool that allows them to bypass the more
repetitive parts of their work, thus being able to focus on the creative
aspects of sound production. We achieve this presenting Stable-V2A, a two-stage
model consisting of: an RMS-Mapper that estimates an envelope representative of
the audio characteristics associated with the input video; and Stable-Foley, a
diffusion model based on Stable Audio Open that generates audio semantically
and temporally aligned with the target video. Temporal alignment is guaranteed
by the use of the envelope as a ControlNet input, while semantic alignment is
achieved through the use of sound representations chosen by the designer as
cross-attention conditioning of the diffusion process. We train and test our
model on Greatest Hits, a dataset commonly used to evaluate V2A models. In
addition, to test our model on a case study of interest, we introduce Walking
The Maps, a dataset of videos extracted from video games depicting animated
characters walking in different locations. Samples and code available on our
demo page at https://ispamm.github.io/Stable-V2A.
[LINK]
http://arxiv.org/abs/2412.15023v1
[DATE]
2024-12-20 00:37:19+08:00
[CATEGORIES]
cs.LG
Learning from Linear Algebra: A Graph Neural Network Approach to Preconditioner Design for Conjugate Gradient Solvers
[AUTHORS]
Vladislav Trifonov, Alexander Rudikov, Oleg Iliev, Yuri M. Laevsky, Ivan Oseledets, Ekaterina Muravleva
[ABSTRACT]
Large linear systems are ubiquitous in modern computational science and
engineering. The main recipe for solving them is the use of Krylov subspace
iterative methods with well-designed preconditioners. Deep learning models can
be used as nonlinear preconditioners during the iteration of linear solvers
such as the conjugate gradient (CG) method. Neural network models require an
enormous number of parameters to approximate well in this setup. Another
approach is to take advantage of small graph neural networks (GNNs) to
construct preconditioners with predefined sparsity patterns. Recently, GNNs
have been shown to be a promising tool for designing preconditioners to reduce
the overall computational cost of iterative methods by constructing them more
efficiently than with classical linear algebra techniques. However,
preconditioners designed with these approaches cannot outperform those designed
with classical methods in terms of the number of iterations in CG. In our work,
we recall well-established preconditioners from linear algebra and use them as
a starting point for training the GNN to obtain preconditioners that reduce the
condition number of the system more significantly. Numerical experiments show
that our approach outperforms both classical and neural network-based methods
for an important class of parametric partial differential equations. We also
provide a heuristic justification for the loss function used and show that
preconditioners obtained by learning with this loss function reduce the
condition number in a more desirable way for CG.
[LINK]
http://arxiv.org/abs/2405.15557v2
[DATE]
2024-12-20 00:32:43+08:00
[CATEGORIES]
cs.LG
Robust Federated Learning in the Face of Covariate Shift: A Magnitude Pruning with Hybrid Regularization Framework for Enhanced Model Aggregation
[AUTHORS]
Ozgu Goksu, Nicolas Pugeault
[ABSTRACT]
The development of highly sophisticated neural networks has allowed for fast
progress in every field of computer vision, however, applications where
annotated data is prohibited due to privacy or security concerns remain
challenging. Federated Learning (FL) offers a promising framework for
individuals aiming to collaboratively develop a shared model while preserving
data privacy. Nevertheless, our findings reveal that variations in data
distribution among clients can profoundly affect FL methodologies, primarily
due to instabilities in the aggregation process. We also propose a novel FL
framework to mitigate the adverse effects of covariate shifts among federated
clients by combining individual parameter pruning and regularization techniques
to improve the robustness of individual clients’ models to aggregate. Each
client’s model is optimized through magnitude-based pruning and the addition of
dropout and noise injection layers to build more resilient decision pathways in
the networks and improve the robustness of the model’s parameter aggregation
step. The proposed framework is capable of extracting robust representations
even in the presence of very large covariate shifts among client data
distributions and in the federation of a small number of clients. Empirical
findings substantiate the effectiveness of our proposed methodology across
common benchmark datasets, including CIFAR10, MNIST, SVHN, and Fashion MNIST.
Furthermore, we introduce the CelebA-Gender dataset, specifically designed to
evaluate performance on a more realistic domain. The proposed method is capable
of extracting robust representations even in the presence of both high and low
covariate shifts among client data distributions.
[LINK]
http://arxiv.org/abs/2412.15010v1
[DATE]
2024-12-20 00:22:37+08:00
[CATEGORIES]
cs.LG
DisCo: Graph-Based Disentangled Contrastive Learning for Cold-Start Cross-Domain Recommendation
[AUTHORS]
Hourun Li, Yifan Wang, Zhiping Xiao, Jia Yang, Changling Zhou, Ming Zhang, Wei Ju
[ABSTRACT]
Recommender systems are widely used in various real-world applications, but
they often encounter the persistent challenge of the user cold-start problem.
Cross-domain recommendation (CDR), which leverages user interactions from one
domain to improve prediction performance in another, has emerged as a promising
solution. However, users with similar preferences in the source domain may
exhibit different interests in the target domain. Therefore, directly
transferring embeddings may introduce irrelevant source-domain collaborative
information. In this paper, we propose a novel graph-based disentangled
contrastive learning framework to capture fine-grained user intent and filter
out irrelevant collaborative information, thereby avoiding negative transfer.
Specifically, for each domain, we use a multi-channel graph encoder to capture
diverse user intents. We then construct the affinity graph in the embedding
space and perform multi-step random walks to capture high-order user similarity
relationships. Treating one domain as the target, we propose a disentangled
intent-wise contrastive learning approach, guided by user similarity, to refine
the bridging of user intents across domains. Extensive experiments on four
benchmark CDR datasets demonstrate that DisCo consistently outperforms existing
state-of-the-art baselines, thereby validating the effectiveness of both DisCo
and its components.
[LINK]
http://arxiv.org/abs/2412.15005v1
[DATE]
2024-12-20 00:20:42+08:00
[CATEGORIES]
cs.LG
TurboSVM-FL: Boosting Federated Learning through SVM Aggregation for Lazy Clients
[AUTHORS]
Mengdi Wang, Anna Bodonhelyi, Efe Bozkir, Enkelejda Kasneci
[ABSTRACT]
Federated learning is a distributed collaborative machine learning paradigm
that has gained strong momentum in recent years. In federated learning, a
central server periodically coordinates models with clients and aggregates the
models trained locally by clients without necessitating access to local data.
Despite its potential, the implementation of federated learning continues to
encounter several challenges, predominantly the slow convergence that is
largely due to data heterogeneity. The slow convergence becomes particularly
problematic in cross-device federated learning scenarios where clients may be
strongly limited by computing power and storage space, and hence counteracting
methods that induce additional computation or memory cost on the client side
such as auxiliary objective terms and larger training iterations can be
impractical. In this paper, we propose a novel federated aggregation strategy,
TurboSVM-FL, that poses no additional computation burden on the client side and
can significantly accelerate convergence for federated classification task,
especially when clients are “lazy” and train their models solely for few epochs
for next global aggregation. TurboSVM-FL extensively utilizes support vector
machine to conduct selective aggregation and max-margin spread-out
regularization on class embeddings. We evaluate TurboSVM-FL on multiple
datasets including FEMNIST, CelebA, and Shakespeare using user-independent
validation with non-iid data distribution. Our results show that TurboSVM-FL
can significantly outperform existing popular algorithms on convergence rate
and reduce communication rounds while delivering better test metrics including
accuracy, F1 score, and MCC.
[COMMENTS]
Proceedings of the AAAI Conference on Artificial Intelligence 2024
(AAAI’24)
[LINK]
http://arxiv.org/abs/2401.12012v5
[DATE]
2024-12-20 00:15:27+08:00
[CATEGORIES]
cs.LG
Mitigating federated learning contribution allocation instability through randomized aggregation
[AUTHORS]
Arno Geimer, Beltran Fiz, Radu State
[LINK]
http://arxiv.org/abs/2405.08044v2
[DATE]
2024-12-20 00:08:31+08:00
[CATEGORIES]
cs.LG
Arbitrary Polynomial Separations in Trainable Quantum Machine Learning
[AUTHORS]
Eric R. Anschuetz, Xun Gao
[ABSTRACT]
Recent theoretical results in quantum machine learning have demonstrated a
general trade-off between the expressive power of quantum neural networks
(QNNs) and their trainability; as a corollary of these results, practical
exponential separations in expressive power over classical machine learning
models are believed to be infeasible as such QNNs take a time to train that is
exponential in the model size. We here circumvent these negative results by
constructing a hierarchy of efficiently trainable QNNs that exhibit
unconditionally provable, polynomial memory separations of arbitrary constant
degree over classical neural networks – including state-of-the-art models,
such as Transformers – in performing a classical sequence modeling task. This
construction is also computationally efficient, as each unit cell of the
introduced class of QNNs only has constant gate complexity. We show that
contextuality – informally, a quantitative notion of semantic ambiguity – is
the source of the expressivity separation, suggesting that other learning tasks
with this property may be a natural setting for the use of quantum learning
algorithms.
[COMMENTS]
24 pages, 3 figures, strengthened and simplified results and
presentation
[LINK]
http://arxiv.org/abs/2402.08606v2
[DATE]
2024-12-20 00:05:31+08:00
[CATEGORIES]
cs.LG
Chain-of-MetaWriting: Linguistic and Textual Analysis of How Small Language Models Write Young Students Texts
[AUTHORS]
Ioana Buhnila, Georgeta Cislaru, Amalia Todirascu
[COMMENTS]
Accepted at WRAICOGS 2025 (Writing Aids at the Crossroads of AI,
Cognitive Science, and NLP) co-located with COLING 2025
[LINK]
http://arxiv.org/abs/2412.14986v1
[DATE]
2024-12-19 23:58:53+08:00
[CATEGORIES]
cs.CL
Beyond Dataset Creation: Critical View of Annotation Variation and Bias Probing of a Dataset for Online Radical Content Detection
[AUTHORS]
Arij Riabi, Virginie Mouilleron, Menel Mahamdi, Wissam Antoun, Djamé Seddah
[COMMENTS]
Accepted to COLING 2025
[LINK]
http://arxiv.org/abs/2412.11745v2
[DATE]
2024-12-19 23:55:45+08:00
[CATEGORIES]
cs.CL
LLM-SEM: A Sentiment-Based Student Engagement Metric Using LLMS for E-Learning Platforms
[AUTHORS]
Ali Hamdi, Ahmed Abdelmoneim Mazrou, Mohamed Shaltout
[ABSTRACT]
Current methods for analyzing student engagement in e-learning platforms,
including automated systems, often struggle with challenges such as handling
fuzzy sentiment in text comments and relying on limited metadata. Traditional
approaches, such as surveys and questionnaires, also face issues like small
sample sizes and scalability. In this paper, we introduce LLM-SEM (Language
Model-Based Student Engagement Metric), a novel approach that leverages video
metadata and sentiment analysis of student comments to measure engagement. By
utilizing recent Large Language Models (LLMs), we generate high-quality
sentiment predictions to mitigate text fuzziness and normalize key features
such as views and likes. Our holistic method combines comprehensive metadata
with sentiment polarity scores to gauge engagement at both the course and
lesson levels. Extensive experiments were conducted to evaluate various LLM
models, demonstrating the effectiveness of LLM-SEM in providing a scalable and
accurate measure of student engagement. We fine-tuned TXLM-RoBERTa using
human-annotated sentiment datasets to enhance prediction accuracy and utilized
LLama 3B, and Gemma 9B from Ollama.
[LINK]
http://arxiv.org/abs/2412.13765v2
[DATE]
2024-12-19 23:50:54+08:00
[CATEGORIES]
cs.CL
Movie2Story: A framework for understanding videos and telling stories in the form of novel text
[AUTHORS]
Kangning Li, Zheyang Jia, Anyu Ying
[ABSTRACT]
Multimodal video-to-text models have made considerable progress, primarily in
generating brief descriptions of video content. However, there is still a
deficiency in generating rich long-form text descriptions that integrate both
video and audio. In this paper, we introduce a framework called M2S, designed
to generate novel-length text by combining audio, video, and character
recognition. M2S includes modules for video long-form text description and
comprehension, audio-based analysis of emotion, speech rate, and character
alignment, and visual-based character recognition alignment. By integrating
multimodal information using the large language model GPT4o, M2S stands out in
the field of multimodal text generation. We demonstrate the effectiveness and
accuracy of M2S through comparative experiments and human evaluation.
Additionally, the model framework has good scalability and significant
potential for future research.
[LINK]
http://arxiv.org/abs/2412.14965v1
[DATE]
2024-12-19 23:44:04+08:00
[CATEGORIES]
cs.CL
Knowledge Injection via Prompt Distillation
[AUTHORS]
Kalle Kujanpää, Harri Valpola, Alexander Ilin
[ABSTRACT]
In many practical applications, large language models (LLMs) need to
incorporate new knowledge not present in their pre-training data. The primary
methods for this are fine-tuning and retrieval-augmented generation (RAG).
Although RAG has emerged as the industry standard for knowledge injection,
fine-tuning has not yet achieved comparable success. In this paper, we propose
a new fine-tuning technique for learning new knowledge and show that it can
reach the performance of RAG. The proposed method is based on the
self-distillation approach, which we call prompt distillation. First, we
generate question-answer pairs about the new knowledge. Then, we fine-tune a
student model on the question-answer pairs to imitate the output distributions
of a teacher model, which additionally receives the new knowledge in its
prompt. The student model is identical to the teacher, except it is equipped
with a LoRA adapter. This training procedure facilitates distilling the new
knowledge from the teacher’s prompt into the student’s weights.
[COMMENTS]
Preprint
[LINK]
http://arxiv.org/abs/2412.14964v1
[DATE]
2024-12-19 23:44:01+08:00
[CATEGORIES]
cs.CL
cs.LG
To Word Senses and Beyond: Inducing Concepts with Contextualized Language Models
[AUTHORS]
Bastien Liétard, Pascal Denis, Mikaella Keller
[ABSTRACT]
Polysemy and synonymy are two crucial interrelated facets of lexical
ambiguity. While both phenomena are widely documented in lexical resources and
have been studied extensively in NLP, leading to dedicated systems, they are
often being considered independently in practical problems. While many tasks
dealing with polysemy (e.g. Word Sense Disambiguiation or Induction) highlight
the role of word’s senses, the study of synonymy is rooted in the study of
concepts, i.e. meanings shared across the lexicon. In this paper, we introduce
Concept Induction, the unsupervised task of learning a soft clustering among
words that defines a set of concepts directly from data. This task generalizes
Word Sense Induction. We propose a bi-level approach to Concept Induction that
leverages both a local lemma-centric view and a global cross-lexicon view to
induce concepts. We evaluate the obtained clustering on SemCor’s annotated data
and obtain good performance (BCubed F1 above 0.60). We find that the local and
the global levels are mutually beneficial to induce concepts and also senses in
our setting. Finally, we create static embeddings representing our induced
concepts and use them on the Word-in-Context task, obtaining competitive
performance with the State-of-the-Art.
[COMMENTS]
Published in EMNLP 2024 main conference proceedings
[LINK]
http://arxiv.org/abs/2406.20054v2
[DATE]
2024-12-19 23:30:40+08:00
[CATEGORIES]
cs.CL
Benchmarking Large Language Models for Math Reasoning Tasks
[AUTHORS]
Kathrin Seßler, Yao Rong, Emek Gözlüklü, Enkelejda Kasneci
[ABSTRACT]
The use of Large Language Models (LLMs) in mathematical reasoning has become
a cornerstone of related research, demonstrating the intelligence of these
models and enabling potential practical applications through their advanced
performance, such as in educational settings. Despite the variety of datasets
and in-context learning algorithms designed to improve the ability of LLMs to
automate mathematical problem solving, the lack of comprehensive benchmarking
across different datasets makes it complicated to select an appropriate model
for specific tasks. In this project, we present a benchmark that fairly
compares seven state-of-the-art in-context learning algorithms for mathematical
problem solving across five widely used mathematical datasets on four powerful
foundation models. Furthermore, we explore the trade-off between efficiency and
performance, highlighting the practical applications of LLMs for mathematical
reasoning. Our results indicate that larger foundation models like GPT-4o and
LLaMA 3-70B can solve mathematical reasoning independently from the concrete
prompting strategy, while for smaller models the in-context learning approach
significantly influences the performance. Moreover, the optimal prompt depends
on the chosen foundation model. We open-source our benchmark code to support
the integration of additional models in future research.
[COMMENTS]
This work has been submitted to the IEEE for possible publication
[LINK]
http://arxiv.org/abs/2408.10839v2
[DATE]
2024-12-19 23:25:41+08:00
[CATEGORIES]
cs.CL
cs.LG
ProsodyFM: Unsupervised Phrasing and Intonation Control for Intelligible Speech Synthesis
[AUTHORS]
Xiangheng He, Junjie Chen, Zixing Zhang, Björn W. Schuller
[ABSTRACT]
Prosody contains rich information beyond the literal meaning of words, which
is crucial for the intelligibility of speech. Current models still fall short
in phrasing and intonation; they not only miss or misplace breaks when
synthesizing long sentences with complex structures but also produce unnatural
intonation. We propose ProsodyFM, a prosody-aware text-to-speech synthesis
(TTS) model with a flow-matching (FM) backbone that aims to enhance the
phrasing and intonation aspects of prosody. ProsodyFM introduces two key
components: a Phrase Break Encoder to capture initial phrase break locations,
followed by a Duration Predictor for the flexible adjustment of break
durations; and a Terminal Intonation Encoder which learns a bank of intonation
shape tokens combined with a novel Pitch Processor for more robust modeling of
human-perceived intonation change. ProsodyFM is trained with no explicit
prosodic labels and yet can uncover a broad spectrum of break durations and
intonation patterns. Experimental results demonstrate that ProsodyFM can
effectively improve the phrasing and intonation aspects of prosody, thereby
enhancing the overall intelligibility compared to four state-of-the-art (SOTA)
models. Out-of-distribution experiments show that this prosody improvement can
further bring ProsodyFM superior generalizability for unseen complex sentences
and speakers. Our case study intuitively illustrates the powerful and
fine-grained controllability of ProsodyFM over phrasing and intonation.
[COMMENTS]
Accepted by AAAI 2025
[LINK]
http://arxiv.org/abs/2412.11795v2
[DATE]
2024-12-19 23:21:44+08:00
[CATEGORIES]
cs.CL
From Bench to Bedside: A Review of Clinical Trials in Drug Discovery and Development
[AUTHORS]
Tianyang Wang, Ming Liu, Benji Peng, Xinyuan Song, Charles Zhang, Xintian Sun, Qian Niu, Junyu Liu, Silin Chen, Keyu Chen, Ming Li, Pohsun Feng, Ziqian Bi, Yunze Wang, Yichao Zhang, Cheng Fei, Lawrence KQ Yan
[ABSTRACT]
Clinical trials are an indispensable part of the drug development process,
bridging the gap between basic research and clinical application. During the
development of new drugs, clinical trials are used not only to evaluate the
safety and efficacy of the drug but also to explore its dosage, treatment
regimens, and potential side effects. This review discusses the various stages
of clinical trials, including Phase I (safety assessment), Phase II
(preliminary efficacy evaluation), Phase III (large-scale validation), and
Phase IV (post-marketing surveillance), highlighting the characteristics of
each phase and their interrelationships. Additionally, the paper addresses the
major challenges encountered in clinical trials, such as ethical issues,
subject recruitment difficulties, diversity and representativeness concerns,
and proposes strategies for overcoming these challenges. With the advancement
of technology, innovative technologies such as artificial intelligence, big
data, and digitalization are gradually transforming clinical trial design and
implementation, improving trial efficiency and data quality. The article also
looks forward to the future of clinical trials, particularly the impact of
emerging therapies such as gene therapy and immunotherapy on trial design, as
well as the importance of regulatory reforms and global collaboration. In
conclusion, the core role of clinical trials in drug development will continue
to drive the progress of innovative drug development and clinical treatment.
[COMMENTS]
11 pages
[LINK]
http://arxiv.org/abs/2412.09378v2
[DATE]
2024-12-19 23:14:18+08:00
[CATEGORIES]
cs.CL
ANAH-v2: Scaling Analytical Hallucination Annotation of Large Language Models
[AUTHORS]
Yuzhe Gu, Ziwei Ji, Wenwei Zhang, Chengqi Lyu, Dahua Lin, Kai Chen
[ABSTRACT]
Large language models (LLMs) exhibit hallucinations in long-form
question-answering tasks across various domains and wide applications. Current
hallucination detection and mitigation datasets are limited in domains and
sizes, which struggle to scale due to prohibitive labor costs and insufficient
reliability of existing hallucination annotators. To facilitate the scalable
oversight of LLM hallucinations, this paper introduces an iterative
self-training framework that simultaneously and progressively scales up the
hallucination annotation dataset and improves the accuracy of the hallucination
annotator. Based on the Expectation Maximization (EM) algorithm, in each
iteration, the framework first applies a hallucination annotation pipeline to
annotate a scaled dataset and then trains a more accurate hallucination
annotator on the dataset. This new hallucination annotator is adopted in the
hallucination annotation pipeline used for the next iteration. Extensive
experimental results demonstrate that the finally obtained hallucination
annotator with only 7B parameters surpasses the performance of GPT-4 and
obtains new state-of-the-art hallucination detection results on HaluEval and
HalluQA by zero-shot inference. Such an annotator can not only evaluate the
hallucination levels of various LLMs on the large-scale dataset but also help
to mitigate the hallucination of LLMs generations, with the Natural Language
Inference (NLI) metric increasing from 25% to 37% on HaluEval.
[COMMENTS]
Accepted by NeurIPS 2024. Dataset, code, and model are released at
https://github.com/open-compass/ANAH
[LINK]
http://arxiv.org/abs/2407.04693v2
[DATE]
2024-12-19 23:11:47+08:00
[CATEGORIES]
cs.CL
BayLing 2: A Multilingual Large Language Model with Efficient Language Alignment
[AUTHORS]
Shaolei Zhang, Kehao Zhang, Qingkai Fang, Shoutao Guo, Yan Zhou, Xiaodong Liu, Yang Feng
[ABSTRACT]
Large language models (LLMs), with their powerful generative capabilities and
vast knowledge, empower various tasks in everyday life. However, these
abilities are primarily concentrated in high-resource languages, leaving
low-resource languages with weaker generative capabilities and relatively
limited knowledge. Enhancing the multilingual capabilities of LLMs is therefore
crucial for serving over 100 linguistic communities worldwide. An intuitive
approach to enhance the multilingual capabilities would be to construct
instruction data for various languages, but constructing instruction data for
over 100 languages is prohibitively costly. In this paper, we introduce BayLing
2, which efficiently transfers generative capabilities and knowledge from
high-resource languages to low-resource languages through language alignment.
To achieve this, we constructed a dataset of 3.2 million instructions,
comprising high-resource language instructions (Chinese and English) and
cross-lingual instructions for 100+ languages and performed instruction tuning
based on the dataset to facilitate the capability transfer between languages.
Using Llama as the foundation model, we developed BayLing-2-7B, BayLing-2-13B,
and BayLing-2-8B, and conducted a comprehensive evaluation of BayLing. For
multilingual translation across 100+ languages, BayLing shows superior
performance compared to open-source models of similar scale. For multilingual
knowledge and understanding benchmarks, BayLing achieves significant
improvements across over 20 low-resource languages, demonstrating its
capability of effective knowledge transfer from high-resource to low-resource
languages. Furthermore, results on English benchmarks indicate that BayLing
maintains high performance in highresource languages while enhancing the
performance in low-resource languages. Demo, homepage, code and models of
BayLing are available.
[COMMENTS]
BayLing 2’s online demo: http://nlp.ict.ac.cn/bayling/demo. BayLing
2’s code and models: https://github.com/ictnlp/BayLing
[LINK]
http://arxiv.org/abs/2411.16300v3
[DATE]
2024-12-19 23:11:46+08:00
[CATEGORIES]
cs.CL
Dehallucinating Parallel Context Extension for Retrieval-Augmented Generation
[AUTHORS]
Zexiong Ma, Shengnan An, Zeqi Lin, Yanzhen Zou, Jian-Guang Lou, Bing Xie
[ABSTRACT]
Large language models (LLMs) are susceptible to generating hallucinated
information, despite the integration of retrieval-augmented generation (RAG).
Parallel context extension (PCE) is a line of research attempting to
effectively integrating parallel (unordered) contexts, while it still suffers
from hallucinations when adapted to RAG scenarios. In this paper, we propose
DePaC (Dehallucinating Parallel Context Extension), which alleviates the
hallucination problem with context-aware negative training and
information-calibrated aggregation. DePaC is designed to alleviate two types of
in-context hallucination: fact fabrication (i.e., LLMs present claims that are
not supported by the contexts) and fact omission (i.e., LLMs fail to present
claims that can be supported by the contexts). Specifically, (1) for fact
fabrication, we apply the context-aware negative training that fine-tunes the
LLMs with negative supervisions, thus explicitly guiding the LLMs to refuse to
answer when contexts are not related to questions; (2) for fact omission, we
propose the information-calibrated aggregation which prioritizes context
windows with higher information increment from their contexts. The experimental
results on nine RAG tasks demonstrate that DePaC significantly alleviates the
two types of hallucination and consistently achieves better performances on
these tasks.
[LINK]
http://arxiv.org/abs/2412.14905v1
[DATE]
2024-12-19 22:37:11+08:00
[CATEGORIES]
cs.CL
Think&Cite: Improving Attributed Text Generation with Self-Guided Tree Search and Progress Reward Modeling
[AUTHORS]
Junyi Li, Hwee Tou Ng
[ABSTRACT]
Despite their outstanding capabilities, large language models (LLMs) are
prone to hallucination and producing factually incorrect information. This
challenge has spurred efforts in attributed text generation, which prompts LLMs
to generate content with supporting evidence. In this paper, we propose a novel
framework, called Think&Cite, and formulate attributed text generation as a
multi-step reasoning problem integrated with search. Specifically, we propose
Self-Guided Monte Carlo Tree Search (SG-MCTS), which capitalizes on the
self-reflection capability of LLMs to reflect on the intermediate states of
MCTS for guiding the tree expansion process. To provide reliable and
comprehensive feedback, we introduce Progress Reward Models to measure the
progress of tree search from the root to the current state from two aspects,
i.e., generation and attribution progress. We conduct extensive experiments on
three datasets and the results show that our approach significantly outperforms
baseline approaches.
[LINK]
http://arxiv.org/abs/2412.14860v1
[DATE]
2024-12-19 21:55:48+08:00
[CATEGORIES]
cs.CL
DS$^2$-ABSA: Dual-Stream Data Synthesis with Label Refinement for Few-Shot Aspect-Based Sentiment Analysis
[AUTHORS]
Hongling Xu, Yice Zhang, Qianlong Wang, Ruifeng Xu
[ABSTRACT]
Recently developed large language models (LLMs) have presented promising new
avenues to address data scarcity in low-resource scenarios. In few-shot
aspect-based sentiment analysis (ABSA), previous efforts have explored data
augmentation techniques, which prompt LLMs to generate new samples by modifying
existing ones. However, these methods fail to produce adequately diverse data,
impairing their effectiveness. Besides, some studies apply in-context learning
for ABSA by using specific instructions and a few selected examples as prompts.
Though promising, LLMs often yield labels that deviate from task requirements.
To overcome these limitations, we propose DS$^2$-ABSA, a dual-stream data
synthesis framework targeted for few-shot ABSA. It leverages LLMs to synthesize
data from two complementary perspectives: \textit{key-point-driven} and
\textit{instance-driven}, which effectively generate diverse and high-quality
ABSA samples in low-resource settings. Furthermore, a \textit{label refinement}
module is integrated to improve the synthetic labels. Extensive experiments
demonstrate that DS$^2$-ABSA significantly outperforms previous few-shot ABSA
solutions and other LLM-oriented data generation methods.
[LINK]
http://arxiv.org/abs/2412.14849v1
[DATE]
2024-12-19 21:39:47+08:00
[CATEGORIES]
cs.CL
A Survey of RWKV
[AUTHORS]
Zhiyuan Li, Tingyu Xia, Yi Chang, Yuan Wu
[ABSTRACT]
The Receptance Weighted Key Value (RWKV) model offers a novel alternative to
the Transformer architecture, merging the benefits of recurrent and
attention-based systems. Unlike conventional Transformers, which depend heavily
on self-attention, RWKV adeptly captures long-range dependencies with minimal
computational demands. By utilizing a recurrent framework, RWKV addresses some
computational inefficiencies found in Transformers, particularly in tasks with
long sequences. RWKV has recently drawn considerable attention for its robust
performance across multiple domains. Despite its growing popularity, no
systematic review of the RWKV model exists. This paper seeks to fill this gap
as the first comprehensive review of the RWKV architecture, its core
principles, and its varied applications, such as natural language generation,
natural language understanding, and computer vision. We assess how RWKV
compares to traditional Transformer models, highlighting its capability to
manage long sequences efficiently and lower computational costs. Furthermore,
we explore the challenges RWKV encounters and propose potential directions for
future research and advancement. We consistently maintain the related
open-source materials at: https://github.com/MLGroupJLU/RWKV-Survey.
[COMMENTS]
18 pages
[LINK]
http://arxiv.org/abs/2412.14847v1
[DATE]
2024-12-19 21:39:24+08:00
[CATEGORIES]
cs.CL
Progressive Multimodal Reasoning via Active Retrieval
[AUTHORS]
Guanting Dong, Chenghao Zhang, Mengjie Deng, Yutao Zhu, Zhicheng Dou, Ji-Rong Wen
[ABSTRACT]
Multi-step multimodal reasoning tasks pose significant challenges for
multimodal large language models (MLLMs), and finding effective ways to enhance
their performance in such scenarios remains an unresolved issue. In this paper,
we propose AR-MCTS, a universal framework designed to progressively improve the
reasoning capabilities of MLLMs through Active Retrieval (AR) and Monte Carlo
Tree Search (MCTS). Our approach begins with the development of a unified
retrieval module that retrieves key supporting insights for solving complex
reasoning problems from a hybrid-modal retrieval corpus. To bridge the gap in
automated multimodal reasoning verification, we employ the MCTS algorithm
combined with an active retrieval mechanism, which enables the automatic
generation of step-wise annotations. This strategy dynamically retrieves key
insights for each reasoning step, moving beyond traditional beam search
sampling to improve the diversity and reliability of the reasoning space.
Additionally, we introduce a process reward model that aligns progressively to
support the automatic verification of multimodal reasoning tasks. Experimental
results across three complex multimodal reasoning benchmarks confirm the
effectiveness of the AR-MCTS framework in enhancing the performance of various
multimodal models. Further analysis demonstrates that AR-MCTS can optimize
sampling diversity and accuracy, yielding reliable multimodal reasoning.
[COMMENTS]
Working in progress
[LINK]
http://arxiv.org/abs/2412.14835v1
[DATE]
2024-12-19 21:25:39+08:00
[CATEGORIES]
cs.CL
Mention Attention for Pronoun Translation
[AUTHORS]
Gongbo Tang, Christian Hardmeier
[ABSTRACT]
Most pronouns are referring expressions, computers need to resolve what do
the pronouns refer to, and there are divergences on pronoun usage across
languages. Thus, dealing with these divergences and translating pronouns is a
challenge in machine translation. Mentions are referring candidates of pronouns
and have closer relations with pronouns compared to general tokens. We assume
that extracting additional mention features can help pronoun translation.
Therefore, we introduce an additional mention attention module in the decoder
to pay extra attention to source mentions but not non-mention tokens. Our
mention attention module not only extracts features from source mentions, but
also considers target-side context which benefits pronoun translation. In
addition, we also introduce two mention classifiers to train models to
recognize mentions, whose outputs guide the mention attention. We conduct
experiments on the WMT17 English-German translation task, and evaluate our
models on general translation and pronoun translation, using BLEU, APT, and
contrastive evaluation metrics. Our proposed model outperforms the baseline
Transformer model in terms of APT and BLEU scores, this confirms our hypothesis
that we can improve pronoun translation by paying additional attention to
source mentions, and shows that our introduced additional modules do not have
negative effect on the general translation quality.
[COMMENTS]
camera-ready version of the paper accepted by JCRAI-23 conference, in
ACL format
[LINK]
http://arxiv.org/abs/2412.14829v1
[DATE]
2024-12-19 21:19:19+08:00
[CATEGORIES]
cs.CL
MaFeRw: Query Rewriting with Multi-Aspect Feedbacks for Retrieval-Augmented Large Language Models
[AUTHORS]
Yujing Wang, Hainan Zhang, Liang Pang, Binghui Guo, Hongwei Zheng, Zhiming Zheng
[ABSTRACT]
In a real-world RAG system, the current query often involves spoken ellipses
and ambiguous references from dialogue contexts, necessitating query rewriting
to better describe user’s information needs. However, traditional context-based
rewriting has minimal enhancement on downstream generation tasks due to the
lengthy process from query rewriting to response generation. Some researchers
try to utilize reinforcement learning with generation feedback to assist the
rewriter, but these sparse rewards provide little guidance in most cases,
leading to unstable training and generation results. We find that user’s needs
are also reflected in the gold document, retrieved documents and ground truth.
Therefore, by feeding back these multi-aspect dense rewards to query rewriting,
more stable and satisfactory responses can be achieved. In this paper, we
propose a novel query rewriting method MaFeRw, which improves RAG performance
by integrating multi-aspect feedback from both the retrieval process and
generated results. Specifically, we first use manual data to train a T5 model
for the rewriter initialization. Next, we design three metrics as reinforcement
learning feedback: the similarity between the rewritten query and the gold
document, the ranking metrics, and ROUGE between the generation and the ground
truth. Inspired by RLAIF, we train three kinds of reward models for the above
metrics to achieve more efficient training. Finally, we combine the scores of
these reward models as feedback, and use PPO algorithm to explore the optimal
query rewriting strategy. Experimental results on two conversational RAG
datasets demonstrate that MaFeRw achieves superior generation metrics and more
stable training compared to baselines.
[LINK]
http://arxiv.org/abs/2408.17072v2
[DATE]
2024-12-19 21:16:40+08:00
[CATEGORIES]
cs.CL
ResoFilter: Rine-grained Synthetic Data Filtering for Large Language Models through Data-Parameter Resonance Analysis
[AUTHORS]
Zeao Tu, Xiangdi Meng, Yu He, Zihan Yao, Tianyu Qi, Jun Liu, Ming Li
[ABSTRACT]
Large language models (LLMs) have shown remarkable effectiveness across
various domains, with data augmentation methods utilizing GPT for synthetic
data generation becoming prevalent. However, the quality and utility of
augmented data remain questionable, and current methods lack clear metrics for
evaluating data characteristics. To address these challenges, we propose
ResoFilter, a novel method that integrates models, data, and tasks to refine
datasets. ResoFilter leverages the fine-tuning process to obtain Data-Parameter
features for data selection, offering improved interpretability by representing
data characteristics through model weights. Our experiments demonstrate that
ResoFilter achieves comparable results to full-scale fine-tuning using only
half the data in mathematical tasks and exhibits strong generalization across
different models and domains. This method provides valuable insights for
constructing synthetic datasets and evaluating high-quality data, offering a
promising solution for enhancing data augmentation techniques and improving
training dataset quality for LLMs. For reproducibility, we will release our
code and data upon acceptance.
[COMMENTS]
under review
[LINK]
http://arxiv.org/abs/2412.14809v1
[DATE]
2024-12-19 20:57:47+08:00
[CATEGORIES]
cs.CL
Disentangling Reasoning Tokens and Boilerplate Tokens For Language Model Fine-tuning
[AUTHORS]
Ziang Ye, Zhenru Zhang, Yang Zhang, Jianxin Ma, Junyang Lin, Fuli Feng
[ABSTRACT]
When using agent-task datasets to enhance agent capabilities for Large
Language Models (LLMs), current methodologies often treat all tokens within a
sample equally. However, we argue that tokens serving different roles -
specifically, reasoning tokens versus boilerplate tokens (e.g., those governing
output format) - differ significantly in importance and learning complexity,
necessitating their disentanglement and distinct treatment. To address this, we
propose a novel Shuffle-Aware Discriminator (SHAD) for adaptive token
discrimination. SHAD classifies tokens by exploiting predictability differences
observed after shuffling input-output combinations across samples: boilerplate
tokens, due to their repetitive nature among samples, maintain predictability,
whereas reasoning tokens do not. Using SHAD, we propose the
Reasoning-highlighted Fine-Tuning (RFT) method, which adaptively emphasizes
reasoning tokens during fine-tuning, yielding notable performance gains over
common Supervised Fine-Tuning (SFT).
[LINK]
http://arxiv.org/abs/2412.14780v1
[DATE]
2024-12-19 20:06:24+08:00
[CATEGORIES]
cs.CL
ALKAFI-LLAMA3: Fine-Tuning LLMs for Precise Legal Understanding in Palestine
[AUTHORS]
Rabee Qasem, Mohannad Hendi, Banan Tantour
[ABSTRACT]
Large Language Models (LLMs) have demonstrated remarkable potential in
diverse domains, yet their application in the legal sector, particularly in
low-resource contexts, remains limited. This study addresses the challenges of
adapting LLMs to the Palestinian legal domain, where political instability,
fragmented legal frameworks, and limited AI resources hinder effective
machine-learning applications. We present a fine-tuned model based on a
quantized version of Llama-3.2-1B-Instruct, trained on a synthetic data set
derived from Palestinian legal texts. Using smaller-scale models and
strategically generated question-answer pairs, we achieve a cost-effective,
locally sustainable solution that provides accurate and contextually relevant
legal guidance. Our experiments demonstrate promising performance on various
query types, ranging from yes/no questions and narrative explanations to
complex legal differentiations, while highlighting areas for improvement, such
as handling calculation-based inquiries and structured list formatting. This
work provides a pathway for the deployment of AI-driven legal assistance tools
tailored to the needs of resource-constrained environments.
[LINK]
http://arxiv.org/abs/2412.14771v1
[DATE]
2024-12-19 19:55:51+08:00
[CATEGORIES]
cs.CL
cs.LG
PsyDraw: A Multi-Agent Multimodal System for Mental Health Screening in Left-Behind Children
[AUTHORS]
Yiqun Zhang, Xiaocui Yang, Xiaobai Li, Siyuan Yu, Yi Luan, Shi Feng, Daling Wang, Yifei Zhang
[ABSTRACT]
Left-behind children (LBCs), numbering over 66 million in China, face severe
mental health challenges due to parental migration for work. Early screening
and identification of at-risk LBCs is crucial, yet challenging due to the
severe shortage of mental health professionals, especially in rural areas.
While the House-Tree-Person (HTP) test shows higher child participation rates,
its requirement for expert interpretation limits its application in
resource-scarce regions. To address this challenge, we propose PsyDraw, a
multi-agent system based on Multimodal Large Language Models that assists
mental health professionals in analyzing HTP drawings. The system employs
specialized agents for feature extraction and psychological interpretation,
operating in two stages: comprehensive feature analysis and professional report
generation. Evaluation of HTP drawings from 290 primary school students reveals
that 71.03% of the analyzes achieved High Consistency with professional
evaluations, 26.21% Moderate Consistency and only 2.41% Low Consistency. The
system identified 31.03% of cases requiring professional attention,
demonstrating its effectiveness as a preliminary screening tool. Currently
deployed in pilot schools, \method shows promise in supporting mental health
professionals, particularly in resource-limited areas, while maintaining high
professional standards in psychological assessment.
[COMMENTS]
preprint
[LINK]
http://arxiv.org/abs/2412.14769v1
[DATE]
2024-12-19 19:51:57+08:00
[CATEGORIES]
cs.CL
Query pipeline optimization for cancer patient question answering systems
[AUTHORS]
Maolin He, Rena Gao, Mike Conway, Brian E. Chapman
[ABSTRACT]
Retrieval-augmented generation (RAG) mitigates hallucination in Large
Language Models (LLMs) by using query pipelines to retrieve relevant external
information and grounding responses in retrieved knowledge. However, query
pipeline optimization for cancer patient question-answering (CPQA) systems
requires separately optimizing multiple components with domain-specific
considerations. We propose a novel three-aspect optimization approach for the
RAG query pipeline in CPQA systems, utilizing public biomedical databases like
PubMed and PubMed Central. Our optimization includes: (1) document retrieval,
utilizing a comparative analysis of NCBI resources and introducing Hybrid
Semantic Real-time Document Retrieval (HSRDR); (2) passage retrieval,
identifying optimal pairings of dense retrievers and rerankers; and (3)
semantic representation, introducing Semantic Enhanced Overlap Segmentation
(SEOS) for improved contextual understanding. On a custom-developed dataset
tailored for cancer-related inquiries, our optimized RAG approach improved the
answer accuracy of Claude-3-haiku by 5.24% over chain-of-thought prompting and
about 3% over a naive RAG setup. This study highlights the importance of
domain-specific query optimization in realizing the full potential of RAG and
provides a robust framework for building more accurate and reliable CPQA
systems, advancing the development of RAG-based biomedical systems.
[LINK]
http://arxiv.org/abs/2412.14751v1
[DATE]
2024-12-19 19:30:07+08:00
[CATEGORIES]
cs.CL
On Verbalized Confidence Scores for LLMs
[AUTHORS]
Daniel Yang, Yao-Hung Hubert Tsai, Makoto Yamada
[ABSTRACT]
The rise of large language models (LLMs) and their tight integration into our
daily life make it essential to dedicate efforts towards their trustworthiness.
Uncertainty quantification for LLMs can establish more human trust into their
responses, but also allows LLM agents to make more informed decisions based on
each other’s uncertainty. To estimate the uncertainty in a response, internal
token logits, task-specific proxy models, or sampling of multiple responses are
commonly used. This work focuses on asking the LLM itself to verbalize its
uncertainty with a confidence score as part of its output tokens, which is a
promising way for prompt- and model-agnostic uncertainty quantification with
low overhead. Using an extensive benchmark, we assess the reliability of
verbalized confidence scores with respect to different datasets, models, and
prompt methods. Our results reveal that the reliability of these scores
strongly depends on how the model is asked, but also that it is possible to
extract well-calibrated confidence scores with certain prompt methods. We argue
that verbalized confidence scores can become a simple but effective and
versatile uncertainty quantification method in the future. Our code is
available at https://github.com/danielyxyang/llm-verbalized-uq .
[LINK]
http://arxiv.org/abs/2412.14737v1
[DATE]
2024-12-19 19:10:36+08:00
[CATEGORIES]
cs.CL
LLMs instead of Human Judges? A Large Scale Empirical Study across 20 NLP Evaluation Tasks
[AUTHORS]
Anna Bavaresco, Raffaella Bernardi, Leonardo Bertolazzi, Desmond Elliott, Raquel Fernández, Albert Gatt, Esam Ghaleb, Mario Giulianelli, Michael Hanna, Alexander Koller, André F. T. Martins, Philipp Mondorf, Vera Neplenbroek, Sandro Pezzelle, Barbara Plank, David Schlangen, Alessandro Suglia, Aditya K Surikuchi, Ece Takmaz, Alberto Testoni
[LINK]
http://arxiv.org/abs/2406.18403v2
[DATE]
2024-12-19 19:07:09+08:00
[CATEGORIES]
cs.CL
When Every Token Counts: Optimal Segmentation for Low-Resource Language Models
[AUTHORS]
Bharath Raj S, Garvit Suri, Vikrant Dewangan, Raghav Sonavane
[ABSTRACT]
Traditional greedy tokenization methods have been a critical step in Natural
Language Processing (NLP), influencing how text is converted into tokens and
directly impacting model performance. While subword tokenizers like Byte-Pair
Encoding (BPE) are widely used, questions remain about their optimality across
model scales and languages. In this work, we demonstrate through extensive
experiments that an optimal BPE configuration significantly reduces token count
compared to greedy segmentation, yielding improvements in token-saving
percentages and performance benefits, particularly for smaller models. We
evaluate tokenization performance across various intrinsic and extrinsic tasks,
including generation and classification. Our findings suggest that
compression-optimized tokenization strategies could provide substantial
advantages for multilingual and low-resource language applications,
highlighting a promising direction for further research and inclusive NLP.
[COMMENTS]
LoResLM @ COLING 2025
[LINK]
http://arxiv.org/abs/2412.06926v3
[DATE]
2024-12-19 17:24:39+08:00
[CATEGORIES]
cs.CL
cs.LG
Analysis and Visualization of Linguistic Structures in Large Language Models: Neural Representations of Verb-Particle Constructions in BERT
[AUTHORS]
Hassane Kissane, Achim Schilling, Patrick Krauss
[ABSTRACT]
This study investigates the internal representations of verb-particle
combinations within transformer-based large language models (LLMs),
specifically examining how these models capture lexical and syntactic nuances
at different neural network layers. Employing the BERT architecture, we analyse
the representational efficacy of its layers for various verb-particle
constructions such as ‘agree on’, ‘come back’, and ‘give up’. Our methodology
includes a detailed dataset preparation from the British National Corpus,
followed by extensive model training and output analysis through techniques
like multi-dimensional scaling (MDS) and generalized discrimination value (GDV)
calculations. Results show that BERT’s middle layers most effectively capture
syntactic structures, with significant variability in representational accuracy
across different verb categories. These findings challenge the conventional
uniformity assumed in neural network processing of linguistic elements and
suggest a complex interplay between network architecture and linguistic
representation. Our research contributes to a better understanding of how deep
learning models comprehend and process language, offering insights into the
potential and limitations of current neural approaches to linguistic analysis.
This study not only advances our knowledge in computational linguistics but
also prompts further research into optimizing neural architectures for enhanced
linguistic precision.
[LINK]
http://arxiv.org/abs/2412.14670v1
[DATE]
2024-12-19 17:21:39+08:00
[CATEGORIES]
cs.CL
Unveiling Uncertainty: A Deep Dive into Calibration and Performance of Multimodal Large Language Models
[AUTHORS]
Zijun Chen, Wenbo Hu, Guande He, Zhijie Deng, Zheng Zhang, Richang Hong
[ABSTRACT]
Multimodal large language models (MLLMs) combine visual and textual data for
tasks such as image captioning and visual question answering. Proper
uncertainty calibration is crucial, yet challenging, for reliable use in areas
like healthcare and autonomous driving. This paper investigates representative
MLLMs, focusing on their calibration across various scenarios, including before
and after visual fine-tuning, as well as before and after multimodal training
of the base LLMs. We observed miscalibration in their performance, and at the
same time, no significant differences in calibration across these scenarios. We
also highlight how uncertainty differs between text and images and how their
integration affects overall uncertainty. To better understand MLLMs’
miscalibration and their ability to self-assess uncertainty, we construct the
IDK (I don’t know) dataset, which is key to evaluating how they handle
unknowns. Our findings reveal that MLLMs tend to give answers rather than admit
uncertainty, but this self-assessment improves with proper prompt adjustments.
Finally, to calibrate MLLMs and enhance model reliability, we propose
techniques such as temperature scaling and iterative prompt optimization. Our
results provide insights into improving MLLMs for effective and responsible
deployment in multimodal applications. Code and IDK dataset:
\href{https://github.com/hfutml/Calibration-MLLM}{https://github.com/hfutml/Calibration-MLLM}.
[COMMENTS]
Accepted to COLING 2025
[LINK]
http://arxiv.org/abs/2412.14660v1
[DATE]
2024-12-19 17:10:07+08:00
[CATEGORIES]
cs.CL
cs.LG
Length Controlled Generation for Black-box LLMs
[AUTHORS]
Yuxuan Gu, Wenjie Wang, Xiaocheng Feng, Weihong Zhong, Kun Zhu, Lei Huang, Tat-Seng Chua, Bing Qin
[ABSTRACT]
Large language models (LLMs) have demonstrated impressive instruction
following capabilities, while still struggling to accurately manage the length
of the generated text, which is a fundamental requirement in many real-world
applications. Existing length control methods involve fine-tuning the
parameters of LLMs, which is inefficient and suboptimal for practical use. In
this paper, we propose a novel iterative sampling framework for text length
control, integrating the Metropolis-Hastings algorithm with an importance
sampling acceleration strategy. This framework efficiently and reliably
regulates LLMs to generate length-constrained text without modifying the
underlying parameters, thereby preserving the original capabilities of LLMs.
Experimental results demonstrate that our framework achieves almost 100\%
success rates of length control on Llama3.1 for tasks such as length-controlled
abstractive summarization and length-constrained instruction following, with
minimal additional computational overhead. This also highlights the significant
potential of our method for precise length control across a broader range of
applications, without compromising the versatility of LLMs.
[COMMENTS]
Preprint
[LINK]
http://arxiv.org/abs/2412.14656v1
[DATE]
2024-12-19 17:07:38+08:00
[CATEGORIES]
cs.CL
TOMG-Bench: Evaluating LLMs on Text-based Open Molecule Generation
[AUTHORS]
Jiatong Li, Junxian Li, Yunqing Liu, Dongzhan Zhou, Qing Li
[ABSTRACT]
In this paper, we propose Text-based Open Molecule Generation Benchmark
(TOMG-Bench), the first benchmark to evaluate the open-domain molecule
generation capability of LLMs. TOMG-Bench encompasses a dataset of three major
tasks: molecule editing (MolEdit), molecule optimization (MolOpt), and
customized molecule generation (MolCustom). Each task further contains three
subtasks, with each subtask comprising 5,000 test samples. Given the inherent
complexity of open molecule generation, we have also developed an automated
evaluation system that helps measure both the quality and the accuracy of the
generated molecules. Our comprehensive benchmarking of 25 LLMs reveals the
current limitations and potential areas for improvement in text-guided molecule
discovery. Furthermore, with the assistance of OpenMolIns, a specialized
instruction tuning dataset proposed for solving challenges raised by
TOMG-Bench, Llama3.1-8B could outperform all the open-source general LLMs, even
surpassing GPT-3.5-turbo by 46.5\% on TOMG-Bench. Our codes and datasets are
available through https://github.com/phenixace/TOMG-Bench.
[COMMENTS]
A benchmark for text-based open molecule generation
[LINK]
http://arxiv.org/abs/2412.14642v1
[DATE]
2024-12-19 16:51:16+08:00
[CATEGORIES]
cs.CL
Learning to Generate Research Idea with Dynamic Control
[AUTHORS]
Ruochen Li, Liqiang Jing, Chi Han, Jiawei Zhou, Xinya Du
[ABSTRACT]
The rapid advancements in large language models (LLMs) have demonstrated
their potential to accelerate scientific discovery, particularly in automating
the process of research ideation. LLM-based systems have shown promise in
generating hypotheses and research ideas. However, current approaches
predominantly rely on prompting-based pre-trained models, limiting their
ability to optimize generated content effectively. Moreover, they also lack the
capability to deal with the complex interdependence and inherent restrictions
among novelty, feasibility, and effectiveness, which remains challenging due to
the inherent trade-offs among these dimensions, such as the
innovation-feasibility conflict. To address these limitations, we for the first
time propose fine-tuning LLMs to be better idea proposers and introduce a novel
framework that employs a two-stage approach combining Supervised Fine-Tuning
(SFT) and controllable Reinforcement Learning (RL). In the SFT stage, the model
learns foundational patterns from pairs of research papers and follow-up ideas.
In the RL stage, multi-dimensional reward modeling, guided by fine-grained
feedback, evaluates and optimizes the generated ideas across key metrics.
Dimensional controllers enable dynamic adjustment of generation, while a
sentence-level decoder ensures context-aware emphasis during inference. Our
framework provides a balanced approach to research ideation, achieving
high-quality outcomes by dynamically navigating the trade-offs among novelty,
feasibility, and effectiveness.
[LINK]
http://arxiv.org/abs/2412.14626v1
[DATE]
2024-12-19 16:28:18+08:00
[CATEGORIES]
cs.CL
KARRIEREWEGE: A Large Scale Career Path Prediction Dataset
[AUTHORS]
Elena Senger, Yuri Campbell, Rob van der Goot, Barbara Plank
[COMMENTS]
Accepted at COLING Industry Track
[LINK]
http://arxiv.org/abs/2412.14612v1
[DATE]
2024-12-19 16:02:08+08:00
[CATEGORIES]
cs.CL
LDP: Generalizing to Multilingual Visual Information Extraction by Language Decoupled Pretraining
[AUTHORS]
Huawen Shen, Gengluo Li, Jinwen Zhong, Yu Zhou
[ABSTRACT]
Visual Information Extraction (VIE) plays a crucial role in the comprehension
of semi-structured documents, and several pre-trained models have been
developed to enhance performance. However, most of these works are monolingual
(usually English). Due to the extremely unbalanced quantity and quality of
pre-training corpora between English and other languages, few works can extend
to non-English scenarios. In this paper, we conduct systematic experiments to
show that vision and layout modality hold invariance among images with
different languages. If decoupling language bias from document images, a
vision-layout-based model can achieve impressive cross-lingual generalization.
Accordingly, we present a simple but effective multilingual training paradigm
LDP (Language Decoupled Pre-training) for better utilization of monolingual
pre-training data. Our proposed model LDM (Language Decoupled Model) is first
pre-trained on the language-independent data, where the language knowledge is
decoupled by a diffusion model, and then the LDM is fine-tuned on the
downstream languages. Extensive experiments show that the LDM outperformed all
SOTA multilingual pre-trained models, and also maintains competitiveness on
downstream monolingual/English benchmarks.
[COMMENTS]
Accepted by AAAI2025
[LINK]
http://arxiv.org/abs/2412.14596v1
[DATE]
2024-12-19 15:31:40+08:00
[CATEGORIES]
cs.CL
cs.LG
Low-resource Machine Translation: what for? who for? An observational study on a dedicated Tetun language translation service
[AUTHORS]
Raphael Merx, Adérito José Guterres Correia, Hanna Suominen, Ekaterina Vylomova
[ABSTRACT]
Low-resource machine translation (MT) presents a diversity of community needs
and application challenges that remain poorly understood. To complement surveys
and focus groups, which tend to rely on small samples of respondents, we
propose an observational study on actual usage patterns of a specialized MT
service for the Tetun language, which is the lingua franca in Timor-Leste. Our
analysis of 100,000 translation requests reveals patterns that challenge
assumptions based on existing corpora. We find that users, many of them
students on mobile devices, typically translate text from a high-resource
language into Tetun across diverse domains including science, healthcare, and
daily life. This contrasts sharply with available Tetun corpora, which are
dominated by news articles covering government and social issues. Our results
suggest that MT systems for minority languages like Tetun should prioritize
accuracy on domains relevant to educational contexts, in the high-resource to
low-resource direction. More broadly, this study demonstrates how observational
analysis can inform low-resource language technology development, by grounding
research in practical community needs.
[LINK]
http://arxiv.org/abs/2411.12262v2
[DATE]
2024-12-19 15:29:23+08:00
[CATEGORIES]
cs.CL
Simulation-Free Hierarchical Latent Policy Planning for Proactive Dialogues
[AUTHORS]
Tao He, Lizi Liao, Yixin Cao, Yuanxing Liu, Yiheng Sun, Zerui Chen, Ming Liu, Bing Qin
[ABSTRACT]
Recent advancements in proactive dialogues have garnered significant
attention, particularly for more complex objectives (e.g. emotion support and
persuasion). Unlike traditional task-oriented dialogues, proactive dialogues
demand advanced policy planning and adaptability, requiring rich scenarios and
comprehensive policy repositories to develop such systems. However, existing
approaches tend to rely on Large Language Models (LLMs) for user simulation and
online learning, leading to biases that diverge from realistic scenarios and
result in suboptimal efficiency. Moreover, these methods depend on manually
defined, context-independent, coarse-grained policies, which not only incur
high expert costs but also raise concerns regarding their completeness. In our
work, we highlight the potential for automatically discovering policies
directly from raw, real-world dialogue records. To this end, we introduce a
novel dialogue policy planning framework, LDPP. It fully automates the process
from mining policies in dialogue records to learning policy planning.
Specifically, we employ a variant of the Variational Autoencoder to discover
fine-grained policies represented as latent vectors. After automatically
annotating the data with these latent policy labels, we propose an Offline
Hierarchical Reinforcement Learning (RL) algorithm in the latent space to
develop effective policy planning capabilities. Our experiments demonstrate
that LDPP outperforms existing methods on two proactive scenarios, even
surpassing ChatGPT with only a 1.8-billion-parameter LLM.
[COMMENTS]
24 pages, 5 fgiures, AAAI 2025
[LINK]
http://arxiv.org/abs/2412.14584v1
[DATE]
2024-12-19 15:06:01+08:00
[CATEGORIES]
cs.CL
CORD: Balancing COnsistency and Rank Distillation for Robust Retrieval-Augmented Generation
[AUTHORS]
Youngwon Lee, Seung-won Hwang, Daniel Campos, Filip Graliński, Zhewei Yao, Yuxiong He
[ABSTRACT]
With the adoption of retrieval-augmented generation (RAG), large language
models (LLMs) are expected to ground their generation to the retrieved
contexts. Yet, this is hindered by position bias of LLMs, failing to evenly
attend to all contexts. Previous work has addressed this by synthesizing
contexts with perturbed positions of gold segment, creating a
position-diversified train set. We extend this intuition to propose consistency
regularization with augmentation and distillation. First, we augment each
training instance with its position perturbation to encourage consistent
predictions, regardless of ordering. We also distill behaviors of this pair,
although it can be counterproductive in certain RAG scenarios where the given
order from the retriever is crucial for generation quality. We thus propose
CORD, balancing COnsistency and Rank Distillation. CORD adaptively samples
noise-controlled perturbations from an interpolation space, ensuring both
consistency and respect for the rank prior. Empirical results show this balance
enables CORD to outperform consistently in diverse RAG benchmarks.
[LINK]
http://arxiv.org/abs/2412.14581v1
[DATE]
2024-12-19 15:01:25+08:00
[CATEGORIES]
cs.CL
Piece of Table: A Divide-and-Conquer Approach for Selecting Sub-Tables in Table Question Answering
[AUTHORS]
Wonjin Lee, Kyumin Kim, Sungjae Lee, Jihun Lee, Kwang In Kim
[ABSTRACT]
Applying language models (LMs) to tables is challenging due to the inherent
structural differences between two-dimensional tables and one-dimensional text
for which the LMs were originally designed. Furthermore, when applying
linearized tables to LMs, the maximum token lengths often imposed in
self-attention calculations make it difficult to comprehensively understand the
context spread across large tables. To address these challenges, we present
PieTa (Piece of Table), a new framework for sub-table-based question answering
(QA). PieTa operates through an iterative process of dividing tables into
smaller windows, using LMs to select relevant cells within each window, and
merging these cells into a sub-table. This multi-resolution approach captures
dependencies across multiple rows and columns while avoiding the limitations
caused by long context inputs. Instantiated as a simple iterative sub-table
union algorithm, PieTa demonstrates improved performance over previous
sub-table-based QA approaches.
[LINK]
http://arxiv.org/abs/2412.07629v2
[DATE]
2024-12-19 14:53:37+08:00
[CATEGORIES]
cs.CL
Sliding Windows Are Not the End: Exploring Full Ranking with Long-Context Large Language Models
[AUTHORS]
Wenhan Liu, Xinyu Ma, Yutao Zhu, Ziliang Zhao, Shuaiqiang Wang, Dawei Yin, Zhicheng Dou
[ABSTRACT]
Large Language Models (LLMs) have shown exciting performance in listwise
passage ranking. Due to the limited input length, existing methods often adopt
the sliding window strategy. Such a strategy, though effective, is inefficient
as it involves repetitive and serialized processing, which usually re-evaluates
relevant passages multiple times. As a result, it incurs redundant API costs,
which are proportional to the number of inference tokens. The development of
long-context LLMs enables the full ranking of all passages within a single
inference, avoiding redundant API costs. In this paper, we conduct a
comprehensive study of long-context LLMs for ranking tasks in terms of
efficiency and effectiveness. Surprisingly, our experiments reveal that full
ranking with long-context LLMs can deliver superior performance in the
supervised fine-tuning setting with a huge efficiency improvement. Furthermore,
we identify two limitations of fine-tuning the full ranking model based on
existing methods: (1) sliding window strategy fails to produce a full ranking
list as a training label, and (2) the language modeling loss cannot emphasize
top-ranked passage IDs in the label. To alleviate these issues, we propose a
new complete listwise label construction approach and a novel importance-aware
learning objective for full ranking. Experiments show the superior performance
of our method over baselines. Our codes are available at
\url{https://github.com/8421BCD/fullrank}.
[COMMENTS]
14 pages
[LINK]
http://arxiv.org/abs/2412.14574v1
[DATE]
2024-12-19 14:44:59+08:00
[CATEGORIES]
cs.CL
Alignment-Enhanced Decoding:Defending via Token-Level Adaptive Refining of Probability Distributions
[AUTHORS]
Quan Liu, Zhenhong Zhou, Longzhu He, Yi Liu, Wei Zhang, Sen Su
[ABSTRACT]
Large language models are susceptible to jailbreak attacks, which can result
in the generation of harmful content. While prior defenses mitigate these risks
by perturbing or inspecting inputs, they ignore competing objectives, the
underlying cause of alignment failures. In this paper, we propose
Alignment-Enhanced Decoding (AED), a novel defense that employs adaptive
decoding to address the root causes of jailbreak issues. We first define the
Competitive Index to quantify alignment failures and utilize feedback from
self-evaluation to compute post-alignment logits. Then, AED adaptively combines
AED and post-alignment logits with the original logits to obtain harmless and
helpful distributions. Consequently, our method enhances safety alignment while
maintaining helpfulness. We conduct experiments across five models and four
common jailbreaks, with the results validating the effectiveness of our
approach. Code is available at https://github.com/GIGABaozi/AED.git.
[COMMENTS]
Accepted by EMNLP 2024, 15 pages, 5 figures
[LINK]
http://arxiv.org/abs/2408.07663v2
[DATE]
2024-12-19 14:34:31+08:00
[CATEGORIES]
cs.CL
Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference
[AUTHORS]
Benjamin Warner, Antoine Chaffin, Benjamin Clavié, Orion Weller, Oskar Hallström, Said Taghadouini, Alexis Gallagher, Raja Biswas, Faisal Ladhak, Tom Aarsen, Nathan Cooper, Griffin Adams, Jeremy Howard, Iacopo Poli
[ABSTRACT]
Encoder-only transformer models such as BERT offer a great performance-size
tradeoff for retrieval and classification tasks with respect to larger
decoder-only models. Despite being the workhorse of numerous production
pipelines, there have been limited Pareto improvements to BERT since its
release. In this paper, we introduce ModernBERT, bringing modern model
optimizations to encoder-only models and representing a major Pareto
improvement over older encoders. Trained on 2 trillion tokens with a native
8192 sequence length, ModernBERT models exhibit state-of-the-art results on a
large pool of evaluations encompassing diverse classification tasks and both
single and multi-vector retrieval on different domains (including code). In
addition to strong downstream performance, ModernBERT is also the most speed
and memory efficient encoder and is designed for inference on common GPUs.
[LINK]
http://arxiv.org/abs/2412.13663v2
[DATE]
2024-12-19 14:32:26+08:00
[CATEGORIES]
cs.CL
Improving Retrieval Augmented Language Model with Self-Reasoning
[AUTHORS]
Yuan Xia, Jingbo Zhou, Zhenhui Shi, Jun Chen, Haifeng Huang
[ABSTRACT]
The Retrieval-Augmented Language Model (RALM) has shown remarkable
performance on knowledge-intensive tasks by incorporating external knowledge
during inference, which mitigates the factual hallucinations inherited in large
language models (LLMs). Despite these advancements, challenges persist in the
implementation of RALMs, particularly concerning their reliability and
traceability. To be specific, the irrelevant document retrieval may result in
unhelpful response generation or even deteriorate the performance of LLMs,
while the lack of proper citations in generated outputs complicates efforts to
verify the trustworthiness of the models. To this end, we propose a novel
self-reasoning framework aimed at improving the reliability and traceability of
RALMs, whose core idea is to leverage reasoning trajectories generated by the
LLM itself. The framework involves constructing self-reason trajectories with
three processes: a relevance-aware process, an evidence-aware selective
process, and a trajectory analysis process. We have evaluated our framework
across four public datasets (two short-form QA datasets, one long-form QA
dataset, and one fact verification dataset) to demonstrate the superiority of
our method, which can outperform existing state-of-the-art models and can
achieve comparable performance with GPT-4, while only using 2,000 training
samples.
[COMMENTS]
AAAI 2025 (main conference)
[LINK]
http://arxiv.org/abs/2407.19813v3
[DATE]
2024-12-19 14:27:44+08:00
[CATEGORIES]
cs.CL
Progressive Multi-granular Alignments for Grounded Reasoning in Large Vision-Language Models
[AUTHORS]
Quang-Hung Le, Long Hoang Dang, Ngan Le, Truyen Tran, Thao Minh Le
[ABSTRACT]
Existing Large Vision-Language Models (LVLMs) excel at matching concepts
across multi-modal inputs but struggle with compositional concepts and
high-level relationships between entities. This paper introduces Progressive
multi-granular Vision-Language alignments (PromViL), a novel framework to
enhance LVLMs’ ability in performing grounded compositional visual reasoning
tasks. Our approach constructs a hierarchical structure of multi-modal
alignments, ranging from simple to complex concepts. By progressively aligning
textual descriptions with corresponding visual regions, our model learns to
leverage contextual information from lower levels to inform higher-level
reasoning. To facilitate this learning process, we introduce a data generation
process that creates a novel dataset derived from Visual Genome, providing a
wide range of nested compositional vision-language pairs. Experimental results
demonstrate that our PromViL framework significantly outperforms baselines on
various visual grounding and compositional question answering tasks. The code
is available at: https://github.com/lqh52/PromViL.
[LINK]
http://arxiv.org/abs/2412.08125v2
[DATE]
2024-12-19 13:46:29+08:00
[CATEGORIES]
cs.CL
cs.LG
Unleashing the Unseen: Harnessing Benign Datasets for Jailbreaking Large Language Models
[AUTHORS]
Wei Zhao, Zhe Li, Yige Li, Jun Sun
[ABSTRACT]
Despite significant ongoing efforts in safety alignment, large language
models (LLMs) such as GPT-4 and LLaMA 3 remain vulnerable to jailbreak attacks
that can induce harmful behaviors, including through the use of adversarial
suffixes. Building on prior research, we hypothesize that these adversarial
suffixes are not mere bugs but may represent features that can dominate the
LLM’s behavior. To evaluate this hypothesis, we conduct several experiments.
First, we demonstrate that benign features can be effectively made to function
as adversarial suffixes, i.e., we develop a feature extraction method to
extract sample-agnostic features from benign dataset in the form of suffixes
and show that these suffixes may effectively compromise safety alignment.
Second, we show that adversarial suffixes generated from jailbreak attacks may
contain meaningful features, i.e., appending the same suffix to different
prompts results in responses exhibiting specific characteristics. Third, we
show that such benign-yet-safety-compromising features can be easily introduced
through fine-tuning using only benign datasets. As a result, we are able to
completely eliminate GPT’s safety alignment in a blackbox setting through
finetuning with only benign data. Our code and data is available at
\url{https://github.com/suffix-maybe-feature/adver-suffix-maybe-features}.
[LINK]
http://arxiv.org/abs/2410.00451v3
[DATE]
2024-12-19 13:32:59+08:00
[CATEGORIES]
cs.CL
VisualRWKV: Exploring Recurrent Neural Networks for Visual Language Models
[AUTHORS]
Haowen Hou, Peigen Zeng, Fei Ma, Fei Richard Yu
[ABSTRACT]
Visual Language Models (VLMs) have rapidly progressed with the recent success
of large language models. However, there have been few attempts to incorporate
efficient linear Recurrent Neural Networks (RNNs) architectures into VLMs. In
this study, we introduce VisualRWKV, the first application of a linear RNN
model to multimodal learning tasks, leveraging the pre-trained RWKV language
model. We propose a data-dependent recurrence and sandwich prompts to enhance
our modeling capabilities, along with a 2D image scanning mechanism to enrich
the processing of visual sequences. Extensive experiments demonstrate that
VisualRWKV achieves competitive performance compared to Transformer-based
models like LLaVA-1.5 on various benchmarks. Compared to LLaVA-1.5, VisualRWKV
has a speed advantage of 3.98 times and can save 54% of GPU memory when
reaching an inference length of 24K tokens. To facilitate further research and
analysis, we have made the checkpoints and the associated code publicly
accessible at the following GitHub repository: see
https://github.com/howard-hou/VisualRWKV.
[COMMENTS]
Accepted at COLING 2025 main conference
[LINK]
http://arxiv.org/abs/2406.13362v3
[DATE]
2024-12-19 13:26:14+08:00
[CATEGORIES]
cs.CL
cs.LG
Doubly-Universal Adversarial Perturbations: Deceiving Vision-Language Models Across Both Images and Text with a Single Perturbation
[AUTHORS]
Hee-Seon Kim, Minbeom Kim, Changick Kim
[ABSTRACT]
Large Vision-Language Models (VLMs) have demonstrated remarkable performance
across multimodal tasks by integrating vision encoders with large language
models (LLMs). However, these models remain vulnerable to adversarial attacks.
Among such attacks, Universal Adversarial Perturbations (UAPs) are especially
powerful, as a single optimized perturbation can mislead the model across
various input images. In this work, we introduce a novel UAP specifically
designed for VLMs: the Doubly-Universal Adversarial Perturbation (Doubly-UAP),
capable of universally deceiving VLMs across both image and text inputs. To
successfully disrupt the vision encoder’s fundamental process, we analyze the
core components of the attention mechanism. After identifying value vectors in
the middle-to-late layers as the most vulnerable, we optimize Doubly-UAP in a
label-free manner with a frozen model. Despite being developed as a black-box
to the LLM, Doubly-UAP achieves high attack success rates on VLMs, consistently
outperforming baseline methods across vision-language tasks. Extensive ablation
studies and analyses further demonstrate the robustness of Doubly-UAP and
provide insights into how it influences internal attention mechanisms.
[LINK]
http://arxiv.org/abs/2412.08108v2
[DATE]
2024-12-19 13:01:33+08:00
[CATEGORIES]
cs.CL
Multi-Level Optimal Transport for Universal Cross-Tokenizer Knowledge Distillation on Language Models
[AUTHORS]
Xiao Cui, Mo Zhu, Yulei Qin, Liang Xie, Wengang Zhou, Houqiang Li
[ABSTRACT]
Knowledge distillation (KD) has become a prevalent technique for compressing
large language models (LLMs). Existing KD methods are constrained by the need
for identical tokenizers (i.e., vocabularies) between teacher and student
models, limiting their versatility in handling LLMs of different architecture
families. In this paper, we introduce the Multi-Level Optimal Transport
(MultiLevelOT), a novel approach that advances the optimal transport for
universal cross-tokenizer knowledge distillation. Our method aligns the logit
distributions of the teacher and the student at both token and sequence levels
using diverse cost matrices, eliminating the need for dimensional or
token-by-token correspondence. At the token level, MultiLevelOT integrates both
global and local information by jointly optimizing all tokens within a sequence
to enhance robustness. At the sequence level, we efficiently capture complex
distribution structures of logits via the Sinkhorn distance, which approximates
the Wasserstein distance for divergence measures. Extensive experiments on
tasks such as extractive QA, generative QA, and summarization demonstrate that
the MultiLevelOT outperforms state-of-the-art cross-tokenizer KD methods under
various settings. Our approach is robust to different student and teacher
models across model families, architectures, and parameter sizes.
[COMMENTS]
Accepted by AAAI 2025
[LINK]
http://arxiv.org/abs/2412.14528v1
[DATE]
2024-12-19 12:51:06+08:00
[CATEGORIES]
cs.CL
Self-Generated Critiques Boost Reward Modeling for Language Models
[AUTHORS]
Yue Yu, Zhengxing Chen, Aston Zhang, Liang Tan, Chenguang Zhu, Richard Yuanzhe Pang, Yundi Qian, Xuewei Wang, Suchin Gururangan, Chao Zhang, Melanie Kambadur, Dhruv Mahajan, Rui Hou
[ABSTRACT]
Reward modeling is crucial for aligning large language models (LLMs) with
human preferences, especially in reinforcement learning from human feedback
(RLHF). However, current reward models mainly produce scalar scores and
struggle to incorporate critiques in a natural language format. We hypothesize
that predicting both critiques and the scalar reward would improve reward
modeling ability. Motivated by this, we propose Critic-RM, a framework that
improves reward models using self-generated critiques without extra
supervision. Critic-RM employs a two-stage process: generating and filtering
high-quality critiques, followed by joint fine-tuning on reward prediction and
critique generation. Experiments across benchmarks show that Critic-RM improves
reward modeling accuracy by 3.7%-7.3% compared to standard reward models and
LLM judges, demonstrating strong performance and data efficiency. Additional
studies further validate the effectiveness of generated critiques in rectifying
flawed reasoning steps with 2.5%-3.2% gains in improving reasoning accuracy.
[COMMENTS]
20 pages
[LINK]
http://arxiv.org/abs/2411.16646v2
[DATE]
2024-12-19 12:50:45+08:00
[CATEGORIES]
cs.CL
cs.LG
KnowledgePrompts: Exploring the Abilities of Large Language Models to Solve Proportional Analogies via Knowledge-Enhanced Prompting
[AUTHORS]
Thilini Wijesiriwardene, Ruwan Wickramarachchi, Sreeram Vennam, Vinija Jain, Aman Chadha, Amitava Das, Ponnurangam Kumaraguru, Amit Sheth
[ABSTRACT]
Making analogies is fundamental to cognition. Proportional analogies, which
consist of four terms, are often used to assess linguistic and cognitive
abilities. For instance, completing analogies like “Oxygen is to Gas as
[COMMENTS]
Accepted at COLING 2025
[LINK]
[http://arxiv.org/abs/2412.00869v2](http://arxiv.org/abs/2412.00869v2)
[DATE]
2024-12-19 12:38:59+08:00
[CATEGORIES]
cs.CL
UOR: Universal Backdoor Attacks on Pre-trained Language Models
[AUTHORS]
Wei Du, Peixuan Li, Boqun Li, Haodong Zhao, Gongshen Liu
[COMMENTS]
ACL-Findings 2024
[LINK]
http://arxiv.org/abs/2305.09574v2
[DATE]
2024-12-19 12:35:08+08:00
[CATEGORIES]
cs.CL
Cal-DPO: Calibrated Direct Preference Optimization for Language Model Alignment
[AUTHORS]
Teng Xiao, Yige Yuan, Huaisheng Zhu, Mingxiao Li, Vasant G Honavar
[COMMENTS]
Accepted by NeurIPS 2024 Main
[LINK]
http://arxiv.org/abs/2412.14516v1
[DATE]
2024-12-19 12:31:56+08:00
[CATEGORIES]
cs.LG
cs.CL
PA-RAG: RAG Alignment via Multi-Perspective Preference Optimization
[AUTHORS]
Jiayi Wu, Hengyi Cai, Lingyong Yan, Hao Sun, Xiang Li, Shuaiqiang Wang, Dawei Yin, Ming Gao
[ABSTRACT]
The emergence of Retrieval-augmented generation (RAG) has alleviated the
issues of outdated and hallucinatory content in the generation of large
language models (LLMs), yet it still reveals numerous limitations. When a
general-purpose LLM serves as the RAG generator, it often suffers from
inadequate response informativeness, response robustness, and citation quality.
Past approaches to tackle these limitations, either by incorporating additional
steps beyond generating responses or optimizing the generator through
supervised fine-tuning (SFT), still failed to align with the RAG requirement
thoroughly. Consequently, optimizing the RAG generator from multiple preference
perspectives while maintaining its end-to-end LLM form remains a challenge. To
bridge this gap, we propose Multiple Perspective Preference Alignment for
Retrieval-Augmented Generation (PA-RAG), a method for optimizing the generator
of RAG systems to align with RAG requirements comprehensively. Specifically, we
construct high-quality instruction fine-tuning data and multi-perspective
preference data by sampling varied quality responses from the generator across
different prompt documents quality scenarios. Subsequently, we optimize the
generator using SFT and Direct Preference Optimization (DPO). Extensive
experiments conducted on four question-answer datasets across three LLMs
demonstrate that PA-RAG can significantly enhance the performance of RAG
generators. Our code and datasets are available at
https://github.com/wujwyi/PA-RAG.
[LINK]
http://arxiv.org/abs/2412.14510v1
[DATE]
2024-12-19 12:18:51+08:00
[CATEGORIES]
cs.CL
GraphEQA: Using 3D Semantic Scene Graphs for Real-time Embodied Question Answering
[AUTHORS]
Saumya Saxena, Blake Buchanan, Chris Paxton, Bingqing Chen, Narunas Vaskevicius, Luigi Palmieri, Jonathan Francis, Oliver Kroemer
[ABSTRACT]
In Embodied Question Answering (EQA), agents must explore and develop a
semantic understanding of an unseen environment in order to answer a situated
question with confidence. This remains a challenging problem in robotics, due
to the difficulties in obtaining useful semantic representations, updating
these representations online, and leveraging prior world knowledge for
efficient exploration and planning. Aiming to address these limitations, we
propose GraphEQA, a novel approach that utilizes real-time 3D metric-semantic
scene graphs (3DSGs) and task relevant images as multi-modal memory for
grounding Vision-Language Models (VLMs) to perform EQA tasks in unseen
environments. We employ a hierarchical planning approach that exploits the
hierarchical nature of 3DSGs for structured planning and semantic-guided
exploration. Through experiments in simulation on the HM-EQA dataset and in the
real world in home and office environments, we demonstrate that our method
outperforms key baselines by completing EQA tasks with higher success rates and
fewer planning steps.
[COMMENTS]
Project website: https://saumyasaxena.github.io/grapheqa
[LINK]
http://arxiv.org/abs/2412.14480v1
[DATE]
2024-12-19 11:04:34+08:00
[CATEGORIES]
cs.CL
cs.LG
Why We Build Local Large Language Models: An Observational Analysis from 35 Japanese and Multilingual LLMs
[AUTHORS]
Koshiro Saito, Sakae Mizuki, Masanari Ohi, Taishi Nakamura, Taihei Shiotani, Koki Maeda, Youmi Ma, Kakeru Hattori, Kazuki Fujii, Takumi Okamoto, Shigeki Ishida, Hiroya Takamura, Rio Yokota, Naoaki Okazaki
[ABSTRACT]
Why do we build local large language models (LLMs)? What should a local LLM
learn from the target language? Which abilities can be transferred from other
languages? Do language-specific scaling laws exist? To explore these research
questions, we evaluated 35 Japanese, English, and multilingual LLMs on 19
evaluation benchmarks for Japanese and English, taking Japanese as a local
language. Adopting an observational approach, we analyzed correlations of
benchmark scores, and conducted principal component analysis (PCA) on the
scores to derive \textit{ability factors} of local LLMs. We found that training
on English text can improve the scores of academic subjects in Japanese
(JMMLU). In addition, it is unnecessary to specifically train on Japanese text
to enhance abilities for solving Japanese code generation, arithmetic
reasoning, commonsense, and reading comprehension tasks. In contrast, training
on Japanese text could improve question-answering tasks about Japanese
knowledge and English-Japanese translation, which indicates that abilities for
solving these two tasks can be regarded as \textit{Japanese abilities} for
LLMs. Furthermore, we confirmed that the Japanese abilities scale with the
computational budget for Japanese text.
[COMMENTS]
Preprint. Under review
[LINK]
http://arxiv.org/abs/2412.14471v1
[DATE]
2024-12-19 10:39:26+08:00
[CATEGORIES]
cs.CL
Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress?
[AUTHORS]
Richard Ren, Steven Basart, Adam Khoja, Alice Gatti, Long Phan, Xuwang Yin, Mantas Mazeika, Alexander Pan, Gabriel Mukobi, Ryan H. Kim, Stephen Fitz, Dan Hendrycks
[COMMENTS]
NeurIPS 2024
[LINK]
http://arxiv.org/abs/2407.21792v2
[DATE]
2024-12-19 10:39:01+08:00
[CATEGORIES]
cs.LG
cs.CL
Agent Planning with World Knowledge Model
[AUTHORS]
Shuofei Qiao, Runnan Fang, Ningyu Zhang, Yuqi Zhu, Xiang Chen, Shumin Deng, Yong Jiang, Pengjun Xie, Fei Huang, Huajun Chen
[ABSTRACT]
Recent endeavors towards directly using large language models (LLMs) as agent
models to execute interactive planning tasks have shown commendable results.
Despite their achievements, however, they still struggle with brainless
trial-and-error in global planning and generating hallucinatory actions in
local planning due to their poor understanding of the “real” physical world.
Imitating humans’ mental world knowledge model which provides global prior
knowledge before the task and maintains local dynamic knowledge during the
task, in this paper, we introduce parametric World Knowledge Model (WKM) to
facilitate agent planning. Concretely, we steer the agent model to
self-synthesize knowledge from both expert and sampled trajectories. Then we
develop WKM, providing prior task knowledge to guide the global planning and
dynamic state knowledge to assist the local planning. Experimental results on
three complex real-world simulated datasets with three state-of-the-art
open-source LLMs, Mistral-7B, Gemma-7B, and Llama-3-8B, demonstrate that our
method can achieve superior performance compared to various strong baselines.
Besides, we analyze to illustrate that our WKM can effectively alleviate the
blind trial-and-error and hallucinatory action issues, providing strong support
for the agent’s understanding of the world. Other interesting findings include:
1) our instance-level task knowledge can generalize better to unseen tasks, 2)
weak WKM can guide strong agent model planning, and 3) unified WKM training has
promising potential for further development. The code is available at
https://github.com/zjunlp/WKM.
[COMMENTS]
NeurIPS 2024
[LINK]
http://arxiv.org/abs/2405.14205v3
[DATE]
2024-12-19 10:35:48+08:00
[CATEGORIES]
cs.CL
cs.LG
WISE: Rethinking the Knowledge Memory for Lifelong Model Editing of Large Language Models
[AUTHORS]
Peng Wang, Zexi Li, Ningyu Zhang, Ziwen Xu, Yunzhi Yao, Yong Jiang, Pengjun Xie, Fei Huang, Huajun Chen
[COMMENTS]
NeurIPS 2024
[LINK]
http://arxiv.org/abs/2405.14768v3
[DATE]
2024-12-19 10:18:54+08:00
[CATEGORIES]
cs.CL
cs.LG
DialSim: A Real-Time Simulator for Evaluating Long-Term Multi-Party Dialogue Understanding of Conversational Agents
[AUTHORS]
Jiho Kim, Woosog Chay, Hyeonji Hwang, Daeun Kyung, Hyunseung Chung, Eunbyeol Cho, Yohan Jo, Edward Choi
[ABSTRACT]
Recent advancements in Large Language Models (LLMs) have significantly
enhanced the capabilities of conversational agents, making them applicable to
various fields (e.g., education). Despite their progress, the evaluation of the
agents often overlooks the complexities of real-world conversations, such as
real-time interactions, multi-party dialogues, and extended contextual
dependencies. To bridge this gap, we introduce DialSim, a real-time dialogue
simulator. In this simulator, an agent is assigned the role of a character from
popular TV shows, requiring it to respond to spontaneous questions using past
dialogue information and to distinguish between known and unknown information.
Key features of DialSim include assessing the agent’s ability to respond within
a reasonable time limit, handling long-term multi-party dialogues, and
evaluating performance under randomized questioning with LongDialQA, a novel,
high-quality question-answering dataset. Our experiments using DialSim reveal
the strengths and weaknesses of the latest conversational agents, offering
valuable insights for future advancements in conversational AI. DialSim is
available at https://dialsim.github.io/.
[LINK]
http://arxiv.org/abs/2406.13144v3
[DATE]
2024-12-19 10:14:09+08:00
[CATEGORIES]
cs.CL
Knowledge Circuits in Pretrained Transformers
[AUTHORS]
Yunzhi Yao, Ningyu Zhang, Zekun Xi, Mengru Wang, Ziwen Xu, Shumin Deng, Huajun Chen
[COMMENTS]
NeurIPS 2024, 26 pages
[LINK]
http://arxiv.org/abs/2405.17969v3
[DATE]
2024-12-19 10:10:00+08:00
[CATEGORIES]
cs.CL
cs.LG
All-in-One Tuning and Structural Pruning for Domain-Specific LLMs
[AUTHORS]
Lei Lu, Zhepeng Wang, Ruexue Bao, Mengbing Wang, Fangyi Li, Yawen Wu, Weiwen Jiang, Jie Xu, Yanzhi Wang, Shangqian Gao
[ABSTRACT]
Existing pruning techniques for large language models (LLMs) targeting
domain-specific applications typically follow a two-stage process: pruning the
pretrained general-purpose LLMs and then fine-tuning the pruned LLMs on
specific domains. However, the pruning decisions, derived from the pretrained
weights, remain unchanged during fine-tuning, even if the weights have been
updated. Therefore, such a combination of the pruning decisions and the
finetuned weights may be suboptimal, leading to non-negligible performance
degradation. To address these limitations, we propose ATP: All-in-One Tuning
and Structural Pruning, a unified one-stage structural pruning and fine-tuning
approach that dynamically identifies the current optimal substructure
throughout the fine-tuning phase via a trainable pruning decision generator.
Moreover, given the limited available data for domain-specific applications,
Low-Rank Adaptation (LoRA) becomes a common technique to fine-tune the LLMs. In
ATP, we introduce LoRA-aware forward and sparsity regularization to ensure that
the substructures corresponding to the learned pruning decisions can be
directly removed after the ATP process. ATP outperforms the state-of-the-art
two-stage pruning methods on tasks in the legal and healthcare domains. More
specifically, ATP recovers up to 88% and 91% performance of the dense model
when pruning 40% parameters of LLaMA2-7B and LLaMA3-8B models, respectively.
[LINK]
http://arxiv.org/abs/2412.14426v1
[DATE]
2024-12-19 08:41:40+08:00
[CATEGORIES]
cs.CL
In-Group Love, Out-Group Hate: A Framework to Measure Affective Polarization via Contentious Online Discussions
[AUTHORS]
Buddhika Nettasinghe, Ashwin Rao, Bohan Jiang, Allon Percus, Kristina Lerman
[ABSTRACT]
Affective polarization, the emotional divide between ideological groups
marked by in-group love and out-group hate, has intensified in the United
States, driving contentious issues like masking and lockdowns during the
COVID-19 pandemic. Despite its societal impact, existing models of opinion
change fail to account for emotional dynamics nor offer methods to quantify
affective polarization robustly and in real-time. In this paper, we introduce a
discrete choice model that captures decision-making within affectively
polarized social networks and propose a statistical inference method estimate
key parameters – in-group love and out-group hate – from social media data.
Through empirical validation from online discussions about the COVID-19
pandemic, we demonstrate that our approach accurately captures real-world
polarization dynamics and explains the rapid emergence of a partisan gap in
attitudes towards masking and lockdowns. This framework allows for tracking
affective polarization across contentious issues has broad implications for
fostering constructive online dialogues in digital spaces.
[LINK]
http://arxiv.org/abs/2412.14414v1
[DATE]
2024-12-19 07:58:13+08:00
[CATEGORIES]
cs.CL
ETF: An Entity Tracing Framework for Hallucination Detection in Code Summaries
[AUTHORS]
Kishan Maharaj, Vitobha Munigala, Srikanth G. Tamilselvam, Prince Kumar, Sayandeep Sen, Palani Kodeswaran, Abhijit Mishra, Pushpak Bhattacharyya
[ABSTRACT]
Recent advancements in large language models (LLMs) have significantly
enhanced their ability to understand both natural language and code, driving
their use in tasks like natural language-to-code (NL2Code) and code
summarization. However, LLMs are prone to hallucination-outputs that stray from
intended meanings. Detecting hallucinations in code summarization is especially
difficult due to the complex interplay between programming and natural
languages. We introduce a first-of-its-kind dataset with $\sim$10K samples,
curated specifically for hallucination detection in code summarization. We
further propose a novel Entity Tracing Framework (ETF) that a) utilizes static
program analysis to identify code entities from the program and b) uses LLMs to
map and verify these entities and their intents within generated code
summaries. Our experimental analysis demonstrates the effectiveness of the
framework, leading to a 0.73 F1 score. This approach provides an interpretable
method for detecting hallucinations by grounding entities, allowing us to
evaluate summary accuracy.
[COMMENTS]
11 pages, 6 Figures, 5 Tables
[LINK]
http://arxiv.org/abs/2410.14748v3
[DATE]
2024-12-19 07:36:03+08:00
[CATEGORIES]
cs.CL
Hands-Free VR
[AUTHORS]
Jorge Askur Vazquez Fernandez, Jae Joong Lee, Santiago Andrés Serrano Vacca, Alejandra Magana, Radim Pesam, Bedrich Benes, Voicu Popescu
[ABSTRACT]
The paper introduces Hands-Free VR, a voice-based natural-language interface
for VR. The user gives a command using their voice, the speech audio data is
converted to text using a speech-to-text deep learning model that is fine-tuned
for robustness to word phonetic similarity and to spoken English accents, and
the text is mapped to an executable VR command using a large language model
that is robust to natural language diversity. Hands-Free VR was evaluated in a
controlled within-subjects study (N = 22) that asked participants to find
specific objects and to place them in various configurations. In the control
condition participants used a conventional VR user interface to grab, carry,
and position the objects using the handheld controllers. In the experimental
condition participants used Hands-Free VR. The results confirm that: (1)
Hands-Free VR is robust to spoken English accents, as for 20 of our
participants English was not their first language, and to word phonetic
similarity, correctly transcribing the voice command 96.71% of the time; (2)
Hands-Free VR is robust to natural language diversity, correctly mapping the
transcribed command to an executable command in 97.83% of the time; (3)
Hands-Free VR had a significant efficiency advantage over the conventional VR
interface in terms of task completion time, total viewpoint translation, total
view direction rotation, and total left and right hand translations; (4)
Hands-Free VR received high user preference ratings in terms of ease of use,
intuitiveness, ergonomics, reliability, and desirability.
[COMMENTS]
The first two authors contributed equally. Accepted VISIGRAPP@HUCAPP
2025
[LINK]
http://arxiv.org/abs/2402.15083v2
[DATE]
2024-12-19 07:11:48+08:00
[CATEGORIES]
cs.CL
Markovian Transformers for Informative Language Modeling
[AUTHORS]
Scott Viteri, Max Lamparth, Peter Chatain, Clark Barrett
[ABSTRACT]
Chain-of-Thought (CoT) reasoning holds great promise for explaining language
model outputs, but recent studies have highlighted significant challenges in
its practical application for interpretability. We propose to address this
issue by making CoT causally essential to prediction through two key
components: factoring next-token prediction through intermediate CoT text, and
training CoT to predict future tokens independently of other context. This
results in “Markovian” language models, where CoT serves as a fixed-size state
for future token prediction. Our approach optimizes for “informativeness” - the
improvement in next-token predictions using a trained CoT compared to a
baseline. Using Proximal Policy Optimization (PPO) for arithmetic problems and
policy gradient for GSM8K, we demonstrate effectiveness on both arithmetic
problems with Mistral 7B and the GSM8K benchmark with Llama 3.1 8B, where the
model learns to produce CoTs that are 33.20% more effective at predicting
answers than the pre-trained baseline. The increased sensitivity of model
performance to CoT perturbations provides strong evidence of CoT reliance.
Furthermore, we show that CoTs trained for one model generalize to help other
models predict answers, suggesting these CoTs capture reasoning patterns that
transfer across different interpreters. This work advances the development of
more interpretable language models, potentially enabling their extension to
arbitrarily long contexts and enhancing AI reasoning capabilities across
various domains.
[COMMENTS]
20 pages, 8 figures
[LINK]
http://arxiv.org/abs/2404.18988v4
[DATE]
2024-12-19 06:26:15+08:00
[CATEGORIES]
cs.CL
ECG-Byte: A Tokenizer for End-to-End Generative Electrocardiogram Language Modeling
[AUTHORS]
William Han, Chaojing Duan, Michael A. Rosenberg, Emerson Liu, Ding Zhao
[ABSTRACT]
Large Language Models (LLMs) have shown remarkable adaptability across
domains beyond text, specifically electrocardiograms (ECGs). More specifically,
there is a growing body of work exploring the task of generating text from a
multi-channeled ECG and corresponding textual prompt. Current approaches
typically involve pretraining an ECG-specific encoder with a self-supervised
learning (SSL) objective and using the features output by the pretrained
encoder to finetune a LLM for natural language generation (NLG). However, these
methods are limited by 1) inefficiency from two-stage training and 2)
interpretability challenges with encoder-generated features. To address these
limitations, we introduce ECG-Byte, an adapted byte pair encoding (BPE)
tokenizer pipeline for autoregressive language modeling of ECGs. This approach
compresses and encodes ECG signals into tokens, enabling end-to-end LLM
training by combining ECG and text tokens directly, while being much more
interpretable since the ECG tokens can be directly mapped back to the original
signal. Using ECG-Byte, we achieve competitive performance in NLG tasks in only
half the time and ~48% of the data required by two-stage approaches.
[COMMENTS]
26 pages, 17 figures
[LINK]
http://arxiv.org/abs/2412.14373v1
[DATE]
2024-12-19 06:13:21+08:00
[CATEGORIES]
cs.CL
ResQ: Mixed-Precision Quantization of Large Language Models with Low-Rank Residuals
[AUTHORS]
Utkarsh Saxena, Sayeh Sharify, Kaushik Roy, Xin Wang
[ABSTRACT]
Post-training quantization (PTQ) of large language models (LLMs) holds the
promise in reducing the prohibitive computational cost at inference time.
Quantization of all weight, activation and key-value (KV) cache tensors to
4-bit without significantly degrading generalizability is challenging, due to
the high quantization error caused by extreme outliers in activations. To
tackle this problem, we propose ResQ, a PTQ method that pushes further the
state-of-the-art. By means of principal component analysis (PCA), it identifies
a low-rank subspace (in practice 1/8 of the hidden dimension) in which
activation variances are highest, and keep the coefficients within this
subspace in high precision, e.g. 8-bit, while quantizing the rest to 4-bit.
Within each subspace, invariant random rotation is applied to further suppress
outliers. We show that this is a provably optimal mixed precision quantization
scheme that minimizes error. With the Llama families of models, we demonstrate
that ResQ outperforms recent uniform and mixed precision PTQ methods on a
variety of benchmarks, achieving up to 33% lower perplexity on Wikitext than
the next best method SpinQuant, and a 2.4x speedup over 16-bit baseline. Code
is available at https://github.com/utkarsh-dmx/project-resq.
[COMMENTS]
14 pages, 6 figures, 6 tables
[LINK]
http://arxiv.org/abs/2412.14363v1
[DATE]
2024-12-19 06:01:55+08:00
[CATEGORIES]
cs.LG
cs.CL
Evidence Contextualization and Counterfactual Attribution for Conversational QA over Heterogeneous Data with RAG Systems
[AUTHORS]
Rishiraj Saha Roy, Joel Schlotthauer, Chris Hinze, Andreas Foltyn, Luzian Hahn, Fabian Kuech
[ABSTRACT]
Retrieval Augmented Generation (RAG) works as a backbone for interacting with
an enterprise’s own data via Conversational Question Answering (ConvQA). In a
RAG system, a retriever fetches passages from a collection in response to a
question, which are then included in the prompt of a large language model (LLM)
for generating a natural language (NL) answer. However, several RAG systems
today suffer from two shortcomings: (i) retrieved passages usually contain
their raw text and lack appropriate document context, negatively impacting both
retrieval and answering quality; and (ii) attribution strategies that explain
answer generation usually rely only on similarity between the answer and the
retrieved passages, thereby only generating plausible but not causal
explanations. In this work, we demonstrate RAGONITE, a RAG system that remedies
the above concerns by: (i) contextualizing evidence with source metadata and
surrounding text; and (ii) computing counterfactual attribution, a causal
explanation approach where the contribution of an evidence to an answer is
determined by the similarity of the original response to the answer obtained by
removing that evidence. To evaluate our proposals, we release a new benchmark
ConfQuestions, with 300 hand-created conversational questions, each in English
and German, coupled with ground truth URLs, completed questions, and answers
from 215 public Confluence pages, that are typical of enterprise wiki spaces
with heterogeneous elements. Experiments with RAGONITE on ConfQuestions show
the viability of our ideas: contextualization improves RAG performance, and
counterfactual attribution is effective at explaining RAG answers.
[COMMENTS]
Accepted at WSDM 2025
[LINK]
http://arxiv.org/abs/2412.10571v2
[DATE]
2024-12-19 06:01:52+08:00
[CATEGORIES]
cs.CL
State Space Models are Strong Text Rerankers
[AUTHORS]
Zhichao Xu, Jinghua Yan, Ashim Gupta, Vivek Srikumar
[ABSTRACT]
Transformers dominate NLP and IR; but their inference inefficiencies and
challenges in extrapolating to longer contexts have sparked interest in
alternative model architectures. Among these, state space models (SSMs) like
Mamba offer promising advantages, particularly $O(1)$ time complexity in
inference. Despite their potential, SSMs’ effectiveness at text reranking – a
task requiring fine-grained query-document interaction and long-context
understanding – remains underexplored.
This study benchmarks SSM-based architectures (specifically, Mamba-1 and
Mamba-2) against transformer-based models across various scales, architectures,
and pre-training objectives, focusing on performance and efficiency in text
reranking tasks. We find that (1) Mamba architectures achieve competitive text
ranking performance, comparable to transformer-based models of similar size;
(2) they are less efficient in training and inference compared to transformers
with flash attention; and (3) Mamba-2 outperforms Mamba-1 in both performance
and efficiency. These results underscore the potential of state space models as
a transformer alternative and highlight areas for improvement in future IR
applications.
[COMMENTS]
The first two authors contributed equally, order decided randomly
[LINK]
http://arxiv.org/abs/2412.14354v1
[DATE]
2024-12-19 05:42:15+08:00
[CATEGORIES]
cs.CL
A Survey on LLM Inference-Time Self-Improvement
[AUTHORS]
Xiangjue Dong, Maria Teleki, James Caverlee
[ABSTRACT]
Techniques that enhance inference through increased computation at test-time
have recently gained attention. In this survey, we investigate the current
state of LLM Inference-Time Self-Improvement from three different perspectives:
Independent Self-improvement, focusing on enhancements via decoding or sampling
methods; Context-Aware Self-Improvement, leveraging additional context or
datastore; and Model-Aided Self-Improvement, achieving improvement through
model collaboration. We provide a comprehensive review of recent relevant
studies, contribute an in-depth taxonomy, and discuss challenges and
limitations, offering insights for future research.
[COMMENTS]
The first two authors contribute equally
[LINK]
http://arxiv.org/abs/2412.14352v1
[DATE]
2024-12-19 05:37:07+08:00
[CATEGORIES]
cs.CL
Is Peer-Reviewing Worth the Effort?
[AUTHORS]
Kenneth Church, Raman Chandrasekar, John E. Ortega, Ibrahim Said Ahmad
[ABSTRACT]
How effective is peer-reviewing in identifying important papers? We treat
this question as a forecasting task. Can we predict which papers will be highly
cited in the future based on venue and “early returns” (citations soon after
publication)? We show early returns are more predictive than venue. Finally, we
end with constructive suggestions to address scaling challenges: (a) too many
submissions and (b) too few qualified reviewers.
[COMMENTS]
The 31st International Conference on Computational Linguistics
(COLING 2025)
[LINK]
http://arxiv.org/abs/2412.14351v1
[DATE]
2024-12-19 05:34:42+08:00
[CATEGORIES]
cs.CL
From Prejudice to Parity: A New Approach to Debiasing Large Language Model Word Embeddings
[AUTHORS]
Aishik Rakshit, Smriti Singh, Shuvam Keshari, Arijit Ghosh Chowdhury, Vinija Jain, Aman Chadha
[COMMENTS]
Accepted at COLING 2025
[LINK]
http://arxiv.org/abs/2402.11512v5
[DATE]
2024-12-19 05:28:54+08:00
[CATEGORIES]
cs.CL
Semantic Role Labeling of NomBank Partitives
[AUTHORS]
Adam Meyers, Advait Pravin Savant, John E. Ortega
[ABSTRACT]
This article is about Semantic Role Labeling for English partitive nouns
(5%/REL of the price/ARG1; The price/ARG1 rose 5 percent/REL) in the NomBank
annotated corpus. Several systems are described using traditional and
transformer-based machine learning, as well as ensembling. Our highest scoring
system achieves an F1 of 91.74% using “gold” parses from the Penn Treebank and
91.12% when using the Berkeley Neural parser. This research includes both
classroom and experimental settings for system development.
[COMMENTS]
SUMEval-2: The 2nd Workshop on Scaling Up Multilingual &
Multi-Cultural Evaluation at the 31st International Conference on
Computational Linguistics (COLING 2025)
[LINK]
http://arxiv.org/abs/2412.14328v1
[DATE]
2024-12-19 04:56:11+08:00
[CATEGORIES]
cs.CL
The Role of Handling Attributive Nouns in Improving Chinese-To-English Machine Translation
[AUTHORS]
Haohao, Wang, Adam Meyers, John E. Ortega, Rodolfo Zevallos
[ABSTRACT]
Translating between languages with drastically different grammatical
conventions poses challenges, not just for human interpreters but also for
machine translation systems. In this work, we specifically target the
translation challenges posed by attributive nouns in Chinese, which frequently
cause ambiguities in English translation. By manually inserting the omitted
particle X (‘DE’). In news article titles from the Penn Chinese Discourse
Treebank, we developed a targeted dataset to fine-tune Hugging Face Chinese to
English translation models, specifically improving how this critical function
word is handled. This focused approach not only complements the broader
strategies suggested by previous studies but also offers a practical
enhancement by specifically addressing a common error type in Chinese-English
translation.
[COMMENTS]
18th Workshop on Building and Using Comparable Corpora (BUCC) at the
31st International Conference on Computational Linguistics (COLING 2025)
[LINK]
http://arxiv.org/abs/2412.14323v1
[DATE]
2024-12-19 04:37:52+08:00
[CATEGORIES]
cs.CL
Multi-OphthaLingua: A Multilingual Benchmark for Assessing and Debiasing LLM Ophthalmological QA in LMICs
[AUTHORS]
David Restrepo, Chenwei Wu, Zhengxu Tang, Zitao Shuai, Thao Nguyen Minh Phan, Jun-En Ding, Cong-Tinh Dao, Jack Gallifant, Robyn Gayle Dychiao, Jose Carlo Artiaga, André Hiroshi Bando, Carolina Pelegrini Barbosa Gracitelli, Vincenz Ferrer, Leo Anthony Celi, Danielle Bitterman, Michael G Morley, Luis Filipe Nakayama
[ABSTRACT]
Current ophthalmology clinical workflows are plagued by over-referrals, long
waits, and complex and heterogeneous medical records. Large language models
(LLMs) present a promising solution to automate various procedures such as
triaging, preliminary tests like visual acuity assessment, and report
summaries. However, LLMs have demonstrated significantly varied performance
across different languages in natural language question-answering tasks,
potentially exacerbating healthcare disparities in Low and Middle-Income
Countries (LMICs). This study introduces the first multilingual
ophthalmological question-answering benchmark with manually curated questions
parallel across languages, allowing for direct cross-lingual comparisons. Our
evaluation of 6 popular LLMs across 7 different languages reveals substantial
bias across different languages, highlighting risks for clinical deployment of
LLMs in LMICs. Existing debiasing methods such as Translation Chain-of-Thought
or Retrieval-augmented generation (RAG) by themselves fall short of closing
this performance gap, often failing to improve performance across all languages
and lacking specificity for the medical domain. To address this issue, We
propose CLARA (Cross-Lingual Reflective Agentic system), a novel inference time
de-biasing method leveraging retrieval augmented generation and
self-verification. Our approach not only improves performance across all
languages but also significantly reduces the multilingual bias gap,
facilitating equitable LLM application across the globe.
[COMMENTS]
Accepted at the AAAI 2025 Artificial Intelligence for Social Impact
Track (AAAI-AISI 2025)
[LINK]
http://arxiv.org/abs/2412.14304v1
[DATE]
2024-12-19 04:18:03+08:00
[CATEGORIES]
cs.CL
Fake News Detection: Comparative Evaluation of BERT-like Models and Large Language Models with Generative AI-Annotated Data
[AUTHORS]
haina Raza, Drai Paulen-Patterson, Chen Ding
[ABSTRACT]
Fake news poses a significant threat to public opinion and social stability
in modern society. This study presents a comparative evaluation of BERT-like
encoder-only models and autoregressive decoder-only large language models
(LLMs) for fake news detection. We introduce a dataset of news articles labeled
with GPT-4 assistance (an AI-labeling method) and verified by human experts to
ensure reliability. Both BERT-like encoder-only models and LLMs were fine-tuned
on this dataset. Additionally, we developed an instruction-tuned LLM approach
with majority voting during inference for label generation. Our analysis
reveals that BERT-like models generally outperform LLMs in classification
tasks, while LLMs demonstrate superior robustness against text perturbations.
Compared to weak labels (distant supervision) data, the results show that AI
labels with human supervision achieve better classification results. This study
highlights the effectiveness of combining AI-based annotation with human
oversight and demonstrates the performance of different families of machine
learning models for fake news detection
[COMMENTS]
Accepted in Knowledge and Information Systems Journal
[LINK]
http://arxiv.org/abs/2412.14276v1
[DATE]
2024-12-19 03:15:17+08:00
[CATEGORIES]
cs.CL
Hypothesis Generation with Large Language Models
[AUTHORS]
Yangqiaoyu Zhou, Haokun Liu, Tejes Srivastava, Hongyuan Mei, Chenhao Tan
[ABSTRACT]
Effective generation of novel hypotheses is instrumental to scientific
progress. So far, researchers have been the main powerhouse behind hypothesis
generation by painstaking data analysis and thinking (also known as the Eureka
moment). In this paper, we examine the potential of large language models
(LLMs) to generate hypotheses. We focus on hypothesis generation based on data
(i.e., labeled examples). To enable LLMs to handle arbitrarily long contexts,
we generate initial hypotheses from a small number of examples and then update
them iteratively to improve the quality of hypotheses. Inspired by multi-armed
bandits, we design a reward function to inform the exploitation-exploration
tradeoff in the update process. Our algorithm is able to generate hypotheses
that enable much better predictive performance than few-shot prompting in
classification tasks, improving accuracy by 31.7% on a synthetic dataset and by
13.9%, 3.3% and, 24.9% on three real-world datasets. We also outperform
supervised learning by 12.8% and 11.2% on two challenging real-world datasets.
Furthermore, we find that the generated hypotheses not only corroborate
human-verified theories but also uncover new insights for the tasks.
[COMMENTS]
28 pages, 6 figures, code link:
https://github.com/ChicagoHAI/hypothesis_generation. Accepted by the 1st
Workshop on NLP for Science (NLP4Science) at EMNLP 2024
[LINK]
http://arxiv.org/abs/2404.04326v3
[DATE]
2024-12-19 03:00:00+08:00
[CATEGORIES]
cs.CL
cs.LG
Learning from Massive Human Videos for Universal Humanoid Pose Control
[AUTHORS]
Jiageng Mao, Siheng Zhao, Siqi Song, Tianheng Shi, Junjie Ye, Mingtong Zhang, Haoran Geng, Jitendra Malik, Vitor Guizilini, Yue Wang
[ABSTRACT]
Scalable learning of humanoid robots is crucial for their deployment in
real-world applications. While traditional approaches primarily rely on
reinforcement learning or teleoperation to achieve whole-body control, they are
often limited by the diversity of simulated environments and the high costs of
demonstration collection. In contrast, human videos are ubiquitous and present
an untapped source of semantic and motion information that could significantly
enhance the generalization capabilities of humanoid robots. This paper
introduces Humanoid-X, a large-scale dataset of over 20 million humanoid robot
poses with corresponding text-based motion descriptions, designed to leverage
this abundant data. Humanoid-X is curated through a comprehensive pipeline:
data mining from the Internet, video caption generation, motion retargeting of
humans to humanoid robots, and policy learning for real-world deployment. With
Humanoid-X, we further train a large humanoid model, UH-1, which takes text
instructions as input and outputs corresponding actions to control a humanoid
robot. Extensive simulated and real-world experiments validate that our
scalable training approach leads to superior generalization in text-based
humanoid control, marking a significant step toward adaptable, real-world-ready
humanoid robots.
[LINK]
http://arxiv.org/abs/2412.14172v1
[DATE]
2024-12-19 02:59:56+08:00
[CATEGORIES]
cs.CL
Faster Transformer Decoding: N-gram Masked Self-Attention
[AUTHORS]
Ciprian Chelba, Mia Chen, Ankur Bapna, Noam Shazeer
[LINK]
http://arxiv.org/abs/2001.04589v2
[DATE]
2024-12-19 02:59:53+08:00
[CATEGORIES]
cs.LG
cs.CL
TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks
[AUTHORS]
Frank F. Xu, Yufan Song, Boxuan Li, Yuxuan Tang, Kritanjali Jain, Mengxue Bao, Zora Z. Wang, Xuhui Zhou, Zhitong Guo, Murong Cao, Mingyang Yang, Hao Yang Lu, Amaad Martin, Zhe Su, Leander Maben, Raj Mehta, Wayne Chi, Lawrence Jang, Yiqing Xie, Shuyan Zhou, Graham Neubig
[COMMENTS]
Preprint
[LINK]
http://arxiv.org/abs/2412.14161v1
[DATE]
2024-12-19 02:55:40+08:00
[CATEGORIES]
cs.CL
Representative Social Choice: From Learning Theory to AI Alignment
[AUTHORS]
Tianyi Qiu
[ABSTRACT]
Social choice theory is the study of preference aggregation across a
population, used both in mechanism design for human agents and in the
democratic alignment of language models. In this study, we propose the
representative social choice framework for the modeling of democratic
representation in collective decisions, where the number of issues and
individuals are too large for mechanisms to consider all preferences directly.
These scenarios are widespread in real-world decision-making processes, such as
jury trials, indirect elections, legislation processes, corporate governance,
and, more recently, language model alignment. In representative social choice,
the population is represented by a finite sample of individual-issue pairs
based on which social choice decisions are made. We show that many of the
deepest questions in representative social choice can be naturally formulated
as statistical learning problems, and prove the generalization properties of
social choice mechanisms using the theory of machine learning. We further
formulate axioms for representative social choice, and prove Arrow-like
impossibility theorems with new combinatorial tools of analysis. Our framework
introduces the representative approach to social choice, opening up research
directions at the intersection of social choice, learning theory, and AI
alignment.
[COMMENTS]
Full version (20 pages). Under review. Received Best Paper Award at
NeurIPS 2024 Pluralistic Alignment Workshop
[LINK]
http://arxiv.org/abs/2410.23953v3
[DATE]
2024-12-19 02:41:48+08:00
[CATEGORIES]
cs.LG
cs.CL
GLIDER: Grading LLM Interactions and Decisions using Explainable Ranking
[AUTHORS]
Darshan Deshpande, Selvan Sunitha Ravi, Sky CH-Wang, Bartosz Mielczarek, Anand Kannappan, Rebecca Qian
[LINK]
http://arxiv.org/abs/2412.14140v1
[DATE]
2024-12-19 02:41:12+08:00
[CATEGORIES]
cs.CL
Performance Gap in Entity Knowledge Extraction Across Modalities in Vision Language Models
[AUTHORS]
Ido Cohen, Daniela Gottesman, Mor Geva, Raja Giryes
[ABSTRACT]
Vision-language models (VLMs) excel at extracting and reasoning about
information from images. Yet, their capacity to leverage internal knowledge
about specific entities remains underexplored. This work investigates the
disparity in model performance when answering factual questions about an entity
described in text versus depicted in an image. Our results reveal a significant
accuracy drop –averaging 19%– when the entity is presented visually instead
of textually. We hypothesize that this decline arises from limitations in how
information flows from image tokens to query tokens. We use mechanistic
interpretability tools to reveal that, although image tokens are preprocessed
by the vision encoder, meaningful information flow from these tokens occurs
only in the much deeper layers. Furthermore, critical image processing happens
in the language model’s middle layers, allowing few layers for consecutive
reasoning, highlighting a potential inefficiency in how the model utilizes its
layers for reasoning. These insights shed light on the internal mechanics of
VLMs and offer pathways for enhancing their reasoning capabilities.
[LINK]
http://arxiv.org/abs/2412.14133v1
[DATE]
2024-12-19 02:22:30+08:00
[CATEGORIES]
cs.CL
SwitchCIT: Switching for Continual Instruction Tuning
[AUTHORS]
Xinbo Wu, Max Hartman, Vidhata Arjun Jayaraman, Lav R. Varshney
[ABSTRACT]
Large language models (LLMs) and multimodal models (MMs) have exhibited
impressive capabilities in various domains, particularly in general language
understanding and visual reasoning. However, these models, trained on massive
data, may not be finely optimized for specific tasks triggered by instructions.
Continual instruction tuning is crucial to adapt a large model to evolving
tasks and domains, ensuring their effectiveness and relevance across a wide
range of applications. In the context of continual instruction tuning, where
models are sequentially trained on different tasks, catastrophic forgetting can
occur, leading to performance degradation on previously learned tasks. This
work addresses the catastrophic forgetting in continual instruction learning
through a switching mechanism for routing computations to parameter-efficient
tuned models. We demonstrate the effectiveness of our method through
experiments on continual instruction tuning of different natural language
generation tasks and vision-language tasks. We also showcase the advantages of
our proposed method in terms of efficiency, scalability, portability, and
privacy preservation.
[LINK]
http://arxiv.org/abs/2407.11780v2
[DATE]
2024-12-19 02:21:53+08:00
[CATEGORIES]
cs.CL
MagicPIG: LSH Sampling for Efficient LLM Generation
[AUTHORS]
Zhuoming Chen, Ranajoy Sadhukhan, Zihao Ye, Yang Zhou, Jianyu Zhang, Niklas Nolte, Yuandong Tian, Matthijs Douze, Leon Bottou, Zhihao Jia, Beidi Chen
[ABSTRACT]
Large language models (LLMs) with long context windows have gained
significant attention. However, the KV cache, stored to avoid re-computation,
becomes a bottleneck. Various dynamic sparse or TopK-based attention
approximation methods have been proposed to leverage the common insight that
attention is sparse. In this paper, we first show that TopK attention itself
suffers from quality degradation in certain downstream tasks because attention
is not always as sparse as expected. Rather than selecting the keys and values
with the highest attention scores, sampling with theoretical guarantees can
provide a better estimation for attention output. To make the sampling-based
approximation practical in LLM generation, we propose MagicPIG, a heterogeneous
system based on Locality Sensitive Hashing (LSH). MagicPIG significantly
reduces the workload of attention computation while preserving high accuracy
for diverse tasks. MagicPIG stores the LSH hash tables and runs the attention
computation on the CPU, which allows it to serve longer contexts and larger
batch sizes with high approximation accuracy. MagicPIG can improve decoding
throughput by up to $5\times$ across various GPU hardware and achieve 54ms
decoding latency on a single RTX 4090 for Llama-3.1-8B-Instruct model with a
context of 96k tokens. The code is available at
https://github.com/Infini-AI-Lab/MagicPIG.
[LINK]
http://arxiv.org/abs/2410.16179v4
[DATE]
2024-12-19 01:36:36+08:00
[CATEGORIES]
cs.CL
cs.LG
SEKE: Specialised Experts for Keyword Extraction
[AUTHORS]
Matej Martinc, Hanh Thi Hong Tran, Senja Pollak, Boshko Koloski
[ABSTRACT]
Keyword extraction involves identifying the most descriptive words in a
document, allowing automatic categorisation and summarisation of large
quantities of diverse textual data. Relying on the insight that real-world
keyword detection often requires handling of diverse content, we propose a
novel supervised keyword extraction approach based on the mixture of experts
(MoE) technique. MoE uses a learnable routing sub-network to direct information
to specialised experts, allowing them to specialize in distinct regions of the
input space. SEKE, a mixture of Specialised Experts for supervised Keyword
Extraction, uses DeBERTa as the backbone model and builds on the MoE framework,
where experts attend to each token, by integrating it with a recurrent neural
network (RNN), to allow successful extraction even on smaller corpora, where
specialisation is harder due to lack of training data. The MoE framework also
provides an insight into inner workings of individual experts, enhancing the
explainability of the approach. We benchmark SEKE on multiple English datasets,
achieving state-of-the-art performance compared to strong supervised and
unsupervised baselines. Our analysis reveals that depending on data size and
type, experts specialize in distinct syntactic and semantic components, such as
punctuation, stopwords, parts-of-speech, or named entities. Code is available
at: https://github.com/matejMartinc/SEKE_keyword_extraction
[LINK]
http://arxiv.org/abs/2412.14087v1
[DATE]
2024-12-19 01:34:32+08:00
[CATEGORIES]
cs.CL
Compositional Generalization Across Distributional Shifts with Sparse Tree Operations
[AUTHORS]
Paul Soulos, Henry Conklin, Mattia Opper, Paul Smolensky, Jianfeng Gao, Roland Fernandez
[ABSTRACT]
Neural networks continue to struggle with compositional generalization, and
this issue is exacerbated by a lack of massive pre-training. One successful
approach for developing neural systems which exhibit human-like compositional
generalization is \textit{hybrid} neurosymbolic techniques. However, these
techniques run into the core issues that plague symbolic approaches to AI:
scalability and flexibility. The reason for this failure is that at their core,
hybrid neurosymbolic models perform symbolic computation and relegate the
scalable and flexible neural computation to parameterizing a symbolic system.
We investigate a \textit{unified} neurosymbolic system where transformations in
the network can be interpreted simultaneously as both symbolic and neural
computation. We extend a unified neurosymbolic architecture called the
Differentiable Tree Machine in two central ways. First, we significantly
increase the model’s efficiency through the use of sparse vector
representations of symbolic structures. Second, we enable its application
beyond the restricted set of tree2tree problems to the more general class of
seq2seq problems. The improved model retains its prior generalization
capabilities and, since there is a fully neural path through the network,
avoids the pitfalls of other neurosymbolic techniques that elevate symbolic
computation over neural computation.
[COMMENTS]
NeurIPS 2024. Code available at https://github.com/psoulos/sdtm
[LINK]
http://arxiv.org/abs/2412.14076v1
[DATE]
2024-12-19 01:20:19+08:00
[CATEGORIES]
cs.CL
To Label or Not to Label: Hybrid Active Learning for Neural Machine Translation
[AUTHORS]
Abdul Hameed Azeemi, Ihsan Ayyub Qazi, Agha Ali Raza
[ABSTRACT]
Active learning (AL) techniques reduce labeling costs for training neural
machine translation (NMT) models by selecting smaller representative subsets
from unlabeled data for annotation. Diversity sampling techniques select
heterogeneous instances, while uncertainty sampling methods select instances
with the highest model uncertainty. Both approaches have limitations -
diversity methods may extract varied but trivial examples, while uncertainty
sampling can yield repetitive, uninformative instances. To bridge this gap, we
propose Hybrid Uncertainty and Diversity Sampling (HUDS), an AL strategy for
domain adaptation in NMT that combines uncertainty and diversity for sentence
selection. HUDS computes uncertainty scores for unlabeled sentences and
subsequently stratifies them. It then clusters sentence embeddings within each
stratum and computes diversity scores by distance to the centroid. A weighted
hybrid score that combines uncertainty and diversity is then used to select the
top instances for annotation in each AL iteration. Experiments on multi-domain
German-English and French-English datasets demonstrate the better performance
of HUDS over other strong AL baselines. We analyze the sentence selection with
HUDS and show that it prioritizes diverse instances having high model
uncertainty for annotation in early AL iterations.
[COMMENTS]
The 31st International Conference on Computational Linguistics
(COLING 2025)
[LINK]
http://arxiv.org/abs/2403.09259v2
[DATE]
2024-12-19 01:18:12+08:00
[CATEGORIES]
cs.CL
cs.LG
Transformer Layers as Painters
[AUTHORS]
Qi Sun, Marc Pickett, Aakash Kumar Nain, Llion Jones
[ABSTRACT]
Despite their nearly universal adoption for large language models, the
internal workings of transformers are not well understood. We aim to better
understand the impact of removing or reorganizing information throughout the
layers of a pretrained transformer. Such an understanding could both yield
better usage of existing models as well as to make architectural improvements
to produce new variants. We present a series of empirical studies on frozen
models that show that the lower and final layers of pretrained transformers
differ from middle layers, but that middle layers have a surprising amount of
uniformity. We further show that some classes of problems have robustness to
skipping layers, running the layers in an order different from how they were
trained, or running the layers in parallel. Our observations suggest that even
frozen pretrained models may gracefully trade accuracy for latency by skipping
layers or running layers in parallel.
[COMMENTS]
13 pages total, including references and appendices
[LINK]
http://arxiv.org/abs/2407.09298v3
[DATE]
2024-12-19 01:17:44+08:00
[CATEGORIES]
cs.CL
Adapting Multilingual LLMs to Low-Resource Languages with Knowledge Graphs via Adapters
[AUTHORS]
Daniil Gurgurov, Mareike Hartmann, Simon Ostermann
[ABSTRACT]
This paper explores the integration of graph knowledge from linguistic
ontologies into multilingual Large Language Models (LLMs) using adapters to
improve performance for low-resource languages (LRLs) in sentiment analysis
(SA) and named entity recognition (NER). Building upon successful
parameter-efficient fine-tuning techniques, such as K-ADAPTER and MAD-X, we
propose a similar approach for incorporating knowledge from multilingual
graphs, connecting concepts in various languages with each other through
linguistic relationships, into multilingual LLMs for LRLs. Specifically, we
focus on eight LRLs – Maltese, Bulgarian, Indonesian, Nepali, Javanese,
Uyghur, Tibetan, and Sinhala – and employ language-specific adapters
fine-tuned on data extracted from the language-specific section of ConceptNet,
aiming to enable knowledge transfer across the languages covered by the
knowledge graph. We compare various fine-tuning objectives, including standard
Masked Language Modeling (MLM), MLM with full-word masking, and MLM with
targeted masking, to analyse their effectiveness in learning and integrating
the extracted graph data. Through empirical evaluation on language-specific
tasks, we assess how structured graph knowledge affects the performance of
multilingual LLMs for LRLs in SA and NER, providing insights into the potential
benefits of adapting language models for low-resource scenarios.
[COMMENTS]
9 pages, KaLLM workshop
[LINK]
http://arxiv.org/abs/2407.01406v3
[DATE]
2024-12-19 01:09:31+08:00
[CATEGORIES]
cs.CL
Digestion Algorithm in Hierarchical Symbolic Forests: A Fast Text Normalization Algorithm and Semantic Parsing Framework for Specific Scenarios and Lightweight Deployment
[AUTHORS]
Kevin You
[ABSTRACT]
Text Normalization and Semantic Parsing have numerous applications in natural
language processing, such as natural language programming, paraphrasing, data
augmentation, constructing expert systems, text matching, and more. Despite the
prominent achievements of deep learning in Large Language Models (LLMs), the
interpretability of neural network architectures is still poor, which affects
their credibility and hence limits the deployments of risk-sensitive scenarios.
In certain scenario-specific domains with scarce data, rapidly obtaining a
large number of supervised learning labels is challenging, and the workload of
manually labeling data would be enormous. Catastrophic forgetting in neural
networks further leads to low data utilization rates. In situations where swift
responses are vital, the density of the model makes local deployment difficult
and the response time long, which is not conducive to local applications of
these fields. Inspired by the multiplication rule, a principle of combinatorial
mathematics, and human thinking patterns, a multilayer framework along with its
algorithm, the Digestion Algorithm in Hierarchical Symbolic Forests (DAHSF), is
proposed to address these above issues, combining text normalization and
semantic parsing workflows. The Chinese Scripting Language “Fire Bunny
Intelligent Development Platform V2.0” is an important test and application of
the technology discussed in this paper. DAHSF can run locally in
scenario-specific domains on little datasets, with model size and memory usage
optimized by at least two orders of magnitude, thus improving the execution
speed, and possessing a promising optimization outlook.
[COMMENTS]
8 pages, 3 figures, 1 table
[LINK]
http://arxiv.org/abs/2412.14054v1
[DATE]
2024-12-19 01:05:49+08:00
[CATEGORIES]
cs.CL
Cross-Lingual Transfer of Debiasing and Detoxification in Multilingual LLMs: An Extensive Investigation
[AUTHORS]
Vera Neplenbroek, Arianna Bisazza, Raquel Fernández
[ABSTRACT]
Recent generative large language models (LLMs) show remarkable performance in
non-English languages, but when prompted in those languages they tend to
express higher harmful social biases and toxicity levels. Prior work has shown
that finetuning on specialized datasets can mitigate this behavior, and doing
so in English can transfer to other languages. In this work, we investigate the
impact of different finetuning methods on the model’s bias and toxicity, but
also on its ability to produce fluent and diverse text. Our results show that
finetuning on curated non-harmful text is more effective for mitigating bias,
and finetuning on direct preference optimization (DPO) datasets is more
effective for mitigating toxicity. The mitigation caused by applying these
methods in English also transfers to non-English languages. We find evidence
that the extent to which transfer takes place can be predicted by the amount of
data in a given language present in the model’s pretraining data. However, this
transfer of bias and toxicity mitigation often comes at the expense of
decreased language generation ability in non-English languages, highlighting
the importance of developing language-specific bias and toxicity mitigation
methods.
[LINK]
http://arxiv.org/abs/2412.14050v1
[DATE]
2024-12-19 01:05:08+08:00
[CATEGORIES]
cs.CL
XFormParser: A Simple and Effective Multimodal Multilingual Semi-structured Form Parser
[AUTHORS]
Xianfu Cheng, Hang Zhang, Jian Yang, Xiang Li, Weixiao Zhou, Fei Liu, Kui Wu, Xiangyuan Guan, Tao Sun, Xianjie Wu, Tongliang Li, Zhoujun Li
[ABSTRACT]
In the domain of Document AI, parsing semi-structured image form is a crucial
Key Information Extraction (KIE) task. The advent of pre-trained multimodal
models significantly empowers Document AI frameworks to extract key information
from form documents in different formats such as PDF, Word, and images.
Nonetheless, form parsing is still encumbered by notable challenges like subpar
capabilities in multilingual parsing and diminished recall in industrial
contexts in rich text and rich visuals. In this work, we introduce a simple but
effective \textbf{M}ultimodal and \textbf{M}ultilingual semi-structured
\textbf{FORM} \textbf{PARSER} (\textbf{XFormParser}), which anchored on a
comprehensive Transformer-based pre-trained language model and innovatively
amalgamates semantic entity recognition (SER) and relation extraction (RE) into
a unified framework. Combined with Bi-LSTM, the performance of multilingual
parsing is significantly improved. Furthermore, we develop InDFormSFT, a
pioneering supervised fine-tuning (SFT) industrial dataset that specifically
addresses the parsing needs of forms in various industrial contexts.
XFormParser has demonstrated its unparalleled effectiveness and robustness
through rigorous testing on established benchmarks. Compared to existing
state-of-the-art (SOTA) models, XFormParser notably achieves up to 1.79\% F1
score improvement on RE tasks in language-specific settings. It also exhibits
exceptional cross-task performance improvements in multilingual and zero-shot
settings. The codes, datasets, and pre-trained models are publicly available at
https://github.com/zhbuaa0/xformparser.
[COMMENTS]
15 pages, 8 figures, 8 tables
[LINK]
http://arxiv.org/abs/2405.17336v2
[DATE]
2024-12-19 00:55:15+08:00
[CATEGORIES]
cs.CL
Hansel: Output Length Controlling Framework for Large Language Models
[AUTHORS]
Seoha Song, Junhyun Lee, Hyeonmok Ko
[ABSTRACT]
Despite the great success of large language models (LLMs), efficiently
controlling the length of the output sequence still remains a challenge. In
this paper, we propose Hansel, an efficient framework for length control in
LLMs without affecting its generation ability. Hansel utilizes periodically
outputted hidden special tokens to keep track of the remaining target length of
the output sequence. Together with techniques to avoid abrupt termination of
the output, this seemingly simple method proved to be efficient and versatile,
while not harming the coherency and fluency of the generated text. The
framework can be applied to any pre-trained LLMs during the finetuning stage of
the model, regardless of its original positional encoding method. We
demonstrate this by finetuning four different LLMs with Hansel and show that
the mean absolute error of the output sequence decreases significantly in every
model and dataset compared to the prompt-based length control finetuning.
Moreover, the framework showed a substantially improved ability to extrapolate
to target lengths unseen during finetuning, such as long dialog responses or
extremely short summaries. This indicates that the model learns the general
means of length control, rather than learning to match output lengths to those
seen during training.
[COMMENTS]
13 pages, 6 figures; accepted to AAAI-25
[LINK]
http://arxiv.org/abs/2412.14033v1
[DATE]
2024-12-19 00:52:38+08:00
[CATEGORIES]
cs.CL
cs.LG
TEncDM: Understanding the Properties of the Diffusion Model in the Space of Language Model Encodings
[AUTHORS]
Alexander Shabalin, Viacheslav Meshchaninov, Egor Chimbulatov, Vladislav Lapikov, Roman Kim, Grigory Bartosh, Dmitry Molchanov, Sergey Markov, Dmitry Vetrov
[ABSTRACT]
This paper presents the Text Encoding Diffusion Model (TEncDM), a novel
approach to diffusion modeling that operates in the space of pre-trained
language model encodings. In contrast to traditionally used embeddings,
encodings integrate contextual information. In our approach, we also employ a
transformer-based decoder, specifically designed to incorporate context in the
token prediction process. We conduct a comprehensive examination of the
influence of the encoder, decoder, noise scheduler, and self-conditioning on
zero-shot generation. Furthermore, we compare TEncDM with previous approaches
on three conditional text generation tasks: QQP, XSum, and Wiki-Auto. The
results show that TEncDM exhibits superior performance compared to existing
non-autoregressive diffusion models. Our code is available at
https://github.com/M0RJIQUE/tencdm.
[COMMENTS]
15 pages, 13 figures
[LINK]
http://arxiv.org/abs/2402.19097v3
[DATE]
2024-12-19 00:30:58+08:00
[CATEGORIES]
cs.CL
MetaphorShare: A Dynamic Collaborative Repository of Open Metaphor Datasets
[AUTHORS]
Joanne Boisson, Arif Mehmood, Jose Camacho-Collados
[LINK]
http://arxiv.org/abs/2411.18260v2
[DATE]
2024-12-19 00:28:19+08:00
[CATEGORIES]
cs.CL
Cognition Chain for Explainable Psychological Stress Detection on Social Media
[AUTHORS]
Xin Wang, Boyan Gao, Yi Dai, Lei Cao, Liang Zhao, Yibo Yang, David Clifton
[ABSTRACT]
Stress is a pervasive global health issue that can lead to severe mental
health problems. Early detection offers timely intervention and prevention of
stress-related disorders. The current early detection models perform “black
box” inference suffering from limited explainability and trust which blocks the
real-world clinical application. Thanks to the generative properties introduced
by the Large Language Models (LLMs), the decision and the prediction from such
models are semi-interpretable through the corresponding description. However,
the existing LLMs are mostly trained for general purposes without the guidance
of psychological cognitive theory. To this end, we first highlight the
importance of prior theory with the observation of performance boosted by the
chain-of-thoughts tailored for stress detection. This method termed Cognition
Chain explicates the generation of stress through a step-by-step cognitive
perspective based on cognitive appraisal theory with a progress pipeline:
Stimulus $\rightarrow$ Evaluation $\rightarrow$ Reaction $\rightarrow$ Stress
State, guiding LLMs to provide comprehensive reasoning explanations. We further
study the benefits brought by the proposed Cognition Chain format by utilising
it as a synthetic dataset generation template for LLMs instruction-tuning and
introduce CogInstruct, an instruction-tuning dataset for stress detection. This
dataset is developed using a three-stage self-reflective annotation pipeline
that enables LLMs to autonomously generate and refine instructional data. By
instruction-tuning Llama3 with CogInstruct, we develop CogLLM, an explainable
stress detection model. Evaluations demonstrate that CogLLM achieves
outstanding performance while enhancing explainability. Our work contributes a
novel approach by integrating cognitive theories into LLM reasoning processes,
offering a promising direction for future explainable AI research.
[LINK]
http://arxiv.org/abs/2412.14009v1
[DATE]
2024-12-19 00:26:47+08:00
[CATEGORIES]
cs.CL
FarExStance: Explainable Stance Detection for Farsi
[AUTHORS]
Majid Zarharan, Maryam Hashemi, Malika Behroozrazegh, Sauleh Eetemadi, Mohammad Taher Pilehvar, Jennifer Foster
[COMMENTS]
Accepted in COLING 2025
[LINK]
http://arxiv.org/abs/2412.14008v1
[DATE]
2024-12-19 00:24:20+08:00
[CATEGORIES]
cs.CL
Tokens, the oft-overlooked appetizer: Large language models, the distributional hypothesis, and meaning
[AUTHORS]
Julia Witte Zimmerman, Denis Hudon, Kathryn Cramer, Alejandro J. Ruiz, Calla Beauregard, Ashley Fehr, Mikaela Irene Fudolig, Bradford Demarest, Yoshi Meke Bird, Milo Z. Trujillo, Christopher M. Danforth, Peter Sheridan Dodds
[ABSTRACT]
Tokenization is a necessary component within the current architecture of many
language models, including the transformer-based large language models (LLMs)
of Generative AI, yet its impact on the model’s cognition is often overlooked.
We argue that LLMs demonstrate that the Distributional Hypothesis (DM) is
sufficient for reasonably human-like language performance, and that the
emergence of human-meaningful linguistic units among tokens motivates
linguistically-informed interventions in existing, linguistically-agnostic
tokenization techniques, particularly with respect to their roles as (1)
semantic primitives and as (2) vehicles for conveying salient distributional
patterns from human language to the model. We explore tokenizations from a BPE
tokenizer; extant model vocabularies obtained from Hugging Face and tiktoken;
and the information in exemplar token vectors as they move through the layers
of a RoBERTa (large) model. Besides creating sub-optimal semantic building
blocks and obscuring the model’s access to the necessary distributional
patterns, we describe how tokenization pretraining can be a backdoor for bias
and other unwanted content, which current alignment practices may not
remediate. Additionally, we relay evidence that the tokenization algorithm’s
objective function impacts the LLM’s cognition, despite being meaningfully
insulated from the main system intelligence.
[LINK]
http://arxiv.org/abs/2412.10924v2
[DATE]
2024-12-19 00:16:04+08:00
[CATEGORIES]
cs.CL
What makes a good metric? Evaluating automatic metrics for text-to-image consistency
[AUTHORS]
Candace Ross, Melissa Hall, Adriana Romero Soriano, Adina Williams
[COMMENTS]
Accepted and presented at COLM 2024
[LINK]
http://arxiv.org/abs/2412.13989v1
[DATE]
2024-12-19 00:09:42+08:00
[CATEGORIES]
cs.CL
LoRA-Guard: Parameter-Efficient Guardrail Adaptation for Content Moderation of Large Language Models
[AUTHORS]
Hayder Elesedy, Pedro M. Esperança, Silviu Vlad Oprea, Mete Ozay
[ABSTRACT]
Guardrails have emerged as an alternative to safety alignment for content
moderation of large language models (LLMs). Existing model-based guardrails
have not been designed for resource-constrained computational portable devices,
such as mobile phones, more and more of which are running LLM-based
applications locally. We introduce LoRA-Guard, a parameter-efficient guardrail
adaptation method that relies on knowledge sharing between LLMs and guardrail
models. LoRA-Guard extracts language features from the LLMs and adapts them for
the content moderation task using low-rank adapters, while a dual-path design
prevents any performance degradation on the generative task. We show that
LoRA-Guard outperforms existing approaches with 100-1000x lower parameter
overhead while maintaining accuracy, enabling on-device content moderation.
[LINK]
http://arxiv.org/abs/2407.02987v2
[DATE]
2024-12-19 00:07:28+08:00
[CATEGORIES]
cs.LG
cs.CL
Optimized Gradient Clipping for Noisy Label Learning
[AUTHORS]
Xichen Ye, Yifan Wu, Weizhong Zhang, Xiaoqiang Li, Yifan Chen, Cheng Jin
[ABSTRACT]
Previous research has shown that constraining the gradient of loss function
with respect to model-predicted probabilities can enhance the model robustness
against noisy labels. These methods typically specify a fixed optimal threshold
for gradient clipping through validation data to obtain the desired robustness
against noise. However, this common practice overlooks the dynamic distribution
of gradients from both clean and noisy-labeled samples at different stages of
training, significantly limiting the model capability to adapt to the variable
nature of gradients throughout the training process. To address this issue, we
propose a simple yet effective approach called Optimized Gradient Clipping
(OGC), which dynamically adjusts the clipping threshold based on the ratio of
noise gradients to clean gradients after clipping, estimated by modeling the
distributions of clean and noisy samples. This approach allows us to modify the
clipping threshold at each training step, effectively controlling the influence
of noise gradients. Additionally, we provide statistical analysis to certify
the noise-tolerance ability of OGC. Our extensive experiments across various
types of label noise, including symmetric, asymmetric, instance-dependent, and
real-world noise, demonstrate the effectiveness of our approach.
[COMMENTS]
Accepted by AAAI2025
[LINK]
http://arxiv.org/abs/2412.08941v3
[DATE]
2024-12-19 23:59:19+08:00
[CATEGORIES]
cs.LG
IDOL: Instant Photorealistic 3D Human Creation from a Single Image
[AUTHORS]
Yiyu Zhuang, Jiaxi Lv, Hao Wen, Qing Shuai, Ailing Zeng, Hao Zhu, Shifeng Chen, Yujiu Yang, Xun Cao, Wei Liu
[ABSTRACT]
Creating a high-fidelity, animatable 3D full-body avatar from a single image
is a challenging task due to the diverse appearance and poses of humans and the
limited availability of high-quality training data. To achieve fast and
high-quality human reconstruction, this work rethinks the task from the
perspectives of dataset, model, and representation. First, we introduce a
large-scale HUman-centric GEnerated dataset, HuGe100K, consisting of 100K
diverse, photorealistic sets of human images. Each set contains 24-view frames
in specific human poses, generated using a pose-controllable
image-to-multi-view model. Next, leveraging the diversity in views, poses, and
appearances within HuGe100K, we develop a scalable feed-forward transformer
model to predict a 3D human Gaussian representation in a uniform space from a
given human image. This model is trained to disentangle human pose, body shape,
clothing geometry, and texture. The estimated Gaussians can be animated without
post-processing. We conduct comprehensive experiments to validate the
effectiveness of the proposed dataset and method. Our model demonstrates the
ability to efficiently reconstruct photorealistic humans at 1K resolution from
a single input image using a single GPU instantly. Additionally, it seamlessly
supports various applications, as well as shape and texture editing tasks.
[COMMENTS]
21 pages, 15 figures, includes main content, supplementary materials,
and references
[LINK]
http://arxiv.org/abs/2412.14963v1
[DATE]
2024-12-19 23:43:05+08:00
[CATEGORIES]
cs.LG
Samudra: An AI Global Ocean Emulator for Climate
[AUTHORS]
Surya Dheeshjith, Adam Subel, Alistair Adcroft, Julius Busecke, Carlos Fernandez-Granda, Shubham Gupta, Laure Zanna
[ABSTRACT]
AI emulators for forecasting have emerged as powerful tools that can
outperform conventional numerical predictions. The next frontier is to build
emulators for long climate simulations with skill across a range of
spatiotemporal scales, a particularly important goal for the ocean. Our work
builds a skillful global emulator of the ocean component of a state-of-the-art
climate model. We emulate key ocean variables, sea surface height, horizontal
velocities, temperature, and salinity, across their full depth. We use a
modified ConvNeXt UNet architecture trained on multidepth levels of ocean data.
We show that the ocean emulator - Samudra - which exhibits no drift relative to
the truth, can reproduce the depth structure of ocean variables and their
interannual variability. Samudra is stable for centuries and 150 times faster
than the original ocean model. Samudra struggles to capture the correct
magnitude of the forcing trends and simultaneously remains stable, requiring
further work.
[LINK]
http://arxiv.org/abs/2412.03795v2
[DATE]
2024-12-19 23:43:00+08:00
[CATEGORIES]
cs.LG
Corn Ear Detection and Orientation Estimation Using Deep Learning
[AUTHORS]
Nathan Sprague, John Evans, Michael Mardikes
[ABSTRACT]
Monitoring growth behavior of maize plants such as the development of ears
can give key insights into the plant’s health and development. Traditionally,
the measurement of the angle of ears is performed manually, which can be
time-consuming and prone to human error. To address these challenges, this
paper presents a computer vision-based system for detecting and tracking ears
of corn in an image sequence. The proposed system could accurately detect,
track, and predict the ear’s orientation, which can be useful in monitoring
their growth behavior. This can significantly save time compared to manual
measurement and enables additional areas of ear orientation research and
potential increase in efficiencies for maize production. Using an object
detector with keypoint detection, the algorithm proposed could detect 90
percent of all ears. The cardinal estimation had a mean absolute error (MAE) of
18 degrees, compared to a mean 15 degree difference between two people
measuring by hand. These results demonstrate the feasibility of using computer
vision techniques for monitoring maize growth and can lead to further research
in this area.
[COMMENTS]
22 pages;15 figures
[LINK]
http://arxiv.org/abs/2412.14954v1
[DATE]
2024-12-19 23:36:30+08:00
[CATEGORIES]
cs.LG
SageAttention2: Efficient Attention with Thorough Outlier Smoothing and Per-thread INT4 Quantization
[AUTHORS]
Jintao Zhang, Haofeng Huang, Pengle Zhang, Jia Wei, Jun Zhu, Jianfei Chen
[ABSTRACT]
Although quantization for linear layers has been widely used, its application
to accelerate the attention process remains limited. To further enhance the
efficiency of attention computation compared to SageAttention while maintaining
precision, we propose SageAttention2, which utilizes significantly faster 4-bit
matrix multiplication (Matmul) alongside additional precision-enhancing
techniques. First, we propose to quantize matrixes $(Q, K)$ to INT4 in a
hardware-friendly thread-level granularity and quantize matrixes $(\widetilde
P, V)$ to FP8. Second, we propose a method to smooth $Q$, enhancing the
accuracy of INT4 $QK$. Third, we propose to use an FP32 Matmul buffer for $PV$
to enhance the accuracy of FP8 $\widetilde PV$. The operations per second (OPS)
of SageAttention2 surpass FlashAttention2 and xformers by about 3x and 5x on
RTX4090, respectively. Comprehensive experiments confirm that our approach
incurs negligible end-to-end metrics loss across diverse models, including
those for large language processing, image generation, and video generation.
The codes are available at https://github.com/thu-ml/SageAttention.
[LINK]
http://arxiv.org/abs/2411.10958v2
[DATE]
2024-12-19 23:26:20+08:00
[CATEGORIES]
cs.LG
Scaling Laws for Imitation Learning in Single-Agent Games
[AUTHORS]
Jens Tuyls, Dhruv Madeka, Kari Torkkola, Dean Foster, Karthik Narasimhan, Sham Kakade
[ABSTRACT]
Imitation Learning (IL) is one of the most widely used methods in machine
learning. Yet, many works find it is often unable to fully recover the
underlying expert behavior, even in constrained environments like single-agent
games. However, none of these works deeply investigate the role of scaling up
the model and data size. Inspired by recent work in Natural Language Processing
(NLP) where “scaling up” has resulted in increasingly more capable LLMs, we
investigate whether carefully scaling up model and data size can bring similar
improvements in the imitation learning setting for single-agent games. We first
demonstrate our findings on a variety of Atari games, and thereafter focus on
the extremely challenging game of NetHack. In all games, we find that IL loss
and mean return scale smoothly with the compute budget (FLOPs) and are strongly
correlated, resulting in power laws for training compute-optimal IL agents.
Finally, we forecast and train several NetHack agents with IL and find they
outperform prior state-of-the-art by 1.5x in all settings. Our work both
demonstrates the scaling behavior of imitation learning in a variety of
single-agent games, as well as the viability of scaling up current approaches
for increasingly capable agents in NetHack, a game that remains elusively hard
for current AI systems.
[COMMENTS]
Accepted at TMLR 2024
[LINK]
http://arxiv.org/abs/2307.09423v3
[DATE]
2024-12-19 23:10:18+08:00
[CATEGORIES]
cs.LG
From Point to probabilistic gradient boosting for claim frequency and severity prediction
[AUTHORS]
Dominik Chevalier, Marie-Pier Côté
[ABSTRACT]
Gradient boosting for decision tree algorithms are increasingly used in
actuarial applications as they show superior predictive performance over
traditional generalized linear models. Many improvements and sophistications to
the first gradient boosting machine algorithm exist. We present in a unified
notation, and contrast, all the existing point and probabilistic gradient
boosting for decision tree algorithms: GBM, XGBoost, DART, LightGBM, CatBoost,
EGBM, PGBM, XGBoostLSS, cyclic GBM, and NGBoost. In this comprehensive
numerical study, we compare their performance on five publicly available
datasets for claim frequency and severity, of various size and comprising
different number of (high cardinality) categorical variables. We explain how
varying exposure-to-risk can be handled with boosting in frequency models. We
compare the algorithms on the basis of computational efficiency, predictive
performance, and model adequacy. LightGBM and XGBoostLSS win in terms of
computational efficiency. The fully interpretable EGBM achieves competitive
predictive performance compared to the black box algorithms considered. We find
that there is no trade-off between model adequacy and predictive accuracy: both
are achievable simultaneously.
[COMMENTS]
26 pages, 4 figures, 26 tables, 7 algorithms
[LINK]
http://arxiv.org/abs/2412.14916v1
[DATE]
2024-12-19 22:50:10+08:00
[CATEGORIES]
cs.LG
Union-over-Intersections: Object Detection beyond Winner-Takes-All
[AUTHORS]
Aritra Bhowmik, Pascal Mettes, Martin R. Oswald, Cees G. M. Snoek
[ABSTRACT]
This paper revisits the problem of predicting box locations in object
detection architectures. Typically, each box proposal or box query aims to
directly maximize the intersection-over-union score with the ground truth,
followed by a winner-takes-all non-maximum suppression where only the highest
scoring box in each region is retained. We observe that both steps are
sub-optimal: the first involves regressing proposals to the entire ground
truth, which is a difficult task even with large receptive fields, and the
second neglects valuable information from boxes other than the top candidate.
Instead of regressing proposals to the whole ground truth, we propose a simpler
approach: regress only to the area of intersection between the proposal and the
ground truth. This avoids the need for proposals to extrapolate beyond their
visual scope, improving localization accuracy. Rather than adopting a
winner-takes-all strategy, we take the union over the regressed intersections
of all boxes in a region to generate the final box outputs. Our plug-and-play
method integrates seamlessly into proposal-based, grid-based, and query-based
detection architectures with minimal modifications, consistently improving
object localization and instance segmentation. We demonstrate its broad
applicability and versatility across various detection and segmentation tasks.
[COMMENTS]
17 pages, 6 figures, 12 tables
[LINK]
http://arxiv.org/abs/2311.18512v2
[DATE]
2024-12-19 22:46:05+08:00
[CATEGORIES]
cs.LG
Online MDP with Transition Prototypes: A Robust Adaptive Approach
[AUTHORS]
Shuo Sun, Meng Qi, Zuo-Jun Max Shen
[ABSTRACT]
In this work, we consider an online robust Markov Decision Process (MDP)
where we have the information of finitely many prototypes of the underlying
transition kernel. We consider an adaptively updated ambiguity set of the
prototypes and propose an algorithm that efficiently identifies the true
underlying transition kernel while guaranteeing the performance of the
corresponding robust policy. To be more specific, we provide a sublinear regret
of the subsequent optimal robust policy. We also provide an early stopping
mechanism and a worst-case performance bound of the value function. In
numerical experiments, we demonstrate that our method outperforms existing
approaches, particularly in the early stage with limited data. This work
contributes to robust MDPs by considering possible prior information about the
underlying transition probability and online learning, offering both
theoretical insights and practical algorithms for improved decision-making
under uncertainty.
[LINK]
http://arxiv.org/abs/2412.14075v2
[DATE]
2024-12-19 22:41:03+08:00
[CATEGORIES]
cs.LG
Do Parameters Reveal More than Loss for Membership Inference?
[AUTHORS]
Anshuman Suri, Xiao Zhang, David Evans
[ABSTRACT]
Membership inference attacks are used as a key tool for disclosure auditing.
They aim to infer whether an individual record was used to train a model. While
such evaluations are useful to demonstrate risk, they are computationally
expensive and often make strong assumptions about potential adversaries’ access
to models and training environments, and thus do not provide tight bounds on
leakage from potential attacks. We show how prior claims around black-box
access being sufficient for optimal membership inference do not hold for
stochastic gradient descent, and that optimal membership inference indeed
requires white-box access. Our theoretical results lead to a new white-box
inference attack, IHA (Inverse Hessian Attack), that explicitly uses model
parameters by taking advantage of computing inverse-Hessian vector products.
Our results show that both auditors and adversaries may be able to benefit from
access to model parameters, and we advocate for further research into white-box
methods for membership inference.
[COMMENTS]
Accepted to Transactions on Machine Learning Research (TMLR)
[LINK]
http://arxiv.org/abs/2406.11544v4
[DATE]
2024-12-19 22:33:00+08:00
[CATEGORIES]
cs.LG
Diffusion priors for Bayesian 3D reconstruction from incomplete measurements
[AUTHORS]
Julian L. Möbius, Michael Habeck
[ABSTRACT]
Many inverse problems are ill-posed and need to be complemented by prior
information that restricts the class of admissible models. Bayesian approaches
encode this information as prior distributions that impose generic properties
on the model such as sparsity, non-negativity or smoothness. However, in case
of complex structured models such as images, graphs or three-dimensional (3D)
objects,generic prior distributions tend to favor models that differ largely
from those observed in the real world. Here we explore the use of diffusion
models as priors that are combined with experimental data within a Bayesian
framework. We use 3D point clouds to represent 3D objects such as household
items or biomolecular complexes formed from proteins and nucleic acids. We
train diffusion models that generate coarse-grained 3D structures at a medium
resolution and integrate these with incomplete and noisy experimental data. To
demonstrate the power of our approach, we focus on the reconstruction of
biomolecular assemblies from cryo-electron microscopy (cryo-EM) images, which
is an important inverse problem in structural biology. We find that posterior
sampling with diffusion model priors allows for 3D reconstruction from very
sparse, low-resolution and partial observations.
[LINK]
http://arxiv.org/abs/2412.14897v1
[DATE]
2024-12-19 22:28:00+08:00
[CATEGORIES]
cs.LG
Hybridization of Persistent Homology with Neural Networks for Time-Series Prediction: A Case Study in Wave Height
[AUTHORS]
Zixin Lin, Nur Fariha Syaqina Zulkepli, Mohd Shareduwan Mohd Kasihmuddin, R. U. Gobithaasan
[ABSTRACT]
Time-series prediction is an active area of research across various fields,
often challenged by the fluctuating influence of short-term and long-term
factors. In this study, we introduce a feature engineering method that enhances
the predictive performance of neural network models. Specifically, we leverage
computational topology techniques to derive valuable topological features from
input data, boosting the predictive accuracy of our models. Our focus is on
predicting wave heights, utilizing models based on topological features within
feedforward neural networks (FNNs), recurrent neural networks (RNNs), long
short-term memory networks (LSTM), and RNNs with gated recurrent units (GRU).
For time-ahead predictions, the enhancements in $R^2$ score were significant
for FNNs, RNNs, LSTM, and GRU models. Additionally, these models also showed
significant reductions in maximum errors and mean squared errors.
[COMMENTS]
the paper contain errors
[LINK]
http://arxiv.org/abs/2409.01519v3
[DATE]
2024-12-19 22:26:22+08:00
[CATEGORIES]
cs.LG
AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents
[AUTHORS]
Christopher Rawles, Sarah Clinckemaillie, Yifan Chang, Jonathan Waltz, Gabrielle Lau, Marybeth Fair, Alice Li, William Bishop, Wei Li, Folawiyo Campbell-Ajala, Daniel Toyama, Robert Berry, Divya Tyamagundlu, Timothy Lillicrap, Oriana Riva
[LINK]
http://arxiv.org/abs/2405.14573v4
[DATE]
2024-12-19 22:19:02+08:00
[CATEGORIES]
cs.LG
Erase then Rectify: A Training-Free Parameter Editing Approach for Cost-Effective Graph Unlearning
[AUTHORS]
Zhe-Rui Yang, Jindong Han, Chang-Dong Wang, Hao Liu
[ABSTRACT]
Graph unlearning, which aims to eliminate the influence of specific nodes,
edges, or attributes from a trained Graph Neural Network (GNN), is essential in
applications where privacy, bias, or data obsolescence is a concern. However,
existing graph unlearning techniques often necessitate additional training on
the remaining data, leading to significant computational costs, particularly
with large-scale graphs. To address these challenges, we propose a two-stage
training-free approach, Erase then Rectify (ETR), designed for efficient and
scalable graph unlearning while preserving the model utility. Specifically, we
first build a theoretical foundation showing that masking parameters critical
for unlearned samples enables effective unlearning. Building on this insight,
the Erase stage strategically edits model parameters to eliminate the impact of
unlearned samples and their propagated influence on intercorrelated nodes. To
further ensure the GNN’s utility, the Rectify stage devises a gradient
approximation method to estimate the model’s gradient on the remaining dataset,
which is then used to enhance model performance. Overall, ETR achieves graph
unlearning without additional training or full training data access,
significantly reducing computational overhead and preserving data privacy.
Extensive experiments on seven public datasets demonstrate the consistent
superiority of ETR in model utility, unlearning efficiency, and unlearning
effectiveness, establishing it as a promising solution for real-world graph
unlearning challenges.
[COMMENTS]
Accepted by AAAI2025
[LINK]
http://arxiv.org/abs/2409.16684v2
[DATE]
2024-12-19 22:18:15+08:00
[CATEGORIES]
cs.LG
AI-Powered Intracranial Hemorrhage Detection: A Co-Scale Convolutional Attention Model with Uncertainty-Based Fuzzy Integral Operator and Feature Screening
[AUTHORS]
Mehdi Hosseini Chagahi, Md. Jalil Piran, Niloufar Delfan, Behzad Moshiri, Jaber Hatam Parikhan
[ABSTRACT]
Intracranial hemorrhage (ICH) refers to the leakage or accumulation of blood
within the skull, which occurs due to the rupture of blood vessels in or around
the brain. If this condition is not diagnosed in a timely manner and
appropriately treated, it can lead to serious complications such as decreased
consciousness, permanent neurological disabilities, or even death.The primary
aim of this study is to detect the occurrence or non-occurrence of ICH,
followed by determining the type of subdural hemorrhage (SDH). These tasks are
framed as two separate binary classification problems. By adding two layers to
the co-scale convolutional attention (CCA) classifier architecture, we
introduce a novel approach for ICH detection. In the first layer, after
extracting features from different slices of computed tomography (CT) scan
images, we combine these features and select the 50 components that capture the
highest variance in the data, considering them as informative features. We then
assess the discriminative power of these features using the bootstrap forest
algorithm, discarding those that lack sufficient discriminative ability between
different classes. This algorithm explicitly determines the contribution of
each feature to the final prediction, assisting us in developing an explainable
AI model. The features feed into a boosting neural network as a latent feature
space. In the second layer, we introduce a novel uncertainty-based fuzzy
integral operator to fuse information from different CT scan slices. This
operator, by accounting for the dependencies between consecutive slices,
significantly improves detection accuracy.
[LINK]
http://arxiv.org/abs/2412.14869v1
[DATE]
2024-12-19 22:06:44+08:00
[CATEGORIES]
cs.LG
Hierarchical Subspaces of Policies for Continual Offline Reinforcement Learning
[AUTHORS]
Anthony Kobanda, Rémy Portelas, Odalric-Ambrym Maillard, Ludovic Denoyer
[ABSTRACT]
In dynamic domains such as autonomous robotics and video game simulations,
agents must continuously adapt to new tasks while retaining previously acquired
skills. This ongoing process, known as Continual Reinforcement Learning,
presents significant challenges, including the risk of forgetting past
knowledge and the need for scalable solutions as the number of tasks increases.
To address these issues, we introduce HIerarchical LOW-rank Subspaces of
Policies (HILOW), a novel framework designed for continual learning in offline
navigation settings. HILOW leverages hierarchical policy subspaces to enable
flexible and efficient adaptation to new tasks while preserving existing
knowledge. We demonstrate, through a careful experimental study, the
effectiveness of our method in both classical MuJoCo maze environments and
complex video game-like simulations, showcasing competitive performance and
satisfying adaptability according to classical continual learning metrics, in
particular regarding memory usage. Our work provides a promising framework for
real-world applications where continuous learning from pre-collected data is
essential.
[LINK]
http://arxiv.org/abs/2412.14865v1
[DATE]
2024-12-19 22:00:03+08:00
[CATEGORIES]
cs.LG
Surrogate-assisted multi-objective design of complex multibody systems
[AUTHORS]
Augustina C. Amakor, Manuel B. Berkemeier, Meike Wohlleben, Walter Sextro, Sebastian Peitz
[ABSTRACT]
The optimization of large-scale multibody systems is a numerically
challenging task, in particular when considering multiple conflicting criteria
at the same time. In this situation, we need to approximate the Pareto set of
optimal compromises, which is significantly more expensive than finding a
single optimum in single-objective optimization. To prevent large costs, the
usage of surrogate models, constructed from a small but informative number of
expensive model evaluations, is a very popular and widely studied approach. The
central challenge then is to ensure a high quality (that is, near-optimality)
of the solutions that were obtained using the surrogate model, which can be
hard to guarantee with a single pre-computed surrogate. We present a
back-and-forth approach between surrogate modeling and multi-objective
optimization to improve the quality of the obtained solutions. Using the
example of an expensive-to-evaluate multibody system, we compare different
strategies regarding multi-objective optimization, sampling and also surrogate
modeling, to identify the most promising approach in terms of computational
efficiency and solution quality.
[COMMENTS]
arXiv admin note: text overlap with arXiv:2412.01566
[LINK]
http://arxiv.org/abs/2412.14854v1
[DATE]
2024-12-19 21:48:49+08:00
[CATEGORIES]
cs.LG
ASTM :Autonomous Smart Traffic Management System Using Artificial Intelligence CNN and LSTM
[AUTHORS]
Christofel Rio Goenawan
[ABSTRACT]
In the modern world, the development of Artificial Intelligence (AI) has
contributed to improvements in various areas, including automation, computer
vision, fraud detection, and more. AI can be leveraged to enhance the
efficiency of Autonomous Smart Traffic Management (ASTM) systems and reduce
traffic congestion rates. This paper presents an Autonomous Smart Traffic
Management (STM) system that uses AI to improve traffic flow rates. The system
employs the YOLO V5 Convolutional Neural Network to detect vehicles in traffic
management images. Additionally, it predicts the number of vehicles for the
next 12 hours using a Recurrent Neural Network with Long Short-Term Memory
(RNN-LSTM). The Smart Traffic Management Cycle Length Analysis manages the
traffic cycle length based on these vehicle predictions, aided by AI. From the
results of the RNN-LSTM model for predicting vehicle numbers over the next 12
hours, we observe that the model predicts traffic with a Mean Squared Error
(MSE) of 4.521 vehicles and a Root Mean Squared Error (RMSE) of 2.232 vehicles.
After simulating the STM system in the CARLA simulation environment, we found
that the Traffic Management Congestion Flow Rate with ASTM (21 vehicles per
minute) is 50\% higher than the rate without STM (around 15 vehicles per
minute). Additionally, the Traffic Management Vehicle Pass Delay with STM (5
seconds per vehicle) is 70\% lower than without STM (around 12 seconds per
vehicle). These results demonstrate that the STM system using AI can increase
traffic flow by 50\% and reduce vehicle pass delays by 70\%.
[COMMENTS]
In process to IEEE Intelligent Vehicle Symposium 2025
[LINK]
http://arxiv.org/abs/2410.10929v6
[DATE]
2024-12-19 21:39:55+08:00
[CATEGORIES]
cs.LG
Sum of Squares Circuits
[AUTHORS]
Lorenzo Loconte, Stefan Mengel, Antonio Vergari
[ABSTRACT]
Designing expressive generative models that support exact and efficient
inference is a core question in probabilistic ML. Probabilistic circuits (PCs)
offer a framework where this tractability-vs-expressiveness trade-off can be
analyzed theoretically. Recently, squared PCs encoding subtractive mixtures via
negative parameters have emerged as tractable models that can be exponentially
more expressive than monotonic PCs, i.e., PCs with positive parameters only. In
this paper, we provide a more precise theoretical characterization of the
expressiveness relationships among these models. First, we prove that squared
PCs can be less expressive than monotonic ones. Second, we formalize a novel
class of PCs – sum of squares PCs – that can be exponentially more expressive
than both squared and monotonic PCs. Around sum of squares PCs, we build an
expressiveness hierarchy that allows us to precisely unify and separate
different tractable model classes such as Born Machines and PSD models, and
other recently introduced tractable probabilistic models by using complex
parameters. Finally, we empirically show the effectiveness of sum of squares
circuits in performing distribution estimation.
[LINK]
http://arxiv.org/abs/2408.11778v2
[DATE]
2024-12-19 21:34:56+08:00
[CATEGORIES]
cs.LG
How to Re-enable PDE Loss for Physical Systems Modeling Under Partial Observation
[AUTHORS]
Haodong Feng, Yue Wang, Dixia Fan
[ABSTRACT]
In science and engineering, machine learning techniques are increasingly
successful in physical systems modeling (predicting future states of physical
systems). Effectively integrating PDE loss as a constraint of system transition
can improve the model’s prediction by overcoming generalization issues due to
data scarcity, especially when data acquisition is costly. However, in many
real-world scenarios, due to sensor limitations, the data we can obtain is
often only partial observation, making the calculation of PDE loss seem to be
infeasible, as the PDE loss heavily relies on high-resolution states. We
carefully study this problem and propose a novel framework named Re-enable PDE
Loss under Partial Observation (RPLPO). The key idea is that although enabling
PDE loss to constrain system transition solely is infeasible, we can re-enable
PDE loss by reconstructing the learnable high-resolution state and constraining
system transition simultaneously. Specifically, RPLPO combines an encoding
module for reconstructing learnable high-resolution states with a transition
module for predicting future states. The two modules are jointly trained by
data and PDE loss. We conduct experiments in various physical systems to
demonstrate that RPLPO has significant improvement in generalization, even when
observation is sparse, irregular, noisy, and PDE is inaccurate.
[COMMENTS]
Accepted by AAAI2025
[LINK]
http://arxiv.org/abs/2412.09116v3
[DATE]
2024-12-19 21:27:49+08:00
[CATEGORIES]
cs.LG
Entropy Regularized Task Representation Learning for Offline Meta-Reinforcement Learning
[AUTHORS]
Mohammadreza nakhaei, Aidan Scannell, Joni Pajarinen
[ABSTRACT]
Offline meta-reinforcement learning aims to equip agents with the ability to
rapidly adapt to new tasks by training on data from a set of different tasks.
Context-based approaches utilize a history of state-action-reward transitions
– referred to as the context – to infer representations of the current task,
and then condition the agent, i.e., the policy and value function, on the task
representations. Intuitively, the better the task representations capture the
underlying tasks, the better the agent can generalize to new tasks.
Unfortunately, context-based approaches suffer from distribution mismatch, as
the context in the offline data does not match the context at test time,
limiting their ability to generalize to the test tasks. This leads to the task
representations overfitting to the offline training data. Intuitively, the task
representations should be independent of the behavior policy used to collect
the offline data. To address this issue, we approximately minimize the mutual
information between the distribution over the task representations and behavior
policy by maximizing the entropy of behavior policy conditioned on the task
representations. We validate our approach in MuJoCo environments, showing that
compared to baselines, our task representations more faithfully represent the
underlying tasks, leading to outperforming prior methods in both
in-distribution and out-of-distribution tasks.
[COMMENTS]
7 Pages, Accepted at AAAI 2025
[LINK]
http://arxiv.org/abs/2412.14834v1
[DATE]
2024-12-19 21:24:01+08:00
[CATEGORIES]
cs.LG
Discovering Continuous-Time Memory-Based Symbolic Policies using Genetic Programming
[AUTHORS]
Sigur de Vries, Sander Keemink, Marcel van Gerven
[ABSTRACT]
Artificial intelligence techniques are increasingly being applied to solve
control problems, but often rely on black-box methods without transparent
output generation. To improve the interpretability and transparency in control
systems, models can be defined as white-box symbolic policies described by
mathematical expressions. For better performance in partially observable and
volatile environments, the symbolic policies are extended with memory
represented by continuous-time latent variables, governed by differential
equations. Genetic programming is used for optimisation, resulting in
interpretable policies consisting of symbolic expressions. Our results show
that symbolic policies with memory compare with black-box policies on a variety
of control tasks. Furthermore, the benefit of the memory in symbolic policies
is demonstrated on experiments where memory-less policies fall short. Overall,
we present a method for evolving high-performing symbolic policies that offer
interpretability and transparency, which lacks in black-box models.
[COMMENTS]
21 pages including references and appendix, 5 figures, 1 algorithm, 5
tables
[LINK]
http://arxiv.org/abs/2406.02765v5
[DATE]
2024-12-19 21:16:21+08:00
[CATEGORIES]
cs.LG
Answer Set Networks: Casting Answer Set Programming into Deep Learning
[AUTHORS]
Arseny Skryagin, Daniel Ochs, Phillip Deibert, Simon Kohaut, Devendra Singh Dhami, Kristian Kersting
[ABSTRACT]
Although Answer Set Programming (ASP) allows constraining neural-symbolic
(NeSy) systems, its employment is hindered by the prohibitive costs of
computing stable models and the CPU-bound nature of state-of-the-art solvers.
To this end, we propose Answer Set Networks (ASN), a NeSy solver. Based on
Graph Neural Networks (GNN), ASNs are a scalable approach to ASP-based Deep
Probabilistic Logic Programming (DPPL). Specifically, we show how to translate
ASPs into ASNs and demonstrate how ASNs can efficiently solve the encoded
problem by leveraging GPU’s batching and parallelization capabilities. Our
experimental evaluations demonstrate that ASNs outperform state-of-the-art
CPU-bound NeSy systems on multiple tasks. Simultaneously, we make the following
two contributions based on the strengths of ASNs. Namely, we are the first to
show the finetuning of Large Language Models (LLM) with DPPLs, employing ASNs
to guide the training with logic. Further, we show the “constitutional
navigation” of drones, i.e., encoding public aviation laws in an ASN for
routing Unmanned Aerial Vehicles in uncertain environments.
[COMMENTS]
16 pages, 9 figures
[LINK]
http://arxiv.org/abs/2412.14814v1
[DATE]
2024-12-19 21:09:06+08:00
[CATEGORIES]
cs.LG
Shape error prediction in 5-axis machining using graph neural networks
[AUTHORS]
Julia Huuk, Abheek Dhingra, Eirini Ntoutsi, Berend Denkena
[ABSTRACT]
This paper presents an innovative method for predicting shape errors in
5-axis machining using graph neural networks. The graph structure is defined
with nodes representing workpiece surface points and edges denoting the
neighboring relationships. The dataset encompasses data from a material removal
simulation, process data, and post-machining quality information. Experimental
results show that the presented approach can generalize the shape error
prediction for the investigated workpiece geometry. Moreover, by modelling
spatial and temporal connections within the workpiece, the approach handles a
low number of labels compared to non-graphical methods such as Support Vector
Machines.
[LINK]
http://arxiv.org/abs/2412.10341v2
[DATE]
2024-12-19 21:03:10+08:00
[CATEGORIES]
cs.LG
MARIA: a Multimodal Transformer Model for Incomplete Healthcare Data
[AUTHORS]
Camillo Maria Caruso, Paolo Soda, Valerio Guarrasi
[ABSTRACT]
In healthcare, the integration of multimodal data is pivotal for developing
comprehensive diagnostic and predictive models. However, managing missing data
remains a significant challenge in real-world applications. We introduce MARIA
(Multimodal Attention Resilient to Incomplete datA), a novel transformer-based
deep learning model designed to address these challenges through an
intermediate fusion strategy. Unlike conventional approaches that depend on
imputation, MARIA utilizes a masked self-attention mechanism, which processes
only the available data without generating synthetic values. This approach
enables it to effectively handle incomplete datasets, enhancing robustness and
minimizing biases introduced by imputation methods. We evaluated MARIA against
10 state-of-the-art machine learning and deep learning models across 8
diagnostic and prognostic tasks. The results demonstrate that MARIA outperforms
existing methods in terms of performance and resilience to varying levels of
data incompleteness, underscoring its potential for critical healthcare
applications.
[LINK]
http://arxiv.org/abs/2412.14810v1
[DATE]
2024-12-19 21:00:03+08:00
[CATEGORIES]
cs.LG
Extending TWIG: Zero-Shot Predictive Hyperparameter Selection for KGEs based on Graph Structure
[AUTHORS]
Jeffrey Sardina, John D. Kelleher, Declan O’Sullivan
[ABSTRACT]
Knowledge Graphs (KGs) have seen increasing use across various domains –
from biomedicine and linguistics to general knowledge modelling. In order to
facilitate the analysis of knowledge graphs, Knowledge Graph Embeddings (KGEs)
have been developed to automatically analyse KGs and predict new facts based on
the information in a KG, a task called “link prediction”. Many existing studies
have documented that the structure of a KG, KGE model components, and KGE
hyperparameters can significantly change how well KGEs perform and what
relationships they are able to learn. Recently, the Topologically-Weighted
Intelligence Generation (TWIG) model has been proposed as a solution to
modelling how each of these elements relate. In this work, we extend the
previous research on TWIG and evaluate its ability to simulate the output of
the KGE model ComplEx in the cross-KG setting. Our results are twofold. First,
TWIG is able to summarise KGE performance on a wide range of hyperparameter
settings and KGs being learned, suggesting that it represents a general
knowledge of how to predict KGE performance from KG structure. Second, we show
that TWIG can successfully predict hyperparameter performance on unseen KGs in
the zero-shot setting. This second observation leads us to propose that, with
additional research, optimal hyperparameter selection for KGE models could be
determined in a pre-hoc manner using TWIG-like methods, rather than by using a
full hyperparameter search.
[LINK]
http://arxiv.org/abs/2412.14801v1
[DATE]
2024-12-19 20:47:21+08:00
[CATEGORIES]
cs.LG
TRAIL: Trust-Aware Client Scheduling for Semi-Decentralized Federated Learning
[AUTHORS]
Gangqiang Hu, Jianfeng Lu, Jianmin Han, Shuqin Cao, Jing Liu, Hao Fu
[ABSTRACT]
Due to the sensitivity of data, Federated Learning (FL) is employed to enable
distributed machine learning while safeguarding data privacy and accommodating
the requirements of various devices. However, in the context of
semi-decentralized FL, clients’ communication and training states are dynamic.
This variability arises from local training fluctuations, heterogeneous data
distributions, and intermittent client participation. Most existing studies
primarily focus on stable client states, neglecting the dynamic challenges
inherent in real-world scenarios. To tackle this issue, we propose a
TRust-Aware clIent scheduLing mechanism called TRAIL, which assesses client
states and contributions, enhancing model training efficiency through selective
client participation. We focus on a semi-decentralized FL framework where edge
servers and clients train a shared global model using unreliable intra-cluster
model aggregation and inter-cluster model consensus. First, we propose an
adaptive hidden semi-Markov model to estimate clients’ communication states and
contributions. Next, we address a client-server association optimization
problem to minimize global training loss. Using convergence analysis, we
propose a greedy client scheduling algorithm. Finally, our experiments
conducted on real-world datasets demonstrate that TRAIL outperforms
state-of-the-art baselines, achieving an improvement of 8.7% in test accuracy
and a reduction of 15.3% in training loss.
[LINK]
http://arxiv.org/abs/2412.11448v3
[DATE]
2024-12-19 20:46:27+08:00
[CATEGORIES]
cs.LG
Accelerating Diffusion Transformers with Token-wise Feature Caching
[AUTHORS]
Chang Zou, Xuyang Liu, Ting Liu, Siteng Huang, Linfeng Zhang
[ABSTRACT]
Diffusion transformers have shown significant effectiveness in both image and
video synthesis at the expense of huge computation costs. To address this
problem, feature caching methods have been introduced to accelerate diffusion
transformers by caching the features in previous timesteps and reusing them in
the following timesteps. However, previous caching methods ignore that
different tokens exhibit different sensitivities to feature caching, and
feature caching on some tokens may lead to 10$\times$ more destruction to the
overall generation quality compared with other tokens. In this paper, we
introduce token-wise feature caching, allowing us to adaptively select the most
suitable tokens for caching, and further enable us to apply different caching
ratios to neural layers in different types and depths. Extensive experiments on
PixArt-$\alpha$, OpenSora, and DiT demonstrate our effectiveness in both image
and video generation with no requirements for training. For instance,
2.36$\times$ and 1.93$\times$ acceleration are achieved on OpenSora and
PixArt-$\alpha$ with almost no drop in generation quality.
[COMMENTS]
In this version, we achieved a nearly lossless acceleration of 1.51
times for ToCa on FLUX in the appendix
[LINK]
http://arxiv.org/abs/2410.05317v3
[DATE]
2024-12-19 20:38:23+08:00
[CATEGORIES]
cs.LG
Probability Distribution Learning and Its Application in Deep Learning
[AUTHORS]
Binchuan Qi
[ABSTRACT]
This paper introduces a novel theoretical learning framework, termed
probability distribution learning (PD learning). Departing from the traditional
statistical learning framework, PD learning focuses on learning the underlying
probability distribution, which is modeled as a random variable within the
probability simplex. In this framework, the optimization objective is the
learning error, which quantifies the posterior expected discrepancy between the
model’s predicted distribution and the underlying true distribution, given
available sample data and prior knowledge. To optimize the learning error, this
paper proposes the necessary conditions for loss functions, models, and
optimization algorithms, ensuring that these conditions are met in real-world
machine learning scenarios. Based on these conditions, the non-convex
optimization mechanism corresponding to model training can be theoretically
resolved. Moreover, this paper provides model-dependent and model-independent
bounds on learning error, offering new insights into the model’s fitting and
generalization capabilities. Furthermore, the paper applies the PD learning
framework to elucidate the mechanisms by which various techniques, including
random parameter initialization, over-parameterization, and dropout, influence
deep model training. Finally, the paper substantiates the key conclusions of
the proposed framework through experimental results.
[COMMENTS]
arXiv admin note: text overlap with arXiv:2105.04026 by other
authors. arXiv admin note: text overlap with arXiv:2105.04026 by other
authors
[LINK]
http://arxiv.org/abs/2406.05666v9
[DATE]
2024-12-19 20:13:26+08:00
[CATEGORIES]
cs.LG
Score and Distribution Matching Policy: Advanced Accelerated Visuomotor Policies via Matched Distillation
[AUTHORS]
Bofang Jia, Pengxiang Ding, Can Cui, Mingyang Sun, Pengfang Qian, Siteng Huang, Zhaoxin Fan, Donglin Wang
[ABSTRACT]
Visual-motor policy learning has advanced with architectures like
diffusion-based policies, known for modeling complex robotic trajectories.
However, their prolonged inference times hinder high-frequency control tasks
requiring real-time feedback. While consistency distillation (CD) accelerates
inference, it introduces errors that compromise action quality. To address
these limitations, we propose the Score and Distribution Matching Policy (SDM
Policy), which transforms diffusion-based policies into single-step generators
through a two-stage optimization process: score matching ensures alignment with
true action distributions, and distribution matching minimizes KL divergence
for consistency. A dual-teacher mechanism integrates a frozen teacher for
stability and an unfrozen teacher for adversarial training, enhancing
robustness and alignment with target distributions. Evaluated on a 57-task
simulation benchmark, SDM Policy achieves a 6x inference speedup while having
state-of-the-art action quality, providing an efficient and reliable framework
for high-frequency robotic tasks.
[LINK]
http://arxiv.org/abs/2412.09265v4
[DATE]
2024-12-19 20:11:13+08:00
[CATEGORIES]
cs.LG
On the Expressivity of Persistent Homology in Graph Learning
[AUTHORS]
Rubén Ballester, Bastian Rieck
[ABSTRACT]
Persistent homology, a technique from computational topology, has recently
shown strong empirical performance in the context of graph classification.
Being able to capture long range graph properties via higher-order topological
features, such as cycles of arbitrary length, in combination with multi-scale
topological descriptors, has improved predictive performance for data sets with
prominent topological structures, such as molecules. At the same time, the
theoretical properties of persistent homology have not been formally assessed
in this context. This paper intends to bridge the gap between computational
topology and graph machine learning by providing a brief introduction to
persistent homology in the context of graphs, as well as a theoretical
discussion and empirical analysis of its expressivity for graph learning tasks.
[COMMENTS]
Accepted at the 3rd Learning on Graphs Conference (LoG) 2024
[LINK]
http://arxiv.org/abs/2302.09826v4
[DATE]
2024-12-19 20:07:06+08:00
[CATEGORIES]
cs.LG
Agent-Temporal Credit Assignment for Optimal Policy Preservation in Sparse Multi-Agent Reinforcement Learning
[AUTHORS]
Aditya Kapoor, Sushant Swamy, Kale-ab Tessera, Mayank Baranwal, Mingfei Sun, Harshad Khadilkar, Stefano V. Albrecht
[ABSTRACT]
In multi-agent environments, agents often struggle to learn optimal policies
due to sparse or delayed global rewards, particularly in long-horizon tasks
where it is challenging to evaluate actions at intermediate time steps. We
introduce Temporal-Agent Reward Redistribution (TAR$^2$), a novel approach
designed to address the agent-temporal credit assignment problem by
redistributing sparse rewards both temporally and across agents. TAR$^2$
decomposes sparse global rewards into time-step-specific rewards and calculates
agent-specific contributions to these rewards. We theoretically prove that
TAR$^2$ is equivalent to potential-based reward shaping, ensuring that the
optimal policy remains unchanged. Empirical results demonstrate that TAR$^2$
stabilizes and accelerates the learning process. Additionally, we show that
when TAR$^2$ is integrated with single-agent reinforcement learning algorithms,
it performs as well as or better than traditional multi-agent reinforcement
learning methods.
[COMMENTS]
12 pages, 1 figure
[LINK]
http://arxiv.org/abs/2412.14779v1
[DATE]
2024-12-19 20:05:13+08:00
[CATEGORIES]
cs.LG
Generalized Encouragement-Based Instrumental Variables for Counterfactual Regression
[AUTHORS]
Anpeng Wu, Kun Kuang, Ruoxuan Xiong, Xiangwei Chen, Zexu Sun, Fei Wu, Kun Zhang
[ABSTRACT]
In causal inference, encouragement designs (EDs) are widely used to analyze
causal effects, when randomized controlled trials (RCTs) are impractical or
compliance to treatment cannot be perfectly enforced. Unlike RCTs, which
directly allocate treatments, EDs randomly assign encouragement policies that
positively motivate individuals to engage in a specific treatment. These random
encouragements act as instrumental variables (IVs), facilitating the
identification of causal effects through leveraging exogenous perturbations in
discrete treatment scenarios. However, real-world applications of encouragement
designs often face challenges such as incomplete randomization, limited
experimental data, and significantly fewer encouragements compared to
treatments, hindering precise causal effect estimation. To address this, this
paper introduces novel theories and algorithms for identifying the Conditional
Average Treatment Effect (CATE) using variations in encouragement. Further, by
leveraging both observational and encouragement data, we propose a generalized
IV estimator, named Encouragement-based Counterfactual Regression (EnCounteR),
to effectively estimate the causal effects. Extensive experiments on both
synthetic and real-world datasets demonstrate the superiority of EnCounteR over
existing methods.
[LINK]
http://arxiv.org/abs/2408.05428v2
[DATE]
2024-12-19 19:57:19+08:00
[CATEGORIES]
cs.LG
Smoothness Really Matters: A Simple Yet Effective Approach for Unsupervised Graph Domain Adaptation
[AUTHORS]
Wei Chen, Guo Ye, Yakun Wang, Zhao Zhang, Libang Zhang, Daxin Wang, Zhiqiang Zhang, Fuzhen Zhuang
[ABSTRACT]
Unsupervised Graph Domain Adaptation (UGDA) seeks to bridge distribution
shifts between domains by transferring knowledge from labeled source graphs to
given unlabeled target graphs. Existing UGDA methods primarily focus on
aligning features in the latent space learned by graph neural networks (GNNs)
across domains, often overlooking structural shifts, resulting in limited
effectiveness when addressing structurally complex transfer scenarios. Given
the sensitivity of GNNs to local structural features, even slight discrepancies
between source and target graphs could lead to significant shifts in node
embeddings, thereby reducing the effectiveness of knowledge transfer. To
address this issue, we introduce a novel approach for UGDA called Target-Domain
Structural Smoothing (TDSS). TDSS is a simple and effective method designed to
perform structural smoothing directly on the target graph, thereby mitigating
structural distribution shifts and ensuring the consistency of node
representations. Specifically, by integrating smoothing techniques with
neighborhood sampling, TDSS maintains the structural coherence of the target
graph while mitigating the risk of over-smoothing. Our theoretical analysis
shows that TDSS effectively reduces target risk by improving model smoothness.
Empirical results on three real-world datasets demonstrate that TDSS
outperforms recent state-of-the-art baselines, achieving significant
improvements across six transfer scenarios. The code is available in
https://github.com/cwei01/TDSS.
[COMMENTS]
11 pages, Accpected by AAAI2025
[LINK]
http://arxiv.org/abs/2412.11654v2
[DATE]
2024-12-19 19:47:34+08:00
[CATEGORIES]
cs.LG
MetaSymNet: A Tree-like Symbol Network with Adaptive Architecture and Activation Functions
[AUTHORS]
Yanjie Li, Weijun Li, Lina Yu, Min Wu, Jinyi Liu, Wenqiang Li, Meilan Hao, Shu Wei, Yusong Deng
[ABSTRACT]
Mathematical formulas serve as the means of communication between humans and
nature, encapsulating the operational laws governing natural phenomena. The
concise formulation of these laws is a crucial objective in scientific research
and an important challenge for artificial intelligence (AI). While traditional
artificial neural networks (MLP) excel at data fitting, they often yield
uninterpretable black box results that hinder our understanding of the
relationship between variables x and predicted values y. Moreover, the fixed
network architecture in MLP often gives rise to redundancy in both network
structure and parameters. To address these issues, we propose MetaSymNet, a
novel neural network that dynamically adjusts its structure in real-time,
allowing for both expansion and contraction. This adaptive network employs the
PANGU meta function as its activation function, which is a unique type capable
of evolving into various basic functions during training to compose
mathematical formulas tailored to specific needs. We then evolve the neural
network into a concise, interpretable mathematical expression. To evaluate
MetaSymNet’s performance, we compare it with four state-of-the-art symbolic
regression algorithms across more than 10 public datasets comprising 222
formulas. Our experimental results demonstrate that our algorithm outperforms
others consistently regardless of noise presence or absence. Furthermore, we
assess MetaSymNet against MLP and SVM regarding their fitting ability and
extrapolation capability, these are two essential aspects of machine learning
algorithms. The findings reveal that our algorithm excels in both areas.
Finally, we compared MetaSymNet with MLP using iterative pruning in network
structure complexity. The results show that MetaSymNet’s network structure
complexity is obviously less than MLP under the same goodness of fit.
[COMMENTS]
This work has been accepted by AAAI2025
[LINK]
http://arxiv.org/abs/2311.07326v2
[DATE]
2024-12-19 19:41:28+08:00
[CATEGORIES]
cs.LG
A parametric algorithm is optimal for non-parametric regression of smooth functions
[AUTHORS]
Davide Maran, Marcello Restelli
[ABSTRACT]
We address the regression problem for a general function $f:[-1,1]^d\to
\mathbb R$ when the learner selects the training points $\{x_i\}_{i=1}^n$ to
achieve a uniform error bound across the entire domain. In this setting, known
historically as nonparametric regression, we aim to establish a sample
complexity bound that depends solely on the function’s degree of smoothness.
Assuming periodicity at the domain boundaries, we introduce PADUA, an algorithm
that, with high probability, provides performance guarantees optimal up to
constant or logarithmic factors across all problem parameters. Notably, PADUA
is the first parametric algorithm with optimal sample complexity for this
setting. Due to this feature, we prove that, differently from the
non-parametric state of the art, PADUA enjoys optimal space complexity in the
prediction phase. To validate these results, we perform numerical experiments
over functions coming from real audio data, where PADUA shows comparable
performance to state-of-the-art methods, while requiring only a fraction of the
computational time.
[LINK]
http://arxiv.org/abs/2412.14744v1
[DATE]
2024-12-19 19:22:52+08:00
[CATEGORIES]
cs.LG
Active Inference and Human–Computer Interaction
[AUTHORS]
Roderick Murray-Smith, John H. Williamson, Sebastian Stein
[ABSTRACT]
Active Inference is a closed-loop computational theoretical basis for
understanding behaviour, based on agents with internal probabilistic generative
models that encode their beliefs about how hidden states in their environment
cause their sensations. We review Active Inference and how it could be applied
to model the human-computer interaction loop. Active Inference provides a
coherent framework for managing generative models of humans, their
environments, sensors and interface components. It informs off-line design and
supports real-time, online adaptation. It provides model-based explanations for
behaviours observed in HCI, and new tools to measure important concepts such as
agency and engagement. We discuss how Active Inference offers a new basis for a
theory of interaction in HCI, tools for design of modern, complex sensor-based
systems, and integration of artificial intelligence technologies, enabling it
to cope with diversity in human users and contexts. We discuss the practical
challenges in implementing such Active Inference-based systems.
[LINK]
http://arxiv.org/abs/2412.14741v1
[DATE]
2024-12-19 19:17:31+08:00
[CATEGORIES]
cs.LG
Training Datasets Generation for Machine Learning: Application to Vision Based Navigation
[AUTHORS]
Jérémy Lebreton, Ingo Ahrns, Roland Brochard, Christoph Haskamp, Hans Krüger, Matthieu Le Goff, Nicolas Menga, Nicolas Ollagnier, Ralf Regele, Francesco Capolupo, Massimo Casasco
[ABSTRACT]
Vision Based Navigation consists in utilizing cameras as precision sensors
for GNC after extracting information from images. To enable the adoption of
machine learning for space applications, one of obstacles is the demonstration
that available training datasets are adequate to validate the algorithms. The
objective of the study is to generate datasets of images and metadata suitable
for training machine learning algorithms. Two use cases were selected and a
robust methodology was developed to validate the datasets including the ground
truth. The first use case is in-orbit rendezvous with a man-made object: a
mockup of satellite ENVISAT. The second use case is a Lunar landing scenario.
Datasets were produced from archival datasets (Chang’e 3), from the laboratory
at DLR TRON facility and at Airbus Robotic laboratory, from SurRender software
high fidelity image simulator using Model Capture and from Generative
Adversarial Networks. The use case definition included the selection of
algorithms as benchmark: an AI-based pose estimation algorithm and a dense
optical flow algorithm were selected. Eventually it is demonstrated that
datasets produced with SurRender and selected laboratory facilities are
adequate to train machine learning algorithms.
[COMMENTS]
6 pages, 4 figures, preprint of the proceedings of ESA SPAICE
conference 2024
[LINK]
http://arxiv.org/abs/2409.11383v2
[DATE]
2024-12-19 19:12:30+08:00
[CATEGORIES]
cs.LG
Boosting GNN Performance via Training Sample Selection Based on Adversarial Robustness Evaluation
[AUTHORS]
Yongyu Wang
[ABSTRACT]
Graph Neural Networks (GNNs) have established themselves as one of the most
powerful neural network architectures, excelling in leveraging graph topology
and node features for various tasks. However, GNNs are inherently vulnerable to
noise in their inputs. Such noise can significantly degrade their performance.
To address this challenge, we propose a novel approach that employs adversarial
robustness evaluation techniques to identify nodes in the graph that are most
susceptible to noise. By selecting and constructing a training set composed of
these particularly noise-prone nodes, we then use them to train a Graph
Convolutional Network (GCN). Our experimental results demonstrate that this
strategy leads to substantial improvements in the GCN’s performance.
[LINK]
http://arxiv.org/abs/2412.14738v1
[DATE]
2024-12-19 19:10:48+08:00
[CATEGORIES]
cs.LG
Feature selection in linear SVMs via a hard cardinality constraint: a scalable SDP decomposition approach
[AUTHORS]
Immanuel Bomze, Federico D’Onofrio, Laura Palagi, Bo Peng
[ABSTRACT]
In this paper, we study the embedded feature selection problem in linear
Support Vector Machines (SVMs), in which a cardinality constraint is employed,
leading to an interpretable classification model. The problem is NP-hard due to
the presence of the cardinality constraint, even though the original linear SVM
amounts to a problem solvable in polynomial time. To handle the hard problem,
we first introduce two mixed-integer formulations for which novel semidefinite
relaxations are proposed. Exploiting the sparsity pattern of the relaxations,
we decompose the problems and obtain equivalent relaxations in a much smaller
cone, making the conic approaches scalable. To make the best usage of the
decomposed relaxations, we propose heuristics using the information of its
optimal solution. Moreover, an exact procedure is proposed by solving a
sequence of mixed-integer decomposed semidefinite optimization problems.
Numerical results on classical benchmarking datasets are reported, showing the
efficiency and effectiveness of our approach.
[COMMENTS]
Submitted to European Journal of Operational Research. arXiv admin
note: text overlap with arXiv:1808.02435 by other authors
[LINK]
http://arxiv.org/abs/2404.10099v2
[DATE]
2024-12-19 19:06:39+08:00
[CATEGORIES]
cs.LG
Scalable Acceleration for Classification-Based Derivative-Free Optimization
[AUTHORS]
Tianyi Han, Jingya Li, Zhipeng Guo, Yuan Jin
[ABSTRACT]
Derivative-free optimization algorithms play an important role in scientific
and engineering design optimization problems, especially when derivative
information is not accessible. In this paper, we study the framework of
sequential classification-based derivative-free optimization algorithms. By
introducing learning theoretic concept hypothesis-target shattering rate, we
revisit the computational complexity upper bound of SRACOS (Hu, Qian, and Yu
2017). Inspired by the revisited upper bound, we propose an algorithm named
RACE-CARS, which adds a random region-shrinking step compared with SRACOS. We
further establish theorems showing the acceleration by region shrinking.
Experiments on the synthetic functions as well as black-box tuning for
language-model-as-a-service demonstrate empirically the efficiency of
RACE-CARS. An ablation experiment on the introduced hyperparameters is also
conducted, revealing the mechanism of RACE-CARS and putting forward an
empirical hyper-parameter tuning guidance.
[LINK]
http://arxiv.org/abs/2309.11036v2
[DATE]
2024-12-19 18:58:39+08:00
[CATEGORIES]
cs.LG
Generative AI for Banks: Benchmarks and Algorithms for Synthetic Financial Transaction Data
[AUTHORS]
Fabian Sven Karst, Sook-Yee Chong, Abigail A. Antenor, Enyu Lin, Mahei Manhai Li, Jan Marco Leimeister
[ABSTRACT]
The banking sector faces challenges in using deep learning due to data
sensitivity and regulatory constraints, but generative AI may offer a solution.
Thus, this study identifies effective algorithms for generating synthetic
financial transaction data and evaluates five leading models - Conditional
Tabular Generative Adversarial Networks (CTGAN), DoppelGANger (DGAN),
Wasserstein GAN, Financial Diffusion (FinDiff), and Tabular Variational
AutoEncoders (TVAE) - across five criteria: fidelity, synthesis quality,
efficiency, privacy, and graph structure. While none of the algorithms is able
to replicate the real data’s graph structure, each excels in specific areas:
DGAN is ideal for privacy-sensitive tasks, FinDiff and TVAE excel in data
replication and augmentation, and CTGAN achieves a balance across all five
criteria, making it suitable for general applications with moderate privacy
concerns. As a result, our findings offer valuable insights for choosing the
most suitable algorithm.
[COMMENTS]
Presented at the 34th Workshop on Information Technologies and
Systems (WITS 2024)
[LINK]
http://arxiv.org/abs/2412.14730v1
[DATE]
2024-12-19 18:56:18+08:00
[CATEGORIES]
cs.LG
FROC: Building Fair ROC from a Trained Classifier
[AUTHORS]
Avyukta Manjunatha Vummintala, Shantanu Das, Sujit Gujar
[ABSTRACT]
This paper considers the problem of fair probabilistic binary classification
with binary protected groups. The classifier assigns scores, and a practitioner
predicts labels using a certain cut-off threshold based on the desired
trade-off between false positives vs. false negatives. It derives these
thresholds from the ROC of the classifier. The resultant classifier may be
unfair to one of the two protected groups in the dataset. It is desirable that
no matter what threshold the practitioner uses, the classifier should be fair
to both the protected groups; that is, the $\mathcal{L}_p$ norm between FPRs
and TPRs of both the protected groups should be at most $\varepsilon$. We call
such fairness on ROCs of both the protected attributes
$\varepsilon_p$-Equalized ROC. Given a classifier not satisfying
$\varepsilon_1$-Equalized ROC, we aim to design a post-processing method to
transform the given (potentially unfair) classifier’s output (score) to a
suitable randomized yet fair classifier. That is, the resultant classifier must
satisfy $\varepsilon_1$-Equalized ROC. First, we introduce a threshold query
model on the ROC curves for each protected group. The resulting classifier is
bound to face a reduction in AUC. With the proposed query model, we provide a
rigorous theoretical analysis of the minimal AUC loss to achieve
$\varepsilon_1$-Equalized ROC. To achieve this, we design a linear time
algorithm, namely \texttt{FROC}, to transform a given classifier’s output to a
probabilistic classifier that satisfies $\varepsilon_1$-Equalized ROC. We prove
that under certain theoretical conditions, \texttt{FROC}\ achieves the
theoretical optimal guarantees. We also study the performance of our
\texttt{FROC}\ on multiple real-world datasets with many trained classifiers.
[COMMENTS]
51 pages, The 39th Annual AAAI Conference on Artificial Intelligence
[LINK]
http://arxiv.org/abs/2412.14724v1
[DATE]
2024-12-19 18:47:31+08:00
[CATEGORIES]
cs.LG
Guiding a Diffusion Model with a Bad Version of Itself
[AUTHORS]
Tero Karras, Miika Aittala, Tuomas Kynkäänniemi, Jaakko Lehtinen, Timo Aila, Samuli Laine
[ABSTRACT]
The primary axes of interest in image-generating diffusion models are image
quality, the amount of variation in the results, and how well the results align
with a given condition, e.g., a class label or a text prompt. The popular
classifier-free guidance approach uses an unconditional model to guide a
conditional model, leading to simultaneously better prompt alignment and
higher-quality images at the cost of reduced variation. These effects seem
inherently entangled, and thus hard to control. We make the surprising
observation that it is possible to obtain disentangled control over image
quality without compromising the amount of variation by guiding generation
using a smaller, less-trained version of the model itself rather than an
unconditional model. This leads to significant improvements in ImageNet
generation, setting record FIDs of 1.01 for 64x64 and 1.25 for 512x512, using
publicly available networks. Furthermore, the method is also applicable to
unconditional diffusion models, drastically improving their quality.
[COMMENTS]
NeurIPS 2024
[LINK]
http://arxiv.org/abs/2406.02507v3
[DATE]
2024-12-19 18:43:11+08:00
[CATEGORIES]
cs.LG
Prototypical Calibrating Ambiguous Samples for Micro-Action Recognition
[AUTHORS]
Kun Li, Dan Guo, Guoliang Chen, Chunxiao Fan, Jingyuan Xu, Zhiliang Wu, Hehe Fan, Meng Wang
[ABSTRACT]
Micro-Action Recognition (MAR) has gained increasing attention due to its
crucial role as a form of non-verbal communication in social interactions, with
promising potential for applications in human communication and emotion
analysis. However, current approaches often overlook the inherent ambiguity in
micro-actions, which arises from the wide category range and subtle visual
differences between categories. This oversight hampers the accuracy of
micro-action recognition. In this paper, we propose a novel Prototypical
Calibrating Ambiguous Network (\textbf{PCAN}) to unleash and mitigate the
ambiguity of MAR. \textbf{Firstly}, we employ a hierarchical action-tree to
identify the ambiguous sample, categorizing them into distinct sets of
ambiguous samples of false negatives and false positives, considering both
body- and action-level categories. \textbf{Secondly}, we implement an ambiguous
contrastive refinement module to calibrate these ambiguous samples by
regulating the distance between ambiguous samples and their corresponding
prototypes. This calibration process aims to pull false negative
($\mathbb{FN}$) samples closer to their respective prototypes and push false
positive ($\mathbb{FP}$) samples apart from their affiliated prototypes. In
addition, we propose a new prototypical diversity amplification loss to
strengthen the model’s capacity by amplifying the differences between different
prototypes. \textbf{Finally}, we propose a prototype-guided rectification to
rectify prediction by incorporating the representability of prototypes.
Extensive experiments conducted on the benchmark dataset demonstrate the
superior performance of our method compared to existing approaches. The code is
available at https://github.com/kunli-cs/PCAN.
[COMMENTS]
Accepted by AAAI 2025
[LINK]
http://arxiv.org/abs/2412.14719v1
[DATE]
2024-12-19 18:41:24+08:00
[CATEGORIES]
cs.LG
A Comprehensive Forecasting Framework based on Multi-Stage Hierarchical Forecasting Reconciliation and Adjustment
[AUTHORS]
Zhengchao Yang, Mithun Ghosh, Anish Saha, Dong Xu, Konstantin Shmakov, Kuang-chih Lee
[ABSTRACT]
Ads demand forecasting for Walmart’s ad products plays a critical role in
enabling effective resource planning, allocation, and management of ads
performance. In this paper, we introduce a comprehensive demand forecasting
system that tackles hierarchical time series forecasting in business settings.
Though traditional hierarchical reconciliation methods ensure forecasting
coherence, they often trade off accuracy for coherence especially at lower
levels and fail to capture the seasonality unique to each time-series in the
hierarchy. Thus, we propose a novel framework “Multi-Stage Hierarchical
Forecasting Reconciliation and Adjustment (Multi-Stage HiFoReAd)” to address
the challenges of preserving seasonality, ensuring coherence, and improving
accuracy. Our system first utilizes diverse models, ensembled through Bayesian
Optimization (BO), achieving base forecasts. The generated base forecasts are
then passed into the Multi-Stage HiFoReAd framework. The initial stage refines
the hierarchy using Top-Down forecasts and “harmonic alignment.” The second
stage aligns the higher levels’ forecasts using MinTrace algorithm, following
which the last two levels undergo “harmonic alignment” and “stratified
scaling”, to eventually achieve accurate and coherent forecasts across the
whole hierarchy. Our experiments on Walmart’s internal Ads-demand dataset and 3
other public datasets, each with 4 hierarchical levels, demonstrate that the
average Absolute Percentage Error from the cross-validation sets improve from
3% to 40% across levels against BO-ensemble of models (LGBM, MSTL+ETS, Prophet)
as well as from 1.2% to 92.9% against State-Of-The-Art models. In addition, the
forecasts at all hierarchical levels are proved to be coherent. The proposed
framework has been deployed and leveraged by Walmart’s ads, sales and
operations teams to track future demands, make informed decisions and plan
resources.
[COMMENTS]
Published in 2024 IEEE International Conference on Big Data (BigData)
[LINK]
http://arxiv.org/abs/2412.14718v1
[DATE]
2024-12-19 18:33:19+08:00
[CATEGORIES]
cs.LG
Computing Gram Matrix for SMILES Strings using RDKFingerprint and Sinkhorn-Knopp Algorithm
[AUTHORS]
Sarwan Ali, Haris Mansoor, Prakash Chourasia, Imdad Ullah Khan, Murray Patterson
[ABSTRACT]
In molecular structure data, SMILES (Simplified Molecular Input Line Entry
System) strings are used to analyze molecular structure design. Numerical
feature representation of SMILES strings is a challenging task. This work
proposes a kernel-based approach for encoding and analyzing molecular
structures from SMILES strings. The proposed approach involves computing a
kernel matrix using the Sinkhorn-Knopp algorithm while using kernel principal
component analysis (PCA) for dimensionality reduction. The resulting
low-dimensional embeddings are then used for classification and regression
analysis. The kernel matrix is computed by converting the SMILES strings into
molecular structures using the Morgan Fingerprint, which computes a fingerprint
for each molecule. The distance matrix is computed using the pairwise kernels
function. The Sinkhorn-Knopp algorithm is used to compute the final kernel
matrix that satisfies the constraints of a probability distribution. This is
achieved by iteratively adjusting the kernel matrix until the marginal
distributions of the rows and columns match the desired marginal distributions.
We provided a comprehensive empirical analysis of the proposed kernel method to
evaluate its goodness with greater depth. The suggested method is assessed for
drug subcategory prediction (classification task) and solubility AlogPS
``Aqueous solubility and Octanol/Water partition coefficient” (regression task)
using the benchmark SMILES string dataset. The outcomes show the proposed
method outperforms several baseline methods in terms of supervised analysis and
has potential uses in molecular design and drug discovery. Overall, the
suggested method is a promising avenue for kernel methods-based molecular
structure analysis and design.
[LINK]
http://arxiv.org/abs/2412.14717v1
[DATE]
2024-12-19 18:31:25+08:00
[CATEGORIES]
cs.LG
Holistic Adversarially Robust Pruning
[AUTHORS]
Qi Zhao, Christian Wressnegger
[ABSTRACT]
Neural networks can be drastically shrunk in size by removing redundant
parameters. While crucial for the deployment on resource-constraint hardware,
oftentimes, compression comes with a severe drop in accuracy and lack of
adversarial robustness. Despite recent advances, counteracting both aspects has
only succeeded for moderate compression rates so far. We propose a novel
method, HARP, that copes with aggressive pruning significantly better than
prior work. For this, we consider the network holistically. We learn a global
compression strategy that optimizes how many parameters (compression rate) and
which parameters (scoring connections) to prune specific to each layer
individually. Our method fine-tunes an existing model with dynamic
regularization, that follows a step-wise incremental function balancing the
different objectives. It starts by favoring robustness before shifting focus on
reaching the target compression rate and only then handles the objectives
equally. The learned compression strategies allow us to maintain the
pre-trained model natural accuracy and its adversarial robustness for a
reduction by 99% of the network original size. Moreover, we observe a crucial
influence of non-uniform compression across layers.
[COMMENTS]
Accepted by ICLR 2023
[LINK]
http://arxiv.org/abs/2412.14714v1
[DATE]
2024-12-19 18:25:21+08:00
[CATEGORIES]
cs.LG
ReMoE: Fully Differentiable Mixture-of-Experts with ReLU Routing
[AUTHORS]
Ziteng Wang, Jianfei Chen, Jun Zhu
[ABSTRACT]
Sparsely activated Mixture-of-Experts (MoE) models are widely adopted to
scale up model capacity without increasing the computation budget. However,
vanilla TopK routers are trained in a discontinuous, non-differentiable way,
limiting their performance and scalability. To address this issue, we propose
ReMoE, a fully differentiable MoE architecture that offers a simple yet
effective drop-in replacement for the conventional TopK+Softmax routing,
utilizing ReLU as the router instead. We further propose methods to regulate
the router’s sparsity while balancing the load among experts. ReMoE’s
continuous nature enables efficient dynamic allocation of computation across
tokens and layers, while also exhibiting domain specialization. Our experiments
demonstrate that ReMoE consistently outperforms vanilla TopK-routed MoE across
various model sizes, expert counts, and levels of granularity. Furthermore,
ReMoE exhibits superior scalability with respect to the number of experts,
surpassing traditional MoE architectures. The implementation based on
Megatron-LM is available at https://github.com/thu-ml/ReMoE.
[LINK]
http://arxiv.org/abs/2412.14711v1
[DATE]
2024-12-19 18:21:20+08:00
[CATEGORIES]
cs.LG
Learning Deep Dissipative Dynamics
[AUTHORS]
Yuji Okamoto, Ryosuke Kojima
[ABSTRACT]
This study challenges strictly guaranteeing “dissipativity” of a dynamical
system represented by neural networks learned from given time-series data.
Dissipativity is a crucial indicator for dynamical systems that generalizes
stability and input-output stability, known to be valid across various systems
including robotics, biological systems, and molecular dynamics. By analytically
proving the general solution to the nonlinear Kalman-Yakubovich-Popov (KYP)
lemma, which is the necessary and sufficient condition for dissipativity, we
propose a differentiable projection that transforms any dynamics represented by
neural networks into dissipative ones and a learning method for the transformed
dynamics. Utilizing the generality of dissipativity, our method strictly
guarantee stability, input-output stability, and energy conservation of trained
dynamical systems. Finally, we demonstrate the robustness of our method against
out-of-domain input through applications to robotic arms and fluid dynamics.
Code is https://github.com/kojima-r/DeepDissipativeModel
[COMMENTS]
AAAI 2025
[LINK]
http://arxiv.org/abs/2408.11479v2
[DATE]
2024-12-19 18:21:11+08:00
[CATEGORIES]
cs.LG
Grimm: A Plug-and-Play Perturbation Rectifier for Graph Neural Networks Defending against Poisoning Attacks
[AUTHORS]
Ao Liu, Wenshan Li, Beibei Li, Wengang Ma, Tao Li, Pan Zhou
[ABSTRACT]
Recent studies have revealed the vulnerability of graph neural networks
(GNNs) to adversarial poisoning attacks on node classification tasks. Current
defensive methods require substituting the original GNNs with defense models,
regardless of the original’s type. This approach, while targeting adversarial
robustness, compromises the enhancements developed in prior research to boost
GNNs’ practical performance. Here we introduce Grimm, the first plug-and-play
defense model. With just a minimal interface requirement for extracting
features from any layer of the protected GNNs, Grimm is thus enabled to
seamlessly rectify perturbations. Specifically, we utilize the feature
trajectories (FTs) generated by GNNs, as they evolve through epochs, to reflect
the training status of the networks. We then theoretically prove that the FTs
of victim nodes will inevitably exhibit discriminable anomalies. Consequently,
inspired by the natural parallelism between the biological nervous and immune
systems, we construct Grimm, a comprehensive artificial immune system for GNNs.
Grimm not only detects abnormal FTs and rectifies adversarial edges during
training but also operates efficiently in parallel, thereby mirroring the
concurrent functionalities of its biological counterparts. We experimentally
confirm that Grimm offers four empirically validated advantages: 1)
Harmlessness, as it does not actively interfere with GNN training; 2)
Parallelism, ensuring monitoring, detection, and rectification functions
operate independently of the GNN training process; 3) Generalizability,
demonstrating compatibility with mainstream GNNs such as GCN, GAT, and
GraphSAGE; and 4) Transferability, as the detectors for abnormal FTs can be
efficiently transferred across different systems for one-step rectification.
[COMMENTS]
19 pages, 13 figures
[LINK]
http://arxiv.org/abs/2412.08555v2
[DATE]
2024-12-19 18:12:38+08:00
[CATEGORIES]
cs.LG
Holdouts set for safe predictive model updating
[AUTHORS]
Sami Haidar-Wehbe, Samuel R Emerson, Louis J M Aslett, James Liley
[ABSTRACT]
Predictive risk scores for adverse outcomes are increasingly crucial in
guiding health interventions. Such scores may need to be periodically updated
due to change in the distributions they model. However, directly updating risk
scores used to guide intervention can lead to biased risk estimates. To address
this, we propose updating using a `holdout set’ - a subset of the population
that does not receive interventions guided by the risk score. Balancing the
holdout set size is essential to ensure good performance of the updated risk
score whilst minimising the number of held out samples. We prove that this
approach reduces adverse outcome frequency to an asymptotically optimal level
and argue that often there is no competitive alternative. We describe
conditions under which an optimal holdout size (OHS) can be readily identified,
and introduce parametric and semi-parametric algorithms for OHS estimation. We
apply our methods to the ASPRE risk score for pre-eclampsia to recommend a plan
for updating it in the presence of change in the underlying data distribution.
We show that, in order to minimise the number of pre-eclampsia cases over time,
this is best achieved using a holdout set of around 10,000 individuals.
[COMMENTS]
Manuscript includes supplementary materials and figures
[LINK]
http://arxiv.org/abs/2202.06374v5
[DATE]
2024-12-19 18:12:00+08:00
[CATEGORIES]
cs.LG
DG-Mamba: Robust and Efficient Dynamic Graph Structure Learning with Selective State Space Models
[AUTHORS]
Haonan Yuan, Qingyun Sun, Zhaonan Wang, Xingcheng Fu, Cheng Ji, Yongjian Wang, Bo Jin, Jianxin Li
[ABSTRACT]
Dynamic graphs exhibit intertwined spatio-temporal evolutionary patterns,
widely existing in the real world. Nevertheless, the structure incompleteness,
noise, and redundancy result in poor robustness for Dynamic Graph Neural
Networks (DGNNs). Dynamic Graph Structure Learning (DGSL) offers a promising
way to optimize graph structures. However, aside from encountering unacceptable
quadratic complexity, it overly relies on heuristic priors, making it hard to
discover underlying predictive patterns. How to efficiently refine the dynamic
structures, capture intrinsic dependencies, and learn robust representations,
remains under-explored. In this work, we propose the novel DG-Mamba, a robust
and efficient Dynamic Graph structure learning framework with the Selective
State Space Models (Mamba). To accelerate the spatio-temporal structure
learning, we propose a kernelized dynamic message-passing operator that reduces
the quadratic time complexity to linear. To capture global intrinsic dynamics,
we establish the dynamic graph as a self-contained system with State Space
Model. By discretizing the system states with the cross-snapshot graph
adjacency, we enable the long-distance dependencies capturing with the
selective snapshot scan. To endow learned dynamic structures more expressive
with informativeness, we propose the self-supervised Principle of Relevant
Information for DGSL to regularize the most relevant yet least redundant
information, enhancing global robustness. Extensive experiments demonstrate the
superiority of the robustness and efficiency of our DG-Mamba compared with the
state-of-the-art baselines against adversarial attacks.
[COMMENTS]
Accepted by the Main Technical Track of the 39th Annual AAAI
Conference on Artificial Intelligence (AAAI-2025)
[LINK]
http://arxiv.org/abs/2412.08160v4
[DATE]
2024-12-19 18:01:27+08:00
[CATEGORIES]
cs.LG
Lorentzian Residual Neural Networks
[AUTHORS]
Neil He, Menglin Yang, Rex Ying
[ABSTRACT]
Hyperbolic neural networks have emerged as a powerful tool for modeling
hierarchical data structures prevalent in real-world datasets. Notably,
residual connections, which facilitate the direct flow of information across
layers, have been instrumental in the success of deep neural networks. However,
current methods for constructing hyperbolic residual networks suffer from
limitations such as increased model complexity, numerical instability, and
errors due to multiple mappings to and from the tangent space. To address these
limitations, we introduce LResNet, a novel Lorentzian residual neural network
based on the weighted Lorentzian centroid in the Lorentz model of hyperbolic
geometry. Our method enables the efficient integration of residual connections
in Lorentz hyperbolic neural networks while preserving their hierarchical
representation capabilities. We demonstrate that our method can theoretically
derive previous methods while offering improved stability, efficiency, and
effectiveness. Extensive experiments on both graph and vision tasks showcase
the superior performance and robustness of our method compared to
state-of-the-art Euclidean and hyperbolic alternatives. Our findings highlight
the potential of \method for building more expressive neural networks in
hyperbolic embedding space as a generally applicable method to multiple
architectures, including CNNs, GNNs, and graph Transformers.
[COMMENTS]
12 pages, 3 figures, KDD 2025
[LINK]
http://arxiv.org/abs/2412.14695v1
[DATE]
2024-12-19 17:56:01+08:00
[CATEGORIES]
cs.LG
T-JEPA: Augmentation-Free Self-Supervised Learning for Tabular Data
[AUTHORS]
Hugo Thimonier, José Lucas De Melo Costa, Fabrice Popineau, Arpad Rimmel, Bich-Liên Doan
[ABSTRACT]
Self-supervision is often used for pre-training to foster performance on a
downstream task by constructing meaningful representations of samples.
Self-supervised learning (SSL) generally involves generating different views of
the same sample and thus requires data augmentations that are challenging to
construct for tabular data. This constitutes one of the main challenges of
self-supervision for structured data. In the present work, we propose a novel
augmentation-free SSL method for tabular data. Our approach, T-JEPA, relies on
a Joint Embedding Predictive Architecture (JEPA) and is akin to mask
reconstruction in the latent space. It involves predicting the latent
representation of one subset of features from the latent representation of a
different subset within the same sample, thereby learning rich representations
without augmentations. We use our method as a pre-training technique and train
several deep classifiers on the obtained representation. Our experimental
results demonstrate a substantial improvement in both classification and
regression tasks, outperforming models trained directly on samples in their
original data space. Moreover, T-JEPA enables some methods to consistently
outperform or match the performance of traditional methods likes Gradient
Boosted Decision Trees. To understand why, we extensively characterize the
obtained representations and show that T-JEPA effectively identifies relevant
features for downstream tasks without access to the labels. Additionally, we
introduce regularization tokens, a novel regularization method critical for
training of JEPA-based models on structured data.
[LINK]
http://arxiv.org/abs/2410.05016v2
[DATE]
2024-12-19 17:49:25+08:00
[CATEGORIES]
cs.LG
Leveraging Group Classification with Descending Soft Labeling for Deep Imbalanced Regression
[AUTHORS]
Ruizhi Pu, Gezheng Xu, Ruiyi Fang, Binkun Bao, Charles X. Ling, Boyu Wang
[ABSTRACT]
Deep imbalanced regression (DIR), where the target values have a highly
skewed distribution and are also continuous, is an intriguing yet
under-explored problem in machine learning.
While recent works have already shown that incorporating various
classification-based regularizers can produce enhanced outcomes, the role of
classification remains elusive in DIR.
Moreover, such regularizers (e.g., contrastive penalties) merely focus on
learning discriminative features of data, which inevitably results in ignorance
of either continuity or similarity across the data.
To address these issues, we first bridge the connection between the
objectives of DIR and classification from a Bayesian perspective.
Consequently, this motivates us to decompose the objective of DIR into a
combination of classification and regression tasks, which naturally guides us
toward a divide-and-conquer manner to solve the DIR problem.
Specifically, by aggregating the data at nearby labels into the same groups,
we introduce an ordinal group-aware contrastive learning loss along with a
multi-experts regressor to tackle the different groups of data thereby
maintaining the data continuity.
Meanwhile, considering the similarity between the groups, we also propose a
symmetric descending soft labeling strategy to exploit the intrinsic similarity
across the data, which allows classification to facilitate regression more
effectively.
Extensive experiments on real-world datasets also validate the effectiveness
of our method.
[LINK]
http://arxiv.org/abs/2412.12327v2
[DATE]
2024-12-19 17:34:30+08:00
[CATEGORIES]
cs.LG
Langevin dynamics for high-dimensional optimization: the case of multi-spiked tensor PCA
[AUTHORS]
Gérard Ben Arous, Cédric Gerbelot, Vanessa Piccolo
[ABSTRACT]
We study nonconvex optimization in high dimensions through Langevin dynamics,
focusing on the multi-spiked tensor PCA problem. This tensor estimation problem
involves recovering $r$ hidden signal vectors (spikes) from noisy Gaussian
tensor observations using maximum likelihood estimation. We study the number of
samples required for Langevin dynamics to efficiently recover the spikes and
determine the necessary separation condition on the signal-to-noise ratios
(SNRs) for exact recovery, distinguishing the cases $p \ge 3$ and $p=2$, where
$p$ denotes the order of the tensor. In particular, we show that the sample
complexity required for recovering the spike associated with the largest SNR
matches the well-known algorithmic threshold for the single-spike case, while
this threshold degrades when recovering all $r$ spikes. As a key step, we
provide a detailed characterization of the trajectory and interactions of
low-dimensional projections that capture the high-dimensional dynamics.
[COMMENTS]
65 pages
[LINK]
http://arxiv.org/abs/2408.06401v2
[DATE]
2024-12-19 17:30:05+08:00
[CATEGORIES]
cs.LG
LoLaFL: Low-Latency Federated Learning via Forward-only Propagation
[AUTHORS]
Jierui Zhang, Jianhao Huang, Kaibin Huang
[ABSTRACT]
Federated learning (FL) has emerged as a widely adopted paradigm for enabling
edge learning with distributed data while ensuring data privacy. However, the
traditional FL with deep neural networks trained via backpropagation can hardly
meet the low-latency learning requirements in the sixth generation (6G) mobile
networks. This challenge mainly arises from the high-dimensional model
parameters to be transmitted and the numerous rounds of communication required
for convergence due to the inherent randomness of the training process. To
address this issue, we adopt the state-of-the-art principle of maximal coding
rate reduction to learn linear discriminative features and extend the resultant
white-box neural network into FL, yielding the novel framework of Low-Latency
Federated Learning (LoLaFL) via forward-only propagation. LoLaFL enables
layer-wise transmissions and aggregation with significantly fewer communication
rounds, thereby considerably reducing latency. Additionally, we propose two
\emph{nonlinear} aggregation schemes for LoLaFL. The first scheme is based on
the proof that the optimal NN parameter aggregation in LoLaFL should be
harmonic-mean-like. The second scheme further exploits the low-rank structures
of the features and transmits the low-rank-approximated covariance matrices of
features to achieve additional latency reduction. Theoretic analysis and
experiments are conducted to evaluate the performance of LoLaFL. In comparison
with traditional FL, the two nonlinear aggregation schemes for LoLaFL can
achieve reductions in latency of over 91\% and 98\%, respectively, while
maintaining comparable accuracies.
[COMMENTS]
14 pages, 9 figures
[LINK]
http://arxiv.org/abs/2412.14668v1
[DATE]
2024-12-19 17:20:27+08:00
[CATEGORIES]
cs.LG
Trainable Adaptive Activation Function Structure (TAAFS) Enhances Neural Network Force Field Performance with Only Dozens of Additional Parameters
[AUTHORS]
Enji Li
[ABSTRACT]
At the heart of neural network force fields (NNFFs) is the architecture of
neural networks, where the capacity to model complex interactions is typically
enhanced through widening or deepening multilayer perceptrons (MLPs) or by
increasing layers of graph neural networks (GNNs). These enhancements, while
improving the model’s performance, often come at the cost of a substantial
increase in the number of parameters. By applying the Trainable Adaptive
Activation Function Structure (TAAFS), we introduce a method that selects
distinct mathematical formulations for non-linear activations, thereby
increasing the precision of NNFFs with an insignificant addition to the
parameter count. In this study, we integrate TAAFS into a variety of neural
network models, resulting in observed accuracy improvements, and further
validate these enhancements through molecular dynamics (MD) simulations using
DeepMD.
[LINK]
http://arxiv.org/abs/2412.14655v1
[DATE]
2024-12-19 17:06:39+08:00
[CATEGORIES]
cs.LG
Image Classification with Rotation-Invariant Variational Quantum Circuits
[AUTHORS]
Paul San Sebastian, Mikel Cañizo, Román Orús
[ABSTRACT]
Variational quantum algorithms are gaining attention as an early application
of Noisy Intermediate-Scale Quantum (NISQ) devices. One of the main problems of
variational methods lies in the phenomenon of Barren Plateaus, present in the
optimization of variational parameters. Adding geometric inductive bias to the
quantum models has been proposed as a potential solution to mitigate this
problem, leading to a new field called Geometric Quantum Machine Learning. In
this work, an equivariant architecture for variational quantum classifiers is
introduced to create a label-invariant model for image classification with
$C_4$ rotational label symmetry. The equivariant circuit is benchmarked against
two different architectures, and it is experimentally observed that the
geometric approach boosts the model’s performance. Finally, a classical
equivariant convolution operation is proposed to extend the quantum model for
the processing of larger images, employing the resources available in NISQ
devices.
[COMMENTS]
13 pages, 10 figures
[LINK]
http://arxiv.org/abs/2403.15031v2
[DATE]
2024-12-19 17:00:34+08:00
[CATEGORIES]
cs.LG
Permutation recovery of spikes in noisy high-dimensional tensor estimation
[AUTHORS]
Gérard Ben Arous, CĆedric Gerbelot, Vanessa Piccolo
[ABSTRACT]
We study the dynamics of gradient flow in high dimensions for the
multi-spiked tensor problem, where the goal is to estimate $r$ unknown signal
vectors (spikes) from noisy Gaussian tensor observations. Specifically, we
analyze the maximum likelihood estimation procedure, which involves optimizing
a highly nonconvex random function. We determine the sample complexity required
for gradient flow to efficiently recover all spikes, without imposing any
assumptions on the separation of the signal-to-noise ratios (SNRs). More
precisely, our results provide the sample complexity required to guarantee
recovery of the spikes up to a permutation. Our work builds on our companion
paper [Ben Arous, Gerbelot, Piccolo 2024], which studies Langevin dynamics and
determines the sample complexity and separation conditions for the SNRs
necessary for ensuring exact recovery of the spikes (where the recovered
permutation matches the identity). During the recovery process, the
correlations between the estimators and the hidden vectors increase in a
sequential manner. The order in which these correlations become significant
depends on their initial values and the corresponding SNRs, which ultimately
determines the permutation of the recovered spikes.
[COMMENTS]
29 pages, 2 figures. arXiv admin note: substantial text overlap with
arXiv:2408.06401
[LINK]
http://arxiv.org/abs/2412.14650v1
[DATE]
2024-12-19 16:59:49+08:00
[CATEGORIES]
cs.LG
Adaptive Prompt Tuning: Vision Guided Prompt Tuning with Cross-Attention for Fine-Grained Few-Shot Learning
[AUTHORS]
Eric Brouwer, Jan Erik van Woerden, Gertjan Burghouts, Matias Valedenegro-Toro, Marco Zullich
[ABSTRACT]
Few-shot, fine-grained classification in computer vision poses significant
challenges due to the need to differentiate subtle class distinctions with
limited data. This paper presents a novel method that enhances the Contrastive
Language-Image Pre-Training (CLIP) model through adaptive prompt tuning, guided
by real-time visual inputs. Unlike existing techniques such as Context
Optimization (CoOp) and Visual Prompt Tuning (VPT), which are constrained by
static prompts or visual token reliance, the proposed approach leverages a
cross-attention mechanism to dynamically refine text prompts for the image at
hand. This enables an image-specific alignment of textual features with image
patches extracted from the Vision Transformer, making the model more effective
for datasets with high intra-class variance and low inter-class differences.
The method is evaluated on several datasets, including CUBirds, Oxford Flowers,
and FGVC Aircraft, showing significant performance gains over static prompt
tuning approaches. To ensure these performance gains translate into trustworthy
predictions, we integrate Monte-Carlo Dropout in our approach to improve the
reliability of the model predictions and uncertainty estimates. This
integration provides valuable insights into the model’s predictive confidence,
helping to identify when predictions can be trusted and when additional
verification is necessary. This dynamic approach offers a robust solution,
advancing the state-of-the-art for few-shot fine-grained classification.
[LINK]
http://arxiv.org/abs/2412.14640v1
[DATE]
2024-12-19 16:51:01+08:00
[CATEGORIES]
cs.LG
Iterative Methods for Full-Scale Gaussian Process Approximations for Large Spatial Data
[AUTHORS]
Tim Gyger, Reinhard Furrer, Fabio Sigrist
[ABSTRACT]
Gaussian processes are flexible probabilistic regression models which are
widely used in statistics and machine learning. However, a drawback is their
limited scalability to large data sets. To alleviate this, we consider
full-scale approximations (FSAs) that combine predictive process methods and
covariance tapering, thus approximating both global and local structures. We
show how iterative methods can be used to reduce the computational costs for
calculating likelihoods, gradients, and predictive distributions with FSAs. We
introduce a novel preconditioner and show that it accelerates the conjugate
gradient method’s convergence speed and mitigates its sensitivity with respect
to the FSA parameters and the eigenvalue structure of the original covariance
matrix, and we demonstrate empirically that it outperforms a state-of-the-art
pivoted Cholesky preconditioner. Further, we present a novel, accurate, and
fast way to calculate predictive variances relying on stochastic estimations
and iterative methods. In both simulated and real-world data experiments, we
find that our proposed methodology achieves the same accuracy as Cholesky-based
computations with a substantial reduction in computational time. Finally, we
also compare different approaches for determining inducing points in predictive
process and FSA models. All methods are implemented in a free C++ software
library with high-level Python and R packages.
[LINK]
http://arxiv.org/abs/2405.14492v2
[DATE]
2024-12-19 16:46:02+08:00
[CATEGORIES]
cs.LG
Robust PCA Based on Adaptive Weighted Least Squares and Low-Rank Matrix Factorization
[AUTHORS]
Kexin Li, You-wei Wen, Xu Xiao, Mingchao Zhao
[ABSTRACT]
Robust Principal Component Analysis (RPCA) is a fundamental technique for
decomposing data into low-rank and sparse components, which plays a critical
role for applications such as image processing and anomaly detection.
Traditional RPCA methods commonly use $\ell_1$ norm regularization to enforce
sparsity, but this approach can introduce bias and result in suboptimal
estimates, particularly in the presence of significant noise or outliers.
Non-convex regularization methods have been proposed to mitigate these
challenges, but they tend to be complex to optimize and sensitive to initial
conditions, leading to potential instability in solutions. To overcome these
challenges, in this paper, we propose a novel RPCA model that integrates
adaptive weighted least squares (AWLS) and low-rank matrix factorization
(LRMF). The model employs a {self-attention-inspired} mechanism in its weight
update process, allowing the weight matrix to dynamically adjust and emphasize
significant components during each iteration. By employing a weighted F-norm
for the sparse component, our method effectively reduces bias while simplifying
the computational process compared to traditional $\ell_1$-norm-based methods.
We use an alternating minimization algorithm, where each subproblem has an
explicit solution, thereby improving computational efficiency. Despite its
simplicity, numerical experiments demonstrate that our method outperforms
existing non-convex regularization approaches, offering superior performance
and stability, as well as enhanced accuracy and robustness in practical
applications.
[LINK]
http://arxiv.org/abs/2412.14629v1
[DATE]
2024-12-19 16:31:42+08:00
[CATEGORIES]
cs.LG
Qua$^2$SeDiMo: Quantifiable Quantization Sensitivity of Diffusion Models
[AUTHORS]
Keith G. Mills, Mohammad Salameh, Ruichen Chen, Negar Hassanpour, Wei Lu, Di Niu
[ABSTRACT]
Diffusion Models (DM) have democratized AI image generation through an
iterative denoising process. Quantization is a major technique to alleviate the
inference cost and reduce the size of DM denoiser networks. However, as
denoisers evolve from variants of convolutional U-Nets toward newer Transformer
architectures, it is of growing importance to understand the quantization
sensitivity of different weight layers, operations and architecture types to
performance. In this work, we address this challenge with Qua$^2$SeDiMo, a
mixed-precision Post-Training Quantization framework that generates explainable
insights on the cost-effectiveness of various model weight quantization methods
for different denoiser operation types and block structures. We leverage these
insights to make high-quality mixed-precision quantization decisions for a
myriad of diffusion models ranging from foundational U-Nets to state-of-the-art
Transformers. As a result, Qua$^2$SeDiMo can construct 3.4-bit, 3.9-bit,
3.65-bit and 3.7-bit weight quantization on PixArt-${\alpha}$,
PixArt-${\Sigma}$, Hunyuan-DiT and SDXL, respectively. We further pair our
weight-quantization configurations with 6-bit activation quantization and
outperform existing approaches in terms of quantitative metrics and generative
image quality.
[COMMENTS]
AAAI 2025; version includes supplementary material; 22 Pages, 18
Figures, 8 Tables
[LINK]
http://arxiv.org/abs/2412.14628v1
[DATE]
2024-12-19 16:30:54+08:00
[CATEGORIES]
cs.LG
Gauss-Newton Dynamics for Neural Networks: A Riemannian Optimization Perspective
[AUTHORS]
Semih Cayci
[ABSTRACT]
We analyze the convergence of Gauss-Newton dynamics for training neural
networks with smooth activation functions. In the underparameterized regime,
the Gauss-Newton gradient flow induces a Riemannian gradient flow on a
low-dimensional, smooth, embedded submanifold of the Euclidean output space.
Using tools from Riemannian optimization, we prove \emph{last-iterate}
convergence of the Riemannian gradient flow to the optimal in-class predictor
at an \emph{exponential rate} that is independent of the conditioning of the
Gram matrix, \emph{without} requiring explicit regularization. We further
characterize the critical impacts of the neural network scaling factor and the
initialization on the convergence behavior. In the overparameterized regime, we
show that the Levenberg-Marquardt dynamics with an appropriately chosen damping
factor yields robustness to ill-conditioned kernels, analogous to the
underparameterized regime. These findings demonstrate the potential of
Gauss-Newton methods for efficiently optimizing neural networks, particularly
in ill-conditioned problems where kernel and Gram matrices have small singular
values.
[LINK]
http://arxiv.org/abs/2412.14031v2
[DATE]
2024-12-19 16:21:15+08:00
[CATEGORIES]
cs.LG
Continuous latent representations for modeling precipitation with deep learning
[AUTHORS]
Gokul Radhakrishnan, Rahul Sundar, Nishant Parashar, Antoine Blanchard, Daiwei Wang, Boyko Dodov
[ABSTRACT]
The sparse and spatio-temporally discontinuous nature of precipitation data
presents significant challenges for simulation and statistical processing for
bias correction and downscaling. These include incorrect representation of
intermittency and extreme values (critical for hydrology applications), Gibbs
phenomenon upon regridding, and lack of fine scales details. To address these
challenges, a common approach is to transform the precipitation variable
nonlinearly into one that is more malleable. In this work, we explore how deep
learning can be used to generate a smooth, spatio-temporally continuous
variable as a proxy for simulation of precipitation data. We develop a normally
distributed field called pseudo-precipitation (PP) as an alternative for
simulating precipitation. The practical applicability of this variable is
investigated by applying it for downscaling precipitation from (1\degree)
((\sim) 100 km) to (0.25\degree) ((\sim) 25 km).
[LINK]
http://arxiv.org/abs/2412.14620v1
[DATE]
2024-12-19 16:13:20+08:00
[CATEGORIES]
cs.LG
Pitfalls of topology-aware image segmentation
[AUTHORS]
Alexander H. Berger, Laurin Lux, Alexander Weers, Martin Menten, Daniel Rueckert, Johannes C. Paetzold
[ABSTRACT]
Topological correctness, i.e., the preservation of structural integrity and
specific characteristics of shape, is a fundamental requirement for medical
imaging tasks, such as neuron or vessel segmentation. Despite the recent surge
in topology-aware methods addressing this challenge, their real-world
applicability is hindered by flawed benchmarking practices. In this paper, we
identify critical pitfalls in model evaluation that include inadequate
connectivity choices, overlooked topological artifacts in ground truth
annotations, and inappropriate use of evaluation metrics. Through detailed
empirical analysis, we uncover these issues’ profound impact on the evaluation
and ranking of segmentation methods. Drawing from our findings, we propose a
set of actionable recommendations to establish fair and robust evaluation
standards for topology-aware medical image segmentation methods.
[COMMENTS]
Code is available at
https://github.com/AlexanderHBerger/topo-pitfalls
[LINK]
http://arxiv.org/abs/2412.14619v1
[DATE]
2024-12-19 16:11:42+08:00
[CATEGORIES]
cs.LG
Towards Scalable and Deep Graph Neural Networks via Noise Masking
[AUTHORS]
Yuxuan Liang, Wentao Zhang, Zeang Sheng, Ling Yang, Quanqing Xu, Jiawei Jiang, Yunhai Tong, Bin Cu
[ABSTRACT]
In recent years, Graph Neural Networks (GNNs) have achieved remarkable
success in many graph mining tasks. However, scaling them to large graphs is
challenging due to the high computational and storage costs of repeated feature
propagation and non-linear transformation during training. One commonly
employed approach to address this challenge is model-simplification, which only
executes the Propagation (P) once in the pre-processing, and Combine (C) these
receptive fields in different ways and then feed them into a simple model for
better performance. Despite their high predictive performance and scalability,
these methods still face two limitations. First, existing approaches mainly
focus on exploring different C methods from the model perspective, neglecting
the crucial problem of performance degradation with increasing P depth from the
data-centric perspective, known as the over-smoothing problem. Second,
pre-processing overhead takes up most of the end-to-end processing time,
especially for large-scale graphs. To address these limitations, we present
random walk with noise masking (RMask), a plug-and-play module compatible with
the existing model-simplification works. This module enables the exploration of
deeper GNNs while preserving their scalability. Unlike the previous
model-simplification works, we focus on continuous P and found that the noise
existing inside each P is the cause of the over-smoothing issue, and use the
efficient masking mechanism to eliminate them. Experimental results on six
real-world datasets demonstrate that model-simplification works equipped with
RMask yield superior performance compared to their original version and can
make a good trade-off between accuracy and efficiency.
[LINK]
http://arxiv.org/abs/2412.14602v1
[DATE]
2024-12-19 15:48:14+08:00
[CATEGORIES]
cs.LG
Fast inverse lithography based on a model-driven block stacking convolutional neural network
[AUTHORS]
Ruixiang Chen, Yang Zhao, Haoqin Li, Rui Chen
[ABSTRACT]
In the realm of lithography, Optical Proximity Correction (OPC) is a crucial
resolution enhancement technique that optimizes the transmission function of
photomasks on a pixel-based to effectively counter Optical Proximity Effects
(OPE). However, conventional pixel-based OPC methods often generate patterns
that pose manufacturing challenges, thereby leading to the increased cost in
practical scenarios. This paper presents a novel inverse lithographic approach
to OPC, employing a model-driven, block stacking deep learning framework that
expedites the generation of masks conducive to manufacturing. This method is
founded on vector lithography modelling and streamlines the training process by
eliminating the requirement for extensive labeled datasets. Furthermore,
diversity of mask patterns is enhanced by employing a wave function collapse
algorithm, which facilitates the random generation of a multitude of target
patterns, therefore significantly expanding the range of mask paradigm.
Numerical experiments have substantiated the efficacy of the proposed
end-to-end approach, highlighting its superior capability to manage mask
complexity within the context of advanced OPC lithography. This advancement is
anticipated to enhance the feasibility and economic viability of OPC technology
within actual manufacturing environments.
[COMMENTS]
21 pages, 7 figures
[LINK]
http://arxiv.org/abs/2412.14599v1
[DATE]
2024-12-19 15:42:07+08:00
[CATEGORIES]
cs.LG
DualDynamics: Synergizing Implicit and Explicit Methods for Robust Irregular Time Series Analysis
[AUTHORS]
YongKyung Oh, Dongyoung Lim, Sungil Kim
[ABSTRACT]
Real-world time series analysis faces significant challenges when dealing
with irregular and incomplete data. While Neural Differential Equation (NDE)
based methods have shown promise, they struggle with limited expressiveness,
scalability issues, and stability concerns. Conversely, Neural Flows offer
stability but falter with irregular data. We introduce ‘DualDynamics’, a novel
framework that synergistically combines NDE-based method and Neural Flow-based
method. This approach enhances expressive power while balancing computational
demands, addressing critical limitations of existing techniques. We demonstrate
DualDynamics’ effectiveness across diverse tasks: classification of robustness
to dataset shift, irregularly-sampled series analysis, interpolation of missing
data, and forecasting with partial observations. Our results show consistent
outperformance over state-of-the-art methods, indicating DualDynamics’
potential to advance irregular time series analysis significantly.
[COMMENTS]
Published at the 39th Annual AAAI Conference on Artificial
Intelligence (AAAI 2025)
[LINK]
http://arxiv.org/abs/2401.04979v3
[DATE]
2024-12-19 15:33:48+08:00
[CATEGORIES]
cs.LG
Alt-MoE: Multimodal Alignment via Alternating Optimization of Multi-directional MoE with Unimodal Models
[AUTHORS]
Hongyang Lei, Xiaolong Cheng, Dan Wang, Kun Fan, Qi Qin, Huazhen Huang, Yetao Wu, Qingqing Gu, Zhonglin Jiang, Yong Chen, Luo Ji
[ABSTRACT]
Recent Large Multi-Modal Models (LMMs) have made significant advancements in
multi-modal alignment by employing lightweight connection modules to facilitate
the representation and fusion of knowledge from existing pre-trained uni-modal
models. However, these methods still rely on modality-specific and
direction-specific connectors, leading to compartmentalized knowledge
representations and reduced computational efficiency, which limits the model’s
ability to form unified multi-modal representations. To address these issues,
we introduce a novel training framework, Alt-MoE, which employs the Mixture of
Experts (MoE) as a unified multi-directional connector across modalities, and
employs a multi-step sequential alternating unidirectional alignment strategy,
which converges to bidirectional alignment over iterations. The extensive
empirical studies revealed the following key points: 1) Alt-MoE achieves
competitive results by integrating diverse knowledge representations from
uni-modal models. This approach seamlessly fuses the specialized expertise of
existing high-performance uni-modal models, effectively synthesizing their
domain-specific knowledge into a cohesive multi-modal representation. 2)
Alt-MoE efficiently scales to new tasks and modalities without altering its
model architecture or training strategy. Furthermore, Alt-MoE operates in
latent space, supporting vector pre-storage and real-time retrieval via
lightweight multi-directional MoE, thereby facilitating massive data
processing. Our methodology has been validated on several well-performing
uni-modal models (LLAMA3, Qwen2, and DINOv2), achieving competitive results on
a wide range of downstream tasks and datasets.
[LINK]
http://arxiv.org/abs/2409.05929v2
[DATE]
2024-12-19 15:31:18+08:00
[CATEGORIES]
cs.LG
Multi-Sensor Object Anomaly Detection: Unifying Appearance, Geometry, and Internal Properties
[AUTHORS]
Wenqiao Li, Bozhong Zheng, Xiaohao Xu, Jinye Gan, Fading Lu, Xiang Li, Na Ni, Zheng Tian, Xiaonan Huang, Shenghua Gao, Yingna Wu
[ABSTRACT]
Object anomaly detection is essential for industrial quality inspection, yet
traditional single-sensor methods face critical limitations. They fail to
capture the wide range of anomaly types, as single sensors are often
constrained to either external appearance, geometric structure, or internal
properties. To overcome these challenges, we introduce MulSen-AD, the first
high-resolution, multi-sensor anomaly detection dataset tailored for industrial
applications. MulSen-AD unifies data from RGB cameras, laser scanners, and
lock-in infrared thermography, effectively capturing external appearance,
geometric deformations, and internal defects. The dataset spans 15 industrial
products with diverse, real-world anomalies. We also present MulSen-AD Bench, a
benchmark designed to evaluate multi-sensor methods, and propose
MulSen-TripleAD, a decision-level fusion algorithm that integrates these three
modalities for robust, unsupervised object anomaly detection. Our experiments
demonstrate that multi-sensor fusion substantially outperforms single-sensor
approaches, achieving 96.1% AUROC in object-level detection accuracy. These
results highlight the importance of integrating multi-sensor data for
comprehensive industrial anomaly detection.
[LINK]
http://arxiv.org/abs/2412.14592v1
[DATE]
2024-12-19 15:23:17+08:00
[CATEGORIES]
cs.LG
Quantum Curriculum Learning
[AUTHORS]
Quoc Hoan Tran, Yasuhiro Endo, Hirotaka Oshima
[ABSTRACT]
Quantum machine learning (QML) requires significant quantum resources to
address practical real-world problems. When the underlying quantum information
exhibits hierarchical structures in the data, limitations persist in training
complexity and generalization. Research should prioritize both the efficient
design of quantum architectures and the development of learning strategies to
optimize resource usage. We propose a framework called quantum curriculum
learning (Q-CurL) for quantum data, where the curriculum introduces simpler
tasks or data to the learning model before progressing to more challenging
ones. Q-CurL exhibits robustness to noise and data limitations, which is
particularly relevant for current and near-term noisy intermediate-scale
quantum devices. We achieve this through a curriculum design based on quantum
data density ratios and a dynamic learning schedule that prioritizes the most
informative quantum data. Empirical evidence shows that Q-CurL significantly
enhances training convergence and generalization for unitary learning and
improves the robustness of quantum phase recognition tasks. Q-CurL is effective
with broad physical learning applications in condensed matter physics and
quantum chemistry.
[COMMENTS]
main 6 pages, supplementary materials 11 pages (update the
supplementary materials with more explanation on data-based Q-CurL)
[LINK]
http://arxiv.org/abs/2407.02419v3
[DATE]
2024-12-19 15:07:51+08:00
[CATEGORIES]
cs.LG
Learning Discretized Neural Networks under Ricci Flow
[AUTHORS]
Jun Chen, Hanwen Chen, Mengmeng Wang, Guang Dai, Ivor W. Tsang, Yong Liu
[ABSTRACT]
In this paper, we study Discretized Neural Networks (DNNs) composed of
low-precision weights and activations, which suffer from either infinite or
zero gradients due to the non-differentiable discrete function during training.
Most training-based DNNs in such scenarios employ the standard Straight-Through
Estimator (STE) to approximate the gradient w.r.t. discrete values. However,
the use of STE introduces the problem of gradient mismatch, arising from
perturbations in the approximated gradient. To address this problem, this paper
reveals that this mismatch can be interpreted as a metric perturbation in a
Riemannian manifold, viewed through the lens of duality theory. Building on
information geometry, we construct the Linearly Nearly Euclidean (LNE) manifold
for DNNs, providing a background for addressing perturbations. By introducing a
partial differential equation on metrics, i.e., the Ricci flow, we establish
the dynamical stability and convergence of the LNE metric with the $L^2$-norm
perturbation. In contrast to previous perturbation theories with convergence
rates in fractional powers, the metric perturbation under the Ricci flow
exhibits exponential decay in the LNE manifold. Experimental results across
various datasets demonstrate that our method achieves superior and more stable
performance for DNNs compared to other representative training-based methods.
[LINK]
http://arxiv.org/abs/2302.03390v5
[DATE]
2024-12-19 15:05:03+08:00
[CATEGORIES]
cs.LG
Selective Uncertainty Propagation in Offline RL
[AUTHORS]
Sanath Kumar Krishnamurthy, Tanmay Gangwani, Sumeet Katariya, Branislav Kveton, Shrey Modi, Anshuka Rangi
[ABSTRACT]
We consider the finite-horizon offline reinforcement learning (RL) setting,
and are motivated by the challenge of learning the policy at any step h in
dynamic programming (DP) algorithms. To learn this, it is sufficient to
evaluate the treatment effect of deviating from the behavioral policy at step h
after having optimized the policy for all future steps. Since the policy at any
step can affect next-state distributions, the related distributional shift
challenges can make this problem far more statistically hard than estimating
such treatment effects in the stochastic contextual bandit setting. However,
the hardness of many real-world RL instances lies between the two regimes. We
develop a flexible and general method called selective uncertainty propagation
for confidence interval construction that adapts to the hardness of the
associated distribution shift challenges. We show benefits of our approach on
toy environments and demonstrate the benefits of these techniques for offline
policy learning.
[LINK]
http://arxiv.org/abs/2302.00284v3
[DATE]
2024-12-19 14:52:07+08:00
[CATEGORIES]
cs.LG
CAP: A General Algorithm for Online Selective Conformal Prediction with FCR Control
[AUTHORS]
Yajie Bao, Yuyang Huo, Haojie Ren, Changliang Zou
[ABSTRACT]
We study the problem of post-selection predictive inference in an online
fashion. To avoid devoting resources to unimportant units, a preliminary
selection of the current individual before reporting its prediction interval is
common and meaningful in online predictive tasks. Since the online selection
causes a temporal multiplicity in the selected prediction intervals, it is
important to control the real-time false coverage-statement rate (FCR) which
measures the overall miscoverage level. We develop a general framework named
CAP (Calibration after Adaptive Pick) that performs an adaptive pick rule on
historical data to construct a calibration set if the current individual is
selected and then outputs a conformal prediction interval for the unobserved
label. We provide tractable procedures for constructing the calibration set for
popular online selection rules. We proved that CAP can achieve an exact
selection-conditional coverage guarantee in the finite-sample and
distribution-free regimes. To account for the distribution shift in online
data, we also embed CAP into some recent dynamic conformal prediction
algorithms and show that the proposed method can deliver long-run FCR control.
Numerical results on both synthetic and real data corroborate that CAP can
effectively control FCR around the target level and yield more narrowed
prediction intervals over existing baselines across various settings.
[LINK]
http://arxiv.org/abs/2403.07728v3
[DATE]
2024-12-19 14:51:17+08:00
[CATEGORIES]
cs.LG
Accelerated Patient-Specific Calibration via Differentiable Hemodynamics Simulations
[AUTHORS]
Diego Renner, Georgios Kissas
[ABSTRACT]
One of the goals of personalized medicine is to tailor diagnostics to
individual patients. Diagnostics are performed in practice by measuring
quantities, called biomarkers, that indicate the existence and progress of a
disease. In common cardiovascular diseases, such as hypertension, biomarkers
that are closely related to the clinical representation of a patient can be
predicted using computational models. Personalizing computational models
translates to considering patient-specific flow conditions, for example, the
compliance of blood vessels that cannot be a priori known and quantities such
as the patient geometry that can be measured using imaging. Therefore, a
patient is identified by a set of measurable and nonmeasurable parameters
needed to well-define a computational model; else, the computational model is
not personalized, meaning it is prone to large prediction errors. Therefore, to
personalize a computational model, sufficient information needs to be extracted
from the data. The current methods by which this is done are either
inefficient, due to relying on slow-converging optimization methods, or hard to
interpret, due to using black box
deep-learning algorithms. We propose a
personalized diagnostic procedure based on a differentiable 0D-1D Navier-Stokes
reduced order model solver and fast parameter inference methods that take
advantage of gradients through the solver. By providing a faster method for
performing parameter inference and sensitivity analysis through
differentiability while maintaining the interpretability of well-understood
mathematical models and numerical methods, the best of both worlds is combined.
The performance of the proposed solver is validated against a well-established
process on different geometries, and different parameter inference processes
are successfully performed.
[LINK]
http://arxiv.org/abs/2412.14572v1
[DATE]
2024-12-19 14:42:57+08:00
[CATEGORIES]
cs.LG
Global Spatio-Temporal Fusion-based Traffic Prediction Algorithm with Anomaly Aware
[AUTHORS]
Chaoqun Liu, Xuanpeng Li, Chen Gong, Guangyu Li
[ABSTRACT]
Traffic prediction is an indispensable component of urban planning and
traffic management. Achieving accurate traffic prediction hinges on the ability
to capture the potential spatio-temporal relationships among road sensors.
However, the majority of existing works focus on local short-term
spatio-temporal correlations, failing to fully consider the interactions of
different sensors in the long-term state. In addition, these works do not
analyze the influences of anomalous factors, or have insufficient ability to
extract personalized features of anomalous factors, which make them
ineffectively capture their spatio-temporal influences on traffic prediction.
To address the aforementioned issues, We propose a global spatio-temporal
fusion-based traffic prediction algorithm that incorporates anomaly awareness.
Initially, based on the designed anomaly detection network, we construct an
efficient anomalous factors impacting module (AFIM), to evaluate the
spatio-temporal impact of unexpected external events on traffic prediction.
Furthermore, we propose a multi-scale spatio-temporal feature fusion module
(MTSFFL) based on the transformer architecture, to obtain all possible both
long and short term correlations among different sensors in a wide-area traffic
environment for accurate prediction of traffic flow. Finally, experiments are
implemented based on real-scenario public transportation datasets (PEMS04 and
PEMS08) to demonstrate that our approach can achieve state-of-the-art
performance.
[LINK]
http://arxiv.org/abs/2412.14569v1
[DATE]
2024-12-19 14:40:21+08:00
[CATEGORIES]
cs.LG
Learning Infinitesimal Generators of Continuous Symmetries from Data
[AUTHORS]
Gyeonghoon Ko, Hyunsu Kim, Juho Lee
[ABSTRACT]
Exploiting symmetry inherent in data can significantly improve the sample
efficiency of a learning procedure and the generalization of learned models.
When data clearly reveals underlying symmetry, leveraging this symmetry can
naturally inform the design of model architectures or learning strategies. Yet,
in numerous real-world scenarios, identifying the specific symmetry within a
given data distribution often proves ambiguous. To tackle this, some existing
works learn symmetry in a data-driven manner, parameterizing and learning
expected symmetry through data. However, these methods often rely on explicit
knowledge, such as pre-defined Lie groups, which are typically restricted to
linear or affine transformations. In this paper, we propose a novel symmetry
learning algorithm based on transformations defined with one-parameter groups,
continuously parameterized transformations flowing along the directions of
vector fields called infinitesimal generators. Our method is built upon minimal
inductive biases, encompassing not only commonly utilized symmetries rooted in
Lie groups but also extending to symmetries derived from nonlinear generators.
To learn these symmetries, we introduce a notion of a validity score that
examine whether the transformed data is still valid for the given task. The
validity score is designed to be fully differentiable and easily computable,
enabling effective searches for transformations that achieve symmetries innate
to the data. We apply our method mainly in two domains: image data and partial
differential equations, and demonstrate its advantages. Our codes are available
at \url{https://github.com/kogyeonghoon/learning-symmetry-from-scratch.git}.
[COMMENTS]
Neurips 2024
[LINK]
http://arxiv.org/abs/2410.21853v2
[DATE]
2024-12-19 14:39:24+08:00
[CATEGORIES]
cs.LG
AIArena: A Blockchain-Based Decentralized AI Training Platform
[AUTHORS]
Zhipeng Wang, Rui Sun, Elizabeth Lui, Tuo Zhou, Yizhe Wen, Jiahao Sun
[ABSTRACT]
The rapid advancement of AI has underscored critical challenges in its
development and implementation, largely due to centralized control by a few
major corporations. This concentration of power intensifies biases within AI
models, resulting from inadequate governance and oversight mechanisms.
Additionally, it limits public involvement and heightens concerns about the
integrity of model generation. Such monopolistic control over data and AI
outputs threatens both innovation and fair data usage, as users inadvertently
contribute data that primarily benefits these corporations. In this work, we
propose AIArena, a blockchain-based decentralized AI training platform designed
to democratize AI development and alignment through on-chain incentive
mechanisms. AIArena fosters an open and collaborative environment where
participants can contribute models and computing resources. Its on-chain
consensus mechanism ensures fair rewards for participants based on their
contributions. We instantiate and implement AIArena on the public Base
blockchain Sepolia testnet, and the evaluation results demonstrate the
feasibility of AIArena in real-world applications.
[LINK]
http://arxiv.org/abs/2412.14566v1
[DATE]
2024-12-19 14:35:54+08:00
[CATEGORIES]
cs.LG
How Does the Smoothness Approximation Method Facilitate Generalization for Federated Adversarial Learning?
[AUTHORS]
Wenjun Ding, Ying An, Lixing Chen, Shichao Kan, Fan Wu, Zhe Qu
[ABSTRACT]
Federated Adversarial Learning (FAL) is a robust framework for resisting
adversarial attacks on federated learning. Although some FAL studies have
developed efficient algorithms, they primarily focus on convergence performance
and overlook generalization. Generalization is crucial for evaluating algorithm
performance on unseen data. However, generalization analysis is more
challenging due to non-smooth adversarial loss functions. A common approach to
addressing this issue is to leverage smoothness approximation. In this paper,
we develop algorithm stability measures to evaluate the generalization
performance of two popular FAL algorithms: \textit{Vanilla FAL (VFAL)} and {\it
Slack FAL (SFAL)}, using three different smooth approximation methods: 1)
\textit{Surrogate Smoothness Approximation (SSA)}, (2) \textit{Randomized
Smoothness Approximation (RSA)}, and (3) \textit{Over-Parameterized Smoothness
Approximation (OPSA)}. Based on our in-depth analysis, we answer the question
of how to properly set the smoothness approximation method to mitigate
generalization error in FAL. Moreover, we identify RSA as the most effective
method for reducing generalization error. In highly data-heterogeneous
scenarios, we also recommend employing SFAL to mitigate the deterioration of
generalization performance caused by heterogeneity. Based on our theoretical
results, we provide insights to help develop more efficient FAL algorithms,
such as designing new metrics and dynamic aggregation rules to mitigate
heterogeneity.
[LINK]
http://arxiv.org/abs/2412.08282v2
[DATE]
2024-12-19 14:35:21+08:00
[CATEGORIES]
cs.LG
Continual Learning: Forget-free Winning Subnetworks for Video Representations
[AUTHORS]
Haeyong Kang, Jaehong Yoon, Sung Ju Hwang, Chang D. Yoo
[ABSTRACT]
Inspired by the Lottery Ticket Hypothesis (LTH), which highlights the
existence of efficient subnetworks within larger, dense networks, a
high-performing Winning Subnetwork (WSN) in terms of task performance under
appropriate sparsity conditions is considered for various continual learning
tasks. It leverages pre-existing weights from dense networks to achieve
efficient learning in Task Incremental Learning (TIL) and Task-agnostic
Incremental Learning (TaIL) scenarios. In Few-Shot Class Incremental Learning
(FSCIL), a variation of WSN referred to as the Soft subnetwork (SoftNet) is
designed to prevent overfitting when the data samples are scarce. Furthermore,
the sparse reuse of WSN weights is considered for Video Incremental Learning
(VIL). The use of Fourier Subneural Operator (FSO) within WSN is considered. It
enables compact encoding of videos and identifies reusable subnetworks across
varying bandwidths. We have integrated FSO into different architectural
frameworks for continual learning, including VIL, TIL, and FSCIL. Our
comprehensive experiments demonstrate FSO’s effectiveness, significantly
improving task performance at various convolutional representational levels.
Specifically, FSO enhances higher-layer performance in TIL and FSCIL and
lower-layer performance in VIL.
[COMMENTS]
IEEE Transactions on Pattern Analysis and Machine Intelligence
(T-PAMI)
[LINK]
http://arxiv.org/abs/2312.11973v6
[DATE]
2024-12-19 14:29:38+08:00
[CATEGORIES]
cs.LG
PALM: Pushing Adaptive Learning Rate Mechanisms for Continual Test-Time Adaptation
[AUTHORS]
Sarthak Kumar Maharana, Baoming Zhang, Yunhui Guo
[ABSTRACT]
Real-world vision models in dynamic environments face rapid shifts in domain
distributions, leading to decreased recognition performance. Using unlabeled
test data, continuous test-time adaptation (CTTA) directly adjusts a
pre-trained source discriminative model to these changing domains. A highly
effective CTTA method involves applying layer-wise adaptive learning rates for
selectively adapting pre-trained layers. However, it suffers from the poor
estimation of domain shift and the inaccuracies arising from the pseudo-labels.
This work aims to overcome these limitations by identifying layers for
adaptation via quantifying model prediction uncertainty without relying on
pseudo-labels. We utilize the magnitude of gradients as a metric, calculated by
backpropagating the KL divergence between the softmax output and a uniform
distribution, to select layers for further adaptation. Subsequently, for the
parameters exclusively belonging to these selected layers, with the remaining
ones frozen, we evaluate their sensitivity to approximate the domain shift and
adjust their learning rates accordingly. We conduct extensive image
classification experiments on CIFAR-10C, CIFAR-100C, and ImageNet-C,
demonstrating the superior efficacy of our method compared to prior approaches.
[COMMENTS]
AAAI 2025
[LINK]
http://arxiv.org/abs/2403.10650v3
[DATE]
2024-12-19 14:25:45+08:00
[CATEGORIES]
cs.LG
ScaMo: Exploring the Scaling Law in Autoregressive Motion Generation Model
[AUTHORS]
Shunlin Lu, Jingbo Wang, Zeyu Lu, Ling-Hao Chen, Wenxun Dai, Junting Dong, Zhiyang Dou, Bo Dai, Ruimao Zhang
[ABSTRACT]
The scaling law has been validated in various domains, such as natural
language processing (NLP) and massive computer vision tasks; however, its
application to motion generation remains largely unexplored. In this paper, we
introduce a scalable motion generation framework that includes the motion
tokenizer Motion FSQ-VAE and a text-prefix autoregressive transformer. Through
comprehensive experiments, we observe the scaling behavior of this system. For
the first time, we confirm the existence of scaling laws within the context of
motion generation. Specifically, our results demonstrate that the normalized
test loss of our prefix autoregressive models adheres to a logarithmic law in
relation to compute budgets. Furthermore, we also confirm the power law between
Non-Vocabulary Parameters, Vocabulary Parameters, and Data Tokens with respect
to compute budgets respectively. Leveraging the scaling law, we predict the
optimal transformer size, vocabulary size, and data requirements for a compute
budget of $1e18$. The test loss of the system, when trained with the optimal
model size, vocabulary size, and required data, aligns precisely with the
predicted test loss, thereby validating the scaling law.
[LINK]
http://arxiv.org/abs/2412.14559v1
[DATE]
2024-12-19 14:22:19+08:00
[CATEGORIES]
cs.LG
Generalization Analysis for Deep Contrastive Representation Learning
[AUTHORS]
Nong Minh Hieu, Antoine Ledent, Yunwen Lei, Cheng Yeaw Ku
[ABSTRACT]
In this paper, we present generalization bounds for the unsupervised risk in
the Deep Contrastive Representation Learning framework, which employs deep
neural networks as representation functions. We approach this problem from two
angles. On the one hand, we derive a parameter-counting bound that scales with
the overall size of the neural networks. On the other hand, we provide a
norm-based bound that scales with the norms of neural networks’ weight
matrices. Ignoring logarithmic factors, the bounds are independent of $k$, the
size of the tuples provided for contrastive learning. To the best of our
knowledge, this property is only shared by one other work, which employed a
different proof strategy and suffers from very strong exponential dependence on
the depth of the network which is due to a use of the peeling technique. Our
results circumvent this by leveraging powerful results on covering numbers with
respect to uniform norms over samples. In addition, we utilize loss
augmentation techniques to further reduce the dependency on matrix norms and
the implicit dependence on network depth. In fact, our techniques allow us to
produce many bounds for the contrastive learning setting with similar
architectural dependencies as in the study of the sample complexity of ordinary
loss functions, thereby bridging the gap between the learning theories of
contrastive learning and DNNs.
[COMMENTS]
Accepted at AAAI 2025
[LINK]
http://arxiv.org/abs/2412.12014v2
[DATE]
2024-12-19 14:21:35+08:00
[CATEGORIES]
cs.LG
Single-Loop Federated Actor-Critic across Heterogeneous Environments
[AUTHORS]
Ye Zhu, Xiaowen Gong
[ABSTRACT]
Federated reinforcement learning (FRL) has emerged as a promising paradigm,
enabling multiple agents to collaborate and learn a shared policy adaptable
across heterogeneous environments. Among the various reinforcement learning
(RL) algorithms, the actor-critic (AC) algorithm stands out for its low
variance and high sample efficiency. However, little to nothing is known
theoretically about AC in a federated manner, especially each agent interacts
with a potentially different environment. The lack of such results is
attributed to various technical challenges: a two-level structure illustrating
the coupling effect between the actor and the critic, heterogeneous
environments, Markovian sampling and multiple local updates. In response, we
study \textit{Single-loop Federated Actor Critic} (SFAC) where agents perform
actor-critic learning in a two-level federated manner while interacting with
heterogeneous environments. We then provide bounds on the convergence error of
SFAC. The results show that the convergence error asymptotically converges to a
near-stationary point, with the extent proportional to environment
heterogeneity. Moreover, the sample complexity exhibits a linear speed-up
through the federation of agents. We evaluate the performance of SFAC through
numerical experiments using common RL benchmarks, which demonstrate its
effectiveness.
[COMMENTS]
Extended version of paper accepted at AAAI’25
[LINK]
http://arxiv.org/abs/2412.14555v1
[DATE]
2024-12-19 14:13:59+08:00
[CATEGORIES]
cs.LG
Summary of Point Transformer with Federated Learning for Predicting Breast Cancer HER2 Status from Hematoxylin and Eosin-Stained Whole Slide Images
[AUTHORS]
Kamorudeen A. Amuda, Almustapha A. Wakili
[ABSTRACT]
This study introduces a federated learning-based approach to predict HER2
status from hematoxylin and eosin (HE)-stained whole slide images (WSIs),
reducing costs and speeding up treatment decisions. To address label imbalance
and feature representation challenges in multisite datasets, a point
transformer is proposed, incorporating dynamic label distribution, an auxiliary
classifier, and farthest cosine sampling. Extensive experiments demonstrate
state-of-the-art performance across four sites (2687 WSIs) and strong
generalization to two unseen sites (229 WSIs).
[LINK]
http://arxiv.org/abs/2412.14545v1
[DATE]
2024-12-19 13:51:46+08:00
[CATEGORIES]
cs.LG
Transformer models are gauge invariant: A mathematical connection between AI and particle physics
[AUTHORS]
Leo van Nierop
[ABSTRACT]
In particle physics, the fundamental forces are subject to symmetries called
gauge invariance. It is a redundancy in the mathematical description of any
physical system. In this article I will demonstrate that the transformer
architecture exhibits the same properties, and show that the default
representation of transformers has partially, but not fully removed the gauge
invariance.
[COMMENTS]
10 pages, 2 figures, 1 table
[LINK]
http://arxiv.org/abs/2412.14543v1
[DATE]
2024-12-19 13:49:31+08:00
[CATEGORIES]
cs.LG
SCONE: A Novel Stochastic Sampling to Generate Contrastive Views and Hard Negative Samples for Recommendation
[AUTHORS]
Chaejeong Lee, Jeongwhan Choi, Hyowon Wi, Sung-Bae Cho, Noseong Park
[ABSTRACT]
Graph-based collaborative filtering (CF) has emerged as a promising approach
in recommender systems. Despite its achievements, graph-based CF models face
challenges due to data sparsity and negative sampling. In this paper, we
propose a novel Stochastic sampling for i) COntrastive views and ii) hard
NEgative samples (SCONE) to overcome these issues. SCONE generates dynamic
augmented views and diverse hard negative samples via a unified stochastic
sampling approach based on score-based generative models. Our extensive
experiments on 6 benchmark datasets show that SCONE consistently outperforms
state-of-the-art baselines. SCONE shows efficacy in addressing user sparsity
and item popularity issues, while enhancing performance for both cold-start
users and long-tail items. Furthermore, our approach improves the diversity of
the recommendation and the uniformity of the representations. The code is
available at https://github.com/jeongwhanchoi/SCONE.
[COMMENTS]
Accepted to WSDM 2025. Chaejeong Lee and Jeongwhan Choi are co-first
authors with equal contributions
[LINK]
http://arxiv.org/abs/2405.00287v2
[DATE]
2024-12-19 13:48:08+08:00
[CATEGORIES]
cs.LG
Mitigating Spurious Correlations via Disagreement Probability
[AUTHORS]
Hyeonggeun Han, Sehwan Kim, Hyungjun Joo, Sangwoo Hong, Jungwoo Lee
[ABSTRACT]
Models trained with empirical risk minimization (ERM) are prone to be biased
towards spurious correlations between target labels and bias attributes, which
leads to poor performance on data groups lacking spurious correlations. It is
particularly challenging to address this problem when access to bias labels is
not permitted. To mitigate the effect of spurious correlations without bias
labels, we first introduce a novel training objective designed to robustly
enhance model performance across all data samples, irrespective of the presence
of spurious correlations. From this objective, we then derive a debiasing
method, Disagreement Probability based Resampling for debiasing (DPR), which
does not require bias labels. DPR leverages the disagreement between the target
label and the prediction of a biased model to identify bias-conflicting
samples-those without spurious correlations-and upsamples them according to the
disagreement probability. Empirical evaluations on multiple benchmarks
demonstrate that DPR achieves state-of-the-art performance over existing
baselines that do not use bias labels. Furthermore, we provide a theoretical
analysis that details how DPR reduces dependency on spurious correlations.
[LINK]
http://arxiv.org/abs/2411.01757v3
[DATE]
2024-12-19 13:38:31+08:00
[CATEGORIES]
cs.LG
Downscaling Precipitation with Bias-informed Conditional Diffusion Model
[AUTHORS]
Ran Lyu, Linhan Wang, Yanshen Sun, Hedanqiu Bai, Chang-Tien Lu
[ABSTRACT]
Climate change is intensifying rainfall extremes, making high-resolution
precipitation projections crucial for society to better prepare for impacts
such as flooding. However, current Global Climate Models (GCMs) operate at
spatial resolutions too coarse for localized analyses. To address this
limitation, deep learning-based statistical downscaling methods offer promising
solutions, providing high-resolution precipitation projections with a moderate
computational cost. In this work, we introduce a bias-informed conditional
diffusion model for statistical downscaling of precipitation. Specifically, our
model leverages a conditional diffusion approach to learn distribution priors
from large-scale, high-resolution precipitation datasets. The long-tail
distribution of precipitation poses a unique challenge for training diffusion
models; to address this, we apply gamma correction during preprocessing.
Additionally, to correct biases in the downscaled results, we employ a
guided-sampling strategy to enhance bias correction. Our experiments
demonstrate that the proposed model achieves highly accurate results in an 8
times downscaling setting, outperforming previous deterministic methods. The
code and dataset are available at
https://github.com/RoseLV/research_super-resolution
[COMMENTS]
3 pages, 2 figures. Accepted by Proceedings of IEEE International
Conference on Big Data, Dec 15-18, 2024
[LINK]
http://arxiv.org/abs/2412.14539v1
[DATE]
2024-12-19 13:36:52+08:00
[CATEGORIES]
cs.LG
ST-ReP: Learning Predictive Representations Efficiently for Spatial-Temporal Forecasting
[AUTHORS]
Qi Zheng, Zihao Yao, Yaying Zhang
[ABSTRACT]
Spatial-temporal forecasting is crucial and widely applicable in various
domains such as traffic, energy, and climate. Benefiting from the abundance of
unlabeled spatial-temporal data, self-supervised methods are increasingly
adapted to learn spatial-temporal representations. However, it encounters three
key challenges: 1) the difficulty in selecting reliable negative pairs due to
the homogeneity of variables, hindering contrastive learning methods; 2)
overlooking spatial correlations across variables over time; 3) limitations of
efficiency and scalability in existing self-supervised learning methods. To
tackle these, we propose a lightweight representation-learning model ST-ReP,
integrating current value reconstruction and future value prediction into the
pre-training framework for spatial-temporal forecasting. And we design a new
spatial-temporal encoder to model fine-grained relationships. Moreover,
multi-time scale analysis is incorporated into the self-supervised loss to
enhance predictive capability. Experimental results across diverse domains
demonstrate that the proposed model surpasses pre-training-based baselines,
showcasing its ability to learn compact and semantically enriched
representations while exhibiting superior scalability.
[COMMENTS]
13 pages, 7 pages. Accepted by AAAI2025
[LINK]
http://arxiv.org/abs/2412.14537v1
[DATE]
2024-12-19 13:33:55+08:00
[CATEGORIES]
cs.LG
Spectral Motion Alignment for Video Motion Transfer using Diffusion Models
[AUTHORS]
Geon Yeong Park, Hyeonho Jeong, Sang Wan Lee, Jong Chul Ye
[ABSTRACT]
The evolution of diffusion models has greatly impacted video generation and
understanding. Particularly, text-to-video diffusion models (VDMs) have
significantly facilitated the customization of input video with target
appearance, motion, etc. Despite these advances, challenges persist in
accurately distilling motion information from video frames. While existing
works leverage the consecutive frame residual as the target motion vector, they
inherently lack global motion context and are vulnerable to frame-wise
distortions. To address this, we present Spectral Motion Alignment (SMA), a
novel framework that refines and aligns motion vectors using Fourier and
wavelet transforms. SMA learns motion patterns by incorporating
frequency-domain regularization, facilitating the learning of whole-frame
global motion dynamics, and mitigating spatial artifacts. Extensive experiments
demonstrate SMA’s efficacy in improving motion transfer while maintaining
computational efficiency and compatibility across various video customization
frameworks.
[COMMENTS]
AAAI 2025, Project page:
https://geonyeong-park.github.io/spectral-motion-alignment/
[LINK]
http://arxiv.org/abs/2403.15249v2
[DATE]
2024-12-19 13:30:55+08:00
[CATEGORIES]
cs.LG
Leveraging Time Series Categorization and Temporal Fusion Transformers to Improve Cryptocurrency Price Forecasting
[AUTHORS]
Arash Peik, Mohammad Ali Zare Chahooki, Amin Milani Fard, Mehdi Agha Sarram
[ABSTRACT]
Organizing and managing cryptocurrency portfolios and decision-making on
transactions is crucial in this market. Optimal selection of assets is one of
the main challenges that requires accurate prediction of the price of
cryptocurrencies. In this work, we categorize the financial time series into
several similar subseries to increase prediction accuracy by learning each
subseries category with similar behavior. For each category of the subseries,
we create a deep learning model based on the attention mechanism to predict the
next step of each subseries. Due to the limited amount of cryptocurrency data
for training models, if the number of categories increases, the amount of
training data for each model will decrease, and some complex models will not be
trained well due to the large number of parameters. To overcome this challenge,
we propose to combine the time series data of other cryptocurrencies to
increase the amount of data for each category, hence increasing the accuracy of
the models corresponding to each category.
[LINK]
http://arxiv.org/abs/2412.14529v1
[DATE]
2024-12-19 12:57:55+08:00
[CATEGORIES]
cs.LG
Accelerating AI Performance using Anderson Extrapolation on GPUs
[AUTHORS]
Saleem Abdul Fattah Ahmed Al Dajani, David E. Keyes
[COMMENTS]
6 pages, 6 figures, 1 table, Accepted by NeurIPS 2024 Workshop MLNCP
https://openreview.net/forum?id=wkP2ZFRn9e
[LINK]
http://arxiv.org/abs/2410.19460v2
[DATE]
2024-12-19 12:49:52+08:00
[CATEGORIES]
cs.LG
Statistical Undersampling with Mutual Information and Support Points
[AUTHORS]
Alex Mak, Shubham Sahoo, Shivani Pandey, Yidan Yue, Linglong Kong
[ABSTRACT]
Class imbalance and distributional differences in large datasets present
significant challenges for classification tasks machine learning, often leading
to biased models and poor predictive performance for minority classes. This
work introduces two novel undersampling approaches: mutual information-based
stratified simple random sampling and support points optimization. These
methods prioritize representative data selection, effectively minimizing
information loss. Empirical results across multiple classification tasks
demonstrate that our methods outperform traditional undersampling techniques,
achieving higher balanced classification accuracy. These findings highlight the
potential of combining statistical concepts with machine learning to address
class imbalance in practical applications.
[LINK]
http://arxiv.org/abs/2412.14527v1
[DATE]
2024-12-19 12:48:29+08:00
[CATEGORIES]
cs.LG
Knowledge Distillation in RNN-Attention Models for Early Prediction of Student Performance
[AUTHORS]
Sukrit Leelaluk, Cheng Tang, Valdemar Švábenský, Atsushi Shimada
[ABSTRACT]
Educational data mining (EDM) is a part of applied computing that focuses on
automatically analyzing data from learning contexts. Early prediction for
identifying at-risk students is a crucial and widely researched topic in EDM
research. It enables instructors to support at-risk students to stay on track,
preventing student dropout or failure. Previous studies have predicted
students’ learning performance to identify at-risk students by using machine
learning on data collected from e-learning platforms. However, most studies
aimed to identify at-risk students utilizing the entire course data after the
course finished. This does not correspond to the real-world scenario that
at-risk students may drop out before the course ends. To address this problem,
we introduce an RNN-Attention-KD (knowledge distillation) framework to predict
at-risk students early throughout a course. It leverages the strengths of
Recurrent Neural Networks (RNNs) in handling time-sequence data to predict
students’ performance at each time step and employs an attention mechanism to
focus on relevant time steps for improved predictive accuracy. At the same
time, KD is applied to compress the time steps to facilitate early prediction.
In an empirical evaluation, RNN-Attention-KD outperforms traditional neural
network models in terms of recall and F1-measure. For example, it obtained
recall and F1-measure of 0.49 and 0.51 for Weeks 1–3 and 0.51 and 0.61 for
Weeks 1–6 across all datasets from four years of a university course. Then, an
ablation study investigated the contributions of different knowledge transfer
methods (distillation objectives). We found that hint loss from the hidden
layer of RNN and context vector loss from the attention module on RNN could
enhance the model’s prediction performance for identifying at-risk students.
These results are relevant for EDM researchers employing deep learning models.
[COMMENTS]
Published in Proceedings of The 40th ACM/SIGAPP Symposium on Applied
Computing (SAC ‘25), see https://doi.org/10.1145/3672608.3707805
[LINK]
http://arxiv.org/abs/2412.14526v1
[DATE]
2024-12-19 12:46:06+08:00
[CATEGORIES]
cs.LG
CAE-T: A Channelwise AutoEncoder with Transformer for EEG Abnormality Detection
[AUTHORS]
Youshen Zhao, Keiji Iramina
[ABSTRACT]
Electroencephalogram (EEG) signals are critical for detecting abnormal brain
activity, but their high dimensionality and complexity pose significant
challenges for effective analysis. In this paper, we propose CAE-T, a novel
framework that combines a channelwise CNN-based autoencoder with a single-head
transformer classifier for efficient EEG abnormality detection. The channelwise
autoencoder compresses raw EEG signals while preserving channel independence,
reducing computational costs and retaining biologically meaningful features.
The compressed representations are then fed into the transformer-based
classifier, which efficiently models long-term dependencies to distinguish
between normal and abnormal signals. Evaluated on the TUH Abnormal EEG Corpus,
the proposed model achieves 85.0% accuracy, 76.2% sensitivity, and 91.2%
specificity at the per-case level, outperforming baseline models such as
EEGNet, Deep4Conv, and FusionCNN. Furthermore, CAE-T requires only 202M FLOPs
and 2.9M parameters, making it significantly more efficient than
transformer-based alternatives. The framework retains interpretability through
its channelwise design, demonstrating great potential for future applications
in neuroscience research and clinical practice. The source code is available at
https://github.com/YossiZhao/CAE-T.
[COMMENTS]
The manuscript consists of 10 pages, including 5 figures. The
experimental results are based on evaluations using the TUH Abnormal EEG
Corpus
[LINK]
http://arxiv.org/abs/2412.14522v1
[DATE]
2024-12-19 12:38:34+08:00
[CATEGORIES]
cs.LG
Dynamic User Interface Generation for Enhanced Human-Computer Interaction Using Variational Autoencoders
[AUTHORS]
Runsheng Zhang, Shixiao Wang, Tianfang Xie, Shiyu Duan, Mengmeng Chen
[ABSTRACT]
This study presents a novel approach for intelligent user interaction
interface generation and optimization, grounded in the variational autoencoder
(VAE) model. With the rapid advancement of intelligent technologies,
traditional interface design methods struggle to meet the evolving demands for
diversity and personalization, often lacking flexibility in real-time
adjustments to enhance the user experience. Human-Computer Interaction (HCI)
plays a critical role in addressing these challenges by focusing on creating
interfaces that are functional, intuitive, and responsive to user needs. This
research leverages the RICO dataset to train the VAE model, enabling the
simulation and creation of user interfaces that align with user aesthetics and
interaction habits. By integrating real-time user behavior data, the system
dynamically refines and optimizes the interface, improving usability and
underscoring the importance of HCI in achieving a seamless user experience.
Experimental findings indicate that the VAE-based approach significantly
enhances the quality and precision of interface generation compared to other
methods, including autoencoders (AE), generative adversarial networks (GAN),
conditional GANs (cGAN), deep belief networks (DBN), and VAE-GAN. This work
contributes valuable insights into HCI, providing robust technical solutions
for automated interface generation and enhanced user experience optimization.
[LINK]
http://arxiv.org/abs/2412.14521v1
[DATE]
2024-12-19 12:37:47+08:00
[CATEGORIES]
cs.LG
Guided Diffusion Model for Sensor Data Obfuscation
[AUTHORS]
Xin Yang, Omid Ardakanian
[ABSTRACT]
Sensor data collected by Internet of Things (IoT) devices carries detailed
information about individuals in their vicinity. Sharing this data with a
semi-trusted service provider may compromise the individuals’ privacy, as
sensitive information can be extracted by powerful machine learning models.
Data obfuscation empowered by generative models is a promising approach to
generate synthetic sensor data such that the useful information contained in
the original data is preserved and the sensitive information is obscured. This
newly generated data will then be shared with the service provider instead of
the original sensor data. In this work, we propose PrivDiffuser, a novel data
obfuscation technique based on a denoising diffusion model that attains a
superior trade-off between data utility and privacy through effective guidance
techniques. Specifically, we extract latent representations that contain
information about public and private attributes from sensor data to guide the
diffusion model, and impose mutual information-based regularization when
learning the latent representations to alleviate the entanglement of public and
private attributes, thereby increasing the effectiveness of guidance.
Evaluation on three real-world datasets containing different sensing modalities
reveals that PrivDiffuser yields a better privacy-utility trade-off than the
state-of-the-art obfuscation model, decreasing the utility loss by up to
$1.81\%$ and the privacy loss by up to $3.42\%$. Moreover, we showed that users
with diverse privacy needs can use PrivDiffuser to protect their privacy
without having to retrain the model.
[LINK]
http://arxiv.org/abs/2412.14499v1
[DATE]
2024-12-19 11:47:12+08:00
[CATEGORIES]
cs.LG
Treatment Effects Estimation on Networked Observational Data using Disentangled Variational Graph Autoencoder
[AUTHORS]
Di Fan, Renlei Jiang, Yunhao Wen, Chuanhou Gao
[ABSTRACT]
Estimating individual treatment effect (ITE) from observational data has
gained increasing attention across various domains, with a key challenge being
the identification of latent confounders affecting both treatment and outcome.
Networked observational data offer new opportunities to address this issue by
utilizing network information to infer latent confounders. However, most
existing approaches assume observed variables and network information serve
only as proxy variables for latent confounders, which often fails in practice,
as some variables influence treatment but not outcomes, and vice versa. Recent
advances in disentangled representation learning, which disentangle latent
factors into instrumental, confounding, and adjustment factors, have shown
promise for ITE estimation. Building on this, we propose a novel disentangled
variational graph autoencoder that learns disentangled factors for treatment
effect estimation on networked observational data. Our graph encoder further
ensures factor independence using the Hilbert-Schmidt Independence Criterion.
Extensive experiments on two semi-synthetic datasets derived from real-world
social networks and one synthetic dataset demonstrate that our method achieves
state-of-the-art performance.
[COMMENTS]
21 pages, 6 figures
[LINK]
http://arxiv.org/abs/2412.14497v1
[DATE]
2024-12-19 11:44:49+08:00
[CATEGORIES]
cs.LG
Stochastic first-order methods with multi-extrapolated momentum for highly smooth unconstrained optimization
[AUTHORS]
Chuan He
[ABSTRACT]
In this paper we consider an unconstrained stochastic optimization problem
where the objective function exhibits a high order of smoothness. In
particular, we propose a stochastic first-order method (SFOM) with
multi-extrapolated momentum, in which multiple extrapolations are performed in
each iteration, followed by a momentum step based on these extrapolations. We
show that our proposed SFOM with multi-extrapolated momentum can accelerate
optimization by exploiting the high-order smoothness of the objective function
$f$. Specifically, assuming that the gradient and the $p$th-order derivative of
$f$ are Lipschitz continuous for some $p\ge2$, and under some additional mild
assumptions, we establish that our method achieves a sample complexity of
$\widetilde{\mathcal{O}}(\epsilon^{-(3p+1)/p})$ for finding a point $x$
satisfying $\mathbb{E}[|\nabla f(x)|]\le\epsilon$. To the best of our
knowledge, our method is the first SFOM to leverage arbitrary order smoothness
of the objective function for acceleration, resulting in a sample complexity
that strictly improves upon the best-known results without assuming the average
smoothness condition. Finally, preliminary numerical experiments validate the
practical performance of our method and corroborate our theoretical findings.
[LINK]
http://arxiv.org/abs/2412.14488v1
[DATE]
2024-12-19 11:22:47+08:00
[CATEGORIES]
cs.LG
Robust and Communication-Efficient Federated Domain Adaptation via Random Features
[AUTHORS]
Zhanbo Feng, Yuanjie Wang, Jie Li, Fan Yang, Jiong Lou, Tiebin Mi, Robert. C. Qiu, Zhenyu Liao
[ABSTRACT]
Modern machine learning (ML) models have grown to a scale where training them
on a single machine becomes impractical. As a result, there is a growing trend
to leverage federated learning (FL) techniques to train large ML models in a
distributed and collaborative manner. These models, however, when deployed on
new devices, might struggle to generalize well due to domain shifts. In this
context, federated domain adaptation (FDA) emerges as a powerful approach to
address this challenge.
Most existing FDA approaches typically focus on aligning the distributions
between source and target domains by minimizing their (e.g., MMD) distance.
Such strategies, however, inevitably introduce high communication overheads and
can be highly sensitive to network reliability.
In this paper, we introduce RF-TCA, an enhancement to the standard Transfer
Component Analysis approach that significantly accelerates computation without
compromising theoretical and empirical performance. Leveraging the
computational advantage of RF-TCA, we further extend it to FDA setting with
FedRF-TCA. The proposed FedRF-TCA protocol boasts communication complexity that
is independent of the sample size, while maintaining performance that is either
comparable to or even surpasses state-of-the-art FDA methods. We present
extensive experiments to showcase the superior performance and robustness (to
network condition) of FedRF-TCA.
[COMMENTS]
22 pages, 7 figures, 17 tables, accepted by IEEE Trans. KDE
[LINK]
http://arxiv.org/abs/2311.04686v2
[DATE]
2024-12-19 11:09:23+08:00
[CATEGORIES]
cs.LG
THESAURUS: Contrastive Graph Clustering by Swapping Fused Gromov-Wasserstein Couplings
[AUTHORS]
Bowen Deng, Tong Wang, Lele Fu, Sheng Huang, Chuan Chen, Tao Zhang
[ABSTRACT]
Graph node clustering is a fundamental unsupervised task. Existing methods
typically train an encoder through selfsupervised learning and then apply
K-means to the encoder output. Some methods use this clustering result directly
as the final assignment, while others initialize centroids based on this
initial clustering and then finetune both the encoder and these learnable
centroids. However, due to their reliance on K-means, these methods inherit its
drawbacks when the cluster separability of encoder output is low, facing
challenges from the Uniform Effect and Cluster Assimilation. We summarize three
reasons for the low cluster separability in existing methods: (1) lack of
contextual information prevents discrimination between similar nodes from
different clusters; (2) training tasks are not sufficiently aligned with the
downstream clustering task; (3) the cluster information in the graph structure
is not appropriately exploited. To address these issues, we propose conTrastive
grapH clustEring by SwApping fUsed gRomov-wasserstein coUplingS (THESAURUS).
Our method introduces semantic prototypes to provide contextual information,
and employs a cross-view assignment prediction pretext task that aligns well
with the downstream clustering task. Additionally, it utilizes
Gromov-Wasserstein Optimal Transport (GW-OT) along with the proposed prototype
graph to thoroughly exploit cluster information in the graph structure. To
adapt to diverse real-world data, THESAURUS updates the prototype graph and the
prototype marginal distribution in OT by using momentum. Extensive experiments
demonstrate that THESAURUS achieves higher cluster separability than the prior
art, effectively mitigating the Uniform Effect and Cluster Assimilation issues
[COMMENTS]
Accepted by AAAI 2025
[LINK]
http://arxiv.org/abs/2412.11550v2
[DATE]
2024-12-19 11:00:59+08:00
[CATEGORIES]
cs.LG
Graph-Structured Topic Modeling for Documents with Spatial or Covariate Dependencies
[AUTHORS]
Yeo Jin Jung, Claire Donnat
[ABSTRACT]
We address the challenge of incorporating document-level metadata into topic
modeling to improve topic mixture estimation. To overcome the computational
complexity and lack of theoretical guarantees in existing Bayesian methods, we
extend probabilistic latent semantic indexing (pLSI), a frequentist framework
for topic modeling, by incorporating document-level covariates or known
similarities between documents through a graph formalism. Modeling documents as
nodes and edges denoting similarities, we propose a new estimator based on a
fast graph-regularized iterative singular value decomposition (SVD) that
encourages similar documents to share similar topic mixture proportions. We
characterize the estimation error of our proposed method by deriving
high-probability bounds and develop a specialized cross-validation method to
optimize our regularization parameters. We validate our model through
comprehensive experiments on synthetic datasets and three real-world corpora,
demonstrating improved performance and faster inference compared to existing
Bayesian methods.
[LINK]
http://arxiv.org/abs/2412.14477v1
[DATE]
2024-12-19 11:00:26+08:00
[CATEGORIES]
cs.LG
Benign Overfitting in Out-of-Distribution Generalization of Linear Models
[AUTHORS]
Shange Tang, Jiayun Wu, Jianqing Fan, Chi Jin
[ABSTRACT]
Benign overfitting refers to the phenomenon where an over-parameterized model
fits the training data perfectly, including noise in the data, but still
generalizes well to the unseen test data. While prior work provides some
theoretical understanding of this phenomenon under the in-distribution setup,
modern machine learning often operates in a more challenging
Out-of-Distribution (OOD) regime, where the target (test) distribution can be
rather different from the source (training) distribution. In this work, we take
an initial step towards understanding benign overfitting in the OOD regime by
focusing on the basic setup of over-parameterized linear models under covariate
shift. We provide non-asymptotic guarantees proving that benign overfitting
occurs in standard ridge regression, even under the OOD regime when the target
covariance satisfies certain structural conditions. We identify several vital
quantities relating to source and target covariance, which govern the
performance of OOD generalization. Our result is sharp, which provably recovers
prior in-distribution benign overfitting guarantee [Tsigler and Bartlett,
2023], as well as under-parameterized OOD guarantee [Ge et al., 2024] when
specializing to each setup. Moreover, we also present theoretical results for a
more general family of target covariance matrix, where standard ridge
regression only achieves a slow statistical rate of $O(1/\sqrt{n})$ for the
excess risk, while Principal Component Regression (PCR) is guaranteed to
achieve the fast rate $O(1/n)$, where $n$ is the number of samples.
[COMMENTS]
58 pages, 1 figure
[LINK]
http://arxiv.org/abs/2412.14474v1
[DATE]
2024-12-19 10:47:39+08:00
[CATEGORIES]
cs.LG
Online Policy Learning and Inference by Matrix Completion
[AUTHORS]
Congyuan Duan, Jingyang Li, Dong Xia
[ABSTRACT]
Is it possible to make online decisions when personalized covariates are
unavailable? We take a collaborative-filtering approach for decision-making
based on collective preferences. By assuming low-dimensional latent features,
we formulate the covariate-free decision-making problem as a matrix completion
bandit. We propose a policy learning procedure that combines an
$\varepsilon$-greedy policy for decision-making with an online gradient descent
algorithm for bandit parameter estimation. Our novel two-phase design balances
policy learning accuracy and regret performance. For policy inference, we
develop an online debiasing method based on inverse propensity weighting and
establish its asymptotic normality. Our methods are applied to data from the
San Francisco parking pricing project, revealing intriguing discoveries and
outperforming the benchmark policy.
[LINK]
http://arxiv.org/abs/2404.17398v2
[DATE]
2024-12-19 10:45:16+08:00
[CATEGORIES]
cs.LG
E-CAR: Efficient Continuous Autoregressive Image Generation via Multistage Modeling
[AUTHORS]
Zhihang Yuan, Yuzhang Shang, Hanling Zhang, Tongcheng Fang, Rui Xie, Bingxin Xu, Yan Yan, Shengen Yan, Guohao Dai, Yu Wang
[ABSTRACT]
Recent advances in autoregressive (AR) models with continuous tokens for
image generation show promising results by eliminating the need for discrete
tokenization. However, these models face efficiency challenges due to their
sequential token generation nature and reliance on computationally intensive
diffusion-based sampling. We present ECAR (Efficient Continuous Auto-Regressive
Image Generation via Multistage Modeling), an approach that addresses these
limitations through two intertwined innovations: (1) a stage-wise continuous
token generation strategy that reduces computational complexity and provides
progressively refined token maps as hierarchical conditions, and (2) a
multistage flow-based distribution modeling method that transforms only
partial-denoised distributions at each stage comparing to complete denoising in
normal diffusion models. Holistically, ECAR operates by generating tokens at
increasing resolutions while simultaneously denoising the image at each stage.
This design not only reduces token-to-image transformation cost by a factor of
the stage number but also enables parallel processing at the token level. Our
approach not only enhances computational efficiency but also aligns naturally
with image generation principles by operating in continuous token space and
following a hierarchical generation process from coarse to fine details.
Experimental results demonstrate that ECAR achieves comparable image quality to
DiT Peebles & Xie [2023] while requiring 10$\times$ FLOPs reduction and
5$\times$ speedup to generate a 256$\times$256 image.
[LINK]
http://arxiv.org/abs/2412.14170v2
[DATE]
2024-12-19 10:42:46+08:00
[CATEGORIES]
cs.LG
Multi-view Granular-ball Contrastive Clustering
[AUTHORS]
Peng Su, Shudong Huang, Weihong Ma, Deng Xiong, Jiancheng Lv
[ABSTRACT]
Previous multi-view contrastive learning methods typically operate at two
scales: instance-level and cluster-level. Instance-level approaches construct
positive and negative pairs based on sample correspondences, aiming to bring
positive pairs closer and push negative pairs further apart in the latent
space. Cluster-level methods focus on calculating cluster assignments for
samples under each view and maximize view consensus by reducing distribution
discrepancies, e.g., minimizing KL divergence or maximizing mutual information.
However, these two types of methods either introduce false negatives, leading
to reduced model discriminability, or overlook local structures and cannot
measure relationships between clusters across views explicitly. To this end, we
propose a method named Multi-view Granular-ball Contrastive Clustering (MGBCC).
MGBCC segments the sample set into coarse-grained granular balls, and
establishes associations between intra-view and cross-view granular balls.
These associations are reinforced in a shared latent space, thereby achieving
multi-granularity contrastive learning. Granular balls lie between instances
and clusters, naturally preserving the local topological structure of the
sample set. We conduct extensive experiments to validate the effectiveness of
the proposed method.
[COMMENTS]
9 pages, 5 figures, 2 tables, AAAI 2025
[LINK]
http://arxiv.org/abs/2412.13550v2
[DATE]
2024-12-19 10:41:59+08:00
[CATEGORIES]
cs.LG
DECRL: A Deep Evolutionary Clustering Jointed Temporal Knowledge Graph Representation Learning Approach
[AUTHORS]
Qian Chen, Ling Chen
[ABSTRACT]
Temporal Knowledge Graph (TKG) representation learning aims to map temporal
evolving entities and relations to embedded representations in a continuous
low-dimensional vector space. However, existing approaches cannot capture the
temporal evolution of high-order correlations in TKGs. To this end, we propose
a Deep Evolutionary Clustering jointed temporal knowledge graph Representation
Learning approach (DECRL). Specifically, a deep evolutionary clustering module
is proposed to capture the temporal evolution of high-order correlations among
entities. Furthermore, a cluster-aware unsupervised alignment mechanism is
introduced to ensure the precise one-to-one alignment of soft overlapping
clusters across timestamps, thereby maintaining the temporal smoothness of
clusters. In addition, an implicit correlation encoder is introduced to capture
latent correlations between any pair of clusters under the guidance of a global
graph. Extensive experiments on seven real-world datasets demonstrate that
DECRL achieves the state-of-the-art performances, outperforming the best
baseline by an average of 9.53%, 12.98%, 10.42%, and 14.68% in MRR, Hits@1,
Hits@3, and Hits@10, respectively.
[COMMENTS]
Accepted by NeurIPS 2024, 17 pages, and 3 figures
[LINK]
http://arxiv.org/abs/2410.22631v2
[DATE]
2024-12-19 10:31:00+08:00
[CATEGORIES]
cs.LG
Nonstationary Sparse Spectral Permanental Process
[AUTHORS]
Zicheng Sun, Yixuan Zhang, Zenan Ling, Xuhui Fan, Feng Zhou
[ABSTRACT]
Existing permanental processes often impose constraints on kernel types or
stationarity, limiting the model’s expressiveness. To overcome these
limitations, we propose a novel approach utilizing the sparse spectral
representation of nonstationary kernels. This technique relaxes the constraints
on kernel types and stationarity, allowing for more flexible modeling while
reducing computational complexity to the linear level. Additionally, we
introduce a deep kernel variant by hierarchically stacking multiple spectral
feature mappings, further enhancing the model’s expressiveness to capture
complex patterns in data. Experimental results on both synthetic and real-world
datasets demonstrate the effectiveness of our approach, particularly in
scenarios with pronounced data nonstationarity. Additionally, ablation studies
are conducted to provide insights into the impact of various hyperparameters on
model performance.
[LINK]
http://arxiv.org/abs/2410.03581v3
[DATE]
2024-12-19 10:24:54+08:00
[CATEGORIES]
cs.LG
Multimodal Latent Diffusion Model for Complex Sewing Pattern Generation
[AUTHORS]
Shengqi Liu, Yuhao Cheng, Zhuo Chen, Xingyu Ren, Wenhan Zhu, Lincheng Li, Mengxiao Bi, Xiaokang Yang, Yichao Yan
[ABSTRACT]
Generating sewing patterns in garment design is receiving increasing
attention due to its CG-friendly and flexible-editing nature. Previous sewing
pattern generation methods have been able to produce exquisite clothing, but
struggle to design complex garments with detailed control. To address these
issues, we propose SewingLDM, a multi-modal generative model that generates
sewing patterns controlled by text prompts, body shapes, and garment sketches.
Initially, we extend the original vector of sewing patterns into a more
comprehensive representation to cover more intricate details and then compress
them into a compact latent space. To learn the sewing pattern distribution in
the latent space, we design a two-step training strategy to inject the
multi-modal conditions, \ie, body shapes, text prompts, and garment sketches,
into a diffusion model, ensuring the generated garments are body-suited and
detail-controlled. Comprehensive qualitative and quantitative experiments show
the effectiveness of our proposed method, significantly surpassing previous
approaches in terms of complex garment design and various body adaptability.
Our project page: https://shengqiliu1.github.io/SewingLDM.
[COMMENTS]
Our project page: https://shengqiliu1.github.io/SewingLDM
[LINK]
http://arxiv.org/abs/2412.14453v1
[DATE]
2024-12-19 10:05:28+08:00
[CATEGORIES]
cs.LG
CLDG: Contrastive Learning on Dynamic Graphs
[AUTHORS]
Yiming Xu, Bin Shi, Teng Ma, Bo Dong, Haoyi Zhou, Qinghua Zheng
[ABSTRACT]
The graph with complex annotations is the most potent data type, whose
constantly evolving motivates further exploration of the unsupervised dynamic
graph representation. One of the representative paradigms is graph contrastive
learning. It constructs self-supervised signals by maximizing the mutual
information between the statistic graph’s augmentation views. However, the
semantics and labels may change within the augmentation process, causing a
significant performance drop in downstream tasks. This drawback becomes greatly
magnified on dynamic graphs. To address this problem, we designed a simple yet
effective framework named CLDG. Firstly, we elaborate that dynamic graphs have
temporal translation invariance at different levels. Then, we proposed a
sampling layer to extract the temporally-persistent signals. It will encourage
the node to maintain consistent local and global representations, i.e.,
temporal translation invariance under the timespan views. The extensive
experiments demonstrate the effectiveness and efficiency of the method on seven
datasets by outperforming eight unsupervised state-of-the-art baselines and
showing competitiveness against four semi-supervised methods. Compared with the
existing dynamic graph method, the number of model parameters and training time
is reduced by an average of 2,001.86 times and 130.31 times on seven datasets,
respectively.
[COMMENTS]
Accepted by ICDE2023
[LINK]
http://arxiv.org/abs/2412.14451v1
[DATE]
2024-12-19 09:59:24+08:00
[CATEGORIES]
cs.LG
VLM-AD: End-to-End Autonomous Driving through Vision-Language Model Supervision
[AUTHORS]
Yi Xu, Yuxin Hu, Zaiwei Zhang, Gregory P. Meyer, Siva Karthik Mustikovela, Siddhartha Srinivasa, Eric M. Wolff, Xin Huang
[ABSTRACT]
Human drivers rely on commonsense reasoning to navigate diverse and dynamic
real-world scenarios. Existing end-to-end (E2E) autonomous driving (AD) models
are typically optimized to mimic driving patterns observed in data, without
capturing the underlying reasoning processes. This limitation constrains their
ability to handle challenging driving scenarios. To close this gap, we propose
VLM-AD, a method that leverages vision-language models (VLMs) as teachers to
enhance training by providing additional supervision that incorporates
unstructured reasoning information and structured action labels. Such
supervision enhances the model’s ability to learn richer feature
representations that capture the rationale behind driving patterns.
Importantly, our method does not require a VLM during inference, making it
practical for real-time deployment. When integrated with state-of-the-art
methods, VLM-AD achieves significant improvements in planning accuracy and
reduced collision rates on the nuScenes dataset.
[LINK]
http://arxiv.org/abs/2412.14446v1
[DATE]
2024-12-19 09:53:36+08:00
[CATEGORIES]
cs.LG
GenHMR: Generative Human Mesh Recovery
[AUTHORS]
Muhammad Usama Saleem, Ekkasit Pinyoanuntapong, Pu Wang, Hongfei Xue, Srijan Das, Chen Chen
[ABSTRACT]
Human mesh recovery (HMR) is crucial in many computer vision applications;
from health to arts and entertainment. HMR from monocular images has
predominantly been addressed by deterministic methods that output a single
prediction for a given 2D image. However, HMR from a single image is an
ill-posed problem due to depth ambiguity and occlusions. Probabilistic methods
have attempted to address this by generating and fusing multiple plausible 3D
reconstructions, but their performance has often lagged behind deterministic
approaches. In this paper, we introduce GenHMR, a novel generative framework
that reformulates monocular HMR as an image-conditioned generative task,
explicitly modeling and mitigating uncertainties in the 2D-to-3D mapping
process. GenHMR comprises two key components: (1) a pose tokenizer to convert
3D human poses into a sequence of discrete tokens in a latent space, and (2) an
image-conditional masked transformer to learn the probabilistic distributions
of the pose tokens, conditioned on the input image prompt along with randomly
masked token sequence. During inference, the model samples from the learned
conditional distribution to iteratively decode high-confidence pose tokens,
thereby reducing 3D reconstruction uncertainties. To further refine the
reconstruction, a 2D pose-guided refinement technique is proposed to directly
fine-tune the decoded pose tokens in the latent space, which forces the
projected 3D body mesh to align with the 2D pose clues. Experiments on
benchmark datasets demonstrate that GenHMR significantly outperforms
state-of-the-art methods. Project website can be found at
https://m-usamasaleem.github.io/publication/GenHMR/GenHMR.html
[LINK]
http://arxiv.org/abs/2412.14444v1
[DATE]
2024-12-19 09:45:58+08:00
[CATEGORIES]
cs.LG
Balanced Gradient Sample Retrieval for Enhanced Knowledge Retention in Proxy-based Continual Learning
[AUTHORS]
Hongye Xu, Jan Wasilewski, Bartosz Krawczyk
[ABSTRACT]
Continual learning in deep neural networks often suffers from catastrophic
forgetting, where representations for previous tasks are overwritten during
subsequent training. We propose a novel sample retrieval strategy from the
memory buffer that leverages both gradient-conflicting and gradient-aligned
samples to effectively retain knowledge about past tasks within a supervised
contrastive learning framework. Gradient-conflicting samples are selected for
their potential to reduce interference by re-aligning gradients, thereby
preserving past task knowledge. Meanwhile, gradient-aligned samples are
incorporated to reinforce stable, shared representations across tasks. By
balancing gradient correction from conflicting samples with alignment
reinforcement from aligned ones, our approach increases the diversity among
retrieved instances and achieves superior alignment in parameter space,
significantly enhancing knowledge retention and mitigating proxy drift.
Empirical results demonstrate that using both sample types outperforms methods
relying solely on one sample type or random retrieval. Experiments on popular
continual learning benchmarks in computer vision validate our method’s
state-of-the-art performance in mitigating forgetting while maintaining
competitive accuracy on new tasks.
[LINK]
http://arxiv.org/abs/2412.14430v1
[DATE]
2024-12-19 09:08:09+08:00
[CATEGORIES]
cs.LG
Generative Semi-supervised Graph Anomaly Detection
[AUTHORS]
Hezhe Qiao, Qingsong Wen, Xiaoli Li, Ee-Peng Lim, Guansong Pang
[ABSTRACT]
This work considers a practical semi-supervised graph anomaly detection (GAD)
scenario, where part of the nodes in a graph are known to be normal,
contrasting to the extensively explored unsupervised setting with a fully
unlabeled graph. We reveal that having access to the normal nodes, even just a
small percentage of normal nodes, helps enhance the detection performance of
existing unsupervised GAD methods when they are adapted to the semi-supervised
setting. However, their utilization of these normal nodes is limited. In this
paper, we propose a novel Generative GAD approach (namely GGAD) for the
semi-supervised scenario to better exploit the normal nodes. The key idea is to
generate pseudo anomaly nodes, referred to as ‘outlier nodes’, for providing
effective negative node samples in training a discriminative one-class
classifier. The main challenge here lies in the lack of ground truth
information about real anomaly nodes. To address this challenge, GGAD is
designed to leverage two important priors about the anomaly nodes – asymmetric
local affinity and egocentric closeness – to generate reliable outlier nodes
that assimilate anomaly nodes in both graph structure and feature
representations. Comprehensive experiments on six real-world GAD datasets are
performed to establish a benchmark for semi-supervised GAD and show that GGAD
substantially outperforms state-of-the-art unsupervised and semi-supervised GAD
methods with varying numbers of training normal nodes. Code will be made
available at https://github.com/mala-lab/GGAD.
[COMMENTS]
Accepted by NeurIPS 2024
[LINK]
http://arxiv.org/abs/2402.11887v7
[DATE]
2024-12-19 08:45:48+08:00
[CATEGORIES]
cs.LG
SimAvatar: Simulation-Ready Avatars with Layered Hair and Clothing
[AUTHORS]
Xueting Li, Ye Yuan, Shalini De Mello, Gilles Daviet, Jonathan Leaf, Miles Macklin, Jan Kautz, Umar Iqbal
[ABSTRACT]
We introduce SimAvatar, a framework designed to generate simulation-ready
clothed 3D human avatars from a text prompt. Current text-driven human avatar
generation methods either model hair, clothing, and the human body using a
unified geometry or produce hair and garments that are not easily adaptable for
simulation within existing simulation pipelines. The primary challenge lies in
representing the hair and garment geometry in a way that allows leveraging
established prior knowledge from foundational image diffusion models (e.g.,
Stable Diffusion) while being simulation-ready using either physics or neural
simulators. To address this task, we propose a two-stage framework that
combines the flexibility of 3D Gaussians with simulation-ready hair strands and
garment meshes. Specifically, we first employ three text-conditioned 3D
generative models to generate garment mesh, body shape and hair strands from
the given text prompt. To leverage prior knowledge from foundational diffusion
models, we attach 3D Gaussians to the body mesh, garment mesh, as well as hair
strands and learn the avatar appearance through optimization. To drive the
avatar given a pose sequence, we first apply physics simulators onto the
garment meshes and hair strands. We then transfer the motion onto 3D Gaussians
through carefully designed mechanisms for each body part. As a result, our
synthesized avatars have vivid texture and realistic dynamic motion. To the
best of our knowledge, our method is the first to produce highly realistic,
fully simulation-ready 3D avatars, surpassing the capabilities of current
approaches.
[COMMENTS]
Project website: https://nvlabs.github.io/SimAvatar/
[LINK]
http://arxiv.org/abs/2412.09545v2
[DATE]
2024-12-19 08:30:08+08:00
[CATEGORIES]
cs.LG
Enhancing Diffusion Models for High-Quality Image Generation
[AUTHORS]
Jaineet Shah, Michael Gromis, Rickston Pinto
[ABSTRACT]
This report presents the comprehensive implementation, evaluation, and
optimization of Denoising Diffusion Probabilistic Models (DDPMs) and Denoising
Diffusion Implicit Models (DDIMs), which are state-of-the-art generative
models. During inference, these models take random noise as input and
iteratively generate high-quality images as output. The study focuses on
enhancing their generative capabilities by incorporating advanced techniques
such as Classifier-Free Guidance (CFG), Latent Diffusion Models with
Variational Autoencoders (VAE), and alternative noise scheduling strategies.
The motivation behind this work is the growing demand for efficient and
scalable generative AI models that can produce realistic images across diverse
datasets, addressing challenges in applications such as art creation, image
synthesis, and data augmentation. Evaluations were conducted on datasets
including CIFAR-10 and ImageNet-100, with a focus on improving inference speed,
computational efficiency, and image quality metrics like Frechet Inception
Distance (FID). Results demonstrate that DDIM + CFG achieves faster inference
and superior image quality. Challenges with VAE and noise scheduling are also
highlighted, suggesting opportunities for future optimization. This work lays
the groundwork for developing scalable, efficient, and high-quality generative
AI systems to benefit industries ranging from entertainment to robotics.
[LINK]
http://arxiv.org/abs/2412.14422v1
[DATE]
2024-12-19 08:23:15+08:00
[CATEGORIES]
cs.LG
DriveGPT: Scaling Autoregressive Behavior Models for Driving
[AUTHORS]
Xin Huang, Eric M. Wolff, Paul Vernaza, Tung Phan-Minh, Hongge Chen, David S. Hayden, Mark Edmonds, Brian Pierce, Xinxin Chen, Pratik Elias Jacob, Xiaobai Chen, Chingiz Tairbekov, Pratik Agarwal, Tianshi Gao, Yuning Chai, Siddhartha Srinivasa
[ABSTRACT]
We present DriveGPT, a scalable behavior model for autonomous driving. We
model driving as a sequential decision making task, and learn a transformer
model to predict future agent states as tokens in an autoregressive fashion. We
scale up our model parameters and training data by multiple orders of
magnitude, enabling us to explore the scaling properties in terms of dataset
size, model parameters, and compute. We evaluate DriveGPT across different
scales in a planning task, through both quantitative metrics and qualitative
examples including closed-loop driving in complex real-world scenarios. In a
separate prediction task, DriveGPT outperforms a state-of-the-art baseline and
exhibits improved performance by pretraining on a large-scale dataset, further
validating the benefits of data scaling.
[COMMENTS]
14 pages, 16 figures, 9 tables, and 1 video link
[LINK]
http://arxiv.org/abs/2412.14415v1
[DATE]
2024-12-19 08:06:09+08:00
[CATEGORIES]
cs.LG
Multi-task Representation Learning for Mixed Integer Linear Programming
[AUTHORS]
Junyang Cai, Taoan Huang, Bistra Dilkina
[LINK]
http://arxiv.org/abs/2412.14409v1
[DATE]
2024-12-19 07:33:32+08:00
[CATEGORIES]
cs.LG
DTW+S: Shape-based Comparison of Time-series with Ordered Local Trend
[AUTHORS]
Ajitesh Srivastava
[ABSTRACT]
Measuring distance or similarity between time-series data is a fundamental
aspect of many applications including classification, clustering, and
ensembling/alignment. Existing measures may fail to capture similarities among
local trends (shapes) and may even produce misleading results. Our goal is to
develop a measure that looks for similar trends occurring around similar times
and is easily interpretable for researchers in applied domains. This is
particularly useful for applications where time-series have a sequence of
meaningful local trends that are ordered, such as in epidemics (a surge to an
increase to a peak to a decrease). We propose a novel measure, DTW+S, which
creates an interpretable “closeness-preserving” matrix representation of the
time-series, where each column represents local trends, and then it applies
Dynamic Time Warping to compute distances between these matrices. We present a
theoretical analysis that supports the choice of this representation. We
demonstrate the utility of DTW+S in several tasks. For the clustering of
epidemic curves, we show that DTW+S is the only measure able to produce good
clustering compared to the baselines. For ensemble building, we propose a
combination of DTW+S and barycenter averaging that results in the best
preservation of characteristics of the underlying trajectories. We also
demonstrate that our approach results in better classification compared to
Dynamic Time Warping for a class of datasets, particularly when local trends
rather than scale play a decisive role.
[COMMENTS]
Longer version of the paper “Aligning Time-series by Local Trends:
Applications in Public Health” accepted at The 39th Annual AAAI Conference on
Artificial Intelligence (AAAI 2025)
[LINK]
http://arxiv.org/abs/2309.03579v3
[DATE]
2024-12-19 07:23:16+08:00
[CATEGORIES]
cs.LG
Fingerprinting Codes Meet Geometry: Improved Lower Bounds for Private Query Release and Adaptive Data Analysis
[AUTHORS]
Xin Lyu, Kunal Talwar
[ABSTRACT]
Fingerprinting codes are a crucial tool for proving lower bounds in
differential privacy. They have been used to prove tight lower bounds for
several fundamental questions, especially in the “low accuracy” regime.
Unlike reconstruction/discrepancy approaches however, they are more suited for
query sets that arise naturally from the fingerprinting codes construction. In
this work, we propose a general framework for proving fingerprinting type lower
bounds, that allows us to tailor the technique to the geometry of the query
set. Our approach allows us to prove several new results, including the
following.
First, we show that any (sample- and population-)accurate algorithm for
answering $Q$ arbitrary adaptive counting queries over a universe $\mathcal{X}$
to accuracy $\alpha$ needs $\Omega(\frac{\sqrt{\log |\mathcal{X}|}\cdot \log
Q}{\alpha^3})$ samples, matching known upper bounds. This shows that the
approaches based on differential privacy are optimal for this question, and
improves significantly on the previously known lower bounds of $\frac{\log
Q}{\alpha^2}$ and $\min(\sqrt{Q}, \sqrt{\log |\mathcal{X}|})/\alpha^2$.
Second, we show that any $(\varepsilon,\delta)$-DP algorithm for answering
$Q$ counting queries to accuracy $\alpha$ needs $\Omega(\frac{\sqrt{
\log|\mathcal{X}| \log(1/\delta)} \log Q}{\varepsilon\alpha^2})$ samples,
matching known upper bounds up to constants. Our framework allows for proving
this bound via a direct correlation analysis and improves the prior bound of
[BUV’14] by $\sqrt{\log(1/\delta)}$.
Third, we characterize the sample complexity of answering a set of random
$0$-$1$ queries under approximate differential privacy. We give new upper and
lower bounds in different regimes. By combining them with known results, we can
complete the whole picture.
[COMMENTS]
Abstract slightly shortened to meet the arXiv requirement; 50 Pages
and 1 Figure
[LINK]
http://arxiv.org/abs/2412.14396v1
[DATE]
2024-12-19 07:11:07+08:00
[CATEGORIES]
cs.LG
Nemesis: Noise-randomized Encryption with Modular Efficiency and Secure Integration in Machine Learning Systems
[AUTHORS]
Dongfang Zhao
[LINK]
http://arxiv.org/abs/2412.14392v1
[DATE]
2024-12-19 06:52:12+08:00
[CATEGORIES]
cs.LG
A Clinical Trial Design Approach to Auditing Language Models in Healthcare Setting
[AUTHORS]
Lovedeep Gondara, Jonathan Simkin
[ABSTRACT]
We present an audit mechanism for language models, with a focus on models
deployed in the healthcare setting. Our proposed mechanism takes inspiration
from clinical trial design where we posit the language model audit as a single
blind equivalence trial, with the comparison of interest being the subject
matter experts. We show that using our proposed method, we can follow
principled sample size and power calculations, leading to the requirement of
sampling minimum number of records while maintaining the audit integrity and
statistical soundness. Finally, we provide a real-world example of the audit
used in a production environment in a large-scale public health network.
[LINK]
http://arxiv.org/abs/2411.16702v2
[DATE]
2024-12-19 06:48:15+08:00
[CATEGORIES]
cs.LG
The Data-Driven Censored Newsvendor Problem
[AUTHORS]
Chamsi Hssaine, Sean R. Sinclair
[ABSTRACT]
We study a censored variant of the data-driven newsvendor problem, where the
decision-maker must select an ordering quantity that minimizes expected overage
and underage costs based only on offline censored sales data, rather than
historical demand realizations. Our goal is to understand how the degree of
historical demand censoring affects the performance of any learning algorithm
for this problem. To isolate this impact, we adopt a distributionally robust
optimization framework, evaluating policies according to their worst-case
regret over an ambiguity set of distributions. This set is defined by the
largest historical order quantity (the observable boundary of the dataset), and
contains all distributions matching the true demand distribution up to this
boundary, while allowing them to be arbitrary afterwards. We demonstrate a
spectrum of achievability under demand censoring by deriving a natural
necessary and sufficient condition under which vanishing regret is an
achievable goal. In regimes in which it is not, we exactly characterize the
information loss due to censoring: an insurmountable lower bound on the
performance of any policy, even when the decision-maker has access to
infinitely many demand samples. We then leverage these sharp characterizations
to propose a natural robust algorithm that adapts to the historical level of
demand censoring. We derive finite-sample guarantees for this algorithm across
all possible censoring regimes and show its near-optimality with matching lower
bounds (up to polylogarithmic factors). We moreover demonstrate its robust
performance via extensive numerical experiments on both synthetic and
real-world datasets.
[COMMENTS]
72 pages, 9 tables, 7 figures
[LINK]
http://arxiv.org/abs/2412.01763v2
[DATE]
2024-12-19 06:34:36+08:00
[CATEGORIES]
cs.LG
Balans: Multi-Armed Bandits-based Adaptive Large Neighborhood Search for Mixed-Integer Programming Problem
[AUTHORS]
Junyang Cai, Serdar Kadioglu, Bistra Dilkina
[ABSTRACT]
Mixed-Integer Programming (MIP) is a powerful paradigm for modeling and
solving various important combinatorial optimization problems. Recently,
learning-based approaches have shown potential to speed up MIP solving via
offline training that then guides important design decisions during search.
However, a significant drawback of these methods is their heavy reliance on
offline training, which requires collecting training datasets and
computationally costly training epochs yet offering only limited generalization
to unseen (larger) instances. In this paper, we propose Balans, an adaptive
meta-solver for MIPs with online learning capability that does not require any
supervision or apriori training. At its core, Balans is based on adaptive
large-neighborhood search, operating on top of a MIP solver by successive
applications of destroy and repair neighborhood operators. During the search,
the selection among different neighborhood definitions is guided on the fly for
the instance at hand via multi-armed bandit algorithms. Our extensive
experiments on hard optimization instances show that Balans offers significant
performance gains over the default MIP solver, is better than committing to any
single best neighborhood, and improves over the state-of-the-art
large-neighborhood search for MIPs. Finally, we release Balans as a highly
configurable, MIP solver agnostic, open-source software.
[LINK]
http://arxiv.org/abs/2412.14382v1
[DATE]
2024-12-19 06:32:13+08:00
[CATEGORIES]
cs.LG
Explanations that reveal all through the definition of encoding
[AUTHORS]
Aahlad Puli, Nhi Nguyen, Rajesh Ranganath
[COMMENTS]
36 pages, 7 figures, 6 tables, 38th Conference on Neural Information
Processing Systems (NeurIPS 2024)
[LINK]
http://arxiv.org/abs/2411.02664v2
[DATE]
2024-12-19 06:18:45+08:00
[CATEGORIES]
cs.LG
Causal Bayesian Optimization via Exogenous Distribution Learning
[AUTHORS]
Shaogang Ren, Xiaoning Qian
[ABSTRACT]
Maximizing a target variable as an operational objective in a structural
causal model is an important problem. Existing Causal Bayesian
Optimization~(CBO) methods either rely on hard interventions that alter the
causal structure to maximize the reward; or introduce action nodes to
endogenous variables so that the data generation mechanisms are adjusted to
achieve the objective. In this paper, a novel method is introduced to learn the
distribution of exogenous variables, which is typically ignored or marginalized
through expectation by existing methods. Exogenous distribution learning
improves the approximation accuracy of structural causal models in a surrogate
model that is usually trained with limited observational data. Moreover, the
learned exogenous distribution extends existing CBO to general causal schemes
beyond Additive Noise Models~(ANM). The recovery of exogenous variables allows
us to use a more flexible prior for noise or unobserved hidden variables. We
develop a new CBO method by leveraging the learned exogenous distribution.
Experiments on different datasets and applications show the benefits of our
proposed method.
[LINK]
http://arxiv.org/abs/2402.02277v8
[DATE]
2024-12-19 06:17:10+08:00
[CATEGORIES]
cs.LG
Scaling Deep Learning Training with MPMD Pipeline Parallelism
[AUTHORS]
Anxhelo Xhebraj, Sean Lee, Hanfeng Chen, Vinod Grover
[ABSTRACT]
We present JaxPP, a system for efficiently scaling the training of large deep
learning models with flexible pipeline parallelism. We introduce a seamless
programming model that allows implementing user-defined pipeline schedules for
gradient accumulation. JaxPP automatically distributes tasks, corresponding to
pipeline stages, over a cluster of nodes and automatically infers the
communication among them. We implement a MPMD runtime for asynchronous
execution of SPMD tasks. The pipeline parallelism implementation of JaxPP
improves hardware utilization by up to $1.11\times$ with respect to the best
performing SPMD configuration.
[LINK]
http://arxiv.org/abs/2412.14374v1
[DATE]
2024-12-19 06:15:11+08:00
[CATEGORIES]
cs.LG
Implementing TD3 to train a Neural Network to fly a Quadcopter through an FPV Gate
[AUTHORS]
Patrick Thomas, Kevin Schroeder, Jonathan Black
[ABSTRACT]
Deep Reinforcement learning has shown to be a powerful tool for developing
policies in environments where an optimal solution is unclear. In this paper,
we attempt to apply Twin Delayed Deep Deterministic Policy Gradients to train a
neural network to act as a velocity controller for a quadcopter. The
quadcopter’s objective is to quickly fly through a gate while avoiding crashing
into the gate. We transfer our trained policy to the real world by deploying it
on a quadcopter in a laboratory environment. Finally, we demonstrate that the
trained policy is able to navigate the drone to the gate in the real world.
[LINK]
http://arxiv.org/abs/2412.14367v1
[DATE]
2024-12-19 06:04:29+08:00
[CATEGORIES]
cs.LG
Enabling Realtime Reinforcement Learning at Scale with Staggered Asynchronous Inference
[AUTHORS]
Matthew Riemer, Gopeshh Subbaraj, Glen Berseth, Irina Rish
[ABSTRACT]
Realtime environments change even as agents perform action inference and
learning, thus requiring high interaction frequencies to effectively minimize
regret. However, recent advances in machine learning involve larger neural
networks with longer inference times, raising questions about their
applicability in realtime systems where reaction time is crucial. We present an
analysis of lower bounds on regret in realtime reinforcement learning (RL)
environments to show that minimizing long-term regret is generally impossible
within the typical sequential interaction and learning paradigm, but often
becomes possible when sufficient asynchronous compute is available. We propose
novel algorithms for staggering asynchronous inference processes to ensure that
actions are taken at consistent time intervals, and demonstrate that use of
models with high action inference times is only constrained by the
environment’s effective stochasticity over the inference horizon, and not by
action frequency. Our analysis shows that the number of inference processes
needed scales linearly with increasing inference times while enabling use of
models that are multiple orders of magnitude larger than existing approaches
when learning from a realtime simulation of Game Boy games such as Pok'emon
and Tetris.
[LINK]
http://arxiv.org/abs/2412.14355v1
[DATE]
2024-12-19 05:43:40+08:00
[CATEGORIES]
cs.LG
NeuralPLexer3: Accurate Biomolecular Complex Structure Prediction with Flow Models
[AUTHORS]
Zhuoran Qiao, Feizhi Ding, Thomas Dresselhaus, Mia A. Rosenfeld, Xiaotian Han, Owen Howell, Aniketh Iyengar, Stephen Opalenski, Anders S. Christensen, Sai Krishna Sirumalla, Frederick R. Manby, Thomas F. Miller III, Matthew Welborn
[ABSTRACT]
Structure determination is essential to a mechanistic understanding of
diseases and the development of novel therapeutics. Machine-learning-based
structure prediction methods have made significant advancements by
computationally predicting protein and bioassembly structures from sequences
and molecular topology alone. Despite substantial progress in the field,
challenges remain to deliver structure prediction models to real-world drug
discovery. Here, we present NeuralPLexer3 – a physics-inspired flow-based
generative model that achieves state-of-the-art prediction accuracy on key
biomolecular interaction types and improves training and sampling efficiency
compared to its predecessors and alternative methodologies. Examined through
newly developed benchmarking strategies, NeuralPLexer3 excels in vital areas
that are crucial to structure-based drug design, such as physical validity and
ligand-induced conformational changes.
[LINK]
http://arxiv.org/abs/2412.10743v2
[DATE]
2024-12-19 05:35:10+08:00
[CATEGORIES]
cs.LG
Conditioned quantum-assisted deep generative surrogate for particle-calorimeter interactions
[AUTHORS]
J. Quetzalcoatl Toledo-Marin, Sebastian Gonzalez, Hao Jia, Ian Lu, Deniz Sogutlu, Abhishek Abhishek, Colin Gay, Eric Paquet, Roger Melko, Geoffrey C. Fox, Maximilian Swiatlowski, Wojciech Fedorko
[ABSTRACT]
Particle collisions at accelerators such as the Large Hadron Collider,
recorded and analyzed by experiments such as ATLAS and CMS, enable exquisite
measurements of the Standard Model and searches for new phenomena. Simulations
of collision events at these detectors have played a pivotal role in shaping
the design of future experiments and analyzing ongoing ones. However, the quest
for accuracy in Large Hadron Collider (LHC) collisions comes at an imposing
computational cost, with projections estimating the need for millions of
CPU-years annually during the High Luminosity LHC (HL-LHC) run
\cite{collaboration2022atlas}. Simulating a single LHC event with
\textsc{Geant4} currently devours around 1000 CPU seconds, with simulations of
the calorimeter subdetectors in particular imposing substantial computational
demands \cite{rousseau2023experimental}. To address this challenge, we propose
a conditioned quantum-assisted deep generative model. Our model integrates a
conditioned variational autoencoder (VAE) on the exterior with a conditioned
Restricted Boltzmann Machine (RBM) in the latent space, providing enhanced
expressiveness compared to conventional VAEs. The RBM nodes and connections are
meticulously engineered to enable the use of qubits and couplers on D-Wave’s
Pegasus-structured \textit{Advantage} quantum annealer (QA) for sampling. We
introduce a novel method for conditioning the quantum-assisted RBM using
\textit{flux biases}. We further propose a novel adaptive mapping to estimate
the effective inverse temperature in quantum annealers. The effectiveness of
our framework is illustrated using Dataset 2 of the CaloChallenge
\cite{calochallenge}.
[COMMENTS]
27 pages, 10 figures, 8 appendices
[LINK]
http://arxiv.org/abs/2410.22870v5
[DATE]
2024-12-19 05:25:06+08:00
[CATEGORIES]
cs.LG
Stable Learning Using Spiking Neural Networks Equipped With Affine Encoders and Decoders
[AUTHORS]
A. Martina Neuman, Dominik Dold, Philipp Christian Petersen
[ABSTRACT]
We study the learning problem associated with spiking neural networks.
Specifically, we focus on spiking neural networks composed of simple spiking
neurons having only positive synaptic weights, equipped with an affine encoder
and decoder. These neural networks are shown to depend continuously on their
parameters, which facilitates classical covering number-based generalization
statements and supports stable gradient-based training. We demonstrate that the
positivity of the weights continues to enable a wide range of expressivity
results, including rate-optimal approximation of smooth functions and
dimension-independent approximation of Barron regular functions. In particular,
we show in theory and simulations that affine spiking neural networks are
capable of approximating shallow ReLU neural networks. Furthermore, we apply
these neural networks to standard machine learning benchmarks, reaching
competitive results. Finally, and remarkably, we observe that from a
generalization perspective, contrary to feedforward neural networks or previous
results for general spiking neural networks, the depth has little to no adverse
effect on the generalization capabilities.
[LINK]
http://arxiv.org/abs/2404.04549v2
[DATE]
2024-12-19 05:17:16+08:00
[CATEGORIES]
cs.LG
A Unifying Information-theoretic Perspective on Evaluating Generative Models
[AUTHORS]
Alexis Fox, Samarth Swarup, Abhijin Adiga
[ABSTRACT]
Considering the difficulty of interpreting generative model output, there is
significant current research focused on determining meaningful evaluation
metrics. Several recent approaches utilize “precision” and “recall,” borrowed
from the classification domain, to individually quantify the output fidelity
(realism) and output diversity (representation of the real data variation),
respectively. With the increase in metric proposals, there is a need for a
unifying perspective, allowing for easier comparison and clearer explanation of
their benefits and drawbacks. To this end, we unify a class of
kth-nearest-neighbors (kNN)-based metrics under an information-theoretic lens
using approaches from kNN density estimation. Additionally, we propose a
tri-dimensional metric composed of Precision Cross-Entropy (PCE), Recall
Cross-Entropy (RCE), and Recall Entropy (RE), which separately measure fidelity
and two distinct aspects of diversity, inter- and intra-class. Our
domain-agnostic metric, derived from the information-theoretic concepts of
entropy and cross-entropy, can be dissected for both sample- and mode-level
analysis. Our detailed experimental results demonstrate the sensitivity of our
metric components to their respective qualities and reveal undesirable
behaviors of other metrics.
[LINK]
http://arxiv.org/abs/2412.14340v1
[DATE]
2024-12-19 05:17:02+08:00
[CATEGORIES]
cs.LG
Covariances for Free: Exploiting Mean Distributions for Federated Learning with Pre-Trained Models
[AUTHORS]
Dipam Goswami, Simone Magistri, Kai Wang, Bartłomiej Twardowski, Andrew D. Bagdanov, Joost van de Weijer
[ABSTRACT]
Using pre-trained models has been found to reduce the effect of data
heterogeneity and speed up federated learning algorithms. Recent works have
investigated the use of first-order statistics and second-order statistics to
aggregate local client data distributions at the server and achieve very high
performance without any training. In this work we propose a training-free
method based on an unbiased estimator of class covariance matrices. Our method,
which only uses first-order statistics in the form of class means communicated
by clients to the server, incurs only a fraction of the communication costs
required by methods based on communicating second-order statistics. We show how
these estimated class covariances can be used to initialize a linear
classifier, thus exploiting the covariances without actually sharing them. When
compared to state-of-the-art methods which also share only class means, our
approach improves performance in the range of 4-26\% with exactly the same
communication cost. Moreover, our method achieves performance competitive or
superior to sharing second-order statistics with dramatically less
communication overhead. Finally, using our method to initialize classifiers and
then performing federated fine-tuning yields better and faster convergence.
Code is available at https://github.com/dipamgoswami/FedCOF.
[LINK]
http://arxiv.org/abs/2412.14326v1
[DATE]
2024-12-19 04:40:14+08:00
[CATEGORIES]
cs.LG
On the Robustness of Spectral Algorithms for Semirandom Stochastic Block Models
[AUTHORS]
Aditya Bhaskara, Agastya Vibhuti Jha, Michael Kapralov, Naren Sarayu Manoj, Davide Mazzali, Weronika Wrzos-Kaminska
[ABSTRACT]
In a graph bisection problem, we are given a graph $G$ with two equally-sized
unlabeled communities, and the goal is to recover the vertices in these
communities. A popular heuristic, known as spectral clustering, is to output an
estimated community assignment based on the eigenvector corresponding to the
second smallest eigenvalue of the Laplacian of $G$. Spectral algorithms can be
shown to provably recover the cluster structure for graphs generated from
certain probabilistic models, such as the Stochastic Block Model (SBM).
However, spectral clustering is known to be non-robust to model
mis-specification. Techniques based on semidefinite programming have been shown
to be more robust, but they incur significant computational overheads.
In this work, we study the robustness of spectral algorithms against
semirandom adversaries. Informally, a semirandom adversary is allowed to
“helpfully” change the specification of the model in a way that is consistent
with the ground-truth solution. Our semirandom adversaries in particular are
allowed to add edges inside clusters or increase the probability that an edge
appears inside a cluster. Semirandom adversaries are a useful tool to determine
the extent to which an algorithm has overfit to statistical assumptions on the
input.
On the positive side, we identify classes of semirandom adversaries under
which spectral bisection using the unnormalized Laplacian is strongly
consistent, i.e., it exactly recovers the planted partitioning. On the negative
side, we show that in these classes spectral bisection with the normalized
Laplacian outputs a partitioning that makes a classification mistake on a
constant fraction of the vertices. Finally, we demonstrate numerical
experiments that complement our theoretical findings.
[COMMENTS]
45 pages. NeurIPS 2024
[LINK]
http://arxiv.org/abs/2412.14315v1
[DATE]
2024-12-19 04:35:02+08:00
[CATEGORIES]
cs.LG
XTSFormer: Cross-Temporal-Scale Transformer for Irregular-Time Event Prediction in Clinical Applications
[AUTHORS]
Tingsong Xiao, Zelin Xu, Wenchong He, Zhengkun Xiao, Yupu Zhang, Zibo Liu, Shigang Chen, My T. Thai, Jiang Bian, Parisa Rashidi, Zhe Jiang
[ABSTRACT]
Adverse clinical events related to unsafe care are among the top ten causes
of death in the U.S. Accurate modeling and prediction of clinical events from
electronic health records (EHRs) play a crucial role in patient safety
enhancement. An example is modeling de facto care pathways that characterize
common step-by-step plans for treatment or care. However, clinical event data
pose several unique challenges, including the irregularity of time intervals
between consecutive events, the existence of cycles, periodicity, multi-scale
event interactions, and the high computational costs associated with long event
sequences. Existing neural temporal point processes (TPPs) methods do not
effectively capture the multi-scale nature of event interactions, which is
common in many real-world clinical applications. To address these issues, we
propose the cross-temporal-scale transformer (XTSFormer), specifically designed
for irregularly timed event data. Our model consists of two vital components: a
novel Feature-based Cycle-aware Time Positional Encoding (FCPE) that adeptly
captures the cyclical nature of time, and a hierarchical multi-scale temporal
attention mechanism, where different temporal scales are determined by a
bottom-up clustering approach. Extensive experiments on several real-world EHR
datasets show that our XTSFormer outperforms multiple baseline methods. The
code is available at https://github.com/spatialdatasciencegroup/XTSFormer.
[COMMENTS]
Accepted at AAAI 2025
[LINK]
http://arxiv.org/abs/2402.02258v2
[DATE]
2024-12-19 04:31:28+08:00
[CATEGORIES]
cs.LG
Stealing That Free Lunch: Exposing the Limits of Dyna-Style Reinforcement Learning
[AUTHORS]
Brett Barkley, David Fridovich-Keil
[ABSTRACT]
Dyna-style off-policy model-based reinforcement learning (DMBRL) algorithms
are a family of techniques for generating synthetic state transition data and
thereby enhancing the sample efficiency of off-policy RL algorithms. This paper
identifies and investigates a surprising performance gap observed when applying
DMBRL algorithms across different benchmark environments with proprioceptive
observations. We show that, while DMBRL algorithms perform well in OpenAI Gym,
their performance can drop significantly in DeepMind Control Suite (DMC), even
though these settings offer similar tasks and identical physics backends.
Modern techniques designed to address several key issues that arise in these
settings do not provide a consistent improvement across all environments, and
overall our results show that adding synthetic rollouts to the training process
– the backbone of Dyna-style algorithms – significantly degrades performance
across most DMC environments. Our findings contribute to a deeper understanding
of several fundamental challenges in model-based RL and show that, like many
optimization fields, there is no free lunch when evaluating performance across
diverse benchmarks in RL.
[LINK]
http://arxiv.org/abs/2412.14312v1
[DATE]
2024-12-19 04:25:04+08:00
[CATEGORIES]
cs.LG
Reinforcement Learning from Automatic Feedback for High-Quality Unit Test Generation
[AUTHORS]
Benjamin Steenhoek, Michele Tufano, Neel Sundaresan, Alexey Svyatkovskiy
[ABSTRACT]
Software testing is a crucial but time-consuming aspect of software
development, and recently, Large Language Models (LLMs) have gained popularity
for automated test case generation. However, because LLMs are trained on vast
amounts of open-source code, they often generate test cases that do not adhere
to best practices and may even contain test smells (anti-patterns). To address
this issue, we propose Reinforcement Learning from Static Quality Metrics
(RLSQM), wherein we utilize Reinforcement Learning to generate high-quality
unit tests based on static analysis-based quality metrics. First, we analyzed
LLM-generated tests and show that LLMs frequently do generate undesirable test
smells – up to 37% of the time. Then, we implemented lightweight static
analysis-based reward model and trained LLMs using this reward model to
optimize for five code quality metrics. Our experimental results demonstrate
that the RL-optimized Codex model consistently generated higher-quality test
cases than the base LLM, improving quality metrics by up to 23%, and generated
nearly 100% syntactically-correct code. RLSQM also outperformed GPT-4 on all
code quality metrics, in spite of training a substantially cheaper Codex model.
We provide insights into how reliably utilize RL to improve test generation
quality and show that RLSQM is a significant step towards enhancing the overall
efficiency and reliability of automated software testing. Our data are
available at https://doi.org/10.6084/m9.figshare.25983166.
[COMMENTS]
Accepted to DeepTest 2025 (ICSE Workshop). arXiv admin note: text
overlap with arXiv:2310.02368
[LINK]
http://arxiv.org/abs/2412.14308v1
[DATE]
2024-12-19 04:20:01+08:00
[CATEGORIES]
cs.LG
What Has Been Overlooked in Contrastive Source-Free Domain Adaptation: Leveraging Source-Informed Latent Augmentation within Neighborhood Context
[AUTHORS]
Jing Wang, Wonho Bae, Jiahong Chen, Kuangen Zhang, Leonid Sigal, Clarence W. de Silva
[ABSTRACT]
Source-free domain adaptation (SFDA) involves adapting a model originally
trained using a labeled dataset ({\em source domain}) to perform effectively on
an unlabeled dataset ({\em target domain}) without relying on any source data
during adaptation. This adaptation is especially crucial when significant
disparities in data distributions exist between the two domains and when there
are privacy concerns regarding the source model’s training data. The absence of
access to source data during adaptation makes it challenging to analytically
estimate the domain gap. To tackle this issue, various techniques have been
proposed, such as unsupervised clustering, contrastive learning, and continual
learning. In this paper, we first conduct an extensive theoretical analysis of
SFDA based on contrastive learning, primarily because it has demonstrated
superior performance compared to other techniques. Motivated by the obtained
insights, we then introduce a straightforward yet highly effective latent
augmentation method tailored for contrastive SFDA. This augmentation method
leverages the dispersion of latent features within the neighborhood of the
query sample, guided by the source pre-trained model, to enhance the
informativeness of positive keys. Our approach, based on a single InfoNCE-based
contrastive loss, outperforms state-of-the-art SFDA methods on widely
recognized benchmark datasets.
[COMMENTS]
ICLR 2025
[LINK]
http://arxiv.org/abs/2412.14301v1
[DATE]
2024-12-19 04:09:46+08:00
[CATEGORIES]
cs.LG
The Multiplex Classification Framework: optimizing multi-label classifiers through problem transformation, ontology engineering, and model ensembling
[AUTHORS]
Mauro Nievas Offidani, Facundo Roffet, Claudio Augusto Delrieux, Maria Carolina Gonzalez Galtier, Marcos Zarate
[ABSTRACT]
Classification is a fundamental task in machine learning. While conventional
methods-such as binary, multiclass, and multi-label classification-are
effective for simpler problems, they may not adequately address the
complexities of some real-world scenarios. This paper introduces the Multiplex
Classification Framework, a novel approach developed to tackle these and
similar challenges through the integration of problem transformation, ontology
engineering, and model ensembling. The framework offers several advantages,
including adaptability to any number of classes and logical constraints, an
innovative method for managing class imbalance, the elimination of confidence
threshold selection, and a modular structure. Two experiments were conducted to
compare the performance of conventional classification models with the
Multiplex approach. Our results demonstrate that the Multiplex approach can
improve classification performance significantly (up to 10% gain in overall F1
score), particularly in classification problems with a large number of classes
and pronounced class imbalances. However, it also has limitations, as it
requires a thorough understanding of the problem domain and some experience
with ontology engineering, and it involves training multiple models, which can
make the whole process more intricate. Overall, this methodology provides a
valuable tool for researchers and practitioners dealing with complex
classification problems in machine learning.
[COMMENTS]
43 pages, 15 figures, submitted to Applied Ontology
[LINK]
http://arxiv.org/abs/2412.14299v1
[DATE]
2024-12-19 04:07:27+08:00
[CATEGORIES]
cs.LG
Distributionally Robust Policy Learning under Concept Drifts
[AUTHORS]
Jingyuan Wang, Zhimei Ren, Ruohan Zhan, Zhengyuan Zhou
[ABSTRACT]
Distributionally robust policy learning aims to find a policy that performs
well under the worst-case distributional shift, and yet most existing methods
for robust policy learning consider the worst-case joint distribution of the
covariate and the outcome. The joint-modeling strategy can be unnecessarily
conservative when we have more information on the source of distributional
shifts. This paper studiesa more nuanced problem – robust policy learning
under the concept drift, when only the conditional relationship between the
outcome and the covariate changes. To this end, we first provide a
doubly-robust estimator for evaluating the worst-case average reward of a given
policy under a set of perturbed conditional distributions. We show that the
policy value estimator enjoys asymptotic normality even if the nuisance
parameters are estimated with a slower-than-root-$n$ rate. We then propose a
learning algorithm that outputs the policy maximizing the estimated policy
value within a given policy class $\Pi$, and show that the sub-optimality gap
of the proposed algorithm is of the order $\kappa(\Pi)n^{-1/2}$, with
$\kappa(\Pi)$ is the entropy integral of $\Pi$ under the Hamming distance and
$n$ is the sample size. A matching lower bound is provided to show the
optimality of the rate. The proposed methods are implemented and evaluated in
numerical studies, demonstrating substantial improvement compared with existing
benchmarks.
[LINK]
http://arxiv.org/abs/2412.14297v1
[DATE]
2024-12-19 03:53:56+08:00
[CATEGORIES]
cs.LG
Temporally Consistent Object-Centric Learning by Contrasting Slots
[AUTHORS]
Anna Manasyan, Maximilian Seitzer, Filip Radovic, Georg Martius, Andrii Zadaianchuk
[ABSTRACT]
Unsupervised object-centric learning from videos is a promising approach to
extract structured representations from large, unlabeled collections of videos.
To support downstream tasks like autonomous control, these representations must
be both compositional and temporally consistent. Existing approaches based on
recurrent processing often lack long-term stability across frames because their
training objective does not enforce temporal consistency. In this work, we
introduce a novel object-level temporal contrastive loss for video
object-centric models that explicitly promotes temporal consistency. Our method
significantly improves the temporal consistency of the learned object-centric
representations, yielding more reliable video decompositions that facilitate
challenging downstream tasks such as unsupervised object dynamics prediction.
Furthermore, the inductive bias added by our loss strongly improves object
discovery, leading to state-of-the-art results on both synthetic and real-world
datasets, outperforming even weakly-supervised methods that leverage motion
masks as additional cues.
[LINK]
http://arxiv.org/abs/2412.14295v1
[DATE]
2024-12-19 03:46:04+08:00
[CATEGORIES]
cs.LG
TRecViT: A Recurrent Video Transformer
[AUTHORS]
Viorica Pătrăucean, Xu Owen He, Joseph Heyward, Chuhan Zhang, Mehdi S. M. Sajjadi, George-Cristian Muraru, Artem Zholus, Mahdi Karami, Ross Goroshin, Yutian Chen, Simon Osindero, João Carreira, Razvan Pascanu
[ABSTRACT]
We propose a novel block for video modelling. It relies on a
time-space-channel factorisation with dedicated blocks for each dimension:
gated linear recurrent units (LRUs) perform information mixing over time,
self-attention layers perform mixing over space, and MLPs over channels. The
resulting architecture TRecViT performs well on sparse and dense tasks, trained
in supervised or self-supervised regimes. Notably, our model is causal and
outperforms or is on par with a pure attention model ViViT-L on large scale
video datasets (SSv2, Kinetics400), while having $3\times$ less parameters,
$12\times$ smaller memory footprint, and $5\times$ lower FLOPs count. Code and
checkpoints will be made available online at
https://github.com/google-deepmind/trecvit.
[LINK]
http://arxiv.org/abs/2412.14294v1
[DATE]
2024-12-19 03:44:30+08:00
[CATEGORIES]
cs.LG
Projected gradient methods for nonconvex and stochastic optimization: new complexities and auto-conditioned stepsizes
[AUTHORS]
Guanghui Lan, Tianjiao Li, Yangyang Xu
[ABSTRACT]
We present a novel class of projected gradient (PG) methods for minimizing a
smooth but not necessarily convex function over a convex compact set. We first
provide a novel analysis of the “vanilla” PG method, achieving the best-known
iteration complexity for finding an approximate stationary point of the
problem. We then develop an “auto-conditioned” projected gradient (AC-PG)
variant that achieves the same iteration complexity without requiring the input
of the Lipschitz constant of the gradient or any line search procedure. The key
idea is to estimate the Lipschitz constant using first-order information
gathered from the previous iterations, and to show that the error caused by
underestimating the Lipschitz constant can be properly controlled. We then
generalize the PG methods to the stochastic setting, by proposing a stochastic
projected gradient (SPG) method and a variance-reduced stochastic gradient
(VR-SPG) method, achieving new complexity bounds in different oracle settings.
We also present auto-conditioned stepsize policies for both stochastic PG
methods and establish comparable convergence guarantees.
[LINK]
http://arxiv.org/abs/2412.14291v1
[DATE]
2024-12-19 03:34:16+08:00
[CATEGORIES]
cs.LG
Split Learning in Computer Vision for Semantic Segmentation Delay Minimization
[AUTHORS]
Nikos G. Evgenidis, Nikos A. Mitsiou, Sotiris A. Tegos, Panagiotis D. Diamantoulakis, George K. Karagiannidis
[ABSTRACT]
In this paper, we propose a novel approach to minimize the inference delay in
semantic segmentation using split learning (SL), tailored to the needs of
real-time computer vision (CV) applications for resource-constrained devices.
Semantic segmentation is essential for applications such as autonomous vehicles
and smart city infrastructure, but faces significant latency challenges due to
high computational and communication loads. Traditional centralized processing
methods are inefficient for such scenarios, often resulting in unacceptable
inference delays. SL offers a promising alternative by partitioning deep neural
networks (DNNs) between edge devices and a central server, enabling localized
data processing and reducing the amount of data required for transmission. Our
contribution includes the joint optimization of bandwidth allocation, cut layer
selection of the edge devices’ DNN, and the central server’s processing
resource allocation. We investigate both parallel and serial data processing
scenarios and propose low-complexity heuristic solutions that maintain
near-optimal performance while reducing computational requirements. Numerical
results show that our approach effectively reduces inference delay,
demonstrating the potential of SL for improving real-time CV applications in
dynamic, resource-constrained environments.
[LINK]
http://arxiv.org/abs/2412.14272v1
[DATE]
2024-12-19 03:07:25+08:00
[CATEGORIES]
cs.LG
A Unified Framework for Human-Allied Learning of Probabilistic Circuits
[AUTHORS]
Athresh Karanam, Saurabh Mathur, Sahil Sidheekh, Sriraam Natarajan
[ABSTRACT]
Probabilistic Circuits (PCs) have emerged as an efficient framework for
representing and learning complex probability distributions. Nevertheless, the
existing body of research on PCs predominantly concentrates on data-driven
parameter learning, often neglecting the potential of knowledge-intensive
learning, a particular issue in data-scarce/knowledge-rich domains such as
healthcare. To bridge this gap, we propose a novel unified framework that can
systematically integrate diverse domain knowledge into the parameter learning
process of PCs. Experiments on several benchmarks as well as real world
datasets show that our proposed framework can both effectively and efficiently
leverage domain knowledge to achieve superior performance compared to purely
data-driven learning approaches.
[LINK]
http://arxiv.org/abs/2405.02413v2
[DATE]
2024-12-19 03:02:40+08:00
[CATEGORIES]
cs.LG
VideoDPO: Omni-Preference Alignment for Video Diffusion Generation
[AUTHORS]
Runtao Liu, Haoyu Wu, Zheng Ziqiang, Chen Wei, Yingqing He, Renjie Pi, Qifeng Chen
[ABSTRACT]
Recent progress in generative diffusion models has greatly advanced
text-to-video generation. While text-to-video models trained on large-scale,
diverse datasets can produce varied outputs, these generations often deviate
from user preferences, highlighting the need for preference alignment on
pre-trained models. Although Direct Preference Optimization (DPO) has
demonstrated significant improvements in language and image generation, we
pioneer its adaptation to video diffusion models and propose a VideoDPO
pipeline by making several key adjustments. Unlike previous image alignment
methods that focus solely on either (i) visual quality or (ii) semantic
alignment between text and videos, we comprehensively consider both dimensions
and construct a preference score accordingly, which we term the OmniScore. We
design a pipeline to automatically collect preference pair data based on the
proposed OmniScore and discover that re-weighting these pairs based on the
score significantly impacts overall preference alignment. Our experiments
demonstrate substantial improvements in both visual quality and semantic
alignment, ensuring that no preference aspect is neglected. Code and data will
be shared at https://videodpo.github.io/.
[LINK]
http://arxiv.org/abs/2412.14167v1
[DATE]
2024-12-19 02:59:49+08:00
[CATEGORIES]
cs.LG
Syzygy: Dual Code-Test C to (safe) Rust Translation using LLMs and Dynamic Analysis
[AUTHORS]
Manish Shetty, Naman Jain, Adwait Godbole, Sanjit A. Seshia, Koushik Sen
[ABSTRACT]
Despite extensive usage in high-performance, low-level systems programming
applications, C is susceptible to vulnerabilities due to manual memory
management and unsafe pointer operations. Rust, a modern systems programming
language, offers a compelling alternative. Its unique ownership model and type
system ensure memory safety without sacrificing performance.
In this paper, we present Syzygy, an automated approach to translate C to
safe Rust. Our technique uses a synergistic combination of LLM-driven code and
test translation guided by dynamic-analysis-generated execution information.
This paired translation runs incrementally in a loop over the program in
dependency order of the code elements while maintaining per-step correctness.
Our approach exposes novel insights on combining the strengths of LLMs and
dynamic analysis in the context of scaling and combining code generation with
testing. We apply our approach to successfully translate Zopfli, a
high-performance compression library with ~3000 lines of code and 98 functions.
We validate the translation by testing equivalence with the source C program on
a set of inputs. To our knowledge, this is the largest automated and
test-validated C to safe Rust code translation achieved so far.
[COMMENTS]
Project Webpage: https://syzygy-project.github.io/. Preliminary
version accepted at LLM4Code 2025, 34 pages
[LINK]
http://arxiv.org/abs/2412.14234v1
[DATE]
2024-12-19 02:55:46+08:00
[CATEGORIES]
cs.LG
A Staged Deep Learning Approach to Spatial Refinement in 3D Temporal Atmospheric Transport
[AUTHORS]
M. Giselle Fernández-Godino, Wai Tong Chung, Akshay A. Gowardhan, Matthias Ihme, Qingkai Kong, Donald D. Lucas, Stephen C. Myers
[ABSTRACT]
High-resolution spatiotemporal simulations effectively capture the
complexities of atmospheric plume dispersion in complex terrain. However, their
high computational cost makes them impractical for applications requiring rapid
responses or iterative processes, such as optimization, uncertainty
quantification, or inverse modeling. To address this challenge, this work
introduces the Dual-Stage Temporal Three-dimensional UNet Super-resolution
(DST3D-UNet-SR) model, a highly efficient deep learning model for plume
dispersion prediction. DST3D-UNet-SR is composed of two sequential modules: the
temporal module (TM), which predicts the transient evolution of a plume in
complex terrain from low-resolution temporal data, and the spatial refinement
module (SRM), which subsequently enhances the spatial resolution of the TM
predictions. We train DST3DUNet- SR using a comprehensive dataset derived from
high-resolution large eddy simulations (LES) of plume transport. We propose the
DST3D-UNet-SR model to significantly accelerate LES simulations of
three-dimensional plume dispersion by three orders of magnitude. Additionally,
the model demonstrates the ability to dynamically adapt to evolving conditions
through the incorporation of new observational data, substantially improving
prediction accuracy in high-concentration regions near the source.
Keywords: Atmospheric sciences, Geosciences, Plume transport,3D temporal
sequences, Artificial intelligence, CNN, LSTM, Autoencoder, Autoregressive
model, U-Net, Super-resolution, Spatial Refinement.
[COMMENTS]
12 pages, 10 figures
[LINK]
http://arxiv.org/abs/2412.10945v2
[DATE]
2024-12-19 02:46:15+08:00
[CATEGORIES]
cs.LG
On Calibration in Multi-Distribution Learning
[AUTHORS]
Rajeev Verma, Volker Fischer, Eric Nalisnick
[ABSTRACT]
Modern challenges of robustness, fairness, and decision-making in machine
learning have led to the formulation of multi-distribution learning (MDL)
frameworks in which a predictor is optimized across multiple distributions. We
study the calibration properties of MDL to better understand how the predictor
performs uniformly across the multiple distributions. Through classical results
on decomposing proper scoring losses, we first derive the Bayes optimal rule
for MDL, demonstrating that it maximizes the generalized entropy of the
associated loss function. Our analysis reveals that while this approach ensures
minimal worst-case loss, it can lead to non-uniform calibration errors across
the multiple distributions and there is an inherent calibration-refinement
trade-off, even at Bayes optimality. Our results highlight a critical
limitation: despite the promise of MDL, one must use caution when designing
predictors tailored to multiple distributions so as to minimize disparity.
[LINK]
http://arxiv.org/abs/2412.14142v1
[DATE]
2024-12-19 02:41:40+08:00
[CATEGORIES]
cs.LG
Walk Wisely on Graph: Knowledge Graph Reasoning with Dual Agents via Efficient Guidance-Exploration
[AUTHORS]
Zijian Wang, Bin Wang, Haifeng Jing, Huayu Li, Hongbo Dou
[ABSTRACT]
Recent years, multi-hop reasoning has been widely studied for knowledge graph
(KG) reasoning due to its efficacy and interpretability. However, previous
multi-hop reasoning approaches are subject to two primary shortcomings. First,
agents struggle to learn effective and robust policies at the early phase due
to sparse rewards. Second, these approaches often falter on specific datasets
like sparse knowledge graphs, where agents are required to traverse lengthy
reasoning paths. To address these problems, we propose a multi-hop reasoning
model with dual agents based on hierarchical reinforcement learning (HRL),
which is named FULORA. FULORA tackles the above reasoning challenges by
eFficient GUidance-ExpLORAtion between dual agents. The high-level agent walks
on the simplified knowledge graph to provide stage-wise hints for the low-level
agent walking on the original knowledge graph. In this framework, the low-level
agent optimizes a value function that balances two objectives: (1) maximizing
return, and (2) integrating efficient guidance from the high-level agent.
Experiments conducted on three real-word knowledge graph datasets demonstrate
that FULORA outperforms RL-based baselines, especially in the case of
long-distance reasoning.
[COMMENTS]
Accepted by AAAI-25
[LINK]
http://arxiv.org/abs/2408.01880v4
[DATE]
2024-12-19 02:31:42+08:00
[CATEGORIES]
cs.LG
Scaling of Search and Learning: A Roadmap to Reproduce o1 from Reinforcement Learning Perspective
[AUTHORS]
Zhiyuan Zeng, Qinyuan Cheng, Zhangyue Yin, Bo Wang, Shimin Li, Yunhua Zhou, Qipeng Guo, Xuanjing Huang, Xipeng Qiu
[LINK]
http://arxiv.org/abs/2412.14135v1
[DATE]
2024-12-19 02:24:47+08:00
[CATEGORIES]
cs.LG
jinns: a JAX Library for Physics-Informed Neural Networks
[AUTHORS]
Hugo Gangloff, Nicolas Jouvin
[ABSTRACT]
jinns is an open-source Python library for physics-informed neural networks,
built to tackle both forward and inverse problems, as well as meta-model
learning. Rooted in the JAX ecosystem, it provides a versatile framework for
efficiently prototyping real-problems, while easily allowing extensions to
specific needs. Furthermore, the implementation leverages existing popular JAX
libraries such as equinox and optax for model definition and optimisation,
bringing a sense of familiarity to the user. Many models are available as
baselines, and the documentation provides reference implementations of
different use-cases along with step-by-step tutorials for extensions to
specific needs. The code is available on Gitlab
https://gitlab.com/mia_jinns/jinns.
[COMMENTS]
12 pages, 1 figure
[LINK]
http://arxiv.org/abs/2412.14132v1
[DATE]
2024-12-19 02:21:41+08:00
[CATEGORIES]
cs.LG
Go With the Flow: Fast Diffusion for Gaussian Mixture Models
[AUTHORS]
George Rapakoulias, Ali Reza Pedram, Panagiotis Tsiotras
[ABSTRACT]
Schr"{o}dinger Bridges (SB) are diffusion processes that steer, in finite
time, a given initial distribution to another final one while minimizing a
suitable cost functional. Although various methods for computing SBs have
recently been proposed in the literature, most of these approaches require
computationally expensive training schemes, even for solving low-dimensional
problems. In this work, we propose an analytic parametrization of a set of
feasible policies for steering the distribution of a dynamical system from one
Gaussian Mixture Model (GMM) to another. Instead of relying on standard
non-convex optimization techniques, the optimal policy within the set can be
approximated as the solution of a low-dimensional linear program whose
dimension scales linearly with the number of components in each mixture.
Furthermore, our method generalizes naturally to more general classes of
dynamical systems such as controllable Linear Time-Varying systems that cannot
currently be solved using traditional neural SB approaches. We showcase the
potential of this approach in low-to-moderate dimensional problems such as
image-to-image translation in the latent space of an autoencoder, and various
other examples. We also benchmark our approach on an Entropic Optimal Transport
(EOT) problem and show that it outperforms state-of-the-art methods in cases
where the boundary distributions are mixture models while requiring virtually
no training.
[LINK]
http://arxiv.org/abs/2412.09059v2
[DATE]
2024-12-19 02:20:14+08:00
[CATEGORIES]
cs.LG
Machine Learning Co-pilot for Screening of Organic Molecular Additives for Perovskite Solar Cells
[AUTHORS]
Yang Pu, Zhiyuan Dai, Yifan Zhou, Ning Jia, Hongyue Wang, Yerzhan Mukhametkarimov, Ruihao Chen, Hongqiang Wang, Zhe Liu
[ABSTRACT]
Machine learning (ML) has been extensively employed in planar perovskite
photovoltaics to screen effective organic molecular additives, while
encountering predictive biases for novel materials due to small datasets and
reliance on predefined descriptors. Present work thus proposes an effective
approach, Co-Pilot for Perovskite Additive Screener (Co-PAS), an ML-driven
framework designed to accelerate additive screening for perovskite solar cells
(PSCs). Co-PAS overcomes predictive biases by integrating the Molecular
Scaffold Classifier (MSC) for scaffold-based pre-screening and utilizing
Junction Tree Variational Autoencoder (JTVAE) latent vectors to enhance
molecular structure representation, thereby enhancing the accuracy of power
conversion efficiency (PCE) predictions. Leveraging Co-PAS, we integrate domain
knowledge to screen an extensive dataset of 250,000 molecules from PubChem,
prioritizing candidates based on predicted PCE values and key molecular
properties such as donor number, dipole moment, and hydrogen bond acceptor
count. This workflow leads to the identification of several promising
passivating molecules, including the novel Boc-L-threonine N-hydroxysuccinimide
ester (BTN), which, to our knowledge, has not been explored as an additive in
PSCs and achieves a device PCE of 25.20%. Our results underscore the potential
of Co-PAS in advancing additive discovery for high-performance PSCs.
[LINK]
http://arxiv.org/abs/2412.14109v1
[DATE]
2024-12-19 01:52:45+08:00
[CATEGORIES]
cs.LG
Parameter-efficient Fine-tuning for improved Convolutional Baseline for Brain Tumor Segmentation in Sub-Saharan Africa Adult Glioma Dataset
[AUTHORS]
Bijay Adhikari, Pratibha Kulung, Jakesh Bohaju, Laxmi Kanta Poudel, Confidence Raymond, Dong Zhang, Udunna C Anazodo, Bishesh Khanal, Mahesh Shakya
[ABSTRACT]
Automating brain tumor segmentation using deep learning methods is an ongoing
challenge in medical imaging. Multiple lingering issues exist including
domain-shift and applications in low-resource settings which brings a unique
set of challenges including scarcity of data. As a step towards solving these
specific problems, we propose Convolutional adapter-inspired
Parameter-efficient Fine-tuning (PEFT) of MedNeXt architecture. To validate our
idea, we show our method performs comparable to full fine-tuning with the added
benefit of reduced training compute using BraTS-2021 as pre-training dataset
and BraTS-Africa as the fine-tuning dataset. BraTS-Africa consists of a small
dataset (60 train / 35 validation) from the Sub-Saharan African population with
marked shift in the MRI quality compared to BraTS-2021 (1251 train samples). We
first show that models trained on BraTS-2021 dataset do not generalize well to
BraTS-Africa as shown by 20% reduction in mean dice on BraTS-Africa validation
samples. Then, we show that PEFT can leverage both the BraTS-2021 and
BraTS-Africa dataset to obtain mean dice of 0.8 compared to 0.72 when trained
only on BraTS-Africa. Finally, We show that PEFT (0.80 mean dice) results in
comparable performance to full fine-tuning (0.77 mean dice) which may show PEFT
to be better on average but the boxplots show that full finetuning results is
much lesser variance in performance. Nevertheless, on disaggregation of the
dice metrics, we find that the model has tendency to oversegment as shown by
high specificity (0.99) compared to relatively low sensitivity(0.75). The
source code is available at
https://github.com/CAMERA-MRI/SPARK2024/tree/main/PEFT_MedNeXt
[COMMENTS]
Accepted to “The International Brain Tumor Segmentation (BraTS)
challenge organized at MICCAI 2024 conference”
[LINK]
http://arxiv.org/abs/2412.14100v1
[DATE]
2024-12-19 01:48:32+08:00
[CATEGORIES]
cs.LG
Adaptive Concept Bottleneck for Foundation Models Under Distribution Shifts
[AUTHORS]
Jihye Choi, Jayaram Raghuram, Yixuan Li, Somesh Jha
[COMMENTS]
The preliminary version of the work appeared in the ICML 2024
Workshop on Foundation Models in the Wild
[LINK]
http://arxiv.org/abs/2412.14097v1
[DATE]
2024-12-19 01:47:46+08:00
[CATEGORIES]
cs.LG
HeteroSample: Meta-path Guided Sampling for Heterogeneous Graph Representation Learning
[AUTHORS]
Ao Liu, Jing Chen, Ruiying Du, Cong Wu, Yebo Feng, Teng Li, Jianfeng Ma
[ABSTRACT]
The rapid expansion of Internet of Things (IoT) has resulted in vast,
heterogeneous graphs that capture complex interactions among devices, sensors,
and systems. Efficient analysis of these graphs is critical for deriving
insights in IoT scenarios such as smart cities, industrial IoT, and intelligent
transportation systems. However, the scale and diversity of IoT-generated data
present significant challenges, and existing methods often struggle with
preserving the structural integrity and semantic richness of these complex
graphs. Many current approaches fail to maintain the balance between
computational efficiency and the quality of the insights generated, leading to
potential loss of critical information necessary for accurate decision-making
in IoT applications. We introduce HeteroSample, a novel sampling method
designed to address these challenges by preserving the structural integrity,
node and edge type distributions, and semantic patterns of IoT-related graphs.
HeteroSample works by incorporating the novel top-leader selection, balanced
neighborhood expansion, and meta-path guided sampling strategies. The key idea
is to leverage the inherent heterogeneous structure and semantic relationships
encoded by meta-paths to guide the sampling process. This approach ensures that
the resulting subgraphs are representative of the original data while
significantly reducing computational overhead. Extensive experiments
demonstrate that HeteroSample outperforms state-of-the-art methods, achieving
up to 15% higher F1 scores in tasks such as link prediction and node
classification, while reducing runtime by 20%.These advantages make
HeteroSample a transformative tool for scalable and accurate IoT applications,
enabling more effective and efficient analysis of complex IoT systems,
ultimately driving advancements in smart cities, industrial IoT, and beyond.
[COMMENTS]
11 pages
[LINK]
http://arxiv.org/abs/2411.07022v2
[DATE]
2024-12-19 01:46:57+08:00
[CATEGORIES]
cs.LG
CiTrus: Squeezing Extra Performance out of Low-data Bio-signal Transfer Learning
[AUTHORS]
Eloy Geenjaar, Lie Lu
[ABSTRACT]
Transfer learning for bio-signals has recently become an important technique
to improve prediction performance on downstream tasks with small bio-signal
datasets. Recent works have shown that pre-training a neural network model on a
large dataset (e.g. EEG) with a self-supervised task, replacing the
self-supervised head with a linear classification head, and fine-tuning the
model on different downstream bio-signal datasets (e.g., EMG or ECG) can
dramatically improve the performance on those datasets. In this paper, we
propose a new convolution-transformer hybrid model architecture with masked
auto-encoding for low-data bio-signal transfer learning, introduce a
frequency-based masked auto-encoding task, employ a more comprehensive
evaluation framework, and evaluate how much and when (multimodal) pre-training
improves fine-tuning performance. We also introduce a dramatically more
performant method of aligning a downstream dataset with a different temporal
length and sampling rate to the original pre-training dataset. Our findings
indicate that the convolution-only part of our hybrid model can achieve
state-of-the-art performance on some low-data downstream tasks. The performance
is often improved even further with our full model. In the case of
transformer-based models we find that pre-training especially improves
performance on downstream datasets, multimodal pre-training often increases
those gains further, and our frequency-based pre-training performs the best on
average for the lowest and highest data regimes.
[LINK]
http://arxiv.org/abs/2412.11695v2
[DATE]
2024-12-19 01:40:34+08:00
[CATEGORIES]
cs.LG
A Novel Generative Multi-Task Representation Learning Approach for Predicting Postoperative Complications in Cardiac Surgery Patients
[AUTHORS]
Junbo Shen, Bing Xue, Thomas Kannampallil, Chenyang Lu, Joanna Abraham
[ABSTRACT]
Early detection of surgical complications allows for timely therapy and
proactive risk mitigation. Machine learning (ML) can be leveraged to identify
and predict patient risks for postoperative complications. We developed and
validated the effectiveness of predicting postoperative complications using a
novel surgical Variational Autoencoder (surgVAE) that uncovers intrinsic
patterns via cross-task and cross-cohort presentation learning. This
retrospective cohort study used data from the electronic health records of
adult surgical patients over four years (2018 - 2021). Six key postoperative
complications for cardiac surgery were assessed: acute kidney injury, atrial
fibrillation, cardiac arrest, deep vein thrombosis or pulmonary embolism, blood
transfusion, and other intraoperative cardiac events. We compared prediction
performances of surgVAE against widely-used ML models and advanced
representation learning and generative models under 5-fold cross-validation.
89,246 surgeries (49% male, median (IQR) age: 57 (45-69)) were included, with
6,502 in the targeted cardiac surgery cohort (61% male, median (IQR) age: 60
(53-70)). surgVAE demonstrated superior performance over existing ML solutions
across all postoperative complications of cardiac surgery patients, achieving
macro-averaged AUPRC of 0.409 and macro-averaged AUROC of 0.831, which were
3.4% and 3.7% higher, respectively, than the best alternative method (by AUPRC
scores). Model interpretation using Integrated Gradients highlighted key risk
factors based on preoperative variable importance. surgVAE showed excellent
discriminatory performance for predicting postoperative complications and
addressing the challenges of data complexity, small cohort sizes, and
low-frequency positive events. surgVAE enables data-driven predictions of
patient risks and prognosis while enhancing the interpretability of patient
risk profiles.
[COMMENTS]
This article has been accepted for publication in Journal of the
American Medical Informatics Association Published by Oxford University
Press. Codes are publicly available at:
https://github.com/ai4biomedicine/surgVAE
[LINK]
http://arxiv.org/abs/2412.01950v2
[DATE]
2024-12-19 01:40:15+08:00
[CATEGORIES]
cs.LG
Future Research Avenues for Artificial Intelligence in Digital Gaming: An Exploratory Report
[AUTHORS]
Markus Dablander
[ABSTRACT]
Video games are a natural and synergistic application domain for artificial
intelligence (AI) systems, offering both the potential to enhance player
experience and immersion, as well as providing valuable benchmarks and virtual
environments to advance AI technologies in general. This report presents a
high-level overview of five promising research pathways for applying
state-of-the-art AI methods, particularly deep learning, to digital gaming
within the context of the current research landscape. The objective of this
work is to outline a curated, non-exhaustive list of encouraging research
directions at the intersection of AI and video games that may serve to inspire
more rigorous and comprehensive research efforts in the future. We discuss (i)
investigating large language models as core engines for game agent modelling,
(ii) using neural cellular automata for procedural game content generation,
(iii) accelerating computationally expensive in-game simulations via deep
surrogate modelling, (iv) leveraging self-supervised learning to obtain useful
video game state embeddings, and (v) training generative models of interactive
worlds using unlabelled video data. We also briefly address current technical
challenges associated with the integration of advanced deep learning systems
into video game development, and indicate key areas where further progress is
likely to be beneficial.
[LINK]
http://arxiv.org/abs/2412.14085v1
[DATE]
2024-12-19 01:32:27+08:00
[CATEGORIES]
cs.LG
Transformers Can Navigate Mazes With Multi-Step Prediction
[AUTHORS]
Niklas Nolte, Ouail Kitouni, Adina Williams, Mike Rabbat, Mark Ibrahim
[ABSTRACT]
Despite their remarkable success in language modeling, transformers trained
to predict the next token in a sequence struggle with long-term planning. This
limitation is particularly evident in tasks requiring foresight to plan
multiple steps ahead such as maze navigation. The standard next single token
prediction objective, however, offers no explicit mechanism to predict multiple
steps ahead - or revisit the path taken so far. Consequently, in this work we
study whether explicitly predicting multiple steps ahead (and backwards) can
improve transformers’ maze navigation. We train parameter-matched transformers
from scratch, under identical settings, to navigate mazes of varying types and
sizes with standard next token prediction and MLM-U, an objective explicitly
predicting multiple steps ahead and backwards. We find that MLM-U considerably
improves transformers’ ability to navigate mazes compared to standard next
token prediction across maze types and complexities. We also find MLM-U
training is 4x more sample efficient and converges 2x faster in terms of GPU
training hours relative to next token training. Finally, for more complex mazes
we find MLM-U benefits from scaling to larger transformers. Remarkably, we find
transformers trained with MLM-U outperform larger transformers trained with
next token prediction using additional supervision from A* search traces. We
hope these findings underscore the promise of learning objectives to advance
transformers’ capacity for long-term planning. The code can be found at
https://github.com/facebookresearch/maze_navigation_MLMU
[COMMENTS]
20 pages, 15 figures
[LINK]
http://arxiv.org/abs/2412.05117v2
[DATE]
2024-12-19 01:16:23+08:00
[CATEGORIES]
cs.LG
Adaptive Computation Modules: Granular Conditional Computation For Efficient Inference
[AUTHORS]
Bartosz Wójcik, Alessio Devoto, Karol Pustelnik, Pasquale Minervini, Simone Scardapane
[ABSTRACT]
While transformer models have been highly successful, they are
computationally inefficient. We observe that for each layer, the full width of
the layer may be needed only for a small subset of tokens inside a batch and
that the “effective” width needed to process a token can vary from layer to
layer. Motivated by this observation, we introduce the Adaptive Computation
Module (ACM), a generic module that dynamically adapts its computational load
to match the estimated difficulty of the input on a per-token basis. An ACM
consists of a sequence of learners that progressively refine the output of
their preceding counterparts. An additional gating mechanism determines the
optimal number of learners to execute for each token. We also propose a
distillation technique to replace any pre-trained model with an “ACMized”
variant. Our evaluation of transformer models in computer vision and speech
recognition demonstrates that substituting layers with ACMs significantly
reduces inference costs without degrading the downstream accuracy for a wide
interval of user-defined budgets.
[LINK]
http://arxiv.org/abs/2312.10193v2
[DATE]
2024-12-19 01:13:41+08:00
[CATEGORIES]
cs.LG
Deep hybrid models: infer and plan in a dynamic world
[AUTHORS]
Matteo Priorelli, Ivilin Peev Stoianov
[ABSTRACT]
In order to determine an optimal plan for a complex task, one often deals
with dynamic and hierarchical relationships between several entities.
Traditionally, such problems are tackled with optimal control, which relies on
the optimization of cost functions; instead, a recent biologically-motivated
proposal casts planning and control as an inference process. Active inference
assumes that action and perception are two complementary aspects of life
whereby the role of the former is to fulfill the predictions inferred by the
latter. In this study, we present a solution, based on active inference, for
complex control tasks. The proposed architecture exploits hybrid (discrete and
continuous) processing, and it is based on three features: the representation
of potential body configurations related to the objects of interest; the use of
hierarchical relationships that enable the agent to flexibly expand its body
schema for tool use; the definition of potential trajectories related to the
agent’s intentions, used to infer and plan with dynamic elements at different
temporal scales. We evaluate this deep hybrid model on a habitual task:
reaching a moving object after having picked a moving tool. We show that the
model can tackle the presented task under different conditions. This study
extends past work on planning as inference and advances an alternative
direction to optimal control.
[LINK]
http://arxiv.org/abs/2402.10088v3
[DATE]
2024-12-19 01:05:38+08:00
[CATEGORIES]
cs.LG
Neural Combinatorial Optimization for Stochastic Flexible Job Shop Scheduling Problems
[AUTHORS]
Igor G. Smit, Yaoxin Wu, Pavel Troubil, Yingqian Zhang, Wim P. M. Nuijten
[ABSTRACT]
Neural combinatorial optimization (NCO) has gained significant attention due
to the potential of deep learning to efficiently solve combinatorial
optimization problems. NCO has been widely applied to job shop scheduling
problems (JSPs) with the current focus predominantly on deterministic problems.
In this paper, we propose a novel attention-based scenario processing module
(SPM) to extend NCO methods for solving stochastic JSPs. Our approach
explicitly incorporates stochastic information by an attention mechanism that
captures the embedding of sampled scenarios (i.e., an approximation of
stochasticity). Fed with the embedding, the base neural network is intervened
by the attended scenarios, which accordingly learns an effective policy under
stochasticity. We also propose a training paradigm that works harmoniously with
either the expected makespan or Value-at-Risk objective. Results demonstrate
that our approach outperforms existing learning and non-learning methods for
the flexible JSP problem with stochastic processing times on a variety of
instances. In addition, our approach holds significant generalizability to
varied numbers of scenarios and disparate distributions.
[COMMENTS]
Accepted by the 39th Annual AAAI Conference on Artificial
Intelligence (AAAI-25)
[LINK]
http://arxiv.org/abs/2412.14052v1
[DATE]
2024-12-19 01:05:33+08:00
[CATEGORIES]
cs.LG
Evidential Deep Learning for Probabilistic Modelling of Extreme Storm Events
[AUTHORS]
Ayush Khot, Xihaier Luo, Ai Kagawa, Shinjae Yoo
[COMMENTS]
14 pages, 10 figures
[LINK]
http://arxiv.org/abs/2412.14048v1
[DATE]
2024-12-19 01:03:19+08:00
[CATEGORIES]
cs.LG
Spatio-Temporal SIR Model of Pandemic Spread During Warfare with Optimal Dual-use Healthcare System Administration using Deep Reinforcement Learning
[AUTHORS]
Adi Shuchami, Teddy Lazebnik
[LINK]
http://arxiv.org/abs/2412.14039v1
[DATE]
2024-12-19 00:54:27+08:00
[CATEGORIES]
cs.LG
Machine learning in wastewater treatment: insights from modelling a pilot denitrification reactor
[AUTHORS]
Eivind Bøhn, Sølve Eidnes, Kjell Rune Jonassen
[ABSTRACT]
Wastewater treatment plants are increasingly recognized as promising
candidates for machine learning applications, due to their societal importance
and high availability of data. However, their varied designs, operational
conditions, and influent characteristics hinder straightforward automation. In
this study, we use data from a pilot reactor at the Veas treatment facility in
Norway to explore how machine learning can be used to optimize biological
nitrate ($\mathrm{NO_3^-}$) reduction to molecular nitrogen ($\mathrm{N_2}$) in
the biogeochemical process known as \textit{denitrification}. Rather than
focusing solely on predictive accuracy, our approach prioritizes understanding
the foundational requirements for effective data-driven modelling of wastewater
treatment. Specifically, we aim to identify which process parameters are most
critical, the necessary data quantity and quality, how to structure data
effectively, and what properties are required by the models. We find that
nonlinear models perform best on the training and validation data sets,
indicating nonlinear relationships to be learned, but linear models transfer
better to the unseen test data, which comes later in time. The variable
measuring the water temperature has a particularly detrimental effect on the
models, owing to a significant change in distributions between training and
test data. We therefore conclude that multiple years of data is necessary to
learn robust machine learning models. By addressing foundational elements,
particularly in the context of the climatic variability faced by northern
regions, this work lays the groundwork for a more structured and tailored
approach to machine learning for wastewater treatment. We share publicly both
the data and code used to produce the results in the paper.
[LINK]
http://arxiv.org/abs/2412.14030v1
[DATE]
2024-12-19 00:49:23+08:00
[CATEGORIES]
cs.LG
Flow Exporter Impact on Intelligent Intrusion Detection Systems
[AUTHORS]
Daniela Pinto, João Vitorino, Eva Maia, Ivone Amorim, Isabel Praça
[ABSTRACT]
High-quality datasets are critical for training machine learning models, as
inconsistencies in feature generation can hinder the accuracy and reliability
of threat detection. For this reason, ensuring the quality of the data in
network intrusion detection datasets is important. A key component of this is
using reliable tools to generate the flows and features present in the
datasets. This paper investigates the impact of flow exporters on the
performance and reliability of machine learning models for intrusion detection.
Using HERA, a tool designed to export flows and extract features, the raw
network packets of two widely used datasets, UNSW-NB15 and CIC-IDS2017, were
processed from PCAP files to generate new versions of these datasets. These
were compared to the original ones in terms of their influence on the
performance of several models, including Random Forest, XGBoost, LightGBM, and
Explainable Boosting Machine. The results obtained were significant. Models
trained on the HERA version of the datasets consistently outperformed those
trained on the original dataset, showing improvements in accuracy and
indicating a better generalisation. This highlighted the importance of flow
generation in the model’s ability to differentiate between benign and malicious
traffic.
[COMMENTS]
9 pages, 10 tables, ICISSP 2025 conference
[LINK]
http://arxiv.org/abs/2412.14021v1
[DATE]
2024-12-19 00:38:20+08:00
[CATEGORIES]
cs.LG
Deep Representation Learning for Forecasting Recursive and Multi-Relational Events in Temporal Networks
[AUTHORS]
Tony Gracious, Ambedkar Dukkipati
[ABSTRACT]
Understanding relations arising out of interactions among entities can be
very difficult, and predicting them is even more challenging. This problem has
many applications in various fields, such as financial networks and e-commerce.
These relations can involve much more complexities than just involving more
than two entities. One such scenario is evolving recursive relations between
multiple entities, and so far, this is still an open problem. This work
addresses the problem of forecasting higher-order interaction events that can
be multi-relational and recursive. We pose the problem in the framework of
representation learning of temporal hypergraphs that can capture complex
relationships involving multiple entities. The proposed model,
\textit{Relational Recursive Hyperedge Temporal Point Process} (RRHyperTPP)
uses an encoder that learns a dynamic node representation based on the
historical interaction patterns and then a hyperedge link prediction-based
decoder to model the occurrence of interaction events. These learned
representations are then used for downstream tasks involving forecasting the
type and time of interactions. The main challenge in learning from hyperedge
events is that the number of possible hyperedges grows exponentially with the
number of nodes in the network. This will make the computation of negative
log-likelihood of the temporal point process expensive, as the calculation of
survival function requires a summation over all possible hyperedges. In our
work, we develop a noise contrastive estimation method to learn the parameters
of our model, and we have experimentally shown that our models perform better
than previous state-of-the-art methods for interaction forecasting.
[COMMENTS]
AAAI-2025
[LINK]
http://arxiv.org/abs/2404.17943v2
[DATE]
2024-12-19 00:33:51+08:00
[CATEGORIES]
cs.LG
Adversarial Robustness of Link Sign Prediction in Signed Graphs
[AUTHORS]
Jialong Zhou, Xing Ai, Yuni Lai, Tomasz Michalak, Gaolei Li, Jianhua Li, Kai Zhou
[ABSTRACT]
Signed graphs serve as fundamental data structures for representing positive
and negative relationships in social networks, with signed graph neural
networks (SGNNs) emerging as the primary tool for their analysis. Our
investigation reveals that balance theory, while essential for modeling signed
relationships in SGNNs, inadvertently introduces exploitable vulnerabilities to
black-box attacks. To demonstrate this vulnerability, we propose
balance-attack, a novel adversarial strategy specifically designed to
compromise graph balance degree, and develop an efficient heuristic algorithm
to solve the associated NP-hard optimization problem. While existing approaches
attempt to restore attacked graphs through balance learning techniques, they
face a critical challenge we term “Irreversibility of Balance-related
Information,” where restored edges fail to align with original attack targets.
To address this limitation, we introduce Balance Augmented-Signed Graph
Contrastive Learning (BA-SGCL), an innovative framework that combines
contrastive learning with balance augmentation techniques to achieve robust
graph representations. By maintaining high balance degree in the latent space,
BA-SGCL effectively circumvents the irreversibility challenge and enhances
model resilience. Extensive experiments across multiple SGNN architectures and
real-world datasets demonstrate both the effectiveness of our proposed
balance-attack and the superior robustness of BA-SGCL, advancing the security
and reliability of signed graph analysis in social networks. Datasets and codes
of the proposed framework are at the github repository
https://anonymous.4open.science/r/BA-SGCL-submit-DF41/.
[LINK]
http://arxiv.org/abs/2401.10590v2
[DATE]
2024-12-19 00:33:32+08:00
[CATEGORIES]
cs.LG
FedSTaS: Client Stratification and Client Level Sampling for Efficient Federated Learning
[AUTHORS]
Jordan Slessor, Dezheng Kong, Xiaofen Tang, Zheng En Than, Linglong Kong
[ABSTRACT]
Federated learning (FL) is a machine learning methodology that involves the
collaborative training of a global model across multiple decentralized clients
in a privacy-preserving way. Several FL methods are introduced to tackle
communication inefficiencies but do not address how to sample participating
clients in each round effectively and in a privacy-preserving manner. In this
paper, we propose \textit{FedSTaS}, a client and data-level sampling method
inspired by \textit{FedSTS} and \textit{FedSampling}. In each federated
learning round, \textit{FedSTaS} stratifies clients based on their compressed
gradients, re-allocate the number of clients to sample using an optimal Neyman
allocation, and sample local data from each participating clients using a data
uniform sampling strategy. Experiments on three datasets show that
\textit{FedSTaS} can achieve higher accuracy scores than those of
\textit{FedSTS} within a fixed number of training rounds.
[COMMENTS]
6 pages, 3 figures, to be submitted to ICML
[LINK]
http://arxiv.org/abs/2412.14226v1
[DATE]
2024-12-19 00:31:34+08:00
[CATEGORIES]
cs.LG
Airfoil Diffusion: Denoising Diffusion Model For Conditional Airfoil Generation
[AUTHORS]
Reid Graves, Amir Barati Farimani
[ABSTRACT]
The design of aerodynamic shapes, such as airfoils, has traditionally
required significant computational resources and relied on predefined design
parameters, which limit the potential for novel shape synthesis. In this work,
we introduce a data-driven methodology for airfoil generation using a diffusion
model. Trained on a dataset of preexisting airfoils, our model can generate an
arbitrary number of new airfoils from random vectors, which can be conditioned
on specific aerodynamic performance metrics such as lift and drag, or geometric
criteria. Our results demonstrate that the diffusion model effectively produces
airfoil shapes with realistic aerodynamic properties, offering substantial
improvements in efficiency, flexibility, and the potential for discovering
innovative airfoil designs. This approach significantly expands the design
space, facilitating the synthesis of high-performance aerodynamic shapes that
transcend the limitations of traditional methods.
[COMMENTS]
20 Pages, 6 figures
[LINK]
http://arxiv.org/abs/2408.15898v3
[DATE]
2024-12-19 00:29:07+08:00
[CATEGORIES]
cs.LG
Few-shot Steerable Alignment: Adapting Rewards and LLM Policies with Neural Processes
[AUTHORS]
Katarzyna Kobalczyk, Claudio Fanconi, Hao Sun, Mihaela van der Schaar
[ABSTRACT]
As large language models (LLMs) become increasingly embedded in everyday
applications, ensuring their alignment with the diverse preferences of
individual users has become a critical challenge. Currently deployed approaches
typically assume homogeneous user objectives and rely on single-objective
fine-tuning. However, human preferences are inherently heterogeneous,
influenced by various unobservable factors, leading to conflicting signals in
preference data. Existing solutions addressing this diversity often require
costly datasets labelled for specific objectives and involve training multiple
reward models or LLM policies, which is computationally expensive and
impractical. In this work, we present a novel framework for few-shot steerable
alignment, where users’ underlying preferences are inferred from a small sample
of their choices. To achieve this, we extend the Bradley-Terry-Luce model to
handle heterogeneous preferences with unobserved variability factors and
propose its practical implementation for reward modelling and LLM fine-tuning.
Thanks to our proposed approach of functional parameter-space conditioning,
LLMs trained with our framework can be adapted to individual preferences at
inference time, generating outputs over a continuum of behavioural modes. We
empirically validate the effectiveness of methods, demonstrating their ability
to capture and align with diverse human preferences in a data-efficient manner.
Our code is made available at:
https://github.com/kasia-kobalczyk/few-shot-steerable-alignment.
[LINK]
http://arxiv.org/abs/2412.13998v1
[DATE]
2024-12-19 00:14:59+08:00
[CATEGORIES]
cs.LG
Modality-Independent Graph Neural Networks with Global Transformers for Multimodal Recommendation
[AUTHORS]
Jun Hu, Bryan Hooi, Bingsheng He, Yinwei Wei
[ABSTRACT]
Multimodal recommendation systems can learn users’ preferences from existing
user-item interactions as well as the semantics of multimodal data associated
with items. Many existing methods model this through a multimodal user-item
graph, approaching multimodal recommendation as a graph learning task. Graph
Neural Networks (GNNs) have shown promising performance in this domain. Prior
research has capitalized on GNNs’ capability to capture neighborhood
information within certain receptive fields (typically denoted by the number of
hops, $K$) to enrich user and item semantics. We observe that the optimal
receptive fields for GNNs can vary across different modalities. In this paper,
we propose GNNs with Modality-Independent Receptive Fields, which employ
separate GNNs with independent receptive fields for different modalities to
enhance performance. Our results indicate that the optimal $K$ for certain
modalities on specific datasets can be as low as 1 or 2, which may restrict the
GNNs’ capacity to capture global information. To address this, we introduce a
Sampling-based Global Transformer, which utilizes uniform global sampling to
effectively integrate global information for GNNs. We conduct comprehensive
experiments that demonstrate the superiority of our approach over existing
methods. Our code is publicly available at
https://github.com/CrawlScript/MIG-GT.
[COMMENTS]
Accepted by AAAI 2025
[LINK]
http://arxiv.org/abs/2412.13994v1
[DATE]
2024-12-19 00:12:26+08:00
[CATEGORIES]
cs.LG
Variance-based loss function for improved regularization
[AUTHORS]
John M. Hanna, Irene E. Vignon-Clemental
[LINK]
http://arxiv.org/abs/2412.13993v1
[DATE]
2024-12-19 00:11:45+08:00
[CATEGORIES]
cs.LG
Risk-averse Total-reward MDPs with ERM and EVaR
[AUTHORS]
Xihong Su, Julien Grand-Clément, Marek Petrik
[ABSTRACT]
Optimizing risk-averse objectives in discounted MDPs is challenging because
most models do not admit direct dynamic programming equations and require
complex history-dependent policies. In this paper, we show that the risk-averse
{\em total reward criterion}, under the Entropic Risk Measure (ERM) and
Entropic Value at Risk (EVaR) risk measures, can be optimized by a stationary
policy, making it simple to analyze, interpret, and deploy. We propose
exponential value iteration, policy iteration, and linear programming to
compute optimal policies. Compared with prior work, our results only require
the relatively mild condition of transient MDPs and allow for {\em both}
positive and negative rewards. Our results indicate that the total reward
criterion may be preferable to the discounted criterion in a broad range of
risk-averse reinforcement learning domains.
[LINK]
http://arxiv.org/abs/2408.17286v2
[DATE]
2024-12-19 00:10:18+08:00
[CATEGORIES]
cs.LG
RAG for Effective Supply Chain Security Questionnaire Automation
[AUTHORS]
Zaynab Batool Reza, Abdul Rafay Syed, Omer Iqbal, Ethel Mensah, Qian Liu, Maxx Richard Rahman, Wolfgang Maass
[ABSTRACT]
In an era where digital security is crucial, efficient processing of
security-related inquiries through supply chain security questionnaires is
imperative. This paper introduces a novel approach using Natural Language
Processing (NLP) and Retrieval-Augmented Generation (RAG) to automate these
responses. We developed QuestSecure, a system that interprets diverse document
formats and generates precise responses by integrating large language models
(LLMs) with an advanced retrieval system. Our experiments show that QuestSecure
significantly improves response accuracy and operational efficiency. By
employing advanced NLP techniques and tailored retrieval mechanisms, the system
consistently produces contextually relevant and semantically rich responses,
reducing cognitive load on security teams and minimizing potential errors. This
research offers promising avenues for automating complex security management
tasks, enhancing organizational security processes.
[LINK]
http://arxiv.org/abs/2412.13988v1
[DATE]
2024-12-19 00:07:32+08:00
[CATEGORIES]
cs.LG
A Data-Centric Perspective on Evaluating Machine Learning Models for Tabular Data
[AUTHORS]
Andrej Tschalzev, Sascha Marton, Stefan Lüdtke, Christian Bartelt, Heiner Stuckenschmidt
[ABSTRACT]
Tabular data is prevalent in real-world machine learning applications, and
new models for supervised learning of tabular data are frequently proposed.
Comparative studies assessing the performance of models typically consist of
model-centric evaluation setups with overly standardized data preprocessing.
This paper demonstrates that such model-centric evaluations are biased, as
real-world modeling pipelines often require dataset-specific preprocessing and
feature engineering. Therefore, we propose a data-centric evaluation framework.
We select 10 relevant datasets from Kaggle competitions and implement
expert-level preprocessing pipelines for each dataset. We conduct experiments
with different preprocessing pipelines and hyperparameter optimization (HPO)
regimes to quantify the impact of model selection, HPO, feature engineering,
and test-time adaptation. Our main findings are: 1. After dataset-specific
feature engineering, model rankings change considerably, performance
differences decrease, and the importance of model selection reduces. 2. Recent
models, despite their measurable progress, still significantly benefit from
manual feature engineering. This holds true for both tree-based models and
neural networks. 3. While tabular data is typically considered static, samples
are often collected over time, and adapting to distribution shifts can be
important even in supposedly static data. These insights suggest that research
efforts should be directed toward a data-centric perspective, acknowledging
that tabular data requires feature engineering and often exhibits temporal
characteristics. Our framework is available under:
https://github.com/atschalz/dc_tabeval.
[LINK]
http://arxiv.org/abs/2407.02112v3
[DATE]
2024-12-19 00:07:04+08:00
[CATEGORIES]
cs.LG
LeStrat-Net: Lebesgue style stratification for Monte Carlo simulations powered by machine learning
[AUTHORS]
Kayoung Ban, Myeonghun Park, Raymundo Ramos
[ABSTRACT]
We develop a machine learning algorithm to turn around stratification in
Monte Carlo sampling. We use a different way to divide the domain space of the
integrand, based on the height of the function being sampled, similar to what
is done in Lebesgue integration. This means that isocontours of the function
define regions that can have any shape depending on the behavior of the
function. We take advantage of the capacity of neural networks to learn
complicated functions in order to predict these complicated divisions and
preclassify large samples of the domain space. From this preclassification we
can select the required number of points to perform a number of tasks such as
variance reduction, integration and even event selection. The network
ultimately defines the regions with what it learned and is also used to
calculate the multi-dimensional volume of each region.
[COMMENTS]
44 pages, 17 figures
[LINK]
http://arxiv.org/abs/2412.13982v1
[DATE]
2024-12-19 00:03:37+08:00
[CATEGORIES]
cs.LG
Do LLMs Play Dice? Exploring Probability Distribution Sampling in Large Language Models for Behavioral Simulation
[AUTHORS]
Jia Gu, Liang Pang, Huawei Shen, Xueqi Cheng
[ABSTRACT]
With the rapid advancement of large language models (LLMs) for handling
complex language tasks, an increasing number of studies are employing LLMs as
agents to emulate the sequential decision-making processes of humans often
represented as Markov decision-making processes (MDPs). The actions in MDPs
adhere to specific probability distributions and require iterative sampling.
This arouses curiosity regarding the capacity of LLM agents to comprehend
probability distributions, thereby guiding the agent’s behavioral
decision-making through probabilistic sampling and generating behavioral
sequences. To answer the above question, we divide the problem into two main
aspects: sequence simulation with known probability distribution and sequence
simulation with unknown probability distribution. Our analysis indicates that
LLM agents can understand probabilities, but they struggle with probability
sampling. Their ability to perform probabilistic sampling can be improved to
some extent by integrating coding tools, but this level of sampling precision
still makes it difficult to simulate human behavior as agents.
[COMMENTS]
The 31st International Conference on Computational Linguistics
(COLING 2025)
[LINK]
http://arxiv.org/abs/2404.09043v3
[DATE]
2024-12-18 23:56:29+08:00
[CATEGORIES]
cs.CL
Wonderful Matrices: Combining for a More Efficient and Effective Foundation Model Architecture
[AUTHORS]
Jingze Shi, Bingheng Wu
[ABSTRACT]
In order to make the foundation model more efficient and effective, our idea
is combining sequence transformation and state transformation. First, we prove
the availability of rotary position embedding in the state space duality
algorithm, which reduces the perplexity of the hybrid quadratic causal
self-attention and state space duality by more than 4%, to ensure that the
combining sequence transformation unifies position encoding. Second, we propose
dynamic mask attention, which maintains 100% accuracy in the more challenging
multi-query associative recall task, improving by more than 150% compared to
quadratic causal self-attention and state space duality, to ensure that the
combining sequence transformation selectively filters relevant information.
Third, we design cross domain mixture of experts, which makes the computational
speed of expert retrieval with more than 1024 experts 8 to 10 times faster than
the mixture of experts, to ensure that the combining state transformation
quickly retrieval mixture. Finally, we summarize these matrix algorithms that
can form the foundation model: Wonderful Matrices, which can be a competitor to
popular model architectures.
[COMMENTS]
The code is open-sourced at
https://github.com/LoserCheems/WonderfulMatrices
[LINK]
http://arxiv.org/abs/2412.11834v2
[DATE]
2024-12-18 23:53:26+08:00
[CATEGORIES]
cs.LG
cs.CL
Cracking the Code of Hallucination in LVLMs with Vision-aware Head Divergence
[AUTHORS]
Jinghan He, Kuan Zhu, Haiyun Guo, Junfeng Fang, Zhenglin Hua, Yuheng Jia, Ming Tang, Tat-Seng Chua, Jinqiao Wang
[ABSTRACT]
Large vision-language models (LVLMs) have made substantial progress in
integrating large language models (LLMs) with visual inputs, enabling advanced
multimodal reasoning. Despite their success, a persistent challenge is
hallucination-where generated text fails to accurately reflect visual
content-undermining both accuracy and reliability. Existing methods focus on
alignment training or decoding refinements but primarily address symptoms at
the generation stage without probing the underlying causes. In this work, we
investigate the internal mechanisms driving hallucination in LVLMs, with an
emphasis on the multi-head attention module. Specifically, we introduce
Vision-aware Head Divergence (VHD), a metric that quantifies the sensitivity of
attention head outputs to visual context. Based on this, our findings reveal
the presence of vision-aware attention heads that are more attuned to visual
information; however, the model’s overreliance on its prior language patterns
is closely related to hallucinations. Building on these insights, we propose
Vision-aware Head Reinforcement (VHR), a training-free approach to mitigate
hallucination by enhancing the role of vision-aware attention heads. Extensive
experiments demonstrate that our method achieves superior performance compared
to state-of-the-art approaches in mitigating hallucinations, while maintaining
high efficiency with negligible additional time overhead.
[LINK]
http://arxiv.org/abs/2412.13949v1
[DATE]
2024-12-18 23:29:30+08:00
[CATEGORIES]
cs.CL
Persona-SQ: A Personalized Suggested Question Generation Framework For Real-world Documents
[AUTHORS]
Zihao Lin, Zichao Wang, Yuanting Pan, Varun Manjunatha, Ryan Rossi, Angela Lau, Lifu Huang, Tong Sun
[ABSTRACT]
Suggested questions (SQs) provide an effective initial interface for users to
engage with their documents in AI-powered reading applications. In practical
reading sessions, users have diverse backgrounds and reading goals, yet current
SQ features typically ignore such user information, resulting in homogeneous or
ineffective questions. We introduce a pipeline that generates personalized SQs
by incorporating reader profiles (professions and reading goals) and
demonstrate its utility in two ways: 1) as an improved SQ generation pipeline
that produces higher quality and more diverse questions compared to current
baselines, and 2) as a data generator to fine-tune extremely small models that
perform competitively with much larger models on SQ generation. Our approach
can not only serve as a drop-in replacement in current SQ systems to
immediately improve their performance but also help develop on-device SQ models
that can run locally to deliver fast and private SQ experience.
[COMMENTS]
38 pages, 26 figures
[LINK]
http://arxiv.org/abs/2412.12445v2
[DATE]
2024-12-18 23:28:43+08:00
[CATEGORIES]
cs.CL
Language verY Rare for All
[AUTHORS]
Ibrahim Merad, Amos Wolf, Ziad Mazzawi, Yannick Léo
[ABSTRACT]
In the quest to overcome language barriers, encoder-decoder models like NLLB
have expanded machine translation to rare languages, with some models (e.g.,
NLLB 1.3B) even trainable on a single GPU. While general-purpose LLMs perform
well in translation, open LLMs prove highly competitive when fine-tuned for
specific tasks involving unknown corpora. We introduce LYRA (Language verY Rare
for All), a novel approach that combines open LLM fine-tuning,
retrieval-augmented generation (RAG), and transfer learning from related
high-resource languages. This study is exclusively focused on single-GPU
training to facilitate ease of adoption. Our study focuses on two-way
translation between French and Mon'egasque, a rare language unsupported by
existing translation tools due to limited corpus availability. Our results
demonstrate LYRA’s effectiveness, frequently surpassing and consistently
matching state-of-the-art encoder-decoder models in rare language translation.
[LINK]
http://arxiv.org/abs/2412.13924v1
[DATE]
2024-12-18 23:07:23+08:00
[CATEGORIES]
cs.CL
cs.LG
Towards Deployable OCR models for Indic languages
[AUTHORS]
Minesh Mathew, Ajoy Mondal, CV Jawahar
[ABSTRACT]
Recognition of text on word or line images, without the need for sub-word
segmentation has become the mainstream of research and development of text
recognition for Indian languages. Modelling unsegmented sequences using
Connectionist Temporal Classification (CTC) is the most commonly used approach
for segmentation-free OCR. In this work we present a comprehensive empirical
study of various neural network models that uses CTC for transcribing step-wise
predictions in the neural network output to a Unicode sequence. The study is
conducted for 13 Indian languages, using an internal dataset that has around
1000 pages per language. We study the choice of line vs word as the recognition
unit, and use of synthetic data to train the models. We compare our models with
popular publicly available OCR tools for end-to-end document image recognition.
Our end-to-end pipeline that employ our recognition models and existing text
segmentation tools outperform these public OCR tools for 8 out of the 13
languages. We also introduce a new public dataset called Mozhi for word and
line recognition in Indian language. The dataset contains more than 1.2 million
annotated word images (120 thousand text lines) across 13 Indian languages. Our
code, trained models and the Mozhi dataset will be made available at
http://cvit.iiit.ac.in/research/projects/cvit-projects/
[COMMENTS]
presented at ICPR 2024;
https://link.springer.com/chapter/10.1007/978-3-031-78495-8_11
[LINK]
http://arxiv.org/abs/2205.06740v2
[DATE]
2024-12-18 22:41:53+08:00
[CATEGORIES]
cs.CL
PersonaGym: Evaluating Persona Agents and LLMs
[AUTHORS]
Vinay Samuel, Henry Peng Zou, Yue Zhou, Shreyas Chaudhari, Ashwin Kalyan, Tanmay Rajpurohit, Ameet Deshpande, Karthik Narasimhan, Vishvak Murahari
[ABSTRACT]
Persona agents, which are LLM agents that act according to an assigned
persona, have demonstrated impressive contextual response capabilities across
various applications. These persona agents offer significant enhancements
across diverse sectors, such as education, healthcare, and entertainment, where
model developers can align agent responses to different user requirements
thereby broadening the scope of agent applications. However, evaluating persona
agent performance is incredibly challenging due to the complexity of assessing
persona adherence in free-form interactions across various environments that
are relevant to each persona agent. We introduce PersonaGym, the first dynamic
evaluation framework for assessing persona agents, and PersonaScore, the first
automated human-aligned metric grounded in decision theory for comprehensive
large-scale evaluation of persona agents. Our evaluation of 6 open and
closed-source LLMs, using a benchmark encompassing 200 personas and 10,000
questions, reveals significant opportunities for advancement in persona agent
capabilities across state-of-the-art models. For example, Claude 3.5 Sonnet
only has a 2.97% relative improvement in PersonaScore than GPT 3.5 despite
being a much more advanced model. Importantly, we find that increased model
size and complexity do not necessarily imply enhanced persona agent
capabilities thereby highlighting the pressing need for algorithmic and
architectural invention towards faithful and performant persona agents.
[COMMENTS]
21 pages, 5 figures
[LINK]
http://arxiv.org/abs/2407.18416v3
[DATE]
2024-12-18 22:25:08+08:00
[CATEGORIES]
cs.CL
cs.LG
Understanding and Analyzing Model Robustness and Knowledge-Transfer in Multilingual Neural Machine Translation using TX-Ray
[AUTHORS]
Vageesh Saxena, Sharid Loáiciga, Nils Rethmeier
[ABSTRACT]
Neural networks have demonstrated significant advancements in Neural Machine
Translation (NMT) compared to conventional phrase-based approaches. However,
Multilingual Neural Machine Translation (MNMT) in extremely low-resource
settings remains underexplored. This research investigates how knowledge
transfer across languages can enhance MNMT in such scenarios. Using the Tatoeba
translation challenge dataset from Helsinki NLP, we perform English-German,
English-French, and English-Spanish translations, leveraging minimal parallel
data to establish cross-lingual mappings. Unlike conventional methods relying
on extensive pre-training for specific language pairs, we pre-train our model
on English-English translations, setting English as the source language for all
tasks. The model is fine-tuned on target language pairs using joint multi-task
and sequential transfer learning strategies. Our work addresses three key
questions: (1) How can knowledge transfer across languages improve MNMT in
extremely low-resource scenarios? (2) How does pruning neuron knowledge affect
model generalization, robustness, and catastrophic forgetting? (3) How can
TX-Ray interpret and quantify knowledge transfer in trained models? Evaluation
using BLEU-4 scores demonstrates that sequential transfer learning outperforms
baselines on a 40k parallel sentence corpus, showcasing its efficacy. However,
pruning neuron knowledge degrades performance, increases catastrophic
forgetting, and fails to improve robustness or generalization. Our findings
provide valuable insights into the potential and limitations of knowledge
transfer and pruning in MNMT for extremely low-resource settings.
[COMMENTS]
103 pages, Master’s thesis
[LINK]
http://arxiv.org/abs/2412.13881v1
[DATE]
2024-12-18 22:21:58+08:00
[CATEGORIES]
cs.CL
Crabs: Consuming Resrouce via Auto-generation for LLM-DoS Attack under Black-box Settings
[AUTHORS]
Yuanhe Zhang, Zhenhong Zhou, Wei Zhang, Xinyue Wang, Xiaojun Jia, Yang Liu, Sen Su
[ABSTRACT]
Large Language Models (LLMs) have demonstrated remarkable performance across
diverse tasks. LLMs continue to be vulnerable to external threats, particularly
Denial-of-Service (DoS) attacks. Specifically, LLM-DoS attacks aim to exhaust
computational resources and block services. However, prior works tend to focus
on performing white-box attacks, overlooking black-box settings. In this work,
we propose an automated algorithm designed for black-box LLMs, called
Auto-Generation for LLM-DoS Attack (AutoDoS). AutoDoS introduces DoS Attack
Tree and optimizes the prompt node coverage to enhance effectiveness under
black-box conditions. Our method can bypass existing defense with enhanced
stealthiness via semantic improvement of prompt nodes. Furthermore, we reveal
that implanting Length Trojan in Basic DoS Prompt aids in achieving higher
attack efficacy. Experimental results show that AutoDoS amplifies service
response latency by over 250 $\times \uparrow$, leading to severe resource
consumption in terms of GPU utilization and memory usage. Our code is available
at \url{https://github.com/shuita2333/AutoDoS}.
[COMMENTS]
20 pages, 7 figures, 11 tables
[LINK]
http://arxiv.org/abs/2412.13879v1
[DATE]
2024-12-18 22:19:23+08:00
[CATEGORIES]
cs.CL
Energy-Based Preference Model Offers Better Offline Alignment than the Bradley-Terry Preference Model
[AUTHORS]
Yuzhong Hong, Hanshan Zhang, Junwei Bao, Hongfei Jiang, Yang Song
[ABSTRACT]
Since the debut of DPO, it has been shown that aligning a target LLM with
human preferences via the KL-constrained RLHF loss is mathematically equivalent
to a special kind of reward modeling task. Concretely, the task requires: 1)
using the target LLM to parameterize the reward model, and 2) tuning the reward
model so that it has a 1:1 linear relationship with the true reward. However,
we identify a significant issue: the DPO loss might have multiple minimizers,
of which only one satisfies the required linearity condition. The problem
arises from a well-known issue of the underlying Bradley-Terry preference
model: it does not always have a unique maximum likelihood estimator (MLE).
Consequently,the minimizer of the RLHF loss might be unattainable because it is
merely one among many minimizers of the DPO loss. As a better alternative, we
propose an energy-based model (EBM) that always has a unique MLE, inherently
satisfying the linearity requirement. To approximate the MLE in practice, we
propose a contrastive loss named Energy Preference Alignment (EPA), wherein
each positive sample is contrasted against one or more strong negatives as well
as many free weak negatives. Theoretical properties of our EBM enable the
approximation error of EPA to almost surely vanish when a sufficient number of
negatives are used. Empirically, we demonstrate that EPA consistently delivers
better performance on open benchmarks compared to DPO, thereby showing the
superiority of our EBM.
[LINK]
http://arxiv.org/abs/2412.13862v1
[DATE]
2024-12-18 21:55:42+08:00
[CATEGORIES]
cs.LG
cs.CL
Domain-adaptative Continual Learning for Low-resource Tasks: Evaluation on Nepali
[AUTHORS]
Sharad Duwal, Suraj Prasai, Suresh Manandhar
[ABSTRACT]
Continual learning has emerged as an important research direction due to the
infeasibility of retraining large language models (LLMs) from scratch in the
event of new data availability. Of great interest is the domain-adaptive
pre-training (DAPT) paradigm, which focuses on continually training a
pre-trained language model to adapt it to a domain it was not originally
trained on. In this work, we evaluate the feasibility of DAPT in a low-resource
setting, namely the Nepali language. We use synthetic data to continue training
Llama 3 8B to adapt it to the Nepali language in a 4-bit QLoRA setting. We
evaluate the adapted model on its performance, forgetting, and knowledge
acquisition. We compare the base model and the final model on their Nepali
generation abilities, their performance on popular benchmarks, and run
case-studies to probe their linguistic knowledge in Nepali. We see some
unsurprising forgetting in the final model, but also surprisingly find that
increasing the number of shots during evaluation yields better percent
increases in the final model (as high as 19.29% increase) compared to the base
model (4.98%), suggesting latent retention. We also explore layer-head
self-attention heatmaps to establish dependency resolution abilities of the
final model in Nepali.
[COMMENTS]
10 pages, 2 figures
[LINK]
http://arxiv.org/abs/2412.13860v1
[DATE]
2024-12-18 21:53:59+08:00
[CATEGORIES]
cs.CL
cs.LG
TRIM: Token Reduction and Inference Modeling for Cost-Effective Language Generation
[AUTHORS]
Alfredo Garrachón Ruiz, Tomás de la Rosa, Daniel Borrajo
[COMMENTS]
12 pages
[LINK]
http://arxiv.org/abs/2412.07682v3
[DATE]
2024-12-18 21:39:47+08:00
[CATEGORIES]
cs.CL
EXIT: Context-Aware Extractive Compression for Enhancing Retrieval-Augmented Generation
[AUTHORS]
Taeho Hwang, Sukmin Cho, Soyeong Jeong, Hoyun Song, SeungYoon Han, Jong C. Park
[ABSTRACT]
We introduce EXIT, an extractive context compression framework that enhances
both the effectiveness and efficiency of retrieval-augmented generation (RAG)
in question answering (QA). Current RAG systems often struggle when retrieval
models fail to rank the most relevant documents, leading to the inclusion of
more context at the expense of latency and accuracy. While abstractive
compression methods can drastically reduce token counts, their token-by-token
generation process significantly increases end-to-end latency. Conversely,
existing extractive methods reduce latency but rely on independent,
non-adaptive sentence selection, failing to fully utilize contextual
information. EXIT addresses these limitations by classifying sentences from
retrieved documents - while preserving their contextual dependencies - enabling
parallelizable, context-aware extraction that adapts to query complexity and
retrieval quality. Our evaluations on both single-hop and multi-hop QA tasks
show that EXIT consistently surpasses existing compression methods and even
uncompressed baselines in QA accuracy, while also delivering substantial
reductions in inference time and token count. By improving both effectiveness
and efficiency, EXIT provides a promising direction for developing scalable,
high-quality QA solutions in RAG pipelines. Our code is available at
https://github.com/ThisIsHwang/EXIT
[COMMENTS]
Under Review
[LINK]
http://arxiv.org/abs/2412.12559v2
[DATE]
2024-12-18 21:08:36+08:00
[CATEGORIES]
cs.CL
Are Your LLMs Capable of Stable Reasoning?
[AUTHORS]
Junnan Liu, Hongwei Liu, Linchen Xiao, Ziyi Wang, Kuikun Liu, Songyang Gao, Wenwei Zhang, Songyang Zhang, Kai Chen
[ABSTRACT]
The rapid advancement of Large Language Models (LLMs) has demonstrated
remarkable progress in complex reasoning tasks. However, a significant
discrepancy persists between benchmark performances and real-world
applications. We identify this gap as primarily stemming from current
evaluation protocols and metrics, which inadequately capture the full spectrum
of LLM capabilities, particularly in complex reasoning tasks where both
accuracy and consistency are crucial. This work makes two key contributions.
First, we introduce G-Pass@k, a novel evaluation metric that provides a
continuous assessment of model performance across multiple sampling attempts,
quantifying both the model’s peak performance potential and its stability.
Second, we present LiveMathBench, a dynamic benchmark comprising challenging,
contemporary mathematical problems designed to minimize data leakage risks
during evaluation. Through extensive experiments using G-Pass@k on
state-of-the-art LLMs with LiveMathBench, we provide comprehensive insights
into both their maximum capabilities and operational consistency. Our findings
reveal substantial room for improvement in LLMs’ “realistic” reasoning
capabilities, highlighting the need for more robust evaluation methods. The
benchmark and detailed results are available at:
https://github.com/open-compass/GPassK.
[COMMENTS]
Preprint
[LINK]
http://arxiv.org/abs/2412.13147v2
[DATE]
2024-12-18 21:05:24+08:00
[CATEGORIES]
cs.CL
Enhancing Rhetorical Figure Annotation: An Ontology-Based Web Application with RAG Integration
[AUTHORS]
Ramona Kühn, Jelena Mitrović, Michael Granitzer
[ABSTRACT]
Rhetorical figures play an important role in our communication. They are used
to convey subtle, implicit meaning, or to emphasize statements. We notice them
in hate speech, fake news, and propaganda. By improving the systems for
computational detection of rhetorical figures, we can also improve tasks such
as hate speech and fake news detection, sentiment analysis, opinion mining, or
argument mining. Unfortunately, there is a lack of annotated data, as well as
qualified annotators that would help us build large corpora to train machine
learning models for the detection of rhetorical figures. The situation is
particularly difficult in languages other than English, and for rhetorical
figures other than metaphor, sarcasm, and irony. To overcome this issue, we
develop a web application called “Find your Figure” that facilitates the
identification and annotation of German rhetorical figures. The application is
based on the German Rhetorical ontology GRhOOT which we have specially adapted
for this purpose. In addition, we improve the user experience with Retrieval
Augmented Generation (RAG). In this paper, we present the restructuring of the
ontology, the development of the web application, and the built-in RAG
pipeline. We also identify the optimal RAG settings for our application. Our
approach is one of the first to practically use rhetorical ontologies in
combination with RAG and shows promising results.
[COMMENTS]
The 31st International Conference on Computational Linguistics
(COLING 2025)
[LINK]
http://arxiv.org/abs/2412.13799v1
[DATE]
2024-12-18 20:45:55+08:00
[CATEGORIES]
cs.CL
Physics Reasoner: Knowledge-Augmented Reasoning for Solving Physics Problems with Large Language Models
[AUTHORS]
Xinyu Pang, Ruixin Hong, Zhanke Zhou, Fangrui Lv, Xinwei Yang, Zhilong Liang, Bo Han, Changshui Zhang
[COMMENTS]
COLING 2025
[LINK]
http://arxiv.org/abs/2412.13791v1
[DATE]
2024-12-18 20:33:50+08:00
[CATEGORIES]
cs.CL
Knowledge Editing with Dynamic Knowledge Graphs for Multi-hop Question Answering
[AUTHORS]
Yifan Lu, Yigeng Zhou, Jing Li, Yequan Wang, Xuebo Liu, Daojing He, Fangming Liu, Min Zhang
[ABSTRACT]
Multi-hop question answering (MHQA) poses a significant challenge for large
language models (LLMs) due to the extensive knowledge demands involved.
Knowledge editing, which aims to precisely modify the LLMs to incorporate
specific knowledge without negatively impacting other unrelated knowledge,
offers a potential solution for addressing MHQA challenges with LLMs. However,
current solutions struggle to effectively resolve issues of knowledge
conflicts. Most parameter-preserving editing methods are hindered by inaccurate
retrieval and overlook secondary editing issues, which can introduce noise into
the reasoning process of LLMs. In this paper, we introduce KEDKG, a novel
knowledge editing method that leverages a dynamic knowledge graph for MHQA,
designed to ensure the reliability of answers. KEDKG involves two primary
steps: dynamic knowledge graph construction and knowledge graph augmented
generation. Initially, KEDKG autonomously constructs a dynamic knowledge graph
to store revised information while resolving potential knowledge conflicts.
Subsequently, it employs a fine-grained retrieval strategy coupled with an
entity and relation detector to enhance the accuracy of graph retrieval for LLM
generation. Experimental results on benchmarks show that KEDKG surpasses
previous state-of-the-art models, delivering more accurate and reliable answers
in environments with dynamic information.
[COMMENTS]
AAAI 2025
[LINK]
http://arxiv.org/abs/2412.13782v1
[DATE]
2024-12-18 20:21:46+08:00
[CATEGORIES]
cs.CL
Semantic Convergence: Harmonizing Recommender Systems via Two-Stage Alignment and Behavioral Semantic Tokenization
[AUTHORS]
Guanghan Li, Xun Zhang, Yufei Zhang, Yifan Yin, Guojun Yin, Wei Lin
[ABSTRACT]
Large language models (LLMs), endowed with exceptional reasoning
capabilities, are adept at discerning profound user interests from historical
behaviors, thereby presenting a promising avenue for the advancement of
recommendation systems. However, a notable discrepancy persists between the
sparse collaborative semantics typically found in recommendation systems and
the dense token representations within LLMs. In our study, we propose a novel
framework that harmoniously merges traditional recommendation models with the
prowess of LLMs. We initiate this integration by transforming ItemIDs into
sequences that align semantically with the LLMs space, through the proposed
Alignment Tokenization module. Additionally, we design a series of specialized
supervised learning tasks aimed at aligning collaborative signals with the
subtleties of natural language semantics. To ensure practical applicability, we
optimize online inference by pre-caching the top-K results for each user,
reducing latency and improving effciency. Extensive experimental evidence
indicates that our model markedly improves recall metrics and displays
remarkable scalability of recommendation systems.
[COMMENTS]
7 pages, 3 figures, AAAI 2025
[LINK]
http://arxiv.org/abs/2412.13771v1
[DATE]
2024-12-18 20:07:58+08:00
[CATEGORIES]
cs.CL
KlF: Knowledge Localization and Fusion for Language Model Continual Learning
[AUTHORS]
Yujie Feng, Xu Chu, Yongxin Xu, Zexin Lu, Bo Liu, Philip S. Yu, Xiao-Ming Wu
[ABSTRACT]
Language model continual learning (CL) has recently attracted significant
interest for its ability to adapt large language models (LLMs) to dynamic
real-world scenarios without retraining. A major challenge in this domain is
catastrophic forgetting, where models lose previously acquired knowledge upon
learning new tasks. Existing approaches commonly utilize multiple
parameter-efficient fine-tuning (PEFT) blocks to acquire task-specific
knowledge, yet these methods are inefficient and fail to leverage potential
knowledge transfer across tasks. In this paper, we introduce a novel CL
framework for language models, named Knowledge Localization and Fusion (KlF),
which boosts knowledge transfer without depending on memory replay. KlF
initially segregates the model into ‘skill units’ based on parameter
dependencies, allowing for more precise control. Subsequently, it employs a
novel group-wise knowledge localization technique to ascertain the importance
distribution of skill units for a new task. By comparing this importance
distribution with those from previous tasks, we implement a fine-grained
knowledge fusion strategy that retains task-specific knowledge, thereby
preventing forgetting, and updates task-shared knowledge, which facilitates
bi-directional knowledge transfer. As a result, KlF achieves an optimal balance
between retaining prior knowledge and excelling in new tasks. KlF also
demonstrates strong generalizability, making it suitable for various base
models and adaptable to PEFT methods like LoRA. Furthermore, it offers notable
extensibility, supporting enhancements through integration with memory replay
techniques. Comprehensive experiments conducted on two CL benchmarks, involving
models ranging from 220M to 7B parameters, affirm the effectiveness of KlF and
its variants across different settings.
[COMMENTS]
This version updates the model name from Task Skill Localization and
Consolidation (TaSL) to Knowledge Localization and Fusion (KlF). It is an
extension of the ACL 2024 paper titled Continual Dialog State Tracking via
Task Skill Localization and Consolidation
[LINK]
http://arxiv.org/abs/2408.05200v3
[DATE]
2024-12-18 20:07:27+08:00
[CATEGORIES]
cs.CL
REVS: Unlearning Sensitive Information in Language Models via Rank Editing in the Vocabulary Space
[AUTHORS]
Tomer Ashuach, Martin Tutek, Yonatan Belinkov
[ABSTRACT]
Language models (LMs) risk inadvertently memorizing and divulging sensitive
or personally identifiable information (PII) seen in training data, causing
privacy concerns. Current approaches to address this issue involve costly
dataset scrubbing, or model filtering through unlearning and model editing,
which can be bypassed through extraction attacks. We propose REVS, a novel
non-gradient-based method for unlearning sensitive information from LMs. REVS
identifies and modifies a small subset of neurons relevant for constituent
tokens which form sensitive information. To adequately evaluate our method on
truly sensitive information, we curate two datasets: an email dataset naturally
memorized by Llama-3-8B and GPT-J-6B, and a synthetic social security number
dataset that we tune the models to memorize. Compared to other methods, REVS
demonstrates superior performance in unlearning sensitive information and
robustness to extraction attacks, while retaining underlying model integrity.
[COMMENTS]
18 pages, 3 figures
[LINK]
http://arxiv.org/abs/2406.09325v2
[DATE]
2024-12-18 19:56:33+08:00
[CATEGORIES]
cs.CL
RAG-RewardBench: Benchmarking Reward Models in Retrieval Augmented Generation for Preference Alignment
[AUTHORS]
Zhuoran Jin, Hongbang Yuan, Tianyi Men, Pengfei Cao, Yubo Chen, Kang Liu, Jun Zhao
[COMMENTS]
26 pages, 12 figures, 6 tables
[LINK]
http://arxiv.org/abs/2412.13746v1
[DATE]
2024-12-18 19:28:05+08:00
[CATEGORIES]
cs.CL
Learning Complex Word Embeddings in Classical and Quantum Spaces
[AUTHORS]
Carys Harvey, Stephen Clark, Douglas Brown, Konstantinos Meichanetzidis
[ABSTRACT]
We present a variety of methods for training complex-valued word embeddings,
based on the classical Skip-gram model, with a straightforward adaptation
simply replacing the real-valued vectors with arbitrary vectors of complex
numbers. In a more “physically-inspired” approach, the vectors are produced by
parameterised quantum circuits (PQCs), which are unitary transformations
resulting in normalised vectors which have a probabilistic interpretation. We
develop a complex-valued version of the highly optimised C code version of
Skip-gram, which allows us to easily produce complex embeddings trained on a
3.8B-word corpus for a vocabulary size of over 400k, for which we are then able
to train a separate PQC for each word. We evaluate the complex embeddings on a
set of standard similarity and relatedness datasets, for some models obtaining
results competitive with the classical baseline. We find that, while training
the PQCs directly tends to harm performance, the quantum word embeddings from
the two-stage process perform as well as the classical Skip-gram embeddings
with comparable numbers of parameters. This enables a highly scalable route to
learning embeddings in complex spaces which scales with the size of the
vocabulary rather than the size of the training corpus. In summary, we
demonstrate how to produce a large set of high-quality word embeddings for use
in complex-valued and quantum-inspired NLP models, and for exploring potential
advantage in quantum NLP models.
[LINK]
http://arxiv.org/abs/2412.13745v1
[DATE]
2024-12-18 19:26:51+08:00
[CATEGORIES]
cs.CL
Federated Learning and RAG Integration: A Scalable Approach for Medical Large Language Models
[AUTHORS]
Jincheol Jung, Hongju Jeong, Eui-Nam Huh
[ABSTRACT]
This study analyzes the performance of domain-specific Large Language Models
(LLMs) for the medical field by integrating Retrieval-Augmented Generation
(RAG) systems within a federated learning framework. Leveraging the inherent
advantages of federated learning, such as preserving data privacy and enabling
distributed computation, this research explores the integration of RAG systems
with models trained under varying client configurations to optimize
performance. Experimental results demonstrate that the federated learning-based
models integrated with RAG systems consistently outperform their non-integrated
counterparts across all evaluation metrics. This study highlights the potential
of combining federated learning and RAG systems for developing domain-specific
LLMs in the medical field, providing a scalable and privacy-preserving solution
for enhancing text generation capabilities.
[LINK]
http://arxiv.org/abs/2412.13720v1
[DATE]
2024-12-18 19:00:58+08:00
[CATEGORIES]
cs.CL
Towards Automatic Evaluation for Image Transcreation
[AUTHORS]
Simran Khanuja, Vivek Iyer, Claire He, Graham Neubig
[ABSTRACT]
Beyond conventional paradigms of translating speech and text, recently, there
has been interest in automated transcreation of images to facilitate
localization of visual content across different cultures. Attempts to define
this as a formal Machine Learning (ML) problem have been impeded by the lack of
automatic evaluation mechanisms, with previous work relying solely on human
evaluation. In this paper, we seek to close this gap by proposing a suite of
automatic evaluation metrics inspired by machine translation (MT) metrics,
categorized into: a) Object-based, b) Embedding-based, and c) VLM-based.
Drawing on theories from translation studies and real-world transcreation
practices, we identify three critical dimensions of image transcreation:
cultural relevance, semantic equivalence and visual similarity, and design our
metrics to evaluate systems along these axes. Our results show that proprietary
VLMs best identify cultural relevance and semantic equivalence, while
vision-encoder representations are adept at measuring visual similarity.
Meta-evaluation across 7 countries shows our metrics agree strongly with human
ratings, with average segment-level correlations ranging from 0.55-0.87.
Finally, through a discussion of the merits and demerits of each metric, we
offer a robust framework for automated image transcreation evaluation, grounded
in both theoretical foundations and practical application. Our code can be
found here: https://github.com/simran-khanuja/automatic-eval-transcreation
[LINK]
http://arxiv.org/abs/2412.13717v1
[DATE]
2024-12-18 18:55:58+08:00
[CATEGORIES]
cs.CL
Mitigating Adversarial Attacks in LLMs through Defensive Suffix Generation
[AUTHORS]
Minkyoung Kim, Yunha Kim, Hyeram Seo, Heejung Choi, Jiye Han, Gaeun Kee, Soyoung Ko, HyoJe Jung, Byeolhee Kim, Young-Hak Kim, Sanghyun Park, Tae Joon Jun
[ABSTRACT]
Large language models (LLMs) have exhibited outstanding performance in
natural language processing tasks. However, these models remain susceptible to
adversarial attacks in which slight input perturbations can lead to harmful or
misleading outputs. A gradient-based defensive suffix generation algorithm is
designed to bolster the robustness of LLMs. By appending carefully optimized
defensive suffixes to input prompts, the algorithm mitigates adversarial
influences while preserving the models’ utility. To enhance adversarial
understanding, a novel total loss function ($L_{\text{total}}$) combining
defensive loss ($L_{\text{def}}$) and adversarial loss ($L_{\text{adv}}$)
generates defensive suffixes more effectively. Experimental evaluations
conducted on open-source LLMs such as Gemma-7B, mistral-7B, Llama2-7B, and
Llama2-13B show that the proposed method reduces attack success rates (ASR) by
an average of 11\% compared to models without defensive suffixes. Additionally,
the perplexity score of Gemma-7B decreased from 6.57 to 3.93 when applying the
defensive suffix generated by openELM-270M. Furthermore, TruthfulQA evaluations
demonstrate consistent improvements with Truthfulness scores increasing by up
to 10\% across tested configurations. This approach significantly enhances the
security of LLMs in critical applications without requiring extensive
retraining.
[COMMENTS]
9 pages, 2 figures
[LINK]
http://arxiv.org/abs/2412.13705v1
[DATE]
2024-12-18 18:49:41+08:00
[CATEGORIES]
cs.CL
Discerning and Characterising Types of Competency Questions for Ontologies
[AUTHORS]
C. Maria Keet, Zubeida Casmod Khan
[ABSTRACT]
Competency Questions (CQs) are widely used in ontology development by
guiding, among others, the scoping and validation stages. However, very limited
guidance exists for formulating CQs and assessing whether they are good CQs,
leading to issues such as ambiguity and unusable formulations. To solve this,
one requires insight into the nature of CQs for ontologies and their
constituent parts, as well as which ones are not. We aim to contribute to such
theoretical foundations in this paper, which is informed by analysing
questions, their uses, and the myriad of ontology development tasks. This
resulted in a first Model for Competency Questions, which comprises five main
types of CQs, each with a different purpose: Scoping (SCQ), Validating (VCQ),
Foundational (FCQ), Relationship (RCQ), and Metaproperty (MpCQ) questions. This
model enhances the clarity of CQs and therewith aims to improve on the
effectiveness of CQs in ontology development, thanks to their respective
identifiable distinct constituent elements. We illustrate and evaluate them
with a user story and demonstrate where which type can be used in ontology
development tasks. To foster use and research, we created an annotated
repository of 438 CQs, the Repository of Ontology Competency QuestionS (ROCQS),
incorporating an existing CQ dataset and new CQs and CQ templates, which
further demonstrate distinctions among types of CQs.
[COMMENTS]
16 pages, 5 figures
[LINK]
http://arxiv.org/abs/2412.13688v1
[DATE]
2024-12-18 18:26:29+08:00
[CATEGORIES]
cs.CL
Evaluation of LLM Vulnerabilities to Being Misused for Personalized Disinformation Generation
[AUTHORS]
Aneta Zugecova, Dominik Macko, Ivan Srba, Robert Moro, Jakub Kopal, Katarina Marcincinova, Matus Mesarcik
[ABSTRACT]
The capabilities of recent large language models (LLMs) to generate
high-quality content indistinguishable by humans from human-written texts rises
many concerns regarding their misuse. Previous research has shown that LLMs can
be effectively misused for generating disinformation news articles following
predefined narratives. Their capabilities to generate personalized (in various
aspects) content have also been evaluated and mostly found usable. However, a
combination of personalization and disinformation abilities of LLMs has not
been comprehensively studied yet. Such a dangerous combination should trigger
integrated safety filters of the LLMs, if there are some. This study fills this
gap by evaluation of vulnerabilities of recent open and closed LLMs, and their
willingness to generate personalized disinformation news articles in English.
We further explore whether the LLMs can reliably meta-evaluate the
personalization quality and whether the personalization affects the
generated-texts detectability. Our results demonstrate the need for stronger
safety-filters and disclaimers, as those are not properly functioning in most
of the evaluated LLMs. Additionally, our study revealed that the
personalization actually reduces the safety-filter activations; thus
effectively functioning as a jailbreak. Such behavior must be urgently
addressed by LLM developers and service providers.
[LINK]
http://arxiv.org/abs/2412.13666v1
[DATE]
2024-12-18 17:48:53+08:00
[CATEGORIES]
cs.CL
TACO-RL: Task Aware Prompt Compression Optimization with Reinforcement Learning
[AUTHORS]
Shivam Shandilya, Menglin Xia, Supriyo Ghosh, Huiqiang Jiang, Jue Zhang, Qianhui Wu, Victor Rühle
[ABSTRACT]
The increasing prevalence of large language models (LLMs) such as GPT-4 in
various applications has led to a surge in the size of prompts required for
optimal performance, leading to challenges in computational efficiency. Prompt
compression aims to reduce the inference cost by minimizing input tokens
without compromising on the task performance. However, existing prompt
compression techniques either rely on sub-optimal metrics such as information
entropy or model it as a task-agnostic token classification problem that fails
to capture task-specific information. To address these issues, we propose a
novel and efficient reinforcement learning (RL) based task-aware prompt
compression method. To ensure low latency requirements, we leverage existing
Transformer encoder-based token classification model while guiding the learning
process with task-specific reward signals using lightweight REINFORCE
algorithm. We evaluate the performance of our method on three diverse and
challenging tasks including text summarization, question answering and code
summarization. We demonstrate that our RL-guided compression method improves
the task performance by 8% - 189% across these three scenarios over
state-of-the-art compression techniques while satisfying the same compression
rate and latency requirements.
[LINK]
http://arxiv.org/abs/2409.13035v3
[DATE]
2024-12-18 17:48:32+08:00
[CATEGORIES]
cs.CL
cs.LG
Multi-Party Supervised Fine-tuning of Language Models for Multi-Party Dialogue Generation
[AUTHORS]
Xiaoyu Wang, Ningyuan Xi, Teng Chen, Qingqing Gu, Yue Zhao, Xiaokai Chen, Zhonglin Jiang, Yong Chen, Luo Ji
[ABSTRACT]
Large Language Models (LLM) are usually fine-tuned to participate in dyadic
or two-party dialogues, which can not adapt well to multi-party dialogues
(MPD), which hinders their applications in such scenarios including
multi-personal meetings, discussions and daily communication. Previous
LLM-based researches mainly focus on the multi-agent framework, while their
base LLMs are still pairwisely fine-tuned. In this work, we design a
multi-party fine-tuning framework (MuPaS) for LLMs on the multi-party dialogue
datasets, and prove such a straightforward framework can let the LLM align with
the multi-party conversation style efficiently and effectively. We also design
two training strategies which can convert MuPaS into the MPD simulator.
Substantial experiments show that MuPaS can achieve state-of-the-art
multi-party response, higher accuracy of the-next-speaker prediction, higher
human and automatic evaluated utterance qualities, and can even generate
reasonably with out-of-distribution scene, topic and role descriptions. The
MuPaS framework bridges the LLM training with more complicated multi-party
applications, such as conversation generation, virtual rehearsal or
meta-universe.
[LINK]
http://arxiv.org/abs/2412.05342v2
[DATE]
2024-12-18 17:47:53+08:00
[CATEGORIES]
cs.CL
PsyDT: Using LLMs to Construct the Digital Twin of Psychological Counselor with Personalized Counseling Style for Psychological Counseling
[AUTHORS]
Haojie Xie, Yirong Chen, Xiaofen Xing, Jingkai Lin, Xiangmin Xu
[ABSTRACT]
Currently, large language models (LLMs) have made significant progress in the
field of psychological counseling. However, existing mental health LLMs
overlook a critical issue where they do not consider the fact that different
psychological counselors exhibit different personal styles, including
linguistic style and therapy techniques, etc. As a result, these LLMs fail to
satisfy the individual needs of clients who seek different counseling styles.
To help bridge this gap, we propose PsyDT, a novel framework using LLMs to
construct the Digital Twin of Psychological counselor with personalized
counseling style. Compared to the time-consuming and costly approach of
collecting a large number of real-world counseling cases to create a specific
counselor’s digital twin, our framework offers a faster and more cost-effective
solution. To construct PsyDT, we utilize dynamic one-shot learning by using
GPT-4 to capture counselor’s unique counseling style, mainly focusing on
linguistic style and therapy techniques. Subsequently, using existing
single-turn long-text dialogues with client’s questions, GPT-4 is guided to
synthesize multi-turn dialogues of specific counselor. Finally, we fine-tune
the LLMs on the synthetic dataset, PsyDTCorpus, to achieve the digital twin of
psychological counselor with personalized counseling style. Experimental
results indicate that our proposed PsyDT framework can synthesize multi-turn
dialogues that closely resemble real-world counseling cases and demonstrate
better performance compared to other baselines, thereby show that our framework
can effectively construct the digital twin of psychological counselor with a
specific counseling style.
[COMMENTS]
9 pages, 6 figures
[LINK]
http://arxiv.org/abs/2412.13660v1
[DATE]
2024-12-18 17:38:43+08:00
[CATEGORIES]
cs.CL
SCOPE: Optimizing Key-Value Cache Compression in Long-context Generation
[AUTHORS]
Jialong Wu, Zhenglin Wang, Linhai Zhang, Yilong Lai, Yulan He, Deyu Zhou
[ABSTRACT]
Key-Value (KV) cache has become a bottleneck of LLMs for long-context
generation. Despite the numerous efforts in this area, the optimization for the
decoding phase is generally ignored. However, we believe such optimization is
crucial, especially for long-output generation tasks based on the following two
observations: (i) Excessive compression during the prefill phase, which
requires specific full context impairs the comprehension of the reasoning task;
(ii) Deviation of heavy hitters occurs in the reasoning tasks with long
outputs. Therefore, SCOPE, a simple yet efficient framework that separately
performs KV cache optimization during the prefill and decoding phases, is
introduced. Specifically, the KV cache during the prefill phase is preserved to
maintain the essential information, while a novel strategy based on sliding is
proposed to select essential heavy hitters for the decoding phase. Memory usage
and memory transfer are further optimized using adaptive and discontinuous
strategies. Extensive experiments on LongGenBench show the effectiveness and
generalization of SCOPE and its compatibility as a plug-in to other
prefill-only KV compression methods.
[COMMENTS]
Preprint
[LINK]
http://arxiv.org/abs/2412.13649v1
[DATE]
2024-12-18 17:27:33+08:00
[CATEGORIES]
cs.CL
On the Role of Model Prior in Real-World Inductive Reasoning
[AUTHORS]
Zhuo Liu, Ding Yu, Hangfeng He
[ABSTRACT]
Large Language Models (LLMs) show impressive inductive reasoning
capabilities, enabling them to generate hypotheses that could generalize
effectively to new instances when guided by in-context demonstrations. However,
in real-world applications, LLMs’ hypothesis generation is not solely
determined by these demonstrations but is significantly shaped by task-specific
model priors. Despite their critical influence, the distinct contributions of
model priors versus demonstrations to hypothesis generation have been
underexplored. This study bridges this gap by systematically evaluating three
inductive reasoning strategies across five real-world tasks with three LLMs.
Our empirical findings reveal that, hypothesis generation is primarily driven
by the model’s inherent priors; removing demonstrations results in minimal loss
of hypothesis quality and downstream usage. Further analysis shows the result
is consistent across various label formats with different label configurations,
and prior is hard to override, even under flipped labeling. These insights
advance our understanding of the dynamics of hypothesis generation in LLMs and
highlight the potential for better utilizing model priors in real-world
inductive reasoning tasks.
[LINK]
http://arxiv.org/abs/2412.13645v1
[DATE]
2024-12-18 17:22:08+08:00
[CATEGORIES]
cs.CL
Mind Your Theory: Theory of Mind Goes Deeper Than Reasoning
[AUTHORS]
Eitan Wagner, Nitay Alon, Joseph M. Barnby, Omri Abend
[ABSTRACT]
Theory of Mind (ToM) capabilities in LLMs have recently become a central
object of investigation. Cognitive science distinguishes between two steps
required for ToM tasks: 1) determine whether to invoke ToM, which includes the
appropriate Depth of Mentalizing (DoM), or level of recursion required to
complete a task; and 2) applying the correct inference given the DoM. In this
position paper, we first identify several lines of work in different
communities in AI, including LLM benchmarking, ToM add-ons, ToM probing, and
formal models for ToM. We argue that recent work in AI tends to focus
exclusively on the second step which are typically framed as static logic
problems. We conclude with suggestions for improved evaluation of ToM
capabilities inspired by dynamic environments used in cognitive tasks.
[COMMENTS]
4 pages, 2 figures
[LINK]
http://arxiv.org/abs/2412.13631v1
[DATE]
2024-12-18 17:06:48+08:00
[CATEGORIES]
cs.CL
On Affine Homotopy between Language Encoders
[AUTHORS]
Robin SM Chan, Reda Boumasmoud, Anej Svete, Yuxin Ren, Qipeng Guo, Zhijing Jin, Shauli Ravfogel, Mrinmaya Sachan, Bernhard Schölkopf, Mennatallah El-Assady, Ryan Cotterell
[ABSTRACT]
Pre-trained language encoders – functions that represent text as vectors –
are an integral component of many NLP tasks. We tackle a natural question in
language encoder analysis: What does it mean for two encoders to be similar? We
contend that a faithful measure of similarity needs to be \emph{intrinsic},
that is, task-independent, yet still be informative of \emph{extrinsic}
similarity – the performance on downstream tasks. It is common to consider two
encoders similar if they are \emph{homotopic}, i.e., if they can be aligned
through some transformation. In this spirit, we study the properties of
\emph{affine} alignment of language encoders and its implications on extrinsic
similarity. We find that while affine alignment is fundamentally an asymmetric
notion of similarity, it is still informative of extrinsic similarity. We
confirm this on datasets of natural language representations. Beyond providing
useful bounds on extrinsic similarity, affine intrinsic similarity also allows
us to begin uncovering the structure of the space of pre-trained encoders by
defining an order over them.
[COMMENTS]
10 pages, Accepted at NeurIPS 2024 (Main)
[LINK]
http://arxiv.org/abs/2406.02329v2
[DATE]
2024-12-18 16:56:43+08:00
[CATEGORIES]
cs.CL
cs.LG
REVECA: Adaptive Planning and Trajectory-based Validation in Cooperative Language Agents using Information Relevance and Relative Proximity
[AUTHORS]
SeungWon Seo, SeongRae Noh, Junhyeok Lee, SooBin Lim, Won Hee Lee, HyeongYeop Kang
[ABSTRACT]
We address the challenge of multi-agent cooperation, where agents achieve a
common goal by cooperating with decentralized agents under complex partial
observations. Existing cooperative agent systems often struggle with
efficiently processing continuously accumulating information, managing globally
suboptimal planning due to lack of consideration of collaborators, and
addressing false planning caused by environmental changes introduced by other
collaborators. To overcome these challenges, we propose the RElevance,
Proximity, and Validation-Enhanced Cooperative Language Agent (REVECA), a novel
cognitive architecture powered by GPT-4o-mini. REVECA enables efficient memory
management, optimal planning, and cost-effective prevention of false planning
by leveraging Relevance Estimation, Adaptive Planning, and Trajectory-based
Validation. Extensive experimental results demonstrate REVECA’s superiority
over existing methods across various benchmarks, while a user study reveals its
potential for achieving trustworthy human-AI cooperation.
[COMMENTS]
v2 is the AAAI’25 camera-ready version, including the appendix, which
has been enhanced based on the reviewers’ comments
[LINK]
http://arxiv.org/abs/2405.16751v2
[DATE]
2024-12-18 16:38:06+08:00
[CATEGORIES]
cs.CL
Beyond Outcomes: Transparent Assessment of LLM Reasoning in Games
[AUTHORS]
Wenye Lin, Jonathan Roberts, Yunhan Yang, Samuel Albanie, Zongqing Lu, Kai Han
[ABSTRACT]
Large Language Models (LLMs) are increasingly deployed in real-world
applications that demand complex reasoning. To track progress, robust
benchmarks are required to evaluate their capabilities beyond superficial
pattern recognition. However, current LLM reasoning benchmarks often face
challenges such as insufficient interpretability, performance saturation or
data contamination. To address these challenges, we introduce GAMEBoT, a gaming
arena designed for rigorous and transparent assessment of LLM reasoning
capabilities. GAMEBoT decomposes complex reasoning in games into predefined
modular subproblems. This decomposition allows us to design a suite of
Chain-of-Thought (CoT) prompts that leverage domain knowledge to guide LLMs in
addressing these subproblems before action selection. Furthermore, we develop a
suite of rule-based algorithms to generate ground truth for these subproblems,
enabling rigorous validation of the LLMs’ intermediate reasoning steps. This
approach facilitates evaluation of both the quality of final actions and the
accuracy of the underlying reasoning process. GAMEBoT also naturally alleviates
the risk of data contamination through dynamic games and head-to-head LLM
competitions. We benchmark 17 prominent LLMs across eight games, encompassing
various strategic abilities and game characteristics. Our results suggest that
GAMEBoT presents a significant challenge, even when LLMs are provided with
detailed CoT prompts. Project page: \url{https://visual-ai.github.io/gamebot}
[COMMENTS]
8 pages
[LINK]
http://arxiv.org/abs/2412.13602v1
[DATE]
2024-12-18 16:32:53+08:00
[CATEGORIES]
cs.CL
Unlocking the Potential of Weakly Labeled Data: A Co-Evolutionary Learning Framework for Abnormality Detection and Report Generation
[AUTHORS]
Jinghan Sun, Dong Wei, Zhe Xu, Donghuan Lu, Hong Liu, Hong Wang, Sotirios A. Tsaftaris, Steven McDonagh, Yefeng Zheng, Liansheng Wang
[ABSTRACT]
Anatomical abnormality detection and report generation of chest X-ray (CXR)
are two essential tasks in clinical practice. The former aims at localizing and
characterizing cardiopulmonary radiological findings in CXRs, while the latter
summarizes the findings in a detailed report for further diagnosis and
treatment. Existing methods often focused on either task separately, ignoring
their correlation. This work proposes a co-evolutionary abnormality detection
and report generation (CoE-DG) framework. The framework utilizes both fully
labeled (with bounding box annotations and clinical reports) and weakly labeled
(with reports only) data to achieve mutual promotion between the abnormality
detection and report generation tasks. Specifically, we introduce a
bi-directional information interaction strategy with generator-guided
information propagation (GIP) and detector-guided information propagation
(DIP). For semi-supervised abnormality detection, GIP takes the informative
feature extracted by the generator as an auxiliary input to the detector and
uses the generator’s prediction to refine the detector’s pseudo labels. We
further propose an intra-image-modal self-adaptive non-maximum suppression
module (SA-NMS). This module dynamically rectifies pseudo detection labels
generated by the teacher detection model with high-confidence predictions by
the student.Inversely, for report generation, DIP takes the abnormalities’
categories and locations predicted by the detector as input and guidance for
the generator to improve the generated reports.
[LINK]
http://arxiv.org/abs/2412.13599v1
[DATE]
2024-12-18 16:31:26+08:00
[CATEGORIES]
cs.CL
Enhancing Knowledge Distillation of Large Language Models through Efficient Multi-Modal Distribution Alignment
[AUTHORS]
Tianyu Peng, Jiajun Zhang
[ABSTRACT]
Knowledge distillation (KD) is an effective model compression method that can
transfer the internal capabilities of large language models (LLMs) to smaller
ones. However, the multi-modal probability distribution predicted by teacher
LLMs causes difficulties for student models to learn. In this paper, we first
demonstrate the importance of multi-modal distribution alignment with
experiments and then highlight the inefficiency of existing KD approaches in
learning multi-modal distributions. To address this problem, we propose Ranking
Loss based Knowledge Distillation (RLKD), which encourages the consistency of
the ranking of peak predictions between the teacher and student models. By
incorporating word-level ranking loss, we ensure excellent compatibility with
existing distillation objectives while fully leveraging the fine-grained
information between different categories in peaks of two predicted
distribution. Experimental results demonstrate that our method enables the
student model to better learn the multi-modal distributions of the teacher
model, leading to a significant performance improvement in various downstream
tasks.
[COMMENTS]
Accepted by COLING 2025, 19 pages
[LINK]
http://arxiv.org/abs/2409.12545v2
[DATE]
2024-12-18 16:23:30+08:00
[CATEGORIES]
cs.CL
EvoWiki: Evaluating LLMs on Evolving Knowledge
[AUTHORS]
Wei Tang, Yixin Cao, Yang Deng, Jiahao Ying, Bo Wang, Yizhe Yang, Yuyue Zhao, Qi Zhang, Xuanjing Huang, Yugang Jiang, Yong Liao
[ABSTRACT]
Knowledge utilization is a critical aspect of LLMs, and understanding how
they adapt to evolving knowledge is essential for their effective deployment.
However, existing benchmarks are predominantly static, failing to capture the
evolving nature of LLMs and knowledge, leading to inaccuracies and
vulnerabilities such as contamination. In this paper, we introduce EvoWiki, an
evolving dataset designed to reflect knowledge evolution by categorizing
information into stable, evolved, and uncharted states. EvoWiki is fully
auto-updatable, enabling precise evaluation of continuously changing knowledge
and newly released LLMs. Through experiments with Retrieval-Augmented
Generation (RAG) and Contunual Learning (CL), we evaluate how effectively LLMs
adapt to evolving knowledge. Our results indicate that current models often
struggle with evolved knowledge, frequently providing outdated or incorrect
responses. Moreover, the dataset highlights a synergistic effect between RAG
and CL, demonstrating their potential to better adapt to evolving knowledge.
EvoWiki provides a robust benchmark for advancing future research on the
knowledge evolution capabilities of large language models.
[LINK]
http://arxiv.org/abs/2412.13582v1
[DATE]
2024-12-18 16:04:57+08:00
[CATEGORIES]
cs.CL
Socio-Culturally Aware Evaluation Framework for LLM-Based Content Moderation
[AUTHORS]
Shanu Kumar, Gauri Kholkar, Saish Mendke, Anubhav Sadana, Parag Agrawal, Sandipan Dandapat
[ABSTRACT]
With the growth of social media and large language models, content moderation
has become crucial. Many existing datasets lack adequate representation of
different groups, resulting in unreliable assessments. To tackle this, we
propose a socio-culturally aware evaluation framework for LLM-driven content
moderation and introduce a scalable method for creating diverse datasets using
persona-based generation. Our analysis reveals that these datasets provide
broader perspectives and pose greater challenges for LLMs than
diversity-focused generation methods without personas. This challenge is
especially pronounced in smaller LLMs, emphasizing the difficulties they
encounter in moderating such diverse content.
[COMMENTS]
Accepted in SUMEval Workshop in COLING 2025
[LINK]
http://arxiv.org/abs/2412.13578v1
[DATE]
2024-12-18 15:57:18+08:00
[CATEGORIES]
cs.CL
Generating Long-form Story Using Dynamic Hierarchical Outlining with Memory-Enhancement
[AUTHORS]
Qianyue Wang, Jinwu Hu, Zhengping Li, Yufeng Wang, daiyuan li, Yu Hu, Mingkui Tan
[ABSTRACT]
Long-form story generation task aims to produce coherent and sufficiently
lengthy text, essential for applications such as novel writingand interactive
storytelling. However, existing methods, including LLMs, rely on rigid outlines
or lack macro-level planning, making it difficult to achieve both contextual
consistency and coherent plot development in long-form story generation. To
address this issues, we propose Dynamic Hierarchical Outlining with
Memory-Enhancement long-form story generation method, named DOME, to generate
the long-form story with coherent content and plot. Specifically, the Dynamic
Hierarchical Outline(DHO) mechanism incorporates the novel writing theory into
outline planning and fuses the plan and writing stages together, improving the
coherence of the plot by ensuring the plot completeness and adapting to the
uncertainty during story generation. A Memory-Enhancement Module (MEM) based on
temporal knowledge graphs is introduced to store and access the generated
content, reducing contextual conflicts and improving story coherence. Finally,
we propose a Temporal Conflict Analyzer leveraging temporal knowledge graphs to
automatically evaluate the contextual consistency of long-form story.
Experiments demonstrate that DOME significantly improves the fluency,
coherence, and overall quality of generated long stories compared to
state-of-the-art methods.
[COMMENTS]
39 pages
[LINK]
http://arxiv.org/abs/2412.13575v1
[DATE]
2024-12-18 15:50:54+08:00
[CATEGORIES]
cs.CL
Read Like a Radiologist: Efficient Vision-Language Model for 3D Medical Imaging Interpretation
[AUTHORS]
Changsun Lee, Sangjoon Park, Cheong-Il Shin, Woo Hee Choi, Hyun Jeong Park, Jeong Eun Lee, Jong Chul Ye
[ABSTRACT]
Recent medical vision-language models (VLMs) have shown promise in 2D medical
image interpretation. However extending them to 3D medical imaging has been
challenging due to computational complexities and data scarcity. Although a few
recent VLMs specified for 3D medical imaging have emerged, all are limited to
learning volumetric representation of a 3D medical image as a set of
sub-volumetric features. Such process introduces overly correlated
representations along the z-axis that neglect slice-specific clinical details,
particularly for 3D medical images where adjacent slices have low redundancy.
To address this limitation, we introduce MS-VLM that mimic radiologists’
workflow in 3D medical image interpretation. Specifically, radiologists analyze
3D medical images by examining individual slices sequentially and synthesizing
information across slices and views. Likewise, MS-VLM leverages self-supervised
2D transformer encoders to learn a volumetric representation that capture
inter-slice dependencies from a sequence of slice-specific features. Unbound by
sub-volumetric patchification, MS-VLM is capable of obtaining useful volumetric
representations from 3D medical images with any slice length and from multiple
images acquired from different planes and phases. We evaluate MS-VLM on
publicly available chest CT dataset CT-RATE and in-house rectal MRI dataset. In
both scenarios, MS-VLM surpasses existing methods in radiology report
generation, producing more coherent and clinically relevant reports. These
findings highlight the potential of MS-VLM to advance 3D medical image
interpretation and improve the robustness of medical VLMs.
[LINK]
http://arxiv.org/abs/2412.13558v1
[DATE]
2024-12-18 15:19:48+08:00
[CATEGORIES]
cs.CL
cs.LG
Pruning Large Language Models with Semi-Structural Adaptive Sparse Training
[AUTHORS]
Weiyu Huang, Yuezhou Hu, Guohao Jian, Jun Zhu, Jianfei Chen
[ABSTRACT]
The remarkable success of Large Language Models (LLMs) relies heavily on
their substantial scale, which poses significant challenges during model
deployment in terms of latency and memory consumption. Recently, numerous
studies have attempted to compress LLMs using one-shot pruning methods.
However, these methods often suffer from considerable performance degradation
on complex language understanding tasks, raising concerns about the feasibility
of pruning in LLMs. To address this issue, we propose Adaptive Sparse Trainer
(AST), a novel and efficient retraining framework tailored for semi-structured
sparse models. AST enables models to learn optimal masks during the weight
update process without incurring additional computational overhead.
Furthermore, we demonstrate that incorporating knowledge distillation
significantly improves retraining efficiency and enhances model performance
under fixed computational constraints. Additionally, a supplementary set of
well-initialized parameters is integrated to further augment the model’s
efficacy. AST achieves state-of-the-art performance with minimal training cost.
When applied to the LLaMA2-7B model, AST reduces the perplexity and zero-shot
accuracy gap between dense and 2:4 semi-structured sparse models to 0.6 and
1.16%, respectively, utilizing less than 0.4% of the pretraining tokens and GPU
hours. Our work demonstrates the feasibility of deploying semi-structured
sparse LLMs and offers a promising alternative for achieving highly compressed
models when combined with existing quantization techniques.
[COMMENTS]
Accepted at AAAI25
[LINK]
http://arxiv.org/abs/2407.20584v3
[DATE]
2024-12-18 15:14:33+08:00
[CATEGORIES]
cs.CL
AIR-Bench: Automated Heterogeneous Information Retrieval Benchmark
[AUTHORS]
Jianlyu Chen, Nan Wang, Chaofan Li, Bo Wang, Shitao Xiao, Han Xiao, Hao Liao, Defu Lian, Zheng Liu
[ABSTRACT]
Evaluation plays a crucial role in the advancement of information retrieval
(IR) models. However, current benchmarks, which are based on predefined domains
and human-labeled data, face limitations in addressing evaluation needs for
emerging domains both cost-effectively and efficiently. To address this
challenge, we propose the Automated Heterogeneous Information Retrieval
Benchmark (AIR-Bench). AIR-Bench is distinguished by three key features: 1)
Automated. The testing data in AIR-Bench is automatically generated by large
language models (LLMs) without human intervention. 2) Heterogeneous. The
testing data in AIR-Bench is generated with respect to diverse tasks, domains
and languages. 3) Dynamic. The domains and languages covered by AIR-Bench are
constantly augmented to provide an increasingly comprehensive evaluation
benchmark for community developers. We develop a reliable and robust data
generation pipeline to automatically create diverse and high-quality evaluation
datasets based on real-world corpora. Our findings demonstrate that the
generated testing data in AIR-Bench aligns well with human-labeled testing
data, making AIR-Bench a dependable benchmark for evaluating IR models. The
resources in AIR-Bench are publicly available at
https://github.com/AIR-Bench/AIR-Bench.
[COMMENTS]
31 pages, 6 figures; Update Table 5
[LINK]
http://arxiv.org/abs/2412.13102v2
[DATE]
2024-12-18 15:06:07+08:00
[CATEGORIES]
cs.CL
EscapeBench: Pushing Language Models to Think Outside the Box
[AUTHORS]
Cheng Qian, Peixuan Han, Qinyu Luo, Bingxiang He, Xiusi Chen, Yuji Zhang, Hongyi Du, Jiarui Yao, Xiaocheng Yang, Denghui Zhang, Yunzhu Li, Heng Ji
[COMMENTS]
23 pages, 15 figures
[LINK]
http://arxiv.org/abs/2412.13549v1
[DATE]
2024-12-18 14:50:39+08:00
[CATEGORIES]
cs.CL
cs.LG
Benchmarking and Improving Large Vision-Language Models for Fundamental Visual Graph Understanding and Reasoning
[AUTHORS]
Yingjie Zhu, Xuefeng Bai, Kehai Chen, Yang Xiang, Min Zhang
[ABSTRACT]
Large Vision-Language Models (LVLMs) have demonstrated remarkable performance
across diverse tasks. Despite great success, recent studies show that LVLMs
encounter substantial limitations when engaging with visual graphs. To study
the reason behind these limitations, we propose VGCure, a comprehensive
benchmark covering 22 tasks for examining the fundamental graph understanding
and reasoning capacities of LVLMs. Extensive evaluations conducted on 14 LVLMs
reveal that LVLMs are weak in basic graph understanding and reasoning tasks,
particularly those concerning relational or structurally complex information.
Based on this observation, we propose a structure-aware fine-tuning framework
to enhance LVLMs with structure learning abilities through 3 self-supervised
learning tasks. Experiments validate the effectiveness of our method in
improving LVLMs’ zero-shot performance on fundamental graph learning tasks, as
well as enhancing the robustness of LVLMs against complex visual graphs.
[LINK]
http://arxiv.org/abs/2412.13540v1
[DATE]
2024-12-18 14:35:18+08:00
[CATEGORIES]
cs.CL
MetaRuleGPT: Recursive Numerical Reasoning of Language Models Trained with Simple Rules
[AUTHORS]
Kejie Chen, Lin Wang, Qinghai Zhang, Renjun Xu
[ABSTRACT]
Recent studies have highlighted the limitations of large language models in
mathematical reasoning, particularly their inability to capture the underlying
logic. Inspired by meta-learning, we propose that models should acquire not
only task-specific knowledge but also transferable problem-solving skills. We
introduce MetaRuleGPT, a novel Transformer-based architecture that performs
precise numerical calculations and complex logical operations by learning and
combining different rules. In contrast with traditional training sets, which
are heavily composed of massive raw instance data, MetaRuleGPT is pre-trained
on much less abstract datasets containing basic, compound, and iterative rules
for mathematical reasoning. Extensive experimental results demonstrate
MetaRuleGPT can mimic human’s rule-following capabilities, break down
complexity, and iteratively derive accurate results for complex mathematical
problems. These findings prove the potential of rule learning to enhance the
numerical reasoning abilities of language models.
[COMMENTS]
8 pages, 6 figures
[LINK]
http://arxiv.org/abs/2412.13536v1
[DATE]
2024-12-18 14:27:10+08:00
[CATEGORIES]
cs.CL
Information-Theoretic Generative Clustering of Documents
[AUTHORS]
Xin Du, Kumiko Tanaka-Ishii
[ABSTRACT]
We present {\em generative clustering} (GC) for clustering a set of
documents, $\mathrm{X}$, by using texts $\mathrm{Y}$ generated by large
language models (LLMs) instead of by clustering the original documents
$\mathrm{X}$. Because LLMs provide probability distributions, the similarity
between two documents can be rigorously defined in an information-theoretic
manner by the KL divergence. We also propose a natural, novel clustering
algorithm by using importance sampling. We show that GC achieves the
state-of-the-art performance, outperforming any previous clustering method
often by a large margin. Furthermore, we show an application to generative
document retrieval in which documents are indexed via hierarchical clustering
and our method improves the retrieval accuracy.
[COMMENTS]
Accepted to AAAI 2025
[LINK]
http://arxiv.org/abs/2412.13534v1
[DATE]
2024-12-18 14:21:21+08:00
[CATEGORIES]
cs.LG
cs.CL
Emergence of Abstractions: Concept Encoding and Decoding Mechanism for In-Context Learning in Transformers
[AUTHORS]
Seungwook Han, Jinyeop Song, Jeff Gore, Pulkit Agrawal
[ABSTRACT]
Humans distill complex experiences into fundamental abstractions that enable
rapid learning and adaptation. Similarly, autoregressive transformers exhibit
adaptive learning through in-context learning (ICL), which begs the question of
how. In this paper, we propose concept encoding-decoding mechanism to explain
ICL by studying how transformers form and use internal abstractions in their
representations. On synthetic ICL tasks, we analyze the training dynamics of a
small transformer and report the coupled emergence of concept encoding and
decoding. As the model learns to encode different latent concepts (e.g.,
``Finding the first noun in a sentence.”) into distinct, separable
representations, it concureently builds conditional decoding algorithms and
improve its ICL performance. We validate the existence of this mechanism across
pretrained models of varying scales (Gemma-2 2B/9B/27B, Llama-3.1 8B/70B).
Further, through mechanistic interventions and controlled finetuning, we
demonstrate that the quality of concept encoding is causally related and
predictive of ICL performance. Our empirical insights shed light into better
understanding the success and failure modes of large language models via their
representations.
[LINK]
http://arxiv.org/abs/2412.12276v2
[DATE]
2024-12-18 14:02:03+08:00
[CATEGORIES]
cs.CL
cs.LG
AdaCQR: Enhancing Query Reformulation for Conversational Search via Sparse and Dense Retrieval Alignment
[AUTHORS]
Yilong Lai, Jialong Wu, Congzhi Zhang, Haowen Sun, Deyu Zhou
[ABSTRACT]
Conversational Query Reformulation (CQR) has significantly advanced in
addressing the challenges of conversational search, particularly those stemming
from the latent user intent and the need for historical context. Recent works
aimed to boost the performance of CRQ through alignment. However, they are
designed for one specific retrieval system, which potentially results in poor
generalization. To overcome this limitation, we present a novel framework
AdaCQR. By aligning reformulation models with both term-based and
semantic-based retrieval systems, AdaCQR enhances the generalizability of
information-seeking queries across diverse retrieval environments through a
dual-phase training strategy. We also developed two effective approaches for
acquiring superior labels and diverse input candidates, boosting the efficiency
and robustness of the framework. Experimental evaluations on the TopiOCQA and
QReCC datasets demonstrate that AdaCQR significantly outperforms existing
methods, offering both quantitative and qualitative improvements in
conversational query reformulation.
[COMMENTS]
Accepted by COLING 2025
[LINK]
http://arxiv.org/abs/2407.01965v2
[DATE]
2024-12-18 13:46:41+08:00
[CATEGORIES]
cs.CL
CEHA: A Dataset of Conflict Events in the Horn of Africa
[AUTHORS]
Rui Bai, Di Lu, Shihao Ran, Elizabeth Olson, Hemank Lamba, Aoife Cahill, Joel Tetreault, Alex Jaimes
[COMMENTS]
Accepted by COLING 2025
[LINK]
http://arxiv.org/abs/2412.13511v1
[DATE]
2024-12-18 13:22:33+08:00
[CATEGORIES]
cs.CL
Humor in AI: Massive Scale Crowd-Sourced Preferences and Benchmarks for Cartoon Captioning
[AUTHORS]
Jifan Zhang, Lalit Jain, Yang Guo, Jiayi Chen, Kuan Lok Zhou, Siddharth Suresh, Andrew Wagenmaker, Scott Sievert, Timothy Rogers, Kevin Jamieson, Robert Mankoff, Robert Nowak
[ABSTRACT]
We present a novel multimodal preference dataset for creative tasks,
consisting of over 250 million human ratings on more than 2.2 million captions,
collected through crowdsourcing rating data for The New Yorker’s weekly cartoon
caption contest over the past eight years. This unique dataset supports the
development and evaluation of multimodal large language models and
preference-based fine-tuning algorithms for humorous caption generation. We
propose novel benchmarks for judging the quality of model-generated captions,
utilizing both GPT4 and human judgments to establish ranking-based evaluation
strategies. Our experimental results highlight the limitations of current
fine-tuning methods, such as RLHF and DPO, when applied to creative tasks.
Furthermore, we demonstrate that even state-of-the-art models like GPT4 and
Claude currently underperform top human contestants in generating humorous
captions. As we conclude this extensive data collection effort, we release the
entire preference dataset to the research community, fostering further
advancements in AI humor generation and evaluation.
[LINK]
http://arxiv.org/abs/2406.10522v2
[DATE]
2024-12-18 13:21:24+08:00
[CATEGORIES]
cs.LG
cs.CL
Dynamic Adapter with Semantics Disentangling for Cross-lingual Cross-modal Retrieval
[AUTHORS]
Rui Cai, Zhiyu Dong, Jianfeng Dong, Xun Wang
[ABSTRACT]
Existing cross-modal retrieval methods typically rely on large-scale
vision-language pair data. This makes it challenging to efficiently develop a
cross-modal retrieval model for under-resourced languages of interest.
Therefore, Cross-lingual Cross-modal Retrieval (CCR), which aims to align
vision and the low-resource language (the target language) without using any
human-labeled target-language data, has gained increasing attention. As a
general parameter-efficient way, a common solution is to utilize adapter
modules to transfer the vision-language alignment ability of Vision-Language
Pretraining (VLP) models from a source language to a target language. However,
these adapters are usually static once learned, making it difficult to adapt to
target-language captions with varied expressions. To alleviate it, we propose
Dynamic Adapter with Semantics Disentangling (DASD), whose parameters are
dynamically generated conditioned on the characteristics of the input captions.
Considering that the semantics and expression styles of the input caption
largely influence how to encode it, we propose a semantic disentangling module
to extract the semantic-related and semantic-agnostic features from the input,
ensuring that generated adapters are well-suited to the characteristics of
input caption. Extensive experiments on two image-text datasets and one
video-text dataset demonstrate the effectiveness of our model for cross-lingual
cross-modal retrieval, as well as its good compatibility with various VLP
models.
[COMMENTS]
Accepted by the 39th AAAI Conference on Artificial Intelligence
(AAAI-25)
[LINK]
http://arxiv.org/abs/2412.13510v1
[DATE]
2024-12-18 13:19:09+08:00
[CATEGORIES]
cs.CL
Boosting Long-Context Management via Query-Guided Activation Refilling
[AUTHORS]
Hongjin Qian, Zheng Liu, Peitian Zhang, Zhicheng Dou, Defu Lian
[ABSTRACT]
Processing long contexts poses a significant challenge for large language
models (LLMs) due to their inherent context-window limitations and the
computational burden of extensive key-value (KV) activations, which severely
impact efficiency. For information-seeking tasks, full context perception is
often unnecessary, as a query’s information needs can dynamically range from
localized details to a global perspective, depending on its complexity.
However, existing methods struggle to adapt effectively to these dynamic
information needs.
In the paper, we propose a method for processing long-context
information-seeking tasks via query-guided Activation Refilling (ACRE). ACRE
constructs a Bi-layer KV Cache for long contexts, where the layer-1 (L1) cache
compactly captures global information, and the layer-2 (L2) cache provides
detailed and localized information. ACRE establishes a proxying relationship
between the two caches, allowing the input query to attend to the L1 cache and
dynamically refill it with relevant entries from the L2 cache. This mechanism
integrates global understanding with query-specific local details, thus
improving answer decoding. Experiments on a variety of long-context
information-seeking datasets demonstrate ACRE’s effectiveness, achieving
improvements in both performance and efficiency.
[COMMENTS]
12 pages
[LINK]
http://arxiv.org/abs/2412.12486v2
[DATE]
2024-12-18 13:08:39+08:00
[CATEGORIES]
cs.CL
VaeDiff-DocRE: End-to-end Data Augmentation Framework for Document-level Relation Extraction
[AUTHORS]
Khai Phan Tran, Wen Hua, Xue Li
[ABSTRACT]
Document-level Relation Extraction (DocRE) aims to identify relationships
between entity pairs within a document. However, most existing methods assume a
uniform label distribution, resulting in suboptimal performance on real-world,
imbalanced datasets. To tackle this challenge, we propose a novel data
augmentation approach using generative models to enhance data from the
embedding space. Our method leverages the Variational Autoencoder (VAE)
architecture to capture all relation-wise distributions formed by entity pair
representations and augment data for underrepresented relations. To better
capture the multi-label nature of DocRE, we parameterize the VAE’s latent space
with a Diffusion Model. Additionally, we introduce a hierarchical training
framework to integrate the proposed VAE-based augmentation module into DocRE
systems. Experiments on two benchmark datasets demonstrate that our method
outperforms state-of-the-art models, effectively addressing the long-tail
distribution problem in DocRE.
[COMMENTS]
COLING 2025
[LINK]
http://arxiv.org/abs/2412.13503v1
[DATE]
2024-12-18 12:55:29+08:00
[CATEGORIES]
cs.CL
Refining Salience-Aware Sparse Fine-Tuning Strategies for Language Models
[AUTHORS]
Xinxin Liu, Aaron Thomas, Cheng Zhang, Jianyi Cheng, Yiren Zhao, Xitong Gao
[ABSTRACT]
Parameter-Efficient Fine-Tuning (PEFT) has gained prominence through low-rank
adaptation methods like LoRA. In this paper, we focus on sparsity-based PEFT
(SPEFT), which introduces trainable sparse adaptations to the weight matrices
in the model, offering greater flexibility in selecting fine-tuned parameters
compared to low-rank methods. We conduct the first systematic evaluation of
salience metrics for SPEFT, inspired by zero-cost NAS proxies, and identify
simple gradient-based metrics is reliable, and results are on par with the best
alternatives, offering both computational efficiency and robust performance.
Additionally, we compare static and dynamic masking strategies, finding that
static masking, which predetermines non-zero entries before training, delivers
efficiency without sacrificing performance, while dynamic masking offers no
substantial benefits. Across NLP tasks, a simple gradient-based, static SPEFT
consistently outperforms other fine-tuning methods for LLMs, providing a simple
yet effective baseline for SPEFT. Our work challenges the notion that
complexity is necessary for effective PEFT. Our work is open source and
available to the community at [https://github.com/0-ml/speft].
[LINK]
http://arxiv.org/abs/2412.13488v1
[DATE]
2024-12-18 12:14:35+08:00
[CATEGORIES]
cs.CL
T$^3$-S2S: Training-free Triplet Tuning for Sketch to Scene Generation
[AUTHORS]
Zhenhong Sun, Yifu Wang, Yonhon Ng, Yunfei Duan, Daoyi Dong, Hongdong Li, Pan Ji
[ABSTRACT]
Scene generation is crucial to many computer graphics applications. Recent
advances in generative AI have streamlined sketch-to-image workflows, easing
the workload for artists and designers in creating scene concept art. However,
these methods often struggle for complex scenes with multiple detailed objects,
sometimes missing small or uncommon instances. In this paper, we propose a
Training-free Triplet Tuning for Sketch-to-Scene (T3-S2S) generation after
reviewing the entire cross-attention mechanism. This scheme revitalizes the
existing ControlNet model, enabling effective handling of multi-instance
generations, involving prompt balance, characteristics prominence, and dense
tuning. Specifically, this approach enhances keyword representation via the
prompt balance module, reducing the risk of missing critical instances. It also
includes a characteristics prominence module that highlights TopK indices in
each channel, ensuring essential features are better represented based on token
sketches. Additionally, it employs dense tuning to refine contour details in
the attention map, compensating for instance-related regions. Experiments
validate that our triplet tuning approach substantially improves the
performance of existing sketch-to-image models. It consistently generates
detailed, multi-instance 2D images, closely adhering to the input prompts and
enhancing visual quality in complex multi-instance scenes. Code is available at
https://github.com/chaos-sun/t3s2s.git.
[LINK]
http://arxiv.org/abs/2412.13486v1
[DATE]
2024-12-18 12:01:32+08:00
[CATEGORIES]
cs.CL
Curriculum Learning for Cross-Lingual Data-to-Text Generation With Noisy Data
[AUTHORS]
Kancharla Aditya Hari, Manish Gupta, Vasudeva Varma
[ABSTRACT]
Curriculum learning has been used to improve the quality of text generation
systems by ordering the training samples according to a particular schedule in
various tasks. In the context of data-to-text generation (DTG), previous
studies used various difficulty criteria to order the training samples for
monolingual DTG. These criteria, however, do not generalize to the crosslingual
variant of the problem and do not account for noisy data. We explore multiple
criteria that can be used for improving the performance of cross-lingual DTG
systems with noisy data using two curriculum schedules. Using the alignment
score criterion for ordering samples and an annealing schedule to train the
model, we show increase in BLEU score by up to 4 points, and improvements in
faithfulness and coverage of generations by 5-15% on average across 11 Indian
languages and English in 2 separate datasets. We make code and data publicly
available
[LINK]
http://arxiv.org/abs/2412.13484v1
[DATE]
2024-12-18 12:00:18+08:00
[CATEGORIES]
cs.CL
A Statistical and Multi-Perspective Revisiting of the Membership Inference Attack in Large Language Models
[AUTHORS]
Bowen Chen, Namgi Han, Yusuke Miyao
[ABSTRACT]
The lack of data transparency in Large Language Models (LLMs) has highlighted
the importance of Membership Inference Attack (MIA), which differentiates
trained (member) and untrained (non-member) data. Though it shows success in
previous studies, recent research reported a near-random performance in
different settings, highlighting a significant performance inconsistency. We
assume that a single setting doesn’t represent the distribution of the vast
corpora, causing members and non-members with different distributions to be
sampled and causing inconsistency. In this study, instead of a single setting,
we statistically revisit MIA methods from various settings with thousands of
experiments for each MIA method, along with study in text feature, embedding,
threshold decision, and decoding dynamics of members and non-members. We found
that (1) MIA performance improves with model size and varies with domains,
while most methods do not statistically outperform baselines, (2) Though MIA
performance is generally low, a notable amount of differentiable member and
non-member outliers exists and vary across MIA methods, (3) Deciding a
threshold to separate members and non-members is an overlooked challenge, (4)
Text dissimilarity and long text benefit MIA performance, (5) Differentiable or
not is reflected in the LLM embedding, (6) Member and non-members show
different decoding dynamics.
[COMMENTS]
main content 8 pages, 6 figures
[LINK]
http://arxiv.org/abs/2412.13475v1
[DATE]
2024-12-18 11:39:42+08:00
[CATEGORIES]
cs.CL
Gradual Vigilance and Interval Communication: Enhancing Value Alignment in Multi-Agent Debates
[AUTHORS]
Rui Zou, Mengqi Wei, Jintian Feng, Qian Wan, Jianwen Sun, Sannyuya Liu
[ABSTRACT]
In recent years, large language models have shown exceptional performance in
fulfilling diverse human needs. However, their training data can introduce
harmful content, underscoring the necessity for robust value alignment.
Mainstream methods, which depend on feedback learning and supervised training,
are resource-intensive and may constrain the full potential of the models.
Multi-Agent Debate (MAD) offers a more efficient and innovative solution by
enabling the generation of reliable answers through agent interactions. To
apply MAD to value alignment, we examine the relationship between the
helpfulness and harmlessness of debate outcomes and individual responses, and
propose a MAD based framework Gradual Vigilance and Interval Communication
(GVIC). GVIC allows agents to assess risks with varying levels of vigilance and
to exchange diverse information through interval communication. We
theoretically prove that GVIC optimizes debate efficiency while reducing
communication overhead. Experimental results demonstrate that GVIC consistently
outperforms baseline methods across various tasks and datasets, particularly
excelling in harmfulness mitigation and fraud prevention. Additionally, GVIC
exhibits strong adaptability across different base model sizes, including both
unaligned and aligned models, and across various task types.
[LINK]
http://arxiv.org/abs/2412.13471v1
[DATE]
2024-12-18 11:36:08+08:00
[CATEGORIES]
cs.CL
Transducer Tuning: Efficient Model Adaptation for Software Tasks Using Code Property Graphs
[AUTHORS]
Imam Nur Bani Yusuf, Lingxiao Jiang
[ABSTRACT]
Large language models have demonstrated promising performance across various
software engineering tasks. While fine-tuning is a common practice to adapt
these models for downstream tasks, it becomes challenging in
resource-constrained environments due to increased memory requirements from
growing trainable parameters in increasingly large language models. We
introduce \approach, a technique to adapt large models for downstream code
tasks using Code Property Graphs (CPGs). Our approach introduces a modular
component called <span style=”color:#e74d3c;”>transducer</span> that enriches code embeddings with structural and
dependency information from CPGs. The Transducer comprises two key components:
Graph Vectorization Engine (GVE) and Attention-Based Fusion Layer (ABFL). GVE
extracts CPGs from input source code and transforms them into graph feature
vectors. ABFL then fuses those graphs feature vectors with initial code
embeddings from a large language model. By optimizing these transducers for
different downstream tasks, our approach enhances the models without the need
to fine-tune them for specific tasks. We have evaluated \approach on three
downstream tasks: code summarization, assert generation, and code translation.
Our results demonstrate competitive performance compared to full parameter
fine-tuning while reducing up to 99\% trainable parameters to save memory.
\approach also remains competitive against other fine-tuning approaches (e.g.,
LoRA, Prompt-Tuning, Prefix-Tuning) while using only 1.5\%-80\% of their
trainable parameters. Our findings show that integrating structural and
dependency information through Transducer Tuning enables more efficient model
adaptation, making it easier for users to adapt large models in
resource-constrained settings.
[COMMENTS]
Under review
[LINK]
http://arxiv.org/abs/2412.13467v1
[DATE]
2024-12-18 11:25:17+08:00
[CATEGORIES]
cs.CL
GenX: Mastering Code and Test Generation with Execution Feedback
[AUTHORS]
Nan Wang, Yafei Liu, Chen Chen, Haonan Lu
[ABSTRACT]
Recent advancements in language modeling have enabled the translation of
natural language into code, and the use of execution feedback to improve code
generation. However, these methods often rely heavily on pre-existing test
cases, which may not always be available or comprehensive. In this work, we
propose a novel approach that concurrently trains a code generation model and a
test generation model, utilizing execution feedback to refine and enhance the
performance of both. We introduce two strategies for test and code data
augmentation and a new scoring function for code and test ranking. We
experiment on the APPS dataset and demonstrate that our approach can
effectively generate and augment test cases, filter and synthesize correct code
solutions, and rank the quality of generated code and tests. The results
demonstrate that our models, when iteratively trained with an increasing number
of test cases and code solutions, outperform those trained on the original
dataset.
[LINK]
http://arxiv.org/abs/2412.13464v1
[DATE]
2024-12-18 11:18:21+08:00
[CATEGORIES]
cs.CL
Quantifying Semantic Emergence in Language Models
[AUTHORS]
Hang Chen, Xinyu Yang, Jiaying Zhu, Wenya Wang
[ABSTRACT]
Large language models (LLMs) are widely recognized for their exceptional
capacity to capture semantics meaning. Yet, there remains no established metric
to quantify this capability. In this work, we introduce a quantitative metric,
Information Emergence (IE), designed to measure LLMs’ ability to extract
semantics from input tokens. We formalize “semantics” as the meaningful
information abstracted from a sequence of tokens and quantify this by comparing
the entropy reduction observed for a sequence of tokens (macro-level) and
individual tokens (micro-level). To achieve this, we design a lightweight
estimator to compute the mutual information at each transformer layer, which is
agnostic to different tasks and language model architectures. We apply IE in
both synthetic in-context learning (ICL) scenarios and natural sentence
contexts. Experiments demonstrate informativeness and patterns about semantics.
While some of these patterns confirm the conventional prior linguistic
knowledge, the rest are relatively unexpected, which may provide new insights.
[COMMENTS]
17 pages
[LINK]
http://arxiv.org/abs/2405.12617v2
[DATE]
2024-12-18 11:03:08+08:00
[CATEGORIES]
cs.CL
Design Principle Transfer in Neural Architecture Search via Large Language Models
[AUTHORS]
Xun Zhou, Xingyu Wu, Liang Feng, Zhichao Lu, Kay Chen Tan
[ABSTRACT]
Transferable neural architecture search (TNAS) has been introduced to design
efficient neural architectures for multiple tasks, to enhance the practical
applicability of NAS in real-world scenarios. In TNAS, architectural knowledge
accumulated in previous search processes is reused to warm up the architecture
search for new tasks. However, existing TNAS methods still search in an
extensive search space, necessitating the evaluation of numerous architectures.
To overcome this challenge, this work proposes a novel transfer paradigm, i.e.,
design principle transfer. In this work, the linguistic description of various
structural components’ effects on architectural performance is termed design
principles. They are learned from established architectures and then can be
reused to reduce the search space by discarding unpromising architectures.
Searching in the refined search space can boost both the search performance and
efficiency for new NAS tasks. To this end, a large language model
(LLM)-assisted design principle transfer (LAPT) framework is devised. In LAPT,
LLM is applied to automatically reason the design principles from a set of
given architectures, and then a principle adaptation method is applied to
refine these principles progressively based on the new search results.
Experimental results show that LAPT can beat the state-of-the-art TNAS methods
on most tasks and achieve comparable performance on others.
[LINK]
http://arxiv.org/abs/2408.11330v2
[DATE]
2024-12-18 10:51:50+08:00
[CATEGORIES]
cs.LG
cs.CL
FlashVTG: Feature Layering and Adaptive Score Handling Network for Video Temporal Grounding
[AUTHORS]
Zhuo Cao, Bingqing Zhang, Heming Du, Xin Yu, Xue Li, Sen Wang
[ABSTRACT]
Text-guided Video Temporal Grounding (VTG) aims to localize relevant segments
in untrimmed videos based on textual descriptions, encompassing two subtasks:
Moment Retrieval (MR) and Highlight Detection (HD). Although previous typical
methods have achieved commendable results, it is still challenging to retrieve
short video moments. This is primarily due to the reliance on sparse and
limited decoder queries, which significantly constrain the accuracy of
predictions. Furthermore, suboptimal outcomes often arise because previous
methods rank predictions based on isolated predictions, neglecting the broader
video context. To tackle these issues, we introduce FlashVTG, a framework
featuring a Temporal Feature Layering (TFL) module and an Adaptive Score
Refinement (ASR) module. The TFL module replaces the traditional decoder
structure to capture nuanced video content variations across multiple temporal
scales, while the ASR module improves prediction ranking by integrating context
from adjacent moments and multi-temporal-scale features. Extensive experiments
demonstrate that FlashVTG achieves state-of-the-art performance on four widely
adopted datasets in both MR and HD. Specifically, on the QVHighlights dataset,
it boosts mAP by 5.8% for MR and 3.3% for HD. For short-moment retrieval,
FlashVTG increases mAP to 125% of previous SOTA performance. All these
improvements are made without adding training burdens, underscoring its
effectiveness. Our code is available at https://github.com/Zhuo-Cao/FlashVTG.
[COMMENTS]
Accepted to WACV 2025
[LINK]
http://arxiv.org/abs/2412.13441v1
[DATE]
2024-12-18 10:23:33+08:00
[CATEGORIES]
cs.CL
Lightweight Safety Classification Using Pruned Language Models
[AUTHORS]
Mason Sawtell, Tula Masterman, Sandi Besen, Jim Brown
[ABSTRACT]
In this paper, we introduce a novel technique for content safety and prompt
injection classification for Large Language Models. Our technique, Layer
Enhanced Classification (LEC), trains a Penalized Logistic Regression (PLR)
classifier on the hidden state of an LLM’s optimal intermediate transformer
layer. By combining the computational efficiency of a streamlined PLR
classifier with the sophisticated language understanding of an LLM, our
approach delivers superior performance surpassing GPT-4o and special-purpose
models fine-tuned for each task. We find that small general-purpose models
(Qwen 2.5 sizes 0.5B, 1.5B, and 3B) and other transformer-based architectures
like DeBERTa v3 are robust feature extractors allowing simple classifiers to be
effectively trained on fewer than 100 high-quality examples. Importantly, the
intermediate transformer layers of these models typically outperform the final
layer across both classification tasks. Our results indicate that a single
general-purpose LLM can be used to classify content safety, detect prompt
injections, and simultaneously generate output tokens. Alternatively, these
relatively small LLMs can be pruned to the optimal intermediate layer and used
exclusively as robust feature extractors. Since our results are consistent on
different transformer architectures, we infer that robust feature extraction is
an inherent capability of most, if not all, LLMs.
[LINK]
http://arxiv.org/abs/2412.13435v1
[DATE]
2024-12-18 10:13:13+08:00
[CATEGORIES]
cs.CL
cs.LG
Enhancing Talk Moves Analysis in Mathematics Tutoring through Classroom Teaching Discourse
[AUTHORS]
Jie Cao, Abhijit Suresh, Jennifer Jacobs, Charis Clevenger, Amanda Howard, Chelsea Brown, Brent Milne, Tom Fischaber, Tamara Sumner, James H. Martin
[ABSTRACT]
Human tutoring interventions play a crucial role in supporting student
learning, improving academic performance, and promoting personal growth. This
paper focuses on analyzing mathematics tutoring discourse using talk moves - a
framework of dialogue acts grounded in Accountable Talk theory. However,
scaling the collection, annotation, and analysis of extensive tutoring
dialogues to develop machine learning models is a challenging and
resource-intensive task. To address this, we present SAGA22, a compact dataset,
and explore various modeling strategies, including dialogue context, speaker
information, pretraining datasets, and further fine-tuning. By leveraging
existing datasets and models designed for classroom teaching, our results
demonstrate that supplementary pretraining on classroom data enhances model
performance in tutoring settings, particularly when incorporating longer
context and speaker information. Additionally, we conduct extensive ablation
studies to underscore the challenges in talk move modeling.
[COMMENTS]
Accepted to COLING‘2025
[LINK]
http://arxiv.org/abs/2412.13395v1
[DATE]
2024-12-18 08:13:04+08:00
[CATEGORIES]
cs.CL
Extending LLMs to New Languages: A Case Study of Llama and Persian Adaptation
[AUTHORS]
Samin Mahdizadeh Sani, Pouya Sadeghi, Thuy-Trang Vu, Yadollah Yaghoobzadeh, Gholamreza Haffari
[ABSTRACT]
Large language models (LLMs) have made great progress in classification and
text generation tasks. However, they are mainly trained on English data and
often struggle with low-resource languages. In this study, we explore adding a
new language, i.e., Persian, to Llama (a model with a limited understanding of
Persian) using parameter-efficient fine-tuning. We employ a multi-stage
approach involving pretraining on monolingual Persian data, aligning
representations through bilingual pretraining and instruction datasets, and
instruction-tuning with task-specific datasets. We evaluate the model’s
performance at each stage on generation and classification tasks. Our findings
suggest that incorporating the Persian language, through bilingual data
alignment, can enhance classification accuracy for Persian tasks, with no
adverse impact and sometimes even improvements on English tasks. Additionally,
the results highlight the model’s initial strength as a critical factor when
working with limited training data, with cross-lingual alignment offering
minimal benefits for the low-resource language. Knowledge transfer from English
to Persian has a marginal effect, primarily benefiting simple classification
tasks.
[COMMENTS]
accepted at COLING 2025
[LINK]
http://arxiv.org/abs/2412.13375v1
[DATE]
2024-12-18 07:18:06+08:00
[CATEGORIES]
cs.CL
Online Intrinsic Rewards for Decision Making Agents from Large Language Model Feedback
[AUTHORS]
Qinqing Zheng, Mikael Henaff, Amy Zhang, Aditya Grover, Brandon Amos
[ABSTRACT]
Automatically synthesizing dense rewards from natural language descriptions
is a promising paradigm in reinforcement learning (RL), with applications to
sparse reward problems, open-ended exploration, and hierarchical skill design.
Recent works have made promising steps by exploiting the prior knowledge of
large language models (LLMs). However, these approaches suffer from important
limitations: they are either not scalable to problems requiring billions of
environment samples, due to requiring LLM annotations for each observation, or
they require a diverse offline dataset, which may not exist or be impossible to
collect. In this work, we address these limitations through a combination of
algorithmic and systems-level contributions. We propose \oni, a distributed
architecture that simultaneously learns an RL policy and an intrinsic reward
function using LLM feedback. Our approach annotates the agent’s collected
experience via an asynchronous LLM server, which is then distilled into an
intrinsic reward model. We explore a range of algorithmic choices for reward
modeling with varying complexity, including hashing, classification, and
ranking models. By studying their relative tradeoffs, we shed light on
questions regarding intrinsic reward design for sparse reward problems. Our
approach achieves state-of-the-art performance across a range of challenging,
sparse reward tasks from the NetHack Learning Environment in a simple unified
process, solely using the agent’s gathered experience, without requiring
external datasets. We make our code available at
\url{https://github.com/facebookresearch/oni}.
[LINK]
http://arxiv.org/abs/2410.23022v2
[DATE]
2024-12-18 06:29:46+08:00
[CATEGORIES]
cs.LG
cs.CL
Class-RAG: Real-Time Content Moderation with Retrieval Augmented Generation
[AUTHORS]
Jianfa Chen, Emily Shen, Trupti Bavalatti, Xiaowen Lin, Yongkai Wang, Shuming Hu, Harihar Subramanyam, Ksheeraj Sai Vepuri, Ming Jiang, Ji Qi, Li Chen, Nan Jiang, Ankit Jain
[ABSTRACT]
Robust content moderation classifiers are essential for the safety of
Generative AI systems. In this task, differences between safe and unsafe inputs
are often extremely subtle, making it difficult for classifiers (and indeed,
even humans) to properly distinguish violating vs. benign samples without
context or explanation. Scaling risk discovery and mitigation through
continuous model fine-tuning is also slow, challenging and costly, preventing
developers from being able to respond quickly and effectively to emergent
harms. We propose a Classification approach employing Retrieval-Augmented
Generation (Class-RAG). Class-RAG extends the capability of its base LLM
through access to a retrieval library which can be dynamically updated to
enable semantic hotfixing for immediate, flexible risk mitigation. Compared to
model fine-tuning, Class-RAG demonstrates flexibility and transparency in
decision-making, outperforms on classification and is more robust against
adversarial attack, as evidenced by empirical studies. Our findings also
suggest that Class-RAG performance scales with retrieval library size,
indicating that increasing the library size is a viable and low-cost approach
to improve content moderation.
[COMMENTS]
11 pages, submit to ACL
[LINK]
http://arxiv.org/abs/2410.14881v2
[DATE]
2024-12-18 06:07:18+08:00
[CATEGORIES]
cs.CL
SocREval: Large Language Models with the Socratic Method for Reference-Free Reasoning Evaluation
[AUTHORS]
Hangfeng He, Hongming Zhang, Dan Roth
[ABSTRACT]
To comprehensively gauge the capacity of current models for complex
reasoning, it is crucial to assess their step-by-step reasoning in a scalable
manner. Established reference-based evaluation metrics rely on human-annotated
reasoning chains as references to assess the model-derived chains. However,
such “gold-standard” human-written reasoning chains may not be unique and their
acquisition is often labor-intensive. Existing reference-free reasoning
evaluation metrics, while eliminating the need for human-crafted reasoning
chains as references, often require fine-tuning with human-derived chains
before evaluation, complicating the process and questioning their adaptability
to other datasets. To address these challenges, we harness GPT-4 to
automatically evaluate reasoning chain quality, thereby removing the dependency
on human-written reasoning chains for both model fine-tuning and evaluative
purposes. Leveraging the Socratic method, we develop SocREval ({\bf Soc}ratic
Method-Inspired {\bf R}easoning {\bf Eval}uation), a novel approach for prompt
design in reference-free reasoning evaluation. Empirical results from four
human annotated datasets reveal that SocREval significantly improves GPT-4’s
performance, surpassing existing reference-free and reference-based reasoning
evaluation metrics. Beyond its demonstrated efficacy, SocREval, proves to be
both cost-efficient and robust to prompt writing and example selection, as
substantiated by our in-depth analysis.
[LINK]
http://arxiv.org/abs/2310.00074v3
[DATE]
2024-12-18 05:56:45+08:00
[CATEGORIES]
cs.CL
Experience of Training a 1.7B-Parameter LLaMa Model From Scratch
[AUTHORS]
Miles Q. Li, Benjamin C. M. Fung, Shih-Chia Huang
[ABSTRACT]
Pretraining large language models is a complex endeavor influenced by
multiple factors, including model architecture, data quality, training
continuity, and hardware constraints. In this paper, we share insights gained
from the experience of training DMaS-LLaMa-Lite, a fully open source,
1.7-billion-parameter, LLaMa-based model, on approximately 20 billion tokens of
carefully curated data. We chronicle the full training trajectory, documenting
how evolving validation loss levels and downstream benchmarks reflect
transitions from incoherent text to fluent, contextually grounded output.
Beyond standard quantitative metrics, we highlight practical considerations
such as the importance of restoring optimizer states when resuming from
checkpoints, and the impact of hardware changes on training stability and
throughput. While qualitative evaluation provides an intuitive understanding of
model improvements, our analysis extends to various performance benchmarks,
demonstrating how high-quality data and thoughtful scaling enable competitive
results with significantly fewer training tokens. By detailing these
experiences and offering training logs, checkpoints, and sample outputs, we aim
to guide future researchers and practitioners in refining their pretraining
strategies. The training script is available on Github at
https://github.com/McGill-DMaS/DMaS-LLaMa-Lite-Training-Code. The model
checkpoints are available on Huggingface at
https://huggingface.co/collections/McGill-DMaS/dmas-llama-lite-6761d97ba903f82341954ceb.
[LINK]
http://arxiv.org/abs/2412.13335v1
[DATE]
2024-12-18 05:15:52+08:00
[CATEGORIES]
cs.CL
AraDiCE: Benchmarks for Dialectal and Cultural Capabilities in LLMs
[AUTHORS]
Basel Mousi, Nadir Durrani, Fatema Ahmad, Md. Arid Hasan, Maram Hasanain, Tameem Kabbani, Fahim Dalvi, Shammur Absar Chowdhury, Firoj Alam
[ABSTRACT]
Arabic, with its rich diversity of dialects, remains significantly
underrepresented in Large Language Models, particularly in dialectal
variations. We address this gap by introducing seven synthetic datasets in
dialects alongside Modern Standard Arabic (MSA), created using Machine
Translation (MT) combined with human post-editing. We present AraDiCE, a
benchmark for Arabic Dialect and Cultural Evaluation. We evaluate LLMs on
dialect comprehension and generation, focusing specifically on low-resource
Arabic dialects. Additionally, we introduce the first-ever fine-grained
benchmark designed to evaluate cultural awareness across the Gulf, Egypt, and
Levant regions, providing a novel dimension to LLM evaluation. Our findings
demonstrate that while Arabic-specific models like Jais and AceGPT outperform
multilingual models on dialectal tasks, significant challenges persist in
dialect identification, generation, and translation. This work contributes
$\approx$45K post-edited samples, a cultural benchmark, and highlights the
importance of tailored training to improve LLM performance in capturing the
nuances of diverse Arabic dialects and cultural contexts. We have released the
dialectal translation models and benchmarks developed in this study
(https://huggingface.co/datasets/QCRI/AraDiCE).
[COMMENTS]
Benchmarking, Culturally Informed, Large Language Models, Arabic NLP,
LLMs, Arabic Dialect, Dialectal Benchmarking
[LINK]
http://arxiv.org/abs/2409.11404v3
[DATE]
2024-12-18 05:15:26+08:00
[CATEGORIES]
cs.CL
Contrastive Learning and Mixture of Experts Enables Precise Vector Embeddings
[AUTHORS]
Logan Hallee, Rohan Kapur, Arjun Patel, Jason P. Gleghorn, Bohdan Khomtchouk
[ABSTRACT]
The advancement of transformer neural networks has significantly elevated the
capabilities of sentence similarity models, but they still struggle with highly
discriminative tasks and may produce sub-optimal representations of important
documents like scientific literature. With the increased reliance on retrieval
augmentation and search, representing diverse documents as concise and
descriptive vectors is crucial. This paper improves upon the vectors embeddings
of scientific text by assembling niche datasets using co-citations as a
similarity metric, focusing on biomedical domains. We apply a novel Mixture of
Experts (MoE) extension pipeline to pretrained BERT models, where every
multi-layer perceptron section is enlarged and copied into multiple distinct
experts. Our MoE variants perform well over $N$ scientific domains with $N$
dedicated experts, whereas standard BERT models excel in only one domain at a
time. Notably, extending just a single transformer block to MoE captures 85% of
the benefit seen from full MoE extension at every layer. This holds promise for
versatile and efficient One-Size-Fits-All transformer networks for numerically
representing diverse inputs. Our methodology marks advancements in
representation learning and holds promise for enhancing vector database search
and compilation.
[LINK]
http://arxiv.org/abs/2401.15713v3
[DATE]
2024-12-18 04:58:26+08:00
[CATEGORIES]
cs.LG
cs.CL
Synthetic Lyrics Detection Across Languages and Genres
[AUTHORS]
Yanis Labrak, Markus Frohmann, Gabriel Meseguer-Brocal, Elena V. Epure
[ABSTRACT]
In recent years, the use of large language models (LLMs) to generate music
content, particularly lyrics, has gained in popularity. These advances provide
valuable tools for artists and enhance their creative processes, but they also
raise concerns about copyright violations, consumer satisfaction, and content
spamming. Previous research has explored content detection in various domains.
However, no work has focused on the modality of lyrics in music. To address
this gap, we curated a diverse dataset of real and synthetic lyrics from
multiple languages, music genres, and artists. The generation pipeline was
validated using both humans and automated methods. We conducted a comprehensive
evaluation of existing synthetic text detection features on this novel data
type. Additionally, we explored strategies to adjust the best feature for
lyrics using unsupervised adaptation. Adhering to constraints of our
application domain, we investigated cross-lingual generalization, data
scalability, robustness to language combinations, and the impact of genre
novelty in a few-shot detection scenario. Our findings show promising results
within language families and similar genres, yet challenges persist with lyrics
in languages that exhibit distinct semantic structures.
[COMMENTS]
Under review
[LINK]
http://arxiv.org/abs/2406.15231v2
[DATE]
2024-12-18 04:50:40+08:00
[CATEGORIES]
cs.CL
cs.LG
FM2DS: Few-Shot Multimodal Multihop Data Synthesis with Knowledge Distillation for Question Answering
[AUTHORS]
Amirhossein Abaskohi, Spandana Gella, Giuseppe Carenini, Issam H. Laradji
[ABSTRACT]
Multimodal multihop question answering is a complex task that requires
reasoning over multiple sources of information, such as images and text, to
answer questions. While there has been significant progress in visual question
answering, the multihop setting remains unexplored due to the lack of
high-quality datasets. Current methods focus on single-hop question answering
or a single modality, which makes them unsuitable for real-world scenarios such
as analyzing multimodal educational materials, summarizing lengthy academic
articles, or interpreting scientific studies that combine charts, images, and
text. To address this gap, we propose a novel methodology, introducing the
first framework for creating a high-quality dataset that enables training
models for multimodal multihop question answering. Our approach consists of a
5-stage pipeline that involves acquiring relevant multimodal documents from
Wikipedia, synthetically generating high-level questions and answers, and
validating them through rigorous criteria to ensure quality data. We evaluate
our methodology by training models on our synthesized dataset and testing on
two benchmarks, our results demonstrate that, with an equal sample size, models
trained on our synthesized data outperform those trained on human-collected
data by 1.9 in exact match (EM) on average. We believe our data synthesis
method will serve as a strong foundation for training and evaluating multimodal
multihop question answering models.
[COMMENTS]
20 pages, 11 figures, 10 tables, Submitted to CVPR 2025
[LINK]
http://arxiv.org/abs/2412.07030v2
[DATE]
2024-12-18 04:38:21+08:00
[CATEGORIES]
cs.CL
cs.LG
Hint Marginalization for Improved Reasoning in Large Language Models
[AUTHORS]
Soumyasundar Pal, Didier Chételat, Yingxue Zhang, Mark Coates
[ABSTRACT]
Large Language Models (LLMs) have exhibited an impressive capability to
perform reasoning tasks, especially if they are encouraged to generate a
sequence of intermediate steps. Reasoning performance can be improved by
suitably combining multiple LLM responses, generated either in parallel in a
single query, or via sequential interactions with LLMs throughout the reasoning
process. Existing strategies for combination, such as self-consistency and
progressive-hint-prompting, make inefficient usage of the LLM responses. We
present Hint Marginalization, a novel and principled algorithmic framework to
enhance the reasoning capabilities of LLMs. Our approach can be viewed as an
iterative sampling strategy for forming a Monte Carlo approximation of an
underlying distribution of answers, with the goal of identifying the mode the
most likely answer. Empirical evaluation on several benchmark datasets for
arithmetic reasoning demonstrates the superiority of the proposed approach.
[LINK]
http://arxiv.org/abs/2412.13292v1
[DATE]
2024-12-18 03:45:53+08:00
[CATEGORIES]
cs.CL
Enhancing Persona Classification in Dialogue Systems: A Graph Neural Network Approach
[AUTHORS]
Konstantin Zaitsev
[ABSTRACT]
In recent years, Large Language Models (LLMs) gain considerable attention for
their potential to enhance personalized experiences in virtual assistants and
chatbots. A key area of interest is the integration of personas into LLMs to
improve dialogue naturalness and user engagement. This study addresses the
challenge of persona classification, a crucial component in dialogue
understanding, by proposing a framework that combines text embeddings with
Graph Neural Networks (GNNs) for effective persona classification. Given the
absence of dedicated persona classification datasets, we create a manually
annotated dataset to facilitate model training and evaluation. Our method
involves extracting semantic features from persona statements using text
embeddings and constructing a graph where nodes represent personas and edges
capture their similarities. The GNN component uses this graph structure to
propagate relevant information, thereby improving classification performance.
Experimental results show that our approach, in particular the integration of
GNNs, significantly improves classification performance, especially with
limited data. Our contributions include the development of a persona
classification framework and the creation of a dataset.
[LINK]
http://arxiv.org/abs/2412.13283v1
[DATE]
2024-12-18 03:27:24+08:00
[CATEGORIES]
cs.CL
DataEnvGym: Data Generation Agents in Teacher Environments with Student Feedback
[AUTHORS]
Zaid Khan, Elias Stengel-Eskin, Jaemin Cho, Mohit Bansal
[ABSTRACT]
The process of creating training data to teach models is currently driven by
humans, who manually analyze model weaknesses and plan how to create data that
improves a student model. Approaches using LLMs as annotators reduce human
effort, but still require humans to interpret feedback from evaluations and
control the LLM to produce data the student needs. Automating this
labor-intensive process by creating autonomous data generation agents - or
teachers - is desirable, but requires environments that can simulate the
feedback-driven, iterative, closed loop of data creation. To enable rapid,
scalable testing for such agents and their modules, we introduce DataEnvGym, a
testbed of teacher environments for data generation agents. DataEnvGym frames
data generation as a sequential decision-making task, involving an agent
consisting of a data generation policy (which generates a plan for creating
training data) and a data generation engine (which transforms the plan into
data), inside an environment that provides student feedback. The agent’s goal
is to improve student performance. Students are iteratively trained and
evaluated on generated data, and their feedback (in the form of errors or weak
skills) is reported to the agent after each iteration. DataEnvGym includes
multiple teacher environment instantiations across 3 levels of structure in the
state representation and action space. More structured environments are based
on inferred skills and offer more interpretability and curriculum control. We
support 4 domains (math, code, VQA, and tool-use) and test multiple students
and teachers. Example agents in our teaching environments can iteratively
improve students across tasks and settings. Moreover, we show that environments
teach different skill levels and test variants of key modules, pointing to
future work in improving data generation agents, engines, and feedback
mechanisms.
[COMMENTS]
Project Page: https://DataEnvGym.github.io
[LINK]
http://arxiv.org/abs/2410.06215v2
[DATE]
2024-12-18 02:54:45+08:00
[CATEGORIES]
cs.CL
cs.LG
DnDScore: Decontextualization and Decomposition for Factuality Verification in Long-Form Text Generation
[AUTHORS]
Miriam Wanner, Benjamin Van Durme, Mark Dredze
[ABSTRACT]
The decompose-then-verify strategy for verification of Large Language Model
(LLM) generations decomposes claims that are then independently verified.
Decontextualization augments text (claims) to ensure it can be verified outside
of the original context, enabling reliable verification. While decomposition
and decontextualization have been explored independently, their interactions in
a complete system have not been investigated. Their conflicting purposes can
create tensions: decomposition isolates atomic facts while decontextualization
inserts relevant information. Furthermore, a decontextualized subclaim presents
a challenge to the verification step: what part of the augmented text should be
verified as it now contains multiple atomic facts? We conduct an evaluation of
different decomposition, decontextualization, and verification strategies and
find that the choice of strategy matters in the resulting factuality scores.
Additionally, we introduce DnDScore, a decontextualization aware verification
method which validates subclaims in the context of contextual information.
[LINK]
http://arxiv.org/abs/2412.13175v1
[DATE]
2024-12-18 02:54:01+08:00
[CATEGORIES]
cs.CL
Compressed Chain of Thought: Efficient Reasoning Through Dense Representations
[AUTHORS]
Jeffrey Cheng, Benjamin Van Durme
[ABSTRACT]
Chain-of-thought (CoT) decoding enables language models to improve reasoning
performance at the cost of high generation latency in decoding. Recent
proposals have explored variants of contemplation tokens, a term we introduce
that refers to special tokens used during inference to allow for extra
computation. Prior work has considered fixed-length sequences drawn from a
discrete set of embeddings as contemplation tokens. Here we propose Compressed
Chain-of-Thought (CCoT), a framework to generate contentful and continuous
contemplation tokens of variable sequence length. The generated contemplation
tokens are compressed representations of explicit reasoning chains, and our
method can be applied to off-the-shelf decoder language models. Through
experiments, we illustrate how CCoT enables additional reasoning over dense
contentful representations to achieve corresponding improvements in accuracy.
Moreover, the reasoning improvements can be adaptively modified on demand by
controlling the number of contemplation tokens generated.
[LINK]
http://arxiv.org/abs/2412.13171v1
[DATE]
2024-12-18 02:50:33+08:00
[CATEGORIES]
cs.CL
Syntactic Transfer to Kyrgyz Using the Treebank Translation Method
[AUTHORS]
Anton Alekseev, Alina Tillabaeva, Gulnara Dzh. Kabaeva, Sergey I. Nikolenko
[ABSTRACT]
The Kyrgyz language, as a low-resource language, requires significant effort
to create high-quality syntactic corpora. This study proposes an approach to
simplify the development process of a syntactic corpus for Kyrgyz. We present a
tool for transferring syntactic annotations from Turkish to Kyrgyz based on a
treebank translation method. The effectiveness of the proposed tool was
evaluated using the TueCL treebank. The results demonstrate that this approach
achieves higher syntactic annotation accuracy compared to a monolingual model
trained on the Kyrgyz KTMU treebank. Additionally, the study introduces a
method for assessing the complexity of manual annotation for the resulting
syntactic trees, contributing to further optimization of the annotation
process.
[COMMENTS]
To be published in the Journal of Math. Sciences. Zapiski version (in
Russian): http://www.pdmi.ras.ru/znsl/2024/v540/abs252.html
[LINK]
http://arxiv.org/abs/2412.13146v1
[DATE]
2024-12-18 02:12:33+08:00
[CATEGORIES]
cs.CL
Reinforcement Learning Enhanced LLMs: A Survey
[AUTHORS]
Shuhe Wang, Shengyu Zhang, Jie Zhang, Runyi Hu, Xiaoya Li, Tianwei Zhang, Jiwei Li, Fei Wu, Guoyin Wang, Eduard Hovy
[LINK]
http://arxiv.org/abs/2412.10400v2
[DATE]
2024-12-18 02:05:11+08:00
[CATEGORIES]
cs.CL
cs.LG
RISCORE: Enhancing In-Context Riddle Solving in Language Models through Context-Reconstructed Example Augmentation
[AUTHORS]
Ioannis Panagiotopoulos, Giorgos Filandrianos, Maria Lymperaiou, Giorgos Stamou
[COMMENTS]
Accepted at COLING 2025
[LINK]
http://arxiv.org/abs/2409.16383v4
[DATE]
2024-12-18 01:42:18+08:00
[CATEGORIES]
cs.CL
Systematic Biases in LLM Simulations of Debates
[AUTHORS]
Amir Taubenfeld, Yaniv Dover, Roi Reichart, Ariel Goldstein
[COMMENTS]
Published as a conference paper at EMNLP 2024
[LINK]
http://arxiv.org/abs/2402.04049v3
[DATE]
2024-12-18 01:17:21+08:00
[CATEGORIES]
cs.CL
AI PERSONA: Towards Life-long Personalization of LLMs
[AUTHORS]
Tiannan Wang, Meiling Tao, Ruoyu Fang, Huilin Wang, Shuai Wang, Yuchen Eleanor Jiang, Wangchunshu Zhou
[ABSTRACT]
In this work, we introduce the task of life-long personalization of large
language models. While recent mainstream efforts in the LLM community mainly
focus on scaling data and compute for improved capabilities of LLMs, we argue
that it is also very important to enable LLM systems, or language agents, to
continuously adapt to the diverse and ever-changing profiles of every distinct
user and provide up-to-date personalized assistance. We provide a clear task
formulation and introduce a simple, general, effective, and scalable framework
for life-long personalization of LLM systems and language agents. To facilitate
future research on LLM personalization, we also introduce methods to synthesize
realistic benchmarks and robust evaluation metrics. We will release all codes
and data for building and benchmarking life-long personalized LLM systems.
[COMMENTS]
Work in progress
[LINK]
http://arxiv.org/abs/2412.13103v1
[DATE]
2024-12-18 01:17:03+08:00
[CATEGORIES]
cs.CL
Uchaguzi-2022: A Dataset of Citizen Reports on the 2022 Kenyan Election
[AUTHORS]
Roberto Mondini, Neema Kotonya, Robert L. Logan IV, Elizabeth M Olson, Angela Oduor Lungati, Daniel Duke Odongo, Tim Ombasa, Hemank Lamba, Aoife Cahill, Joel R. Tetreault, Alejandro Jaimes
[COMMENTS]
COLING 2025
[LINK]
http://arxiv.org/abs/2412.13098v1
[DATE]
2024-12-18 01:08:35+08:00
[CATEGORIES]
cs.CL
LMUnit: Fine-grained Evaluation with Natural Language Unit Tests
[AUTHORS]
Jon Saad-Falcon, Rajan Vivek, William Berrios, Nandita Shankar Naik, Matija Franklin, Bertie Vidgen, Amanpreet Singh, Douwe Kiela, Shikib Mehri
[ABSTRACT]
As language models become integral to critical workflows, assessing their
behavior remains a fundamental challenge – human evaluation is costly and
noisy, while automated metrics provide only coarse, difficult-to-interpret
signals. We introduce natural language unit tests, a paradigm that decomposes
response quality into explicit, testable criteria, along with a unified scoring
model, LMUnit, which combines multi-objective training across preferences,
direct ratings, and natural language rationales. Through controlled human
studies, we show this paradigm significantly improves inter-annotator agreement
and enables more effective LLM development workflows. LMUnit achieves
state-of-the-art performance on evaluation benchmarks (FLASK, BigGenBench) and
competitive results on RewardBench. These results validate both our proposed
paradigm and scoring model, suggesting a promising path forward for language
model evaluation and development.
[LINK]
http://arxiv.org/abs/2412.13091v1
[DATE]
2024-12-18 01:01:15+08:00
[CATEGORIES]
cs.CL
PersonaMark: Personalized LLM watermarking for model protection and user attribution
[AUTHORS]
Yuehan Zhang, Peizhuo Lv, Yinpeng Liu, Yongqiang Ma, Wei Lu, Xiaofeng Wang, Xiaozhong Liu, Jiawei Liu
[ABSTRACT]
The rapid advancement of customized Large Language Models (LLMs) offers
considerable convenience. However, it also intensifies concerns regarding the
protection of copyright/confidential information. With the extensive adoption
of private LLMs, safeguarding model copyright and ensuring data privacy have
become critical. Text watermarking has emerged as a viable solution for
detecting AI-generated content and protecting models. However, existing methods
fall short in providing individualized watermarks for each user, a critical
feature for enhancing accountability and traceability. In this paper, we
introduce PersonaMark, a novel personalized text watermarking scheme designed
to protect LLMs’ copyrights and bolster accountability. PersonaMark leverages
sentence structure as a subtle carrier of watermark information and optimizes
the generation process to maintain the natural output of the model. By
employing a personalized hashing function, unique watermarks are embedded for
each user, enabling high-quality text generation without compromising the
model’s performance. This approach is both time-efficient and scalable, capable
of handling large numbers of users through a multi-user hashing mechanism. To
the best of our knowledge, this is a pioneer study to explore personalized
watermarking in LLMs. We conduct extensive evaluations across four LLMs,
analyzing various metrics such as perplexity, sentiment, alignment, and
readability. The results validate that PersonaMark preserves text quality,
ensures unbiased watermark insertion, and offers robust watermark detection
capabilities, all while maintaining the model’s behavior with minimal
disruption.
[COMMENTS]
Under review
[LINK]
http://arxiv.org/abs/2409.09739v2
[DATE]
2024-12-18 00:52:12+08:00
[CATEGORIES]
cs.CL
Rethinking the Alignment of Psychotherapy Dialogue Generation with Motivational Interviewing Strategies
[AUTHORS]
Xin Sun, Xiao Tang, Abdallah El Ali, Zhuying Li, Pengjie Ren, Jan de Wit, Jiahuan Pei, Jos A. Bosch
[ABSTRACT]
Recent advancements in large language models (LLMs) have shown promise in
generating psychotherapeutic dialogues, particularly in the context of
motivational interviewing (MI). However, the inherent lack of transparency in
LLM outputs presents significant challenges given the sensitive nature of
psychotherapy. Applying MI strategies, a set of MI skills, to generate more
controllable therapeutic-adherent conversations with explainability provides a
possible solution. In this work, we explore the alignment of LLMs with MI
strategies by first prompting the LLMs to predict the appropriate strategies as
reasoning and then utilizing these strategies to guide the subsequent dialogue
generation. We seek to investigate whether such alignment leads to more
controllable and explainable generations. Multiple experiments including
automatic and human evaluations are conducted to validate the effectiveness of
MI strategies in aligning psychotherapy dialogue generation. Our findings
demonstrate the potential of LLMs in producing strategically aligned dialogues
and suggest directions for practical applications in psychotherapeutic
settings.
[LINK]
http://arxiv.org/abs/2408.06527v2
[DATE]
2024-12-18 00:44:16+08:00
[CATEGORIES]
cs.CL
Modality-Inconsistent Continual Learning of Multimodal Large Language Models
[AUTHORS]
Weiguo Pian, Shijian Deng, Shentong Mo, Yunhui Guo, Yapeng Tian
[ABSTRACT]
In this paper, we introduce Modality-Inconsistent Continual Learning (MICL),
a new continual learning scenario for Multimodal Large Language Models (MLLMs)
that involves tasks with inconsistent modalities (image, audio, or video) and
varying task types (captioning or question-answering). Unlike existing
vision-only or modality-incremental settings, MICL combines modality and task
type shifts, both of which drive catastrophic forgetting. To address these
challenges, we propose MoInCL, which employs a Pseudo Targets Generation Module
to mitigate forgetting caused by task type shifts in previously seen
modalities. It also incorporates Instruction-based Knowledge Distillation to
preserve the model’s ability to handle previously learned modalities when new
ones are introduced. We benchmark MICL using a total of six tasks and conduct
experiments to validate the effectiveness of our proposed MoInCL. The
experimental results highlight the superiority of MoInCL, showing significant
improvements over representative and state-of-the-art continual learning
baselines.
[LINK]
http://arxiv.org/abs/2412.13050v1
[DATE]
2024-12-18 00:13:56+08:00
[CATEGORIES]
cs.LG
cs.CL
Causal Prompting: Debiasing Large Language Model Prompting based on Front-Door Adjustment
[AUTHORS]
Congzhi Zhang, Linhai Zhang, Jialong Wu, Yulan He, Deyu Zhou
[ABSTRACT]
Despite the notable advancements of existing prompting methods, such as
In-Context Learning and Chain-of-Thought for Large Language Models (LLMs), they
still face challenges related to various biases. Traditional debiasing methods
primarily focus on the model training stage, including approaches based on data
augmentation and reweighting, yet they struggle with the complex biases
inherent in LLMs. To address such limitations, the causal relationship behind
the prompting methods is uncovered using a structural causal model, and a novel
causal prompting method based on front-door adjustment is proposed to
effectively mitigate LLMs biases. In specific, causal intervention is achieved
by designing the prompts without accessing the parameters and logits of LLMs.
The chain-of-thought generated by LLM is employed as the mediator variable and
the causal effect between input prompts and output answers is calculated
through front-door adjustment to mitigate model biases. Moreover, to accurately
represent the chain-of-thoughts and estimate the causal effects, contrastive
learning is used to fine-tune the encoder of chain-of-thought by aligning its
space with that of the LLM. Experimental results show that the proposed causal
prompting approach achieves excellent performance across seven natural language
processing datasets on both open-source and closed-source LLMs.
[COMMENTS]
Accepted by AAAI 2025
[LINK]
http://arxiv.org/abs/2403.02738v3
[DATE]
2024-12-18 00:10:26+08:00
[CATEGORIES]
cs.CL
Harnessing Event Sensory Data for Error Pattern Prediction in Vehicles: A Language Model Approach
[AUTHORS]
Hugo Math, Rainer Lienhart, Robin Schön
[ABSTRACT]
In this paper, we draw an analogy between processing natural languages and
processing multivariate event streams from vehicles in order to predict
$\textit{when}$ and $\textit{what}$ error pattern is most likely to occur in
the future for a given car. Our approach leverages the temporal dynamics and
contextual relationships of our event data from a fleet of cars. Event data is
composed of discrete values of error codes as well as continuous values such as
time and mileage. Modelled by two causal Transformers, we can anticipate
vehicle failures and malfunctions before they happen. Thus, we introduce
$\textit{CarFormer}$, a Transformer model trained via a new self-supervised
learning strategy, and $\textit{EPredictor}$, an autoregressive Transformer
decoder model capable of predicting $\textit{when}$ and $\textit{what}$ error
pattern will most likely occur after some error code apparition. Despite the
challenges of high cardinality of event types, their unbalanced frequency of
appearance and limited labelled data, our experimental results demonstrate the
excellent predictive ability of our novel model. Specifically, with sequences
of $160$ error codes on average, our model is able with only half of the error
codes to achieve $80\%$ F1 score for predicting $\textit{what}$ error pattern
will occur and achieves an average absolute error of $58.4 \pm 13.2$h
$\textit{when}$ forecasting the time of occurrence, thus enabling confident
predictive maintenance and enhancing vehicle safety.
[COMMENTS]
10 pages, 8 figures, accepted to AAAI 2025
[LINK]
http://arxiv.org/abs/2412.13041v1
[DATE]
2024-12-18 00:05:30+08:00
[CATEGORIES]
cs.CL
cs.LG
Model-Agnostic Cosmological Inference with SDSS-IV eBOSS: Simultaneous Probing for Background and Perturbed Universe
[AUTHORS]
Purba Mukherjee, Anjan A. Sen
[ABSTRACT]
Here we explore certain subtle features imprinted in data from the completed
Sloan Digital Sky Survey IV (SDSS-IV) extended Baryon Oscillation Spectroscopic
Survey (eBOSS) as a combined probe for the background and perturbed Universe.
We reconstruct the baryon Acoustic Oscillation (BAO) and Redshift Space
Distortion (RSD) observables as functions of redshift, using measurements from
SDSS alone. We apply the Multi-Task Gaussian Process (MTGP) framework to model
the interdependencies of cosmological observables $D_M(z)/r_d$, $D_H(z)/r_d$,
and $f\sigma_8(z)$, and track their evolution across different redshifts.
Subsequently, we obtain constrained three-dimensional phase space containing
$D_M(z)/r_d$, $D_H(z)/r_d$, and $f\sigma_8(z)$ at different redshifts probed by
the SDSS-IV eBOSS survey. Furthermore, assuming the $\Lambda$CDM model, we
obtain constraints on model parameters $\Omega_{m}$, $H_{0}r_{d}$, $\sigma_{8}$
and $S_{8}$ at each redshift probed by SDSS-IV eBOSS. This indicates
redshift-dependent trends in $H_0$, $\Omega_m$, $\sigma_8$ and $S_8$ in the
$\Lambda$CDM model, suggesting a possible inconsistency in the $\Lambda$CDM
model. Ours is a template for model-independent extraction of information for
both background and perturbed Universe using a single galaxy survey taking into
account all the existing correlations between background and perturbed
observables and this can be easily extended to future DESI-3YR as well as
Euclid results.
[COMMENTS]
13 pages, 7 sets of figures, 3 tables. Comments are welcome
[LINK]
http://arxiv.org/abs/2412.13973v1
[DATE]
2024-12-18 23:50:50+08:00
[CATEGORIES]
cs.LG
Optimally Solving Simultaneous-Move Dec-POMDPs: The Sequential Central Planning Approach
[AUTHORS]
Johan Peralez, Aurèlien Delage, Jacopo Castellini, Rafael F. Cunha, Jilles S. Dibangoye
[ABSTRACT]
The centralized training for decentralized execution paradigm emerged as the
state-of-the-art approach to $\epsilon$-optimally solving decentralized
partially observable Markov decision processes. However, scalability remains a
significant issue. This paper presents a novel and more scalable alternative,
namely the sequential-move centralized training for decentralized execution.
This paradigm further pushes the applicability of the Bellman’s principle of
optimality, raising three new properties. First, it allows a central planner to
reason upon sufficient sequential-move statistics instead of prior
simultaneous-move ones. Next, it proves that $\epsilon$-optimal value functions
are piecewise linear and convex in such sufficient sequential-move statistics.
Finally, it drops the complexity of the backup operators from double
exponential to polynomial at the expense of longer planning horizons. Besides,
it makes it easy to use single-agent methods, e.g., SARSA algorithm enhanced
with these findings, while still preserving convergence guarantees. Experiments
on two- as well as many-agent domains from the literature against
$\epsilon$-optimal simultaneous-move solvers confirm the superiority of our
novel approach. This paradigm opens the door for efficient planning and
reinforcement learning methods for multi-agent systems.
[LINK]
http://arxiv.org/abs/2408.13139v2
[DATE]
2024-12-18 23:46:10+08:00
[CATEGORIES]
cs.LG
Comparative Analysis of Machine Learning-Based Imputation Techniques for Air Quality Datasets with High Missing Data Rates
[AUTHORS]
Sen Yan, David J. O’Connor, Xiaojun Wang, Noel E. O’Connor, Alan. F. Smeaton, Mingming Liu
[ABSTRACT]
Urban pollution poses serious health risks, particularly in relation to
traffic-related air pollution, which remains a major concern in many cities.
Vehicle emissions contribute to respiratory and cardiovascular issues,
especially for vulnerable and exposed road users like pedestrians and cyclists.
Therefore, accurate air quality monitoring with high spatial resolution is
vital for good urban environmental management. This study aims to provide
insights for processing spatiotemporal datasets with high missing data rates.
In this study, the challenge of high missing data rates is a result of the
limited data available and the fine granularity required for precise
classification of PM2.5 levels. The data used for analysis and imputation were
collected from both mobile sensors and fixed stations by Dynamic Parcel
Distribution, the Environmental Protection Agency, and Google in Dublin,
Ireland, where the missing data rate was approximately 82.42%, making accurate
Particulate Matter 2.5 level predictions particularly difficult. Various
imputation and prediction approaches were evaluated and compared, including
ensemble methods, deep learning models, and diffusion models. External features
such as traffic flow, weather conditions, and data from the nearest stations
were incorporated to enhance model performance. The results indicate that
diffusion methods with external features achieved the highest F1 score,
reaching 0.9486 (Accuracy: 94.26%, Precision: 94.42%, Recall: 94.82%), with
ensemble models achieving the highest accuracy of 94.82%, illustrating that
good performance can be obtained despite a high missing data rate.
[COMMENTS]
Accepted by IEEE CIETES 2025, with 8 pages, 3 figures, and 2 tables
[LINK]
http://arxiv.org/abs/2412.13966v1
[DATE]
2024-12-18 23:45:08+08:00
[CATEGORIES]
cs.LG
Harvesting energy from turbulent winds with Reinforcement Learning
[AUTHORS]
Lorenzo Basile, Maria Grazia Berni, Antonio Celani
[ABSTRACT]
Airborne Wind Energy (AWE) is an emerging technology designed to harness the
power of high-altitude winds, offering a solution to several limitations of
conventional wind turbines. AWE is based on flying devices (usually gliders or
kites) that, tethered to a ground station and driven by the wind, convert its
mechanical energy into electrical energy by means of a generator. Such systems
are usually controlled by manoeuvering the kite so as to follow a predefined
path prescribed by optimal control techniques, such as model-predictive
control. These methods are strongly dependent on the specific model at use and
difficult to generalize, especially in unpredictable conditions such as the
turbulent atmospheric boundary layer. Our aim is to explore the possibility of
replacing these techniques with an approach based on Reinforcement Learning
(RL). Unlike traditional methods, RL does not require a predefined model,
making it robust to variability and uncertainty. Our experimental results in
complex simulated environments demonstrate that AWE agents trained with RL can
effectively extract energy from turbulent flows, relying on minimal local
information about the kite orientation and speed relative to the wind.
[LINK]
http://arxiv.org/abs/2412.13961v1
[DATE]
2024-12-18 23:40:40+08:00
[CATEGORIES]
cs.LG
Towards Precise Prediction Uncertainty in GNNs: Refining GNNs with Topology-grouping Strategy
[AUTHORS]
Hyunjin Seo, Kyusung Seo, Joonhyung Park, Eunho Yang
[ABSTRACT]
Recent advancements in graph neural networks (GNNs) have highlighted the
critical need of calibrating model predictions, with neighborhood prediction
similarity recognized as a pivotal component. Existing studies suggest that
nodes with analogous neighborhood prediction similarity often exhibit similar
calibration characteristics. Building on this insight, recent approaches
incorporate neighborhood similarity into node-wise temperature scaling
techniques. However, our analysis reveals that this assumption does not hold
universally. Calibration errors can differ significantly even among nodes with
comparable neighborhood similarity, depending on their confidence levels. This
necessitates a re-evaluation of existing GNN calibration methods, as a single,
unified approach may lead to sub-optimal calibration. In response, we introduce
Simi-Mailbox, a novel approach that categorizes nodes by both neighborhood
similarity and their own confidence, irrespective of proximity or connectivity.
Our method allows fine-grained calibration by employing group-specific
temperature scaling, with each temperature tailored to address the specific
miscalibration level of affiliated nodes, rather than adhering to a uniform
trend based on neighborhood similarity. Extensive experiments demonstrate the
effectiveness of our Simi-Mailbox across diverse datasets on different GNN
architectures, achieving up to 13.79\% error reduction compared to uncalibrated
GNN predictions.
[COMMENTS]
Accepted at AAAI 2025
[LINK]
http://arxiv.org/abs/2412.14223v1
[DATE]
2024-12-18 23:39:57+08:00
[CATEGORIES]
cs.LG
Self-attentive Transformer for Fast and Accurate Postprocessing of Temperature and Wind Speed Forecasts
[AUTHORS]
Aaron Van Poecke, Tobias Sebastian Finn, Ruoke Meng, Joris Van den Bergh, Geert Smet, Jonathan Demaeyer, Piet Termonia, Hossein Tabari, Peter Hellinckx
[ABSTRACT]
Current postprocessing techniques often require separate models for each lead
time and disregard possible inter-ensemble relationships by either correcting
each member separately or by employing distributional approaches. In this work,
we tackle these shortcomings with an innovative, fast and accurate Transformer
which postprocesses each ensemble member individually while allowing
information exchange across variables, spatial dimensions and lead times by
means of multi-headed self-attention. Weather foreacasts are postprocessed over
20 lead times simultaneously while including up to twelve meteorological
predictors. We use the EUPPBench dataset for training which contains ensemble
predictions from the European Center for Medium-range Weather Forecasts’
integrated forecasting system alongside corresponding observations. The work
presented here is the first to postprocess the ten and one hundred-meter wind
speed forecasts within this benchmark dataset, while also correcting the
two-meter temperature. Our approach significantly improves the original
forecasts, as measured by the CRPS, with 17.5 % for two-meter temperature,
nearly 5% for ten-meter wind speed and 5.3 % for one hundred-meter wind speed,
outperforming a classical member-by-member approach employed as competitive
benchmark. Furthermore, being up to 75 times faster, it fulfills the demand for
rapid operational weather forecasts in various downstream applications,
including renewable energy forecasting.
[COMMENTS]
21 pages, 7 figures, submitted to Artificial Intelligence for the
Earth Systems (AIES)
[LINK]
http://arxiv.org/abs/2412.13957v1
[DATE]
2024-12-18 23:37:09+08:00
[CATEGORIES]
cs.LG
Spatio-Temporal Forecasting of PM2.5 via Spatial-Diffusion guided Encoder-Decoder Architecture
[AUTHORS]
Malay Pandey, Vaishali Jain, Nimit Godhani, Sachchida Nand Tripathi, Piyush Rai
[ABSTRACT]
In many problem settings that require spatio-temporal forecasting, the values
in the time-series not only exhibit spatio-temporal correlations but are also
influenced by spatial diffusion across locations. One such example is
forecasting the concentration of fine particulate matter (PM2.5) in the
atmosphere which is influenced by many complex factors, the most important ones
being diffusion due to meteorological factors as well as transport across vast
distances over a period of time. We present a novel Spatio-Temporal Graph
Neural Network architecture, that specifically captures these dependencies to
forecast the PM2.5 concentration. Our model is based on an encoder-decoder
architecture where the encoder and decoder parts leverage gated recurrent units
(GRU) augmented with a graph neural network (TransformerConv) to account for
spatial diffusion. Our model can also be seen as a generalization of various
existing models for time-series or spatio-temporal forecasting. We demonstrate
the model’s effectiveness on two real-world PM2.5 datasets: (1) data collected
by us using a recently deployed network of low-cost PM$_{2.5}$ sensors from 511
locations spanning the entirety of the Indian state of Bihar over a period of
one year, and (2) another publicly available dataset that covers severely
polluted regions from China for a period of 4 years. Our experimental results
show our model’s impressive ability to account for both spatial as well as
temporal dependencies precisely.
[COMMENTS]
9 pages, 4 figures, International Conference on Data Science and
Management of Data (CODS-COMAD), IIT Jodhpur, 2024
[LINK]
http://arxiv.org/abs/2412.13935v1
[DATE]
2024-12-18 23:18:12+08:00
[CATEGORIES]
cs.LG
Investigating the Effects of Diffusion-based Conditional Generative Speech Models Used for Speech Enhancement on Dysarthric Speech
[AUTHORS]
Joanna Reszka, Parvaneh Janbakhshi, Tilak Purohit, Sadegh Mohammadi
[ABSTRACT]
In this study, we aim to explore the effect of pre-trained conditional
generative speech models for the first time on dysarthric speech due to
Parkinson’s disease recorded in an ideal/non-noisy condition. Considering one
category of generative models, i.e., diffusion-based speech enhancement, these
models are previously trained to learn the distribution of clean (i.e, recorded
in a noise-free environment) typical speech signals. Therefore, we hypothesized
that when being exposed to dysarthric speech they might remove the unseen
atypical paralinguistic cues during the enhancement process. By considering the
automatic dysarthric speech detection task, in this study, we experimentally
show that during the enhancement process of dysarthric speech data recorded in
an ideal non-noisy environment, some of the acoustic dysarthric speech cues are
lost. Therefore such pre-trained models are not yet suitable in the context of
dysarthric speech enhancement since they manipulate the pathological speech
cues when they process clean dysarthric speech. Furthermore, we show that the
removed acoustics cues by the enhancement models in the form of residue speech
signal can provide complementary dysarthric cues when fused with the original
input speech signal in the feature space.
[COMMENTS]
Accepted at ICASSP 2025 Satellite Workshop: Workshop on Speech
Pathology Analysis and DEtection (SPADE)
[LINK]
http://arxiv.org/abs/2412.13933v1
[DATE]
2024-12-18 23:18:05+08:00
[CATEGORIES]
cs.LG
Preconditioned Subspace Langevin Monte Carlo
[AUTHORS]
Tyler Maunu, Jiayi Yao
[ABSTRACT]
We develop a new efficient method for high-dimensional sampling called
Subspace Langevin Monte Carlo. The primary application of these methods is to
efficiently implement Preconditioned Langevin Monte Carlo. To demonstrate the
usefulness of this new method, we extend ideas from subspace descent methods in
Euclidean space to solving a specific optimization problem over Wasserstein
space. Our theoretical analysis demonstrates the advantageous convergence
regimes of the proposed method, which depend on relative conditioning
assumptions common to mirror descent methods. We back up our theory with
experimental evidence on sampling from an ill-conditioned Gaussian
distribution.
[COMMENTS]
19 pages, 2 figures, 1 table
[LINK]
http://arxiv.org/abs/2412.13928v1
[DATE]
2024-12-18 23:12:41+08:00
[CATEGORIES]
cs.LG
Speech Watermarking with Discrete Intermediate Representations
[AUTHORS]
Shengpeng Ji, Ziyue Jiang, Jialong Zuo, Minghui Fang, Yifu Chen, Tao Jin, Zhou Zhao
[ABSTRACT]
Speech watermarking techniques can proactively mitigate the potential harmful
consequences of instant voice cloning techniques. These techniques involve the
insertion of signals into speech that are imperceptible to humans but can be
detected by algorithms. Previous approaches typically embed watermark messages
into continuous space. However, intuitively, embedding watermark information
into robust discrete latent space can significantly improve the robustness of
watermarking systems. In this paper, we propose DiscreteWM, a novel speech
watermarking framework that injects watermarks into the discrete intermediate
representations of speech. Specifically, we map speech into discrete latent
space with a vector-quantized autoencoder and inject watermarks by changing the
modular arithmetic relation of discrete IDs. To ensure the imperceptibility of
watermarks, we also propose a manipulator model to select the candidate tokens
for watermark embedding. Experimental results demonstrate that our framework
achieves state-of-the-art performance in robustness and imperceptibility,
simultaneously. Moreover, our flexible frame-wise approach can serve as an
efficient solution for both voice cloning detection and information hiding.
Additionally, DiscreteWM can encode 1 to 150 bits of watermark information
within a 1-second speech clip, indicating its encoding capacity. Audio samples
are available at https://DiscreteWM.github.io/discrete_wm.
[COMMENTS]
Accepted by AAAI 2025
[LINK]
http://arxiv.org/abs/2412.13917v1
[DATE]
2024-12-18 22:57:06+08:00
[CATEGORIES]
cs.LG
Automated Inference of Graph Transformation Rules
[AUTHORS]
Jakob L. Andersen, Akbar Davoodi, Rolf Fagerberg, Christoph Flamm, Walter Fontana, Juri Kolčák, Christophe V. F. P. Laurent, Daniel Merkle, Nikolai Nøjgaard
[ABSTRACT]
The explosion of data available in life sciences is fueling an increasing
demand for expressive models and computational methods. Graph transformation is
a model for dynamic systems with a large variety of applications. We introduce
a novel method of the graph transformation model construction, combining
generative and dynamical viewpoints to give a fully automated data-driven model
inference method.
The method takes the input dynamical properties, given as a “snapshot” of the
dynamics encoded by explicit transitions, and constructs a compatible model.
The obtained model is guaranteed to be minimal, thus framing the approach as
model compression (from a set of transitions into a set of rules). The
compression is permissive to a lossy case, where the constructed model is
allowed to exhibit behavior outside of the input transitions, thus suggesting a
completion of the input dynamics.
The task of graph transformation model inference is naturally highly
challenging due to the combinatorics involved. We tackle the exponential
explosion by proposing a heuristically minimal translation of the task into a
well-established problem, set cover, for which highly optimized solutions
exist. We further showcase how our results relate to Kolmogorov complexity
expressed in terms of graph transformation.
[COMMENTS]
Preprint
[LINK]
http://arxiv.org/abs/2404.02692v2
[DATE]
2024-12-18 22:47:17+08:00
[CATEGORIES]
cs.LG
Flexible and efficient emulation of spatial extremes processes via variational autoencoders
[AUTHORS]
Likun Zhang, Xiaoyu Ma, Christopher K. Wikle, Raphaël Huser
[ABSTRACT]
Many real-world processes have complex tail dependence structures that cannot
be characterized using classical Gaussian processes. More flexible spatial
extremes models exhibit appealing extremal dependence properties but are often
exceedingly prohibitive to fit and simulate from in high dimensions. In this
paper, we aim to push the boundaries on computation and modeling of
high-dimensional spatial extremes via integrating a new spatial extremes model
that has flexible and non-stationary dependence properties in the
encoding-decoding structure of a variational autoencoder called the XVAE. The
XVAE can emulate spatial observations and produce outputs that have the same
statistical properties as the inputs, especially in the tail. Our approach also
provides a novel way of making fast inference with complex extreme-value
processes. Through extensive simulation studies, we show that our XVAE is
substantially more time-efficient than traditional Bayesian inference while
outperforming many spatial extremes models with a stationary dependence
structure. Lastly, we analyze a high-resolution satellite-derived dataset of
sea surface temperature in the Red Sea, which includes 30 years of daily
measurements at 16703 grid cells. We demonstrate how to use XVAE to identify
regions susceptible to marine heatwaves under climate change and examine the
spatial and temporal variability of the extremal dependence structure.
[COMMENTS]
30 pages, 8 figures
[LINK]
http://arxiv.org/abs/2307.08079v4
[DATE]
2024-12-18 22:46:23+08:00
[CATEGORIES]
cs.LG
Data-Efficient Inference of Neural Fluid Fields via SciML Foundation Model
[AUTHORS]
Yuqiu Liu, Jingxuan Xu, Mauricio Soroco, Yunchao Wei, Wuyang Chen
[ABSTRACT]
Recent developments in 3D vision have enabled successful progress in
inferring neural fluid fields and realistic rendering of fluid dynamics.
However, these methods require real-world flow captures, which demand dense
video sequences and specialized lab setups, making the process costly and
challenging. Scientific machine learning (SciML) foundation models, which are
pretrained on extensive simulations of partial differential equations (PDEs),
encode rich multiphysics knowledge and thus provide promising sources of domain
priors for inferring fluid fields. Nevertheless, their potential to advance
real-world vision problems remains largely underexplored, raising questions
about the transferability and practical utility of these foundation models. In
this work, we demonstrate that SciML foundation model can significantly improve
the data efficiency of inferring real-world 3D fluid dynamics with improved
generalization. At the core of our method is leveraging the strong forecasting
capabilities and meaningful representations of SciML foundation models. We
equip neural fluid fields with a novel collaborative training approach that
utilizes augmented views and fluid features extracted by our foundation model.
Our method demonstrates significant improvements in both quantitative metrics
and visual quality, showcasing the practical applicability of SciML foundation
models in real-world fluid dynamics.
[LINK]
http://arxiv.org/abs/2412.13897v1
[DATE]
2024-12-18 22:39:43+08:00
[CATEGORIES]
cs.LG
A Survey on Integrated Sensing, Communication, and Computation
[AUTHORS]
Dingzhu Wen, Yong Zhou, Xiaoyang Li, Yuanming Shi, Kaibin Huang, Khaled B. Letaief
[ABSTRACT]
The forthcoming generation of wireless technology, 6G, aims to usher in an
era of ubiquitous intelligent services, where everything is interconnected and
intelligent. This vision requires the seamless integration of three fundamental
modules: Sensing for information acquisition, communication for information
sharing, and computation for information processing and decision-making. These
modules are intricately linked, especially in complex tasks such as edge
learning and inference. However, the performance of these modules is
interdependent, creating a resource competition for time, energy, and
bandwidth. Existing techniques like integrated communication and computation
(ICC), integrated sensing and computation (ISC), and integrated sensing and
communication (ISAC) have made partial strides in addressing this challenge,
but they fall short of meeting the extreme performance requirements. To
overcome these limitations, it is essential to develop new techniques that
comprehensively integrate sensing, communication, and computation. This
integrated approach, known as Integrated Sensing, Communication, and
Computation (ISCC), offers a systematic perspective for enhancing task
performance. This paper begins with a comprehensive survey of historic and
related techniques such as ICC, ISC, and ISAC, highlighting their strengths and
limitations. It then discusses the benefits, functions, and challenges of ISCC.
Subsequently, the state-of-the-art signal designs for ISCC, along with network
resource management strategies specifically tailored for ISCC are explored.
Furthermore, this paper discusses the exciting research opportunities that lie
ahead for implementing ISCC in future advanced networks, and the unresolved
issues requiring further investigation. ISCC is expected to unlock the full
potential of intelligent connectivity, paving the way for groundbreaking
applications and services.
[COMMENTS]
This version is accepted by IEEE Communications Surveys & Tutorials
on Dec. 18, 2024
[LINK]
http://arxiv.org/abs/2408.08074v3
[DATE]
2024-12-18 22:38:31+08:00
[CATEGORIES]
cs.LG
Neural Temporal Point Processes for Forecasting Directional Relations in Evolving Hypergraphs
[AUTHORS]
Tony Gracious, Arman Gupta, Ambedkar Dukkipati
[ABSTRACT]
Forecasting relations between entities is paramount in the current era of
data and AI. However, it is often overlooked that real-world relationships are
inherently directional, involve more than two entities, and can change with
time. In this paper, we provide a comprehensive solution to the problem of
forecasting directional relations in a general setting, where relations are
higher-order, i.e., directed hyperedges in a hypergraph. This problem has not
been previously explored in the existing literature. The primary challenge in
solving this problem is that the number of possible hyperedges is exponential
in the number of nodes at each event time. To overcome this, we propose a
sequential generative approach that segments the forecasting process into
multiple stages, each contingent upon the preceding stages, thereby reducing
the search space involved in predictions of hyperedges. The first stage
involves a temporal point process-based node event forecasting module that
identifies the subset of nodes involved in an event. The second stage is a
candidate generation module that predicts hyperedge sizes and adjacency vectors
for nodes observing events. The final stage is a directed hyperedge predictor
that identifies the truth by searching over the set of candidate hyperedges. To
validate the effectiveness of our model, we compiled five datasets and
conducted an extensive empirical study to assess each downstream task. Our
proposed method achieves a performance gain of 32\% and 41\% compared to the
state-of-the-art pairwise and hyperedge event forecasting models, respectively,
for the event type prediction.
[COMMENTS]
AAAI-2025
[LINK]
http://arxiv.org/abs/2301.12210v3
[DATE]
2024-12-18 22:36:47+08:00
[CATEGORIES]
cs.LG
Graph-Driven Models for Gas Mixture Identification and Concentration Estimation on Heterogeneous Sensor Array Signals
[AUTHORS]
Ding Wang, Lei Wang, Huilin Yin, Guoqing Gu, Zhiping Lin, Wenwen Zhang
[ABSTRACT]
Accurately identifying gas mixtures and estimating their concentrations are
crucial across various industrial applications using gas sensor arrays.
However, existing models face challenges in generalizing across heterogeneous
datasets, which limits their scalability and practical applicability. To
address this problem, this study develops two novel deep-learning models that
integrate temporal graph structures for enhanced performance: a Graph-Enhanced
Capsule Network (GraphCapsNet) employing dynamic routing for gas mixture
classification and a Graph-Enhanced Attention Network (GraphANet) leveraging
self-attention for concentration estimation. Both models were validated on
datasets from the University of California, Irvine (UCI) Machine Learning
Repository and a custom dataset, demonstrating superior performance in gas
mixture identification and concentration estimation compared to recent models.
In classification tasks, GraphCapsNet achieved over 98.00% accuracy across
multiple datasets, while in concentration estimation, GraphANet attained an R2
score exceeding 0.96 across various gas components. Both GraphCapsNet and
GraphANet exhibited significantly higher accuracy and stability, positioning
them as promising solutions for scalable gas analysis in industrial settings.
[LINK]
http://arxiv.org/abs/2412.13891v1
[DATE]
2024-12-18 22:32:30+08:00
[CATEGORIES]
cs.LG
A Survey on Inference Optimization Techniques for Mixture of Experts Models
[AUTHORS]
Jiacheng Liu, Peng Tang, Wenfeng Wang, Yuhang Ren, Xiaofeng Hou, Pheng-Ann Heng, Minyi Guo, Chao Li
[ABSTRACT]
The emergence of large-scale Mixture of Experts (MoE) models has marked a
significant advancement in artificial intelligence, offering enhanced model
capacity and computational efficiency through conditional computation. However,
the deployment and inference of these models present substantial challenges in
terms of computational resources, latency, and energy efficiency. This
comprehensive survey systematically analyzes the current landscape of inference
optimization techniques for MoE models across the entire system stack. We first
establish a taxonomical framework that categorizes optimization approaches into
model-level, system-level, and hardware-level optimizations. At the model
level, we examine architectural innovations including efficient expert design,
attention mechanisms, various compression techniques such as pruning,
quantization, and knowledge distillation, as well as algorithm improvement
including dynamic routing strategies and expert merging methods. At the system
level, we investigate distributed computing approaches, load balancing
mechanisms, and efficient scheduling algorithms that enable scalable
deployment. Furthermore, we delve into hardware-specific optimizations and
co-design strategies that maximize throughput and energy efficiency. This
survey not only provides a structured overview of existing solutions but also
identifies key challenges and promising research directions in MoE inference
optimization. Our comprehensive analysis serves as a valuable resource for
researchers and practitioners working on large-scale deployment of MoE models
in resource-constrained environments. To facilitate ongoing updates and the
sharing of cutting-edge advances in MoE inference optimization research, we
have established a repository accessible at
\url{https://github.com/MoE-Inf/awesome-moe-inference/}.
[COMMENTS]
Work in Progress
[LINK]
http://arxiv.org/abs/2412.14219v1
[DATE]
2024-12-18 22:11:15+08:00
[CATEGORIES]
cs.LG
Exploring Gradient Subspaces: Addressing and Overcoming LoRA’s Limitations in Federated Fine-Tuning of Large Language Models
[AUTHORS]
Navyansh Mahla, Ganesh Ramakrishnan
[ABSTRACT]
Large Language Models (LLMs) have demonstrated remarkable capabilities across
various domains, particularly in task generalization for both text and vision
data. While fine-tuning these models can significantly enhance their
performance on specific downstream tasks, it often requires high-quality data
that cannot be shared due to privacy concerns. Federated Learning (FL) offers a
promising solution for collaborative training without direct data sharing.
However, many parameter-efficient fine-tuning strategies for LLMs in FL,
particularly those based on Low-Rank Adaptation (LoRA), face limitations. In
this paper, we critically analyze the convergence and performance guarantees of
popular FL frameworks utilizing LoRA, highlighting its suboptimal nature due to
constrained subspace learning of low-rank matrices. This limitation hinders
effective fine-tuning of LLMs in federated settings. Through rigorous
analytical and empirical evaluations, we demonstrate that direct weight
averaging outperforms LoRA-based strategies, leading to superior performance
for fine-tuned models. Our comprehensive comparison unmasks inefficiencies in
LoRA approaches and underscores the advantages of direct weight aggregation. We
extend our analysis to low-rank gradient-based optimizers, such as GaLore, used
during local training steps. Our findings show that GaLore along with
direct-weight aggregation is a more effective approach, outperforming federated
LoRA methods like FlexLoRA and FFA-LoRA across both text and image modalities.
While privacy remains paramount in FL discourse, our focus is on assessing
performance outcomes of federated fine-tuned models and evaluating various FL
frameworks from both theoretical and empirical perspectives. Our findings
advocate reassessing the reliance on LoRA within FL contexts, paving the way
for more efficient training methodologies.
[LINK]
http://arxiv.org/abs/2410.23111v3
[DATE]
2024-12-18 22:10:10+08:00
[CATEGORIES]
cs.LG
SHAP scores fail pervasively even when Lipschitz succeeds
[AUTHORS]
Olivier Letoffe, Xuanxiang Huang, Joao Marques-Silva
[ABSTRACT]
The ubiquitous use of Shapley values in eXplainable AI (XAI) has been
triggered by the tool SHAP, and as a result are commonly referred to as SHAP
scores. Recent work devised examples of machine learning (ML) classifiers for
which the computed SHAP scores are thoroughly unsatisfactory, by allowing human
decision-makers to be misled. Nevertheless, such examples could be perceived as
somewhat artificial, since the selected classes must be interpreted as numeric.
Furthermore, it was unclear how general were the issues identified with SHAP
scores. This paper answers these criticisms. First, the paper shows that for
Boolean classifiers there are arbitrarily many examples for which the SHAP
scores must be deemed unsatisfactory. Second, the paper shows that the issues
with SHAP scores are also observed in the case of regression models. In
addition, the paper studies the class of regression models that respect
Lipschitz continuity, a measure of a function’s rate of change that finds
important recent uses in ML, including model robustness. Concretely, the paper
shows that the issues with SHAP scores occur even for regression models that
respect Lipschitz continuity. Finally, the paper shows that the same issues are
guaranteed to exist for arbitrarily differentiable regression models.
[COMMENTS]
arXiv admin note: text overlap with arXiv:2405.00076
[LINK]
http://arxiv.org/abs/2412.13866v1
[DATE]
2024-12-18 22:02:15+08:00
[CATEGORIES]
cs.LG
Constructing sensible baselines for Integrated Gradients
[AUTHORS]
Jai Bardhan, Cyrin Neeraj, Mihir Rawat, Subhadip Mitra
[ABSTRACT]
Machine learning methods have seen a meteoric rise in their applications in
the scientific community. However, little effort has been put into
understanding these “black box” models. We show how one can apply integrated
gradients (IGs) to understand these models by designing different baselines, by
taking an example case study in particle physics. We find that the zero-vector
baseline does not provide good feature attributions and that an averaged
baseline sampled from the background events provides consistently more
reasonable attributions.
[COMMENTS]
7 pages, 5 figures. Accepted to 4th Annual AAAI Workshop on AI to
Accelerate Science and Engineering (AI2ASE)
[LINK]
http://arxiv.org/abs/2412.13864v1
[DATE]
2024-12-18 22:01:15+08:00
[CATEGORIES]
cs.LG
IDEQ: an improved diffusion model for the TSP
[AUTHORS]
Mickael Basson, Philippe Preux
[ABSTRACT]
We investigate diffusion models to solve the Traveling Salesman Problem.
Building on the recent DIFUSCO and T2TCO approaches, we propose IDEQ. IDEQ
improves the quality of the solutions by leveraging the constrained structure
of the state space of the TSP. Another key component of IDEQ consists in
replacing the last stages of DIFUSCO curriculum learning by considering a
uniform distribution over the Hamiltonian tours whose orbits by the 2-opt
operator converge to the optimal solution as the training objective. Our
experiments show that IDEQ improves the state of the art for such neural
network based techniques on synthetic instances. More importantly, our
experiments show that IDEQ performs very well on the instances of the TSPlib, a
reference benchmark in the TSP community: it closely matches the performance of
the best heuristics, LKH3, being even able to obtain better solutions than LKH3
on 2 instances of the TSPlib defined on 1577 and 3795 cities. IDEQ obtains 0.3%
optimality gap on TSP instances made of 500 cities, and 0.5% on TSP instances
with 1000 cities. This sets a new SOTA for neural based methods solving the
TSP. Moreover, IDEQ exhibits a lower variance and better scales-up with the
number of cities with regards to DIFUSCO and T2TCO.
[LINK]
http://arxiv.org/abs/2412.13858v1
[DATE]
2024-12-18 21:52:50+08:00
[CATEGORIES]
cs.LG
Diagnosising Helicobacter pylori using AutoEncoders and Limited Annotations through Anomalous Staining Patterns in IHC Whole Slide Images
[AUTHORS]
Pau Cano, Eva Musulen, Debora Gil
[ABSTRACT]
Purpose: This work addresses the detection of Helicobacter pylori (H. pylori)
in histological images with immunohistochemical staining. This analysis is a
time demanding task, currently done by an expert pathologist that visually
inspects the samples. Given the effort required to localise the pathogen in
images, a limited number of annotations might be available in an initial
setting. Our goal is to design an approach that, using a limited set of
annotations, is capable of obtaining results good enough to be used as a
support tool. Methods: We propose to use autoencoders to learn the latent
patterns of healthy patches and formulate a specific measure of the
reconstruction error of the image in HSV space. ROC analysis is used to set the
optimal threshold of this measure and the percentage of positive patches in a
sample that determines the presence of H. pylori. Results: Our method has been
tested on an own database of 245 Whole Slide Images (WSI) having 117 cases
without H. pylori and different density of the bacteria in the remaining ones.
The database has 1211 annotated patches, with only 163 positive patches. This
dataset of positive annotations was used to train a baseline thresholding and
an SVM using the features of a pre-trained RedNet18 and ViT models. A 10-fold
cross-validation shows that our method has better performance with 91%
accuracy, 86% sensitivity, 96% specificity and 0.97 AUC in the diagnosis of H.
pylori. Conclusion: Unlike classification approaches, our shallow autoencoder
with threshold adaptation for the detection of anomalous staining is able to
achieve competitive results with a limited set of annotated data. This initial
approach is good enough to be used as a guide for fast annotation of infected
patches.
[LINK]
http://arxiv.org/abs/2412.13857v1
[DATE]
2024-12-18 21:52:42+08:00
[CATEGORIES]
cs.LG
RadField3D: A Data Generator and Data Format for Deep Learning in Radiation-Protection Dosimetry for Medical Applications
[AUTHORS]
Felix Lehner, Pasquale Lombardo, Susana Castillo, Oliver Hupe, Marcus Magnor
[ABSTRACT]
In this research work, we present our open-source Geant4-based Monte-Carlo
simulation application, called RadField3D, for generating threedimensional
radiation field datasets for dosimetry. Accompanying, we introduce a fast,
machine-interpretable data format with a Python API for easy integration into
neural network research, that we call RadFiled3D. Both developments are
intended to be used to research alternative radiation simulation methods using
deep learning.
[LINK]
http://arxiv.org/abs/2412.13852v1
[DATE]
2024-12-18 21:47:37+08:00
[CATEGORIES]
cs.LG
Differentiable Interacting Multiple Model Particle Filtering
[AUTHORS]
John-Joseph Brady, Yuhui Luo, Wenwu Wang, Víctor Elvira, Yunpeng Li
[ABSTRACT]
We propose a sequential Monte Carlo algorithm for parameter learning when the
studied model exhibits random discontinuous jumps in behaviour. To facilitate
the learning of high dimensional parameter sets, such as those associated to
neural networks, we adopt the emerging framework of differentiable particle
filtering, wherein parameters are trained by gradient descent. We design a new
differentiable interacting multiple model particle filter to be capable of
learning the individual behavioural regimes and the model which controls the
jumping simultaneously. In contrast to previous approaches, our algorithm
allows control of the computational effort assigned per regime whilst using the
probability of being in a given regime to guide sampling. Furthermore, we
develop a new gradient estimator that has a lower variance than established
approaches and remains fast to compute, for which we prove consistency. We
establish new theoretical results of the presented algorithms and demonstrate
superior numerical performance compared to the previous state-of-the-art
algorithms.
[LINK]
http://arxiv.org/abs/2410.00620v2
[DATE]
2024-12-18 21:43:43+08:00
[CATEGORIES]
cs.LG
A Concept-Centric Approach to Multi-Modality Learning
[AUTHORS]
Yuchong Geng, Ao Tang
[ABSTRACT]
In an effort to create a more efficient AI system, we introduce a new
multi-modality learning framework that leverages a modality-agnostic concept
space possessing abstract knowledge and a set of modality-specific projection
models tailored to process distinct modality inputs and map them onto the
concept space. Decoupled from specific modalities and their associated
projection models, the concept space focuses on learning abstract knowledge
that is universally applicable across modalities. Subsequently, the knowledge
embedded into the concept space streamlines the learning processes of
modality-specific projection models. We evaluate our framework on two popular
tasks: Image-Text Matching and Visual Question Answering. Our framework
achieves performance on par with benchmark models while demonstrating more
efficient learning curves.
[LINK]
http://arxiv.org/abs/2412.13847v1
[DATE]
2024-12-18 21:40:21+08:00
[CATEGORIES]
cs.LG
Extreme Multi-label Completion for Semantic Document Labelling with Taxonomy-Aware Parallel Learning
[AUTHORS]
Julien Audiffren, Christophe Broillet, Ljiljana Dolamic, Philippe Cudré-Mauroux
[ABSTRACT]
In Extreme Multi Label Completion (XMLCo), the objective is to predict the
missing labels of a collection of documents. Together with XML Classification,
XMLCo is arguably one of the most challenging document classification tasks, as
the very high number of labels (at least ten of thousands) is generally very
large compared to the number of available labelled documents in the training
dataset. Such a task is often accompanied by a taxonomy that encodes the labels
organic relationships, and many methods have been proposed to leverage this
hierarchy to improve the results of XMLCo algorithms. In this paper, we propose
a new approach to this problem, TAMLEC (Taxonomy-Aware Multi-task Learning for
Extreme multi-label Completion). TAMLEC divides the problem into several
Taxonomy-Aware Tasks, i.e. subsets of labels adapted to the hierarchical paths
of the taxonomy, and trains on these tasks using a dynamic Parallel Feature
sharing approach, where some parts of the model are shared between tasks while
others are task-specific. Then, at inference time, TAMLEC uses the labels
available in a document to infer the appropriate tasks and to predict missing
labels. To achieve this result, TAMLEC uses a modified transformer architecture
that predicts ordered sequences of labels on a Weak-Semilattice structure that
is naturally induced by the tasks. This approach yields multiple advantages.
First, our experiments on real-world datasets show that TAMLEC outperforms
state-of-the-art methods for various XMLCo problems. Second, TAMLEC is by
construction particularly suited for few-shots XML tasks, where new tasks or
labels are introduced with only few examples, and extensive evaluations
highlight its strong performance compared to existing methods.
[LINK]
http://arxiv.org/abs/2412.13809v1
[DATE]
2024-12-18 20:57:49+08:00
[CATEGORIES]
cs.LG
A Hitchhiker’s Guide to Understanding Performances of Two-Class Classifiers
[AUTHORS]
Anaïs Halin, Sébastien Piérard, Anthony Cioppa, Marc Van Droogenbroeck
[ABSTRACT]
Properly understanding the performances of classifiers is essential in
various scenarios. However, the literature often relies only on one or two
standard scores to compare classifiers, which fails to capture the nuances of
application-specific requirements, potentially leading to suboptimal classifier
selection. Recently, a paper on the foundations of the theory of
performance-based ranking introduced a tool, called the Tile, that organizes an
infinity of ranking scores into a 2D map. Thanks to the Tile, it is now
possible to evaluate and compare classifiers efficiently, displaying all
possible application-specific preferences instead of having to rely on a pair
of scores. In this paper, we provide a first hitchhiker’s guide for
understanding the performances of two-class classifiers by presenting four
scenarios, each showcasing a different user profile: a theoretical analyst, a
method designer, a benchmarker, and an application developer. Particularly, we
show that we can provide different interpretative flavors that are adapted to
the user’s needs by mapping different values on the Tile. As an illustration,
we leverage the newly introduced Tile tool and the different flavors to rank
and analyze the performances of 74 state-of-the-art semantic segmentation
models in two-class classification through the eyes of the four user profiles.
Through these user profiles, we demonstrate that the Tile effectively captures
the behavior of classifiers in a single visualization, while accommodating an
infinite number of ranking scores.
[LINK]
http://arxiv.org/abs/2412.04377v2
[DATE]
2024-12-18 20:55:49+08:00
[CATEGORIES]
cs.LG
The Tile: A 2D Map of Ranking Scores for Two-Class Classification
[AUTHORS]
Sébastien Piérard, Anaïs Halin, Anthony Cioppa, Adrien Deliège, Marc Van Droogenbroeck
[ABSTRACT]
In the computer vision and machine learning communities, as well as in many
other research domains, rigorous evaluation of any new method, including
classifiers, is essential. One key component of the evaluation process is the
ability to compare and rank methods. However, ranking classifiers and
accurately comparing their performances, especially when taking
application-specific preferences into account, remains challenging. For
instance, commonly used evaluation tools like Receiver Operating Characteristic
(ROC) and Precision/Recall (PR) spaces display performances based on two
scores. Hence, they are inherently limited in their ability to compare
classifiers across a broader range of scores and lack the capability to
establish a clear ranking among classifiers. In this paper, we present a novel
versatile tool, named the Tile, that organizes an infinity of ranking scores in
a single 2D map for two-class classifiers, including common evaluation scores
such as the accuracy, the true positive rate, the positive predictive value,
Jaccard’s coefficient, and all F-beta scores. Furthermore, we study the
properties of the underlying ranking scores, such as the influence of the
priors or the correspondences with the ROC space, and depict how to
characterize any other score by comparing them to the Tile. Overall, we
demonstrate that the Tile is a powerful tool that effectively captures all the
rankings in a single visualization and allows interpreting them.
[LINK]
http://arxiv.org/abs/2412.04309v2
[DATE]
2024-12-18 20:50:29+08:00
[CATEGORIES]
cs.LG
The Certainty Ratio $C_ρ$: a novel metric for assessing the reliability of classifier predictions
[AUTHORS]
Jesus S. Aguilar-Ruiz
[ABSTRACT]
Evaluating the performance of classifiers is critical in machine learning,
particularly in high-stakes applications where the reliability of predictions
can significantly impact decision-making. Traditional performance measures,
such as accuracy and F-score, often fail to account for the uncertainty
inherent in classifier predictions, leading to potentially misleading
assessments. This paper introduces the Certainty Ratio ($C_\rho$), a novel
metric designed to quantify the contribution of confident (certain) versus
uncertain predictions to any classification performance measure. By integrating
the Probabilistic Confusion Matrix ($CM^\star$) and decomposing predictions
into certainty and uncertainty components, $C_\rho$ provides a more
comprehensive evaluation of classifier reliability. Experimental results across
21 datasets and multiple classifiers, including Decision Trees, Naive-Bayes,
3-Nearest Neighbors, and Random Forests, demonstrate that $C_\rho$ reveals
critical insights that conventional metrics often overlook. These findings
emphasize the importance of incorporating probabilistic information into
classifier evaluation, offering a robust tool for researchers and practitioners
seeking to improve model trustworthiness in complex environments.
[LINK]
http://arxiv.org/abs/2411.01973v2
[DATE]
2024-12-18 20:46:01+08:00
[CATEGORIES]
cs.LG
Foundations of the Theory of Performance-Based Ranking
[AUTHORS]
Sébastien Piérard, Anaïs Halin, Anthony Cioppa, Adrien Deliège, Marc Van Droogenbroeck
[ABSTRACT]
Ranking entities such as algorithms, devices, methods, or models based on
their performances, while accounting for application-specific preferences, is a
challenge. To address this challenge, we establish the foundations of a
universal theory for performance-based ranking. First, we introduce a rigorous
framework built on top of both the probability and order theories. Our new
framework encompasses the elements necessary to (1) manipulate performances as
mathematical objects, (2) express which performances are worse than or
equivalent to others, (3) model tasks through a variable called satisfaction,
(4) consider properties of the evaluation, (5) define scores, and (6) specify
application-specific preferences through a variable called importance. On top
of this framework, we propose the first axiomatic definition of performance
orderings and performance-based rankings. Then, we introduce a universal
parametric family of scores, called ranking scores, that can be used to
establish rankings satisfying our axioms, while considering
application-specific preferences. Finally, we show, in the case of two-class
classification, that the family of ranking scores encompasses well-known
performance scores, including the accuracy, the true positive rate (recall,
sensitivity), the true negative rate (specificity), the positive predictive
value (precision), and F1. However, we also show that some other scores
commonly used to compare classifiers are unsuitable to derive performance
orderings satisfying the axioms. Therefore, this paper provides the computer
vision and machine learning communities with a rigorous framework for
evaluating and ranking entities.
[LINK]
http://arxiv.org/abs/2412.04227v2
[DATE]
2024-12-18 20:45:58+08:00
[CATEGORIES]
cs.LG
Neural Controlled Differential Equations with Quantum Hidden Evolutions
[AUTHORS]
Lingyi Yang, Zhen Shao
[ABSTRACT]
We introduce a class of neural controlled differential equation inspired by
quantum mechanics. Neural quantum controlled differential equations (NQDEs)
model the dynamics by analogue of the Schr"{o}dinger equation. Specifically,
the hidden state represents the wave function, and its collapse leads to an
interpretation of the classification probability. We implement and compare the
results of four variants of NQDEs on a toy spiral classification problem.
[COMMENTS]
Code available at: https://github.com/lingyiyang/NQDE
[LINK]
http://arxiv.org/abs/2404.19673v2
[DATE]
2024-12-18 20:42:10+08:00
[CATEGORIES]
cs.LG
Conditional Diffusion Models Based Conditional Independence Testing
[AUTHORS]
Yanfeng Yang, Shuai Li, Yingjie Zhang, Zhuoran Sun, Hai Shu, Ziqi Chen, Renming Zhang
[ABSTRACT]
Conditional independence (CI) testing is a fundamental task in modern
statistics and machine learning. The conditional randomization test (CRT) was
recently introduced to test whether two random variables, $X$ and $Y$, are
conditionally independent given a potentially high-dimensional set of random
variables, $Z$. The CRT operates exceptionally well under the assumption that
the conditional distribution $X|Z$ is known. However, since this distribution
is typically unknown in practice, accurately approximating it becomes crucial.
In this paper, we propose using conditional diffusion models (CDMs) to learn
the distribution of $X|Z$. Theoretically and empirically, it is shown that CDMs
closely approximate the true conditional distribution. Furthermore, CDMs offer
a more accurate approximation of $X|Z$ compared to GANs, potentially leading to
a CRT that performs better than those based on GANs. To accommodate complex
dependency structures, we utilize a computationally efficient classifier-based
conditional mutual information (CMI) estimator as our test statistic. The
proposed testing procedure performs effectively without requiring assumptions
about specific distribution forms or feature dependencies, and is capable of
handling mixed-type conditioning sets that include both continuous and discrete
variables. Theoretical analysis shows that our proposed test achieves a valid
control of the type I error. A series of experiments on synthetic data
demonstrates that our new test effectively controls both type-I and type-II
errors, even in high dimensional scenarios.
[COMMENTS]
17 pages, 7 figures, aaai 2025
[LINK]
http://arxiv.org/abs/2412.11744v2
[DATE]
2024-12-18 20:34:00+08:00
[CATEGORIES]
cs.LG
Rehearsal-Free Continual Federated Learning with Synergistic Regularization
[AUTHORS]
Yichen Li, Yuying Wang, Tianzhe Xiao, Haozhao Wang, Yining Qi, Ruixuan Li
[ABSTRACT]
Continual Federated Learning (CFL) allows distributed devices to
collaboratively learn novel concepts from continuously shifting training data
while avoiding knowledge forgetting of previously seen tasks. To tackle this
challenge, most current CFL approaches rely on extensive rehearsal of previous
data. Despite effectiveness, rehearsal comes at a cost to memory, and it may
also violate data privacy. Considering these, we seek to apply regularization
techniques to CFL by considering their cost-efficient properties that do not
require sample caching or rehearsal. Specifically, we first apply traditional
regularization techniques to CFL and observe that existing regularization
techniques, especially synaptic intelligence, can achieve promising results
under homogeneous data distribution but fail when the data is heterogeneous.
Based on this observation, we propose a simple yet effective regularization
algorithm for CFL named FedSSI, which tailors the synaptic intelligence for the
CFL with heterogeneous data settings. FedSSI can not only reduce computational
overhead without rehearsal but also address the data heterogeneity issue.
Extensive experiments show that FedSSI achieves superior performance compared
to state-of-the-art methods.
[LINK]
http://arxiv.org/abs/2412.13779v1
[DATE]
2024-12-18 20:16:41+08:00
[CATEGORIES]
cs.LG
QuLTSF: Long-Term Time Series Forecasting with Quantum Machine Learning
[AUTHORS]
Hari Hara Suthan Chittoor, Paul Robert Griffin, Ariel Neufeld, Jayne Thompson, Mile Gu
[ABSTRACT]
Long-term time series forecasting (LTSF) involves predicting a large number
of future values of a time series based on the past values and is an essential
task in a wide range of domains including weather forecasting, stock market
analysis, disease outbreak prediction. Over the decades LTSF algorithms have
transitioned from statistical models to deep learning models like transformer
models. Despite the complex architecture of transformer based LTSF models `Are
Transformers Effective for Time Series Forecasting? (Zeng et al., 2023)’ showed
that simple linear models can outperform the state-of-the-art transformer based
LTSF models. Recently, quantum machine learning (QML) is evolving as a domain
to enhance the capabilities of classical machine learning models. In this paper
we initiate the application of QML to LTSF problems by proposing QuLTSF, a
simple hybrid QML model for multivariate LTSF. Through extensive experiments on
a widely used weather dataset we show the advantages of QuLTSF over the
state-of-the-art classical linear models, in terms of reduced mean squared
error and mean absolute error.
[COMMENTS]
submitted for conference publication
[LINK]
http://arxiv.org/abs/2412.13769v1
[DATE]
2024-12-18 20:06:52+08:00
[CATEGORIES]
cs.LG
Cultivating Archipelago of Forests: Evolving Robust Decision Trees through Island Coevolution
[AUTHORS]
Adam Żychowski, Andrew Perrault, Jacek Mańdziuk
[ABSTRACT]
Decision trees are widely used in machine learning due to their simplicity
and interpretability, but they often lack robustness to adversarial attacks and
data perturbations. The paper proposes a novel island-based coevolutionary
algorithm (ICoEvoRDF) for constructing robust decision tree ensembles. The
algorithm operates on multiple islands, each containing populations of decision
trees and adversarial perturbations. The populations on each island evolve
independently, with periodic migration of top-performing decision trees between
islands. This approach fosters diversity and enhances the exploration of the
solution space, leading to more robust and accurate decision tree ensembles.
ICoEvoRDF utilizes a popular game theory concept of mixed Nash equilibrium for
ensemble weighting, which further leads to improvement in results. ICoEvoRDF is
evaluated on 20 benchmark datasets, demonstrating its superior performance
compared to state-of-the-art methods in optimizing both adversarial accuracy
and minimax regret. The flexibility of ICoEvoRDF allows for the integration of
decision trees from various existing methods, providing a unified framework for
combining diverse solutions. Our approach offers a promising direction for
developing robust and interpretable machine learning models
[LINK]
http://arxiv.org/abs/2412.13762v1
[DATE]
2024-12-18 19:59:24+08:00
[CATEGORIES]
cs.LG
Diffusion map particle systems for generative modeling
[AUTHORS]
Fengyi Li, Youssef Marzouk
[ABSTRACT]
We propose a novel diffusion map particle system (DMPS) for generative
modeling, based on diffusion maps and Laplacian-adjusted Wasserstein gradient
descent (LAWGD). Diffusion maps are used to approximate the generator of the
corresponding Langevin diffusion process from samples, and hence to learn the
underlying data-generating manifold. On the other hand, LAWGD enables efficient
sampling from the target distribution given a suitable choice of kernel, which
we construct here via a spectral approximation of the generator, computed with
diffusion maps. Our method requires no offline training and minimal tuning, and
can outperform other approaches on data sets of moderate dimension.
[LINK]
http://arxiv.org/abs/2304.00200v5
[DATE]
2024-12-18 19:51:43+08:00
[CATEGORIES]
cs.LG
Optimal Exact Recovery in Semi-Supervised Learning: A Study of Spectral Methods and Graph Convolutional Networks
[AUTHORS]
Hai-Xiao Wang, Zhichao Wang
[ABSTRACT]
We delve into the challenge of semi-supervised node classification on the
Contextual Stochastic Block Model (CSBM) dataset. Here, nodes from the
two-cluster Stochastic Block Model (SBM) are coupled with feature vectors,
which are derived from a Gaussian Mixture Model (GMM) that corresponds to their
respective node labels. With only a subset of the CSBM node labels accessible
for training, our primary objective becomes the accurate classification of the
remaining nodes. Venturing into the transductive learning landscape, we, for
the first time, pinpoint the information-theoretical threshold for the exact
recovery of all test nodes in CSBM. Concurrently, we design an optimal spectral
estimator inspired by Principal Component Analysis (PCA) with the training
labels and essential data from both the adjacency matrix and feature vectors.
We also evaluate the efficacy of graph ridge regression and Graph Convolutional
Networks (GCN) on this synthetic dataset. Our findings underscore that graph
ridge regression and GCN possess the ability to achieve the information
threshold of exact recovery in a manner akin to the optimal estimator when
using the optimal weighted self-loops. This highlights the potential role of
feature learning in augmenting the proficiency of GCN, especially in the realm
of semi-supervised learning.
[COMMENTS]
Accepted by ICML 2024. The conference version can be accessed via
https://openreview.net/forum?id=8m4V6Fx6ma
[LINK]
http://arxiv.org/abs/2412.13754v1
[DATE]
2024-12-18 19:44:19+08:00
[CATEGORIES]
cs.LG
ManiBox: Enhancing Spatial Grasping Generalization via Scalable Simulation Data Generation
[AUTHORS]
Hengkai Tan, Xuezhou Xu, Chengyang Ying, Xinyi Mao, Songming Liu, Xingxing Zhang, Hang Su, Jun Zhu
[ABSTRACT]
Learning a precise robotic grasping policy is crucial for embodied agents
operating in complex real-world manipulation tasks. Despite significant
advancements, most models still struggle with accurate spatial positioning of
objects to be grasped. We first show that this spatial generalization challenge
stems primarily from the extensive data requirements for adequate spatial
understanding. However, collecting such data with real robots is prohibitively
expensive, and relying on simulation data often leads to visual generalization
gaps upon deployment. To overcome these challenges, we then focus on
state-based policy generalization and present \textbf{ManiBox}, a novel
bounding-box-guided manipulation method built on a simulation-based
teacher-student framework. The teacher policy efficiently generates scalable
simulation data using bounding boxes, which are proven to uniquely determine
the objects’ spatial positions. The student policy then utilizes these
low-dimensional spatial states to enable zero-shot transfer to real robots.
Through comprehensive evaluations in simulated and real-world environments,
ManiBox demonstrates a marked improvement in spatial grasping generalization
and adaptability to diverse objects and backgrounds. Further, our empirical
study into scaling laws for policy performance indicates that spatial volume
generalization scales with data volume in a power law. For a certain level of
spatial volume, the success rate of grasping empirically follows
Michaelis-Menten kinetics relative to data volume, showing a saturation effect
as data increases. Our videos and code are available in
https://thkkk.github.io/manibox.
[LINK]
http://arxiv.org/abs/2411.01850v2
[DATE]
2024-12-18 19:25:55+08:00
[CATEGORIES]
cs.LG
Uncertainty separation via ensemble quantile regression
[AUTHORS]
Navid Ansari, Hans-Peter Seidel, Vahid Babaei
[ABSTRACT]
This paper introduces a novel and scalable framework for uncertainty
estimation and separation with applications in data driven modeling in science
and engineering tasks where reliable uncertainty quantification is critical.
Leveraging an ensemble of quantile regression (E-QR) models, our approach
enhances aleatoric uncertainty estimation while preserving the quality of
epistemic uncertainty, surpassing competing methods, such as Deep Ensembles
(DE) and Monte Carlo (MC) dropout. To address challenges in separating
uncertainty types, we propose an algorithm that iteratively improves separation
through progressive sampling in regions of high uncertainty. Our framework is
scalable to large datasets and demonstrates superior performance on synthetic
benchmarks, offering a robust tool for uncertainty quantification in
data-driven applications.
[LINK]
http://arxiv.org/abs/2412.13738v1
[DATE]
2024-12-18 19:15:32+08:00
[CATEGORIES]
cs.LG
THÖR-MAGNI Act: Actions for Human Motion Modeling in Robot-Shared Industrial Spaces
[AUTHORS]
Tiago Rodrigues de Almeida, Tim Schreiter, Andrey Rudenko, Luigi Palmieiri, Johannes A. Stork, Achim J. Lilienthal
[ABSTRACT]
Accurate human activity and trajectory prediction are crucial for ensuring
safe and reliable human-robot interactions in dynamic environments, such as
industrial settings, with mobile robots. Datasets with fine-grained action
labels for moving people in industrial environments with mobile robots are
scarce, as most existing datasets focus on social navigation in public spaces.
This paper introduces the TH"OR-MAGNI Act dataset, a substantial extension of
the TH"OR-MAGNI dataset, which captures participant movements alongside robots
in diverse semantic and spatial contexts. TH"OR-MAGNI Act provides 8.3 hours
of manually labeled participant actions derived from egocentric videos recorded
via eye-tracking glasses. These actions, aligned with the provided TH"OR-MAGNI
motion cues, follow a long-tailed distribution with diversified acceleration,
velocity, and navigation distance profiles. We demonstrate the utility of
TH"OR-MAGNI Act for two tasks: action-conditioned trajectory prediction and
joint action and trajectory prediction. We propose two efficient
transformer-based models that outperform the baselines to address these tasks.
These results underscore the potential of TH"OR-MAGNI Act to develop
predictive models for enhanced human-robot interaction in complex environments.
[COMMENTS]
This paper has been accepted to the the 20th edition of the IEEE/ACM
International Conference on Human-Robot Interaction (HRI’25), which will be
held in Melbourne, Australia on March 4-6, 2025. Code:
https://github.com/tmralmeida/thor-magni-actions
[LINK]
http://arxiv.org/abs/2412.13729v1
[DATE]
2024-12-18 19:08:25+08:00
[CATEGORIES]
cs.LG
Context Matters: Leveraging Contextual Features for Time Series Forecasting
[AUTHORS]
Sameep Chattopadhyay, Pulkit Paliwal, Sai Shankar Narasimhan, Shubhankar Agarwal, Sandeep P. Chinchali
[ABSTRACT]
Time series forecasts are often influenced by exogenous contextual features
in addition to their corresponding history. For example, in financial settings,
it is hard to accurately predict a stock price without considering public
sentiments and policy decisions in the form of news articles, tweets, etc.
Though this is common knowledge, the current state-of-the-art (SOTA)
forecasting models fail to incorporate such contextual information, owing to
its heterogeneity and multimodal nature. To address this, we introduce
ContextFormer, a novel plug-and-play method to surgically integrate multimodal
contextual information into existing pre-trained forecasting models.
ContextFormer effectively distills forecast-specific information from rich
multimodal contexts, including categorical, continuous, time-varying, and even
textual information, to significantly enhance the performance of existing base
forecasters. ContextFormer outperforms SOTA forecasting models by up to 30% on
a range of real-world datasets spanning energy, traffic, environmental, and
financial domains.
[LINK]
http://arxiv.org/abs/2410.12672v4
[DATE]
2024-12-18 19:01:18+08:00
[CATEGORIES]
cs.LG
Model Decides How to Tokenize: Adaptive DNA Sequence Tokenization with MxDNA
[AUTHORS]
Lifeng Qiao, Peng Ye, Yuchen Ren, Weiqiang Bai, Chaoqi Liang, Xinzhu Ma, Nanqing Dong, Wanli Ouyang
[ABSTRACT]
Foundation models have made significant strides in understanding the genomic
language of DNA sequences. However, previous models typically adopt the
tokenization methods designed for natural language, which are unsuitable for
DNA sequences due to their unique characteristics. In addition, the optimal
approach to tokenize DNA remains largely under-explored, and may not be
intuitively understood by humans even if discovered. To address these
challenges, we introduce MxDNA, a novel framework where the model autonomously
learns an effective DNA tokenization strategy through gradient decent. MxDNA
employs a sparse Mixture of Convolution Experts coupled with a deformable
convolution to model the tokenization process, with the discontinuous,
overlapping, and ambiguous nature of meaningful genomic segments explicitly
considered. On Nucleotide Transformer Benchmarks and Genomic Benchmarks, MxDNA
demonstrates superior performance to existing methods with less pretraining
data and time, highlighting its effectiveness. Finally, we show that MxDNA
learns unique tokenization strategy distinct to those of previous methods and
captures genomic functionalities at a token level during self-supervised
pretraining. Our MxDNA aims to provide a new perspective on DNA tokenization,
potentially offering broad applications in various domains and yielding
profound insights.
[COMMENTS]
Accepted by NeurIPS 2024
[LINK]
http://arxiv.org/abs/2412.13716v1
[DATE]
2024-12-18 18:55:43+08:00
[CATEGORIES]
cs.LG
SSE-SAM: Balancing Head and Tail Classes Gradually through Stage-Wise SAM
[AUTHORS]
Xingyu Lyu, Qianqian Xu, Zhiyong Yang, Shaojie Lyu, Qingming Huang
[ABSTRACT]
Real-world datasets often exhibit a long-tailed distribution, where vast
majority of classes known as tail classes have only few samples. Traditional
methods tend to overfit on these tail classes. Recently, a new approach called
Imbalanced SAM (ImbSAM) is proposed to leverage the generalization benefits of
Sharpness-Aware Minimization (SAM) for long-tailed distributions. The main
strategy is to merely enhance the smoothness of the loss function for tail
classes. However, we argue that improving generalization in long-tail scenarios
requires a careful balance between head and tail classes. We show that neither
SAM nor ImbSAM alone can fully achieve this balance. For SAM, we prove that
although it enhances the model’s generalization ability by escaping saddle
point in the overall loss landscape, it does not effectively address this for
tail-class losses. Conversely, while ImbSAM is more effective at avoiding
saddle points in tail classes, the head classes are trained insufficiently,
resulting in significant performance drops. Based on these insights, we propose
Stage-wise Saddle Escaping SAM (SSE-SAM), which uses complementary strengths of
ImbSAM and SAM in a phased approach. Initially, SSE-SAM follows the majority
sample to avoid saddle points of the head-class loss. During the later phase,
it focuses on tail-classes to help them escape saddle points. Our experiments
confirm that SSE-SAM has better ability in escaping saddles both on head and
tail classes, and shows performance improvements.
[LINK]
http://arxiv.org/abs/2412.13715v1
[DATE]
2024-12-18 18:53:36+08:00
[CATEGORIES]
cs.LG
Flash Diffusion: Accelerating Any Conditional Diffusion Model for Few Steps Image Generation
[AUTHORS]
Clément Chadebec, Onur Tasar, Eyal Benaroche, Benjamin Aubin
[ABSTRACT]
In this paper, we propose an efficient, fast, and versatile distillation
method to accelerate the generation of pre-trained diffusion models: Flash
Diffusion. The method reaches state-of-the-art performances in terms of FID and
CLIP-Score for few steps image generation on the COCO2014 and COCO2017
datasets, while requiring only several GPU hours of training and fewer
trainable parameters than existing methods. In addition to its efficiency, the
versatility of the method is also exposed across several tasks such as
text-to-image, inpainting, face-swapping, super-resolution and using different
backbones such as UNet-based denoisers (SD1.5, SDXL) or DiT (Pixart-$\alpha$),
as well as adapters. In all cases, the method allowed to reduce drastically the
number of sampling steps while maintaining very high-quality image generation.
The official implementation is available at
https://github.com/gojasper/flash-diffusion.
[COMMENTS]
Accepted to AAAI 2025
[LINK]
http://arxiv.org/abs/2406.02347v3
[DATE]
2024-12-18 18:45:06+08:00
[CATEGORIES]
cs.LG
Splitting criteria for ordinal decision trees: an experimental study
[AUTHORS]
Rafael Ayllón-Gavilán, Francisco José Martínez-Estudillo, David Guijo-Rubio, César Hervás-Martínez, Pedro Antonio Gutiérrez
[ABSTRACT]
Ordinal Classification (OC) is a machine learning field that addresses
classification tasks where the labels exhibit a natural order. Unlike nominal
classification, which treats all classes as equally distinct, OC takes the
ordinal relationship into account, producing more accurate and relevant
results. This is particularly critical in applications where the magnitude of
classification errors has implications. Despite this, OC problems are often
tackled using nominal methods, leading to suboptimal solutions. Although
decision trees are one of the most popular classification approaches, ordinal
tree-based approaches have received less attention when compared to other
classifiers. This work conducts an experimental study of tree-based
methodologies specifically designed to capture ordinal relationships. A
comprehensive survey of ordinal splitting criteria is provided, standardising
the notations used in the literature for clarity. Three ordinal splitting
criteria, Ordinal Gini (OGini), Weighted Information Gain (WIG), and Ranking
Impurity (RI), are compared to the nominal counterparts of the first two (Gini
and information gain), by incorporating them into a decision tree classifier.
An extensive repository considering 45 publicly available OC datasets is
presented, supporting the first experimental comparison of ordinal and nominal
splitting criteria using well-known OC evaluation metrics. Statistical analysis
of the results highlights OGini as the most effective ordinal splitting
criterion to date. Source code, datasets, and results are made available to the
research community.
[COMMENTS]
11 pages, 3 figures, 6 tables
[LINK]
http://arxiv.org/abs/2412.13697v1
[DATE]
2024-12-18 18:41:44+08:00
[CATEGORIES]
cs.LG
Denoising Diffusion Probabilistic Models for Magnetic Resonance Fingerprinting
[AUTHORS]
Perla Mayo, Carolin M. Pirkl, Alin Achim, Bjoern H. Menze, Mohammad Golbabaee
[ABSTRACT]
Magnetic Resonance Fingerprinting (MRF) is a time-efficient approach to
quantitative MRI, enabling the mapping of multiple tissue properties from a
single, accelerated scan. However, achieving accurate reconstructions remains
challenging, particularly in highly accelerated and undersampled acquisitions,
which are crucial for reducing scan times. While deep learning techniques have
advanced image reconstruction, the recent introduction of diffusion models
offers new possibilities for imaging tasks, though their application in the
medical field is still emerging. Notably, diffusion models have not yet been
explored for the MRF problem. In this work, we propose for the first time a
conditional diffusion probabilistic model for MRF image reconstruction.
Qualitative and quantitative comparisons on in-vivo brain scan data demonstrate
that the proposed approach can outperform established deep learning and
compressed sensing algorithms for MRF reconstruction. Extensive ablation
studies also explore strategies to improve computational efficiency of our
approach.
[COMMENTS]
13 pages, 5 figures, 3 tables, 2 algorithms
[LINK]
http://arxiv.org/abs/2410.23318v2
[DATE]
2024-12-18 18:32:31+08:00
[CATEGORIES]
cs.LG
Neuro-Symbolic Embedding for Short and Effective Feature Selection via Autoregressive Generation
[AUTHORS]
Nanxu Gong, Wangyang Ying, Dongjie Wang, Yanjie Fu
[ABSTRACT]
Feature selection aims to identify the optimal feature subset for enhancing
downstream models. Effective feature selection can remove redundant features,
save computational resources, accelerate the model learning process, and
improve the model overall performance. However, existing works are often
time-intensive to identify the effective feature subset within high-dimensional
feature spaces. Meanwhile, these methods mainly utilize a single downstream
task performance as the selection criterion, leading to the selected subsets
that are not only redundant but also lack generalizability. To bridge these
gaps, we reformulate feature selection through a neuro-symbolic lens and
introduce a novel generative framework aimed at identifying short and effective
feature subsets. More specifically, we found that feature ID tokens of the
selected subset can be formulated as symbols to reflect the intricate
correlations among features. Thus, in this framework, we first create a data
collector to automatically collect numerous feature selection samples
consisting of feature ID tokens, model performance, and the measurement of
feature subset redundancy. Building on the collected data, an
encoder-decoder-evaluator learning paradigm is developed to preserve the
intelligence of feature selection into a continuous embedding space for
efficient search. Within the learned embedding space, we leverage a
multi-gradient search algorithm to find more robust and generalized embeddings
with the objective of improving model performance and reducing feature subset
redundancy. These embeddings are then utilized to reconstruct the feature ID
tokens for executing the final feature selection. Ultimately, comprehensive
experiments and case studies are conducted to validate the effectiveness of the
proposed framework.
[COMMENTS]
Accepted to ACM TIST
[LINK]
http://arxiv.org/abs/2404.17157v2
[DATE]
2024-12-18 18:32:06+08:00
[CATEGORIES]
cs.LG
Personalized Clustering via Targeted Representation Learning
[AUTHORS]
Xiwen Geng, Suyun Zhao, Yixin Yu, Borui Peng, Pan Du, Hong Chen, Cuiping Li, Mengdie Wang
[ABSTRACT]
Clustering traditionally aims to reveal a natural grouping structure model
from unlabeled data. However, this model may not always align with users’
preference. In this paper, we propose a personalized clustering method that
explicitly performs targeted representation learning by interacting with users
via modicum task information (e.g., $\textit{must-link}$ or
$\textit{cannot-link}$ pairs) to guide the clustering direction. We query users
with the most informative pairs, i.e., those pairs most hard to cluster and
those most easy to miscluster, to facilitate the representation learning in
terms of the clustering preference. Moreover, by exploiting attention
mechanism, the targeted representation is learned and augmented. By leveraging
the targeted representation and constrained constrastive loss as well,
personalized clustering is obtained. Theoretically, we verify that the risk of
personalized clustering is tightly bounded, guaranteeing that active queries to
users do mitigate the clustering risk. Experimentally, extensive results show
that our method performs well across different clustering tasks and datasets,
even with a limited number of queries.
[COMMENTS]
Accepted to AAAI 2025 main conference
[LINK]
http://arxiv.org/abs/2412.13690v1
[DATE]
2024-12-18 18:28:51+08:00
[CATEGORIES]
cs.LG
Evolutionary Large Language Model for Automated Feature Transformation
[AUTHORS]
Nanxu Gong, Chandan K. Reddy, Wangyang Ying, Haifeng Chen, Yanjie Fu
[ABSTRACT]
Feature transformation aims to reconstruct the feature space of raw features
to enhance the performance of downstream models. However, the exponential
growth in the combinations of features and operations poses a challenge, making
it difficult for existing methods to efficiently explore a wide space.
Additionally, their optimization is solely driven by the accuracy of downstream
models in specific domains, neglecting the acquisition of general feature
knowledge. To fill this research gap, we propose an evolutionary LLM framework
for automated feature transformation. This framework consists of two parts: 1)
constructing a multi-population database through an RL data collector while
utilizing evolutionary algorithm strategies for database maintenance, and 2)
utilizing the ability of Large Language Model (LLM) in sequence understanding,
we employ few-shot prompts to guide LLM in generating superior samples based on
feature transformation sequence distinction. Leveraging the multi-population
database initially provides a wide search scope to discover excellent
populations. Through culling and evolution, the high-quality populations are
afforded greater opportunities, thereby furthering the pursuit of optimal
individuals. Through the integration of LLMs with evolutionary algorithms, we
achieve efficient exploration within a vast space, while harnessing feature
knowledge to propel optimization, thus realizing a more adaptable search
paradigm. Finally, we empirically demonstrate the effectiveness and generality
of our proposed method.
[COMMENTS]
Accepted to AAAI 2025
[LINK]
http://arxiv.org/abs/2405.16203v2
[DATE]
2024-12-18 18:25:12+08:00
[CATEGORIES]
cs.LG
Understanding Key Point Cloud Features for Development Three-dimensional Adversarial Attacks
[AUTHORS]
Hanieh Naderi, Chinthaka Dinesh, Ivan V. Bajic, Shohreh Kasaei
[ABSTRACT]
Adversarial attacks pose serious challenges for deep neural network
(DNN)-based analysis of various input signals. In the case of three-dimensional
point clouds, methods have been developed to identify points that play a key
role in network decision, and these become crucial in generating existing
adversarial attacks. For example, a saliency map approach is a popular method
for identifying adversarial drop points, whose removal would significantly
impact the network decision. This paper seeks to enhance the understanding of
three-dimensional adversarial attacks by exploring which point cloud features
are most important for predicting adversarial points. Specifically, Fourteen
key point cloud features such as edge intensity and distance from the centroid
are defined, and multiple linear regression is employed to assess their
predictive power for adversarial points. Based on critical feature selection
insights, a new attack method has been developed to evaluate whether the
selected features can generate an attack successfully. Unlike traditional
attack methods that rely on model-specific vulnerabilities, this approach
focuses on the intrinsic characteristics of the point clouds themselves. It is
demonstrated that these features can predict adversarial points across four
different DNN architectures, Point Network (PointNet), PointNet++, Dynamic
Graph Convolutional Neural Networks (DGCNN), and Point Convolutional Network
(PointConv) outperforming random guessing and achieving results comparable to
saliency map-based attacks. This study has important engineering applications,
such as enhancing the security and robustness of three-dimensional point
cloud-based systems in fields like robotics and autonomous driving.
[COMMENTS]
10 pages, 6 figures
[LINK]
http://arxiv.org/abs/2210.14164v4
[DATE]
2024-12-18 18:16:59+08:00
[CATEGORIES]
cs.LG
Graph Spring Neural ODEs for Link Sign Prediction
[AUTHORS]
Andrin Rehmann, Alexandre Bovet
[ABSTRACT]
Signed graphs allow for encoding positive and negative relations between
nodes and are used to model various online activities. Node representation
learning for signed graphs is a well-studied task with important applications
such as sign prediction. While the size of datasets is ever-increasing, recent
methods often sacrifice scalability for accuracy. We propose a novel
message-passing layer architecture called Graph Spring Network (GSN) modeled
after spring forces. We combine it with a Graph Neural Ordinary Differential
Equations (ODEs) formalism to optimize the system dynamics in embedding space
to solve a downstream prediction task. Once the dynamics is learned, embedding
generation for novel datasets is done by solving the ODEs in time using a
numerical integration scheme. Our GSN layer leverages the fast-to-compute edge
vector directions and learnable scalar functions that only depend on nodes’
distances in latent space to compute the nodes’ positions. Conversely, Graph
Convolution and Graph Attention Network layers rely on learnable vector
functions that require the full positions of input nodes in latent space. We
propose a specific implementation called Spring-Neural-Network (SPR-NN) using a
set of small neural networks mimicking attracting and repulsing spring forces
that we train for link sign prediction. Experiments show that our method
achieves accuracy close to the state-of-the-art methods with node generation
time speedup factors of up to 28,000 on large graphs.
[LINK]
http://arxiv.org/abs/2412.12916v2
[DATE]
2024-12-18 18:16:59+08:00
[CATEGORIES]
cs.LG
Comparing Differentiable and Dynamic Ray Tracing: Introducing the Multipath Lifetime Map
[AUTHORS]
Jérome Eertmans, Enrico Maria Vittuci, Vittorio Degli-Esposti, Laurent Jacques, Claude Oestges
[ABSTRACT]
With the increasing presence of dynamic scenarios, such as Vehicle-to-Vehicle
communications, radio propagation modeling tools must adapt to the rapidly
changing nature of the radio channel. Recently, both Differentiable and Dynamic
Ray Tracing frameworks have emerged to address these challenges. However, there
is often confusion about how these approaches differ and which one should be
used in specific contexts. In this paper, we provide an overview of these two
techniques and a comparative analysis against two state-of-the-art tools:
3DSCAT from UniBo and Sionna from NVIDIA. To provide a more precise
characterization of the scope of these methods, we introduce a novel
simulation-based metric, the Multipath Lifetime Map, which enables the
evaluation of spatial and temporal coherence in radio channels only based on
the geometrical description of the environment. Finally, our metrics are
evaluated on a classic urban street canyon scenario, yielding similar results
to those obtained from measurement campaigns.
[COMMENTS]
5 pages, 5 figures, 1 table, accepted at EuCAP 2025
[LINK]
http://arxiv.org/abs/2410.14535v4
[DATE]
2024-12-18 18:10:56+08:00
[CATEGORIES]
cs.LG
On Enhancing Root Cause Analysis with SQL Summaries for Failures in Database Workload Replays at SAP HANA
[AUTHORS]
Neetha Jambigi, Joshua Hammesfahr, Moritz Mueller, Thomas Bach, Michael Felderer
[ABSTRACT]
Capturing the workload of a database and replaying this workload for a new
version of the database can be an effective approach for regression testing.
However, false positive errors caused by many factors such as data privacy
limitations, time dependency or non-determinism in multi-threaded environment
can negatively impact the effectiveness. Therefore, we employ a machine
learning based framework to automate the root cause analysis of failures found
during replays. However, handling unseen novel issues not found in the training
data is one general challenge of machine learning approaches with respect to
generalizability of the learned model. We describe how we continue to address
this challenge for more robust long-term solutions. From our experience,
retraining with new failures is inadequate due to features overlapping across
distinct root causes. Hence, we leverage a large language model (LLM) to
analyze failed SQL statements and extract concise failure summaries as an
additional feature to enhance the classification process. Our experiments show
the F1-Macro score improved by 4.77% for our data. We consider our approach
beneficial for providing end users with additional information to gain more
insights into the found issues and to improve the assessment of the replay
results.
[COMMENTS]
The 35th IEEE International Symposium on Software Reliability
Engineering
[LINK]
http://arxiv.org/abs/2412.13679v1
[DATE]
2024-12-18 18:07:54+08:00
[CATEGORIES]
cs.LG
Urban traffic analysis and forecasting through shared Koopman eigenmodes
[AUTHORS]
Chuhan Yang, Fares B. Mehouachi, Monica Menendez, Saif Eddin Jabari
[ABSTRACT]
Predicting traffic flow in data-scarce cities is challenging due to limited
historical data. To address this, we leverage transfer learning by identifying
periodic patterns common to data-rich cities using a customized variant of
Dynamic Mode Decomposition (DMD): constrained Hankelized DMD (TrHDMD). This
method uncovers common eigenmodes (urban heartbeats) in traffic patterns and
transfers them to data-scarce cities, significantly enhancing prediction
performance. TrHDMD reduces the need for extensive training datasets by
utilizing prior knowledge from other cities. By applying Koopman operator
theory to multi-city loop detector data, we identify stable, interpretable, and
time-invariant traffic modes. Injecting “urban heartbeats” into forecasting
tasks improves prediction accuracy and has the potential to enhance traffic
management strategies for cities with varying data infrastructures. Our work
introduces cross-city knowledge transfer via shared Koopman eigenmodes,
offering actionable insights and reliable forecasts for data-scarce urban
environments.
[LINK]
http://arxiv.org/abs/2409.04728v2
[DATE]
2024-12-18 18:00:35+08:00
[CATEGORIES]
cs.LG
Learning Dynamic Mechanisms in Unknown Environments: A Reinforcement Learning Approach
[AUTHORS]
Shuang Qiu, Boxiang Lyu, Qinglin Meng, Zhaoran Wang, Zhuoran Yang, Michael I. Jordan
[ABSTRACT]
Dynamic mechanism design studies how mechanism designers should allocate
resources among agents in a time-varying environment. We consider the problem
where the agents interact with the mechanism designer according to an unknown
Markov Decision Process (MDP), where agent rewards and the mechanism designer’s
state evolve according to an episodic MDP with unknown reward functions and
transition kernels. We focus on the online setting with linear function
approximation and propose novel learning algorithms to recover the dynamic
Vickrey-Clarke-Grove (VCG) mechanism over multiple rounds of interaction. A key
contribution of our approach is incorporating reward-free online Reinforcement
Learning (RL) to aid exploration over a rich policy space to estimate prices in
the dynamic VCG mechanism. We show that the regret of our proposed method is
upper bounded by $\tilde{\mathcal{O}}(T^{2/3})$ and further devise a lower
bound to show that our algorithm is efficient, incurring the same $\Omega(T^{2
/ 3})$ regret as the lower bound, where $T$ is the total number of rounds. Our
work establishes the regret guarantee for online RL in solving dynamic
mechanism design problems without prior knowledge of the underlying model.
[COMMENTS]
Accepted in JMLR 2024
[LINK]
http://arxiv.org/abs/2202.12797v3
[DATE]
2024-12-18 17:54:22+08:00
[CATEGORIES]
cs.LG
DyCoke: Dynamic Compression of Tokens for Fast Video Large Language Models
[AUTHORS]
Keda Tao, Can Qin, Haoxuan You, Yang Sui, Huan Wang
[ABSTRACT]
Video large language models (VLLMs) have significantly advanced recently in
processing complex video content, yet their inference efficiency remains
constrained because of the high computational cost stemming from the thousands
of visual tokens generated from the video inputs. We empirically observe that,
unlike single image inputs, VLLMs typically attend visual tokens from different
frames at different decoding iterations, making a one-shot pruning strategy
prone to removing important tokens by mistake. Motivated by this, we present
DyCoke, a training-free token compression method to optimize token
representation and accelerate VLLMs. DyCoke incorporates a plug-and-play
temporal compression module to minimize temporal redundancy by merging
redundant tokens across frames, and applies dynamic KV cache reduction to prune
spatially redundant tokens selectively. It ensures high-quality inference by
dynamically retaining the critical tokens at each decoding step. Extensive
experimental results demonstrate that DyCoke can outperform the prior SoTA
counterparts, achieving 1.5X inference speedup, 1.4X memory reduction against
the baseline VLLM, while still improving the performance, with no training.
[COMMENTS]
12 pages, 6 figures
[LINK]
http://arxiv.org/abs/2411.15024v2
[DATE]
2024-12-18 17:47:25+08:00
[CATEGORIES]
cs.LG
Time-Reversible Bridges of Data with Machine Learning
[AUTHORS]
Ludwig Winkler
[ABSTRACT]
The analysis of dynamical systems is a fundamental tool in the natural
sciences and engineering. It is used to understand the evolution of systems as
large as entire galaxies and as small as individual molecules. With predefined
conditions on the evolution of dy-namical systems, the underlying differential
equations have to fulfill specific constraints in time and space. This class of
problems is known as boundary value problems. This thesis presents novel
approaches to learn time-reversible deterministic and stochastic dynamics
constrained by initial and final conditions. The dynamics are inferred by
machine learning algorithms from observed data, which is in contrast to the
traditional approach of solving differential equations by numerical
integration. The work in this thesis examines a set of problems of increasing
difficulty each of which is concerned with learning a different aspect of the
dynamics. Initially, we consider learning deterministic dynamics from ground
truth solutions which are constrained by deterministic boundary conditions.
Secondly, we study a boundary value problem in discrete state spaces, where the
forward dynamics follow a stochastic jump process and the boundary conditions
are discrete probability distributions. In particular, the stochastic dynamics
of a specific jump process, the Ehrenfest process, is considered and the
reverse time dynamics are inferred with machine learning. Finally, we
investigate the problem of inferring the dynamics of a continuous-time
stochastic process between two probability distributions without any reference
information. Here, we propose a novel criterion to learn time-reversible
dynamics of two stochastic processes to solve the Schr"odinger Bridge Problem.
[LINK]
http://arxiv.org/abs/2412.13665v1
[DATE]
2024-12-18 17:47:19+08:00
[CATEGORIES]
cs.LG
When Should We Prefer State-to-Visual DAgger Over Visual Reinforcement Learning?
[AUTHORS]
Tongzhou Mu, Zhaoyang Li, Stanisław Wiktor Strzelecki, Xiu Yuan, Yunchao Yao, Litian Liang, Hao Su
[ABSTRACT]
Learning policies from high-dimensional visual inputs, such as pixels and
point clouds, is crucial in various applications. Visual reinforcement learning
is a promising approach that directly trains policies from visual observations,
although it faces challenges in sample efficiency and computational costs. This
study conducts an empirical comparison of State-to-Visual DAgger, a two-stage
framework that initially trains a state policy before adopting online imitation
to learn a visual policy, and Visual RL across a diverse set of tasks. We
evaluate both methods across 16 tasks from three benchmarks, focusing on their
asymptotic performance, sample efficiency, and computational costs.
Surprisingly, our findings reveal that State-to-Visual DAgger does not
universally outperform Visual RL but shows significant advantages in
challenging tasks, offering more consistent performance. In contrast, its
benefits in sample efficiency are less pronounced, although it often reduces
the overall wall-clock time required for training. Based on our findings, we
provide recommendations for practitioners and hope that our results contribute
valuable perspectives for future research in visual policy learning.
[COMMENTS]
Accepted by The 39th Annual AAAI Conference on Artificial
Intelligence (AAAI 2025)
[LINK]
http://arxiv.org/abs/2412.13662v1
[DATE]
2024-12-18 17:39:12+08:00
[CATEGORIES]
cs.LG
FFHFlow: A Flow-based Variational Approach for Learning Diverse Dexterous Grasps with Shape-Aware Introspection
[AUTHORS]
Qian Feng, Jianxiang Feng, Zhaopeng Chen, Rudolph Triebel, Alois Knoll
[ABSTRACT]
Synthesizing diverse dexterous grasps from uncertain partial observation is
an important yet challenging task for physically intelligent embodiments.
Previous works on generative grasp synthesis fell short of precisely capturing
the complex grasp distribution and reasoning about shape uncertainty in the
unstructured and often partially perceived reality. In this work, we introduce
a novel model that can generate diverse grasps for a multi-fingered hand while
introspectively handling perceptual uncertainty and recognizing unknown object
geometry to avoid performance degradation. Specifically, we devise a Deep
Latent Variable Model (DLVM) based on Normalizing Flows (NFs), facilitating
hierarchical and expressive latent representation for modeling versatile
grasps. Our model design counteracts typical pitfalls of its popular
alternative in generative grasping, i.e., conditional Variational Autoencoders
(cVAEs) whose performance is limited by mode collapse and miss-specified prior
issues. Moreover, the resultant feature hierarchy and the exact flow likelihood
computation endow our model with shape-aware introspective capabilities,
enabling it to quantify the shape uncertainty of partial point clouds and
detect objects of novel geometry. We further achieve performance gain by fusing
this information with a discriminative grasp evaluator, facilitating a novel
hybrid way for grasp evaluation. Comprehensive simulated and real-world
experiments show that the proposed idea gains superior performance and higher
run-time efficiency against strong baselines, including diffusion models. We
also demonstrate substantial benefits of greater diversity for grasping objects
in clutter and a confined workspace in the real world.
[COMMENTS]
First two authors contributed equally, whose ordering decided via
coin-tossing. Under Reivew
[LINK]
http://arxiv.org/abs/2407.15161v2
[DATE]
2024-12-18 17:07:47+08:00
[CATEGORIES]
cs.LG
Policy Decorator: Model-Agnostic Online Refinement for Large Policy Model
[AUTHORS]
Xiu Yuan, Tongzhou Mu, Stone Tao, Yunhao Fang, Mengke Zhang, Hao Su
[ABSTRACT]
Recent advancements in robot learning have used imitation learning with large
models and extensive demonstrations to develop effective policies. However,
these models are often limited by the quantity, quality, and diversity of
demonstrations. This paper explores improving offline-trained imitation
learning models through online interactions with the environment. We introduce
Policy Decorator, which uses a model-agnostic residual policy to refine large
imitation learning models during online interactions. By implementing
controlled exploration strategies, Policy Decorator enables stable,
sample-efficient online learning. Our evaluation spans eight tasks across two
benchmarks-ManiSkill and Adroit-and involves two state-of-the-art imitation
learning models (Behavior Transformer and Diffusion Policy). The results show
Policy Decorator effectively improves the offline-trained policies and
preserves the smooth motion of imitation learning models, avoiding the erratic
behaviors of pure RL policies. See our project page
(https://policydecorator.github.io) for videos.
[COMMENTS]
Explore videos, data, code, and more at
https://policydecorator.github.io
[LINK]
http://arxiv.org/abs/2412.13630v1
[DATE]
2024-12-18 17:06:16+08:00
[CATEGORIES]
cs.LG
TAUDiff: Improving statistical downscaling for extreme weather events using generative diffusion models
[AUTHORS]
Rahul Sundar, Nishant Parashar, Antoine Blanchard, Boyko Dodov
[ABSTRACT]
Deterministic regression-based downscaling models for climate variables often
suffer from spectral bias, which can be mitigated by generative models like
diffusion models. To enable efficient and reliable simulation of extreme
weather events, it is crucial to achieve rapid turnaround, dynamical
consistency, and accurate spatio-temporal spectral recovery. We propose an
efficient correction diffusion model, TAUDiff, that combines a deterministic
spatio-temporal model for mean field downscaling with a smaller generative
diffusion model for recovering the fine-scale stochastic features. We
demonstrate the efficacy of this approach on downscaling atmospheric wind
velocity fields obtained from coarse GCM simulations. Our approach can not only
ensure quicker simulation of extreme events but also reduce overall carbon
footprint due to low inference times.
[LINK]
http://arxiv.org/abs/2412.13627v1
[DATE]
2024-12-18 17:05:19+08:00
[CATEGORIES]
cs.LG
Offline Multi-Agent Reinforcement Learning via In-Sample Sequential Policy Optimization
[AUTHORS]
Zongkai Liu, Qian Lin, Chao Yu, Xiawei Wu, Yile Liang, Donghui Li, Xuetao Ding
[ABSTRACT]
Offline Multi-Agent Reinforcement Learning (MARL) is an emerging field that
aims to learn optimal multi-agent policies from pre-collected datasets.
Compared to single-agent case, multi-agent setting involves a large joint
state-action space and coupled behaviors of multiple agents, which bring extra
complexity to offline policy optimization. In this work, we revisit the
existing offline MARL methods and show that in certain scenarios they can be
problematic, leading to uncoordinated behaviors and out-of-distribution (OOD)
joint actions. To address these issues, we propose a new offline MARL
algorithm, named In-Sample Sequential Policy Optimization (InSPO). InSPO
sequentially updates each agent’s policy in an in-sample manner, which not only
avoids selecting OOD joint actions but also carefully considers teammates’
updated policies to enhance coordination. Additionally, by thoroughly exploring
low-probability actions in the behavior policy, InSPO can well address the
issue of premature convergence to sub-optimal solutions. Theoretically, we
prove InSPO guarantees monotonic policy improvement and converges to quantal
response equilibrium (QRE). Experimental results demonstrate the effectiveness
of our method compared to current state-of-the-art offline MARL methods.
[LINK]
http://arxiv.org/abs/2412.07639v2
[DATE]
2024-12-18 17:04:32+08:00
[CATEGORIES]
cs.LG
PreMixer: MLP-Based Pre-training Enhanced MLP-Mixers for Large-scale Traffic Forecasting
[AUTHORS]
Tongtong Zhang, Zhiyong Cui, Bingzhang Wang, Yilong Ren, Haiyang Yu, Pan Deng, Yinhai Wang
[ABSTRACT]
In urban computing, precise and swift forecasting of multivariate time series
data from traffic networks is crucial. This data incorporates additional
spatial contexts such as sensor placements and road network layouts, and
exhibits complex temporal patterns that amplify challenges for predictive
learning in traffic management, smart mobility demand, and urban planning.
Consequently, there is an increasing need to forecast traffic flow across
broader geographic regions and for higher temporal coverage. However, current
research encounters limitations because of the inherent inefficiency of model
and their unsuitability for large-scale traffic network applications due to
model complexity. This paper proposes a novel framework, named PreMixer,
designed to bridge this gap. It features a predictive model and a pre-training
mechanism, both based on the principles of Multi-Layer Perceptrons (MLP). The
PreMixer comprehensively consider temporal dependencies of traffic patterns in
different time windows and processes the spatial dynamics as well.
Additionally, we integrate spatio-temporal positional encoding to manage
spatiotemporal heterogeneity without relying on predefined graphs. Furthermore,
our innovative pre-training model uses a simple patch-wise MLP to conduct
masked time series modeling, learning from long-term historical data segmented
into patches to generate enriched contextual representations. This approach
enhances the downstream forecasting model without incurring significant time
consumption or computational resource demands owing to improved learning
efficiency and data handling flexibility. Our framework achieves comparable
state-of-the-art performance while maintaining high computational efficiency,
as verified by extensive experiments on large-scale traffic datasets.
[LINK]
http://arxiv.org/abs/2412.13607v1
[DATE]
2024-12-18 16:35:40+08:00
[CATEGORIES]
cs.LG
Sinkhorn Algorithm for Sequentially Composed Optimal Transports
[AUTHORS]
Kazuki Watanabe, Noboru Isobe
[ABSTRACT]
Sinkhorn algorithm is the de-facto standard approximation algorithm for
optimal transport, which has been applied to a variety of applications,
including image processing and natural language processing. In theory, the
proof of its convergence follows from the convergence of the Sinkhorn–Knopp
algorithm for the matrix scaling problem, and Altschuler et al. show that its
worst-case time complexity is in near-linear time. Very recently, sequentially
composed optimal transports were proposed by Watanabe and Isobe as a
hierarchical extension of optimal transports. In this paper, we present an
efficient approximation algorithm, namely Sinkhorn algorithm for sequentially
composed optimal transports, for its entropic regularization. Furthermore, we
present a theoretical analysis of the Sinkhorn algorithm, namely (i) its
exponential convergence to the optimal solution with respect to the Hilbert
pseudometric, and (ii) a worst-case complexity analysis for the case of one
sequential composition.
[COMMENTS]
Preprint
[LINK]
http://arxiv.org/abs/2412.03120v3
[DATE]
2024-12-18 16:34:23+08:00
[CATEGORIES]
cs.LG
Diffusion Model from Scratch
[AUTHORS]
Wang Zhen, Dong Yunyun
[ABSTRACT]
Diffusion generative models are currently the most popular generative models.
However, their underlying modeling process is quite complex, and starting
directly with the seminal paper Denoising Diffusion Probability Model (DDPM)
can be challenging. This paper aims to assist readers in building a
foundational understanding of generative models by tracing the evolution from
VAEs to DDPM through detailed mathematical derivations and a problem-oriented
analytical approach. It also explores the core ideas and improvement strategies
of current mainstream methodologies, providing guidance for undergraduate and
graduate students interested in learning about diffusion models.
[COMMENTS]
There were problems with the typography of our illustrations, and
there were problems with the derivation of the 200-step formula
[LINK]
http://arxiv.org/abs/2412.10824v2
[DATE]
2024-12-18 16:25:55+08:00
[CATEGORIES]
cs.LG
PASCO (PArallel Structured COarsening): an overlay to speed up graph clustering algorithms
[AUTHORS]
Etienne Lasalle, Rémi Vaudaine, Titouan Vayer, Pierre Borgnat, Rémi Gribonval, Paulo Gonçalves, Màrton Karsai
[ABSTRACT]
Clustering the nodes of a graph is a cornerstone of graph analysis and has
been extensively studied. However, some popular methods are not suitable for
very large graphs: e.g., spectral clustering requires the computation of the
spectral decomposition of the Laplacian matrix, which is not applicable for
large graphs with a large number of communities. This work introduces PASCO, an
overlay that accelerates clustering algorithms. Our method consists of three
steps: 1-We compute several independent small graphs representing the input
graph by applying an efficient and structure-preserving coarsening algorithm.
2-A clustering algorithm is run in parallel onto each small graph and provides
several partitions of the initial graph. 3-These partitions are aligned and
combined with an optimal transport method to output the final partition. The
PASCO framework is based on two key contributions: a novel global algorithm
structure designed to enable parallelization and a fast, empirically validated
graph coarsening algorithm that preserves structural properties. We demonstrate
the strong performance of 1 PASCO in terms of computational efficiency,
structural preservation, and output partition quality, evaluated on both
synthetic and real-world graph datasets.
[LINK]
http://arxiv.org/abs/2412.13592v1
[DATE]
2024-12-18 16:15:55+08:00
[CATEGORIES]
cs.LG
e-COP : Episodic Constrained Optimization of Policies
[AUTHORS]
Akhil Agnihotri, Rahul Jain, Deepak Ramachandran, Sahil Singla
[ABSTRACT]
In this paper, we present the $\texttt{e-COP}$ algorithm, the first policy
optimization algorithm for constrained Reinforcement Learning (RL) in episodic
(finite horizon) settings. Such formulations are applicable when there are
separate sets of optimization criteria and constraints on a system’s behavior.
We approach this problem by first establishing a policy difference lemma for
the episodic setting, which provides the theoretical foundation for the
algorithm. Then, we propose to combine a set of established and novel solution
ideas to yield the $\texttt{e-COP}$ algorithm that is easy to implement and
numerically stable, and provide a theoretical guarantee on optimality under
certain scaling assumptions. Through extensive empirical analysis using
benchmarks in the Safety Gym suite, we show that our algorithm has similar or
better performance than SoTA (non-episodic) algorithms adapted for the episodic
setting. The scalability of the algorithm opens the door to its application in
safety-constrained Reinforcement Learning from Human Feedback for Large
Language or Diffusion Models.
[LINK]
http://arxiv.org/abs/2406.09563v2
[DATE]
2024-12-18 16:15:09+08:00
[CATEGORIES]
cs.LG
Uncertainty-Aware Hybrid Inference with On-Device Small and Remote Large Language Models
[AUTHORS]
Seungeun Oh, Jinhyuk Kim, Jihong Park, Seung-Woo Ko, Tony Q. S. Quek, Seong-Lyun Kim
[ABSTRACT]
This paper studies a hybrid language model (HLM) architecture that integrates
a small language model (SLM) operating on a mobile device with a large language
model (LLM) hosted at the base station (BS) of a wireless network. The HLM
token generation process follows the speculative inference principle: the SLM’s
vocabulary distribution is uploaded to the LLM, which either accepts or rejects
it, with rejected tokens being resampled by the LLM. While this approach
ensures alignment between the vocabulary distributions of the SLM and LLM, it
suffers from low token throughput due to uplink transmission and the
computation costs of running both language models. To address this, we propose
a novel HLM structure coined Uncertainty-aware opportunistic HLM (U-HLM),
wherein the SLM locally measures its output uncertainty and skips both uplink
transmissions and LLM operations for tokens that are likely to be accepted.
This opportunistic skipping is enabled by our empirical finding of a linear
correlation between the SLM’s uncertainty and the LLM’s rejection probability.
We analytically derive the uncertainty threshold and evaluate its expected risk
of rejection. Simulations show that U-HLM reduces uplink transmissions and LLM
computations by 45.93%, while achieving up to 97.54% of the LLM’s inference
accuracy and 2.54$\times$ faster token throughput than HLM without skipping.
[COMMENTS]
6 pages, 6 figures; This work has been submitted to the IEEE for
possible publication
[LINK]
http://arxiv.org/abs/2412.12687v2
[DATE]
2024-12-18 16:14:35+08:00
[CATEGORIES]
cs.LG
SemiDFL: A Semi-Supervised Paradigm for Decentralized Federated Learning
[AUTHORS]
Xinyang Liu, Pengchao Han, Xuan Li, Bo Liu
[ABSTRACT]
Decentralized federated learning (DFL) realizes cooperative model training
among connected clients without relying on a central server, thereby mitigating
communication bottlenecks and eliminating the single-point failure issue
present in centralized federated learning (CFL). Most existing work on DFL
focuses on supervised learning, assuming each client possesses sufficient
labeled data for local training. However, in real-world applications, much of
the data is unlabeled. We address this by considering a challenging yet
practical semisupervised learning (SSL) scenario in DFL, where clients may have
varying data sources: some with few labeled samples, some with purely unlabeled
data, and others with both. In this work, we propose SemiDFL, the first
semi-supervised DFL method that enhances DFL performance in SSL scenarios by
establishing a consensus in both data and model spaces. Specifically, we
utilize neighborhood information to improve the quality of pseudo-labeling,
which is crucial for effectively leveraging unlabeled data. We then design a
consensusbased diffusion model to generate synthesized data, which is used in
combination with pseudo-labeled data to create mixed datasets. Additionally, we
develop an adaptive aggregation method that leverages the model accuracy of
synthesized data to further enhance SemiDFL performance. Through extensive
experimentation, we demonstrate the remarkable performance superiority of the
proposed DFL-Semi method over existing CFL and DFL schemes in both IID and
non-IID SSL scenarios.
[COMMENTS]
Accepted by AAAI 2025
[LINK]
http://arxiv.org/abs/2412.13589v1
[DATE]
2024-12-18 16:12:55+08:00
[CATEGORIES]
cs.LG
Timer-XL: Long-Context Transformers for Unified Time Series Forecasting
[AUTHORS]
Yong Liu, Guo Qin, Xiangdong Huang, Jianmin Wang, Mingsheng Long
[ABSTRACT]
We present Timer-XL, a generative Transformer for unified time series
forecasting. To uniformly predict 1D and 2D time series, we generalize next
token prediction, predominantly adopted for causal generation of 1D sequences,
to multivariate next token prediction. The proposed paradigm uniformly
formulates various forecasting scenarios as a long-context generation problem.
We opt for the generative Transformer, which can capture global-range and
causal dependencies while providing contextual flexibility, to implement
unified forecasting on univariate series characterized by non-stationarity,
multivariate time series with complicated dynamics and correlations, and
covariate-informed contexts that include both endogenous and exogenous time
series. Technically, we propose a universal TimeAttention to facilitate
generative Transformers on multiple time series, which can effectively capture
fine-grained intra- and inter-series dependencies of flattened time series
tokens (patches), and is further enhanced by deftly designed position
embeddings for the temporal and variable dimensions. Timer-XL achieves
state-of-the-art performance across challenging forecasting benchmarks through
a unified approach. Based on large-scale pre-training, Timer-XL also
demonstrates notable zero-shot performance, making it a promising architecture
for large time series models.
[LINK]
http://arxiv.org/abs/2410.04803v3
[DATE]
2024-12-18 16:12:18+08:00
[CATEGORIES]
cs.LG
Careful Seeding for k-Medois Clustering with Incremental k-Means++ Initialization
[AUTHORS]
Difei Cheng, Yunfeng Zhang, Ruinan Jin
[ABSTRACT]
K-medoids clustering is a popular variant of k-means clustering and widely
used in pattern recognition and machine learning. A main drawback of k-medoids
clustering is that an improper initialization can cause it to get trapped in
local optima. An improved k-medoids clustering algorithm, called INCKM
algorithm, which is the first to apply incremental initialization to k-medoids
clustering, was recently proposed to overcome this drawback. The INCKM
algorithm requires the construction of a subset of candidate medoids determined
by one hyperparameter for initialization, and meanwhile, it always fails when
dealing with imbalanced datasets with an incorrect hyperparameter selection. In
this paper, we propose a novel k-medoids clustering algorithm, called
incremental k-means++ (INCKPP) algorithm, which initializes with a novel
incremental manner, attempting to optimally add one new cluster center at each
stage through a nonparametric and stochastic k-means++ initialization. The
INCKPP algorithm overcomes the difficulty of hyperparameter selection in the
INCKM algorithm, improves the clustering performance, and can deal with
imbalanced datasets well. However, the INCKPP algorithm is not computationally
efficient enough. To deal with this, we further propose an improved INCKPP
algorithm, called INCKPPsample algorithm, which improves the clustering
efficiency while maintaining the clustering performance of the INCKPP
algorithm. Extensive results from experiments on both synthetic and real-world
datasets, including imbalanced datasets, illustrate that the proposed
algorithms outperforms than the other compared algorithms.
[COMMENTS]
18 pages, 43 figures, 5 tables
[LINK]
http://arxiv.org/abs/2207.02404v2
[DATE]
2024-12-18 16:00:59+08:00
[CATEGORIES]
cs.LG
Scaling Laws for Pre-training Agents and World Models
[AUTHORS]
Tim Pearce, Tabish Rashid, Dave Bignell, Raluca Georgescu, Sam Devlin, Katja Hofmann
[ABSTRACT]
The performance of embodied agents has been shown to improve by increasing
model parameters, dataset size, and compute. This has been demonstrated in
domains from robotics to video games, when generative learning objectives on
offline datasets (pre-training) are used to model an agent’s behavior
(imitation learning) or their environment (world modeling). This paper
characterizes the role of scale in these tasks more precisely. Going beyond the
simple intuition that `bigger is better’, we show that the same types of power
laws found in language modeling also arise in world modeling and imitation
learning (e.g. between loss and optimal model size). However, the coefficients
of these laws are heavily influenced by the tokenizer, task \& architecture –
this has important implications on the optimal sizing of models and data.
[LINK]
http://arxiv.org/abs/2411.04434v2
[DATE]
2024-12-18 15:54:33+08:00
[CATEGORIES]
cs.LG
CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models
[AUTHORS]
Zhihao Du, Yuxuan Wang, Qian Chen, Xian Shi, Xiang Lv, Tianyu Zhao, Zhifu Gao, Yexin Yang, Changfeng Gao, Hui Wang, Fan Yu, Huadai Liu, Zhengyan Sheng, Yue Gu, Chong Deng, Wen Wang, Shiliang Zhang, Zhijie Yan, Jingren Zhou
[ABSTRACT]
In our previous work, we introduced CosyVoice, a multilingual speech
synthesis model based on supervised discrete speech tokens. By employing
progressive semantic decoding with two popular generative models, language
models (LMs) and Flow Matching, CosyVoice demonstrated high prosody
naturalness, content consistency, and speaker similarity in speech in-context
learning. Recently, significant progress has been made in multi-modal large
language models (LLMs), where the response latency and real-time factor of
speech synthesis play a crucial role in the interactive experience. Therefore,
in this report, we present an improved streaming speech synthesis model,
CosyVoice 2, which incorporates comprehensive and systematic optimizations.
Specifically, we introduce finite-scalar quantization to improve the codebook
utilization of speech tokens. For the text-speech LM, we streamline the model
architecture to allow direct use of a pre-trained LLM as the backbone. In
addition, we develop a chunk-aware causal flow matching model to support
various synthesis scenarios, enabling both streaming and non-streaming
synthesis within a single model. By training on a large-scale multilingual
dataset, CosyVoice 2 achieves human-parity naturalness, minimal response
latency, and virtually lossless synthesis quality in the streaming mode. We
invite readers to listen to the demos at
https://funaudiollm.github.io/cosyvoice2.
[COMMENTS]
Tech report, work in progress
[LINK]
http://arxiv.org/abs/2412.10117v2
[DATE]
2024-12-18 15:37:37+08:00
[CATEGORIES]
cs.LG
Indirect Query Bayesian Optimization with Integrated Feedback
[AUTHORS]
Mengyan Zhang, Shahine Bouabid, Cheng Soon Ong, Seth Flaxman, Dino Sejdinovic
[ABSTRACT]
We develop the framework of Indirect Query Bayesian Optimization (IQBO), a
new class of Bayesian optimization problems where the integrated feedback is
given via a conditional expectation of the unknown function $f$ to be
optimized. The underlying conditional distribution can be unknown and learned
from data. The goal is to find the global optimum of $f$ by adaptively querying
and observing in the space transformed by the conditional distribution. This is
motivated by real-world applications where one cannot access direct feedback
due to privacy, hardware or computational constraints. We propose the
Conditional Max-Value Entropy Search (CMES) acquisition function to address
this novel setting, and propose a hierarchical search algorithm to address the
multi-resolution setting and improve the computational efficiency. We show
regret bounds for our proposed methods and demonstrate the effectiveness of our
approaches on simulated optimization tasks.
[COMMENTS]
Preliminary work. Under review
[LINK]
http://arxiv.org/abs/2412.13559v1
[DATE]
2024-12-18 15:20:33+08:00
[CATEGORIES]
cs.LG
The Distributional Reward Critic Framework for Reinforcement Learning Under Perturbed Rewards
[AUTHORS]
Xi Chen, Zhihui Zhu, Andrew Perrault
[ABSTRACT]
The reward signal plays a central role in defining the desired behaviors of
agents in reinforcement learning (RL). Rewards collected from realistic
environments could be perturbed, corrupted, or noisy due to an adversary,
sensor error, or because they come from subjective human feedback. Thus, it is
important to construct agents that can learn under such rewards. Existing
methodologies for this problem make strong assumptions, including that the
perturbation is known in advance, clean rewards are accessible, or that the
perturbation preserves the optimal policy. We study a new, more general, class
of unknown perturbations, and introduce a distributional reward critic
framework for estimating reward distributions and perturbations during
training. Our proposed methods are compatible with any RL algorithm. Despite
their increased generality, we show that they achieve comparable or better
rewards than existing methods in a variety of environments, including those
with clean rewards. Under the challenging and generalized perturbations we
study, we win/tie the highest return in 44/48 tested settings (compared to
11/48 for the best baseline). Our results broaden and deepen our ability to
perform RL in reward-perturbed environments.
[COMMENTS]
to be published in AAAI 2025
[LINK]
http://arxiv.org/abs/2401.05710v2
[DATE]
2024-12-18 15:04:12+08:00
[CATEGORIES]
cs.LG
Rethinking Multi-domain Generalization with A General Learning Objective
[AUTHORS]
Zhaorui Tan, Xi Yang, Kaizhu Huang
[ABSTRACT]
Multi-domain generalization (mDG) is universally aimed to minimize the
discrepancy between training and testing distributions to enhance
marginal-to-label distribution mapping. However, existing mDG literature lacks
a general learning objective paradigm and often imposes constraints on static
target marginal distributions. In this paper, we propose to leverage a
$Y$-mapping to relax the constraint. We rethink the learning objective for mDG
and design a new \textbf{general learning objective} to interpret and analyze
most existing mDG wisdom. This general objective is bifurcated into two
synergistic amis: learning domain-independent conditional features and
maximizing a posterior. Explorations also extend to two effective
regularization terms that incorporate prior information and suppress invalid
causality, alleviating the issues that come with relaxed constraints. We
theoretically contribute an upper bound for the domain alignment of
domain-independent conditional features, disclosing that many previous mDG
endeavors actually \textbf{optimize partially the objective} and thus lead to
limited performance. As such, our study distills a general learning objective
into four practical components, providing a general, robust, and flexible
mechanism to handle complex domain shifts. Extensive empirical results indicate
that the proposed objective with $Y$-mapping leads to substantially better mDG
performance in various downstream tasks, including regression, segmentation,
and classification.
[COMMENTS]
Accepted by CVPR24
[LINK]
http://arxiv.org/abs/2402.18853v2
[DATE]
2024-12-18 14:55:31+08:00
[CATEGORIES]
cs.LG
Spatio-Temporal Fuzzy-oriented Multi-Modal Meta-Learning for Fine-grained Emotion Recognition
[AUTHORS]
Jingyao Wang, Yuxuan Yang, Wenwen Qiang, Changwen Zheng, Hui Xiong
[ABSTRACT]
Fine-grained emotion recognition (FER) plays a vital role in various fields,
such as disease diagnosis, personalized recommendations, and multimedia mining.
However, existing FER methods face three key challenges in real-world
applications: (i) they rely on large amounts of continuously annotated data to
ensure accuracy since emotions are complex and ambiguous in reality, which is
costly and time-consuming; (ii) they cannot capture the temporal heterogeneity
caused by changing emotion patterns, because they usually assume that the
temporal correlation within sampling periods is the same; (iii) they do not
consider the spatial heterogeneity of different FER scenarios, that is, the
distribution of emotion information in different data may have bias or
interference. To address these challenges, we propose a Spatio-Temporal
Fuzzy-oriented Multi-modal Meta-learning framework (ST-F2M). Specifically,
ST-F2M first divides the multi-modal videos into multiple views, and each view
corresponds to one modality of one emotion. Multiple randomly selected views
for the same emotion form a meta-training task. Next, ST-F2M uses an integrated
module with spatial and temporal convolutions to encode the data of each task,
reflecting the spatial and temporal heterogeneity. Then it adds fuzzy semantic
information to each task based on generalized fuzzy rules, which helps handle
the complexity and ambiguity of emotions. Finally, ST-F2M learns
emotion-related general meta-knowledge through meta-recurrent neural networks
to achieve fast and robust fine-grained emotion recognition. Extensive
experiments show that ST-F2M outperforms various state-of-the-art methods in
terms of accuracy and model efficiency. In addition, we construct ablation
studies and further analysis to explore why ST-F2M performs well.
[COMMENTS]
13 pages, Submitted to TMM in 30-May-2024
[LINK]
http://arxiv.org/abs/2412.13541v1
[DATE]
2024-12-18 14:40:53+08:00
[CATEGORIES]
cs.LG
Speed Up the Cold-Start Learning in Two-Sided Bandits with Many Arms
[AUTHORS]
Mohsen Bayati, Junyu Cao, Wanning Chen
[ABSTRACT]
Multi-armed bandit (MAB) algorithms are efficient approaches to reduce the
opportunity cost of online experimentation and are used by companies to find
the best product from periodically refreshed product catalogs. However, these
algorithms face the so-called cold-start at the onset of the experiment due to
a lack of knowledge of customer preferences for new products, requiring an
initial data collection phase known as the burn-in period. During this period,
standard MAB algorithms operate like randomized experiments, incurring large
burn-in costs which scale with the large number of products. We attempt to
reduce the burn-in by identifying that many products can be cast into two-sided
products, and then naturally model the rewards of the products with a matrix,
whose rows and columns represent the two sides respectively. Next, we design
two-phase bandit algorithms that first use subsampling and low-rank matrix
estimation to obtain a substantially smaller targeted set of products and then
apply a UCB procedure on the target products to find the best one. We
theoretically show that the proposed algorithms lower costs and expedite the
experiment in cases when there is limited experimentation time along with a
large product set. Our analysis also reveals three regimes of long, short, and
ultra-short horizon experiments, depending on dimensions of the matrix.
Empirical evidence from both synthetic data and a real-world dataset on music
streaming services validates this superior performance.
[LINK]
http://arxiv.org/abs/2210.00340v3
[DATE]
2024-12-18 14:31:11+08:00
[CATEGORIES]
cs.LG
KA-GNN: Kolmogorov-Arnold Graph Neural Networks for Molecular Property Prediction
[AUTHORS]
Longlong Li, Yipeng Zhang, Guanghui Wang, Kelin Xia
[ABSTRACT]
As key models in geometric deep learning, graph neural networks have
demonstrated enormous power in molecular data analysis. Recently, a
specially-designed learning scheme, known as Kolmogorov-Arnold Network (KAN),
shows unique potential for the improvement of model accuracy, efficiency, and
explainability. Here we propose the first non-trivial Kolmogorov-Arnold
Network-based Graph Neural Networks (KA-GNNs), including KAN-based graph
convolutional networks(KA-GCN) and KAN-based graph attention network (KA-GAT).
The essential idea is to utilizes KAN’s unique power to optimize GNN
architectures at three major levels, including node embedding, message passing,
and readout. Further, with the strong approximation capability of Fourier
series, we develop Fourier series-based KAN model and provide a rigorous
mathematical prove of the robust approximation capability of this Fourier KAN
architecture. To validate our KA-GNNs, we consider seven most-widely-used
benchmark datasets for molecular property prediction and extensively compare
with existing state-of-the-art models. It has been found that our KA-GNNs can
outperform traditional GNN models. More importantly, our Fourier KAN module can
not only increase the model accuracy but also reduce the computational time.
This work not only highlights the great power of KA-GNNs in molecular property
prediction but also provides a novel geometric deep learning framework for the
general non-Euclidean data analysis.
[LINK]
http://arxiv.org/abs/2410.11323v2
[DATE]
2024-12-18 14:23:29+08:00
[CATEGORIES]
cs.LG
Optimal Rates for Robust Stochastic Convex Optimization
[AUTHORS]
Changyu Gao, Andrew Lowy, Xingyu Zhou, Stephen J. Wright
[ABSTRACT]
Machine learning algorithms in high-dimensional settings are highly
susceptible to the influence of even a small fraction of structured outliers,
making robust optimization techniques essential. In particular, within the
$\epsilon$-contamination model, where an adversary can inspect and replace up
to an $\epsilon$-fraction of the samples, a fundamental open problem is
determining the optimal rates for robust stochastic convex optimization (SCO)
under such contamination. We develop novel algorithms that achieve
minimax-optimal excess risk (up to logarithmic factors) under the
$\epsilon$-contamination model. Our approach improves over existing algorithms,
which are not only suboptimal but also require stringent assumptions, including
Lipschitz continuity and smoothness of individual sample functions. By
contrast, our optimal algorithms do not require these restrictive assumptions,
and can handle nonsmooth but Lipschitz population loss functions. We complement
our algorithmic developments with a tight lower bound for robust SCO.
[LINK]
http://arxiv.org/abs/2412.11003v2
[DATE]
2024-12-18 14:17:12+08:00
[CATEGORIES]
cs.LG
Quantum Machine Learning in Log-based Anomaly Detection: Challenges and Opportunities
[AUTHORS]
Jiaxing Qi, Chang Zeng, Zhongzhi Luan, Shaohan Huang, Shu Yang, Yao Lu, Bin Han, Hailong Yang, Depei Qian
[ABSTRACT]
Log-based anomaly detection (LogAD) is the main component of Artificial
Intelligence for IT Operations (AIOps), which can detect anomalous that occur
during the system on-the-fly. Existing methods commonly extract log sequence
features using classical machine learning techniques to identify whether a new
sequence is an anomaly or not. However, these classical approaches often
require trade-offs between efficiency and accuracy. The advent of quantum
machine learning (QML) offers a promising alternative. By transforming parts of
classical machine learning computations into parameterized quantum circuits
(PQCs), QML can significantly reduce the number of trainable parameters while
maintaining accuracy comparable to classical counterparts. In this work, we
introduce a unified framework, \ourframework{}, for evaluating QML models in
the context of LogAD. This framework incorporates diverse log data, integrated
QML models, and comprehensive evaluation metrics. State-of-the-art methods such
as DeepLog, LogAnomaly, and LogRobust, along with their quantum-transformed
counterparts, are included in our framework.Beyond standard metrics like F1
score, precision, and recall, our evaluation extends to factors critical to QML
performance, such as specificity, the number of circuits, circuit design, and
quantum state encoding. Using \ourframework{}, we conduct extensive experiments
to assess the performance of these models and their quantum counterparts,
uncovering valuable insights and paving the way for future research in QML
model selection and design for LogAD.
[LINK]
http://arxiv.org/abs/2412.13529v1
[DATE]
2024-12-18 14:13:49+08:00
[CATEGORIES]
cs.LG
Analyzing (In)Abilities of SAEs via Formal Languages
[AUTHORS]
Abhinav Menon, Manish Shrivastava, David Krueger, Ekdeep Singh Lubana
[ABSTRACT]
Autoencoders have been used for finding interpretable and disentangled
features underlying neural network representations in both image and text
domains. While the efficacy and pitfalls of such methods are well-studied in
vision, there is a lack of corresponding results, both qualitative and
quantitative, for the text domain. We aim to address this gap by training
sparse autoencoders (SAEs) on a synthetic testbed of formal languages.
Specifically, we train SAEs on the hidden representations of models trained on
formal languages (Dyck-2, Expr, and English PCFG) under a wide variety of
hyperparameter settings, finding interpretable latents often emerge in the
features learned by our SAEs. However, similar to vision, we find performance
turns out to be highly sensitive to inductive biases of the training pipeline.
Moreover, we show latents correlating to certain features of the input do not
always induce a causal impact on model’s computation. We thus argue that
causality has to become a central target in SAE training: learning of causal
features should be incentivized from the ground-up. Motivated by this, we
propose and perform preliminary investigations for an approach that promotes
learning of causally relevant features in our formal language setting.
[COMMENTS]
NeurIPS workshop on Foundation Model Interventions (Awarded best
paper)
[LINK]
http://arxiv.org/abs/2410.11767v2
[DATE]
2024-12-18 14:13:46+08:00
[CATEGORIES]
cs.LG
Mastering AI: Big Data, Deep Learning, and the Evolution of Large Language Models – AutoML from Basics to State-of-the-Art Techniques
[AUTHORS]
Pohsun Feng, Ziqian Bi, Yizhu Wen, Benji Peng, Junyu Liu, Caitlyn Heqi Yin, Tianyang Wang, Keyu Chen, Sen Zhang, Ming Li, Jiawei Xu, Ming Liu, Xuanhe Pan, Jinlang Wang, Qian Niu
[ABSTRACT]
A comprehensive guide to Automated Machine Learning (AutoML) is presented,
covering fundamental principles, practical implementations, and future trends.
The paper is structured to assist both beginners and experienced practitioners,
with detailed discussions on popular AutoML tools such as TPOT, AutoGluon, and
Auto-Keras. Emerging topics like Neural Architecture Search (NAS) and AutoML’s
applications in deep learning are also addressed. It is anticipated that this
work will contribute to ongoing research and development in the field of AI and
machine learning.
[COMMENTS]
This book contains 169 pages and 5 figures
[LINK]
http://arxiv.org/abs/2410.09596v2
[DATE]
2024-12-18 14:10:58+08:00
[CATEGORIES]
cs.LG
Towards a Dynamic Future with Adaptable Computing and Network Convergence (ACNC)
[AUTHORS]
Masoud Shokrnezhad, Hao Yu, Tarik Taleb, Richard Li, Kyunghan Lee, Jaeseung Song, Cedric Westphal
[ABSTRACT]
In the context of advancing 6G, a substantial paradigm shift is anticipated,
highlighting comprehensive everything-to-everything interactions characterized
by numerous connections and stringent adherence to Quality of
Service/Experience (QoS/E) prerequisites. The imminent challenge stems from
resource scarcity, prompting a deliberate transition to Computing-Network
Convergence (CNC) as an auspicious approach for joint resource orchestration.
While CNC-based mechanisms have garnered attention, their effectiveness in
realizing future services, particularly in use cases like the Metaverse, may
encounter limitations due to the continually changing nature of users,
services, and resources. Hence, this paper presents the concept of Adaptable
CNC (ACNC) as an autonomous Machine Learning (ML)-aided mechanism crafted for
the joint orchestration of computing and network resources, catering to dynamic
and voluminous user requests with stringent requirements. ACNC encompasses two
primary functionalities: state recognition and context detection. Given the
intricate nature of the user-service-computing-network space, the paper employs
dimension reduction to generate live, holistic, abstract system states in a
hierarchical structure. To address the challenges posed by dynamic changes,
Continual Learning (CL) is employed, classifying the system state into contexts
controlled by dedicated ML agents, enabling them to operate efficiently. These
two functionalities are intricately linked within a closed loop overseen by the
End-to-End (E2E) orchestrator to allocate resources. The paper introduces the
components of ACNC, proposes a Metaverse scenario to exemplify ACNC’s role in
resource provisioning with Segment Routing v6 (SRv6), outlines ACNC’s workflow,
details a numerical analysis for efficiency assessment, and concludes with
discussions on relevant challenges and potential avenues for future research.
[LINK]
http://arxiv.org/abs/2403.07573v3
[DATE]
2024-12-18 13:54:46+08:00
[CATEGORIES]
cs.LG
PRAGA: Prototype-aware Graph Adaptive Aggregation for Spatial Multi-modal Omics Analysis
[AUTHORS]
Xinlei Huang, Zhiqi Ma, Dian Meng, Yanran Liu, Shiwei Ruan, Qingqiang Sun, Xubin Zheng, Ziyue Qiao
[ABSTRACT]
Spatial multi-modal omics technology, highlighted by Nature Methods as an
advanced biological technique in 2023, plays a critical role in resolving
biological regulatory processes with spatial context. Recently, graph neural
networks based on K-nearest neighbor (KNN) graphs have gained prominence in
spatial multi-modal omics methods due to their ability to model semantic
relations between sequencing spots. However, the fixed KNN graph fails to
capture the latent semantic relations hidden by the inevitable data
perturbations during the biological sequencing process, resulting in the loss
of semantic information. In addition, the common lack of spot annotation and
class number priors in practice further hinders the optimization of spatial
multi-modal omics models. Here, we propose a novel spatial multi-modal omics
resolved framework, termed PRototype-Aware Graph Adaptative Aggregation for
Spatial Multi-modal Omics Analysis (PRAGA). PRAGA constructs a dynamic graph to
capture latent semantic relations and comprehensively integrate spatial
information and feature semantics. The learnable graph structure can also
denoise perturbations by learning cross-modal knowledge. Moreover, a dynamic
prototype contrastive learning is proposed based on the dynamic adaptability of
Bayesian Gaussian Mixture Models to optimize the multi-modal omics
representations for unknown biological priors. Quantitative and qualitative
experiments on simulated and real datasets with 7 competing methods demonstrate
the superior performance of PRAGA. Code is available at
https://github.com/Xubin-s-Lab/PRAGA.
[COMMENTS]
Accepted by AAAl2025; full version including appendix
[LINK]
http://arxiv.org/abs/2409.12728v5
[DATE]
2024-12-18 13:54:22+08:00
[CATEGORIES]
cs.LG
Correlated Proxies: A New Definition and Improved Mitigation for Reward Hacking
[AUTHORS]
Cassidy Laidlaw, Shivam Singhal, Anca Dragan
[ABSTRACT]
Because it is difficult to precisely specify complex objectives,
reinforcement learning policies are often optimized using proxy reward
functions that only approximate the true goal. However, optimizing proxy
rewards frequently leads to reward hacking: the optimized reward function
ceases to be a good proxy and the resulting policy performs poorly with respect
to the unspecified true reward. Principled solutions to reward hacking have
been impeded by the lack of a good definition for the problem. To address this
gap, we introduce a definition of reward hacking based on the correlation
between proxy and true rewards for states and actions seen by a “base policy”
that breaks down under optimization. We show that this definition captures
reward hacking behavior across several realistic settings, including in
reinforcement learning from human feedback (RLHF). Using our formulation, we
show theoretically that regularization to the base policy can effectively
prevent reward hacking. While the current practice in RLHF applies a KL penalty
between action distributions for this purpose, our theory suggests regularizing
the $\chi^2$ divergence between the policies’ occupancy measures can be more
effective. We intuitively show the benefits of this type of regularization and
demonstrate that it better mitigates reward hacking in practice across four
realistic settings, including RLHF. Our code is available at
https://github.com/cassidylaidlaw/orpo.
[LINK]
http://arxiv.org/abs/2403.03185v3
[DATE]
2024-12-18 13:51:35+08:00
[CATEGORIES]
cs.LG
Open-Source Protein Language Models for Function Prediction and Protein Design
[AUTHORS]
Shivasankaran Vanaja Pandi, Bharath Ramsundar
[ABSTRACT]
Protein language models (PLMs) have shown promise in improving the
understanding of protein sequences, contributing to advances in areas such as
function prediction and protein engineering. However, training these models
from scratch requires significant computational resources, limiting their
accessibility. To address this, we integrate a PLM into DeepChem, an
open-source framework for computational biology and chemistry, to provide a
more accessible platform for protein-related tasks.
We evaluate the performance of the integrated model on various protein
prediction tasks, showing that it achieves reasonable results across
benchmarks. Additionally, we present an exploration of generating
plastic-degrading enzyme candidates using the model’s embeddings and latent
space manipulation techniques. While the results suggest that further
refinement is needed, this approach provides a foundation for future work in
enzyme design. This study aims to facilitate the use of PLMs in research fields
like synthetic biology and environmental sustainability, even for those with
limited computational resources.
[COMMENTS]
To be published in 4th Annual AAAI workshop on AI to Accelerate
Science and Engineering
[LINK]
http://arxiv.org/abs/2412.13519v1
[DATE]
2024-12-18 13:41:15+08:00
[CATEGORIES]
cs.LG
Learning Causal Transition Matrix for Instance-dependent Label Noise
[AUTHORS]
Jiahui Li, Tai-Wei Chang, Kun Kuang, Ximing Li, Long Chen, Jun Zhou
[ABSTRACT]
Noisy labels are both inevitable and problematic in machine learning methods,
as they negatively impact models’ generalization ability by causing
overfitting. In the context of learning with noise, the transition matrix plays
a crucial role in the design of statistically consistent algorithms. However,
the transition matrix is often considered unidentifiable. One strand of methods
typically addresses this problem by assuming that the transition matrix is
instance-independent; that is, the probability of mislabeling a particular
instance is not influenced by its characteristics or attributes. This
assumption is clearly invalid in complex real-world scenarios. To better
understand the transition relationship and relax this assumption, we propose to
study the data generation process of noisy labels from a causal perspective. We
discover that an unobservable latent variable can affect either the instance
itself, the label annotation procedure, or both, which complicates the
identification of the transition matrix. To address various scenarios, we have
unified these observations within a new causal graph. In this graph, the input
instance is divided into a noise-resistant component and a noise-sensitive
component based on whether they are affected by the latent variable. These two
components contribute to identifying the “causal transition matrix”, which
approximates the true transition matrix with theoretical guarantee. In line
with this, we have designed a novel training framework that explicitly models
this causal relationship and, as a result, achieves a more accurate model for
inferring the clean label.
[LINK]
http://arxiv.org/abs/2412.13516v1
[DATE]
2024-12-18 13:33:16+08:00
[CATEGORIES]
cs.LG
Contextual Decision-Making with Knapsacks Beyond the Worst Case
[AUTHORS]
Zhaohua Chen, Rui Ai, Mingwei Yang, Yuqi Pan, Chang Wang, Xiaotie Deng
[ABSTRACT]
We study the framework of a dynamic decision-making scenario with resource
constraints. In this framework, an agent, whose target is to maximize the total
reward under the initial inventory, selects an action in each round upon
observing a random request, leading to a reward and resource consumptions that
are further associated with an unknown random external factor. While previous
research has already established an $\widetilde{O}(\sqrt{T})$ worst-case regret
for this problem, this work offers two results that go beyond the worst-case
perspective: one for the worst-case gap between benchmarks and another for
logarithmic regret rates. We first show that an $\Omega(\sqrt{T})$ distance
between the commonly used fluid benchmark and the online optimum is unavoidable
when the former has a degenerate optimal solution. On the algorithmic side, we
merge the re-solving heuristic with distribution estimation skills and propose
an algorithm that achieves an $\widetilde{O}(1)$ regret as long as the fluid LP
has a unique and non-degenerate solution. Furthermore, we prove that our
algorithm maintains a near-optimal $\widetilde{O}(\sqrt{T})$ regret even in the
worst cases and extend these results to the setting where the request and
external factor are continuous. Regarding information structure, our regret
results are obtained under two feedback models, respectively, where the
algorithm accesses the external factor at the end of each round and at the end
of a round only when a non-null action is executed.
[COMMENTS]
39 pages. A preliminary version appeared at NeurIPS 2024
[LINK]
http://arxiv.org/abs/2211.13952v2
[DATE]
2024-12-18 13:29:19+08:00
[CATEGORIES]
cs.LG
Climate-Driven Doubling of U.S. Maize Loss Probability: Interactive Simulation with Neural Network Monte Carlo
[AUTHORS]
A Samuel Pottinger, Lawson Connor, Brookie Guzder-Williams, Maya Weltman-Fahs, Nick Gondek, Timothy Bowles
[ABSTRACT]
Climate change not only threatens agricultural producers but also strains
related public agencies and financial institutions. These important food system
actors include government entities tasked with insuring grower livelihoods and
supporting response to continued global warming. We examine future risk within
the U.S. Corn Belt geographic region for one such crucial institution: the U.S.
Federal Crop Insurance Program. Specifically, we predict the impacts of
climate-driven crop loss at a policy-salient “risk unit” scale. Built through
our presented neural network Monte Carlo method, simulations anticipate both
more frequent and more severe losses that would result in a costly doubling of
the annual probability of maize Yield Protection insurance claims at
mid-century. We also provide an open source pipeline and interactive
visualization tools to explore these results with configurable statistical
treatments. Altogether, we fill an important gap in current understanding for
climate adaptation by bridging existing historic yield estimation and climate
projection to predict crop loss metrics at policy-relevant granularity.
[COMMENTS]
20 page main document followed by supplemental
[LINK]
http://arxiv.org/abs/2408.02217v2
[DATE]
2024-12-18 13:25:39+08:00
[CATEGORIES]
cs.LG
One Node One Model: Featuring the Missing-Half for Graph Clustering
[AUTHORS]
Xuanting Xie, Bingheng Li, Erlin Pan, Zhaochen Guo, Zhao Kang, Wenyu Chen
[ABSTRACT]
Most existing graph clustering methods primarily focus on exploiting
topological structure, often neglecting the missing-half" node feature
information, especially how these features can enhance clustering performance.
This issue is further compounded by the challenges associated with
high-dimensional features. Feature selection in graph clustering is
particularly difficult because it requires simultaneously discovering clusters
and identifying the relevant features for these clusters. To address this gap,
we introduce a novel paradigm called
one node one model”, which builds an
exclusive model for each node and defines the node label as a combination of
predictions for node groups. Specifically, the proposed ``Feature Personalized
Graph Clustering (FPGC)” method identifies cluster-relevant features for each
node using a squeeze-and-excitation block, integrating these features into each
model to form the final representations. Additionally, the concept of feature
cross is developed as a data augmentation technique to learn low-order feature
interactions. Extensive experimental results demonstrate that FPGC outperforms
state-of-the-art clustering methods. Moreover, the plug-and-play nature of our
method provides a versatile solution to enhance GNN-based models from a feature
perspective.
[COMMENTS]
Accepted by AAAI 2025
[LINK]
http://arxiv.org/abs/2412.09902v2
[DATE]
2024-12-18 12:51:33+08:00
[CATEGORIES]
cs.LG
Federated t-SNE and UMAP for Distributed Data Visualization
[AUTHORS]
Dong Qiao, Xinxian Ma, Jicong Fan
[ABSTRACT]
High-dimensional data visualization is crucial in the big data era and these
techniques such as t-SNE and UMAP have been widely used in science and
engineering. Big data, however, is often distributed across multiple data
centers and subject to security and privacy concerns, which leads to
difficulties for the standard algorithms of t-SNE and UMAP. To tackle the
challenge, this work proposes Fed-tSNE and Fed-UMAP, which provide
high-dimensional data visualization under the framework of federated learning,
without exchanging data across clients or sending data to the central server.
The main idea of Fed-tSNE and Fed-UMAP is implicitly learning the distribution
information of data in a manner of federated learning and then estimating the
global distance matrix for t-SNE and UMAP. To further enhance the protection of
data privacy, we propose Fed-tSNE+ and Fed-UMAP+. We also extend our idea to
federated spectral clustering, yielding algorithms of clustering distributed
data. In addition to these new algorithms, we offer theoretical guarantees of
optimization convergence, distance and similarity estimation, and differential
privacy. Experiments on multiple datasets demonstrate that, compared to the
original algorithms, the accuracy drops of our federated algorithms are tiny.
[COMMENTS]
The paper was accepted by AAAI 2025
[LINK]
http://arxiv.org/abs/2412.13495v1
[DATE]
2024-12-18 12:33:11+08:00
[CATEGORIES]
cs.LG
Efficient Language-instructed Skill Acquisition via Reward-Policy Co-Evolution
[AUTHORS]
Changxin Huang, Yanbin Chang, Junfan Lin, Junyang Liang, Runhao Zeng, Jianqiang Li
[ABSTRACT]
The ability to autonomously explore and resolve tasks with minimal human
guidance is crucial for the self-development of embodied intelligence. Although
reinforcement learning methods can largely ease human effort, it’s challenging
to design reward functions for real-world tasks, especially for
high-dimensional robotic control, due to complex relationships among joints and
tasks. Recent advancements large language models (LLMs) enable automatic reward
function design. However, approaches evaluate reward functions by re-training
policies from scratch placing an undue burden on the reward function, expecting
it to be effective throughout the whole policy improvement process. We argue
for a more practical strategy in robotic autonomy, focusing on refining
existing policies with policy-dependent reward functions rather than a
universal one. To this end, we propose a novel reward-policy co-evolution
framework where the reward function and the learned policy benefit from each
other’s progressive on-the-fly improvements, resulting in more efficient and
higher-performing skill acquisition. Specifically, the reward evolution process
translates the robot’s previous best reward function, descriptions of tasks and
environment into text inputs. These inputs are used to query LLMs to generate a
dynamic amount of reward function candidates, ensuring continuous improvement
at each round of evolution. For policy evolution, our method generates new
policy populations by hybridizing historically optimal and random policies.
Through an improved Bayesian optimization, our approach efficiently and
robustly identifies the most capable and plastic reward-policy combination,
which then proceeds to the next round of co-evolution. Despite using less data,
our approach demonstrates an average normalized improvement of 95.3% across
various high-dimensional robotic skill learning tasks.
[COMMENTS]
14 pages, 5 figures, published to AAAI2025
[LINK]
http://arxiv.org/abs/2412.13492v1
[DATE]
2024-12-18 12:20:33+08:00
[CATEGORIES]
cs.LG
Clustering by Mining Density Distributions and Splitting Manifold Structure
[AUTHORS]
Zhichang Xu, Zhiguo Long, Hua Meng
[ABSTRACT]
Spectral clustering requires the time-consuming decomposition of the
Laplacian matrix of the similarity graph, thus limiting its applicability to
large datasets. To improve the efficiency of spectral clustering, a top-down
approach was recently proposed, which first divides the data into several
micro-clusters (granular-balls), then splits these micro-clusters when they are
not “compact”, and finally uses these micro-clusters as nodes to construct a
similarity graph for more efficient spectral clustering. However, this top-down
approach is challenging to adapt to unevenly distributed or structurally
complex data. This is because constructing micro-clusters as a rough ball
struggles to capture the shape and structure of data in a local range, and the
simplistic splitting rule that solely targets “compactness” is susceptible to
noise and variations in data density and leads to micro-clusters with varying
shapes, making it challenging to accurately measure the similarity between
them. To resolve these issues and improve spectral clustering, this paper first
proposes to start from local structures to obtain micro-clusters, such that the
complex structural information inside local neighborhoods is well captured by
them. Moreover, by noting that Euclidean distance is more suitable for convex
sets, this paper further proposes a data splitting rule that couples local
density and data manifold structures, so that the similarities of the obtained
micro-clusters can be easily characterized. A novel similarity measure between
micro-clusters is then proposed for the final spectral clustering. A series of
experiments based on synthetic and real-world datasets demonstrate that the
proposed method has better adaptability to structurally complex data than
granular-ball based methods.
[LINK]
http://arxiv.org/abs/2408.10493v2
[DATE]
2024-12-18 11:44:25+08:00
[CATEGORIES]
cs.LG
Efficient Fine-Tuning of Single-Cell Foundation Models Enables Zero-Shot Molecular Perturbation Prediction
[AUTHORS]
Sepideh Maleki, Jan-Christian Huetter, Kangway V. Chuang, Gabriele Scalia, Tommaso Biancalani
[ABSTRACT]
Predicting transcriptional responses to novel drugs provides a unique
opportunity to accelerate biomedical research and advance drug discovery
efforts. However, the inherent complexity and high dimensionality of cellular
responses, combined with the extremely limited available experimental data,
makes the task challenging. In this study, we leverage single-cell foundation
models (FMs) pre-trained on tens of millions of single cells, encompassing
multiple cell types, states, and disease annotations, to address molecular
perturbation prediction. We introduce a drug-conditional adapter that allows
efficient fine-tuning by training less than 1% of the original foundation
model, thus enabling molecular conditioning while preserving the rich
biological representation learned during pre-training. The proposed strategy
allows not only the prediction of cellular responses to novel drugs, but also
the zero-shot generalization to unseen cell lines. We establish a robust
evaluation framework to assess model performance across different
generalization tasks, demonstrating state-of-the-art results across all
settings, with significant improvements in the few-shot and zero-shot
generalization to new cell lines compared to existing baselines.
[LINK]
http://arxiv.org/abs/2412.13478v1
[DATE]
2024-12-18 11:42:20+08:00
[CATEGORIES]
cs.LG
Generating Unseen Nonlinear Evolution in Sea Surface Temperature Using a Deep Learning-Based Latent Space Data Assimilation Framework
[AUTHORS]
Qingyu Zheng, Guijun Han, Wei Li, Lige Cao, Gongfu Zhou, Haowen Wu, Qi Shao, Ru Wang, Xiaobo Wu, Xudong Cui, Hong Li, Xuan Wang
[ABSTRACT]
Advances in data assimilation (DA) methods have greatly improved the accuracy
of Earth system predictions. To fuse multi-source data and reconstruct the
nonlinear evolution missing from observations, geoscientists are developing
future-oriented DA methods. In this paper, we redesign a purely data-driven
latent space DA framework (DeepDA) that employs a generative artificial
intelligence model to capture the nonlinear evolution in sea surface
temperature. Under variational constraints, DeepDA embedded with nonlinear
features can effectively fuse heterogeneous data. The results show that DeepDA
remains highly stable in capturing and generating nonlinear evolutions even
when a large amount of observational information is missing. It can be found
that when only 10% of the observation information is available, the error
increase of DeepDA does not exceed 40%. Furthermore, DeepDA has been shown to
be robust in the fusion of real observations and ensemble simulations. In
particular, this paper provides a mechanism analysis of the nonlinear evolution
generated by DeepDA from the perspective of physical patterns, which reveals
the inherent explainability of our DL model in capturing multi-scale ocean
signals.
[COMMENTS]
31 pages, 14 figures
[LINK]
http://arxiv.org/abs/2412.13477v1
[DATE]
2024-12-18 11:41:34+08:00
[CATEGORIES]
cs.LG
Deep encoder-decoder hierarchical convolutional neural networks for conjugate heat transfer surrogate modeling
[AUTHORS]
Takiah Ebbs-Picken, David A. Romero, Carlos M. Da Silva, Cristina H. Amon
[ABSTRACT]
Conjugate heat transfer (CHT) analyses are vital for the design of many
energy systems. However, high-fidelity CHT numerical simulations are
computationally intensive, which limits their applications such as design
optimization, where hundreds to thousands of evaluations are required. In this
work, we develop a modular deep encoder-decoder hierarchical (DeepEDH)
convolutional neural network, a novel deep-learning-based surrogate modeling
methodology for computationally intensive CHT analyses. Leveraging convective
temperature dependencies, we propose a two-stage temperature prediction
architecture that couples velocity and temperature fields. The proposed DeepEDH
methodology is demonstrated by modeling the pressure, velocity, and temperature
fields for a liquid-cooled cold-plate-based battery thermal management system
with variable channel geometry. A computational mesh and CHT formulation of the
cold plate is created and solved using the finite element method (FEM),
generating a dataset of 1,500 simulations. Our performance analysis covers the
impact of the novel architecture, separate DeepEDH models for each field,
output geometry masks, multi-stage temperature field predictions, and
optimizations of the hyperparameters and architecture. Furthermore, we quantify
the influence of the CHT analysis’ thermal boundary conditions on surrogate
model performance, highlighting improved temperature model performance with
higher heat fluxes. Compared to other deep learning neural network surrogate
models, such as U-Net and DenseED, the proposed DeepEDH architecture for CHT
analyses exhibits up to a 65% enhancement in the coefficient of determination
$R^{2}$. (*Due to the notification of arXiv “The Abstract field cannot be
longer than 1,920 characters”, the appeared Abstract is shortened. For the full
Abstract, please download the Article.)
[COMMENTS]
Revised version published in Applied Energy
(https://doi.org/10.1016/j.apenergy.2024.123723)
[LINK]
http://arxiv.org/abs/2311.17068v2
[DATE]
2024-12-18 11:28:16+08:00
[CATEGORIES]
cs.LG
Exploring Context Generalizability in Citywide Crowd Mobility Prediction: An Analytic Framework and Benchmark
[AUTHORS]
Liyue Chen, Xiaoxiang Wang, Leye Wang
[ABSTRACT]
Contextual features are important data sources for building citywide crowd
mobility prediction models. However, the difficulty of applying context lies in
the unknown generalizability of contextual features (e.g., weather, holiday,
and points of interests) and context modeling techniques across different
scenarios. In this paper, we present a unified analytic framework and a
large-scale benchmark for evaluating context generalizability. The benchmark
includes crowd mobility data, contextual data, and advanced prediction models.
We conduct comprehensive experiments in several crowd mobility prediction tasks
such as bike flow, metro passenger flow, and electric vehicle charging demand.
Our results reveal several important observations: (1) Using more contextual
features may not always result in better prediction with existing context
modeling techniques; in particular, the combination of holiday and temporal
position can provide more generalizable beneficial information than other
contextual feature combinations. (2) In context modeling techniques, using a
gated unit to incorporate raw contextual features into the deep prediction
model has good generalizability. Besides, we offer several suggestions about
incorporating contextual factors for building crowd mobility prediction
applications. From our findings, we call for future research efforts devoted to
developing new context modeling solutions.
[LINK]
http://arxiv.org/abs/2106.16046v5
[DATE]
2024-12-18 11:12:43+08:00
[CATEGORIES]
cs.LG
Let Curves Speak: A Continuous Glucose Monitor based Large Sensor Foundation Model for Diabetes Management
[AUTHORS]
Junjie Luo, Abhimanyu Kumbara, Mansur Shomali, Rui Han, Anand Iyer, Ritu Agarwal, Gordon Gao
[ABSTRACT]
While previous studies of AI in diabetes management focus on long-term risk,
research on near-future glucose prediction remains limited but important as it
enables timely diabetes self-management. Integrating AI with continuous glucose
monitoring (CGM) holds promise for near-future glucose prediction. However,
existing models have limitations in capturing patterns of blood glucose
fluctuations and demonstrate poor generalizability. A robust approach is needed
to leverage massive CGM data for near-future glucose prediction. We propose
large sensor models (LSMs) to capture knowledge in CGM data by modeling
patients as sequences of glucose. CGM-LSM is pretrained on 15.96 million
glucose records from 592 diabetes patients for near-future glucose prediction.
We evaluated CGM-LSM against state-of-the-art methods using the OhioT1DM
dataset across various metrics, prediction horizons, and unseen patients.
Additionally, we assessed its generalizability across factors like diabetes
type, age, gender, and hour of day. CGM-LSM achieved exceptional performance,
with an rMSE of 29.81 mg/dL for type 1 diabetes patients and 23.49 mg/dL for
type 2 diabetes patients in a two-hour prediction horizon. For the OhioT1DM
dataset, CGM-LSM achieved a one-hour rMSE of 15.64 mg/dL, halving the previous
best of 31.97 mg/dL. Robustness analyses revealed consistent performance not
only for unseen patients and future periods, but also across diabetes type,
age, and gender. The model demonstrated adaptability to different hours of day,
maintaining accuracy across periods of various activity intensity levels.
CGM-LSM represents a transformative step in diabetes management by leveraging
pretraining to uncover latent glucose generation patterns in sensor data. Our
findings also underscore the broader potential of LSMs to drive innovation
across domains involving complex sensor data.
[LINK]
http://arxiv.org/abs/2412.09727v2
[DATE]
2024-12-18 11:07:48+08:00
[CATEGORIES]
cs.LG
Communication-Efficient Personalized Federal Graph Learning via Low-Rank Decomposition
[AUTHORS]
Ruyue Liu, Rong Yin, Xiangzhen Bo, Xiaoshuai Hao, Xingrui Zhou, Yong Liu, Can Ma, Weiping Wang
[ABSTRACT]
Federated graph learning (FGL) has gained significant attention for enabling
heterogeneous clients to process their private graph data locally while
interacting with a centralized server, thus maintaining privacy. However, graph
data on clients are typically non-IID, posing a challenge for a single model to
perform well across all clients. Another major bottleneck of FGL is the high
cost of communication. To address these challenges, we propose a
communication-efficient personalized federated graph learning algorithm, CEFGL.
Our method decomposes the model parameters into low-rank generic and sparse
private models. We employ a dual-channel encoder to learn sparse local
knowledge in a personalized manner and low-rank global knowledge in a shared
manner. Additionally, we perform multiple local stochastic gradient descent
iterations between communication phases and integrate efficient compression
techniques into the algorithm. The advantage of CEFGL lies in its ability to
capture common and individual knowledge more precisely. By utilizing low-rank
and sparse parameters along with compression techniques, CEFGL significantly
reduces communication complexity. Extensive experiments demonstrate that our
method achieves optimal classification accuracy in a variety of heterogeneous
environments across sixteen datasets. Specifically, compared to the
state-of-the-art method FedStar, the proposed method (with GIN as the base
model) improves accuracy by 5.64\% on cross-datasets setting CHEM, reduces
communication bits by a factor of 18.58, and reduces the communication time by
a factor of 1.65.
[LINK]
http://arxiv.org/abs/2412.13442v1
[DATE]
2024-12-18 10:26:07+08:00
[CATEGORIES]
cs.LG
Rare Event Detection in Imbalanced Multi-Class Datasets Using an Optimal MIP-Based Ensemble Weighting Approach
[AUTHORS]
Georgios Tertytchny, Georgios L. Stavrinides, Maria K. Michael
[ABSTRACT]
To address the challenges of imbalanced multi-class datasets typically used
for rare event detection in critical cyber-physical systems, we propose an
optimal, efficient, and adaptable mixed integer programming (MIP) ensemble
weighting scheme. Our approach leverages the diverse capabilities of the
classifier ensemble on a granular per class basis, while optimizing the weights
of classifier-class pairs using elastic net regularization for improved
robustness and generalization. Additionally, it seamlessly and optimally
selects a predefined number of classifiers from a given set. We evaluate and
compare our MIP-based method against six well-established weighting schemes,
using representative datasets and suitable metrics, under various ensemble
sizes. The experimental results reveal that MIP outperforms all existing
approaches, achieving an improvement in balanced accuracy ranging from 0.99% to
7.31%, with an overall average of 4.53% across all datasets and ensemble sizes.
Furthermore, it attains an overall average increase of 4.63%, 4.60%, and 4.61%
in macro-averaged precision, recall, and F1-score, respectively, while
maintaining computational efficiency.
[COMMENTS]
To be published in the Proceedings of the 39th AAAI Conference on
Artificial Intelligence (AAAI25)
[LINK]
http://arxiv.org/abs/2412.13439v1
[DATE]
2024-12-18 10:20:51+08:00
[CATEGORIES]
cs.LG
The Impact of Geometric Complexity on Neural Collapse in Transfer Learning
[AUTHORS]
Michael Munn, Benoit Dherin, Javier Gonzalvo
[COMMENTS]
Accepted as a NeurIPS 2024 paper
[LINK]
http://arxiv.org/abs/2405.15706v3
[DATE]
2024-12-18 09:53:47+08:00
[CATEGORIES]
cs.LG
Exploring Transformer-Augmented LSTM for Temporal and Spatial Feature Learning in Trajectory Prediction
[AUTHORS]
Chandra Raskoti, Weizi Li
[ABSTRACT]
Accurate vehicle trajectory prediction is crucial for ensuring safe and
efficient autonomous driving. This work explores the integration of Transformer
based model with Long Short-Term Memory (LSTM) based technique to enhance
spatial and temporal feature learning in vehicle trajectory prediction. Here, a
hybrid model that combines LSTMs for temporal encoding with a Transformer
encoder for capturing complex interactions between vehicles is proposed.
Spatial trajectory features of the neighboring vehicles are processed and goes
through a masked scatter mechanism in a grid based environment, which is then
combined with temporal trajectory of the vehicles. This combined trajectory
data are learned by sequential LSTM encoding and Transformer based attention
layers. The proposed model is benchmarked against predecessor LSTM based
methods, including STA-LSTM, SA-LSTM, CS-LSTM, and NaiveLSTM. Our results,
while not outperforming it’s predecessor, demonstrate the potential of
integrating Transformers with LSTM based technique to build interpretable
trajectory prediction model. Future work will explore alternative architectures
using Transformer applications to further enhance performance. This study
provides a promising direction for improving trajectory prediction models by
leveraging transformer based architectures, paving the way for more robust and
interpretable vehicle trajectory prediction system.
[LINK]
http://arxiv.org/abs/2412.13419v1
[DATE]
2024-12-18 09:31:08+08:00
[CATEGORIES]
cs.LG
Combining Priors with Experience: Confidence Calibration Based on Binomial Process Modeling
[AUTHORS]
Jinzong Dong, Zhaohui Jiang, Dong Pan, Haoyang Yu
[ABSTRACT]
Confidence calibration of classification models is a technique to estimate
the true posterior probability of the predicted class, which is critical for
ensuring reliable decision-making in practical applications. Existing
confidence calibration methods mostly use statistical techniques to estimate
the calibration curve from data or fit a user-defined calibration function, but
often overlook fully mining and utilizing the prior distribution behind the
calibration curve. However, a well-informed prior distribution can provide
valuable insights beyond the empirical data under the limited data or
low-density regions of confidence scores. To fill this gap, this paper proposes
a new method that integrates the prior distribution behind the calibration
curve with empirical data to estimate a continuous calibration curve, which is
realized by modeling the sampling process of calibration data as a binomial
process and maximizing the likelihood function of the binomial process. We
prove that the calibration curve estimating method is Lipschitz continuous with
respect to data distribution and requires a sample size of $3/B$ of that
required for histogram binning, where $B$ represents the number of bins. Also,
a new calibration metric ($TCE_{bpm}$), which leverages the estimated
calibration curve to estimate the true calibration error (TCE), is designed.
$TCE_{bpm}$ is proven to be a consistent calibration measure. Furthermore,
realistic calibration datasets can be generated by the binomial process
modeling from a preset true calibration curve and confidence score
distribution, which can serve as a benchmark to measure and compare the
discrepancy between existing calibration metrics and the true calibration
error. The effectiveness of our calibration method and metric are verified in
real-world and simulated data.
[COMMENTS]
Accepted by AAAI-25
[LINK]
http://arxiv.org/abs/2412.10658v2
[DATE]
2024-12-18 09:20:05+08:00
[CATEGORIES]
cs.LG
Lightweight yet Fine-grained: A Graph Capsule Convolutional Network with Subspace Alignment for Shared-account Sequential Recommendation
[AUTHORS]
Jinyu Zhang, Zhongying Zhao, Chao Li, Yanwei Yu
[ABSTRACT]
Shared-account Sequential Recommendation (SSR) aims to provide personalized
recommendations for accounts shared by multiple users with varying sequential
preferences. Previous studies on SSR struggle to capture the fine-grained
associations between interactions and different latent users within the shared
account’s hybrid sequences. Moreover, most existing SSR methods (e.g.,
RNN-based or GCN-based methods) have quadratic computational complexities,
hindering the deployment of SSRs on resource-constrained devices. To this end,
we propose a Lightweight Graph Capsule Convolutional Network with subspace
alignment for shared-account sequential recommendation, named LightGC$^2$N.
Specifically, we devise a lightweight graph capsule convolutional network. It
facilitates the fine-grained matching between interactions and latent users by
attentively propagating messages on the capsule graphs. Besides, we present an
efficient subspace alignment method. This method refines the sequence
representations and then aligns them with the finely clustered preferences of
latent users. The experimental results on four real-world datasets indicate
that LightGC$^2$N outperforms nine state-of-the-art methods in accuracy and
efficiency.
[COMMENTS]
11 pages, 6 figures, accepted by AAAI-2025 conference
[LINK]
http://arxiv.org/abs/2412.13408v1
[DATE]
2024-12-18 08:56:16+08:00
[CATEGORIES]
cs.LG
Pattern Matching in AI Compilers and its Formalization (Extended Version)
[AUTHORS]
Joseph W. Cutler, Alex Collins, Bin Fan, Mahesh Ravishankar, Vinod Grover
[ABSTRACT]
PyPM is a Python-based domain specific language (DSL) for building
rewrite-based optimization passes on machine learning computation graphs. Users
define individual optimizations by writing (a) patterns that match subgraphs of
a computation graph and (b) corresponding rules which replace a matched
subgraph with an optimized kernel. PyPM is distinguished from the many other
DSLs for defining rewriting passes by its complex and novel pattern language
which borrows concepts from logic programming. PyPM patterns can be recursive,
nondeterminstic, and can require checking domain-specific constraints such as
the shapes of tensors. The PyPM implementation is thus similarly complicated,
consisting of thousands of lines of C++ code. In this paper, we present our
work on building PyPM, as well as formalizing and distilling and this
complexity to an understandable mathematical core. We have developed a formal
core calculus expressing the main operations of the PyPM pattern language. We
define both a declarative semantics - describing which patterns match which
terms - and an algorithmic semantics - an idealized version of the PyPM pattern
interpreter - and prove their equivalence. The development is fully mechanized
in the Coq proof assistant.
[COMMENTS]
To appear at CGO’25
[LINK]
http://arxiv.org/abs/2412.13398v1
[DATE]
2024-12-18 08:29:09+08:00
[CATEGORIES]
cs.LG
Navigating the Noise: Bringing Clarity to ML Parameterization Design with O(100) Ensembles
[AUTHORS]
Jerry Lin, Sungduk Yu, Liran Peng, Tom Beucler, Eliot Wong-Toi, Zeyuan Hu, Pierre Gentine, Margarita Geleta, Mike Pritchard
[ABSTRACT]
Machine-learning (ML) parameterizations of subgrid processes (here of
turbulence, convection, and radiation) may one day replace conventional
parameterizations by emulating high-resolution physics without the cost of
explicit simulation. However, uncertainty about the relationship between
offline and online performance (i.e., when integrated with a large-scale
general circulation model (GCM)) hinders their development. Much of this
uncertainty stems from limited sampling of the noisy, emergent effects of
upstream ML design decisions on downstream online hybrid simulation. Our work
rectifies the sampling issue via the construction of a semi-automated,
end-to-end pipeline for $\mathcal{O}(100)$ size ensembles of hybrid
simulations, revealing important nuances in how systematic reductions in
offline error manifest in changes to online error and online stability. For
example, removing dropout and switching from a Mean Squared Error (MSE) to a
Mean Absolute Error (MAE) loss both reduce offline error, but they have
opposite effects on online error and online stability. Other design decisions,
like incorporating memory, converting moisture input from specific humidity to
relative humidity, using batch normalization, and training on multiple climates
do not come with any such compromises. Finally, we show that ensemble sizes of
$\mathcal{O}(100)$ may be necessary to reliably detect causally relevant
differences online. By enabling rapid online experimentation at scale, we can
empirically settle debates regarding subgrid ML parameterization design that
would have otherwise remained unresolved in the noise.
[COMMENTS]
Main text: 21 pages, 5 figures. SI: 24 pages, 35 figures
[LINK]
http://arxiv.org/abs/2309.16177v3
[DATE]
2024-12-18 08:27:15+08:00
[CATEGORIES]
cs.LG
Improving Diffusion Inverse Problem Solving with Decoupled Noise Annealing
[AUTHORS]
Bingliang Zhang, Wenda Chu, Julius Berner, Chenlin Meng, Anima Anandkumar, Yang Song
[ABSTRACT]
Diffusion models have recently achieved success in solving Bayesian inverse
problems with learned data priors. Current methods build on top of the
diffusion sampling process, where each denoising step makes small modifications
to samples from the previous step. However, this process struggles to correct
errors from earlier sampling steps, leading to worse performance in complicated
nonlinear inverse problems, such as phase retrieval. To address this challenge,
we propose a new method called Decoupled Annealing Posterior Sampling (DAPS)
that relies on a novel noise annealing process. Specifically, we decouple
consecutive steps in a diffusion sampling trajectory, allowing them to vary
considerably from one another while ensuring their time-marginals anneal to the
true posterior as we reduce noise levels. This approach enables the exploration
of a larger solution space, improving the success rate for accurate
reconstructions. We demonstrate that DAPS significantly improves sample quality
and stability across multiple image restoration tasks, particularly in
complicated nonlinear inverse problems.
[LINK]
http://arxiv.org/abs/2407.01521v2
[DATE]
2024-12-18 08:26:39+08:00
[CATEGORIES]
cs.LG
MMHMR: Generative Masked Modeling for Hand Mesh Recovery
[AUTHORS]
Muhammad Usama Saleem, Ekkasit Pinyoanuntapong, Mayur Jagdishbhai Patel, Hongfei Xue, Ahmed Helmy, Srijan Das, Pu Wang
[ABSTRACT]
Reconstructing a 3D hand mesh from a single RGB image is challenging due to
complex articulations, self-occlusions, and depth ambiguities. Traditional
discriminative methods, which learn a deterministic mapping from a 2D image to
a single 3D mesh, often struggle with the inherent ambiguities in 2D-to-3D
mapping. To address this challenge, we propose MMHMR, a novel generative masked
model for hand mesh recovery that synthesizes plausible 3D hand meshes by
learning and sampling from the probabilistic distribution of the ambiguous
2D-to-3D mapping process. MMHMR consists of two key components: (1) a VQ-MANO,
which encodes 3D hand articulations as discrete pose tokens in a latent space,
and (2) a Context-Guided Masked Transformer that randomly masks out pose tokens
and learns their joint distribution, conditioned on corrupted token sequences,
image context, and 2D pose cues. This learned distribution facilitates
confidence-guided sampling during inference, producing mesh reconstructions
with low uncertainty and high precision. Extensive evaluations on benchmark and
real-world datasets demonstrate that MMHMR achieves state-of-the-art accuracy,
robustness, and realism in 3D hand mesh reconstruction. Project website:
https://m-usamasaleem.github.io/publication/MMHMR/mmhmr.html
[LINK]
http://arxiv.org/abs/2412.13393v1
[DATE]
2024-12-18 08:10:00+08:00
[CATEGORIES]
cs.LG
GLACIAL: Granger and Learning-based Causality Analysis for Longitudinal Imaging Studies
[AUTHORS]
Minh Nguyen, Gia H. Ngo, Mert R. Sabuncu
[ABSTRACT]
The Granger framework is useful for discovering causal relations in
time-varying signals. However, most Granger causality (GC) methods are
developed for densely sampled timeseries data. A substantially different
setting, particularly common in medical imaging, is the longitudinal study
design, where multiple subjects are followed and sparsely observed over time.
Longitudinal studies commonly track several biomarkers, which are likely
governed by nonlinear dynamics that might have subject-specific idiosyncrasies
and exhibit both direct and indirect causes. Furthermore, real-world
longitudinal data often suffer from widespread missingness. GC methods are not
well-suited to handle these issues. In this paper, we propose an approach named
GLACIAL (Granger and LeArning-based CausalIty Analysis for Longitudinal
studies) to fill this methodological gap by marrying GC with a multi-task
neural forecasting model. GLACIAL treats subjects as independent samples and
uses the model’s average prediction accuracy on hold-out subjects to probe
causal links. Input dropout and model interpolation are used to efficiently
learn nonlinear dynamic relationships between a large number of variables and
to handle missing values respectively. Extensive simulations and experiments on
a real longitudinal medical imaging dataset show GLACIAL beating competitive
baselines and confirm its utility. Our code is available at
https://github.com/mnhng/GLACIAL.
[COMMENTS]
Accepted for publication at the Journal of Machine Learning for
Biomedical Imaging (MELBA) https://melba-journal.org/2024:028
[LINK]
http://arxiv.org/abs/2210.07416v2
[DATE]
2024-12-18 08:07:45+08:00
[CATEGORIES]
cs.LG
Marigold-DC: Zero-Shot Monocular Depth Completion with Guided Diffusion
[AUTHORS]
Massimiliano Viola, Kevin Qu, Nando Metzger, Bingxin Ke, Alexander Becker, Konrad Schindler, Anton Obukhov
[ABSTRACT]
Depth completion upgrades sparse depth measurements into dense depth maps
guided by a conventional image. Existing methods for this highly ill-posed task
operate in tightly constrained settings and tend to struggle when applied to
images outside the training domain or when the available depth measurements are
sparse, irregularly distributed, or of varying density. Inspired by recent
advances in monocular depth estimation, we reframe depth completion as an
image-conditional depth map generation guided by sparse measurements. Our
method, Marigold-DC, builds on a pretrained latent diffusion model for
monocular depth estimation and injects the depth observations as test-time
guidance via an optimization scheme that runs in tandem with the iterative
inference of denoising diffusion. The method exhibits excellent zero-shot
generalization across a diverse range of environments and handles even
extremely sparse guidance effectively. Our results suggest that contemporary
monocular depth priors greatly robustify depth completion: it may be better to
view the task as recovering dense depth from (dense) image pixels, guided by
sparse depth; rather than as inpainting (sparse) depth, guided by an image.
Project website: https://MarigoldDepthCompletion.github.io/
[LINK]
http://arxiv.org/abs/2412.13389v1
[DATE]
2024-12-18 08:06:41+08:00
[CATEGORIES]
cs.LG
GeLoRA: Geometric Adaptive Ranks For Efficient LoRA Fine-tuning
[AUTHORS]
Abdessalam Ed-dib, Zhanibek Datbayev, Amine Mohamed Aboussalah
[ABSTRACT]
Fine-tuning large language models (LLMs) is computationally intensive because
it requires updating all parameters. Low-Rank Adaptation (LoRA) improves
efficiency by modifying only a subset of weights but introduces a trade-off
between expressivity and computational cost: lower ranks reduce resources but
limit expressiveness, while higher ranks enhance expressivity at increased
cost. Despite recent advances in adaptive LoRA techniques, existing methods
fail to provide a theoretical basis for optimizing the trade-off between model
performance and efficiency. We propose Geometric Low-Rank Adaptation (GeLoRA),
a novel framework that computes the intrinsic dimensionality of hidden state
representations to adaptively select LoRA ranks. We demonstrate that the
intrinsic dimension provides a lower bound for the optimal rank of LoRA
matrices, allowing for a principled selection that balances efficiency and
expressivity. GeLoRA dynamically adjusts the rank for each layer based on the
intrinsic dimensionality of its input and output representations, recognizing
that not all model parameters equally impact fine-tuning. Empirical validation
on multiple tasks shows that GeLoRA consistently outperforms recent baselines
within the same parameter budget.
[LINK]
http://arxiv.org/abs/2412.09250v3
[DATE]
2024-12-18 07:41:39+08:00
[CATEGORIES]
cs.LG
Metric Entropy-Free Sample Complexity Bounds for Sample Average Approximation in Convex Stochastic Programming
[AUTHORS]
Hongcheng Liu, Jindong Tong
[ABSTRACT]
This paper studies sample average approximation (SAA) in solving convex or
strongly convex stochastic programming (SP) problems. Under some common
regularity conditions, we show – perhaps for the first time – that SAA’s
sample complexity can be completely free from any quantification of metric
entropy (such as the logarithm of the covering number), leading to a
significantly more efficient rate with dimensionality $d$ than most existing
results. From the newly established complexity bounds, an important revelation
is that SAA and the canonical stochastic mirror descent (SMD) method, two
mainstream solution approaches to SP, entail almost identical rates of sample
efficiency, lifting a theoretical discrepancy of SAA from SMD by the order of
$O(d)$. Furthermore, this paper explores non-Lipschitzian scenarios where SAA
maintains provable efficacy but the corresponding results for SMD remain mostly
unexplored, indicating the potential of SAA’s better applicability in some
irregular settings.
[LINK]
http://arxiv.org/abs/2401.00664v5
[DATE]
2024-12-18 06:43:30+08:00
[CATEGORIES]
cs.LG
Visual Deformation Detection Using Soft Material Simulation for Pre-training of Condition Assessment Models
[AUTHORS]
Joel Sol, Amir M. Soufi Enayati, Homayoun Najjaran
[ABSTRACT]
This paper addresses the challenge of geometric quality assurance in
manufacturing, particularly when human assessment is required. It proposes
using Blender, an open-source simulation tool, to create synthetic datasets for
machine learning (ML) models. The process involves translating expert
information into shape key parameters to simulate deformations, generating
images for both deformed and non-deformed objects. The study explores the
impact of discrepancies between real and simulated environments on ML model
performance and investigates the effect of different simulation backgrounds on
model sensitivity. Additionally, the study aims to enhance the model’s
robustness to camera positioning by generating datasets with a variety of
randomized viewpoints. The entire process, from data synthesis to model
training and testing, is implemented using a Python API interfacing with
Blender. An experiment with a soda can object validates the accuracy of the
proposed pipeline.
[COMMENTS]
6 pages, 4 figures, 2024 IEEE 20th International Conference on
Automation Science and Engineering (CASE)
[LINK]
http://arxiv.org/abs/2405.14877v2
[DATE]
2024-12-18 06:12:21+08:00
[CATEGORIES]
cs.LG
Wind Speed Forecasting Based on Data Decomposition and Deep Learning Models: A Case Study of a Wind Farm in Saudi Arabia
[AUTHORS]
Yasmeen Aldossary, Nabil Hewahi, Abdulla Alasaadi
[ABSTRACT]
With industrial and technological development and the increasing demand for
electric power, wind energy has gradually become the fastest-growing and most
environmentally friendly new energy source. Nevertheless, wind power generation
is always accompanied by uncertainty due to the wind speed’s volatility. Wind
speed forecasting (WSF) is essential for power grids’ dispatch, stability, and
controllability, and its accuracy is crucial to effectively using wind
resources. Therefore, this study proposes a novel WSF framework for stationary
data based on a hybrid decomposition method and the Bidirectional Long
Short-term Memory (BiLSTM) to achieve high forecasting accuracy for the Dumat
Al-Jandal wind farm in Al-Jouf, Saudi Arabia. The hybrid decomposition method
combines the Wavelet Packet Decomposition (WPD) and the Seasonal Adjustment
Method (SAM). The SAM method eliminates the seasonal component of the
decomposed subseries generated by WPD to reduce forecasting complexity. The
BiLSTM is applied to forecast all the deseasonalized decomposed subseries. Five
years of hourly wind speed observations acquired from a location in the Al-Jouf
region were used to prove the effectiveness of the proposed model. The
comparative experimental results, including 27 other models, demonstrated the
proposed model’s superiority in single and multiple WSF with an overall average
mean absolute error of 0.176549, root mean square error of 0.247069, and
R-squared error of 0.985987.
[LINK]
http://arxiv.org/abs/2412.13356v1
[DATE]
2024-12-18 06:04:46+08:00
[CATEGORIES]
cs.LG
A Novel Machine Learning Classifier Based on Genetic Algorithms and Data Importance Reformatting
[AUTHORS]
A. K. Alkhayyata, N. M. Hewahi
[ABSTRACT]
In this paper, a novel classification algorithm that is based on Data
Importance (DI) reformatting and Genetic Algorithms (GA) named GADIC is
proposed to overcome the issues related to the nature of data which may hinder
the performance of the Machine Learning (ML) classifiers. GADIC comprises three
phases which are data reformatting phase which depends on DI concept, training
phase where GA is applied on the reformatted training dataset, and testing
phase where the instances of the reformatted testing dataset are being averaged
based on similar instances in the training dataset. GADIC is an approach that
utilizes the exiting ML classifiers with involvement of data reformatting,
using GA to tune the inputs, and averaging the similar instances to the unknown
instance. The averaging of the instances becomes the unknown instance to be
classified in the stage of testing. GADIC has been tested on five existing ML
classifiers which are Support Vector Machine (SVM), K-Nearest Neighbour (KNN),
Logistic Regression (LR), Decision Tree (DT), and Na"ive Bayes (NB). All were
evaluated using seven open-source UCI ML repository and Kaggle datasets which
are Cleveland heart disease, Indian liver patient, Pima Indian diabetes,
employee future prediction, telecom churn prediction, bank customer churn, and
tech students. In terms of accuracy, the results showed that, with the
exception of approximately 1% decrease in the accuracy of NB classifier in
Cleveland heart disease dataset, GADIC significantly enhanced the performance
of most ML classifiers using various datasets. In addition, KNN with GADIC
showed the greatest performance gain when compared with other ML classifiers
with GADIC followed by SVM while LR had the lowest improvement. The lowest
average improvement that GADIC could achieve is 5.96%, whereas the maximum
average improvement reached 16.79%.
[LINK]
http://arxiv.org/abs/2412.13350v1
[DATE]
2024-12-18 05:54:55+08:00
[CATEGORIES]
cs.LG
Unveiling the Secret Recipe: A Guide For Supervised Fine-Tuning Small LLMs
[AUTHORS]
Aldo Pareja, Nikhil Shivakumar Nayak, Hao Wang, Krishnateja Killamsetty, Shivchander Sudalairaj, Wenlong Zhao, Seungwook Han, Abhishek Bhandwaldar, Guangxuan Xu, Kai Xu, Ligong Han, Luke Inglis, Akash Srivastava
[ABSTRACT]
The rise of large language models (LLMs) has created a significant disparity:
industrial research labs with their computational resources, expert teams, and
advanced infrastructures, can effectively fine-tune LLMs, while individual
developers and small organizations face barriers due to limited resources. In
this paper, we aim to bridge this gap by presenting a comprehensive study on
supervised fine-tuning of LLMs using instruction-tuning datasets spanning
diverse knowledge domains and skills. We focus on small-sized LLMs (3B to 7B
parameters) for their cost-efficiency and accessibility. We explore various
training configurations and strategies across four open-source pre-trained
models. We provide detailed documentation of these configurations, revealing
findings that challenge several common training practices, including
hyperparameter recommendations from TULU and phased training recommended by
Orca. Key insights from our work include: (i) larger batch sizes paired with
lower learning rates lead to improved model performance on benchmarks such as
MMLU, MTBench, and Open LLM Leaderboard; (ii) early-stage training dynamics,
such as lower gradient norms and higher loss values, are strong indicators of
better final model performance, enabling early termination of sub-optimal runs
and significant computational savings; (iii) through a thorough exploration of
hyperparameters like warmup steps and learning rate schedules, we provide
guidance for practitioners and find that certain simplifications do not
compromise performance; and (iv) we observed no significant difference in
performance between phased and stacked training strategies, but stacked
training is simpler and more sample efficient. With these findings holding
robustly across datasets and models, we hope this study serves as a guide for
practitioners fine-tuning small LLMs and promotes a more inclusive environment
for LLM research.
[COMMENTS]
33 pages, 19 figures. Appendix included in submission. Submitted to
ICLR 2025
[LINK]
http://arxiv.org/abs/2412.13337v1
[DATE]
2024-12-18 05:16:59+08:00
[CATEGORIES]
cs.LG
A Library for Learning Neural Operators
[AUTHORS]
Jean Kossaifi, Nikola Kovachki, Zongyi Li, David Pitt, Miguel Liu-Schiaffini, Robert Joseph George, Boris Bonev, Kamyar Azizzadenesheli, Julius Berner, Anima Anandkumar
[ABSTRACT]
We present NeuralOperator, an open-source Python library for operator
learning. Neural operators generalize neural networks to maps between function
spaces instead of finite-dimensional Euclidean spaces. They can be trained and
inferenced on input and output functions given at various discretizations,
satisfying a discretization convergence properties. Built on top of PyTorch,
NeuralOperator provides all the tools for training and deploying neural
operator models, as well as developing new ones, in a high-quality, tested,
open-source package. It combines cutting-edge models and customizability with a
gentle learning curve and simple user interface for newcomers.
[LINK]
http://arxiv.org/abs/2412.10354v2
[DATE]
2024-12-18 05:15:33+08:00
[CATEGORIES]
cs.LG
Structured Unrestricted-Rank Matrices for Parameter Efficient Fine-tuning
[AUTHORS]
Arijit Sehanobish, Avinava Dubey, Krzysztof Choromanski, Somnath Basu Roy Chowdhury, Deepali Jain, Vikas Sindhwani, Snigdha Chaturvedi
[ABSTRACT]
Recent efforts to scale Transformer models have demonstrated rapid progress
across a wide range of tasks (Wei et al., 2022). However, fine-tuning these
models for downstream tasks is expensive due to their large parameter counts.
Parameter-efficient fine-tuning (PEFT) approaches have emerged as a viable
alternative by allowing us to fine-tune models by updating only a small number
of parameters. In this work, we propose a general framework for parameter
efficient fine-tuning (PEFT), based on structured unrestricted-rank matrices
(SURM) which can serve as a drop-in replacement for popular approaches such as
Adapters and LoRA. Unlike other methods like LoRA, SURMs provides more
flexibility in finding the right balance between compactness and
expressiveness. This is achieved by using low displacement rank matrices
(LDRMs), which hasn’t been used in this context before. SURMs remain
competitive with baselines, often providing significant quality improvements
while using a smaller parameter budget. SURMs achieve 5-7% accuracy gains on
various image classification tasks while replacing low-rank matrices in LoRA.
It also results in up to 12x reduction of the number of parameters in adapters
(with virtually no loss in quality) on the GLUE benchmark.
[COMMENTS]
Accepted at NeurIPS 2024
[LINK]
http://arxiv.org/abs/2406.17740v3
[DATE]
2024-12-18 05:11:27+08:00
[CATEGORIES]
cs.LG
LossLens: Diagnostics for Machine Learning through Loss Landscape Visual Analytics
[AUTHORS]
Tiankai Xie, Jiaqing Chen, Yaoqing Yang, Caleb Geniesse, Ge Shi, Ajinkya Chaudhari, John Kevin Cava, Michael W. Mahoney, Talita Perciano, Gunther H. Weber, Ross Maciejewski
[ABSTRACT]
Modern machine learning often relies on optimizing a neural network’s
parameters using a loss function to learn complex features. Beyond training,
examining the loss function with respect to a network’s parameters (i.e., as a
loss landscape) can reveal insights into the architecture and learning process.
While the local structure of the loss landscape surrounding an individual
solution can be characterized using a variety of approaches, the global
structure of a loss landscape, which includes potentially many local minima
corresponding to different solutions, remains far more difficult to
conceptualize and visualize. To address this difficulty, we introduce LossLens,
a visual analytics framework that explores loss landscapes at multiple scales.
LossLens integrates metrics from global and local scales into a comprehensive
visual representation, enhancing model diagnostics. We demonstrate LossLens
through two case studies: visualizing how residual connections influence a
ResNet-20, and visualizing how physical parameters influence a physics-informed
neural network (PINN) solving a simple convection problem.
[LINK]
http://arxiv.org/abs/2412.13321v1
[DATE]
2024-12-18 04:40:06+08:00
[CATEGORIES]
cs.LG
Automated Phytosensing: Ozone Exposure Classification Based on Plant Electrical Signals
[AUTHORS]
Till Aust, Eduard Buss, Felix Mohr, Heiko Hamann
[ABSTRACT]
In our project WatchPlant, we propose to use a decentralized network of
living plants as air-quality sensors by measuring their electrophysiology to
infer the environmental state, also called phytosensing. We conducted in-lab
experiments exposing ivy (Hedera helix) plants to ozone, an important pollutant
to monitor, and measured their electrophysiological response. However, there is
no well established automated way of detecting ozone exposure in plants. We
propose a generic automatic toolchain to select a high-performance subset of
features and highly accurate models for plant electrophysiology. Our approach
derives plant- and stimulus-generic features from the electrophysiological
signal using the tsfresh library. Based on these features, we automatically
select and optimize machine learning models using AutoML. We use forward
feature selection to increase model performance. We show that our approach
successfully classifies plant ozone exposure with accuracies of up to 94.6% on
unseen data. We also show that our approach can be used for other plant species
and stimuli. Our toolchain automates the development of monitoring algorithms
for plants as pollutant monitors. Our results help implement significant
advancements for phytosensing devices contributing to the development of
cost-effective, high-density urban air monitoring systems in the future.
[COMMENTS]
Submitted and Accepted at 2025 IEEE Symposia on CI for Energy,
Transport and Environmental Sustainability (IEEE CIETES)
[LINK]
http://arxiv.org/abs/2412.13312v1
[DATE]
2024-12-18 04:29:00+08:00
[CATEGORIES]
cs.LG
Efficient Gravitational Wave Parameter Estimation via Knowledge Distillation: A ResNet1D-IAF Approach
[AUTHORS]
Xihua Zhu, Yiqian Yang, Fan Zhang
[ABSTRACT]
With the rapid development of gravitational wave astronomy, the increasing
number of detected events necessitates efficient methods for parameter
estimation and model updates. This study presents a novel approach using
knowledge distillation techniques to enhance computational efficiency in
gravitational wave analysis. We develop a framework combining ResNet1D and
Inverse Autoregressive Flow (IAF) architectures, where knowledge from a complex
teacher model is transferred to a lighter student model. Our experimental
results show that the student model achieves a validation loss of 3.70 with
optimal configuration (40,100,0.75), compared to the teacher model’s 4.09,
while reducing the number of parameters by 43\%. The Jensen-Shannon divergence
between teacher and student models remains below 0.0001 across network layers,
indicating successful knowledge transfer. By optimizing ResNet layers (7-16)
and hidden features (70-120), we achieve a 35\% reduction in inference time
while maintaining parameter estimation accuracy. This work demonstrates
significant improvements in computational efficiency for gravitational wave
data analysis, providing valuable insights for real-time event processing.
[COMMENTS]
Due to new experimental results to add to the paper, this version no
longer accurately reflects the current state of our research. Therefore,
while further experiments are conducted, we are withdrawing the paper. A new
version will be submitted in the future and we apologize for any
inconvenience this may cause
[LINK]
http://arxiv.org/abs/2412.08672v2
[DATE]
2024-12-18 04:15:09+08:00
[CATEGORIES]
cs.LG
FastVLM: Efficient Vision Encoding for Vision Language Models
[AUTHORS]
Pavan Kumar Anasosalu Vasu, Fartash Faghri, Chun-Liang Li, Cem Koc, Nate True, Albert Antony, Gokul Santhanam, James Gabriel, Peter Grasch, Oncel Tuzel, Hadi Pouransari
[ABSTRACT]
Scaling the input image resolution is essential for enhancing the performance
of Vision Language Models (VLMs), particularly in text-rich image understanding
tasks. However, popular visual encoders such as ViTs become inefficient at high
resolutions due to the large number of tokens and high encoding latency caused
by stacked self-attention layers. At different operational resolutions, the
vision encoder of a VLM can be optimized along two axes: reducing encoding
latency and minimizing the number of visual tokens passed to the LLM, thereby
lowering overall latency. Based on a comprehensive efficiency analysis of the
interplay between image resolution, vision latency, token count, and LLM size,
we introduce FastVLM, a model that achieves an optimized trade-off between
latency, model size and accuracy. FastVLM incorporates FastViTHD, a novel
hybrid vision encoder designed to output fewer tokens and significantly reduce
encoding time for high-resolution images. Unlike previous methods, FastVLM
achieves the optimal balance between visual token count and image resolution
solely by scaling the input image, eliminating the need for additional token
pruning and simplifying the model design. In the LLaVA-1.5 setup, FastVLM
achieves 3.2$\times$ improvement in time-to-first-token (TTFT) while
maintaining similar performance on VLM benchmarks compared to prior works.
Compared to LLaVa-OneVision at the highest resolution (1152$\times$1152),
FastVLM achieves comparable performance on key benchmarks like SeedBench and
MMMU, using the same 0.5B LLM, but with 85$\times$ faster TTFT and a vision
encoder that is 3.4$\times$ smaller.
[LINK]
http://arxiv.org/abs/2412.13303v1
[DATE]
2024-12-18 04:09:55+08:00
[CATEGORIES]
cs.LG
Image registration is a geometric deep learning task
[AUTHORS]
Vasiliki Sideri-Lampretsa, Nil Stolt-Ansó, Martin Menten, Huaqi Qiu, Julian McGinnis, Daniel Rueckert
[ABSTRACT]
Data-driven deformable image registration methods predominantly rely on
operations that process grid-like inputs. However, applying deformable
transformations to an image results in a warped space that deviates from a
rigid grid structure. Consequently, data-driven approaches with sequential
deformations have to apply grid resampling operations between each deformation
step. While artifacts caused by resampling are negligible in high-resolution
images, the resampling of sparse, high-dimensional feature grids introduces
errors that affect the deformation modeling process. Taking inspiration from
Lagrangian reference frames of deformation fields, our work introduces a novel
paradigm for data-driven deformable image registration that utilizes geometric
deep-learning principles to model deformations without grid requirements.
Specifically, we model image features as a set of nodes that freely move in
Euclidean space, update their coordinates under graph operations, and
dynamically readjust their local neighborhoods. We employ this formulation to
construct a multi-resolution deformable registration model, where deformation
layers iteratively refine the overall transformation at each resolution without
intermediate resampling operations on the feature grids. We investigate our
method’s ability to fully deformably capture large deformations across a number
of medical imaging registration tasks. In particular, we apply our approach
(GeoReg) to the registration of inter-subject brain MR images and inhale-exhale
lung CT images, showing on par performance with the current state-of-the-art
methods. We believe our contribution open up avenues of research to reduce the
black-box nature of current learned registration paradigms by explicitly
modeling the transformation within the architecture.
[COMMENTS]
22 Pages
[LINK]
http://arxiv.org/abs/2412.13294v1
[DATE]
2024-12-18 03:47:10+08:00
[CATEGORIES]
cs.LG
Posterior Mean Matching: Generative Modeling through Online Bayesian Inference
[AUTHORS]
Sebastian Salazar, Michal Kucer, Yixin Wang, Emily Casleton, David Blei
[ABSTRACT]
This paper introduces posterior mean matching (PMM), a new method for
generative modeling that is grounded in Bayesian inference. PMM uses conjugate
pairs of distributions to model complex data of various modalities like images
and text, offering a flexible alternative to existing methods like diffusion
models. PMM models iteratively refine noisy approximations of the target
distribution using updates from online Bayesian inference. PMM is flexible
because its mechanics are based on general Bayesian models. We demonstrate this
flexibility by developing specialized examples: a generative PMM model of
real-valued data using the Normal-Normal model, a generative PMM model of count
data using a Gamma-Poisson model, and a generative PMM model of discrete data
using a Dirichlet-Categorical model. For the Normal-Normal PMM model, we
establish a direct connection to diffusion models by showing that its
continuous-time formulation converges to a stochastic differential equation
(SDE). Additionally, for the Gamma-Poisson PMM, we derive a novel SDE driven by
a Cox process, which is a significant departure from traditional Brownian
motion-based generative models. PMMs achieve performance that is competitive
with generative models for language modeling and image generation.
[LINK]
http://arxiv.org/abs/2412.13286v1
[DATE]
2024-12-18 03:34:58+08:00
[CATEGORIES]
cs.LG
Stochastic weight matrix dynamics during learning and Dyson Brownian motion
[AUTHORS]
Gert Aarts, Biagio Lucini, Chanju Park
[ABSTRACT]
We demonstrate that the update of weight matrices in learning algorithms can
be described in the framework of Dyson Brownian motion, thereby inheriting many
features of random matrix theory. We relate the level of stochasticity to the
ratio of the learning rate and the mini-batch size, providing more robust
evidence to a previously conjectured scaling relationship. We discuss universal
and non-universal features in the resulting Coulomb gas distribution and
identify the Wigner surmise and Wigner semicircle explicitly in a
teacher-student model and in the (near-)solvable case of the Gaussian
restricted Boltzmann machine.
[COMMENTS]
14 pages, 16 figures; minor changes and references added, version to
appear in Physical Review E
[LINK]
http://arxiv.org/abs/2407.16427v2
[DATE]
2024-12-18 03:13:36+08:00
[CATEGORIES]
cs.LG
GPgym: A Remote Service Platform with Gaussian Process Regression for Online Learning
[AUTHORS]
Xiaobing Dai, Zewen Yang
[ABSTRACT]
Machine learning is now widely applied across various domains, including
industry, engineering, and research. While numerous mature machine learning
models have been open-sourced on platforms like GitHub, their deployment often
requires writing scripts in specific programming languages, such as Python,
C++, or MATLAB. This dependency on particular languages creates a barrier for
professionals outside the field of machine learning, making it challenging to
integrate these algorithms into their workflows. To address this limitation, we
propose GPgym, a remote service node based on Gaussian process regression.
GPgym enables experts from diverse fields to seamlessly and flexibly
incorporate machine learning techniques into their existing specialized
software, without needing to write or manage complex script code.
[LINK]
http://arxiv.org/abs/2412.13276v1
[DATE]
2024-12-18 03:10:59+08:00
[CATEGORIES]
cs.LG
ExBody2: Advanced Expressive Humanoid Whole-Body Control
[AUTHORS]
Mazeyu Ji, Xuanbin Peng, Fangchen Liu, Jialong Li, Ge Yang, Xuxin Cheng, Xiaolong Wang
[ABSTRACT]
This paper enables real-world humanoid robots to maintain stability while
performing expressive motions like humans do. We propose ExBody2, a generalized
whole-body tracking framework that can take any reference motion inputs and
control the humanoid to mimic the motion. The model is trained in simulation
with Reinforcement Learning and then transferred to the real world. It
decouples keypoint tracking with velocity control, and effectively leverages a
privileged teacher policy to distill precise mimic skills into the target
student policy, which enables high-fidelity replication of dynamic movements
such as running, crouching, dancing, and other challenging motions. We present
a comprehensive qualitative and quantitative analysis of crucial design factors
in the paper. We conduct our experiments on two humanoid platforms and
demonstrate the superiority of our approach against state-of-the-arts,
providing practical guidelines to pursue the extreme of whole-body control for
humanoid robots.
[COMMENTS]
website: https://exbody2.github.io
[LINK]
http://arxiv.org/abs/2412.13196v1
[DATE]
2024-12-18 02:59:51+08:00
[CATEGORIES]
cs.LG
Tilted Quantile Gradient Updates for Quantile-Constrained Reinforcement Learning
[AUTHORS]
Chenglin Li, Guangchun Ruan, Hua Geng
[ABSTRACT]
Safe reinforcement learning (RL) is a popular and versatile paradigm to learn
reward-maximizing policies with safety guarantees. Previous works tend to
express the safety constraints in an expectation form due to the ease of
implementation, but this turns out to be ineffective in maintaining safety
constraints with high probability. To this end, we move to the
quantile-constrained RL that enables a higher level of safety without any
expectation-form approximations. We directly estimate the quantile gradients
through sampling and provide the theoretical proofs of convergence. Then a
tilted update strategy for quantile gradients is implemented to compensate the
asymmetric distributional density, with a direct benefit of return performance.
Experiments demonstrate that the proposed model fully meets safety requirements
(quantile constraints) while outperforming the state-of-the-art benchmarks with
higher return.
[COMMENTS]
Accepted by the 39th AAAI Conference on Artificial Intelligence
(AAAI-25)
[LINK]
http://arxiv.org/abs/2412.13184v1
[DATE]
2024-12-18 02:58:00+08:00
[CATEGORIES]
cs.LG
ORFormer: Occlusion-Robust Transformer for Accurate Facial Landmark Detection
[AUTHORS]
Jui-Che Chiang, Hou-Ning Hu, Bo-Syuan Hou, Chia-Yu Tseng, Yu-Lun Liu, Min-Hung Chen, Yen-Yu Lin
[ABSTRACT]
Although facial landmark detection (FLD) has gained significant progress,
existing FLD methods still suffer from performance drops on partially
non-visible faces, such as faces with occlusions or under extreme lighting
conditions or poses. To address this issue, we introduce ORFormer, a novel
transformer-based method that can detect non-visible regions and recover their
missing features from visible parts. Specifically, ORFormer associates each
image patch token with one additional learnable token called the messenger
token. The messenger token aggregates features from all but its patch. This
way, the consensus between a patch and other patches can be assessed by
referring to the similarity between its regular and messenger embeddings,
enabling non-visible region identification. Our method then recovers occluded
patches with features aggregated by the messenger tokens. Leveraging the
recovered features, ORFormer compiles high-quality heatmaps for the downstream
FLD task. Extensive experiments show that our method generates heatmaps
resilient to partial occlusions. By integrating the resultant heatmaps into
existing FLD methods, our method performs favorably against the state of the
arts on challenging datasets such as WFLW and COFW.
[COMMENTS]
WACV 2025
[LINK]
http://arxiv.org/abs/2412.13174v1
[DATE]
2024-12-18 02:53:43+08:00
[CATEGORIES]
cs.LG
QEDCartographer: Automating Formal Verification Using Reward-Free Reinforcement Learning
[AUTHORS]
Alex Sanchez-Stern, Abhishek Varghese, Zhanna Kaufman, Dylan Zhang, Talia Ringer, Yuriy Brun
[ABSTRACT]
Formal verification is a promising method for producing reliable software,
but the difficulty of manually writing verification proofs severely limits its
utility in practice. Recent methods have automated some proof synthesis by
guiding a search through the proof space using a theorem prover. Unfortunately,
the theorem prover provides only the crudest estimate of progress, resulting in
effectively undirected search. To address this problem, we create
QEDCartographer, an automated proof-synthesis tool that combines supervised and
reinforcement learning to more effectively explore the proof space.
QEDCartographer incorporates the proofs’ branching structure, enabling
reward-free search and overcoming the sparse reward problem inherent to formal
verification. We evaluate QEDCartographer using the CoqGym benchmark of 68.5K
theorems from 124 open-source Coq projects. QEDCartographer fully automatically
proves 21.4% of the test-set theorems. Previous search-based proof-synthesis
tools Tok, Tac, ASTactic, Passport, and Proverbot9001, which rely only on
supervised learning, prove 9.6%, 9.8%, 10.9%, 12.5%, and 19.8%, respectively.
Diva, which combines 62 tools, proves 19.2%. Comparing to the most effective
prior tool, Proverbot9001, QEDCartographer produces 34% shorter proofs 29%
faster, on average over the theorems both tools prove. Together,
QEDCartographer and non-learning-based CoqHammer prove 30.3% of the theorems,
while CoqHammer alone proves 26.6%. Our work demonstrates that reinforcement
learning is a fruitful research direction for improving proof-synthesis tools’
search mechanisms.
[COMMENTS]
Authors could not agree on final revision. Please see author websites
for individual versions of paper
[LINK]
http://arxiv.org/abs/2408.09237v7
[DATE]
2024-12-18 02:37:07+08:00
[CATEGORIES]
cs.LG
On Model Extrapolation in Marginal Shapley Values
[AUTHORS]
Ilya Rozenfeld
[ABSTRACT]
As the use of complex machine learning models continues to grow, so does the
need for reliable explainability methods. One of the most popular methods for
model explainability is based on Shapley values. There are two most commonly
used approaches to calculating Shapley values which produce different results
when features are correlated, conditional and marginal. In our previous work,
it was demonstrated that the conditional approach is fundamentally flawed due
to implicit assumptions of causality. However, it is a well-known fact that
marginal approach to calculating Shapley values leads to model extrapolation
where it might not be well defined. In this paper we explore the impacts of
model extrapolation on Shapley values in the case of a simple linear spline
model. Furthermore, we propose an approach which while using marginal averaging
avoids model extrapolation and with addition of causal information replicates
causal Shapley values. Finally, we demonstrate our method on the real data
example.
[LINK]
http://arxiv.org/abs/2412.13158v1
[DATE]
2024-12-18 02:33:14+08:00
[CATEGORIES]
cs.LG
Learning Visuotactile Estimation and Control for Non-prehensile Manipulation under Occlusions
[AUTHORS]
Juan Del Aguila Ferrandis, João Moura, Sethu Vijayakumar
[ABSTRACT]
Manipulation without grasping, known as non-prehensile manipulation, is
essential for dexterous robots in contact-rich environments, but presents many
challenges relating with underactuation, hybrid-dynamics, and frictional
uncertainty. Additionally, object occlusions in a scenario of contact
uncertainty and where the motion of the object evolves independently from the
robot becomes a critical problem, which previous literature fails to address.
We present a method for learning visuotactile state estimators and
uncertainty-aware control policies for non-prehensile manipulation under
occlusions, by leveraging diverse interaction data from privileged policies
trained in simulation. We formulate the estimator within a Bayesian deep
learning framework, to model its uncertainty, and then train uncertainty-aware
control policies by incorporating the pre-learned estimator into the
reinforcement learning (RL) loop, both of which lead to significantly improved
estimator and policy performance. Therefore, unlike prior non-prehensile
research that relies on complex external perception set-ups, our method
successfully handles occlusions after sim-to-real transfer to robotic hardware
with a simple onboard camera. See our video: https://youtu.be/hW-C8i_HWgs.
[COMMENTS]
Conference on Robot Learning (CoRL 2024)
[LINK]
http://arxiv.org/abs/2412.13157v1
[DATE]
2024-12-18 02:33:05+08:00
[CATEGORIES]
cs.LG
SWAN: Preprocessing SGD Enables Adam-Level Performance On LLM Training With Significant Memory Reduction
[AUTHORS]
Chao Ma, Wenbo Gong, Meyer Scetbon, Edward Meeds
[ABSTRACT]
Adaptive optimizers such as Adam (Kingma & Ba, 2015) have been central to the
success of large language models. However, they maintain additional moving
average states throughout training, which results in memory requirements
several times greater than the model. This overhead imposes constraints on
scalability and computational efficiency. On the other hand, while stochastic
gradient descent (SGD) is optimal in terms of memory efficiency, their
capability in LLM training is limited (Zhao et al., 2024b).
To address this dilemma, we show that pre-processing SGD is sufficient to
reach Adam-level performance on LLMs. Specifically, we propose to preprocess
the instantaneous stochastic gradients with two simple operators:
$\mathtt{GradNorm}$ and $\mathtt{GradWhitening}$. $\mathtt{GradNorm}$
stabilizes gradient distributions, and $\mathtt{GradWhitening}$ counteracts the
local curvature of the loss landscape, respectively. This results in SWAN (SGD
with Whitening And Normalization), a stochastic optimizer that eliminates the
need to store any accumulative state variables. Empirically, SWAN has the same
memory footprint as SGD, achieving $\approx 50\%$ reduction on total end-to-end
memory compared to Adam. In language modeling tasks, SWAN demonstrates the same
or even a substantial improvement over Adam. Specifically, when pre-training
the LLaMa model with 350M and 1.3B parameters, SWAN achieves a 2x speedup by
reaching the same evaluation perplexity in less than half tokens seen.
[LINK]
http://arxiv.org/abs/2412.13148v1
[DATE]
2024-12-18 02:13:18+08:00
[CATEGORIES]
cs.LG
Practicable Black-box Evasion Attacks on Link Prediction in Dynamic Graphs – A Graph Sequential Embedding Method
[AUTHORS]
Jiate Li, Meng Pang, Binghui Wang
[ABSTRACT]
Link prediction in dynamic graphs (LPDG) has been widely applied to
real-world applications such as website recommendation, traffic flow
prediction, organizational studies, etc. These models are usually kept local
and secure, with only the interactive interface restrictively available to the
public. Thus, the problem of the black-box evasion attack on the LPDG model,
where model interactions and data perturbations are restricted, seems to be
essential and meaningful in practice. In this paper, we propose the first
practicable black-box evasion attack method that achieves effective attacks
against the target LPDG model, within a limited amount of interactions and
perturbations. To perform effective attacks under limited perturbations, we
develop a graph sequential embedding model to find the desired state embedding
of the dynamic graph sequences, under a deep reinforcement learning framework.
To overcome the scarcity of interactions, we design a multi-environment
training pipeline and train our agent for multiple instances, by sharing an
aggregate interaction buffer. Finally, we evaluate our attack against three
advanced LPDG models on three real-world graph datasets of different scales and
compare its performance with related methods under the interaction and
perturbation constraints. Experimental results show that our attack is both
effective and practicable.
[LINK]
http://arxiv.org/abs/2412.13134v1
[DATE]
2024-12-18 01:53:32+08:00
[CATEGORIES]
cs.LG
Enhancing Internet of Things Security throughSelf-Supervised Graph Neural Networks
[AUTHORS]
Safa Ben Atitallah, Maha Driss, Wadii Boulila, Anis Koubaa
[ABSTRACT]
With the rapid rise of the Internet of Things (IoT), ensuring the security of
IoT devices has become essential. One of the primary challenges in this field
is that new types of attacks often have significantly fewer samples than more
common attacks, leading to unbalanced datasets. Existing research on detecting
intrusions in these unbalanced labeled datasets primarily employs Convolutional
Neural Networks (CNNs) or conventional Machine Learning (ML) models, which
result in incomplete detection, especially for new attacks. To handle these
challenges, we suggest a new approach to IoT intrusion detection using
Self-Supervised Learning (SSL) with a Markov Graph Convolutional Network
(MarkovGCN). Graph learning excels at modeling complex relationships within
data, while SSL mitigates the issue of limited labeled data for emerging
attacks. Our approach leverages the inherent structure of IoT networks to
pre-train a GCN, which is then fine-tuned for the intrusion detection task. The
integration of Markov chains in GCN uncovers network structures and enriches
node and edge features with contextual information. Experimental results
demonstrate that our approach significantly improves detection accuracy and
robustness compared to conventional supervised learning methods. Using the
EdgeIIoT-set dataset, we attained an accuracy of 98.68\%, a precision of
98.18%, a recall of 98.35%, and an F1-Score of 98.40%.
[LINK]
http://arxiv.org/abs/2412.13240v1
[DATE]
2024-12-18 01:40:14+08:00
[CATEGORIES]
cs.LG
Active Reinforcement Learning Strategies for Offline Policy Improvement
[AUTHORS]
Ambedkar Dukkipati, Ranga Shaarad Ayyagari, Bodhisattwa Dasgupta, Parag Dutta, Prabhas Reddy Onteru
[ABSTRACT]
Learning agents that excel at sequential decision-making tasks must
continuously resolve the problem of exploration and exploitation for optimal
learning. However, such interactions with the environment online might be
prohibitively expensive and may involve some constraints, such as a limited
budget for agent-environment interactions and restricted exploration in certain
regions of the state space. Examples include selecting candidates for medical
trials and training agents in complex navigation environments. This problem
necessitates the study of active reinforcement learning strategies that collect
minimal additional experience trajectories by reusing existing offline data
previously collected by some unknown behavior policy. In this work, we propose
a representation-aware uncertainty-based active trajectory collection method
that intelligently decides interaction strategies that consider the
distribution of the existing offline data. With extensive experimentation, we
demonstrate that our proposed method reduces additional online interaction with
the environment by up to 75% over competitive baselines across various
continuous control environments.
[COMMENTS]
AAAI 2025
[LINK]
http://arxiv.org/abs/2412.13106v1
[DATE]
2024-12-18 01:22:52+08:00
[CATEGORIES]
cs.LG
TKAN: Temporal Kolmogorov-Arnold Networks
[AUTHORS]
Remi Genet, Hugo Inzirillo
[ABSTRACT]
Recurrent Neural Networks (RNNs) have revolutionized many areas of machine
learning, particularly in natural language and data sequence processing. Long
Short-Term Memory (LSTM) has demonstrated its ability to capture long-term
dependencies in sequential data. Inspired by the Kolmogorov-Arnold Networks
(KANs) a promising alternatives to Multi-Layer Perceptrons (MLPs), we proposed
a new neural networks architecture inspired by KAN and the LSTM, the Temporal
Kolomogorov-Arnold Networks (TKANs). TKANs combined the strenght of both
networks, it is composed of Recurring Kolmogorov-Arnold Networks (RKANs) Layers
embedding memory management. This innovation enables us to perform multi-step
time series forecasting with enhanced accuracy and efficiency. By addressing
the limitations of traditional models in handling complex sequential patterns,
the TKAN architecture offers significant potential for advancements in fields
requiring more than one step ahead forecasting.
[LINK]
http://arxiv.org/abs/2405.07344v3
[DATE]
2024-12-18 01:13:03+08:00
[CATEGORIES]
cs.LG
Incremental Online Learning of Randomized Neural Network with Forward Regularization
[AUTHORS]
Junda Wang, Minghui Hu, Ning Li, Abdulaziz Al-Ali, Ponnuthurai Nagaratnam Suganthan
[ABSTRACT]
Online learning of deep neural networks suffers from challenges such as
hysteretic non-incremental updating, increasing memory usage, past
retrospective retraining, and catastrophic forgetting. To alleviate these
drawbacks and achieve progressive immediate decision-making, we propose a novel
Incremental Online Learning (IOL) process of Randomized Neural Networks
(Randomized NN), a framework facilitating continuous improvements to Randomized
NN performance in restrictive online scenarios. Within the framework, we
further introduce IOL with ridge regularization (-R) and IOL with forward
regularization (-F). -R generates stepwise incremental updates without
retrospective retraining and avoids catastrophic forgetting. Moreover, we
substituted -R with -F as it enhanced precognition learning ability using
semi-supervision and realized better online regrets to offline global experts
compared to -R during IOL. The algorithms of IOL for Randomized NN with -R/-F
on non-stationary batch stream were derived respectively, featuring recursive
weight updates and variable learning rates. Additionally, we conducted a
detailed analysis and theoretically derived relative cumulative regret bounds
of the Randomized NN learners with -R/-F in IOL under adversarial assumptions
using a novel methodology and presented several corollaries, from which we
observed the superiority on online learning acceleration and regret bounds of
employing -F in IOL. Finally, our proposed methods were rigorously examined
across regression and classification tasks on diverse datasets, which
distinctly validated the efficacy of IOL frameworks of Randomized NN and the
advantages of forward regularization.
[LINK]
http://arxiv.org/abs/2412.13096v1
[DATE]
2024-12-18 01:06:33+08:00
[CATEGORIES]
cs.LG
Reservoir Computing for Fast, Simplified Reinforcement Learning on Memory Tasks
[AUTHORS]
Kevin McKee
[ABSTRACT]
Tasks in which rewards depend upon past information not available in the
current observation set can only be solved by agents that are equipped with
short-term memory. Usual choices for memory modules include trainable recurrent
hidden layers, often with gated memory. Reservoir computing presents an
alternative, in which a recurrent layer is not trained, but rather has a set of
fixed, sparse recurrent weights. The weights are scaled to produce stable
dynamical behavior such that the reservoir state contains a high-dimensional,
nonlinear impulse response function of the inputs. An output decoder network
can then be used to map the compressive history represented by the reservoir’s
state to any outputs, including agent actions or predictions. In this study, we
find that reservoir computing greatly simplifies and speeds up reinforcement
learning on memory tasks by (1) eliminating the need for backpropagation of
gradients through time, (2) presenting all recent history simultaneously to the
downstream network, and (3) performing many useful and generic nonlinear
computations upstream from the trained modules. In particular, these findings
offer significant benefit to meta-learning that depends primarily on efficient
and highly general memory systems.
[COMMENTS]
9 pages, 6 figures
[LINK]
http://arxiv.org/abs/2412.13093v1
[DATE]
2024-12-18 01:02:06+08:00
[CATEGORIES]
cs.LG
Identifying Bias in Deep Neural Networks Using Image Transforms
[AUTHORS]
Sai Teja Erukude, Akhil Joshi, Lior Shamir
[ABSTRACT]
CNNs have become one of the most commonly used computational tool in the past
two decades. One of the primary downsides of CNNs is that they work as a
``black box”, where the user cannot necessarily know how the image data are
analyzed, and therefore needs to rely on empirical evaluation to test the
efficacy of a trained CNN. This can lead to hidden biases that affect the
performance evaluation of neural networks, but are difficult to identify. Here
we discuss examples of such hidden biases in common and widely used benchmark
datasets, and propose techniques for identifying dataset biases that can affect
the standard performance evaluation metrics. One effective approach to identify
dataset bias is to perform image classification by using merely blank
background parts of the original images. However, in some situations a blank
background in the images is not available, making it more difficult to separate
foreground or contextual information from the bias. To overcome this, we
propose a method to identify dataset bias without the need to crop background
information from the images. That method is based on applying several image
transforms to the original images, including Fourier transform, wavelet
transforms, median filter, and their combinations. These transforms were
applied to recover background bias information that CNNs use to classify
images. This transformations affect the contextual visual information in a
different manner than it affects the systemic background bias. Therefore, the
method can distinguish between contextual information and the bias, and alert
on the presence of background bias even without the need to separate sub-images
parts from the blank background of the original images. Code used in the
experiments is publicly available.
[COMMENTS]
Computers, published
[LINK]
http://arxiv.org/abs/2412.13079v1
[DATE]
2024-12-18 00:51:44+08:00
[CATEGORIES]
cs.LG
Dual Interpretation of Machine Learning Forecasts
[AUTHORS]
Philippe Goulet Coulombe, Maximilian Goebel, Karin Klieber
[ABSTRACT]
Machine learning predictions are typically interpreted as the sum of
contributions of predictors. Yet, each out-of-sample prediction can also be
expressed as a linear combination of in-sample values of the predicted
variable, with weights corresponding to pairwise proximity scores between
current and past economic events. While this dual route leads nowhere in some
contexts (e.g., large cross-sectional datasets), it provides sparser
interpretations in settings with many regressors and little training data-like
macroeconomic forecasting. In this case, the sequence of contributions can be
visualized as a time series, allowing analysts to explain predictions as
quantifiable combinations of historical analogies. Moreover, the weights can be
viewed as those of a data portfolio, inspiring new diagnostic measures such as
forecast concentration, short position, and turnover. We show how weights can
be retrieved seamlessly for (kernel) ridge regression, random forest, boosted
trees, and neural networks. Then, we apply these tools to analyze post-pandemic
forecasts of inflation, GDP growth, and recession probabilities. In all cases,
the approach opens the black box from a new angle and demonstrates how machine
learning models leverage history partly repeating itself.
[LINK]
http://arxiv.org/abs/2412.13076v1
[DATE]
2024-12-18 00:44:39+08:00
[CATEGORIES]
cs.LG
Optimized two-stage AI-based Neural Decoding for Enhanced Visual Stimulus Reconstruction from fMRI Data
[AUTHORS]
Lorenzo Veronese, Andrea Moglia, Luca Mainardi, Pietro Cerveri
[ABSTRACT]
AI-based neural decoding reconstructs visual perception by leveraging
generative models to map brain activity, measured through functional MRI
(fMRI), into latent hierarchical representations. Traditionally, ridge linear
models transform fMRI into a latent space, which is then decoded using latent
diffusion models (LDM) via a pre-trained variational autoencoder (VAE). Due to
the complexity and noisiness of fMRI data, newer approaches split the
reconstruction into two sequential steps, the first one providing a rough
visual approximation, the second on improving the stimulus prediction via LDM
endowed by CLIP embeddings. This work proposes a non-linear deep network to
improve fMRI latent space representation, optimizing the dimensionality alike.
Experiments on the Natural Scenes Dataset showed that the proposed architecture
improved the structural similarity of the reconstructed image by about 2\% with
respect to the state-of-the-art model, based on ridge linear transform. The
reconstructed image’s semantics improved by about 4\%, measured by perceptual
similarity, with respect to the state-of-the-art. The noise sensitivity
analysis of the LDM showed that the role of the first stage was fundamental to
predict the stimulus featuring high structural similarity. Conversely,
providing a large noise stimulus affected less the semantics of the predicted
stimulus, while the structural similarity between the ground truth and
predicted stimulus was very poor. The findings underscore the importance of
leveraging non-linear relationships between BOLD signal and the latent
representation and two-stage generative AI for optimizing the fidelity of
reconstructed visual stimuli from noisy fMRI data.
[COMMENTS]
14 pages, 5 figures
[LINK]
http://arxiv.org/abs/2412.13237v1
[DATE]
2024-12-18 00:42:55+08:00
[CATEGORIES]
cs.LG
Predicting Change, Not States: An Alternate Framework for Neural PDE Surrogates
[AUTHORS]
Anthony Zhou, Amir Barati Farimani
[ABSTRACT]
Neural surrogates for partial differential equations (PDEs) have become
popular due to their potential to quickly simulate physics. With a few
exceptions, neural surrogates generally treat the forward evolution of
time-dependent PDEs as a black box by directly predicting the next state. While
this is a natural and easy framework for applying neural surrogates, it can be
an over-simplified and rigid framework for predicting physics. In this work, we
propose an alternative framework in which neural solvers predict the temporal
derivative and an ODE integrator forwards the solution in time, which has
little overhead and is broadly applicable across model architectures and PDEs.
We find that by simply changing the training target and introducing numerical
integration during inference, neural surrogates can gain accuracy and
stability. Predicting temporal derivatives also allows models to not be
constrained to a specific temporal discretization, allowing for flexible
time-stepping during inference or training on higher-resolution PDE data.
Lastly, we investigate why this new framework can be beneficial and in what
situations does it work well.
[COMMENTS]
18 pages, 7 figures
[LINK]
http://arxiv.org/abs/2412.13074v1
[DATE]
2024-12-18 00:41:53+08:00
[CATEGORIES]
cs.LG
On the Readiness of Scientific Data for a Fair and Transparent Use in Machine Learning
[AUTHORS]
Joan Giner-Miguelez, Abel Gómez, Jordi Cabot
[ABSTRACT]
To ensure the fairness and trustworthiness of machine learning (ML) systems,
recent legislative initiatives and relevant research in the ML community have
pointed out the need to document the data used to train ML models. Besides,
data-sharing practices in many scientific domains have evolved in recent years
for reproducibility purposes. In this sense, academic institutions’ adoption of
these practices has encouraged researchers to publish their data and technical
documentation in peer-reviewed publications such as data papers. In this study,
we analyze how this broader scientific data documentation meets the needs of
the ML community and regulatory bodies for its use in ML technologies. We
examine a sample of 4041 data papers of different domains, assessing their
completeness, coverage of the requested dimensions, and trends in recent years.
We focus on the most and least documented dimensions and compare the results
with those of an ML-focused venue (NeurIPS D&B track) publishing papers
describing datasets. As a result, we propose a set of recommendation guidelines
for data creators and scientific data publishers to increase their data’s
preparedness for its transparent and fairer use in ML technologies.
[LINK]
http://arxiv.org/abs/2401.10304v2
[DATE]
2024-12-18 00:34:49+08:00
[CATEGORIES]
cs.LG
Learning of Patch-Based Smooth-Plus-Sparse Models for Image Reconstruction
[AUTHORS]
Stanislas Ducotterd, Sebastian Neumayer, Michael Unser
[ABSTRACT]
We aim at the solution of inverse problems in imaging, by combining a
penalized sparse representation of image patches with an unconstrained smooth
one. This allows for a straightforward interpretation of the reconstruction. We
formulate the optimization as a bilevel problem. The inner problem deploys
classical algorithms while the outer problem optimizes the dictionary and the
regularizer parameters through supervised learning. The process is carried out
via implicit differentiation and gradient-based optimization. We evaluate our
method for denoising, super-resolution, and compressed-sensing
magnetic-resonance imaging. We compare it to other classical models as well as
deep-learning-based methods and show that it always outperforms the former and
also the latter in some instances.
[LINK]
http://arxiv.org/abs/2412.13070v1
[DATE]
2024-12-18 00:34:32+08:00
[CATEGORIES]
cs.LG
On Distilling the Displacement Knowledge for Few-Shot Class-Incremental Learning
[AUTHORS]
Pengfei Fang, Yongchun Qin, Hui Xue
[ABSTRACT]
Few-shot Class-Incremental Learning (FSCIL) addresses the challenges of
evolving data distributions and the difficulty of data acquisition in
real-world scenarios. To counteract the catastrophic forgetting typically
encountered in FSCIL, knowledge distillation is employed as a way to maintain
the knowledge from learned data distribution. Recognizing the limitations of
generating discriminative feature representations in a few-shot context, our
approach incorporates structural information between samples into knowledge
distillation. This structural information serves as a remedy for the low
quality of features. Diverging from traditional structured distillation methods
that compute sample similarity, we introduce the Displacement Knowledge
Distillation (DKD) method. DKD utilizes displacement rather than similarity
between samples, incorporating both distance and angular information to
significantly enhance the information density retained through knowledge
distillation. Observing performance disparities in feature distribution between
base and novel classes, we propose the Dual Distillation Network (DDNet). This
network applies traditional knowledge distillation to base classes and DKD to
novel classes, challenging the conventional integration of novel classes with
base classes. Additionally, we implement an instance-aware sample selector
during inference to dynamically adjust dual branch weights, thereby leveraging
the complementary strengths of each approach. Extensive testing on three
benchmarks demonstrates that DDNet achieves state-of-the-art results. Moreover,
through rigorous experimentation and comparison, we establish the robustness
and general applicability of our proposed DKD method.
[LINK]
http://arxiv.org/abs/2412.11017v2
[DATE]
2024-12-18 00:27:21+08:00
[CATEGORIES]
cs.LG
VidTok: A Versatile and Open-Source Video Tokenizer
[AUTHORS]
Anni Tang, Tianyu He, Junliang Guo, Xinle Cheng, Li Song, Jiang Bian
[ABSTRACT]
Encoding video content into compact latent tokens has become a fundamental
step in video generation and understanding, driven by the need to address the
inherent redundancy in pixel-level representations. Consequently, there is a
growing demand for high-performance, open-source video tokenizers as
video-centric research gains prominence. We introduce VidTok, a versatile video
tokenizer that delivers state-of-the-art performance in both continuous and
discrete tokenizations. VidTok incorporates several key advancements over
existing approaches: 1) model architecture such as convolutional layers and
up/downsampling modules; 2) to address the training instability and codebook
collapse commonly associated with conventional Vector Quantization (VQ), we
integrate Finite Scalar Quantization (FSQ) into discrete video tokenization; 3)
improved training strategies, including a two-stage training process and the
use of reduced frame rates. By integrating these advancements, VidTok achieves
substantial improvements over existing methods, demonstrating superior
performance across multiple metrics, including PSNR, SSIM, LPIPS, and FVD,
under standardized evaluation settings.
[COMMENTS]
Code & Models: https://github.com/microsoft/VidTok
[LINK]
http://arxiv.org/abs/2412.13061v1
[DATE]
2024-12-18 00:27:11+08:00
[CATEGORIES]
cs.LG
An Ad-hoc graph node vector embedding algorithm for general knowledge graphs using Kinetica-Graph
[AUTHORS]
B. Kaan Karamete, Eli Glaser
[ABSTRACT]
This paper discusses how to generate general graph node embeddings from
knowledge graph representations. The embedded space is composed of a number of
sub-features to mimic both local affinity and remote structural relevance.
These sub-feature dimensions are defined by several indicators that we
speculate to catch nodal similarities, such as hop-based topological patterns,
the number of overlapping labels, the transitional probabilities (markov-chain
probabilities), and the cluster indices computed by our recursive spectral
bisection (RSB) algorithm. These measures are flattened over the one
dimensional vector space into their respective sub-component ranges such that
the entire set of vector similarity functions could be used for finding similar
nodes. The error is defined by the sum of pairwise square differences across a
randomly selected sample of graph nodes between the assumed embeddings and the
ground truth estimates as our novel loss function. The ground truth is
estimated to be a combination of pairwise Jaccard similarity and the number of
overlapping labels. Finally, we demonstrate a multi-variate stochastic gradient
descent (SGD) algorithm to compute the weighing factors among sub-vector spaces
to minimize the average error using a random sampling logic.
[COMMENTS]
11 pages, 16 figures, 16 references
[LINK]
http://arxiv.org/abs/2407.15906v2
[DATE]
2024-12-18 00:25:06+08:00
[CATEGORIES]
cs.LG
COSEE: Consistency-Oriented Signal-Based Early Exiting via Calibrated Sample Weighting Mechanism
[AUTHORS]
Jianing He, Qi Zhang, Hongyun Zhang, Xuanjing Huang, Usman Naseem, Duoqian Miao
[ABSTRACT]
Early exiting is an effective paradigm for improving the inference efficiency
of pre-trained language models (PLMs) by dynamically adjusting the number of
executed layers for each sample. However, in most existing works, easy and hard
samples are treated equally by each classifier during training, which neglects
the test-time early exiting behavior, leading to inconsistency between training
and testing. Although some methods have tackled this issue under a fixed
speed-up ratio, the challenge of flexibly adjusting the speed-up ratio while
maintaining consistency between training and testing is still under-explored.
To bridge the gap, we propose a novel Consistency-Oriented Signal-based Early
Exiting (COSEE) framework, which leverages a calibrated sample weighting
mechanism to enable each classifier to emphasize the samples that are more
likely to exit at that classifier under various acceleration scenarios.
Extensive experiments on the GLUE benchmark demonstrate the effectiveness of
our COSEE across multiple exiting signals and backbones, yielding a better
trade-off between performance and efficiency.
[COMMENTS]
AAAI 2025, 11 pages
[LINK]
http://arxiv.org/abs/2412.13236v1
[DATE]
2024-12-18 00:24:55+08:00
[CATEGORIES]
cs.LG
On the Hardness of Training Deep Neural Networks Discretely
[AUTHORS]
Ilan Doron-Arad
[ABSTRACT]
We study neural network training (NNT): optimizing a neural network’s
parameters to minimize the training loss over a given dataset. NNT has been
studied extensively under theoretic lenses, mainly on two-layer networks with
linear or ReLU activation functions where the parameters can take any real
value (here referred to as continuous NNT (C-NNT)). However, less is known
about deeper neural networks, which exhibit substantially stronger capabilities
in practice. In addition, the complexity of the discrete variant of the problem
(D-NNT in short), in which the parameters are taken from a given finite set of
options, has remained less explored despite its theoretical and practical
significance.
In this work, we show that the hardness of NNT is dramatically affected by
the network depth. Specifically, we show that, under standard complexity
assumptions, D-NNT is not in the complexity class NP even for instances with
fixed dimensions and dataset size, having a deep architecture. This separates
D-NNT from any NP-complete problem. Furthermore, using a polynomial reduction
we show that the above result also holds for C-NNT, albeit with more structured
instances. We complement these results with a comprehensive list of NP-hardness
lower bounds for D-NNT on two-layer networks, showing that fixing the number of
dimensions, the dataset size, or the number of neurons in the hidden layer
leaves the problem challenging. Finally, we obtain a pseudo-polynomial
algorithm for D-NNT on a two-layer network with a fixed dataset size.
[COMMENTS]
Accepted to AAAI 2025
[LINK]
http://arxiv.org/abs/2412.13057v1
[DATE]
2024-12-18 00:20:29+08:00
[CATEGORIES]
cs.LG
SMOSE: Sparse Mixture of Shallow Experts for Interpretable Reinforcement Learning in Continuous Control Tasks
[AUTHORS]
Mátyás Vincze, Laura Ferrarotti, Leonardo Lucio Custode, Bruno Lepri, Giovanni Iacca
[ABSTRACT]
Continuous control tasks often involve high-dimensional, dynamic, and
non-linear environments. State-of-the-art performance in these tasks is
achieved through complex closed-box policies that are effective, but suffer
from an inherent opacity. Interpretable policies, while generally
underperforming compared to their closed-box counterparts, advantageously
facilitate transparent decision-making within automated systems. Hence, their
usage is often essential for diagnosing and mitigating errors, supporting
ethical and legal accountability, and fostering trust among stakeholders. In
this paper, we propose SMOSE, a novel method to train sparsely activated
interpretable controllers, based on a top-1 Mixture-of-Experts architecture.
SMOSE combines a set of interpretable decisionmakers, trained to be experts in
different basic skills, and an interpretable router that assigns tasks among
the experts. The training is carried out via state-of-the-art Reinforcement
Learning algorithms, exploiting load-balancing techniques to ensure fair expert
usage. We then distill decision trees from the weights of the router,
significantly improving the ease of interpretation. We evaluate SMOSE on six
benchmark environments from MuJoCo: our method outperforms recent interpretable
baselines and narrows the gap with noninterpretable state-of-the-art algorithms
[COMMENTS]
To be published in the Proceedings of the 39th AAAI Conference on
Artificial Intelligence (AAAI-25)
[LINK]
http://arxiv.org/abs/2412.13053v1
[DATE]
2024-12-18 00:15:04+08:00
[CATEGORIES]
cs.LG
WHAT-IF: Exploring Branching Narratives by Meta-Prompting Large Language Models
[AUTHORS]
Runsheng “Anson” Huang, Lara J. Martin, Chris Callison-Burch
[ABSTRACT]
WHAT-IF – Writing a Hero’s Alternate Timeline through Interactive Fiction –
is a system that uses zero-shot meta-prompting to create branching narratives
from a prewritten story. Played as an interactive fiction (IF) game, WHAT-IF
lets the player choose between decisions that the large language model (LLM)
GPT-4 generates as possible branches in the story. Starting with an existing
linear plot as input, a branch is created at each key decision taken by the
main character. By meta-prompting the LLM to consider the major plot points
from the story, the system produces coherent and well-structured alternate
storylines. WHAT-IF stores the branching plot tree in a graph which helps it to
both keep track of the story for prompting and maintain the structure for the
final IF system. A video demo of our system can be found here:
https://youtu.be/8vBqjqtupcc.
[LINK]
http://arxiv.org/abs/2412.10582v2
[DATE]
2024-12-17 23:56:50+08:00
[CATEGORIES]
cs.CL
OmniEval: An Omnidirectional and Automatic RAG Evaluation Benchmark in Financial Domain
[AUTHORS]
Shuting Wang, Jiejun Tan, Zhicheng Dou, Ji-Rong Wen
[ABSTRACT]
As a typical and practical application of Large Language Models (LLMs),
Retrieval-Augmented Generation (RAG) techniques have gained extensive
attention, particularly in vertical domains where LLMs may lack domain-specific
knowledge. In this paper, we introduce an omnidirectional and automatic RAG
benchmark, OmniEval, in the financial domain. Our benchmark is characterized by
its multi-dimensional evaluation framework, including (1) a matrix-based RAG
scenario evaluation system that categorizes queries into five task classes and
16 financial topics, leading to a structured assessment of diverse query
scenarios; (2) a multi-dimensional evaluation data generation approach, which
combines GPT-4-based automatic generation and human annotation, achieving an
87.47\% acceptance ratio in human evaluations on generated instances; (3) a
multi-stage evaluation system that evaluates both retrieval and generation
performance, result in a comprehensive evaluation on the RAG pipeline; and (4)
robust evaluation metrics derived from rule-based and LLM-based ones, enhancing
the reliability of assessments through manual annotations and supervised
fine-tuning of an LLM evaluator. Our experiments demonstrate the
comprehensiveness of OmniEval, which includes extensive test datasets and
highlights the performance variations of RAG systems across diverse topics and
tasks, revealing significant opportunities for RAG models to improve their
capabilities in vertical domains. We open source the code of our benchmark in
\href{https://github.com/RUC-NLPIR/OmniEval}{https://github.com/RUC-NLPIR/OmniEval}.
[LINK]
http://arxiv.org/abs/2412.13018v1
[DATE]
2024-12-17 23:38:42+08:00
[CATEGORIES]
cs.CL
Towards Reliable Latent Knowledge Estimation in LLMs: Zero-Prompt Many-Shot Based Factual Knowledge Extraction
[AUTHORS]
Qinyuan Wu, Mohammad Aflah Khan, Soumi Das, Vedant Nanda, Bishwamittra Ghosh, Camila Kolling, Till Speicher, Laurent Bindschaedler, Krishna P. Gummadi, Evimaria Terzi
[ABSTRACT]
In this paper, we focus on the challenging task of reliably estimating
factual knowledge that is embedded inside large language models (LLMs). To
avoid reliability concerns with prior approaches, we propose to eliminate
prompt engineering when probing LLMs for factual knowledge. Our approach,
called Zero-Prompt Latent Knowledge Estimator (ZP-LKE), leverages the
in-context learning ability of LLMs to communicate both the factual knowledge
question as well as the expected answer format. Our knowledge estimator is both
conceptually simpler (i.e., doesn’t depend on meta-linguistic judgments of
LLMs) and easier to apply (i.e., is not LLM-specific), and we demonstrate that
it can surface more of the latent knowledge embedded in LLMs. We also
investigate how different design choices affect the performance of ZP-LKE.
Using the proposed estimator, we perform a large-scale evaluation of the
factual knowledge of a variety of open-source LLMs, like OPT, Pythia, Llama(2),
Mistral, Gemma, etc. over a large set of relations and facts from the Wikidata
knowledge base. We observe differences in the factual knowledge between
different model families and models of different sizes, that some relations are
consistently better known than others but that models differ in the precise
facts they know, and differences in the knowledge of base models and their
finetuned counterparts. Code available at:
https://github.com/QinyuanWu0710/ZeroPrompt_LKE
[LINK]
http://arxiv.org/abs/2404.12957v2
[DATE]
2024-12-17 23:38:23+08:00
[CATEGORIES]
cs.CL
cs.LG
RCLMuFN: Relational Context Learning and Multiplex Fusion Network for Multimodal Sarcasm Detection
[AUTHORS]
Tongguan Wang, Junkai Li, Guixin Su, Yongcheng Zhang, Dongyu Su, Yuxue Hu, Ying Sha
[ABSTRACT]
Sarcasm typically conveys emotions of contempt or criticism by expressing a
meaning that is contrary to the speaker’s true intent. Accurate detection of
sarcasm aids in identifying and filtering undesirable information on the
Internet, thereby reducing malicious defamation and rumor-mongering.
Nonetheless, the task of automatic sarcasm detection remains highly challenging
for machines, as it critically depends on intricate factors such as relational
context. Most existing multimodal sarcasm detection methods focus on
introducing graph structures to establish entity relationships between text and
images while neglecting to learn the relational context between text and
images, which is crucial evidence for understanding the meaning of sarcasm. In
addition, the meaning of sarcasm changes with the evolution of different
contexts, but existing methods may not be accurate in modeling such dynamic
changes, limiting the generalization ability of the models. To address the
above issues, we propose a relational context learning and multiplex fusion
network (RCLMuFN) for multimodal sarcasm detection. Firstly, we employ four
feature extractors to comprehensively extract features from raw text and
images, aiming to excavate potential features that may have been previously
overlooked. Secondly, we utilize the relational context learning module to
learn the contextual information of text and images and capture the dynamic
properties through shallow and deep interactions. Finally, we employ a
multiplex feature fusion module to enhance the generalization of the model by
penetratingly integrating multimodal features derived from various interaction
contexts. Extensive experiments on two multimodal sarcasm detection datasets
show that our proposed method achieves state-of-the-art performance.
[LINK]
http://arxiv.org/abs/2412.13008v1
[DATE]
2024-12-17 23:29:31+08:00
[CATEGORIES]
cs.CL
Enabling Low-Resource Language Retrieval: Establishing Baselines for Urdu MS MARCO
[AUTHORS]
Umer Butt, Stalin Veranasi, Günter Neumann
[ABSTRACT]
As the Information Retrieval (IR) field increasingly recognizes the
importance of inclusivity, addressing the needs of low-resource languages
remains a significant challenge. This paper introduces the first large-scale
Urdu IR dataset, created by translating the MS MARCO dataset through machine
translation. We establish baseline results through zero-shot learning for IR in
Urdu and subsequently apply the mMARCO multilingual IR methodology to this
newly translated dataset. Our findings demonstrate that the fine-tuned model
(Urdu-mT5-mMARCO) achieves a Mean Reciprocal Rank (MRR@10) of 0.247 and a
Recall@10 of 0.439, representing significant improvements over zero-shot
results and showing the potential for expanding IR access for Urdu speakers. By
bridging access gaps for speakers of low-resource languages, this work not only
advances multilingual IR research but also emphasizes the ethical and societal
importance of inclusive IR technologies. This work provides valuable insights
into the challenges and solutions for improving language representation and
lays the groundwork for future research, especially in South Asian languages,
which can benefit from the adaptable methods used in this study.
[COMMENTS]
6 pages, ECIR 2025, conference submission version
[LINK]
http://arxiv.org/abs/2412.12997v1
[DATE]
2024-12-17 23:21:28+08:00
[CATEGORIES]
cs.CL
Unlocking LLMs: Addressing Scarce Data and Bias Challenges in Mental Health
[AUTHORS]
Vivek Kumar, Eirini Ntoutsi, Pushpraj Singh Rajawat, Giacomo Medda, Diego Reforgiato Recupero
[ABSTRACT]
Large language models (LLMs) have shown promising capabilities in healthcare
analysis but face several challenges like hallucinations, parroting, and bias
manifestation. These challenges are exacerbated in complex, sensitive, and
low-resource domains. Therefore, in this work we introduce IC-AnnoMI, an
expert-annotated motivational interviewing (MI) dataset built upon AnnoMI by
generating in-context conversational dialogues leveraging LLMs, particularly
ChatGPT. IC-AnnoMI employs targeted prompts accurately engineered through cues
and tailored information, taking into account therapy style (empathy,
reflection), contextual relevance, and false semantic change. Subsequently, the
dialogues are annotated by experts, strictly adhering to the Motivational
Interviewing Skills Code (MISC), focusing on both the psychological and
linguistic dimensions of MI dialogues. We comprehensively evaluate the
IC-AnnoMI dataset and ChatGPT’s emotional reasoning ability and understanding
of domain intricacies by modeling novel classification tasks employing several
classical machine learning and current state-of-the-art transformer approaches.
Finally, we discuss the effects of progressive prompting strategies and the
impact of augmented data in mitigating the biases manifested in IC-AnnoM. Our
contributions provide the MI community with not only a comprehensive dataset
but also valuable insights for using LLMs in empathetic text generation for
conversational therapy in supervised settings.
[COMMENTS]
International Conference on Natural Language Processing and
Artificial Intelligence for Cyber Security (NLPAICS) 2024
[LINK]
http://arxiv.org/abs/2412.12981v1
[DATE]
2024-12-17 23:01:07+08:00
[CATEGORIES]
cs.CL
Dynamic-LLaVA: Efficient Multimodal Large Language Models via Dynamic Vision-language Context Sparsification
[AUTHORS]
Wenxuan Huang, Zijie Zhai, Yunhang Shen, Shaosheng Cao, Fei Zhao, Xiangfeng Xu, Zheyu Ye, Shaohui Lin
[ABSTRACT]
Multimodal Large Language Models (MLLMs) have achieved remarkable success in
vision understanding, reasoning, and interaction. However, the inference
computation and memory increase progressively with the generation of output
tokens during decoding, directly affecting the efficacy of MLLMs. Existing
methods attempt to reduce the vision context redundancy to achieve efficient
MLLMs. Unfortunately, the efficiency benefits of the vision context reduction
in the prefill stage gradually diminish during the decoding stage. To address
this problem, we proposed a dynamic vision-language context sparsification
framework Dynamic-LLaVA, which dynamically reduces the redundancy of vision
context in the prefill stage and decreases the memory and computation overhead
of the generated language context during decoding. Dynamic-LLaVA designs a
tailored sparsification inference scheme for different inference modes, i.e.,
prefill, decoding with and without KV cache, to achieve efficient inference of
MLLMs. In practice, Dynamic-LLaVA can reduce computation consumption by
$\sim$75\% in the prefill stage. Meanwhile, throughout the entire generation
process of MLLMs, Dynamic-LLaVA reduces the $\sim$50\% computation consumption
under decoding without KV cache, while saving $\sim$50\% GPU memory overhead
when decoding with KV cache, due to the vision-language context sparsification.
Extensive experiments also demonstrate that Dynamic-LLaVA achieves efficient
inference for MLLMs with negligible understanding and generation ability
degradation or even performance gains compared to the full-context inference
baselines. Code is available at https://github.com/Osilly/dynamic_llava .
[COMMENTS]
Code is available at https://github.com/Osilly/dynamic_llava
[LINK]
http://arxiv.org/abs/2412.00876v3
[DATE]
2024-12-17 22:45:12+08:00
[CATEGORIES]
cs.CL
cs.LG
Adaptations of AI models for querying the LandMatrix database in natural language
[AUTHORS]
Fatiha Ait Kbir, Jérémy Bourgoin, Rémy Decoupes, Marie Gradeler, Roberto Interdonato
[ABSTRACT]
The Land Matrix initiative (https://landmatrix.org) and its global
observatory aim to provide reliable data on large-scale land acquisitions to
inform debates and actions in sectors such as agriculture, extraction, or
energy in low- and middle-income countries. Although these data are recognized
in the academic world, they remain underutilized in public policy, mainly due
to the complexity of access and exploitation, which requires technical
expertise and a good understanding of the database schema.
The objective of this work is to simplify access to data from different
database systems. The methods proposed in this article are evaluated using data
from the Land Matrix. This work presents various comparisons of Large Language
Models (LLMs) as well as combinations of LLM adaptations (Prompt Engineering,
RAG, Agents) to query different database systems (GraphQL and REST queries).
The experiments are reproducible, and a demonstration is available online:
https://github.com/tetis-nlp/landmatrix-graphql-python.
[LINK]
http://arxiv.org/abs/2412.12961v1
[DATE]
2024-12-17 22:44:27+08:00
[CATEGORIES]
cs.CL
Learning from Noisy Labels via Self-Taught On-the-Fly Meta Loss Rescaling
[AUTHORS]
Michael Heck, Christian Geishauser, Nurul Lubis, Carel van Niekerk, Shutong Feng, Hsien-Chin Lin, Benjamin Matthias Ruppik, Renato Vukovic, Milica Gašić
[ABSTRACT]
Correct labels are indispensable for training effective machine learning
models. However, creating high-quality labels is expensive, and even
professionally labeled data contains errors and ambiguities. Filtering and
denoising can be applied to curate labeled data prior to training, at the cost
of additional processing and loss of information. An alternative is on-the-fly
sample reweighting during the training process to decrease the negative impact
of incorrect or ambiguous labels, but this typically requires clean seed data.
In this work we propose unsupervised on-the-fly meta loss rescaling to reweight
training samples. Crucially, we rely only on features provided by the model
being trained, to learn a rescaling function in real time without knowledge of
the true clean data distribution. We achieve this via a novel meta learning
setup that samples validation data for the meta update directly from the noisy
training corpus by employing the rescaling function being trained. Our proposed
method consistently improves performance across various NLP tasks with minimal
computational overhead. Further, we are among the first to attempt on-the-fly
training data reweighting on the challenging task of dialogue modeling, where
noisy and ambiguous labels are common. Our strategy is robust in the face of
noisy and clean data, handles class imbalance, and prevents overfitting to
noisy labels. Our self-taught loss rescaling improves as the model trains,
showing the ability to keep learning from the model’s own signals. As training
progresses, the impact of correctly labeled data is scaled up, while the impact
of wrongly labeled data is suppressed.
[COMMENTS]
10 pages, 3 figures, accepted at AAAI’25
[LINK]
http://arxiv.org/abs/2412.12955v1
[DATE]
2024-12-17 22:37:50+08:00
[CATEGORIES]
cs.CL
MOPO: Multi-Objective Prompt Optimization for Affective Text Generation
[AUTHORS]
Yarik Menchaca Resendiz, Roman Klinger
[ABSTRACT]
How emotions are expressed depends on the context and domain. On X (formerly
Twitter), for instance, an author might simply use the hashtag #anger, while in
a news headline, emotions are typically written in a more polite, indirect
manner. To enable conditional text generation models to create emotionally
connotated texts that fit a domain, users need to have access to a parameter
that allows them to choose the appropriate way to express an emotion. To
achieve this, we introduce MOPO, a Multi-Objective Prompt Optimization
methodology. MOPO optimizes prompts according to multiple objectives (which
correspond here to the output probabilities assigned by emotion classifiers
trained for different domains). In contrast to single objective optimization,
MOPO outputs a set of prompts, each with a different weighting of the multiple
objectives. Users can then choose the most appropriate prompt for their
context. We evaluate MOPO using three objectives, determined by various
domain-specific emotion classifiers. MOPO improves performance by up to 15 pp
across all objectives with a minimal loss (1-2 pp) for any single objective
compared to single-objective optimization. These minor performance losses are
offset by a broader generalization across multiple objectives - which is not
possible with single-objective optimization. Additionally, MOPO reduces
computational requirements by simultaneously optimizing for multiple
objectives, eliminating separate optimization procedures for each objective.
[COMMENTS]
accepted to COLING 2025
[LINK]
http://arxiv.org/abs/2412.12948v1
[DATE]
2024-12-17 22:28:14+08:00
[CATEGORIES]
cs.CL
Emma-X: An Embodied Multimodal Action Model with Grounded Chain of Thought and Look-ahead Spatial Reasoning
[AUTHORS]
Qi Sun, Pengfei Hong, Tej Deep Pala, Vernon Toh, U-Xuan Tan, Deepanway Ghosal, Soujanya Poria
[ABSTRACT]
Traditional reinforcement learning-based robotic control methods are often
task-specific and fail to generalize across diverse environments or unseen
objects and instructions. Visual Language Models (VLMs) demonstrate strong
scene understanding and planning capabilities but lack the ability to generate
actionable policies tailored to specific robotic embodiments. To address this,
Visual-Language-Action (VLA) models have emerged, yet they face challenges in
long-horizon spatial reasoning and grounded task planning. In this work, we
propose the Embodied Multimodal Action Model with Grounded Chain of Thought and
Look-ahead Spatial Reasoning, Emma-X. Emma-X leverages our constructed
hierarchical embodiment dataset based on BridgeV2, containing 60,000 robot
manipulation trajectories auto-annotated with grounded task reasoning and
spatial guidance. Additionally, we introduce a trajectory segmentation strategy
based on gripper states and motion trajectories, which can help mitigate
hallucination in grounding subtask reasoning generation. Experimental results
demonstrate that Emma-X achieves superior performance over competitive
baselines, particularly in real-world robotic tasks requiring spatial
reasoning.
[COMMENTS]
https://github.com/declare-lab/Emma-X,
https://huggingface.co/declare-lab/Emma-X
[LINK]
http://arxiv.org/abs/2412.11974v2
[DATE]
2024-12-17 22:12:56+08:00
[CATEGORIES]
cs.CL
CrAM: Credibility-Aware Attention Modification in LLMs for Combating Misinformation in RAG
[AUTHORS]
Boyi Deng, Wenjie Wang, Fengbin Zhu, Qifan Wang, Fuli Feng
[ABSTRACT]
Retrieval-Augmented Generation (RAG) can alleviate hallucinations of Large
Language Models (LLMs) by referencing external documents. However, the
misinformation in external documents may mislead LLMs’ generation. To address
this issue, we explore the task of “credibility-aware RAG”, in which LLMs
automatically adjust the influence of retrieved documents based on their
credibility scores to counteract misinformation. To this end, we introduce a
plug-and-play method named $\textbf{Cr}$edibility-aware $\textbf{A}$ttention
$\textbf{M}$odification (CrAM). CrAM identifies influential attention heads in
LLMs and adjusts their attention weights based on the credibility of the
documents, thereby reducing the impact of low-credibility documents.
Experiments on Natual Questions and TriviaQA using Llama2-13B, Llama3-8B, and
Qwen1.5-7B show that CrAM improves the RAG performance of LLMs against
misinformation pollution by over 20%, even surpassing supervised fine-tuning
methods.
[COMMENTS]
AAAI25 camera-ready
[LINK]
http://arxiv.org/abs/2406.11497v3
[DATE]
2024-12-17 22:11:19+08:00
[CATEGORIES]
cs.CL
Truthful Text Sanitization Guided by Inference Attacks
[AUTHORS]
Ildikó Pilán, Benet Manzanares-Salor, David Sánchez, Pierre Lison
[ABSTRACT]
The purpose of text sanitization is to rewrite those text spans in a document
that may directly or indirectly identify an individual, to ensure they no
longer disclose personal information. Text sanitization must strike a balance
between preventing the leakage of personal information (privacy protection)
while also retaining as much of the document’s original content as possible
(utility preservation). We present an automated text sanitization strategy
based on generalizations, which are more abstract (but still informative) terms
that subsume the semantic content of the original text spans. The approach
relies on instruction-tuned large language models (LLMs) and is divided into
two stages. The LLM is first applied to obtain truth-preserving replacement
candidates and rank them according to their abstraction level. Those candidates
are then evaluated for their ability to protect privacy by conducting inference
attacks with the LLM. Finally, the system selects the most informative
replacement shown to be resistant to those attacks. As a consequence of this
two-stage process, the chosen replacements effectively balance utility and
privacy. We also present novel metrics to automatically evaluate these two
aspects without the need to manually annotate data. Empirical results on the
Text Anonymization Benchmark show that the proposed approach leads to enhanced
utility, with only a marginal increase in the risk of re-identifying protected
individuals compared to fully suppressing the original information.
Furthermore, the selected replacements are shown to be more truth-preserving
and abstractive than previous methods.
[LINK]
http://arxiv.org/abs/2412.12928v1
[DATE]
2024-12-17 22:07:01+08:00
[CATEGORIES]
cs.CL
An Agentic Approach to Automatic Creation of P&ID Diagrams from Natural Language Descriptions
[AUTHORS]
Shreeyash Gowaikar, Srinivasan Iyengar, Sameer Segal, Shivkumar Kalyanaraman
[ABSTRACT]
The Piping and Instrumentation Diagrams (P&IDs) are foundational to the
design, construction, and operation of workflows in the engineering and process
industries. However, their manual creation is often labor-intensive,
error-prone, and lacks robust mechanisms for error detection and correction.
While recent advancements in Generative AI, particularly Large Language Models
(LLMs) and Vision-Language Models (VLMs), have demonstrated significant
potential across various domains, their application in automating generation of
engineering workflows remains underexplored. In this work, we introduce a novel
copilot for automating the generation of P&IDs from natural language
descriptions. Leveraging a multi-step agentic workflow, our copilot provides a
structured and iterative approach to diagram creation directly from Natural
Language prompts. We demonstrate the feasibility of the generation process by
evaluating the soundness and completeness of the workflow, and show improved
results compared to vanilla zero-shot and few-shot generation approaches.
[COMMENTS]
Accepted at the AAAI’25 Workshop on AI to Accelerate Science and
Engineering (AI2ASE)
[LINK]
http://arxiv.org/abs/2412.12898v1
[DATE]
2024-12-17 21:21:26+08:00
[CATEGORIES]
cs.LG
cs.CL
RAG-Star: Enhancing Deliberative Reasoning with Retrieval Augmented Verification and Refinement
[AUTHORS]
Jinhao Jiang, Jiayi Chen, Junyi Li, Ruiyang Ren, Shijie Wang, Wayne Xin Zhao, Yang Song, Tao Zhang
[ABSTRACT]
Existing large language models (LLMs) show exceptional problem-solving
capabilities but might struggle with complex reasoning tasks. Despite the
successes of chain-of-thought and tree-based search methods, they mainly depend
on the internal knowledge of LLMs to search over intermediate reasoning steps,
limited to dealing with simple tasks involving fewer reasoning steps. In this
paper, we propose \textbf{RAG-Star}, a novel RAG approach that integrates the
retrieved information to guide the tree-based deliberative reasoning process
that relies on the inherent knowledge of LLMs. By leveraging Monte Carlo Tree
Search, RAG-Star iteratively plans intermediate sub-queries and answers for
reasoning based on the LLM itself. To consolidate internal and external
knowledge, we propose an retrieval-augmented verification that utilizes query-
and answer-aware reward modeling to provide feedback for the inherent reasoning
of LLMs. Our experiments involving Llama-3.1-8B-Instruct and GPT-4o demonstrate
that RAG-Star significantly outperforms previous RAG and reasoning methods.
[COMMENTS]
LLM;RAG;MCTS
[LINK]
http://arxiv.org/abs/2412.12881v1
[DATE]
2024-12-17 21:05:36+08:00
[CATEGORIES]
cs.CL
Fine-tuning Large Language Models for Domain-specific Machine Translation
[AUTHORS]
Jiawei Zheng, Hanghai Hong, Feiyan Liu, Xiaoli Wang, Jingsong Su, Yonggui Liang, Shikai Wu
[ABSTRACT]
Large language models (LLMs) have shown great potential in domain-specific
machine translation (MT). However, one major issue is that LLMs pre-trained on
general domain corpus might not generalize well to specific domains due to the
lack of domain-specific knowledge. To address this issue, this paper focuses on
enhancing the domain-specific MT capability of LLMs, by providing high-quality
training datasets and proposing a novel fine-tuning framework denoted by
DragFT. DragFT augments LLMs via three techniques: (i) Dictionary-enhanced
prompting integrates dictionary information into prompts to improve the
translation of domain-specific terminology.; (ii) RAG-based few-shot example
selection provides high-quality examples that simulate both the domain and
style characteristics; (iii) Fine-tuning with few-shot examples further
enhances performance when using in-domain examples. We deploy DragFT on three
well-known LLM backbones with 13B training parameters to validate its
effectiveness. The results on three domain-specific datasets show that DragFT
achieves a significant performance boost and shows superior performance
compared to advanced models such as GPT-3.5 and GPT-4o. The drastic performance
improvement of DragFT over existing LLMs can be attributed to incorporating
relevant knowledge while mitigating noise.
[COMMENTS]
13 pages, 5 figures, 9 tables
[LINK]
http://arxiv.org/abs/2402.15061v2
[DATE]
2024-12-17 20:45:20+08:00
[CATEGORIES]
cs.CL
cs.LG
DISC: Plug-and-Play Decoding Intervention with Similarity of Characters for Chinese Spelling Check
[AUTHORS]
Ziheng Qiao, Houquan Zhou, Yumeng Liu, Zhenghua Li, Min Zhang, Bo Zhang, Chen Li, Ji Zhang, Fei Huang
[LINK]
http://arxiv.org/abs/2412.12863v1
[DATE]
2024-12-17 20:44:06+08:00
[CATEGORIES]
cs.CL
Enhancing Character-Level Understanding in LLMs through Token Internal Structure Learning
[AUTHORS]
Zhu Xu, Zhiqiang Zhao, Zihan Zhang, Yuchi Liu, Quanwei Shen, Fei Liu, Yu Kuang, Jian He, Conglin Liu
[LINK]
http://arxiv.org/abs/2411.17679v3
[DATE]
2024-12-17 20:37:47+08:00
[CATEGORIES]
cs.CL
Selective Shot Learning for Code Explanation
[AUTHORS]
Paheli Bhattacharya, Rishabh Gupta
[ABSTRACT]
Code explanation plays a crucial role in the software engineering domain,
aiding developers in grasping code functionality efficiently. Recent work shows
that the performance of LLMs for code explanation improves in a few-shot
setting, especially when the few-shot examples are selected intelligently.
State-of-the-art approaches for such Selective Shot Learning (SSL) include
token-based and embedding-based methods. However, these SSL approaches have
been evaluated on proprietary LLMs, without much exploration on open-source
Code-LLMs. Additionally, these methods lack consideration for programming
language syntax. To bridge these gaps, we present a comparative study and
propose a novel SSL method (SSL_ner) that utilizes entity information for
few-shot example selection. We present several insights and show the
effectiveness of SSL_ner approach over state-of-the-art methods across two
datasets. To the best of our knowledge, this is the first systematic
benchmarking of open-source Code-LLMs while assessing the performances of the
various few-shot examples selection approaches for the code explanation task.
[LINK]
http://arxiv.org/abs/2412.12852v1
[DATE]
2024-12-17 20:26:14+08:00
[CATEGORIES]
cs.CL
Towards Reliable Detection of LLM-Generated Texts: A Comprehensive Evaluation Framework with CUDRT
[AUTHORS]
Zhen Tao, Yanfang Chen, Dinghao Xi, Zhiyu Li, Wei Xu
[ABSTRACT]
The increasing prevalence of large language models (LLMs) has significantly
advanced text generation, but the human-like quality of LLM outputs presents
major challenges in reliably distinguishing between human-authored and
LLM-generated texts. Existing detection benchmarks are constrained by their
reliance on static datasets, scenario-specific tasks (e.g., question answering
and text refinement), and a primary focus on English, overlooking the diverse
linguistic and operational subtleties of LLMs. To address these gaps, we
propose CUDRT, a comprehensive evaluation framework and bilingual benchmark in
Chinese and English, categorizing LLM activities into five key operations:
Create, Update, Delete, Rewrite, and Translate. CUDRT provides extensive
datasets tailored to each operation, featuring outputs from state-of-the-art
LLMs to assess the reliability of LLM-generated text detectors. This framework
supports scalable, reproducible experiments and enables in-depth analysis of
how operational diversity, multilingual training sets, and LLM architectures
influence detection performance. Our extensive experiments demonstrate the
framework’s capacity to optimize detection systems, providing critical insights
to enhance reliability, cross-linguistic adaptability, and detection accuracy.
By advancing robust methodologies for identifying LLM-generated texts, this
work contributes to the development of intelligent systems capable of meeting
real-world multilingual detection challenges. Source code and dataset are
available at GitHub.
[COMMENTS]
30 pages
[LINK]
http://arxiv.org/abs/2406.09056v3
[DATE]
2024-12-17 20:20:34+08:00
[CATEGORIES]
cs.CL
Benchmarking and Understanding Compositional Relational Reasoning of LLMs
[AUTHORS]
Ruikang Ni, Da Xiao, Qingye Meng, Xiangyu Li, Shihui Zheng, Hongliang Liang
[ABSTRACT]
Compositional relational reasoning (CRR) is a hallmark of human intelligence,
but we lack a clear understanding of whether and how existing transformer large
language models (LLMs) can solve CRR tasks. To enable systematic exploration of
the CRR capability of LLMs, we first propose a new synthetic benchmark called
Generalized Associative Recall (GAR) by integrating and generalizing the
essence of several tasks in mechanistic interpretability (MI) study in a
unified framework. Evaluation shows that GAR is challenging enough for existing
LLMs, revealing their fundamental deficiency in CRR. Meanwhile, it is easy
enough for systematic MI study. Then, to understand how LLMs solve GAR tasks,
we use attribution patching to discover the core circuits reused by Vicuna-33B
across different tasks and a set of vital attention heads. Intervention
experiments show that the correct functioning of these heads significantly
impacts task performance. Especially, we identify two classes of heads whose
activations represent the abstract notion of true and false in GAR tasks
respectively. They play a fundamental role in CRR across various models and
tasks. The dataset and code are available at https://github.com/Caiyun-AI/GAR.
[COMMENTS]
Accepted to the 39th Annual AAAI Conference on Artificial
Intelligence (AAAI-25)
[LINK]
http://arxiv.org/abs/2412.12841v1
[DATE]
2024-12-17 20:10:38+08:00
[CATEGORIES]
cs.CL
cs.LG
DSGram: Dynamic Weighting Sub-Metrics for Grammatical Error Correction in the Era of Large Language Models
[AUTHORS]
Jinxiang Xie, Yilin Li, Xunjian Yin, Xiaojun Wan
[ABSTRACT]
Evaluating the performance of Grammatical Error Correction (GEC) models has
become increasingly challenging, as large language model (LLM)-based GEC
systems often produce corrections that diverge from provided gold references.
This discrepancy undermines the reliability of traditional reference-based
evaluation metrics. In this study, we propose a novel evaluation framework for
GEC models, DSGram, integrating Semantic Coherence, Edit Level, and Fluency,
and utilizing a dynamic weighting mechanism. Our framework employs the Analytic
Hierarchy Process (AHP) in conjunction with large language models to ascertain
the relative importance of various evaluation criteria. Additionally, we
develop a dataset incorporating human annotations and LLM-simulated sentences
to validate our algorithms and fine-tune more cost-effective models.
Experimental results indicate that our proposed approach enhances the
effectiveness of GEC model evaluations.
[COMMENTS]
Extended version of a paper to appear in AAAI-25
[LINK]
http://arxiv.org/abs/2412.12832v1
[DATE]
2024-12-17 19:54:16+08:00
[CATEGORIES]
cs.CL
Revisiting In-context Learning Inference Circuit in Large Language Models
[AUTHORS]
Hakaze Cho, Mariko Kato, Yoshihiro Sakai, Naoya Inoue
[ABSTRACT]
In-context Learning (ICL) is an emerging few-shot learning paradigm on
Language Models (LMs) with inner mechanisms un-explored. There are already
existing works describing the inner processing of ICL, while they struggle to
capture all the inference phenomena in large language models. Therefore, this
paper proposes a comprehensive circuit to model the inference dynamics and try
to explain the observed phenomena of ICL. In detail, we divide ICL inference
into 3 major operations: (1) Input Text Encode: LMs encode every input text
(demonstrations and queries) into linear representation in the hidden states
with sufficient information to solve ICL tasks. (2) Semantics Merge: LMs merge
the encoded representations of demonstrations with their corresponding label
tokens to produce joint representations of labels and demonstrations. (3)
Feature Retrieval and Copy: LMs search the joint representations similar to the
query representation on a task subspace, and copy the searched representations
into the query. Then, language model heads capture these copied label
representations to a certain extent and decode them into predicted labels. The
proposed inference circuit successfully captured many phenomena observed during
the ICL process, making it a comprehensive and practical explanation of the ICL
inference process. Moreover, ablation analysis by disabling the proposed steps
seriously damages the ICL performance, suggesting the proposed inference
circuit is a dominating mechanism. Additionally, we confirm and list some
bypass mechanisms that solve ICL tasks in parallel with the proposed circuit.
[COMMENTS]
37 pages, 41 figures, 8 tables
[LINK]
http://arxiv.org/abs/2410.04468v2
[DATE]
2024-12-17 19:51:51+08:00
[CATEGORIES]
cs.CL
cs.LG
Detecting Emotional Incongruity of Sarcasm by Commonsense Reasoning
[AUTHORS]
Ziqi Qiu, Jianxing Yu, Yufeng Zhang, Hanjiang Lai, Yanghui Rao, Qinliang Su, Jian Yin
[ABSTRACT]
This paper focuses on sarcasm detection, which aims to identify whether given
statements convey criticism, mockery, or other negative sentiment opposite to
the literal meaning. To detect sarcasm, humans often require a comprehensive
understanding of the semantics in the statement and even resort to external
commonsense to infer the fine-grained incongruity. However, existing methods
lack commonsense inferential ability when they face complex real-world
scenarios, leading to unsatisfactory performance. To address this problem, we
propose a novel framework for sarcasm detection, which conducts incongruity
reasoning based on commonsense augmentation, called EICR. Concretely, we first
employ retrieval-augmented large language models to supplement the missing but
indispensable commonsense background knowledge. To capture complex contextual
associations, we construct a dependency graph and obtain the optimized topology
via graph refinement. We further introduce an adaptive reasoning skeleton that
integrates prior rules to extract sentiment-inconsistent subgraphs explicitly.
To eliminate the possible spurious relations between words and labels, we
employ adversarial contrastive learning to enhance the robustness of the
detector. Experiments conducted on five datasets demonstrate the effectiveness
of EICR.
[LINK]
http://arxiv.org/abs/2412.12808v1
[DATE]
2024-12-17 19:25:55+08:00
[CATEGORIES]
cs.CL
Cross-Dialect Information Retrieval: Information Access in Low-Resource and High-Variance Languages
[AUTHORS]
Robert Litschko, Oliver Kraus, Verena Blaschke, Barbara Plank
[ABSTRACT]
A large amount of local and culture-specific knowledge (e.g., people,
traditions, food) can only be found in documents written in dialects. While
there has been extensive research conducted on cross-lingual information
retrieval (CLIR), the field of cross-dialect retrieval (CDIR) has received
limited attention. Dialect retrieval poses unique challenges due to the limited
availability of resources to train retrieval models and the high variability in
non-standardized languages. We study these challenges on the example of German
dialects and introduce the first German dialect retrieval dataset, dubbed
WikiDIR, which consists of seven German dialects extracted from Wikipedia.
Using WikiDIR, we demonstrate the weakness of lexical methods in dealing with
high lexical variation in dialects. We further show that commonly used
zero-shot cross-lingual transfer approach with multilingual encoders do not
transfer well to extremely low-resource setups, motivating the need for
resource-lean and dialect-specific retrieval models. We finally demonstrate
that (document) translation is an effective way to reduce the dialect gap in
CDIR.
[COMMENTS]
Accepted at COLING 2025
[LINK]
http://arxiv.org/abs/2412.12806v1
[DATE]
2024-12-17 19:21:09+08:00
[CATEGORIES]
cs.CL
Is it the end of (generative) linguistics as we know it?
[AUTHORS]
Cristiano Chesi
[ABSTRACT]
A significant debate has emerged in response to a paper written by Steven
Piantadosi (Piantadosi, 2023) and uploaded to the LingBuzz platform, the open
archive for generative linguistics. Piantadosi’s dismissal of Chomsky’s
approach is ruthless, but generative linguists deserve it. In this paper, I
will adopt three idealized perspectives – computational, theoretical, and
experimental – to focus on two fundamental issues that lend partial support to
Piantadosi’s critique: (a) the evidence challenging the Poverty of Stimulus
(PoS) hypothesis and (b) the notion of simplicity as conceived within
mainstream Minimalism. In conclusion, I argue that, to reclaim a central role
in language studies, generative linguistics – representing a prototypical
theoretical perspective on language – needs a serious update leading to (i)
more precise, consistent, and complete formalizations of foundational
intuitions and (ii) the establishment and utilization of a standardized dataset
of crucial empirical evidence to evaluate the theory’s adequacy. On the other
hand, ignoring the formal perspective leads to major drawbacks in both
computational and experimental approaches. Neither descriptive nor explanatory
adequacy can be easily achieved without the precise formulation of general
principles that can be challenged empirically.
[LINK]
http://arxiv.org/abs/2412.12797v1
[DATE]
2024-12-17 19:00:34+08:00
[CATEGORIES]
cs.CL
A Survey of Calibration Process for Black-Box LLMs
[AUTHORS]
Liangru Xie, Hui Liu, Jingying Zeng, Xianfeng Tang, Yan Han, Chen Luo, Jing Huang, Zhen Li, Suhang Wang, Qi He
[ABSTRACT]
Large Language Models (LLMs) demonstrate remarkable performance in semantic
understanding and generation, yet accurately assessing their output reliability
remains a significant challenge. While numerous studies have explored
calibration techniques, they primarily focus on White-Box LLMs with accessible
parameters. Black-Box LLMs, despite their superior performance, pose heightened
requirements for calibration techniques due to their API-only interaction
constraints. Although recent researches have achieved breakthroughs in
black-box LLMs calibration, a systematic survey of these methodologies is still
lacking. To bridge this gap, we presents the first comprehensive survey on
calibration techniques for black-box LLMs. We first define the Calibration
Process of LLMs as comprising two interrelated key steps: Confidence Estimation
and Calibration. Second, we conduct a systematic review of applicable methods
within black-box settings, and provide insights on the unique challenges and
connections in implementing these key steps. Furthermore, we explore typical
applications of Calibration Process in black-box LLMs and outline promising
future research directions, providing new perspectives for enhancing
reliability and human-machine alignment. This is our GitHub link:
https://github.com/LiangruXie/Calibration-Process-in-Black-Box-LLMs
[LINK]
http://arxiv.org/abs/2412.12767v1
[DATE]
2024-12-17 18:31:21+08:00
[CATEGORIES]
cs.CL
Revealing the impact of synthetic native samples and multi-tasking strategies in Hindi-English code-mixed humour and sarcasm detection
[AUTHORS]
Debajyoti Mazumder, Aakash Kumar, Jasabanta Patro
[ABSTRACT]
In this paper, we reported our experiments with various strategies to improve
code-mixed humour and sarcasm detection. We did all of our experiments for
Hindi-English code-mixed scenario, as we have the linguistic expertise for the
same. We experimented with three approaches, namely (i) native sample mixing,
(ii) multi-task learning (MTL), and (iii) prompting very large multilingual
language models (VMLMs). In native sample mixing, we added monolingual task
samples in code-mixed training sets. In MTL learning, we relied on native and
code-mixed samples of a semantically related task (hate detection in our case).
Finally, in our third approach, we evaluated the efficacy of VMLMs via few-shot
context prompting. Some interesting findings we got are (i) adding native
samples improved humor (raising the F1-score up to 6.76%) and sarcasm (raising
the F1-score up to 8.64%) detection, (ii) training MLMs in an MTL framework
boosted performance for both humour (raising the F1-score up to 10.67%) and
sarcasm (increment up to 12.35% in F1-score) detection, and (iii) prompting
VMLMs couldn’t outperform the other approaches. Finally, our ablation studies
and error analysis discovered the cases where our model is yet to improve. We
provided our code for reproducibility.
[COMMENTS]
26 pages; under review
[LINK]
http://arxiv.org/abs/2412.12761v1
[DATE]
2024-12-17 18:26:54+08:00
[CATEGORIES]
cs.CL
Your Next State-of-the-Art Could Come from Another Domain: A Cross-Domain Analysis of Hierarchical Text Classification
[AUTHORS]
Nan Li, Bo Kang, Tijl De Bie
[ABSTRACT]
Text classification with hierarchical labels is a prevalent and challenging
task in natural language processing. Examples include assigning ICD codes to
patient records, tagging patents into IPC classes, assigning EUROVOC
descriptors to European legal texts, and more. Despite its widespread
applications, a comprehensive understanding of state-of-the-art methods across
different domains has been lacking. In this paper, we provide the first
comprehensive cross-domain overview with empirical analysis of state-of-the-art
methods. We propose a unified framework that positions each method within a
common structure to facilitate research. Our empirical analysis yields key
insights and guidelines, confirming the necessity of learning across different
research areas to design effective methods. Notably, under our unified
evaluation pipeline, we achieved new state-of-the-art results by applying
techniques beyond their original domains.
[LINK]
http://arxiv.org/abs/2412.12744v1
[DATE]
2024-12-17 18:08:57+08:00
[CATEGORIES]
cs.CL
cs.LG
SentiQNF: A Novel Approach to Sentiment Analysis Using Quantum Algorithms and Neuro-Fuzzy Systems
[AUTHORS]
Kshitij Dave, Nouhaila Innan, Bikash K. Behera, Zahid Mumtaz, Saif Al-Kuwari, Ahmed Farouk
[ABSTRACT]
Sentiment analysis is an essential component of natural language processing,
used to analyze sentiments, attitudes, and emotional tones in various contexts.
It provides valuable insights into public opinion, customer feedback, and user
experiences. Researchers have developed various classical machine learning and
neuro-fuzzy approaches to address the exponential growth of data and the
complexity of language structures in sentiment analysis. However, these
approaches often fail to determine the optimal number of clusters, interpret
results accurately, handle noise or outliers efficiently, and scale effectively
to high-dimensional data. Additionally, they are frequently insensitive to
input variations. In this paper, we propose a novel hybrid approach for
sentiment analysis called the Quantum Fuzzy Neural Network (QFNN), which
leverages quantum properties and incorporates a fuzzy layer to overcome the
limitations of classical sentiment analysis algorithms. In this study, we test
the proposed approach on two Twitter datasets: the Coronavirus Tweets Dataset
(CVTD) and the General Sentimental Tweets Dataset (GSTD), and compare it with
classical and hybrid algorithms. The results demonstrate that QFNN outperforms
all classical, quantum, and hybrid algorithms, achieving 100% and 90% accuracy
in the case of CVTD and GSTD, respectively. Furthermore, QFNN demonstrates its
robustness against six different noise models, providing the potential to
tackle the computational complexity associated with sentiment analysis on a
large scale in a noisy environment. The proposed approach expedites sentiment
data processing and precisely analyses different forms of textual data, thereby
enhancing sentiment classification and insights associated with sentiment
analysis.
[LINK]
http://arxiv.org/abs/2412.12731v1
[DATE]
2024-12-17 17:54:17+08:00
[CATEGORIES]
cs.CL
WIKIGENBENCH: Exploring Full-length Wikipedia Generation under Real-World Scenario
[AUTHORS]
Jiebin Zhang, Eugene J. Yu, Qinyu Chen, Chenhao Xiong, Dawei Zhu, Han Qian, Mingbo Song, Weimin Xiong, Xiaoguang Li, Qun Liu, Sujian Li
[ABSTRACT]
It presents significant challenges to generate comprehensive and accurate
Wikipedia articles for newly emerging events under a real-world scenario.
Existing attempts fall short either by focusing only on short snippets or by
using metrics that are insufficient to evaluate real-world scenarios. In this
paper, we construct WIKIGENBENCH, a new benchmark consisting of 1,320 entries,
designed to align with real-world scenarios in both generation and evaluation.
For generation, we explore a real-world scenario where structured, full-length
Wikipedia articles with citations are generated for new events using input
documents from web sources. For evaluation, we integrate systematic metrics and
LLM-based metrics to assess the verifiability, organization, and other aspects
aligned with real-world scenarios. Based on this benchmark, we conduct
extensive experiments using various models within three commonly used
frameworks: direct RAG, hierarchical structure-based RAG, and RAG with a
fine-tuned generation model. Experimental results show that hierarchical-based
methods can generate more comprehensive content, while fine-tuned methods
achieve better verifiability. However, even the best methods still show a
significant gap compared to existing Wikipedia content, indicating that further
research is necessary.
[COMMENTS]
COLING 2025 Camera Ready
[LINK]
http://arxiv.org/abs/2402.18264v2
[DATE]
2024-12-17 17:53:41+08:00
[CATEGORIES]
cs.CL
XiYan-SQL: A Multi-Generator Ensemble Framework for Text-to-SQL
[AUTHORS]
Yingqi Gao, Yifu Liu, Xiaoxia Li, Xiaorong Shi, Yin Zhu, Yiming Wang, Shiqi Li, Wei Li, Yuntao Hong, Zhiling Luo, Jinyang Gao, Liyu Mou, Yu Li
[ABSTRACT]
To tackle the challenges of large language model performance in natural
language to SQL tasks, we introduce XiYan-SQL, an innovative framework that
employs a multi-generator ensemble strategy to improve candidate generation. We
introduce M-Schema, a semi-structured schema representation method designed to
enhance the understanding of database structures. To enhance the quality and
diversity of generated candidate SQL queries, XiYan-SQL integrates the
significant potential of in-context learning (ICL) with the precise control of
supervised fine-tuning. On one hand, we propose a series of training strategies
to fine-tune models to generate high-quality candidates with diverse
preferences. On the other hand, we implement the ICL approach with an example
selection method based on named entity recognition to prevent overemphasis on
entities. The refiner optimizes each candidate by correcting logical or
syntactical errors. To address the challenge of identifying the best candidate,
we fine-tune a selection model to distinguish nuances of candidate SQL queries.
The experimental results on multiple dialect datasets demonstrate the
robustness of XiYan-SQL in addressing challenges across different scenarios.
Overall, our proposed XiYan-SQL achieves the state-of-the-art execution
accuracy of 75.63% on Bird benchmark, 89.65% on the Spider test set, 69.86% on
SQL-Eval, 41.20% on NL2GQL. The proposed framework not only enhances the
quality and diversity of SQL queries but also outperforms previous methods.
[LINK]
http://arxiv.org/abs/2411.08599v2
[DATE]
2024-12-17 17:45:45+08:00
[CATEGORIES]
cs.CL
cs.LG
Enhancing Naturalness in LLM-Generated Utterances through Disfluency Insertion
[AUTHORS]
Syed Zohaib Hassan, Pierre Lison, Pål Halvorsen
[ABSTRACT]
Disfluencies are a natural feature of spontaneous human speech but are
typically absent from the outputs of Large Language Models (LLMs). This absence
can diminish the perceived naturalness of synthesized speech, which is an
important criteria when building conversational agents that aim to mimick human
behaviours. We show how the insertion of disfluencies can alleviate this
shortcoming. The proposed approach involves (1) fine-tuning an LLM with
Low-Rank Adaptation (LoRA) to incorporate various types of disfluencies into
LLM-generated utterances and (2) synthesizing those utterances using a
text-to-speech model that supports the generation of speech phenomena such as
disfluencies. We evaluated the quality of the generated speech across two
metrics: intelligibility and perceived spontaneity. We demonstrate through a
user study that the insertion of disfluencies significantly increase the
perceived spontaneity of the generated speech. This increase came, however,
along with a slight reduction in intelligibility.
[COMMENTS]
4 pages short paper, references and appendix are additional
[LINK]
http://arxiv.org/abs/2412.12710v1
[DATE]
2024-12-17 17:25:44+08:00
[CATEGORIES]
cs.CL
More Tokens, Lower Precision: Towards the Optimal Token-Precision Trade-off in KV Cache Compression
[AUTHORS]
Jiebin Zhang, Dawei Zhu, Yifan Song, Wenhao Wu, Chuqiao Kuang, Xiaoguang Li, Lifeng Shang, Qun Liu, Sujian Li
[COMMENTS]
13pages,7 figures
[LINK]
http://arxiv.org/abs/2412.12706v1
[DATE]
2024-12-17 17:20:31+08:00
[CATEGORIES]
cs.CL
Trigger$^3$: Refining Query Correction via Adaptive Model Selector
[AUTHORS]
Kepu Zhang, Zhongxiang Sun, Xiao Zhang, Xiaoxue Zang, Kai Zheng, Yang Song, Jun Xu
[ABSTRACT]
In search scenarios, user experience can be hindered by erroneous queries due
to typos, voice errors, or knowledge gaps. Therefore, query correction is
crucial for search engines. Current correction models, usually small models
trained on specific data, often struggle with queries beyond their training
scope or those requiring contextual understanding. While the advent of Large
Language Models (LLMs) offers a potential solution, they are still limited by
their pre-training data and inference cost, particularly for complex queries,
making them not always effective for query correction. To tackle these, we
propose Trigger$^3$, a large-small model collaboration framework that
integrates the traditional correction model and LLM for query correction,
capable of adaptively choosing the appropriate correction method based on the
query and the correction results from the traditional correction model and LLM.
Trigger$^3$ first employs a correction trigger to filter out correct queries.
Incorrect queries are then corrected by the traditional correction model. If
this fails, an LLM trigger is activated to call the LLM for correction.
Finally, for queries that no model can correct, a fallback trigger decides to
return the original query. Extensive experiments demonstrate Trigger$^3$
outperforms correction baselines while maintaining efficiency.
[LINK]
http://arxiv.org/abs/2412.12701v1
[DATE]
2024-12-17 17:16:54+08:00
[CATEGORIES]
cs.CL
FiRST: Finetuning Router-Selective Transformers for Input-Adaptive Latency Reduction
[AUTHORS]
Akriti Jain, Saransh Sharma, Koyel Mukherjee, Soumyabrata Pal
[ABSTRACT]
Auto-regressive Large Language Models (LLMs) demonstrate remarkable
performance across different domains such as vision and language processing.
However, due to sequential processing through a stack of transformer layers,
autoregressive decoding faces significant computation/latency challenges,
particularly in resource-constrained environments like mobile and edge devices.
Existing approaches in literature that aim to improve latency via skipping
layers have two distinct flavors - 1) Early exit, and 2) Input-agnostic
heuristics where tokens exit at pre-determined layers irrespective of input
sequence. Both the above strategies have limitations - the former cannot be
applied to handle KV Caching necessary for speed-ups in modern framework and
the latter does not capture the variation in layer importance across tasks or
more generally, across input sequences. To address both limitations, we propose
FiRST, an algorithm that reduces inference latency by using layer-specific
routers to select a subset of transformer layers adaptively for each input
sequence - the prompt (during the prefill stage) decides which layers will be
skipped during decoding. FiRST preserves compatibility with KV caching enabling
faster inference while being quality-aware. FiRST is model-agnostic and can be
easily enabled on any pre-trained LLM. Our approach reveals that input
adaptivity is critical - indeed, different task-specific middle layers play a
crucial role in evolving hidden representations depending on tasks. Extensive
experiments show that FiRST significantly reduces latency while outperforming
other layer selection strategies in quality metics. It retains competitive
performance to base model (without layer skipping) and in some cases, even
improves upon it. FiRST is thus a promising and efficient solution for LLM
deployment in low-resource environments.
[LINK]
http://arxiv.org/abs/2410.12513v2
[DATE]
2024-12-17 17:11:47+08:00
[CATEGORIES]
cs.CL
XTransplant: A Probe into the Upper Bound Performance of Multilingual Capability and Culture Adaptability in LLMs via Mutual Cross-lingual Feed-forward Transplantation
[AUTHORS]
Yangfan Ye, Xiaocheng Feng, Xiachong Feng, Libo Qin, Yichong Huang, Lei Huang, Weitao Ma, Zhirui Zhang, Yunfei Lu, Xiaohui Yan, Duyu Tang, Dandan Tu, Bing Qin
[ABSTRACT]
Current large language models (LLMs) often exhibit imbalances in multilingual
capabilities and cultural adaptability, largely due to their English-centric
pretraining data. To address this imbalance, we propose a probing method named
XTransplant that explores cross-lingual latent interactions via cross-lingual
feed-forward transplantation during inference stage, with the hope of enabling
the model to leverage the strengths of both English and non-English languages.
Through extensive pilot experiments, we empirically prove that both the
multilingual capabilities and cultural adaptability of LLMs hold the potential
to be significantly improved by XTransplant, respectively from En -> non-En and
non-En -> En, highlighting the underutilization of current LLMs’ multilingual
potential. And the patterns observed in these pilot experiments further
motivate an offline scaling inference strategy, which demonstrates consistent
performance improvements in multilingual and culture-aware tasks, sometimes
even surpassing multilingual supervised fine-tuning. And we do hope our further
analysis and discussion could help gain deeper insights into XTransplant
mechanism.
[LINK]
http://arxiv.org/abs/2412.12686v1
[DATE]
2024-12-17 17:05:30+08:00
[CATEGORIES]
cs.CL
Does Vision Accelerate Hierarchical Generalization in Neural Language Learners?
[AUTHORS]
Tatsuki Kuribayashi, Timothy Baldwin
[ABSTRACT]
Neural language models (LMs) are arguably less data-efficient than humans
from a language acquisition perspective. One fundamental question is why this
human-LM gap arises. This study explores the advantage of grounded language
acquisition, specifically the impact of visual information – which humans can
usually rely on but LMs largely do not have access to during language
acquisition – on syntactic generalization in LMs. Our experiments, following
the poverty of stimulus paradigm under two scenarios (using artificial vs.
naturalistic images), demonstrate that if the alignments between the linguistic
and visual components are clear in the input, access to vision data does help
with the syntactic generalization of LMs, but if not, visual input does not
help. This highlights the need for additional biases or signals, such as mutual
gaze, to enhance cross-modal alignment and enable efficient syntactic
generalization in multimodal LMs.
[COMMENTS]
COLING 2025; 15 pages
[LINK]
http://arxiv.org/abs/2302.00667v3
[DATE]
2024-12-17 16:57:43+08:00
[CATEGORIES]
cs.CL
Detecting Document-level Paraphrased Machine Generated Content: Mimicking Human Writing Style and Involving Discourse Features
[AUTHORS]
Yupei Li, Manuel Milling, Lucia Specia, Björn W. Schuller
[ABSTRACT]
The availability of high-quality APIs for Large Language Models (LLMs) has
facilitated the widespread creation of Machine-Generated Content (MGC), posing
challenges such as academic plagiarism and the spread of misinformation.
Existing MGC detectors often focus solely on surface-level information,
overlooking implicit and structural features. This makes them susceptible to
deception by surface-level sentence patterns, particularly for longer texts and
in texts that have been subsequently paraphrased.
To overcome these challenges, we introduce novel methodologies and datasets.
Besides the publicly available dataset Plagbench, we developed the paraphrased
Long-Form Question and Answer (paraLFQA) and paraphrased Writing Prompts
(paraWP) datasets using GPT and DIPPER, a discourse paraphrasing tool, by
extending artifacts from their original versions. To address the challenge of
detecting highly similar paraphrased texts, we propose MhBART, an
encoder-decoder model designed to emulate human writing style while
incorporating a novel difference score mechanism. This model outperforms strong
classifier baselines and identifies deceptive sentence patterns. To better
capture the structure of longer texts at document level, we propose
DTransformer, a model that integrates discourse analysis through PDTB
preprocessing to encode structural features. It results in substantial
performance gains across both datasets – 15.5\% absolute improvement on
paraLFQA, 4\% absolute improvement on paraWP, and 1.5\% absolute improvement on
M4 compared to SOTA approaches.
[LINK]
http://arxiv.org/abs/2412.12679v1
[DATE]
2024-12-17 16:47:41+08:00
[CATEGORIES]
cs.CL
Train More Parameters But Mind Their Placement: Insights into Language Adaptation with PEFT
[AUTHORS]
Jenny Kunz
[ABSTRACT]
Smaller LLMs still face significant challenges even in medium-resourced
languages, particularly when it comes to language-specific knowledge – a
problem not easily resolved with machine-translated data. In this case study on
Icelandic, we aim to enhance the generation performance of an LLM by
specialising it using unstructured text corpora. A key focus is on preventing
interference with the models’ capabilities of handling longer context during
this adaptation. Through ablation studies using various parameter-efficient
fine-tuning (PEFT) methods and setups, we find that increasing the number of
trainable parameters leads to better and more robust language adaptation. LoRAs
placed in the feed-forward layers and bottleneck adapters show promising
results with sufficient parameters, while prefix tuning and (IA)3 are not
suitable. Although improvements are consistent in 0-shot summarisation, some
adapted models struggle with longer context lengths, an issue that can be
mitigated by adapting only the final layers.
[COMMENTS]
To appear at NoDaLiDa 2025
[LINK]
http://arxiv.org/abs/2412.12674v1
[DATE]
2024-12-17 16:44:00+08:00
[CATEGORIES]
cs.CL
Citekit: A Modular Toolkit for Large Language Model Citation Generation
[AUTHORS]
Jiajun Shen, Tong Zhou, Yubo Chen, Kang Liu
[ABSTRACT]
Enabling Large Language Models (LLMs) to generate citations in
Question-Answering (QA) tasks is an emerging paradigm aimed at enhancing the
verifiability of their responses when LLMs are utilizing external references to
generate an answer. However, there is currently no unified framework to
standardize and fairly compare different citation generation methods, leading
to difficulties in reproducing different methods and a comprehensive
assessment. To cope with the problems above, we introduce \name, an open-source
and modular toolkit designed to facilitate the implementation and evaluation of
existing citation generation methods, while also fostering the development of
new approaches to improve citation quality in LLM outputs. This tool is highly
extensible, allowing users to utilize 4 main modules and 14 components to
construct a pipeline, evaluating an existing method or innovative designs. Our
experiments with two state-of-the-art LLMs and 11 citation generation baselines
demonstrate varying strengths of different modules in answer accuracy and
citation quality improvement, as well as the challenge of enhancing
granularity. Based on our analysis of the effectiveness of components, we
propose a new method, self-RAG \snippet, obtaining a balanced answer accuracy
and citation quality. Citekit is released at
https://github.com/SjJ1017/Citekit.
[COMMENTS]
7 pages, 14 figures
[LINK]
http://arxiv.org/abs/2408.04662v2
[DATE]
2024-12-17 16:37:34+08:00
[CATEGORIES]
cs.CL
MedMax: Mixed-Modal Instruction Tuning for Training Biomedical Assistants
[AUTHORS]
Hritik Bansal, Daniel Israel, Siyan Zhao, Shufan Li, Tung Nguyen, Aditya Grover
[ABSTRACT]
Recent advancements in mixed-modal generative models have enabled flexible
integration of information across image-text content. These models have opened
new avenues for developing unified biomedical assistants capable of analyzing
biomedical images, answering complex questions about them, and predicting the
impact of medical procedures on a patient’s health. However, existing resources
face challenges such as limited data availability, narrow domain coverage, and
restricted sources (e.g., medical papers). To address these gaps, we present
MedMax, the first large-scale multimodal biomedical instruction-tuning dataset
for mixed-modal foundation models. With 1.47 million instances, MedMax
encompasses a diverse range of tasks, including multimodal content generation
(interleaved image-text data), biomedical image captioning and generation,
visual chatting, and report understanding. These tasks span diverse medical
domains such as radiology and histopathology. Subsequently, we fine-tune a
mixed-modal foundation model on the MedMax dataset, achieving significant
performance improvements: a 26% gain over the Chameleon model and an 18.3%
improvement over GPT-4o across 12 downstream biomedical visual
question-answering tasks. Additionally, we introduce a unified evaluation suite
for biomedical tasks, providing a robust framework to guide the development of
next-generation mixed-modal biomedical AI assistants.
[COMMENTS]
12 figures, 15 tables
[LINK]
http://arxiv.org/abs/2412.12661v1
[DATE]
2024-12-17 16:30:00+08:00
[CATEGORIES]
cs.CL
iPrOp: Interactive Prompt Optimization for Large Language Models with a Human in the Loop
[AUTHORS]
Jiahui Li, Roman Klinger
[ABSTRACT]
Prompt engineering has made significant contributions to the era of large
language models, yet its effectiveness depends on the skills of a prompt
author. Automatic prompt optimization can support the prompt development
process, but requires annotated data. This paper introduces $\textit{iPrOp}$, a
novel Interactive Prompt Optimization system, to bridge manual prompt
engineering and automatic prompt optimization. With human intervention in the
optimization loop, $\textit{iPrOp}$ offers users the flexibility to assess
evolving prompts. We present users with prompt variations, selected instances,
large language model predictions accompanied by corresponding explanations, and
performance metrics derived from a subset of the training data. This approach
empowers users to choose and further refine the provided prompts based on their
individual preferences and needs. This system not only assists non-technical
domain experts in generating optimal prompts tailored to their specific tasks
or domains, but also enables to study the intrinsic parameters that influence
the performance of prompt optimization. Our evaluation shows that our system
has the capability to generate improved prompts, leading to enhanced task
performance.
[LINK]
http://arxiv.org/abs/2412.12644v1
[DATE]
2024-12-17 16:09:15+08:00
[CATEGORIES]
cs.CL
LLM-based Discriminative Reasoning for Knowledge Graph Question Answering
[AUTHORS]
Mufan Xu, Kehai Chen, Xuefeng Bai, Muyun Yang, Tiejun Zhao, Min Zhang
[ABSTRACT]
Large language models (LLMs) based on generative pre-trained Transformer have
achieved remarkable performance on knowledge graph question-answering (KGQA)
tasks. However, LLMs often produce ungrounded subgraph planning or reasoning
results in KGQA due to the hallucinatory behavior brought by the generative
paradigm, which may hinder the advancement of the LLM-based KGQA model. To deal
with the issue, we propose a novel LLM-based Discriminative Reasoning (LDR)
method to explicitly model the subgraph retrieval and answer inference process.
By adopting discriminative strategies, the proposed LDR method not only
enhances the capability of LLMs to retrieve question-related subgraphs but also
alleviates the issue of ungrounded reasoning brought by the generative paradigm
of LLMs. Experimental results show that the proposed approach outperforms
multiple strong comparison methods, along with achieving state-of-the-art
performance on two widely used WebQSP and CWQ benchmarks.
[LINK]
http://arxiv.org/abs/2412.12643v1
[DATE]
2024-12-17 16:07:16+08:00
[CATEGORIES]
cs.CL
Revisiting Cosine Similarity via Normalized ICA-transformed Embeddings
[AUTHORS]
Hiroaki Yamagiwa, Momose Oyama, Hidetoshi Shimodaira
[COMMENTS]
COLING 2025
[LINK]
http://arxiv.org/abs/2406.10984v3
[DATE]
2024-12-17 16:03:38+08:00
[CATEGORIES]
cs.CL
$C^2$: Scalable Auto-Feedback for LLM-based Chart Generation
[AUTHORS]
Woosung Koh, Jang Han Yoon, MinHyung Lee, Youngjin Song, Jaegwan Cho, Jaehyun Kang, Taehyeon Kim, Se-young Yun, Youngjae Yu, Bongshin Lee
[ABSTRACT]
Generating high-quality charts with Large Language Models (LLMs) presents
significant challenges due to limited data and the high cost of scaling through
human curation. $\langle \text{instruction}, \text{data}, \text{code} \rangle$
triplets are scarce and expensive to manually curate as their creation demands
technical expertise. To address this scalability challenge, we introduce a
reference-free automatic feedback generator, which eliminates the need for
costly human intervention. Our novel framework, C$^2$, consists of (1) an
automatic feedback provider (ChartAF) and (2) a diverse, reference-free dataset
(ChartUIE-8K). The results are compelling: in our first experiment, 74% of
respondents strongly preferred, and 10% preferred, the results after feedback.
The second post-feedback experiment demonstrates that ChartAF outperform nine
baselines. Moreover, ChartUIE-8K significantly improves data diversity by
increasing queries, datasets, and chart types by 5982%, 1936%, and 91%,
respectively, over benchmarks. Finally, a study of LLM users revealed that 94%
of participants preferred ChartUIE-8K’s queries, with 93% deeming them aligned
with real-world use cases. Core contributions are available as open-source at
chartsquared.github.io, with ample qualitative examples.
[COMMENTS]
Preprint
[LINK]
http://arxiv.org/abs/2410.18652v4
[DATE]
2024-12-17 16:03:10+08:00
[CATEGORIES]
cs.LG
cs.CL
Falcon: Faster and Parallel Inference of Large Language Models through Enhanced Semi-Autoregressive Drafting and Custom-Designed Decoding Tree
[AUTHORS]
Xiangxiang Gao, Weisheng Xie, Yiwei Xiang, Feng Ji
[ABSTRACT]
Striking an optimal balance between minimal drafting latency and high
speculation accuracy to enhance the inference speed of Large Language Models
remains a significant challenge in speculative decoding. In this paper, we
introduce Falcon, an innovative semi-autoregressive speculative decoding
framework fashioned to augment both the drafter’s parallelism and output
quality. Falcon incorporates the Coupled Sequential Glancing Distillation
technique, which fortifies inter-token dependencies within the same block,
leading to increased speculation accuracy. We offer a comprehensive theoretical
analysis to illuminate the underlying mechanisms. Additionally, we introduce a
Custom-Designed Decoding Tree, which permits the drafter to generate multiple
tokens in a single forward pass and accommodates multiple forward passes as
needed, thereby boosting the number of drafted tokens and significantly
improving the overall acceptance rate. Comprehensive evaluations on benchmark
datasets such as MT-Bench, HumanEval, and GSM8K demonstrate Falcon’s superior
acceleration capabilities. The framework achieves a lossless speedup ratio
ranging from 2.91x to 3.51x when tested on the Vicuna and LLaMA2-Chat model
series. These results outstrip existing speculative decoding methods for LLMs,
including Eagle, Medusa, Lookahead, SPS, and PLD, while maintaining a compact
drafter architecture equivalent to merely two Transformer layers.
[COMMENTS]
AAAI 2025 Accepted
[LINK]
http://arxiv.org/abs/2412.12639v1
[DATE]
2024-12-17 16:02:08+08:00
[CATEGORIES]
cs.CL
What External Knowledge is Preferred by LLMs? Characterizing and Exploring Chain of Evidence in Imperfect Context
[AUTHORS]
Zhiyuan Chang, Mingyang Li, Xiaojun Jia, Junjie Wang, Yuekai Huang, Qing Wang, Yihao Huang, Yang Liu
[ABSTRACT]
Incorporating external knowledge into large language models (LLMs) has
emerged as a promising approach to mitigate outdated knowledge and
hallucination in LLMs. However, external knowledge is often imperfect. In
addition to useful knowledge, external knowledge is rich in irrelevant or
misinformation in the context that can impair the reliability of LLM responses.
This paper focuses on LLMs’ preferred external knowledge in imperfect contexts
when handling multi-hop QA. Inspired by criminal procedural law’s Chain of
Evidence (CoE), we characterize that knowledge preferred by LLMs should
maintain both relevance to the question and mutual support among knowledge
pieces. Accordingly, we propose an automated CoE discrimination approach and
explore LLMs’ preferences from their effectiveness, faithfulness and
robustness, as well as CoE’s usability in a naive Retrieval-Augmented
Generation (RAG) case. The evaluation on five LLMs reveals that CoE enhances
LLMs through more accurate generation, stronger answer faithfulness, better
robustness against knowledge conflict, and improved performance in a popular
RAG case.
[COMMENTS]
12 pages, 4 figures
[LINK]
http://arxiv.org/abs/2412.12632v1
[DATE]
2024-12-17 15:49:49+08:00
[CATEGORIES]
cs.CL
Make Imagination Clearer! Stable Diffusion-based Visual Imagination for Multimodal Machine Translation
[AUTHORS]
Andong Chen, Yuchen Song, Kehai Chen, Muyun Yang, Tiejun Zhao, Min Zhang
[ABSTRACT]
Visual information has been introduced for enhancing machine translation
(MT), and its effectiveness heavily relies on the availability of large amounts
of bilingual parallel sentence pairs with manual image annotations. In this
paper, we introduce a stable diffusion-based imagination network into a
multimodal large language model (MLLM) to explicitly generate an image for each
source sentence, thereby advancing the multimodel MT. Particularly, we build
heuristic human feedback with reinforcement learning to ensure the consistency
of the generated image with the source sentence without the supervision of
image annotation, which breaks the bottleneck of using visual information in
MT. Furthermore, the proposed method enables imaginative visual information to
be integrated into large-scale text-only MT in addition to multimodal MT.
Experimental results show that our model significantly outperforms existing
multimodal MT and text-only MT, especially achieving an average improvement of
more than 14 BLEU points on Multi30K multimodal MT benchmarks.
[COMMENTS]
Work in progress
[LINK]
http://arxiv.org/abs/2412.12627v1
[DATE]
2024-12-17 15:41:23+08:00
[CATEGORIES]
cs.CL
Jailbreaking? One Step Is Enough!
[AUTHORS]
Weixiong Zheng, Peijian Zeng, Yiwei Li, Hongyan Wu, Nankai Lin, Junhao Chen, Aimin Yang, Yongmei Zhou
[ABSTRACT]
Large language models (LLMs) excel in various tasks but remain vulnerable to
jailbreak attacks, where adversaries manipulate prompts to generate harmful
outputs. Examining jailbreak prompts helps uncover the shortcomings of LLMs.
However, current jailbreak methods and the target model’s defenses are engaged
in an independent and adversarial process, resulting in the need for frequent
attack iterations and redesigning attacks for different models. To address
these gaps, we propose a Reverse Embedded Defense Attack (REDA) mechanism that
disguises the attack intention as the “defense”. intention against harmful
content. Specifically, REDA starts from the target response, guiding the model
to embed harmful content within its defensive measures, thereby relegating
harmful content to a secondary role and making the model believe it is
performing a defensive task. The attacking model considers that it is guiding
the target model to deal with harmful content, while the target model thinks it
is performing a defensive task, creating an illusion of cooperation between the
two. Additionally, to enhance the model’s confidence and guidance in
“defensive” intentions, we adopt in-context learning (ICL) with a small number
of attack examples and construct a corresponding dataset of attack examples.
Extensive evaluations demonstrate that the REDA method enables cross-model
attacks without the need to redesign attack strategies for different models,
enables successful jailbreak in one iteration, and outperforms existing methods
on both open-source and closed-source models.
[COMMENTS]
17 pages
[LINK]
http://arxiv.org/abs/2412.12621v1
[DATE]
2024-12-17 15:33:41+08:00
[CATEGORIES]
cs.CL
I-SHEEP: Self-Alignment of LLM from Scratch through an Iterative Self-Enhancement Paradigm
[AUTHORS]
Yiming Liang, Ge Zhang, Xingwei Qu, Tianyu Zheng, Jiawei Guo, Xinrun Du, Zhenzhu Yang, Jiaheng Liu, Chenghua Lin, Lei Ma, Wenhao Huang, Jiajun Zhang
[ABSTRACT]
Large Language Models (LLMs) have achieved significant advancements, however,
the common learning paradigm treats LLMs as passive information repositories,
neglecting their potential for active learning and alignment. Some approaches
train LLMs using their own generated synthetic data, exploring the possibility
of active alignment. However, there is still a huge gap between these one-time
alignment methods and the continuous automatic alignment of humans. In this
paper, we introduce \textbf{I-SHEEP}, an \textbf{I}terative
\textbf{S}elf-En\textbf{H}anc\textbf{E}m\textbf{E}nt \textbf{P}aradigm.This
human-like paradigm enables LLMs to \textbf{continuously self-align from
scratch with nothing}. Compared to the one-time alignment method Dromedary
\cite{sun2023principledriven}, which refers to the first iteration in this
paper, I-SHEEP can significantly enhance capacities on both Qwen and Llama
models. I-SHEEP achieves a maximum relative improvement of 78.2\% in the Alpaca
Eval, 24.0\% in the MT Bench, and an absolute increase of 8.88\% in the IFEval
accuracy over subsequent iterations in Qwen-1.5 72B model. Additionally,
I-SHEEP surpasses the base model in various standard benchmark generation
tasks, achieving an average improvement of 24.77\% in code generation tasks,
12.04\% in TrivialQA, and 20.29\% in SQuAD. We also provide new insights based
on the experiment results. Our codes, datasets, and models are available at
\textbf{https://anonymous.4open.science/r/I-SHEEP}.
[LINK]
http://arxiv.org/abs/2408.08072v3
[DATE]
2024-12-17 15:30:54+08:00
[CATEGORIES]
cs.CL
Deep Learning and Machine Learning – Natural Language Processing: From Theory to Application
[AUTHORS]
Keyu Chen, Cheng Fei, Ziqian Bi, Junyu Liu, Benji Peng, Sen Zhang, Xuanhe Pan, Jiawei Xu, Jinlang Wang, Caitlyn Heqi Yin, Yichao Zhang, Pohsun Feng, Yizhu Wen, Tianyang Wang, Ming Li, Jintao Ren, Qian Niu, Silin Chen, Weiche Hsieh, Lawrence K. Q. Yan, Chia Xin Liang, Han Xu, Hong-Ming Tseng, Xinyuan Song, Ming Liu
[ABSTRACT]
With a focus on natural language processing (NLP) and the role of large
language models (LLMs), we explore the intersection of machine learning, deep
learning, and artificial intelligence. As artificial intelligence continues to
revolutionize fields from healthcare to finance, NLP techniques such as
tokenization, text classification, and entity recognition are essential for
processing and understanding human language. This paper discusses advanced data
preprocessing techniques and the use of frameworks like Hugging Face for
implementing transformer-based models. Additionally, it highlights challenges
such as handling multilingual data, reducing bias, and ensuring model
robustness. By addressing key aspects of data processing and model fine-tuning,
this work aims to provide insights into deploying effective and ethically sound
AI solutions.
[COMMENTS]
252 pages
[LINK]
http://arxiv.org/abs/2411.05026v2
[DATE]
2024-12-17 15:26:39+08:00
[CATEGORIES]
cs.CL
SynthCypher: A Fully Synthetic Data Generation Framework for Text-to-Cypher Querying in Knowledge Graphs
[AUTHORS]
Aman Tiwari, Shiva Krishna Reddy Malay, Vikas Yadav, Masoud Hashemi, Sathwik Tejaswi Madhusudhan
[ABSTRACT]
Cypher, the query language for Neo4j graph databases, plays a critical role
in enabling graph-based analytics and data exploration. While substantial
research has been dedicated to natural language to SQL query generation
(Text2SQL), the analogous problem for graph databases referred to as
Text2Cypher remains underexplored. In this work, we introduce SynthCypher, a
fully synthetic and automated data generation pipeline designed to address this
gap. SynthCypher employs a novel LLMSupervised Generation-Verification
framework, ensuring syntactically and semantically correct Cypher queries
across diverse domains and query complexities. Using this pipeline, we create
SynthCypher Dataset, a large-scale benchmark containing 29.8k Text2Cypher
instances. Fine-tuning open-source large language models (LLMs), including
LLaMa-3.1- 8B, Mistral-7B, and QWEN-7B, on SynthCypher yields significant
performance improvements of up to 40% on the Text2Cypher test set and 30% on
the SPIDER benchmark adapted for graph databases. This work demonstrates that
high-quality synthetic data can effectively advance the state-of-the-art in
Text2Cypher tasks.
[LINK]
http://arxiv.org/abs/2412.12612v1
[DATE]
2024-12-17 15:21:25+08:00
[CATEGORIES]
cs.CL
cs.LG
SEED: Accelerating Reasoning Tree Construction via Scheduled Speculative Decoding
[AUTHORS]
Zhenglin Wang, Jialong Wu, Yilong Lai, Congzhi Zhang, Deyu Zhou
[ABSTRACT]
Large Language Models (LLMs) demonstrate remarkable emergent abilities across
various tasks, yet fall short of complex reasoning and planning tasks. The
tree-search-based reasoning methods address this by surpassing the capabilities
of chain-of-thought prompting, encouraging exploration of intermediate steps.
However, such methods introduce significant inference latency due to the
systematic exploration and evaluation of multiple thought paths. This paper
introduces SeeD, a novel and efficient inference framework to optimize runtime
speed and GPU memory management concurrently. By employing a scheduled
speculative execution, SeeD efficiently handles multiple iterations for the
thought generation and the state evaluation, leveraging a rounds-scheduled
strategy to manage draft model dispatching. Extensive experimental evaluations
on three reasoning datasets demonstrate superior speedup performance of SeeD,
providing a viable path for batched inference in training-free speculative
decoding.
[COMMENTS]
Accepted by COLING2025
[LINK]
http://arxiv.org/abs/2406.18200v2
[DATE]
2024-12-17 15:18:53+08:00
[CATEGORIES]
cs.CL
Can Large Language Models Address Open-Target Stance Detection?
[AUTHORS]
Abu Ubaida Akash, Ahmed Fahmy, Amine Trabelsi
[ABSTRACT]
Stance detection (SD) identifies the text position towards a target,
typically labeled as favor, against, or none. We introduce Open-Target Stance
Detection (OTSD), the most realistic task where targets are neither seen during
training nor provided as input. We evaluate Large Language Models (LLMs) from
GPT, Gemini, Llama, and Mistral families, comparing their performance to the
only existing work, Target-Stance Extraction (TSE), which benefits from
predefined targets. Unlike TSE, OTSD removes the dependency of a predefined
list, making target generation and evaluation more challenging. We also provide
a metric for evaluating target quality that correlates well with human
judgment. Our experiments reveal that LLMs outperform TSE in target generation,
both when the real target is explicitly and not explicitly mentioned in the
text. Similarly, LLMs overall surpass TSE in stance detection for both explicit
and non-explicit cases. However, LLMs struggle in both target generation and
stance detection when the target is not explicit.
[COMMENTS]
13 pages; currently under submission
[LINK]
http://arxiv.org/abs/2409.00222v5
[DATE]
2024-12-17 15:15:38+08:00
[CATEGORIES]
cs.CL
MultiLingPoT: Enhancing Mathematical Reasoning with Multilingual Program Fine-tuning
[AUTHORS]
Nianqi Li, Zujie Liang, Siyu Yuan, Jiaqing Liang, Feng Wei, Yanghua Xiao
[ABSTRACT]
Program-of-Thought (PoT), which aims to use programming language instead of
natural language as an intermediate step in reasoning, is an important way for
LLMs to solve mathematical problems. Since different programming languages
excel in different areas, it is natural to use the most suitable language for
solving specific problems. However, current PoT research only focuses on single
language PoT, ignoring the differences between different programming languages.
Therefore, this paper proposes an multilingual program reasoning method,
MultiLingPoT. This method allows the model to answer questions using multiple
programming languages by fine-tuning on multilingual data. Additionally, prior
and posterior hybrid methods are used to help the model select the most
suitable language for each problem. Our experimental results show that the
training of MultiLingPoT improves each program’s mathematical reasoning by
about 2.5\%. Moreover, with proper mixing, the performance of MultiLingPoT can
be further improved, achieving a 6\% increase compared to the single-language
PoT with the data augmentation.Resources of this paper can be found at
https://github.com/Nianqi-Li/MultiLingPoT.
[LINK]
http://arxiv.org/abs/2412.12609v1
[DATE]
2024-12-17 15:14:03+08:00
[CATEGORIES]
cs.CL
Norm of Mean Contextualized Embeddings Determines their Variance
[AUTHORS]
Hiroaki Yamagiwa, Hidetoshi Shimodaira
[ABSTRACT]
Contextualized embeddings vary by context, even for the same token, and form
a distribution in the embedding space. To analyze this distribution, we focus
on the norm of the mean embedding and the variance of the embeddings. In this
study, we first demonstrate that these values follow the well-known formula for
variance in statistics and provide an efficient sequential computation method.
Then, by observing embeddings from intermediate layers of several Transformer
models, we found a strong trade-off relationship between the norm and the
variance: as the mean embedding becomes closer to the origin, the variance
increases. This trade-off is likely influenced by the layer normalization
mechanism used in Transformer models. Furthermore, when the sets of token
embeddings are treated as clusters, we show that the variance of the entire
embedding set can theoretically be decomposed into the within-cluster variance
and the between-cluster variance. We found experimentally that as the layers of
Transformer models deepen, the embeddings move farther from the origin, the
between-cluster variance relatively decreases, and the within-cluster variance
relatively increases. These results are consistent with existing studies on the
anisotropy of the embedding spaces across layers.
[COMMENTS]
COLING 2025
[LINK]
http://arxiv.org/abs/2409.11253v2
[DATE]
2024-12-17 15:07:52+08:00
[CATEGORIES]
cs.CL
Not All Votes Count! Programs as Verifiers Improve Self-Consistency of Language Models for Math Reasoning
[AUTHORS]
Vernon Y. H. Toh, Deepanway Ghosal, Soujanya Poria
[ABSTRACT]
Large language models (LLMs) have shown increasing competence in solving
mathematical reasoning problems. However, many open-source LLMs still struggle
with errors in calculation and semantic understanding during intermediate
reasoning steps. In this work, we introduce Prove, a simple yet effective
framework that leverages translated programs derived from natural language
solutions as a verification mechanism to filter out potentially incorrect
reasoning paths before aggregating final answers. Unlike vanilla majority
voting, our approach filters out solutions whose corresponding program output
is inconsistent with the generated solution, aggregating only those that pass
verification. We conducted extensive experiments using 13 open-source LLMs from
various model families and sizes, ranging from 0.5B to 13B parameters, across
eight mathematical benchmarks. Our results show that Prove consistently
outperforms vanilla majority voting as a heuristic for solving mathematical
reasoning tasks across all model sizes and datasets, achieving improvements of
up to 18% on GSM8K and 8% on MATH-500. Our codes are available at
https://github.com/declare-lab/prove.
[LINK]
http://arxiv.org/abs/2410.12608v2
[DATE]
2024-12-17 14:55:00+08:00
[CATEGORIES]
cs.CL
Whispers in Grammars: Injecting Covert Backdoors to Compromise Dense Retrieval Systems
[AUTHORS]
Quanyu Long, Yue Deng, LeiLei Gan, Wenya Wang, Sinno Jialin Pan
[ABSTRACT]
Dense retrieval systems have been widely used in various NLP applications.
However, their vulnerabilities to potential attacks have been underexplored.
This paper investigates a novel attack scenario where the attackers aim to
mislead the retrieval system into retrieving the attacker-specified contents.
Those contents, injected into the retrieval corpus by attackers, can include
harmful text like hate speech or spam. Unlike prior methods that rely on model
weights and generate conspicuous, unnatural outputs, we propose a covert
backdoor attack triggered by grammar errors. Our approach ensures that the
attacked models can function normally for standard queries while covertly
triggering the retrieval of the attacker’s contents in response to minor
linguistic mistakes. Specifically, dense retrievers are trained with
contrastive loss and hard negative sampling. Surprisingly, our findings
demonstrate that contrastive loss is notably sensitive to grammatical errors,
and hard negative sampling can exacerbate susceptibility to backdoor attacks.
Our proposed method achieves a high attack success rate with a minimal corpus
poisoning rate of only 0.048%, while preserving normal retrieval performance.
This indicates that the method has negligible impact on user experience for
error-free queries. Furthermore, evaluations across three real-world defense
strategies reveal that the malicious passages embedded within the corpus remain
highly resistant to detection and filtering, underscoring the robustness and
subtlety of the proposed attack.
[LINK]
http://arxiv.org/abs/2402.13532v2
[DATE]
2024-12-17 14:54:32+08:00
[CATEGORIES]
cs.CL
DreamRunner: Fine-Grained Storytelling Video Generation with Retrieval-Augmented Motion Adaptation
[AUTHORS]
Zun Wang, Jialu Li, Han Lin, Jaehong Yoon, Mohit Bansal
[ABSTRACT]
Storytelling video generation (SVG) has recently emerged as a task to create
long, multi-motion, multi-scene videos that consistently represent the story
described in the input text script. SVG holds great potential for diverse
content creation in media and entertainment; however, it also presents
significant challenges: (1) objects must exhibit a range of fine-grained,
complex motions, (2) multiple objects need to appear consistently across
scenes, and (3) subjects may require multiple motions with seamless transitions
within a single scene. To address these challenges, we propose DreamRunner, a
novel story-to-video generation method: First, we structure the input script
using a large language model (LLM) to facilitate both coarse-grained scene
planning as well as fine-grained object-level layout and motion planning. Next,
DreamRunner presents retrieval-augmented test-time adaptation to capture target
motion priors for objects in each scene, supporting diverse motion
customization based on retrieved videos, thus facilitating the generation of
new videos with complex, scripted motions. Lastly, we propose a novel
spatial-temporal region-based 3D attention and prior injection module SR3AI for
fine-grained object-motion binding and frame-by-frame semantic control. We
compare DreamRunner with various SVG baselines, demonstrating state-of-the-art
performance in character consistency, text alignment, and smooth transitions.
Additionally, DreamRunner exhibits strong fine-grained condition-following
ability in compositional text-to-video generation, significantly outperforming
baselines on T2V-ComBench. Finally, we validate DreamRunner’s robust ability to
generate multi-object interactions with qualitative examples.
[COMMENTS]
Project website: https://zunwang1.github.io/DreamRunner
[LINK]
http://arxiv.org/abs/2411.16657v2
[DATE]
2024-12-17 14:52:46+08:00
[CATEGORIES]
cs.CL
LLMs are Also Effective Embedding Models: An In-depth Overview
[AUTHORS]
Chongyang Tao, Tao Shen, Shen Gao, Junshuo Zhang, Zhen Li, Zhengwei Tao, Shuai Ma
[ABSTRACT]
Large language models (LLMs) have revolutionized natural language processing
by achieving state-of-the-art performance across various tasks. Recently, their
effectiveness as embedding models has gained attention, marking a paradigm
shift from traditional encoder-only models like ELMo and BERT to decoder-only,
large-scale LLMs such as GPT, LLaMA, and Mistral. This survey provides an
in-depth overview of this transition, beginning with foundational techniques
before the LLM era, followed by LLM-based embedding models through two main
strategies to derive embeddings from LLMs. 1) Direct prompting: We mainly
discuss the prompt designs and the underlying rationale for deriving
competitive embeddings. 2) Data-centric tuning: We cover extensive aspects that
affect tuning an embedding model, including model architecture, training
objectives, data constructions, etc. Upon the above, we also cover advanced
methods, such as handling longer texts, and multilingual and cross-modal data.
Furthermore, we discuss factors affecting choices of embedding models, such as
performance/efficiency comparisons, dense vs sparse embeddings, pooling
strategies, and scaling law. Lastly, the survey highlights the limitations and
challenges in adapting LLMs for embeddings, including cross-task embedding
quality, trade-offs between efficiency and accuracy, low-resource,
long-context, data bias, robustness, etc. This survey serves as a valuable
resource for researchers and practitioners by synthesizing current
advancements, highlighting key challenges, and offering a comprehensive
framework for future work aimed at enhancing the effectiveness and efficiency
of LLMs as embedding models.
[COMMENTS]
32 pages
[LINK]
http://arxiv.org/abs/2412.12591v1
[DATE]
2024-12-17 14:48:24+08:00
[CATEGORIES]
cs.CL
PerSphere: A Comprehensive Framework for Multi-Faceted Perspective Retrieval and Summarization
[AUTHORS]
Yun Luo, Yingjie Li, Xiangkun Hu, Qinglin Qi, Fang Guo, Qipeng Guo, Zheng Zhang, Yue Zhang
[ABSTRACT]
As online platforms and recommendation algorithms evolve, people are
increasingly trapped in echo chambers, leading to biased understandings of
various issues. To combat this issue, we have introduced PerSphere, a benchmark
designed to facilitate multi-faceted perspective retrieval and summarization,
thus breaking free from these information silos. For each query within
PerSphere, there are two opposing claims, each supported by distinct,
non-overlapping perspectives drawn from one or more documents. Our goal is to
accurately summarize these documents, aligning the summaries with the
respective claims and their underlying perspectives. This task is structured as
a two-step end-to-end pipeline that includes comprehensive document retrieval
and multi-faceted summarization. Furthermore, we propose a set of metrics to
evaluate the comprehensiveness of the retrieval and summarization content.
Experimental results on various counterparts for the pipeline show that recent
models struggle with such a complex task. Analysis shows that the main
challenge lies in long context and perspective extraction, and we propose a
simple but effective multi-agent summarization system, offering a promising
solution to enhance performance on PerSphere.
[LINK]
http://arxiv.org/abs/2412.12588v1
[DATE]
2024-12-17 14:44:06+08:00
[CATEGORIES]
cs.CL
Large Language Model-Brained GUI Agents: A Survey
[AUTHORS]
Chaoyun Zhang, Shilin He, Jiaxu Qian, Bowen Li, Liqun Li, Si Qin, Yu Kang, Minghua Ma, Guyue Liu, Qingwei Lin, Saravan Rajmohan, Dongmei Zhang, Qi Zhang
[ABSTRACT]
GUIs have long been central to human-computer interaction, providing an
intuitive and visually-driven way to access and interact with digital systems.
The advent of LLMs, particularly multimodal models, has ushered in a new era of
GUI automation. They have demonstrated exceptional capabilities in natural
language understanding, code generation, and visual processing. This has paved
the way for a new generation of LLM-brained GUI agents capable of interpreting
complex GUI elements and autonomously executing actions based on natural
language instructions. These agents represent a paradigm shift, enabling users
to perform intricate, multi-step tasks through simple conversational commands.
Their applications span across web navigation, mobile app interactions, and
desktop automation, offering a transformative user experience that
revolutionizes how individuals interact with software. This emerging field is
rapidly advancing, with significant progress in both research and industry.
To provide a structured understanding of this trend, this paper presents a
comprehensive survey of LLM-brained GUI agents, exploring their historical
evolution, core components, and advanced techniques. We address research
questions such as existing GUI agent frameworks, the collection and utilization
of data for training specialized GUI agents, the development of large action
models tailored for GUI tasks, and the evaluation metrics and benchmarks
necessary to assess their effectiveness. Additionally, we examine emerging
applications powered by these agents. Through a detailed analysis, this survey
identifies key research gaps and outlines a roadmap for future advancements in
the field. By consolidating foundational knowledge and state-of-the-art
developments, this work aims to guide both researchers and practitioners in
overcoming challenges and unlocking the full potential of LLM-brained GUI
agents.
[COMMENTS]
The collection of papers reviewed in this survey will be hosted and
regularly updated on the GitHub repository:
https://github.com/vyokky/LLM-Brained-GUI-Agents-Survey Additionally, a
searchable webpage is available at https://aka.ms/gui-agent for easier access
and exploration
[LINK]
http://arxiv.org/abs/2411.18279v4
[DATE]
2024-12-17 14:35:56+08:00
[CATEGORIES]
cs.CL
WaterPark: A Robustness Assessment of Language Model Watermarking
[AUTHORS]
Jiacheng Liang, Zian Wang, Lauren Hong, Shouling Ji, Ting Wang
[ABSTRACT]
Various watermarking methods (“watermarkers”) have been proposed to
identify LLM-generated texts; yet, due to the lack of unified evaluation
platforms, many critical questions remain under-explored: i) What are the
strengths/limitations of various watermarkers, especially their attack
robustness? ii) How do various design choices impact their robustness? iii) How
to optimally operate watermarkers in adversarial environments? To fill this
gap, we systematize existing LLM watermarkers and watermark removal attacks,
mapping out their design spaces. We then develop WaterPark, a unified platform
that integrates 10 state-of-the-art watermarkers and 12 representative attacks.
More importantly, by leveraging WaterPark, we conduct a comprehensive
assessment of existing watermarkers, unveiling the impact of various design
choices on their attack robustness. We further explore the best practices to
operate watermarkers in adversarial environments. We believe our study sheds
light on current LLM watermarking techniques while WaterPark serves as a
valuable testbed to facilitate future research.
[COMMENTS]
22 pages
[LINK]
http://arxiv.org/abs/2411.13425v2
[DATE]
2024-12-17 14:30:32+08:00
[CATEGORIES]
cs.CL
cs.LG
Romanization Encoding For Multilingual ASR
[AUTHORS]
Wen Ding, Fei Jia, Hainan Xu, Yu Xi, Junjie Lai, Boris Ginsburg
[ABSTRACT]
We introduce romanization encoding for script-heavy languages to optimize
multilingual and code-switching Automatic Speech Recognition (ASR) systems. By
adopting romanization encoding alongside a balanced concatenated tokenizer
within a FastConformer-RNNT framework equipped with a Roman2Char module, we
significantly reduce vocabulary and output dimensions, enabling larger training
batches and reduced memory consumption. Our method decouples acoustic modeling
and language modeling, enhancing the flexibility and adaptability of the
system. In our study, applying this method to Mandarin-English ASR resulted in
a remarkable 63.51% vocabulary reduction and notable performance gains of
13.72% and 15.03% on SEAME code-switching benchmarks. Ablation studies on
Mandarin-Korean and Mandarin-Japanese highlight our method’s strong capability
to address the complexities of other script-heavy languages, paving the way for
more versatile and effective multilingual ASR systems.
[COMMENTS]
Accepted by IEEE SLT2024
[LINK]
http://arxiv.org/abs/2407.04368v2
[DATE]
2024-12-17 14:28:55+08:00
[CATEGORIES]
cs.CL
Process-Supervised Reward Models for Clinical Note Generation: A Scalable Approach Guided by Domain Expertise
[AUTHORS]
Hanyin Wang, Qiping Xu, Bolun Liu, Guleid Hussein, Hariprasad Korsapati, Mohamad El Labban, Kingsley Iheasirim, Mohamed Hassan, Gokhan Anil, Brian Bartlett, Jimeng Sun
[ABSTRACT]
Process-supervised reward models (PRMs), which verify large language model
(LLM) outputs step-by-step, have achieved significant success in mathematical
and coding problems. However, their application to other domains remains
largely unexplored. In this work, we train a PRM to provide step-level reward
signals for clinical notes generated by LLMs from patient-doctor dialogues.
Guided by real-world clinician expertise, we carefully designed step
definitions for clinical notes and utilized Gemini-Pro 1.5 to automatically
generate process supervision data at scale. Our proposed PRM, trained on the
LLaMA-3.1 8B instruct model, demonstrated superior performance compared to
Gemini-Pro 1.5 and an outcome-supervised reward model (ORM) across two key
evaluations: (1) the accuracy of selecting gold-reference samples from
error-containing samples, achieving 98.8% (versus 61.3% for ORM and 93.8% for
Gemini-Pro 1.5), and (2) the accuracy of selecting physician-preferred notes,
achieving 56.2% (compared to 51.2% for ORM and 50.0% for Gemini-Pro 1.5).
Additionally, we conducted ablation studies to determine optimal loss functions
and data selection strategies, along with physician reader studies to explore
predictors of downstream Best-of-N performance. Our promising results suggest
the potential of PRMs to extend beyond the clinical domain, offering a scalable
and effective solution for diverse generative tasks.
[LINK]
http://arxiv.org/abs/2412.12583v1
[DATE]
2024-12-17 14:24:34+08:00
[CATEGORIES]
cs.CL
Controlling Equational Reasoning in Large Language Models with Prompt Interventions
[AUTHORS]
Jordan Meadows, Marco Valentino, Andre Freitas
[ABSTRACT]
This paper investigates how hallucination rates in Large Language Models
(LLMs) may be controlled and mitigated via a symbolic data generation
framework, and explores a fundamental relationship between the rate of certain
mathematical errors and interventions. Specifically, we systematically generate
data for a derivation generation task, and apply targeted interventions on
prompts to perturb aspects such as the surface forms of symbols, equational
tree structures, and mathematical context, and evaluate the effect of prompt
interventions across a range of LLMs including fine-tuned T5 models, GPT, and
others. Experiments suggest that T5-Large can outperform the few-shot
performance of GPT-4 on various evaluation sets generated via the framework,
however, an extensive evaluation based on human analysis, template-based error
detection, and various text generation metrics reveals fine-tuned model
weaknesses beyond what the reference-based metrics singularly describe. We use
these results to tie characteristic distributional footprints of interventions
to the human evaluation of LLM derivation quality, potentially leading to
significant control over fine-grained mathematical capabilities of language
models with respect to specific types of errors.
[COMMENTS]
AAAI 2025 (7 pages)
[LINK]
http://arxiv.org/abs/2307.09998v4
[DATE]
2024-12-17 14:01:52+08:00
[CATEGORIES]
cs.CL
Quantifying Lexical Semantic Shift via Unbalanced Optimal Transport
[AUTHORS]
Ryo Kishino, Hiroaki Yamagiwa, Ryo Nagata, Sho Yokoi, Hidetoshi Shimodaira
[ABSTRACT]
Lexical semantic change detection aims to identify shifts in word meanings
over time. While existing methods using embeddings from a diachronic corpus
pair estimate the degree of change for target words, they offer limited insight
into changes at the level of individual usage instances. To address this, we
apply Unbalanced Optimal Transport (UOT) to sets of contextualized word
embeddings, capturing semantic change through the excess and deficit in the
alignment between usage instances. In particular, we propose Sense Usage Shift
(SUS), a measure that quantifies changes in the usage frequency of a word sense
at each usage instance. By leveraging SUS, we demonstrate that several
challenges in semantic change detection can be addressed in a unified manner,
including quantifying instance-level semantic change and word-level tasks such
as measuring the magnitude of semantic change and the broadening or narrowing
of meaning.
[LINK]
http://arxiv.org/abs/2412.12569v1
[DATE]
2024-12-17 14:00:54+08:00
[CATEGORIES]
cs.CL
Evaluating Zero-Shot Multilingual Aspect-Based Sentiment Analysis with Large Language Models
[AUTHORS]
Chengyan Wu, Bolei Ma, Zheyu Zhang, Ningyuan Deng, Yanqing He, Yun Xue
[ABSTRACT]
Aspect-based sentiment analysis (ABSA), a sequence labeling task, has
attracted increasing attention in multilingual contexts. While previous
research has focused largely on fine-tuning or training models specifically for
ABSA, we evaluate large language models (LLMs) under zero-shot conditions to
explore their potential to tackle this challenge with minimal task-specific
adaptation. We conduct a comprehensive empirical evaluation of a series of LLMs
on multilingual ABSA tasks, investigating various prompting strategies,
including vanilla zero-shot, chain-of-thought (CoT), self-improvement,
self-debate, and self-consistency, across nine different models. Results
indicate that while LLMs show promise in handling multilingual ABSA, they
generally fall short of fine-tuned, task-specific models. Notably, simpler
zero-shot prompts often outperform more complex strategies, especially in
high-resource languages like English. These findings underscore the need for
further refinement of LLM-based approaches to effectively address ABSA task
across diverse languages.
[LINK]
http://arxiv.org/abs/2412.12564v1
[DATE]
2024-12-17 13:48:48+08:00
[CATEGORIES]
cs.CL
Task-Agnostic Language Model Watermarking via High Entropy Passthrough Layers
[AUTHORS]
Vaden Masrani, Mohammad Akbari, David Ming Xuan Yue, Ahmad Rezaei, Yong Zhang
[ABSTRACT]
In the era of costly pre-training of large language models, ensuring the
intellectual property rights of model owners, and insuring that said models are
responsibly deployed, is becoming increasingly important. To this end, we
propose model watermarking via passthrough layers, which are added to existing
pre-trained networks and trained using a self-supervised loss such that the
model produces high-entropy output when prompted with a unique private key, and
acts normally otherwise. Unlike existing model watermarking methods, our method
is fully task-agnostic, and can be applied to both classification and
sequence-to-sequence tasks without requiring advanced access to downstream
fine-tuning datasets. We evaluate the proposed passthrough layers on a wide
range of downstream tasks, and show experimentally our watermarking method
achieves a near-perfect watermark extraction accuracy and false-positive rate
in most cases without damaging original model performance. Additionally, we
show our method is robust to both downstream fine-tuning, fine-pruning, and
layer removal attacks, and can be trained in a fraction of the time required to
train the original model. Code is available in the paper.
[COMMENTS]
Accepted to AAAI2025
[LINK]
http://arxiv.org/abs/2412.12563v1
[DATE]
2024-12-17 13:46:50+08:00
[CATEGORIES]
cs.CL
MCQG-SRefine: Multiple Choice Question Generation and Evaluation with Iterative Self-Critique, Correction, and Comparison Feedback
[AUTHORS]
Zonghai Yao, Aditya Parashar, Huixue Zhou, Won Seok Jang, Feiyun Ouyang, Zhichao Yang, Hong Yu
[ABSTRACT]
Automatic question generation (QG) is essential for AI and NLP, particularly
in intelligent tutoring, dialogue systems, and fact verification. Generating
multiple-choice questions (MCQG) for professional exams, like the United States
Medical Licensing Examination (USMLE), is particularly challenging, requiring
domain expertise and complex multi-hop reasoning for high-quality questions.
However, current large language models (LLMs) like GPT-4 struggle with
professional MCQG due to outdated knowledge, hallucination issues, and prompt
sensitivity, resulting in unsatisfactory quality and difficulty. To address
these challenges, we propose MCQG-SRefine, an LLM self-refine-based (Critique
and Correction) framework for converting medical cases into high-quality
USMLE-style questions. By integrating expert-driven prompt engineering with
iterative self-critique and self-correction feedback, MCQG-SRefine
significantly enhances human expert satisfaction regarding both the quality and
difficulty of the questions. Furthermore, we introduce an LLM-as-Judge-based
automatic metric to replace the complex and costly expert evaluation process,
ensuring reliable and expert-aligned assessments.
[COMMENTS]
Equal contribution for the first two authors. Keywords: Question
Generation, USMLE, Self-Refine, Self-Critique, and Self-Correction,
LLM-as-Judge, AI for Medical Education
[LINK]
http://arxiv.org/abs/2410.13191v3
[DATE]
2024-12-17 13:36:53+08:00
[CATEGORIES]
cs.CL
When to Speak, When to Abstain: Contrastive Decoding with Abstention
[AUTHORS]
Hyuhng Joon Kim, Youna Kim, Sang-goo Lee, Taeuk Kim
[ABSTRACT]
Large Language Models (LLMs) demonstrate exceptional performance across
diverse tasks by leveraging both pre-trained knowledge (i.e., parametric
knowledge) and external knowledge (i.e., contextual knowledge). While
substantial efforts have been made to leverage both forms of knowledge,
scenarios in which the model lacks any relevant knowledge remain underexplored.
Such limitations can result in issues like hallucination, causing reduced
reliability and potential risks in high-stakes applications. To address such
limitations, this paper extends the task scope to encompass cases where the
user’s request cannot be fulfilled due to the lack of relevant knowledge. To
this end, we introduce Contrastive Decoding with Abstention (CDA), a
training-free decoding method that empowers LLMs to generate responses when
relevant knowledge is available and to abstain otherwise. CDA evaluates the
relevance of each knowledge for a given query, adaptively determining which
knowledge to prioritize or which to completely ignore. Extensive experiments
with four LLMs on three question-answering datasets demonstrate that CDA can
effectively perform accurate generation and abstention simultaneously. These
findings highlight CDA’s potential to broaden the applicability of LLMs,
enhancing reliability and preserving user trust.
[COMMENTS]
under-review
[LINK]
http://arxiv.org/abs/2412.12527v1
[DATE]
2024-12-17 12:38:08+08:00
[CATEGORIES]
cs.CL
SCAR: Efficient Instruction-Tuning for Large Language Models via Style Consistency-Aware Response Ranking
[AUTHORS]
Zhuang Li, Yuncheng Hua, Thuy-Trang Vu, Haolan Zhan, Lizhen Qu, Gholamreza Haffari
[ABSTRACT]
Recent studies emphasize that manually ensuring a consistent response style
and maintaining high data quality in training sets can significantly improve
the performance of fine-tuned Large Language Models (LLMs) while reducing the
number of training examples needed. However, the precise definition of style
and the relationship between style, data quality, and LLM performance remains
unclear. This research identifies two key stylistic elements in responses:
linguistic form and instructional surprisal. We find that, among training data
of comparable quality, higher consistency in these response elements leads to
better LLM performance. Inspired by this, we introduce Style Consistency-Aware
Response Ranking (SCAR), which automatically prioritizes instruction-response
pairs in the training set based on their response stylistic consistency. By
selecting the most style-consistent examples, sometimes as few as 0.7% of the
full dataset, the fine-tuned LLMs can match or even surpass the performance of
models trained on the entire dataset in coding and open-ended
question-answering benchmarks. Code and data are available at
https://github.com/zhuang-li/SCAR .
[COMMENTS]
31 pages
[LINK]
http://arxiv.org/abs/2406.10882v7
[DATE]
2024-12-17 12:33:12+08:00
[CATEGORIES]
cs.CL
Semantic Prompt Learning for Weakly-Supervised Semantic Segmentation
[AUTHORS]
Ci-Siang Lin, Chien-Yi Wang, Yu-Chiang Frank Wang, Min-Hung Chen
[ABSTRACT]
Weakly-Supervised Semantic Segmentation (WSSS) aims to train segmentation
models using image data with only image-level supervision. Since precise
pixel-level annotations are not accessible, existing methods typically focus on
producing pseudo masks for training segmentation models by refining CAM-like
heatmaps. However, the produced heatmaps may capture only the discriminative
image regions of object categories or the associated co-occurring backgrounds.
To address the issues, we propose a Semantic Prompt Learning for WSSS (SemPLeS)
framework, which learns to effectively prompt the CLIP latent space to enhance
the semantic alignment between the segmented regions and the target object
categories. More specifically, we propose Contrastive Prompt Learning and
Prompt-guided Semantic Refinement to learn the prompts that adequately describe
and suppress the co-occurring backgrounds associated with each object category.
In this way, SemPLeS can perform better semantic alignment between object
regions and class labels, resulting in desired pseudo masks for training
segmentation models. The proposed SemPLeS framework achieves competitive
performance on standard WSSS benchmarks, PASCAL VOC 2012 and MS COCO2014, and
shows compatibility with other WSSS methods.
[COMMENTS]
WACV 2025. Project page: https://projectdisr.github.io/semples
[LINK]
http://arxiv.org/abs/2401.11791v3
[DATE]
2024-12-17 12:27:31+08:00
[CATEGORIES]
cs.CL
cs.LG
Solid-SQL: Enhanced Schema-linking based In-context Learning for Robust Text-to-SQL
[AUTHORS]
Geling Liu, Yunzhi Tan, Ruichao Zhong, Yuanzhen Xie, Lingchen Zhao, Qian Wang, Bo Hu, Zang Li
[ABSTRACT]
Recently, large language models (LLMs) have significantly improved the
performance of text-to-SQL systems. Nevertheless, many state-of-the-art (SOTA)
approaches have overlooked the critical aspect of system robustness. Our
experiments reveal that while LLM-driven methods excel on standard datasets,
their accuracy is notably compromised when faced with adversarial
perturbations. To address this challenge, we propose a robust text-to-SQL
solution, called Solid-SQL, designed to integrate with various LLMs. We focus
on the pre-processing stage, training a robust schema-linking model enhanced by
LLM-based data augmentation. Additionally, we design a two-round, structural
similarity-based example retrieval strategy for in-context learning. Our method
achieves SOTA SQL execution accuracy levels of 82.1% and 58.9% on the general
Spider and Bird benchmarks, respectively. Furthermore, experimental results
show that Solid-SQL delivers an average improvement of 11.6% compared to
baselines on the perturbed Spider-Syn, Spider-Realistic, and Dr. Spider
benchmarks.
[COMMENTS]
Accepted at COLING 2025 Main
[LINK]
http://arxiv.org/abs/2412.12522v1
[DATE]
2024-12-17 12:22:22+08:00
[CATEGORIES]
cs.CL
Linear Adversarial Concept Erasure
[AUTHORS]
Shauli Ravfogel, Michael Twiton, Yoav Goldberg, Ryan Cotterell
[ABSTRACT]
Modern neural models trained on textual data rely on pre-trained
representations that emerge without direct supervision. As these
representations are increasingly being used in real-world applications, the
inability to \emph{control} their content becomes an increasingly important
problem. We formulate the problem of identifying and erasing a linear subspace
that corresponds to a given concept, in order to prevent linear predictors from
recovering the concept. We model this problem as a constrained, linear maximin
game, and show that existing solutions are generally not optimal for this task.
We derive a closed-form solution for certain objectives, and propose a convex
relaxation, \method, that works well for others. When evaluated in the context
of binary gender removal, the method recovers a low-dimensional subspace whose
removal mitigates bias by intrinsic and extrinsic evaluation. We show that the
method is highly expressive, effectively mitigating bias in deep nonlinear
classifiers while maintaining tractability and interpretability.
[COMMENTS]
Accepted in ICML 2022; a revised version
[LINK]
http://arxiv.org/abs/2201.12091v8
[DATE]
2024-12-17 12:09:21+08:00
[CATEGORIES]
cs.LG
cs.CL
ELDER: Enhancing Lifelong Model Editing with Mixture-of-LoRA
[AUTHORS]
Jiaang Li, Quan Wang, Zhongnan Wang, Yongdong Zhang, Zhendong Mao
[ABSTRACT]
Large language models (LLMs) require model editing to efficiently update
specific knowledge within them and avoid factual errors. Most model editing
methods are solely designed for single-time use and result in a significant
forgetting effect in lifelong editing scenarios, where sequential edits are
conducted over time. Previous approaches manage sequential edits by freezing
original parameters and discretely allocating new parameters for each knowledge
update. However, these methods lack robustness to minor input variations due to
the discrete mapping between data and parameters. To overcome this challenge,
we propose ELDER, a novel approach to create a continuous association between
data and adapters. ELDER integrates multiple LoRAs through a router network and
is trained to establish a smooth data-adapter association, thereby enhancing
the edit robustness and generalization of semantically equivalent inputs. To
ensure inputs containing the same knowledge will be processed by the same
LoRAs, we design a novel loss to guide the model link LoRA allocations with
edit knowledge. Furthermore, we propose a deferral mechanism to retain the
original LLM capabilities post-edit. Extensive experiments on GPT-2 XL and
LLaMA2-7B demonstrate that ELDER effectively edits models in the lifelong
setting, outperforming eight baselines while exhibiting strong scalability and
preserving LLMs’ general abilities on downstream tasks.
[COMMENTS]
Accepted by AAAI-25
[LINK]
http://arxiv.org/abs/2408.11869v2
[DATE]
2024-12-17 11:59:22+08:00
[CATEGORIES]
cs.CL
cs.LG
Can Large Language Models Understand You Better? An MBTI Personality Detection Dataset Aligned with Population Traits
[AUTHORS]
Bohan Li, Jiannan Guan, Longxu Dou, Yunlong Feng, Dingzirui Wang, Yang Xu, Enbo Wang, Qiguang Chen, Bichen Wang, Xiao Xu, Yimeng Zhang, Libo Qin, Yanyan Zhao, Qingfu Zhu, Wanxiang Che
[COMMENTS]
Accepted by COLING 2025. 28 papges, 20 figures, 10 tables
[LINK]
http://arxiv.org/abs/2412.12510v1
[DATE]
2024-12-17 11:46:51+08:00
[CATEGORIES]
cs.CL
Can You Trust LLM Judgments? Reliability of LLM-as-a-Judge
[AUTHORS]
Kayla Schroeder, Zach Wood-Doughty
[ABSTRACT]
Large Language Models (LLMs) have become increasingly powerful and
ubiquitous, but their stochastic nature poses challenges to the reliability of
their outputs. While deterministic settings can improve consistency, they do
not guarantee reliability, as a single sample from the model’s probability
distribution can still be misleading. Building upon the concept of
LLM-as-a-judge, we introduce a novel framework for rigorously evaluating the
reliability of LLM judgments, leveraging McDonald’s omega. We evaluate the
reliability of LLMs when judging the outputs of other LLMs on standard
single-turn and multi-turn benchmarks, simultaneously investigating the impact
of temperature on reliability. By analyzing these results, we demonstrate the
limitations of fixed randomness and the importance of considering multiple
samples, which we show has significant implications for downstream
applications. Our findings highlight the need for a nuanced understanding of
LLM reliability and the potential risks associated with over-reliance on
single-shot evaluations. This work provides a crucial step towards building
more trustworthy and reliable LLM-based systems and applications.
[LINK]
http://arxiv.org/abs/2412.12509v1
[DATE]
2024-12-17 11:37:31+08:00
[CATEGORIES]
cs.CL
Multilingual Knowledge Editing with Language-Agnostic Factual Neurons
[AUTHORS]
Xue Zhang, Yunlong Liang, Fandong Meng, Songming Zhang, Yufeng Chen, Jinan Xu, Jie Zhou
[COMMENTS]
COLING 2025 (14 pages, 3 figures, 12 tables)
[LINK]
http://arxiv.org/abs/2406.16416v2
[DATE]
2024-12-17 11:22:34+08:00
[CATEGORIES]
cs.CL
DocFusion: A Unified Framework for Document Parsing Tasks
[AUTHORS]
Mingxu Chai, Ziyu Shen, Chong Zhang, Yue Zhang, Xiao Wang, Shihan Dou, Jihua Kang, Jiazheng Zhang, Qi Zhang
[ABSTRACT]
Document parsing is essential for analyzing complex document structures and
extracting fine-grained information, supporting numerous downstream
applications. However, existing methods often require integrating multiple
independent models to handle various parsing tasks, leading to high complexity
and maintenance overhead. To address this, we propose DocFusion, a lightweight
generative model with only 0.28B parameters. It unifies task representations
and achieves collaborative training through an improved objective function.
Experiments reveal and leverage the mutually beneficial interaction among
recognition tasks, and integrating recognition data significantly enhances
detection performance. The final results demonstrate that DocFusion achieves
state-of-the-art (SOTA) performance across four key tasks.
[LINK]
http://arxiv.org/abs/2412.12505v1
[DATE]
2024-12-17 11:20:00+08:00
[CATEGORIES]
cs.CL
Beyond Data Quantity: Key Factors Driving Performance in Multilingual Language Models
[AUTHORS]
Sina Bagheri Nezhad, Ameeta Agrawal, Rhitabrat Pokharel
[ABSTRACT]
Multilingual language models (MLLMs) are crucial for handling text across
various languages, yet they often show performance disparities due to
differences in resource availability and linguistic characteristics. While the
impact of pre-train data percentage and model size on performance is
well-known, our study reveals additional critical factors that significantly
influence MLLM effectiveness. Analyzing a wide range of features, including
geographical, linguistic, and resource-related aspects, we focus on the SIB-200
dataset for classification and the Flores-200 dataset for machine translation,
using regression models and SHAP values across 204 languages. Our findings
identify token similarity and country similarity as pivotal factors, alongside
pre-train data and model size, in enhancing model performance. Token similarity
facilitates cross-lingual transfer, while country similarity highlights the
importance of shared cultural and linguistic contexts. These insights offer
valuable guidance for developing more equitable and effective multilingual
language models, particularly for underrepresented languages.
[COMMENTS]
Accepted at The First Workshop on Language Models for Low-Resource
Languages @ COLING 2025
[LINK]
http://arxiv.org/abs/2412.12500v1
[DATE]
2024-12-17 11:05:26+08:00
[CATEGORIES]
cs.CL
Human-in-the-Loop Generation of Adversarial Texts: A Case Study on Tibetan Script
[AUTHORS]
Xi Cao, Yuan Sun, Jiajun Li, Quzong Gesang, Nuo Qun, Tashi Nyima
[ABSTRACT]
DNN-based language models perform excellently on various tasks, but even SOTA
LLMs are susceptible to textual adversarial attacks. Adversarial texts play
crucial roles in multiple subfields of NLP. However, current research has the
following issues. (1) Most textual adversarial attack methods target
rich-resourced languages. How do we generate adversarial texts for less-studied
languages? (2) Most textual adversarial attack methods are prone to generating
invalid or ambiguous adversarial texts. How do we construct high-quality
adversarial robustness benchmarks? (3) New language models may be immune to
part of previously generated adversarial texts. How do we update adversarial
robustness benchmarks? To address the above issues, we introduce HITL-GAT, a
system based on a general approach to human-in-the-loop generation of
adversarial texts. HITL-GAT contains four stages in one pipeline: victim model
construction, adversarial example generation, high-quality benchmark
construction, and adversarial robustness evaluation. Additionally, we utilize
HITL-GAT to make a case study on Tibetan script which can be a reference for
the adversarial research of other less-studied languages.
[COMMENTS]
Review Version; Submitted to NAACL 2025 Demo Track
[LINK]
http://arxiv.org/abs/2412.12478v1
[DATE]
2024-12-17 10:29:54+08:00
[CATEGORIES]
cs.CL
Neural Topic Modeling with Large Language Models in the Loop
[AUTHORS]
Xiaohao Yang, He Zhao, Weijie Xu, Yuanyuan Qi, Jueqing Lu, Dinh Phung, Lan Du
[ABSTRACT]
Topic modeling is a fundamental task in natural language processing, allowing
the discovery of latent thematic structures in text corpora. While Large
Language Models (LLMs) have demonstrated promising capabilities in topic
discovery, their direct application to topic modeling suffers from issues such
as incomplete topic coverage, misalignment of topics, and inefficiency. To
address these limitations, we propose LLM-ITL, a novel LLM-in-the-loop
framework that integrates LLMs with Neural Topic Models (NTMs). In LLM-ITL,
global topics and document representations are learned through the NTM.
Meanwhile, an LLM refines these topics using an Optimal Transport (OT)-based
alignment objective, where the refinement is dynamically adjusted based on the
LLM’s confidence in suggesting topical words for each set of input words. With
the flexibility of being integrated into many existing NTMs, the proposed
approach enhances the interpretability of topics while preserving the
efficiency of NTMs in learning topics and document representations. Extensive
experiments demonstrate that LLM-ITL helps NTMs significantly improve their
topic interpretability while maintaining the quality of document
representation. Our code and datasets will be available at Github.
[LINK]
http://arxiv.org/abs/2411.08534v2
[DATE]
2024-12-17 10:27:04+08:00
[CATEGORIES]
cs.CL
RareAgents: Autonomous Multi-disciplinary Team for Rare Disease Diagnosis and Treatment
[AUTHORS]
Xuanzhong Chen, Ye Jin, Xiaohao Mao, Lun Wang, Shuyang Zhang, Ting Chen
[ABSTRACT]
Rare diseases, despite their low individual incidence, collectively impact
around 300 million people worldwide due to the huge number of diseases. The
complexity of symptoms and the shortage of specialized doctors with relevant
experience make diagnosing and treating rare diseases more challenging than
common diseases. Recently, agents powered by large language models (LLMs) have
demonstrated notable improvements across various domains. In the medical field,
some agent methods have outperformed direct prompts in question-answering tasks
from medical exams. However, current agent frameworks lack adaptation for
real-world clinical scenarios, especially those involving the intricate demands
of rare diseases. To address these challenges, we present RareAgents, the first
multi-disciplinary team of LLM-based agents tailored to the complex clinical
context of rare diseases. RareAgents integrates advanced planning capabilities,
memory mechanisms, and medical tools utilization, leveraging Llama-3.1-8B/70B
as the base model. Experimental results show that RareAgents surpasses
state-of-the-art domain-specific models, GPT-4o, and existing agent frameworks
in both differential diagnosis and medication recommendation for rare diseases.
Furthermore, we contribute a novel dataset, MIMIC-IV-Ext-Rare, derived from
MIMIC-IV, to support further advancements in this field.
[LINK]
http://arxiv.org/abs/2412.12475v1
[DATE]
2024-12-17 10:22:24+08:00
[CATEGORIES]
cs.CL
Less is More: A Simple yet Effective Token Reduction Method for Efficient Multi-modal LLMs
[AUTHORS]
Dingjie Song, Wenjun Wang, Shunian Chen, Xidong Wang, Michael Guan, Benyou Wang
[COMMENTS]
Accepted to COLING 2025
[LINK]
http://arxiv.org/abs/2409.10994v3
[DATE]
2024-12-17 10:05:27+08:00
[CATEGORIES]
cs.CL
Core Context Aware Attention for Long Context Language Modeling
[AUTHORS]
Yaofo Chen, Zeng You, Shuhai Zhang, Haokun Li, Yirui Li, Yaowei Wang, Mingkui Tan
[ABSTRACT]
Transformer-based Large Language Models (LLMs) have exhibited remarkable
success in various natural language processing tasks primarily attributed to
self-attention mechanism, which requires a token to consider all preceding
tokens as its context to compute the attention score. However, when the context
length L becomes very large (e.g., 32K), more redundant context information
will be included w.r.t. any tokens, making the self-attention suffer from two
main limitations: 1) The computational and memory complexity scales
quadratically w.r.t. L; 2) The presence of redundant context information may
hamper the model to capture dependencies among crucial tokens, which may
degrade the representation performance. In this paper, we propose a
plug-and-play Core Context Aware (CCA) Attention for efficient long-range
context modeling, which consists of two components: 1) Globality-pooling
attention that divides input tokens into groups and then dynamically merges
tokens within each group into one core token based on their significance; 2)
Locality-preserved attention that incorporates neighboring tokens into the
attention calculation. The two complementary attentions will then be fused to
the final attention, maintaining comprehensive modeling ability as the full
self-attention. In this way, the core context information w.r.t. a given token
will be automatically focused and strengthened, while the context information
in redundant groups will be diminished during the learning process. As a
result, the computational and memory complexity will be significantly reduced.
More importantly, the CCA-Attention can improve the long-context modeling
ability by diminishing the redundant context information. Extensive
experimental results demonstrate that our CCA-Attention significantly
outperforms state-of-the-art models in terms of computational efficiency and
long-context modeling ability.
[LINK]
http://arxiv.org/abs/2412.12465v1
[DATE]
2024-12-17 09:54:08+08:00
[CATEGORIES]
cs.CL
cs.LG
LITA: An Efficient LLM-assisted Iterative Topic Augmentation Framework
[AUTHORS]
Chia-Hsuan Chang, Jui-Tse Tsai, Yi-Hang Tsai, San-Yih Hwang
[ABSTRACT]
Topic modeling is widely used for uncovering thematic structures within text
corpora, yet traditional models often struggle with specificity and coherence
in domain-focused applications. Guided approaches, such as SeededLDA and CorEx,
incorporate user-provided seed words to improve relevance but remain
labor-intensive and static. Large language models (LLMs) offer potential for
dynamic topic refinement and discovery, yet their application often incurs high
API costs. To address these challenges, we propose the LLM-assisted Iterative
Topic Augmentation framework (LITA), an LLM-assisted approach that integrates
user-provided seeds with embedding-based clustering and iterative refinement.
LITA identifies a small number of ambiguous documents and employs an LLM to
reassign them to existing or new topics, minimizing API costs while enhancing
topic quality. Experiments on two datasets across topic quality and clustering
performance metrics demonstrate that LITA outperforms five baseline models,
including LDA, SeededLDA, CorEx, BERTopic, and PromptTopic. Our work offers an
efficient and adaptable framework for advancing topic modeling and text
clustering.
[COMMENTS]
Under Review
[LINK]
http://arxiv.org/abs/2412.12459v1
[DATE]
2024-12-17 09:43:44+08:00
[CATEGORIES]
cs.CL
Graph Learning in the Era of LLMs: A Survey from the Perspective of Data, Models, and Tasks
[AUTHORS]
Xunkai Li, Zhengyu Wu, Jiayi Wu, Hanwen Cui, Jishuo Jia, Rong-Hua Li, Guoren Wang
[ABSTRACT]
With the increasing prevalence of cross-domain Text-Attributed Graph (TAG)
Data (e.g., citation networks, recommendation systems, social networks, and
ai4science), the integration of Graph Neural Networks (GNNs) and Large Language
Models (LLMs) into a unified Model architecture (e.g., LLM as enhancer, LLM as
collaborators, LLM as predictor) has emerged as a promising technological
paradigm. The core of this new graph learning paradigm lies in the synergistic
combination of GNNs’ ability to capture complex structural relationships and
LLMs’ proficiency in understanding informative contexts from the rich textual
descriptions of graphs. Therefore, we can leverage graph description texts with
rich semantic context to fundamentally enhance Data quality, thereby improving
the representational capacity of model-centric approaches in line with
data-centric machine learning principles. By leveraging the strengths of these
distinct neural network architectures, this integrated approach addresses a
wide range of TAG-based Task (e.g., graph learning, graph reasoning, and graph
question answering), particularly in complex industrial scenarios (e.g.,
supervised, few-shot, and zero-shot settings). In other words, we can treat
text as a medium to enable cross-domain generalization of graph learning Model,
allowing a single graph model to effectively handle the diversity of downstream
graph-based Task across different data domains. This work serves as a
foundational reference for researchers and practitioners looking to advance
graph learning methodologies in the rapidly evolving landscape of LLM. We
consistently maintain the related open-source materials at
\url{https://github.com/xkLi-Allen/Awesome-GNN-in-LLMs-Papers}.
[COMMENTS]
In progress
[LINK]
http://arxiv.org/abs/2412.12456v1
[DATE]
2024-12-17 09:41:17+08:00
[CATEGORIES]
cs.LG
cs.CL
PERC: Plan-As-Query Example Retrieval for Underrepresented Code Generation
[AUTHORS]
Jaeseok Yoo, Hojae Han, Youngwon Lee, Jaejin Kim, Seung-won Hwang
[ABSTRACT]
Code generation with large language models has shown significant promise,
especially when employing retrieval-augmented generation (RAG) with few-shot
examples. However, selecting effective examples that enhance generation quality
remains a challenging task, particularly when the target programming language
(PL) is underrepresented. In this study, we present two key findings: (1)
retrieving examples whose presented algorithmic plans can be referenced for
generating the desired behavior significantly improves generation accuracy, and
(2) converting code into pseudocode effectively captures such algorithmic
plans, enhancing retrieval quality even when the source and the target PLs are
different. Based on these findings, we propose Plan-as-query Example Retrieval
for few-shot prompting in Code generation (PERC), a novel framework that
utilizes algorithmic plans to identify and retrieve effective examples. We
validate the effectiveness of PERC through extensive experiments on the
CodeContests, HumanEval and MultiPL-E benchmarks: PERC consistently outperforms
the state-of-the-art RAG methods in code generation, both when the source and
target programming languages match or differ, highlighting its adaptability and
robustness in diverse coding environments.
[COMMENTS]
Accepted by COLING 2025 main conference
[LINK]
http://arxiv.org/abs/2412.12447v1
[DATE]
2024-12-17 09:23:45+08:00
[CATEGORIES]
cs.CL
Refining Dimensions for Improving Clustering-based Cross-lingual Topic Models
[AUTHORS]
Chia-Hsuan Chang, Tien-Yuan Huang, Yi-Hang Tsai, Chia-Ming Chang, San-Yih Hwang
[COMMENTS]
Accepted to 18th BUCC Workshop at COLING 2025
[LINK]
http://arxiv.org/abs/2412.12433v1
[DATE]
2024-12-17 08:50:23+08:00
[CATEGORIES]
cs.CL
cs.LG
Familiarity-Aware Evidence Compression for Retrieval-Augmented Generation
[AUTHORS]
Dongwon Jung, Qin Liu, Tenghao Huang, Ben Zhou, Muhao Chen
[ABSTRACT]
Retrieval-augmented generation (RAG) improves large language models (LMs) by
incorporating non-parametric knowledge through evidence retrieved from external
sources. However, it often struggles to cope with inconsistent and irrelevant
information that can distract the LM from its tasks, especially when multiple
evidence pieces are required. While compressing the retrieved evidence with a
compression model aims to address this issue, the compressed evidence may still
be unfamiliar to the target model used for downstream tasks, potentially
failing to utilize the evidence effectively. We propose FaviComp
(Familarity-Aware Evidence Compression), a novel training-free evidence
compression technique that makes retrieved evidence more familiar to the target
model, while seamlessly integrating parametric knowledge from the model.
Experimental results show that FaviComp consistently outperforms most recent
evidence compression baselines across multiple open-domain QA datasets,
improving accuracy by up to 28.1% while achieving high compression rates.
Additionally, we demonstrate the effective integration of both parametric and
non-parametric knowledge during evidence compression.
[LINK]
http://arxiv.org/abs/2409.12468v2
[DATE]
2024-12-17 08:25:46+08:00
[CATEGORIES]
cs.CL
cs.LG
Execution-Based Evaluation of Natural Language to Bash and PowerShell for Incident Remediation
[AUTHORS]
Ngoc Phuoc An Vo, Brent Paulovicks, Vadim Sheinin
[ABSTRACT]
Given recent advancements of Large Language Models (LLMs), code generation
tasks attract immense attention for wide application in different domains. In
an effort to evaluate and select a best model to automatically remediate system
incidents discovered by Application Performance Monitoring (APM) platforms, it
is crucial to verify if the generated code is syntactically and semantically
correct, and whether it can be executed correctly as intended. However, current
methods for evaluating the quality of code generated by LLMs heavily rely on
surface form similarity metrics (e.g. BLEU, ROUGE, and exact/partial match)
which have numerous limitations. In contrast, execution based evaluation
focuses more on code functionality and does not constrain the code generation
to any fixed solution. Nevertheless, designing and implementing such
execution-based evaluation platform is not a trivial task. There are several
works creating execution-based evaluation platforms for popular programming
languages such as SQL, Python, Java, but limited or no attempts for scripting
languages such as Bash and PowerShell. In this paper, we present the first
execution-based evaluation platform in which we created three test suites
(total 125 handcrafted test cases) to evaluate Bash (both single-line commands
and multiple-line scripts) and PowerShell codes generated by LLMs. We benchmark
seven closed and open-source LLMs using our platform with different techniques
(zero-shot vs. few-shot learning).
[LINK]
http://arxiv.org/abs/2405.06807v2
[DATE]
2024-12-17 07:13:26+08:00
[CATEGORIES]
cs.CL
Efficient Scaling of Diffusion Transformers for Text-to-Image Generation
[AUTHORS]
Hao Li, Shamit Lal, Zhiheng Li, Yusheng Xie, Ying Wang, Yang Zou, Orchid Majumder, R. Manmatha, Zhuowen Tu, Stefano Ermon, Stefano Soatto, Ashwin Swaminathan
[ABSTRACT]
We empirically study the scaling properties of various Diffusion Transformers
(DiTs) for text-to-image generation by performing extensive and rigorous
ablations, including training scaled DiTs ranging from 0.3B upto 8B parameters
on datasets up to 600M images. We find that U-ViT, a pure self-attention based
DiT model provides a simpler design and scales more effectively in comparison
with cross-attention based DiT variants, which allows straightforward expansion
for extra conditions and other modalities. We identify a 2.3B U-ViT model can
get better performance than SDXL UNet and other DiT variants in controlled
setting. On the data scaling side, we investigate how increasing dataset size
and enhanced long caption improve the text-image alignment performance and the
learning efficiency.
[LINK]
http://arxiv.org/abs/2412.12391v1
[DATE]
2024-12-17 06:59:26+08:00
[CATEGORIES]
cs.CL
cs.LG
Re-Examine Distantly Supervised NER: A New Benchmark and a Simple Approach
[AUTHORS]
Yuepei Li, Kang Zhou, Qiao Qiao, Qing Wang, Qi Li
[COMMENTS]
This is has been accepted by COLING 2025
[LINK]
http://arxiv.org/abs/2402.14948v3
[DATE]
2024-12-17 06:39:47+08:00
[CATEGORIES]
cs.CL
cs.LG
Visual Instruction Tuning with 500x Fewer Parameters through Modality Linear Representation-Steering
[AUTHORS]
Jinhe Bi, Yujun Wang, Haokun Chen, Xun Xiao, Artur Hecker, Volker Tresp, Yunpu Ma
[ABSTRACT]
Multimodal Large Language Models (MLLMs) have significantly advanced visual
tasks by integrating visual representations into large language models (LLMs).
The textual modality, inherited from LLMs, equips MLLMs with abilities like
instruction following and in-context learning. In contrast, the visual modality
enhances performance in downstream tasks by leveraging rich semantic content,
spatial information, and grounding capabilities. These intrinsic modalities
work synergistically across various visual tasks. Our research initially
reveals a persistent imbalance between these modalities, with text often
dominating output generation during visual instruction tuning. This imbalance
occurs when using both full fine-tuning and parameter-efficient fine-tuning
(PEFT) methods. We then found that re-balancing these modalities can
significantly reduce the number of trainable parameters required, inspiring a
direction for further optimizing visual instruction tuning. We introduce
Modality Linear Representation-Steering (MoReS) to achieve the goal. MoReS
effectively re-balances the intrinsic modalities throughout the model, where
the key idea is to steer visual representations through linear transformations
in the visual subspace across each model layer. To validate our solution, we
composed LLaVA Steering, a suite of models integrated with the proposed MoReS
method. Evaluation results show that the composed LLaVA Steering models
require, on average, 500 times fewer trainable parameters than LoRA needs while
still achieving comparable performance across three visual benchmarks and eight
visual question-answering tasks. Last, we present the LLaVA Steering Factory,
an in-house developed platform that enables researchers to quickly customize
various MLLMs with component-based architecture for seamlessly integrating
state-of-the-art models, and evaluate their intrinsic modality imbalance.
[LINK]
http://arxiv.org/abs/2412.12359v1
[DATE]
2024-12-17 05:14:11+08:00
[CATEGORIES]
cs.CL
BioRAGent: A Retrieval-Augmented Generation System for Showcasing Generative Query Expansion and Domain-Specific Search for Scientific Q&A
[AUTHORS]
Samy Ateia, Udo Kruschwitz
[ABSTRACT]
We present BioRAGent, an interactive web-based retrieval-augmented generation
(RAG) system for biomedical question answering. The system uses large language
models (LLMs) for query expansion, snippet extraction, and answer generation
while maintaining transparency through citation links to the source documents
and displaying generated queries for further editing. Building on our
successful participation in the BioASQ 2024 challenge, we demonstrate how
few-shot learning with LLMs can be effectively applied for a professional
search setting. The system supports both direct short paragraph style responses
and responses with inline citations. Our demo is available online, and the
source code is publicly accessible through GitHub.
[COMMENTS]
Version as accepted at the Demo Track at ECIR 2025
[LINK]
http://arxiv.org/abs/2412.12358v1
[DATE]
2024-12-17 05:09:28+08:00
[CATEGORIES]
cs.CL
Krony-PT: GPT2 compressed with Kronecker Products
[AUTHORS]
M. Ayoub Ben Ayad, Jelena Mitrovic, Michael Granitzer
[ABSTRACT]
We introduce Krony-PT, a compression technique of GPT2
\citep{radford2019language} based on Kronecker Products. We specifically target
the MLP layers of each transformer layer, and systematically compress the feed
forward layer matrices to various degrees. We introduce a modified Van Loan
decomposition to initialize the new factors, and also introduce a new
pruning-based initialization trick. Our method compresses the original 124M
parameter GPT2 to various smaller models, with 80M being the smallest, and 96M
being the largest compressed model. Our 81M model variant outperforms
distilgpt2 on next-token prediction on all standard language modeling datasets,
and shows competitive scores or performs on par with other Kronecker Products
based compressed models of GPT2 that are significantly higher in size.
[LINK]
http://arxiv.org/abs/2412.12351v1
[DATE]
2024-12-17 04:44:01+08:00
[CATEGORIES]
cs.LG
cs.CL
Granite Guardian
[AUTHORS]
Inkit Padhi, Manish Nagireddy, Giandomenico Cornacchia, Subhajit Chaudhury, Tejaswini Pedapati, Pierre Dognin, Keerthiram Murugesan, Erik Miehling, Martín Santillán Cooper, Kieran Fraser, Giulio Zizzo, Muhammad Zaid Hameed, Mark Purcell, Michael Desmond, Qian Pan, Zahra Ashktorab, Inge Vejsbjerg, Elizabeth M. Daly, Michael Hind, Werner Geyer, Ambrish Rawat, Kush R. Varshney, Prasanna Sattigeri
[ABSTRACT]
We introduce the Granite Guardian models, a suite of safeguards designed to
provide risk detection for prompts and responses, enabling safe and responsible
use in combination with any large language model (LLM). These models offer
comprehensive coverage across multiple risk dimensions, including social bias,
profanity, violence, sexual content, unethical behavior, jailbreaking, and
hallucination-related risks such as context relevance, groundedness, and answer
relevance for retrieval-augmented generation (RAG). Trained on a unique dataset
combining human annotations from diverse sources and synthetic data, Granite
Guardian models address risks typically overlooked by traditional risk
detection models, such as jailbreaks and RAG-specific issues. With AUC scores
of 0.871 and 0.854 on harmful content and RAG-hallucination-related benchmarks
respectively, Granite Guardian is the most generalizable and competitive model
available in the space. Released as open-source, Granite Guardian aims to
promote responsible AI development across the community.
https://github.com/ibm-granite/granite-guardian
[LINK]
http://arxiv.org/abs/2412.07724v2
[DATE]
2024-12-17 04:27:36+08:00
[CATEGORIES]
cs.CL
RAG Playground: A Framework for Systematic Evaluation of Retrieval Strategies and Prompt Engineering in RAG Systems
[AUTHORS]
Ioannis Papadimitriou, Ilias Gialampoukidis, Stefanos Vrochidis, Ioannis, Kompatsiaris
[ABSTRACT]
We present RAG Playground, an open-source framework for systematic evaluation
of Retrieval-Augmented Generation (RAG) systems. The framework implements and
compares three retrieval approaches: naive vector search, reranking, and hybrid
vector-keyword search, combined with ReAct agents using different prompting
strategies. We introduce a comprehensive evaluation framework with novel
metrics and provide empirical results comparing different language models
(Llama 3.1 and Qwen 2.5) across various retrieval configurations. Our
experiments demonstrate significant performance improvements through hybrid
search methods and structured self-evaluation prompting, achieving up to 72.7%
pass rate on our multi-metric evaluation framework. The results also highlight
the importance of prompt engineering in RAG systems, with our custom-prompted
agents showing consistent improvements in retrieval accuracy and response
quality.
[COMMENTS]
Work In Progress
[LINK]
http://arxiv.org/abs/2412.12322v1
[DATE]
2024-12-17 03:40:26+08:00
[CATEGORIES]
cs.LG
cs.CL
Graph-Guided Textual Explanation Generation Framework
[AUTHORS]
Shuzhou Yuan, Jingyi Sun, Ran Zhang, Michael Färber, Steffen Eger, Pepa Atanasova, Isabelle Augenstein
[ABSTRACT]
Natural language explanations (NLEs) are commonly used to provide plausible
free-text explanations of a model’s reasoning about its predictions. However,
recent work has questioned the faithfulness of NLEs, as they may not accurately
reflect the model’s internal reasoning process regarding its predicted answer.
In contrast, highlight explanations – input fragments identified as critical
for the model’s predictions – exhibit measurable faithfulness, which has been
incrementally improved through existing research. Building on this foundation,
we propose G-Tex, a Graph-Guided Textual Explanation Generation framework
designed to enhance the faithfulness of NLEs by leveraging highlight
explanations. Specifically, highlight explanations are extracted as highly
faithful cues representing the model’s reasoning and are subsequently encoded
through a graph neural network layer, which explicitly guides the NLE
generation process. This alignment ensures that the generated explanations
closely reflect the model’s underlying reasoning. Experiments on T5 and BART
using three reasoning datasets show that G-Tex improves NLE faithfulness by up
to 17.59% compared to baseline methods. Additionally, G-Tex generates NLEs with
greater semantic and lexical similarity to human-written ones. Human
evaluations show that G-Tex can decrease redundant content and enhance the
overall quality of NLEs. As our work introduces a novel method for explicitly
guiding NLE generation to improve faithfulness, we hope it will serve as a
stepping stone for addressing additional criteria for NLE and generated text
overall.
[LINK]
http://arxiv.org/abs/2412.12318v1
[DATE]
2024-12-17 03:35:55+08:00
[CATEGORIES]
cs.CL
Second Language (Arabic) Acquisition of LLMs via Progressive Vocabulary Expansion
[AUTHORS]
Jianqing Zhu, Huang Huang, Zhihang Lin, Juhao Liang, Zhengyang Tang, Khalid Almubarak, Abdulmohsen Alharthik, Bang An, Juncai He, Xiangbo Wu, Fei Yu, Junying Chen, Zhuoheng Ma, Yuhao Du, He Zhang, Emad A. Alghamdi, Lian Zhang, Ruoyu Sun, Haizhou Li, Benyou Wang, Jinchao Xu
[ABSTRACT]
This paper addresses the critical need for democratizing large language
models (LLM) in the Arab world, a region that has seen slower progress in
developing models comparable to state-of-the-art offerings like GPT-4 or
ChatGPT 3.5, due to a predominant focus on mainstream languages (e.g., English
and Chinese). One practical objective for an Arabic LLM is to utilize an
Arabic-specific vocabulary for the tokenizer that could speed up decoding.
However, using a different vocabulary often leads to a degradation of learned
knowledge since many words are initially out-of-vocabulary (OOV) when training
starts. Inspired by the vocabulary learning during Second Language (Arabic)
Acquisition for humans, the released AraLLaMA employs progressive vocabulary
expansion, which is implemented by a modified BPE algorithm that progressively
extends the Arabic subwords in its dynamic vocabulary during training, thereby
balancing the OOV ratio at every stage. The ablation study demonstrated the
effectiveness of Progressive Vocabulary Expansion. Moreover, AraLLaMA achieves
decent performance comparable to the best Arabic LLMs across a variety of
Arabic benchmarks. Models, training data, benchmarks, and codes will be all
open-sourced.
[LINK]
http://arxiv.org/abs/2412.12310v1
[DATE]
2024-12-17 03:29:06+08:00
[CATEGORIES]
cs.CL
One Initialization to Rule them All: Fine-tuning via Explained Variance Adaptation
[AUTHORS]
Fabian Paischer, Lukas Hauzenberger, Thomas Schmied, Benedikt Alkin, Marc Peter Deisenroth, Sepp Hochreiter
[ABSTRACT]
Foundation models (FMs) are pre-trained on large-scale datasets and then
fine-tuned on a downstream task for a specific application. The most successful
and most commonly used fine-tuning method is to update the pre-trained weights
via a low-rank adaptation (LoRA). LoRA introduces new weight matrices that are
usually initialized at random with a uniform rank distribution across the model
weights. Recent works focus on different initialization schemes or the learning
of adaptive ranks during fine-tuning. Both approaches have only been
investigated in isolation, resulting in slow convergence or a uniform rank
distribution, in turn leading to suboptimal performance. We propose to improve
LoRA by initializing the new weights in a data-driven manner by computing
singular value decomposition (SVD) on minibatches of activation vectors. Then,
we initialize the LoRA matrices with the obtained right-singular vectors and
redistribute ranks among all weight matrices to provably store the maximum
amount of information of the downstream data in the newly introduced weights.
In this way, only what information to maintain or neglect during the
fine-tuning process needs to be learned. We call our new method
$\textbf{E}$xplained $\textbf{V}$ariance $\textbf{A}$daptation (EVA). We apply
EVA to a variety of fine-tuning tasks ranging from language generation and
understanding to image classification and reinforcement learning. EVA exhibits
faster convergence than competitors and achieves the highest average score
across a multitude of tasks per domain while reducing the number of trainable
parameters through rank redistribution.
[COMMENTS]
11 pages + references and appendix, code available at
https://github.com/ml-jku/EVA
[LINK]
http://arxiv.org/abs/2410.07170v3
[DATE]
2024-12-17 03:19:14+08:00
[CATEGORIES]
cs.LG
cs.CL
Unanswerability Evaluation for Retreival Augmented Generation
[AUTHORS]
Xiangyu Peng, Prafulla Kumar Choubey, Caiming Xiong, Chien-Sheng Wu
[ABSTRACT]
Existing evaluation frameworks for retrieval-augmented generation (RAG)
systems focus on answerable queries, but they overlook the importance of
appropriately rejecting unanswerable requests. In this paper, we introduce
UAEval4RAG, a framework designed to evaluate whether RAG systems can handle
unanswerable queries effectively. We define a taxonomy with six unanswerable
categories, and UAEval4RAG automatically synthesizes diverse and challenging
queries for any given knowledge base with unanswered ratio and acceptable ratio
metrics. We conduct experiments with various RAG components, including
retrieval models, rewriting methods, rerankers, language models, and prompting
strategies, and reveal hidden trade-offs in performance of RAG systems. Our
findings highlight the critical role of component selection and prompt design
in optimizing RAG systems to balance the accuracy of answerable queries with
high rejection rates of unanswerable ones. UAEval4RAG provides valuable
insights and tools for developing more robust and reliable RAG systems.
[LINK]
http://arxiv.org/abs/2412.12300v1
[DATE]
2024-12-17 03:11:55+08:00
[CATEGORIES]
cs.CL
Understanding Language Model Circuits through Knowledge Editing
[AUTHORS]
Huaizhi Ge, Frank Rudzicz, Zining Zhu
[ABSTRACT]
Recent advances in language model interpretability have identified circuits,
critical subnetworks that replicate model behaviors, yet how knowledge is
structured within these crucial subnetworks remains opaque. To gain an
understanding toward the knowledge in the circuits, we conduct systematic
knowledge editing experiments on the circuits of the GPT-2 language model. Our
analysis reveals intriguing patterns in how circuits respond to editing
attempts, the extent of knowledge distribution across network components, and
the architectural composition of knowledge-bearing circuits. These findings
offer insights into the complex relationship between model circuits and
knowledge representation, deepening the understanding of how information is
organized within language models. Our findings offer novel insights into the
“meanings” of the circuits, and introduce directions for further
interpretability and safety research of language models.
[LINK]
http://arxiv.org/abs/2406.17241v3
[DATE]
2024-12-17 02:54:05+08:00
[CATEGORIES]
cs.CL
Using Contextual Information for Sentence-level Morpheme Segmentation
[AUTHORS]
Prabin Bhandari, Abhishek Paudel
[ABSTRACT]
Recent advancements in morpheme segmentation primarily emphasize word-level
segmentation, often neglecting the contextual relevance within the sentence. In
this study, we redefine the morpheme segmentation task as a
sequence-to-sequence problem, treating the entire sentence as input rather than
isolating individual words. Our findings reveal that the multilingual model
consistently exhibits superior performance compared to monolingual
counterparts. While our model did not surpass the performance of the current
state-of-the-art, it demonstrated comparable efficacy with high-resource
languages while revealing limitations in low-resource language scenarios.
[COMMENTS]
6 pages, 3 tables
[LINK]
http://arxiv.org/abs/2403.15436v3
[DATE]
2024-12-17 02:40:11+08:00
[CATEGORIES]
cs.CL
Toward Adaptive Large Language Models Structured Pruning via Hybrid-grained Weight Importance Assessment
[AUTHORS]
Jun Liu, Zhenglun Kong, Pu Zhao, Changdi Yang, Hao Tang, Xuan Shen, Geng Yuan, Wei Niu, Wenbin Zhang, Xue Lin, Dong Huang, Yanzhi Wang
[ABSTRACT]
Structured pruning for large language models (LLMs) has garnered significant
academic interest due to its ability to efficiently compress and accelerate
LLMs by eliminating redundant weight groups at a coarse-grained granularity.
Current structured pruning methods for LLMs typically depend on a singular
granularity for assessing weight importance, resulting in notable performance
degradation in downstream tasks. Intriguingly, our empirical investigations
reveal that utilizing unstructured pruning, which achieves better performance
retention by pruning weights at a finer granularity, \emph{i.e.}, individual
weights, yields significantly varied sparse LLM structures when juxtaposed to
structured pruning. This suggests that evaluating both holistic and individual
assessment for weight importance is essential for LLM pruning. Building on this
insight, we introduce the Hybrid-grained Weight Importance Assessment (HyWIA),
a novel method that merges fine-grained and coarse-grained evaluations of
weight importance for the pruning of LLMs. Leveraging an attention mechanism,
HyWIA adaptively determines the optimal blend of granularity in weight
importance assessments in an end-to-end pruning manner. Extensive experiments
on LLaMA-V1/V2, Vicuna, Baichuan, and Bloom across various benchmarks
demonstrate the effectiveness of HyWIA in pruning LLMs. For example, HyWIA
surpasses the cutting-edge LLM-Pruner by an average margin of 2.82\% in
accuracy across seven downstream tasks when pruning LLaMA-7B by 50\%.
[LINK]
http://arxiv.org/abs/2403.10799v4
[DATE]
2024-12-17 02:31:27+08:00
[CATEGORIES]
cs.CL
cs.LG
MGH Radiology Llama: A Llama 3 70B Model for Radiology
[AUTHORS]
Yucheng Shi, Peng Shu, Zhengliang Liu, Zihao Wu, Quanzheng Li, Tianming Liu, Ninghao Liu, Xiang Li
[ABSTRACT]
In recent years, the field of radiology has increasingly harnessed the power
of artificial intelligence (AI) to enhance diagnostic accuracy, streamline
workflows, and improve patient care. Large language models (LLMs) have emerged
as particularly promising tools, offering significant potential in assisting
radiologists with report generation, clinical decision support, and patient
communication. This paper presents an advanced radiology-focused large language
model: MGH Radiology Llama. It is developed using the Llama 3 70B model,
building upon previous domain-specific models like Radiology-GPT and
Radiology-Llama2. Leveraging a unique and comprehensive dataset from
Massachusetts General Hospital, comprising over 6.5 million de-identified
medical reports across various imaging modalities, the model demonstrates
significant improvements in generating accurate and clinically relevant
radiology impressions given the corresponding findings. Our evaluation,
incorporating both traditional metrics and a GPT-4-based assessment, highlights
the enhanced performance of this work over general-purpose LLMs.
[COMMENTS]
11 pages, 3 figures, 1 table
[LINK]
http://arxiv.org/abs/2408.11848v2
[DATE]
2024-12-17 02:25:26+08:00
[CATEGORIES]
cs.CL
Scaling laws for nonlinear dynamical models of articulatory control
[AUTHORS]
Sam Kirkham
[ABSTRACT]
Dynamical theories of speech use computational models of articulatory control
to generate quantitative predictions and advance understanding of speech
dynamics. The addition of a nonlinear restoring force to task dynamic models is
a significant improvement over linear models, but nonlinearity introduces
challenges with parameterization and interpretability. We illustrate these
problems through numerical simulations and introduce solutions in the form of
scaling laws. We apply the scaling laws to a cubic model and show how they
facilitate interpretable simulations of articulatory dynamics, and can be
theoretically interpreted as imposing physical and cognitive constraints on
models of speech movement dynamics.
[COMMENTS]
Updated title and minor changes to text after first round of reviews
[LINK]
http://arxiv.org/abs/2411.12720v2
[DATE]
2024-12-17 02:24:53+08:00
[CATEGORIES]
cs.CL
Merging Text Transformer Models from Different Initializations
[AUTHORS]
Neha Verma, Maha Elbayad
[ABSTRACT]
Recent work on permutation-based model merging has shown impressive low- or
zero-barrier mode connectivity between models from completely different
initializations. However, this line of work has not yet extended to the
Transformer architecture, despite its dominant popularity in the language
domain. Therefore, in this work, we investigate the extent to which separate
Transformer minima learn similar features, and propose a model merging
technique to investigate the relationship between these minima in the loss
landscape. The specifics of the architecture, like its residual connections,
multi-headed attention, and discrete, sequential input, require specific
interventions in order to compute model permutations that remain within the
same functional equivalence class. In merging these models with our method, we
consistently find lower loss barriers between minima compared to model
averaging, across models trained on a masked-language modeling task or
fine-tuned on a language understanding benchmark. Our results show that the
minima of these models are less sharp and isolated than previously understood,
and provide a basis for future work on merging separately trained Transformer
models.
[COMMENTS]
TMLR, November 2024
[LINK]
http://arxiv.org/abs/2403.00986v3
[DATE]
2024-12-17 02:00:10+08:00
[CATEGORIES]
cs.CL
cs.LG
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
[AUTHORS]
Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, Soroosh Mariooryad, Yifan Ding, Xinyang Geng, Fred Alcober, Roy Frostig, Mark Omernick, Lexi Walker, Cosmin Paduraru, Christina Sorokin, Andrea Tacchetti, Colin Gaffney, Samira Daruki, Olcan Sercinoglu, Zach Gleicher, Juliette Love, Paul Voigtlaender, Rohan Jain, Gabriela Surita, Kareem Mohamed, Rory Blevins, Junwhan Ahn, Tao Zhu, Kornraphop Kawintiranon, Orhan Firat, Yiming Gu, Yujing Zhang, Matthew Rahtz, Manaal Faruqui, Natalie Clay, Justin Gilmer, JD Co-Reyes, Ivo Penchev, Rui Zhu, Nobuyuki Morioka, Kevin Hui, Krishna Haridasan, Victor Campos, Mahdis Mahdieh, Mandy Guo, Samer Hassan, Kevin Kilgour, Arpi Vezer, Heng-Tze Cheng, Raoul de Liedekerke, Siddharth Goyal, Paul Barham, DJ Strouse, Seb Noury, Jonas Adler, Mukund Sundararajan, Sharad Vikram, Dmitry Lepikhin, Michela Paganini, Xavier Garcia, Fan Yang, Dasha Valter, Maja Trebacz, Kiran Vodrahalli, Chulayuth Asawaroengchai, Roman Ring, Norbert Kalb, Livio Baldini Soares, Siddhartha Brahma, David Steiner, Tianhe Yu, Fabian Mentzer, Antoine He, Lucas Gonzalez, Bibo Xu, Raphael Lopez Kaufman, Laurent El Shafey, Junhyuk Oh, Tom Hennigan, George van den Driessche, Seth Odoom, Mario Lucic, Becca Roelofs, Sid Lall, Amit Marathe, Betty Chan, Santiago Ontanon, Luheng He, Denis Teplyashin, Jonathan Lai, Phil Crone, Bogdan Damoc, Lewis Ho, Sebastian Riedel, Karel Lenc, Chih-Kuan Yeh, Aakanksha Chowdhery, Yang Xu, Mehran Kazemi, Ehsan Amid, Anastasia Petrushkina, Kevin Swersky, Ali Khodaei, Gowoon Chen, Chris Larkin, Mario Pinto, Geng Yan, Adria Puigdomenech Badia, Piyush Patil, Steven Hansen, Dave Orr, Sebastien M. R. Arnold, Jordan Grimstad, Andrew Dai, Sholto Douglas, Rishika Sinha, Vikas Yadav, Xi Chen, Elena Gribovskaya, Jacob Austin, Jeffrey Zhao, Kaushal Patel, Paul Komarek, Sophia Austin, Sebastian Borgeaud, Linda Friso, Abhimanyu Goyal, Ben Caine, Kris Cao, Da-Woon Chung, Matthew Lamm, Gabe Barth-Maron, Thais Kagohara, Kate Olszewska, Mia Chen, Kaushik Shivakumar, Rishabh Agarwal, Harshal Godhia, Ravi Rajwar, Javier Snaider, Xerxes Dotiwalla, Yuan Liu, Aditya Barua, Victor Ungureanu, Yuan Zhang, Bat-Orgil Batsaikhan, Mateo Wirth, James Qin, Ivo Danihelka, Tulsee Doshi, Martin Chadwick, Jilin Chen, Sanil Jain, Quoc Le, Arjun Kar, Madhu Gurumurthy, Cheng Li, Ruoxin Sang, Fangyu Liu, Lampros Lamprou, Rich Munoz, Nathan Lintz, Harsh Mehta, Heidi Howard, Malcolm Reynolds, Lora Aroyo, Quan Wang, Lorenzo Blanco, Albin Cassirer, Jordan Griffith, Dipanjan Das, Stephan Lee, Jakub Sygnowski, Zach Fisher, James Besley, Richard Powell, Zafarali Ahmed, Dominik Paulus, David Reitter, Zalan Borsos, Rishabh Joshi, Aedan Pope, Steven Hand, Vittorio Selo, Vihan Jain, Nikhil Sethi, Megha Goel, Takaki Makino, Rhys May, Zhen Yang, Johan Schalkwyk, Christina Butterfield, Anja Hauth, Alex Goldin, Will Hawkins, Evan Senter, Sergey Brin, Oliver Woodman, Marvin Ritter, Eric Noland, Minh Giang, Vijay Bolina, Lisa Lee, Tim Blyth, Ian Mackinnon, Machel Reid, Obaid Sarvana, David Silver, Alexander Chen, Lily Wang, Loren Maggiore, Oscar Chang, Nithya Attaluri, Gregory Thornton, Chung-Cheng Chiu, Oskar Bunyan, Nir Levine, Timothy Chung, Evgenii Eltyshev, Xiance Si, Timothy Lillicrap, Demetra Brady, Vaibhav Aggarwal, Boxi Wu, Yuanzhong Xu, Ross McIlroy, Kartikeya Badola, Paramjit Sandhu, Erica Moreira, Wojciech Stokowiec, Ross Hemsley, Dong Li, Alex Tudor, Pranav Shyam, Elahe Rahimtoroghi, Salem Haykal, Pablo Sprechmann, Xiang Zhou, Diana Mincu, Yujia Li, Ravi Addanki, Kalpesh Krishna, Xiao Wu, Alexandre Frechette, Matan Eyal, Allan Dafoe, Dave Lacey, Jay Whang, Thi Avrahami, Ye Zhang, Emanuel Taropa, Hanzhao Lin, Daniel Toyama, Eliza Rutherford, Motoki Sano, HyunJeong Choe, Alex Tomala, Chalence Safranek-Shrader, Nora Kassner, Mantas Pajarskas, Matt Harvey, Sean Sechrist, Meire Fortunato, Christina Lyu, Gamaleldin Elsayed, Chenkai Kuang, James Lottes, Eric Chu, Chao Jia, Chih-Wei Chen, Peter Humphreys, Kate Baumli, Connie Tao, Rajkumar Samuel, Cicero Nogueira dos Santos, Anders Andreassen, Nemanja Rakićević, Dominik Grewe, Aviral Kumar, Stephanie Winkler, Jonathan Caton, Andrew Brock, Sid Dalmia, Hannah Sheahan, Iain Barr, Yingjie Miao, Paul Natsev, Jacob Devlin, Feryal Behbahani, Flavien Prost, Yanhua Sun, Artiom Myaskovsky, Thanumalayan Sankaranarayana Pillai, Dan Hurt, Angeliki Lazaridou, Xi Xiong, Ce Zheng, Fabio Pardo, Xiaowei Li, Dan Horgan, Joe Stanton, Moran Ambar, Fei Xia, Alejandro Lince, Mingqiu Wang, Basil Mustafa, Albert Webson, Hyo Lee, Rohan Anil, Martin Wicke, Timothy Dozat, Abhishek Sinha, Enrique Piqueras, Elahe Dabir, Shyam Upadhyay, Anudhyan Boral, Lisa Anne Hendricks, Corey Fry, Josip Djolonga, Yi Su, Jake Walker, Jane Labanowski, Ronny Huang, Vedant Misra, Jeremy Chen, RJ Skerry-Ryan, Avi Singh, Shruti Rijhwani, Dian Yu, Alex Castro-Ros, Beer Changpinyo, Romina Datta, Sumit Bagri, Arnar Mar Hrafnkelsson, Marcello Maggioni, Daniel Zheng, Yury Sulsky, Shaobo Hou, Tom Le Paine, Antoine Yang, Jason Riesa, Dominika Rogozinska, Dror Marcus, Dalia El Badawy, Qiao Zhang, Luyu Wang, Helen Miller, Jeremy Greer, Lars Lowe Sjos, Azade Nova, Heiga Zen, Rahma Chaabouni, Mihaela Rosca, Jiepu Jiang, Charlie Chen, Ruibo Liu, Tara Sainath, Maxim Krikun, Alex Polozov, Jean-Baptiste Lespiau, Josh Newlan, Zeyncep Cankara, Soo Kwak, Yunhan Xu, Phil Chen, Andy Coenen, Clemens Meyer, Katerina Tsihlas, Ada Ma, Juraj Gottweis, Jinwei Xing, Chenjie Gu, Jin Miao, Christian Frank, Zeynep Cankara, Sanjay Ganapathy, Ishita Dasgupta, Steph Hughes-Fitt, Heng Chen, David Reid, Keran Rong, Hongmin Fan, Joost van Amersfoort, Vincent Zhuang, Aaron Cohen, Shixiang Shane Gu, Anhad Mohananey, Anastasija Ilic, Taylor Tobin, John Wieting, Anna Bortsova, Phoebe Thacker, Emma Wang, Emily Caveness, Justin Chiu, Eren Sezener, Alex Kaskasoli, Steven Baker, Katie Millican, Mohamed Elhawaty, Kostas Aisopos, Carl Lebsack, Nathan Byrd, Hanjun Dai, Wenhao Jia, Matthew Wiethoff, Elnaz Davoodi, Albert Weston, Lakshman Yagati, Arun Ahuja, Isabel Gao, Golan Pundak, Susan Zhang, Michael Azzam, Khe Chai Sim, Sergi Caelles, James Keeling, Abhanshu Sharma, Andy Swing, YaGuang Li, Chenxi Liu, Carrie Grimes Bostock, Yamini Bansal, Zachary Nado, Ankesh Anand, Josh Lipschultz, Abhijit Karmarkar, Lev Proleev, Abe Ittycheriah, Soheil Hassas Yeganeh, George Polovets, Aleksandra Faust, Jiao Sun, Alban Rrustemi, Pen Li, Rakesh Shivanna, Jeremiah Liu, Chris Welty, Federico Lebron, Anirudh Baddepudi, Sebastian Krause, Emilio Parisotto, Radu Soricut, Zheng Xu, Dawn Bloxwich, Melvin Johnson, Behnam Neyshabur, Justin Mao-Jones, Renshen Wang, Vinay Ramasesh, Zaheer Abbas, Arthur Guez, Constant Segal, Duc Dung Nguyen, James Svensson, Le Hou, Sarah York, Kieran Milan, Sophie Bridgers, Wiktor Gworek, Marco Tagliasacchi, James Lee-Thorp, Michael Chang, Alexey Guseynov, Ale Jakse Hartman, Michael Kwong, Ruizhe Zhao, Sheleem Kashem, Elizabeth Cole, Antoine Miech, Richard Tanburn, Mary Phuong, Filip Pavetic, Sebastien Cevey, Ramona Comanescu, Richard Ives, Sherry Yang, Cosmo Du, Bo Li, Zizhao Zhang, Mariko Iinuma, Clara Huiyi Hu, Aurko Roy, Shaan Bijwadia, Zhenkai Zhu, Danilo Martins, Rachel Saputro, Anita Gergely, Steven Zheng, Dawei Jia, Ioannis Antonoglou, Adam Sadovsky, Shane Gu, Yingying Bi, Alek Andreev, Sina Samangooei, Mina Khan, Tomas Kocisky, Angelos Filos, Chintu Kumar, Colton Bishop, Adams Yu, Sarah Hodkinson, Sid Mittal, Premal Shah, Alexandre Moufarek, Yong Cheng, Adam Bloniarz, Jaehoon Lee, Pedram Pejman, Paul Michel, Stephen Spencer, Vladimir Feinberg, Xuehan Xiong, Nikolay Savinov, Charlotte Smith, Siamak Shakeri, Dustin Tran, Mary Chesus, Bernd Bohnet, George Tucker, Tamara von Glehn, Carrie Muir, Yiran Mao, Hideto Kazawa, Ambrose Slone, Kedar Soparkar, Disha Shrivastava, James Cobon-Kerr, Michael Sharman, Jay Pavagadhi, Carlos Araya, Karolis Misiunas, Nimesh Ghelani, Michael Laskin, David Barker, Qiujia Li, Anton Briukhov, Neil Houlsby, Mia Glaese, Balaji Lakshminarayanan, Nathan Schucher, Yunhao Tang, Eli Collins, Hyeontaek Lim, Fangxiaoyu Feng, Adria Recasens, Guangda Lai, Alberto Magni, Nicola De Cao, Aditya Siddhant, Zoe Ashwood, Jordi Orbay, Mostafa Dehghani, Jenny Brennan, Yifan He, Kelvin Xu, Yang Gao, Carl Saroufim, James Molloy, Xinyi Wu, Seb Arnold, Solomon Chang, Julian Schrittwieser, Elena Buchatskaya, Soroush Radpour, Martin Polacek, Skye Giordano, Ankur Bapna, Simon Tokumine, Vincent Hellendoorn, Thibault Sottiaux, Sarah Cogan, Aliaksei Severyn, Mohammad Saleh, Shantanu Thakoor, Laurent Shefey, Siyuan Qiao, Meenu Gaba, Shuo-yiin Chang, Craig Swanson, Biao Zhang, Benjamin Lee, Paul Kishan Rubenstein, Gan Song, Tom Kwiatkowski, Anna Koop, Ajay Kannan, David Kao, Parker Schuh, Axel Stjerngren, Golnaz Ghiasi, Gena Gibson, Luke Vilnis, Ye Yuan, Felipe Tiengo Ferreira, Aishwarya Kamath, Ted Klimenko, Ken Franko, Kefan Xiao, Indro Bhattacharya, Miteyan Patel, Rui Wang, Alex Morris, Robin Strudel, Vivek Sharma, Peter Choy, Sayed Hadi Hashemi, Jessica Landon, Mara Finkelstein, Priya Jhakra, Justin Frye, Megan Barnes, Matthew Mauger, Dennis Daun, Khuslen Baatarsukh, Matthew Tung, Wael Farhan, Henryk Michalewski, Fabio Viola, Felix de Chaumont Quitry, Charline Le Lan, Tom Hudson, Qingze Wang, Felix Fischer, Ivy Zheng, Elspeth White, Anca Dragan, Jean-baptiste Alayrac, Eric Ni, Alexander Pritzel, Adam Iwanicki, Michael Isard, Anna Bulanova, Lukas Zilka, Ethan Dyer, Devendra Sachan, Srivatsan Srinivasan, Hannah Muckenhirn, Honglong Cai, Amol Mandhane, Mukarram Tariq, Jack W. Rae, Gary Wang, Kareem Ayoub, Nicholas FitzGerald, Yao Zhao, Woohyun Han, Chris Alberti, Dan Garrette, Kashyap Krishnakumar, Mai Gimenez, Anselm Levskaya, Daniel Sohn, Josip Matak, Inaki Iturrate, Michael B. Chang, Jackie Xiang, Yuan Cao, Nishant Ranka, Geoff Brown, Adrian Hutter, Vahab Mirrokni, Nanxin Chen, Kaisheng Yao, Zoltan Egyed, Francois Galilee, Tyler Liechty, Praveen Kallakuri, Evan Palmer, Sanjay Ghemawat, Jasmine Liu, David Tao, Chloe Thornton, Tim Green, Mimi Jasarevic, Sharon Lin, Victor Cotruta, Yi-Xuan Tan, Noah Fiedel, Hongkun Yu, Ed Chi, Alexander Neitz, Jens Heitkaemper, Anu Sinha, Denny Zhou, Yi Sun, Charbel Kaed, Brice Hulse, Swaroop Mishra, Maria Georgaki, Sneha Kudugunta, Clement Farabet, Izhak Shafran, Daniel Vlasic, Anton Tsitsulin, Rajagopal Ananthanarayanan, Alen Carin, Guolong Su, Pei Sun, Shashank V, Gabriel Carvajal, Josef Broder, Iulia Comsa, Alena Repina, William Wong, Warren Weilun Chen, Peter Hawkins, Egor Filonov, Lucia Loher, Christoph Hirnschall, Weiyi Wang, Jingchen Ye, Andrea Burns, Hardie Cate, Diana Gage Wright, Federico Piccinini, Lei Zhang, Chu-Cheng Lin, Ionel Gog, Yana Kulizhskaya, Ashwin Sreevatsa, Shuang Song, Luis C. Cobo, Anand Iyer, Chetan Tekur, Guillermo Garrido, Zhuyun Xiao, Rupert Kemp, Huaixiu Steven Zheng, Hui Li, Ananth Agarwal, Christel Ngani, Kati Goshvadi, Rebeca Santamaria-Fernandez, Wojciech Fica, Xinyun Chen, Chris Gorgolewski, Sean Sun, Roopal Garg, Xinyu Ye, S. M. Ali Eslami, Nan Hua, Jon Simon, Pratik Joshi, Yelin Kim, Ian Tenney, Sahitya Potluri, Lam Nguyen Thiet, Quan Yuan, Florian Luisier, Alexandra Chronopoulou, Salvatore Scellato, Praveen Srinivasan, Minmin Chen, Vinod Koverkathu, Valentin Dalibard, Yaming Xu, Brennan Saeta, Keith Anderson, Thibault Sellam, Nick Fernando, Fantine Huot, Junehyuk Jung, Mani Varadarajan, Michael Quinn, Amit Raul, Maigo Le, Ruslan Habalov, Jon Clark, Komal Jalan, Kalesha Bullard, Achintya Singhal, Thang Luong, Boyu Wang, Sujeevan Rajayogam, Julian Eisenschlos, Johnson Jia, Daniel Finchelstein, Alex Yakubovich, Daniel Balle, Michael Fink, Sameer Agarwal, Jing Li, Dj Dvijotham, Shalini Pal, Kai Kang, Jaclyn Konzelmann, Jennifer Beattie, Olivier Dousse, Diane Wu, Remi Crocker, Chen Elkind, Siddhartha Reddy Jonnalagadda, Jong Lee, Dan Holtmann-Rice, Krystal Kallarackal, Rosanne Liu, Denis Vnukov, Neera Vats, Luca Invernizzi, Mohsen Jafari, Huanjie Zhou, Lilly Taylor, Jennifer Prendki, Marcus Wu, Tom Eccles, Tianqi Liu, Kavya Kopparapu, Francoise Beaufays, Christof Angermueller, Andreea Marzoca, Shourya Sarcar, Hilal Dib, Jeff Stanway, Frank Perbet, Nejc Trdin, Rachel Sterneck, Andrey Khorlin, Dinghua Li, Xihui Wu, Sonam Goenka, David Madras, Sasha Goldshtein, Willi Gierke, Tong Zhou, Yaxin Liu, Yannie Liang, Anais White, Yunjie Li, Shreya Singh, Sanaz Bahargam, Mark Epstein, Sujoy Basu, Li Lao, Adnan Ozturel, Carl Crous, Alex Zhai, Han Lu, Zora Tung, Neeraj Gaur, Alanna Walton, Lucas Dixon, Ming Zhang, Amir Globerson, Grant Uy, Andrew Bolt, Olivia Wiles, Milad Nasr, Ilia Shumailov, Marco Selvi, Francesco Piccinno, Ricardo Aguilar, Sara McCarthy, Misha Khalman, Mrinal Shukla, Vlado Galic, John Carpenter, Kevin Villela, Haibin Zhang, Harry Richardson, James Martens, Matko Bosnjak, Shreyas Rammohan Belle, Jeff Seibert, Mahmoud Alnahlawi, Brian McWilliams, Sankalp Singh, Annie Louis, Wen Ding, Dan Popovici, Lenin Simicich, Laura Knight, Pulkit Mehta, Nishesh Gupta, Chongyang Shi, Saaber Fatehi, Jovana Mitrovic, Alex Grills, Joseph Pagadora, Tsendsuren Munkhdalai, Dessie Petrova, Danielle Eisenbud, Zhishuai Zhang, Damion Yates, Bhavishya Mittal, Nilesh Tripuraneni, Yannis Assael, Thomas Brovelli, Prateek Jain, Mihajlo Velimirovic, Canfer Akbulut, Jiaqi Mu, Wolfgang Macherey, Ravin Kumar, Jun Xu, Haroon Qureshi, Gheorghe Comanici, Jeremy Wiesner, Zhitao Gong, Anton Ruddock, Matthias Bauer, Nick Felt, Anirudh GP, Anurag Arnab, Dustin Zelle, Jonas Rothfuss, Bill Rosgen, Ashish Shenoy, Bryan Seybold, Xinjian Li, Jayaram Mudigonda, Goker Erdogan, Jiawei Xia, Jiri Simsa, Andrea Michi, Yi Yao, Christopher Yew, Steven Kan, Isaac Caswell, Carey Radebaugh, Andre Elisseeff, Pedro Valenzuela, Kay McKinney, Kim Paterson, Albert Cui, Eri Latorre-Chimoto, Solomon Kim, William Zeng, Ken Durden, Priya Ponnapalli, Tiberiu Sosea, Christopher A. Choquette-Choo, James Manyika, Brona Robenek, Harsha Vashisht, Sebastien Pereira, Hoi Lam, Marko Velic, Denese Owusu-Afriyie, Katherine Lee, Tolga Bolukbasi, Alicia Parrish, Shawn Lu, Jane Park, Balaji Venkatraman, Alice Talbert, Lambert Rosique, Yuchung Cheng, Andrei Sozanschi, Adam Paszke, Praveen Kumar, Jessica Austin, Lu Li, Khalid Salama, Bartek Perz, Wooyeol Kim, Nandita Dukkipati, Anthony Baryshnikov, Christos Kaplanis, XiangHai Sheng, Yuri Chervonyi, Caglar Unlu, Diego de Las Casas, Harry Askham, Kathryn Tunyasuvunakool, Felix Gimeno, Siim Poder, Chester Kwak, Matt Miecnikowski, Vahab Mirrokni, Alek Dimitriev, Aaron Parisi, Dangyi Liu, Tomy Tsai, Toby Shevlane, Christina Kouridi, Drew Garmon, Adrian Goedeckemeyer, Adam R. Brown, Anitha Vijayakumar, Ali Elqursh, Sadegh Jazayeri, Jin Huang, Sara Mc Carthy, Jay Hoover, Lucy Kim, Sandeep Kumar, Wei Chen, Courtney Biles, Garrett Bingham, Evan Rosen, Lisa Wang, Qijun Tan, David Engel, Francesco Pongetti, Dario de Cesare, Dongseong Hwang, Lily Yu, Jennifer Pullman, Srini Narayanan, Kyle Levin, Siddharth Gopal, Megan Li, Asaf Aharoni, Trieu Trinh, Jessica Lo, Norman Casagrande, Roopali Vij, Loic Matthey, Bramandia Ramadhana, Austin Matthews, CJ Carey, Matthew Johnson, Kremena Goranova, Rohin Shah, Shereen Ashraf, Kingshuk Dasgupta, Rasmus Larsen, Yicheng Wang, Manish Reddy Vuyyuru, Chong Jiang, Joana Ijazi, Kazuki Osawa, Celine Smith, Ramya Sree Boppana, Taylan Bilal, Yuma Koizumi, Ying Xu, Yasemin Altun, Nir Shabat, Ben Bariach, Alex Korchemniy, Kiam Choo, Olaf Ronneberger, Chimezie Iwuanyanwu, Shubin Zhao, David Soergel, Cho-Jui Hsieh, Irene Cai, Shariq Iqbal, Martin Sundermeyer, Zhe Chen, Elie Bursztein, Chaitanya Malaviya, Fadi Biadsy, Prakash Shroff, Inderjit Dhillon, Tejasi Latkar, Chris Dyer, Hannah Forbes, Massimo Nicosia, Vitaly Nikolaev, Somer Greene, Marin Georgiev, Pidong Wang, Nina Martin, Hanie Sedghi, John Zhang, Praseem Banzal, Doug Fritz, Vikram Rao, Xuezhi Wang, Jiageng Zhang, Viorica Patraucean, Dayou Du, Igor Mordatch, Ivan Jurin, Lewis Liu, Ayush Dubey, Abhi Mohan, Janek Nowakowski, Vlad-Doru Ion, Nan Wei, Reiko Tojo, Maria Abi Raad, Drew A. Hudson, Vaishakh Keshava, Shubham Agrawal, Kevin Ramirez, Zhichun Wu, Hoang Nguyen, Ji Liu, Madhavi Sewak, Bryce Petrini, DongHyun Choi, Ivan Philips, Ziyue Wang, Ioana Bica, Ankush Garg, Jarek Wilkiewicz, Priyanka Agrawal, Xiaowei Li, Danhao Guo, Emily Xue, Naseer Shaik, Andrew Leach, Sadh MNM Khan, Julia Wiesinger, Sammy Jerome, Abhishek Chakladar, Alek Wenjiao Wang, Tina Ornduff, Folake Abu, Alireza Ghaffarkhah, Marcus Wainwright, Mario Cortes, Frederick Liu, Joshua Maynez, Andreas Terzis, Pouya Samangouei, Riham Mansour, Tomasz Kępa, François-Xavier Aubet, Anton Algymr, Dan Banica, Agoston Weisz, Andras Orban, Alexandre Senges, Ewa Andrejczuk, Mark Geller, Niccolo Dal Santo, Valentin Anklin, Majd Al Merey, Martin Baeuml, Trevor Strohman, Junwen Bai, Slav Petrov, Yonghui Wu, Demis Hassabis, Koray Kavukcuoglu, Jeff Dean, Oriol Vinyals
[ABSTRACT]
In this report, we introduce the Gemini 1.5 family of models, representing
the next generation of highly compute-efficient multimodal models capable of
recalling and reasoning over fine-grained information from millions of tokens
of context, including multiple long documents and hours of video and audio. The
family includes two new models: (1) an updated Gemini 1.5 Pro, which exceeds
the February version on the great majority of capabilities and benchmarks; (2)
Gemini 1.5 Flash, a more lightweight variant designed for efficiency with
minimal regression in quality. Gemini 1.5 models achieve near-perfect recall on
long-context retrieval tasks across modalities, improve the state-of-the-art in
long-document QA, long-video QA and long-context ASR, and match or surpass
Gemini 1.0 Ultra’s state-of-the-art performance across a broad set of
benchmarks. Studying the limits of Gemini 1.5’s long-context ability, we find
continued improvement in next-token prediction and near-perfect retrieval
(>99%) up to at least 10M tokens, a generational leap over existing models such
as Claude 3.0 (200k) and GPT-4 Turbo (128k). Finally, we highlight real-world
use cases, such as Gemini 1.5 collaborating with professionals on completing
their tasks achieving 26 to 75% time savings across 10 different job
categories, as well as surprising new capabilities of large language models at
the frontier; when given a grammar manual for Kalamang, a language with fewer
than 200 speakers worldwide, the model learns to translate English to Kalamang
at a similar level to a person who learned from the same content.
[LINK]
http://arxiv.org/abs/2403.05530v5
[DATE]
2024-12-17 01:39:39+08:00
[CATEGORIES]
cs.CL
SpeechPrune: Context-aware Token Pruning for Speech Information Retrieval
[AUTHORS]
Yueqian Lin, Yuzhe Fu, Jingyang Zhang, Yudong Liu, Jianyi Zhang, Jingwei Sun, Hai “Helen” Li, Yiran Chen
[ABSTRACT]
We introduce Speech Information Retrieval (SIR), a new long-context task for
Speech Large Language Models (Speech LLMs), and present SPIRAL, a 1,012-sample
benchmark testing models’ ability to extract critical details from
approximately 90-second spoken inputs. While current Speech LLMs excel at
short-form tasks, they struggle with the computational and representational
demands of longer audio sequences. To address this limitation, we propose
SpeechPrune, a training-free token pruning strategy that uses speech-text
similarity and approximated attention scores to efficiently discard irrelevant
tokens. In SPIRAL, SpeechPrune achieves accuracy improvements of 29% and up to
47% over the original model and the random pruning model at a pruning rate of
20%, respectively. SpeechPrune can maintain network performance even at a
pruning level of 80%. This approach highlights the potential of token-level
pruning for efficient and scalable long-form speech understanding.
[COMMENTS]
Project page and dataset is available at
https://speechprune.github.io/
[LINK]
http://arxiv.org/abs/2412.12009v1
[DATE]
2024-12-17 01:36:02+08:00
[CATEGORIES]
cs.CL
The Open Source Advantage in Large Language Models (LLMs)
[AUTHORS]
Jiya Manchanda, Laura Boettcher, Matheus Westphalen, Jasser Jasser
[ABSTRACT]
Large language models (LLMs) mark a key shift in natural language processing
(NLP), having advanced text generation, translation, and domain-specific
reasoning. Closed-source models like GPT-4, powered by proprietary datasets and
extensive computational resources, lead with state-of-the-art performance
today. However, they face criticism for their “black box” nature and for
limiting accessibility in a manner that hinders reproducibility and equitable
AI development. By contrast, open-source initiatives like LLaMA and BLOOM
prioritize democratization through community-driven development and
computational efficiency. These models have significantly reduced performance
gaps, particularly in linguistic diversity and domain-specific applications,
while providing accessible tools for global researchers and developers.
Notably, both paradigms rely on foundational architectural innovations, such as
the Transformer framework by Vaswani et al. (2017). Closed-source models excel
by scaling effectively, while open-source models adapt to real-world
applications in underrepresented languages and domains. Techniques like
Low-Rank Adaptation (LoRA) and instruction-tuning datasets enable open-source
models to achieve competitive results despite limited resources. To be sure,
the tension between closed-source and open-source approaches underscores a
broader debate on transparency versus proprietary control in AI. Ethical
considerations further highlight this divide. Closed-source systems restrict
external scrutiny, while open-source models promote reproducibility and
collaboration but lack standardized auditing documentation frameworks to
mitigate biases. Hybrid approaches that leverage the strengths of both
paradigms are likely to shape the future of LLM innovation, ensuring
accessibility, competitive technical performance, and ethical deployment.
[COMMENTS]
7 pages, 0 figures
[LINK]
http://arxiv.org/abs/2412.12004v1
[DATE]
2024-12-17 01:32:11+08:00
[CATEGORIES]
cs.CL
cs.LG
LLM-RG4: Flexible and Factual Radiology Report Generation across Diverse Input Contexts
[AUTHORS]
Zhuhao Wang, Yihua Sun, Zihan Li, Xuan Yang, Fang Chen, Hongen Liao
[ABSTRACT]
Drafting radiology reports is a complex task requiring flexibility, where
radiologists tail content to available information and particular clinical
demands. However, most current radiology report generation (RRG) models are
constrained to a fixed task paradigm, such as predicting the full “finding”
section from a single image, inherently involving a mismatch between inputs and
outputs. The trained models lack the flexibility for diverse inputs and could
generate harmful, input-agnostic hallucinations. To bridge the gap between
current RRG models and the clinical demands in practice, we first develop a
data generation pipeline to create a new MIMIC-RG4 dataset, which considers
four common radiology report drafting scenarios and has perfectly corresponded
input and output. Secondly, we propose a novel large language model (LLM) based
RRG framework, namely LLM-RG4, which utilizes LLM’s flexible
instruction-following capabilities and extensive general knowledge. We further
develop an adaptive token fusion module that offers flexibility to handle
diverse scenarios with different input combinations, while minimizing the
additional computational burden associated with increased input volumes.
Besides, we propose a token-level loss weighting strategy to direct the model’s
attention towards positive and uncertain descriptions. Experimental results
demonstrate that LLM-RG4 achieves state-of-the-art performance in both clinical
efficiency and natural language generation on the MIMIC-RG4 and MIMIC-CXR
datasets. We quantitatively demonstrate that our model has minimal
input-agnostic hallucinations, whereas current open-source models commonly
suffer from this problem.
[LINK]
http://arxiv.org/abs/2412.12001v1
[DATE]
2024-12-17 01:29:51+08:00
[CATEGORIES]
cs.CL
ExecRepoBench: Multi-level Executable Code Completion Evaluation
[AUTHORS]
Jian Yang, Jiajun Zhang, Jiaxi Yang, Ke Jin, Lei Zhang, Qiyao Peng, Ken Deng, Yibo Miao, Tianyu Liu, Zeyu Cui, Binyuan Hui, Junyang Lin
[ABSTRACT]
Code completion has become an essential tool for daily software development.
Existing evaluation benchmarks often employ static methods that do not fully
capture the dynamic nature of real-world coding environments and face
significant challenges, including limited context length, reliance on
superficial evaluation metrics, and potential overfitting to training datasets.
In this work, we introduce a novel framework for enhancing code completion in
software development through the creation of a repository-level benchmark
ExecRepoBench and the instruction corpora Repo-Instruct, aim at improving the
functionality of open-source large language models (LLMs) in real-world coding
scenarios that involve complex interdependencies across multiple files.
ExecRepoBench includes 1.2K samples from active Python repositories. Plus, we
present a multi-level grammar-based completion methodology conditioned on the
abstract syntax tree to mask code fragments at various logical units (e.g.
statements, expressions, and functions). Then, we fine-tune the open-source LLM
with 7B parameters on Repo-Instruct to produce a strong code completion
baseline model Qwen2.5-Coder-Instruct-C based on the open-source model.
Qwen2.5-Coder-Instruct-C is rigorously evaluated against existing benchmarks,
including MultiPL-E and ExecRepoBench, which consistently outperforms prior
baselines across all programming languages. The deployment of \ourmethod{} can
be used as a high-performance, local service for programming
development\footnote{\url{https://execrepobench.github.io/}}.
[LINK]
http://arxiv.org/abs/2412.11990v1
[DATE]
2024-12-17 01:14:35+08:00
[CATEGORIES]
cs.CL
SciFaultyQA: Benchmarking LLMs on Faulty Science Question Detection with a GAN-Inspired Approach to Synthetic Dataset Generation
[AUTHORS]
Debarshi Kundu
[LINK]
http://arxiv.org/abs/2412.11988v1
[DATE]
2024-12-17 01:11:48+08:00
[CATEGORIES]
cs.CL
cs.LG
RoboTwin: Dual-Arm Robot Benchmark with Generative Digital Twins (early version)
[AUTHORS]
Yao Mu, Tianxing Chen, Shijia Peng, Zanxin Chen, Zeyu Gao, Yude Zou, Lunkai Lin, Zhiqiang Xie, Ping Luo
[ABSTRACT]
In the rapidly advancing field of robotics, dual-arm coordination and complex
object manipulation are essential capabilities for developing advanced
autonomous systems. However, the scarcity of diverse, high-quality
demonstration data and real-world-aligned evaluation benchmarks severely limits
such development. To address this, we introduce RoboTwin, a generative digital
twin framework that uses 3D generative foundation models and large language
models to produce diverse expert datasets and provide a real-world-aligned
evaluation platform for dual-arm robotic tasks. Specifically, RoboTwin creates
varied digital twins of objects from single 2D images, generating realistic and
interactive scenarios. It also introduces a spatial relation-aware code
generation framework that combines object annotations with large language
models to break down tasks, determine spatial constraints, and generate precise
robotic movement code. Our framework offers a comprehensive benchmark with both
simulated and real-world data, enabling standardized evaluation and better
alignment between simulated training and real-world performance. We validated
our approach using the open-source COBOT Magic Robot platform. Policies
pre-trained on RoboTwin-generated data and fine-tuned with limited real-world
samples improve the success rate of over 70% for single-arm tasks and over 40%
for dual-arm tasks compared to models trained solely on real-world data. This
significant improvement demonstrates RoboTwin’s potential to enhance the
development and evaluation of dual-arm robotic manipulation systems. Project
Page: https://robotwin-benchmark.github.io/early-version/.
[COMMENTS]
Project page: https://robotwin-benchmark.github.io/early-version/
[LINK]
http://arxiv.org/abs/2409.02920v2
[DATE]
2024-12-17 01:09:58+08:00
[CATEGORIES]
cs.CL
Speech Foundation Models and Crowdsourcing for Efficient, High-Quality Data Collection
[AUTHORS]
Beomseok Lee, Marco Gaido, Ioan Calapodescu, Laurent Besacier, Matteo Negri
[ABSTRACT]
While crowdsourcing is an established solution for facilitating and scaling
the collection of speech data, the involvement of non-experts necessitates
protocols to ensure final data quality. To reduce the costs of these essential
controls, this paper investigates the use of Speech Foundation Models (SFMs) to
automate the validation process, examining for the first time the cost/quality
trade-off in data acquisition. Experiments conducted on French, German, and
Korean data demonstrate that SFM-based validation has the potential to reduce
reliance on human validation, resulting in an estimated cost saving of over
40.0% without degrading final data quality. These findings open new
opportunities for more efficient, cost-effective, and scalable speech data
acquisition.
[COMMENTS]
Accepted at COLING 2025 main conference
[LINK]
http://arxiv.org/abs/2412.11978v1
[DATE]
2024-12-17 00:59:22+08:00
[CATEGORIES]
cs.CL
The Impact of Token Granularity on the Predictive Power of Language Model Surprisal
[AUTHORS]
Byung-Doh Oh, William Schuler
[ABSTRACT]
Word-by-word language model surprisal is often used to model the incremental
processing of human readers, which raises questions about how various choices
in language modeling influence its predictive power. One factor that has been
overlooked in cognitive modeling is the granularity of subword tokens, which
explicitly encodes information about word length and frequency, and ultimately
influences the quality of vector representations that are learned. This paper
presents experiments that manipulate the token granularity and evaluate its
impact on the ability of surprisal to account for processing difficulty of
naturalistic text and garden-path constructions. Experiments with naturalistic
reading times reveal a substantial influence of token granularity on surprisal,
with tokens defined by a vocabulary size of 8,000 resulting in surprisal that
is most predictive. In contrast, on garden-path constructions, language models
trained on coarser-grained tokens generally assigned higher surprisal to
critical regions, suggesting their increased sensitivity to syntax. Taken
together, these results suggest a large role of token granularity on the
quality of language model surprisal for cognitive modeling.
[LINK]
http://arxiv.org/abs/2412.11940v1
[DATE]
2024-12-17 00:24:58+08:00
[CATEGORIES]
cs.CL
Precise Length Control in Large Language Models
[AUTHORS]
Bradley Butcher, Michael O’Keefe, James Titchener
[ABSTRACT]
Large Language Models (LLMs) are increasingly used in production systems,
powering applications such as chatbots, summarization, and question answering.
Despite their success, controlling the length of their response remains a
significant challenge, particularly for tasks requiring structured outputs or
specific levels of detail. In this work, we propose a method to adapt
pre-trained decoder-only LLMs for precise control of response length. Our
approach incorporates a secondary length-difference positional encoding (LDPE)
into the input embeddings, which counts down to a user-set response termination
length. Fine-tuning with LDPE allows the model to learn to terminate responses
coherently at the desired length, achieving mean token errors of less than 3
tokens. We also introduce Max New Tokens++, an extension that enables flexible
upper-bound length control, rather than an exact target. Experimental results
on tasks such as question answering and document summarization demonstrate that
our method enables precise length control without compromising response
quality.
[LINK]
http://arxiv.org/abs/2412.11937v1
[DATE]
2024-12-17 00:22:27+08:00
[CATEGORIES]
cs.CL
Adsorb-Agent: Autonomous Identification of Stable Adsorption Configurations via Large Language Model Agent
[AUTHORS]
Janghoon Ock, Tirtha Vinchurkar, Yayati Jadhav, Amir Barati Farimani
[ABSTRACT]
Adsorption energy is a key reactivity descriptor in catalysis, enabling
efficient screening for optimal catalysts. However, determining adsorption
energy typically requires evaluating numerous adsorbate-catalyst
configurations. Current algorithmic approaches rely on exhaustive enumeration
of adsorption sites and configurations, which makes the process computationally
intensive and does not inherently guarantee the identification of the global
minimum energy. In this work, we introduce Adsorb-Agent, a Large Language Model
(LLM) agent designed to efficiently identify system-specific stable adsorption
configurations corresponding to the global minimum adsorption energy.
Adsorb-Agent leverages its built-in knowledge and emergent reasoning
capabilities to strategically explore adsorption configurations likely to hold
adsorption energy. By reducing the reliance on exhaustive sampling, it
significantly decreases the number of initial configurations required while
improving the accuracy of adsorption energy predictions. We evaluate
Adsorb-Agent’s performance across twenty representative systems encompassing a
range of complexities. The Adsorb-Agent successfully identifies comparable
adsorption energies for 83.7% of the systems and achieves lower energies,
closer to the actual global minimum, for 35% of the systems, while requiring
significantly fewer initial configurations than conventional methods. Its
capability is particularly evident in complex systems, where it identifies
lower adsorption energies for 46.7% of systems involving intermetallic surfaces
and 66.7% of systems with large adsorbate molecules. These results demonstrate
the potential of Adsorb-Agent to accelerate catalyst discovery by reducing
computational costs and improving the reliability of adsorption energy
predictions.
[LINK]
http://arxiv.org/abs/2410.16658v2
[DATE]
2024-12-17 00:21:00+08:00
[CATEGORIES]
cs.CL
Analysing Zero-Shot Readability-Controlled Sentence Simplification
[AUTHORS]
Abdullah Barayan, Jose Camacho-Collados, Fernando Alva-Manchego
[COMMENTS]
Accepted on COLING 2025
[LINK]
http://arxiv.org/abs/2409.20246v2
[DATE]
2024-12-17 00:15:11+08:00
[CATEGORIES]
cs.CL
FedMKT: Federated Mutual Knowledge Transfer for Large and Small Language Models
[AUTHORS]
Tao Fan, Guoqiang Ma, Yan Kang, Hanlin Gu, Yuanfeng Song, Lixin Fan, Kai Chen, Qiang Yang
[ABSTRACT]
Recent research in federated large language models (LLMs) has primarily
focused on enabling clients to fine-tune their locally deployed homogeneous
LLMs collaboratively or on transferring knowledge from server-based LLMs to
small language models (SLMs) at downstream clients. However, a significant gap
remains in the simultaneous mutual enhancement of both the server’s LLM and
clients’ SLMs. To bridge this gap, we propose FedMKT, a parameter-efficient
federated mutual knowledge transfer framework for large and small language
models. This framework is designed to adaptively transfer knowledge from the
server’s LLM to clients’ SLMs while concurrently enriching the LLM with
clients’ unique domain insights. We facilitate token alignment using minimum
edit distance (MinED) and then selective mutual knowledge transfer between
client-side SLMs and a server-side LLM, aiming to collectively enhance their
performance. Through extensive experiments across three distinct scenarios, we
evaluate the effectiveness of FedMKT using various public LLMs and SLMs on a
range of NLP text generation tasks. Empirical results demonstrate that FedMKT
simultaneously boosts the performance of both LLMs and SLMs.
[LINK]
http://arxiv.org/abs/2406.02224v4
[DATE]
2024-12-17 00:13:14+08:00
[CATEGORIES]
cs.CL
IRR: Image Review Ranking Framework for Evaluating Vision-Language Models
[AUTHORS]
Kazuki Hayashi, Kazuma Onishi, Toma Suzuki, Yusuke Ide, Seiji Gobara, Shigeki Saito, Yusuke Sakai, Hidetaka Kamigaito, Katsuhiko Hayashi, Taro Watanabe
[ABSTRACT]
Large-scale Vision-Language Models (LVLMs) process both images and text,
excelling in multimodal tasks such as image captioning and description
generation. However, while these models excel at generating factual content,
their ability to generate and evaluate texts reflecting perspectives on the
same image, depending on the context, has not been sufficiently explored. To
address this, we propose IRR: Image Review Rank, a novel evaluation framework
designed to assess critic review texts from multiple perspectives. IRR
evaluates LVLMs by measuring how closely their judgments align with human
interpretations. We validate it using a dataset of images from 15 categories,
each with five critic review texts and annotated rankings in both English and
Japanese, totaling over 2,000 data instances. The datasets are available at
https://hf.co/datasets/naist-nlp/Wiki-ImageReview1.0. Our results indicate
that, although LVLMs exhibited consistent performance across languages, their
correlation with human annotations was insufficient, highlighting the need for
further advancements. These findings highlight the limitations of current
evaluation methods and the need for approaches that better capture human
reasoning in Vision & Language tasks.
[COMMENTS]
18pages, Accepted at COLING25
[LINK]
http://arxiv.org/abs/2402.12121v2
[DATE]
2024-12-17 00:09:47+08:00
[CATEGORIES]
cs.CL
PICLe: Pseudo-Annotations for In-Context Learning in Low-Resource Named Entity Detection
[AUTHORS]
Sepideh Mamooler, Syrielle Montariol, Alexander Mathis, Antoine Bosselut
[ABSTRACT]
In-context learning (ICL) enables Large Language Models (LLMs) to perform
tasks using few demonstrations, facilitating task adaptation when labeled
examples are hard to obtain. However, ICL is sensitive to the choice of
demonstrations, and it remains unclear which demonstration attributes enable
in-context generalization. In this work, we conduct a perturbation study of
in-context demonstrations for low-resource Named Entity Detection (NED). Our
surprising finding is that in-context demonstrations with partially correct
annotated entity mentions can be as effective for task transfer as fully
correct demonstrations. Based off our findings, we propose Pseudo-annotated
In-Context Learning (PICLe), a framework for in-context learning with noisy,
pseudo-annotated demonstrations. PICLe leverages LLMs to annotate many
demonstrations in a zero-shot first pass. We then cluster these synthetic
demonstrations, sample specific sets of in-context demonstrations from each
cluster, and predict entity mentions using each set independently. Finally, we
use self-verification to select the final set of entity mentions. We evaluate
PICLe on five biomedical NED datasets and show that, with zero human
annotation, PICLe outperforms ICL in low-resource settings where limited gold
examples can be used as in-context demonstrations.
[COMMENTS]
Preprint
[LINK]
http://arxiv.org/abs/2412.11923v1
[DATE]
2024-12-17 00:09:35+08:00
[CATEGORIES]
cs.CL
MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering
[AUTHORS]
Jun Shern Chan, Neil Chowdhury, Oliver Jaffe, James Aung, Dane Sherburn, Evan Mays, Giulio Starace, Kevin Liu, Leon Maksin, Tejal Patwardhan, Lilian Weng, Aleksander Mądry
[ABSTRACT]
We introduce MLE-bench, a benchmark for measuring how well AI agents perform
at machine learning engineering. To this end, we curate 75 ML
engineering-related competitions from Kaggle, creating a diverse set of
challenging tasks that test real-world ML engineering skills such as training
models, preparing datasets, and running experiments. We establish human
baselines for each competition using Kaggle’s publicly available leaderboards.
We use open-source agent scaffolds to evaluate several frontier language models
on our benchmark, finding that the best-performing setup–OpenAI’s o1-preview
with AIDE scaffolding–achieves at least the level of a Kaggle bronze medal in
16.9% of competitions. In addition to our main results, we investigate various
forms of resource scaling for AI agents and the impact of contamination from
pre-training. We open-source our benchmark code (github.com/openai/mle-bench/)
to facilitate future research in understanding the ML engineering capabilities
of AI agents.
[COMMENTS]
10 pages, 17 pages appendix. Equal contribution by first seven
authors, authors randomized. Corrected footnote 4. Added citation
[LINK]
http://arxiv.org/abs/2410.07095v4
[DATE]
2024-12-17 00:05:09+08:00
[CATEGORIES]
cs.CL
RetroLLM: Empowering Large Language Models to Retrieve Fine-grained Evidence within Generation
[AUTHORS]
Xiaoxi Li, Jiajie Jin, Yujia Zhou, Yongkang Wu, Zhonghua Li, Qi Ye, Zhicheng Dou
[ABSTRACT]
Large language models (LLMs) exhibit remarkable generative capabilities but
often suffer from hallucinations. Retrieval-augmented generation (RAG) offers
an effective solution by incorporating external knowledge, but existing methods
still face several limitations: additional deployment costs of separate
retrievers, redundant input tokens from retrieved text chunks, and the lack of
joint optimization of retrieval and generation. To address these issues, we
propose \textbf{RetroLLM}, a unified framework that integrates retrieval and
generation into a single, cohesive process, enabling LLMs to directly generate
fine-grained evidence from the corpus with constrained decoding. Moreover, to
mitigate false pruning in the process of constrained evidence generation, we
introduce (1) hierarchical FM-Index constraints, which generate
corpus-constrained clues to identify a subset of relevant documents before
evidence generation, reducing irrelevant decoding space; and (2) a
forward-looking constrained decoding strategy, which considers the relevance of
future sequences to improve evidence accuracy. Extensive experiments on five
open-domain QA datasets demonstrate RetroLLM’s superior performance across both
in-domain and out-of-domain tasks. The code is available at
\url{https://github.com/sunnynexus/RetroLLM}.
[LINK]
http://arxiv.org/abs/2412.11919v1
[DATE]
2024-12-17 00:03:25+08:00
[CATEGORIES]
cs.CL
Unleashing the Power of Pre-trained Language Models for Offline Reinforcement Learning
[AUTHORS]
Ruizhe Shi, Yuyao Liu, Yanjie Ze, Simon S. Du, Huazhe Xu
[ABSTRACT]
Offline reinforcement learning (RL) aims to find a near-optimal policy using
pre-collected datasets. In real-world scenarios, data collection could be
costly and risky; therefore, offline RL becomes particularly challenging when
the in-domain data is limited. Given recent advances in Large Language Models
(LLMs) and their few-shot learning prowess, this paper introduces
$\textbf{La}$nguage Models for $\textbf{Mo}$tion Control ($\textbf{LaMo}$), a
general framework based on Decision Transformers to effectively use pre-trained
Language Models (LMs) for offline RL. Our framework highlights four crucial
components: (1) Initializing Decision Transformers with sequentially
pre-trained LMs, (2) employing the LoRA fine-tuning method, in contrast to
full-weight fine-tuning, to combine the pre-trained knowledge from LMs and
in-domain knowledge effectively, (3) using the non-linear MLP transformation
instead of linear projections, to generate embeddings, and (4) integrating an
auxiliary language prediction loss during fine-tuning to stabilize the LMs and
retain their original abilities on languages. Empirical results indicate
$\textbf{LaMo}$ achieves excellent performance in sparse-reward tasks and
closes the gap between value-based offline RL methods and decision transformers
in dense-reward tasks. In particular, our method demonstrates superior
performance in scenarios with limited data samples.
[COMMENTS]
Format adjustment
[LINK]
http://arxiv.org/abs/2310.20587v5
[DATE]
2024-12-17 23:59:44+08:00
[CATEGORIES]
cs.LG
Causal Invariance Learning via Efficient Optimization of a Nonconvex Objective
[AUTHORS]
Zhenyu Wang, Yifan Hu, Peter Bühlmann, Zijian Guo
[ABSTRACT]
Data from multiple environments offer valuable opportunities to uncover
causal relationships among variables. Leveraging the assumption that the causal
outcome model remains invariant across heterogeneous environments,
state-of-the-art methods attempt to identify causal outcome models by learning
invariant prediction models and rely on exhaustive searches over all
(exponentially many) covariate subsets. These approaches present two major
challenges: 1) determining the conditions under which the invariant prediction
model aligns with the causal outcome model, and 2) devising computationally
efficient causal discovery algorithms that scale polynomially, instead of
exponentially, with the number of covariates. To address both challenges, we
focus on the additive intervention regime and propose nearly necessary and
sufficient conditions for ensuring that the invariant prediction model matches
the causal outcome model. Exploiting the essentially necessary identifiability
conditions, we introduce Negative Weight Distributionally Robust Optimization
(NegDRO), a nonconvex continuous minimax optimization whose global optimizer
recovers the causal outcome model. Unlike standard group DRO problems that
maximize over the simplex, NegDRO allows negative weights on environment
losses, which break the convexity. Despite its nonconvexity, we demonstrate
that a standard gradient method converges to the causal outcome model, and we
establish the convergence rate with respect to the sample size and the number
of iterations. Our algorithm avoids exhaustive search, making it scalable
especially when the number of covariates is large. The numerical results
further validate the efficiency of the proposed method.
[LINK]
http://arxiv.org/abs/2412.11850v2
[DATE]
2024-12-17 23:44:30+08:00
[CATEGORIES]
cs.LG
Relational Neurosymbolic Markov Models
[AUTHORS]
Lennert De Smet, Gabriele Venturato, Luc De Raedt, Giuseppe Marra
[ABSTRACT]
Sequential problems are ubiquitous in AI, such as in reinforcement learning
or natural language processing. State-of-the-art deep sequential models, like
transformers, excel in these settings but fail to guarantee the satisfaction of
constraints necessary for trustworthy deployment. In contrast, neurosymbolic AI
(NeSy) provides a sound formalism to enforce constraints in deep probabilistic
models but scales exponentially on sequential problems. To overcome these
limitations, we introduce relational neurosymbolic Markov models (NeSy-MMs), a
new class of end-to-end differentiable sequential models that integrate and
provably satisfy relational logical constraints. We propose a strategy for
inference and learning that scales on sequential settings, and that combines
approximate Bayesian inference, automated reasoning, and gradient estimation.
Our experiments show that NeSy-MMs can solve problems beyond the current
state-of-the-art in neurosymbolic AI and still provide strong guarantees with
respect to desired properties. Moreover, we show that our models are more
interpretable and that constraints can be adapted at test time to
out-of-distribution scenarios.
[COMMENTS]
Accepted at AAAI 2025
[LINK]
http://arxiv.org/abs/2412.13023v1
[DATE]
2024-12-17 23:41:51+08:00
[CATEGORIES]
cs.LG
Deep Learning Based Superconductivity: Prediction and Experimental Tests
[AUTHORS]
Daniel Kaplan, Adam Zhang, Joanna Blawat, Rongying Jin, Robert J. Cava, Viktor Oudovenko, Gabriel Kotliar, Anirvan M. Sengupta, Weiwei Xie
[ABSTRACT]
The discovery of novel superconducting materials is a longstanding challenge
in materials science, with a wealth of potential for applications in energy,
transportation, and computing. Recent advances in artificial intelligence (AI)
have enabled expediting the search for new materials by efficiently utilizing
vast materials databases. In this study, we developed an approach based on deep
learning (DL) to predict new superconducting materials. We have synthesized a
compound derived from our DL network and confirmed its superconducting
properties in agreement with our prediction. Our approach is also compared to
previous work based on random forests (RFs). In particular, RFs require
knowledge of the chem-ical properties of the compound, while our neural net
inputs depend solely on the chemical composition. With the help of hints from
our network, we discover a new ternary compound
$\textrm{Mo}{20}\textrm{Re}{6}\textrm{Si}_{4}$, which becomes superconducting
below 5.4 K. We further discuss the existing limitations and challenges
associated with using AI to predict and, along with potential future research
directions.
[COMMENTS]
14 pages + 2 appendices + references. EPJ submission
[LINK]
http://arxiv.org/abs/2412.13012v1
[DATE]
2024-12-17 23:33:48+08:00
[CATEGORIES]
cs.LG
AnyAttack: Targeted Adversarial Attacks on Vision-Language Models toward Any Images
[AUTHORS]
Jiaming Zhang, Junhong Ye, Xingjun Ma, Yige Li, Yunfan Yang, Jitao Sang, Dit-Yan Yeung
[ABSTRACT]
Due to their multimodal capabilities, Vision-Language Models (VLMs) have
found numerous impactful applications in real-world scenarios. However, recent
studies have revealed that VLMs are vulnerable to image-based adversarial
attacks, particularly targeted adversarial images that manipulate the model to
generate harmful content specified by the adversary. Current attack methods
rely on predefined target labels to create targeted adversarial attacks, which
limits their scalability and applicability for large-scale robustness
evaluations. In this paper, we propose AnyAttack, a self-supervised framework
that generates targeted adversarial images for VLMs without label supervision,
allowing any image to serve as a target for the attack. Our framework employs
the pre-training and fine-tuning paradigm, with the adversarial noise generator
pre-trained on the large-scale LAION-400M dataset. This large-scale
pre-training endows our method with powerful transferability across a wide
range of VLMs. Extensive experiments on five mainstream open-source VLMs (CLIP,
BLIP, BLIP2, InstructBLIP, and MiniGPT-4) across three multimodal tasks
(image-text retrieval, multimodal classification, and image captioning)
demonstrate the effectiveness of our attack. Additionally, we successfully
transfer AnyAttack to multiple commercial VLMs, including Google Gemini, Claude
Sonnet, Microsoft Copilot and OpenAI GPT. These results reveal an unprecedented
risk to VLMs, highlighting the need for effective countermeasures.
[LINK]
http://arxiv.org/abs/2410.05346v2
[DATE]
2024-12-17 23:32:04+08:00
[CATEGORIES]
cs.LG
Can Generative Models Improve Self-Supervised Representation Learning?
[AUTHORS]
Sana Ayromlou, Vahid Reza Khazaie, Fereshteh Forghani, Arash Afkanpour
[ABSTRACT]
The rapid advancement in self-supervised representation learning has
highlighted its potential to leverage unlabeled data for learning rich visual
representations. However, the existing techniques, particularly those employing
different augmentations of the same image, often rely on a limited set of
simple transformations that cannot fully capture variations in the real world.
This constrains the diversity and quality of samples, which leads to
sub-optimal representations. In this paper, we introduce a framework that
enriches the self-supervised learning (SSL) paradigm by utilizing generative
models to produce semantically consistent image augmentations. By directly
conditioning generative models on a source image, our method enables the
generation of diverse augmentations while maintaining the semantics of the
source image, thus offering a richer set of data for SSL. Our extensive
experimental results on various joint-embedding SSL techniques demonstrate that
our framework significantly enhances the quality of learned visual
representations by up to 10\% Top-1 accuracy in downstream tasks. This research
demonstrates that incorporating generative models into the joint-embedding SSL
workflow opens new avenues for exploring the potential of synthetic data. This
development paves the way for more robust and versatile representation learning
techniques.
[COMMENTS]
To be published in AAAI 2025
[LINK]
http://arxiv.org/abs/2403.05966v3
[DATE]
2024-12-17 23:31:49+08:00
[CATEGORIES]
cs.LG
Boosting Test Performance with Importance Sampling–a Subpopulation Perspective
[AUTHORS]
Hongyu Shen, Zhizhen Zhao
[ABSTRACT]
Despite empirical risk minimization (ERM) is widely applied in the machine
learning community, its performance is limited on data with spurious
correlation or subpopulation that is introduced by hidden attributes. Existing
literature proposed techniques to maximize group-balanced or worst-group
accuracy when such correlation presents, yet, at the cost of lower average
accuracy. In addition, many existing works conduct surveys on different
subpopulation methods without revealing the inherent connection between these
methods, which could hinder the technology advancement in this area. In this
paper, we identify important sampling as a simple yet powerful tool for solving
the subpopulation problem. On the theory side, we provide a new systematic
formulation of the subpopulation problem and explicitly identify the
assumptions that are not clearly stated in the existing works. This helps to
uncover the cause of the dropped average accuracy. We provide the first
theoretical discussion on the connections of existing methods, revealing the
core components that make them different. On the application side, we
demonstrate a single estimator is enough to solve the subpopulation problem. In
particular, we introduce the estimator in both attribute-known and -unknown
scenarios in the subpopulation setup, offering flexibility in practical use
cases. And empirically, we achieve state-of-the-art performance on commonly
used benchmark datasets.
[COMMENTS]
16 pages, 1 figure, 2 tables
[LINK]
http://arxiv.org/abs/2412.13003v1
[DATE]
2024-12-17 23:25:24+08:00
[CATEGORIES]
cs.LG
Reranking individuals: The effect of fair classification within-groups
[AUTHORS]
Sofie Goethals, Marco Favier, Toon Calders
[ABSTRACT]
Artificial Intelligence (AI) finds widespread application across various
domains, but it sparks concerns about fairness in its deployment. The
prevailing discourse in classification often emphasizes outcome-based metrics
comparing sensitive subgroups without a nuanced consideration of the
differential impacts within subgroups. Bias mitigation techniques not only
affect the ranking of pairs of instances across sensitive groups, but often
also significantly affect the ranking of instances within these groups. Such
changes are hard to explain and raise concerns regarding the validity of the
intervention. Unfortunately, these effects remain under the radar in the
accuracy-fairness evaluation framework that is usually applied. Additionally,
we illustrate the effect of several popular bias mitigation methods, and how
their output often does not reflect real-world scenarios.
[LINK]
http://arxiv.org/abs/2401.13391v3
[DATE]
2024-12-17 23:12:02+08:00
[CATEGORIES]
cs.LG
Stochastic interior-point methods for smooth conic optimization with applications
[AUTHORS]
Chuan He, Zhanwang Deng
[ABSTRACT]
Conic optimization plays a crucial role in many machine learning (ML)
problems. However, practical algorithms for conic constrained ML problems with
large datasets are often limited to specific use cases, as stochastic
algorithms for general conic optimization remain underdeveloped. To fill this
gap, we introduce a stochastic interior-point method (SIPM) framework for
general conic optimization, along with four novel SIPM variants leveraging
distinct stochastic gradient estimators. Under mild assumptions, we establish
the global convergence rates of our proposed SIPMs, which, up to a logarithmic
factor, match the best-known rates in stochastic unconstrained optimization.
Finally, our numerical experiments on robust linear regression, multi-task
relationship learning, and clustering data streams demonstrate the
effectiveness and efficiency of our approach.
[LINK]
http://arxiv.org/abs/2412.12987v1
[DATE]
2024-12-17 23:06:44+08:00
[CATEGORIES]
cs.LG
Cluster-guided Contrastive Class-imbalanced Graph Classification
[AUTHORS]
Wei Ju, Zhengyang Mao, Siyu Yi, Yifang Qin, Yiyang Gu, Zhiping Xiao, Jianhao Shen, Ziyue Qiao, Ming Zhang
[ABSTRACT]
This paper studies the problem of class-imbalanced graph classification,
which aims at effectively classifying the categories of graphs in scenarios
with imbalanced class distribution. Despite the tremendous success of graph
neural networks (GNNs), their modeling ability for imbalanced graph-structured
data is inadequate, which typically leads to predictions biased towards the
majority classes. Besides, existing class-imbalanced learning methods in
visions may overlook the rich graph semantic substructures of the majority
classes and excessively emphasize learning from the minority classes. To tackle
this issue, this paper proposes a simple yet powerful approach called C$^3$GNN
that incorporates the idea of clustering into contrastive learning to enhance
class-imbalanced graph classification. Technically, C$^3$GNN clusters graphs
from each majority class into multiple subclasses, ensuring they have similar
sizes to the minority class, thus alleviating class imbalance. Additionally, it
utilizes the Mixup technique to synthesize new samples and enrich the semantic
information of each subclass, and leverages supervised contrastive learning to
hierarchically learn effective graph representations. In this way, we can not
only sufficiently explore the semantic substructures within the majority class
but also effectively alleviate excessive focus on the minority class. Extensive
experiments on real-world graph benchmark datasets verify the superior
performance of our proposed method.
[COMMENTS]
Accepted by Proceedings of the Thirty-Ninth AAAI Conference on
Artificial Intelligence (AAAI-25)
[LINK]
http://arxiv.org/abs/2412.12984v1
[DATE]
2024-12-17 23:04:54+08:00
[CATEGORIES]
cs.LG
Producing Histopathology Phantom Images using Generative Adversarial Networks to improve Tumor Detection
[AUTHORS]
Vidit Gautam
[LINK]
http://arxiv.org/abs/2205.10691v2
[DATE]
2024-12-17 23:04:46+08:00
[CATEGORIES]
cs.LG
ArchesWeather & ArchesWeatherGen: a deterministic and generative model for efficient ML weather forecasting
[AUTHORS]
Guillaume Couairon, Renu Singh, Anastase Charantonis, Christian Lessig, Claire Monteleoni
[ABSTRACT]
Weather forecasting plays a vital role in today’s society, from agriculture
and logistics to predicting the output of renewable energies, and preparing for
extreme weather events. Deep learning weather forecasting models trained with
the next state prediction objective on ERA5 have shown great success compared
to numerical global circulation models. However, for a wide range of
applications, being able to provide representative samples from the
distribution of possible future weather states is critical. In this paper, we
propose a methodology to leverage deterministic weather models in the design of
probabilistic weather models, leading to improved performance and reduced
computing costs. We first introduce \textbf{ArchesWeather}, a transformer-based
deterministic model that improves upon Pangu-Weather by removing
overrestrictive inductive priors. We then design a probabilistic weather model
called \textbf{ArchesWeatherGen} based on flow matching, a modern variant of
diffusion models, that is trained to project ArchesWeather’s predictions to the
distribution of ERA5 weather states. ArchesWeatherGen is a true stochastic
emulator of ERA5 and surpasses IFS ENS and NeuralGCM on all WeatherBench
headline variables (except for NeuralGCM’s geopotential). Our work also aims to
democratize the use of deterministic and generative machine learning models in
weather forecasting research, with academic computing resources. All models are
trained at 1.5{\deg} resolution, with a training budget of $\sim$9 V100 days
for ArchesWeather and $\sim$45 V100 days for ArchesWeatherGen. For inference,
ArchesWeatherGen generates 15-day weather trajectories at a rate of 1 minute
per ensemble member on a A100 GPU card. To make our work fully reproducible,
our code and models are open source, including the complete pipeline for data
preparation, training, and evaluation, at https://github.com/INRIA/geoarches .
[COMMENTS]
Preprint
[LINK]
http://arxiv.org/abs/2412.12971v1
[DATE]
2024-12-17 22:54:30+08:00
[CATEGORIES]
cs.LG
LiNo: Advancing Recursive Residual Decomposition of Linear and Nonlinear Patterns for Robust Time Series Forecasting
[AUTHORS]
Guoqi Yu, Yaoming Li, Xiaoyu Guo, Dayu Wang, Zirui Liu, Shujun Wang, Tong Yang
[ABSTRACT]
Forecasting models are pivotal in a data-driven world with vast volumes of
time series data that appear as a compound of vast Linear and Nonlinear
patterns. Recent deep time series forecasting models struggle to utilize
seasonal and trend decomposition to separate the entangled components. Such a
strategy only explicitly extracts simple linear patterns like trends, leaving
the other linear modes and vast unexplored nonlinear patterns to the residual.
Their flawed linear and nonlinear feature extraction models and shallow-level
decomposition limit their adaptation to the diverse patterns present in
real-world scenarios. Given this, we innovate Recursive Residual Decomposition
by introducing explicit extraction of both linear and nonlinear patterns. This
deeper-level decomposition framework, which is named LiNo, captures linear
patterns using a Li block which can be a moving average kernel, and models
nonlinear patterns using a No block which can be a Transformer encoder. The
extraction of these two patterns is performed alternatively and recursively. To
achieve the full potential of LiNo, we develop the current simple linear
pattern extractor to a general learnable autoregressive model, and design a
novel No block that can handle all essential nonlinear patterns. Remarkably,
the proposed LiNo achieves state-of-the-art on thirteen real-world benchmarks
under univariate and multivariate forecasting scenarios. Experiments show that
current forecasting models can deliver more robust and precise results through
this advanced Recursive Residual Decomposition. We hope this work could offer
insight into designing more effective forecasting models. Code is available at
this Repository: https://github.com/Levi-Ackman/LiNo.
[LINK]
http://arxiv.org/abs/2410.17159v2
[DATE]
2024-12-17 22:53:29+08:00
[CATEGORIES]
cs.LG
Coordinating Planning and Tracking in Layered Control Policies via Actor-Critic Learning
[AUTHORS]
Fengjun Yang, Nikolai Matni
[ABSTRACT]
We propose a reinforcement learning (RL)-based algorithm to jointly train (1)
a trajectory planner and (2) a tracking controller in a layered control
architecture. Our algorithm arises naturally from a rewrite of the underlying
optimal control problem that lends itself to an actor-critic learning approach.
By explicitly learning a \textit{dual} network to coordinate the interaction
between the planning and tracking layers, we demonstrate the ability to achieve
an effective consensus between the two components, leading to an interpretable
policy. We theoretically prove that our algorithm converges to the optimal dual
network in the Linear Quadratic Regulator (LQR) setting and empirically
validate its applicability to nonlinear systems through simulation experiments
on a unicycle model.
[LINK]
http://arxiv.org/abs/2408.01639v2
[DATE]
2024-12-17 22:41:50+08:00
[CATEGORIES]
cs.LG
Efficient Diffusion Transformer Policies with Mixture of Expert Denoisers for Multitask Learning
[AUTHORS]
Moritz Reuss, Jyothish Pari, Pulkit Agrawal, Rudolf Lioutikov
[ABSTRACT]
Diffusion Policies have become widely used in Imitation Learning, offering
several appealing properties, such as generating multimodal and discontinuous
behavior. As models are becoming larger to capture more complex capabilities,
their computational demands increase, as shown by recent scaling laws.
Therefore, continuing with the current architectures will present a
computational roadblock. To address this gap, we propose Mixture-of-Denoising
Experts (MoDE) as a novel policy for Imitation Learning. MoDE surpasses current
state-of-the-art Transformer-based Diffusion Policies while enabling
parameter-efficient scaling through sparse experts and noise-conditioned
routing, reducing both active parameters by 40% and inference costs by 90% via
expert caching. Our architecture combines this efficient scaling with
noise-conditioned self-attention mechanism, enabling more effective denoising
across different noise levels. MoDE achieves state-of-the-art performance on
134 tasks in four established imitation learning benchmarks (CALVIN and
LIBERO). Notably, by pretraining MoDE on diverse robotics data, we achieve 4.01
on CALVIN ABC and 0.95 on LIBERO-90. It surpasses both CNN-based and
Transformer Diffusion Policies by an average of 57% across 4 benchmarks, while
using 90% fewer FLOPs and fewer active parameters compared to default Diffusion
Transformer architectures. Furthermore, we conduct comprehensive ablations on
MoDE’s components, providing insights for designing efficient and scalable
Transformer architectures for Diffusion Policies. Code and demonstrations are
available at https://mbreuss.github.io/MoDE_Diffusion_Policy/.
[LINK]
http://arxiv.org/abs/2412.12953v1
[DATE]
2024-12-17 22:34:51+08:00
[CATEGORIES]
cs.LG
FineGates: LLMs Finetuning with Compression using Stochastic Gates
[AUTHORS]
Jonathan Svirsky, Yehonathan Refael, Ofir Lindenbaum
[ABSTRACT]
Large Language Models (LLMs), with billions of parameters, present
significant challenges for full finetuning due to the high computational
demands, memory requirements, and impracticality of many real-world
applications. When faced with limited computational resources or small
datasets, updating all model parameters can often result in overfitting. To
address this, lightweight finetuning techniques have been proposed, like
learning low-rank adapter layers. These methods aim to train only a few
additional parameters combined with the base model, which remains frozen,
reducing resource usage and mitigating overfitting risks. In this work, we
propose an adaptor model based on stochastic gates that simultaneously sparsify
the frozen base model with task-specific adaptation. Our method comes with a
small number of trainable parameters and allows us to speed up the base model
inference with competitive accuracy. We evaluate it in additional variants by
equipping it with additional low-rank parameters and comparing it to several
recent baselines. Our results show that the proposed method improves the
finetuned model accuracy comparatively to the several baselines and allows the
removal of up to 20-40\% without significant accuracy loss.
[LINK]
http://arxiv.org/abs/2412.12951v1
[DATE]
2024-12-17 22:33:05+08:00
[CATEGORIES]
cs.LG
Content-aware Balanced Spectrum Encoding in Masked Modeling for Time Series Classification
[AUTHORS]
Yudong Han, Haocong Wang, Yupeng Hu, Yongshun Gong, Xuemeng Song, Weili Guan
[ABSTRACT]
Due to the superior ability of global dependency, transformer and its
variants have become the primary choice in Masked Time-series Modeling (MTM)
towards time-series classification task. In this paper, we experimentally
analyze that existing transformer-based MTM methods encounter with two
under-explored issues when dealing with time series data: (1) they encode
features by performing long-dependency ensemble averaging, which easily results
in rank collapse and feature homogenization as the layer goes deeper; (2) they
exhibit distinct priorities in fitting different frequency components contained
in the time-series, inevitably leading to spectrum energy imbalance of encoded
feature. To tackle these issues, we propose an auxiliary content-aware balanced
decoder (CBD) to optimize the encoding quality in the spectrum space within
masked modeling scheme. Specifically, the CBD iterates on a series of
fundamental blocks, and thanks to two tailored units, each block could
progressively refine the masked representation via adjusting the interaction
pattern based on local content variations of time-series and learning to
recalibrate the energy distribution across different frequency components.
Moreover, a dual-constraint loss is devised to enhance the mutual optimization
of vanilla decoder and our CBD. Extensive experimental results on ten
time-series classification datasets show that our method nearly surpasses a
bunch of baselines. Meanwhile, a series of explanatory results are showcased to
sufficiently demystify the behaviors of our method.
[COMMENTS]
13 pages, Accepted by AAAI 25
[LINK]
http://arxiv.org/abs/2412.13232v1
[DATE]
2024-12-17 22:12:20+08:00
[CATEGORIES]
cs.LG
“Clustering and Conquer” Procedures for Parallel Large-Scale Ranking and Selection
[AUTHORS]
Zishi Zhang, Yijie Peng
[ABSTRACT]
This work breaks the sample efficiency bottleneck in parallel large-scale
ranking and selection (R&S) problem by leveraging correlation information. We
modify the commonly used “divide and conquer” framework in parallel computing
by adding a correlation-based clustering step, transforming it into “clustering
and conquer”. This seemingly simple modification can achieve an
$\mathcal{O}(p)$ sample complexity reduction rate, which represents the maximum
attainable reduction for the class of sample-optimal R&S methods. Our approach
enjoys two key advantages: 1) it does not require highly accurate correlation
estimation or precise clustering, and 2) it allows for seamless integration
with various existing R&S method, while achieving optimal sample complexity.
Theoretically, we develop a novel gradient analysis framework to analyze sample
efficiency and guide the design of large-scale R&S procedures. Building upon
this framework, we propose a gradient-based budget allocation policy. We also
introduce a new clustering algorithm, selection policy, and precision criterion
tailored for large-scale scenarios. Finally, in large-scale AI applications
such as neural architecture search, our methods demonstrate superior
performance.
[LINK]
http://arxiv.org/abs/2402.02196v3
[DATE]
2024-12-17 22:02:24+08:00
[CATEGORIES]
cs.LG
Clusterability test for categorical data
[AUTHORS]
Lianyu Hu, Junjie Dong, Mudi Jiang, Yan Liu, Zengyou He
[ABSTRACT]
The objective of clusterability evaluation is to check whether a clustering
structure exists within the data set. As a crucial yet often-overlooked issue
in cluster analysis, it is essential to conduct such a test before applying any
clustering algorithm. If a data set is unclusterable, any subsequent clustering
analysis would not yield valid results. Despite its importance, the majority of
existing studies focus on numerical data, leaving the clusterability evaluation
issue for categorical data as an open problem. Here we present TestCat, a
testing-based approach to assess the clusterability of categorical data in
terms of an analytical $p$-value. The key idea underlying TestCat is that
clusterable categorical data possess many strongly associated attribute pairs
and hence the sum of chi-squared statistics of all attribute pairs is employed
as the test statistic for $p$-value calculation. We apply our method to a set
of benchmark categorical data sets, showing that TestCat outperforms those
solutions based on existing clusterability evaluation methods for numeric data.
To the best of our knowledge, our work provides the first way to effectively
recognize the clusterability of categorical data in a statistically sound
manner.
[COMMENTS]
28 pages, 12 appendix pages, 17 figures
[LINK]
http://arxiv.org/abs/2307.07346v2
[DATE]
2024-12-17 21:57:19+08:00
[CATEGORIES]
cs.LG
BOIDS: High-dimensional Bayesian Optimization via Incumbent-guided Direction Lines and Subspace Embeddings
[AUTHORS]
Lam Ngo, Huong Ha, Jeffrey Chan, Hongyu Zhang
[ABSTRACT]
When it comes to expensive black-box optimization problems, Bayesian
Optimization (BO) is a well-known and powerful solution. Many real-world
applications involve a large number of dimensions, hence scaling BO to high
dimension is of much interest. However, state-of-the-art high-dimensional BO
methods still suffer from the curse of dimensionality, highlighting the need
for further improvements. In this work, we introduce BOIDS, a novel
high-dimensional BO algorithm that guides optimization by a sequence of
one-dimensional direction lines using a novel tailored line-based optimization
procedure. To improve the efficiency, we also propose an adaptive selection
technique to identify most optimal lines for each round of line-based
optimization. Additionally, we incorporate a subspace embedding technique for
better scaling to high-dimensional spaces. We further provide theoretical
analysis of our proposed method to analyze its convergence property. Our
extensive experimental results show that BOIDS outperforms state-of-the-art
baselines on various synthetic and real-world benchmark problems.
[COMMENTS]
Published at AAAI Conference on Artificial Intelligence, 2025
[LINK]
http://arxiv.org/abs/2412.12918v1
[DATE]
2024-12-17 21:51:24+08:00
[CATEGORIES]
cs.LG
Adaptive Epsilon Adversarial Training for Robust Gravitational Wave Parameter Estimation Using Normalizing Flows
[AUTHORS]
Yiqian Yang, Xihua Zhu, Fan Zhang
[ABSTRACT]
Adversarial training with Normalizing Flow (NF) models is an emerging
research area aimed at improving model robustness through adversarial samples.
In this study, we focus on applying adversarial training to NF models for
gravitational wave parameter estimation. We propose an adaptive epsilon method
for Fast Gradient Sign Method (FGSM) adversarial training, which dynamically
adjusts perturbation strengths based on gradient magnitudes using logarithmic
scaling. Our hybrid architecture, combining ResNet and Inverse Autoregressive
Flow, reduces the Negative Log Likelihood (NLL) loss by 47\% under FGSM attacks
compared to the baseline model, while maintaining an NLL of 4.2 on clean data
(only 5\% higher than the baseline). For perturbation strengths between 0.01
and 0.1, our model achieves an average NLL of 5.8, outperforming both
fixed-epsilon (NLL: 6.7) and progressive-epsilon (NLL: 7.2) methods. Under
stronger Projected Gradient Descent attacks with perturbation strength of 0.05,
our model maintains an NLL of 6.4, demonstrating superior robustness while
avoiding catastrophic overfitting.
[COMMENTS]
Due to new experimental results to add to the paper, this version no
longer accurately reflects the current state of our research. Therefore, we
are withdrawing the paper while further experiments are conducted. We will
submit a new version in the future. We apologize for any inconvenience this
may cause
[LINK]
http://arxiv.org/abs/2412.07559v2
[DATE]
2024-12-17 21:43:16+08:00
[CATEGORIES]
cs.LG
C2F-TP: A Coarse-to-Fine Denoising Framework for Uncertainty-Aware Trajectory Prediction
[AUTHORS]
Zichen Wang, Hao Miao, Senzhang Wang, Renzhi Wang, Jianxin Wang, Jian Zhang
[ABSTRACT]
Accurately predicting the trajectory of vehicles is critically important for
ensuring safety and reliability in autonomous driving. Although considerable
research efforts have been made recently, the inherent trajectory uncertainty
caused by various factors including the dynamic driving intends and the diverse
driving scenarios still poses significant challenges to accurate trajectory
prediction. To address this issue, we propose C2F-TP, a coarse-to-fine
denoising framework for uncertainty-aware vehicle trajectory prediction. C2F-TP
features an innovative two-stage coarse-to-fine prediction process.
Specifically, in the spatial-temporal interaction stage, we propose a
spatial-temporal interaction module to capture the inter-vehicle interactions
and learn a multimodal trajectory distribution, from which a certain number of
noisy trajectories are sampled. Next, in the trajectory refinement stage, we
design a conditional denoising model to reduce the uncertainty of the sampled
trajectories through a step-wise denoising operation. Extensive experiments are
conducted on two real datasets NGSIM and highD that are widely adopted in
trajectory prediction. The result demonstrates the effectiveness of our
proposal.
[LINK]
http://arxiv.org/abs/2412.13231v1
[DATE]
2024-12-17 21:42:49+08:00
[CATEGORIES]
cs.LG
Sequential Harmful Shift Detection Without Labels
[AUTHORS]
Salim I. Amoukou, Tom Bewley, Saumitra Mishra, Freddy Lecue, Daniele Magazzeni, Manuela Veloso
[COMMENTS]
Accepted at the 38th Conference on Neural Information Processing
Systems (NeurIPS 2024)
[LINK]
http://arxiv.org/abs/2412.12910v1
[DATE]
2024-12-17 21:37:48+08:00
[CATEGORIES]
cs.LG
PT: A Plain Transformer is Good Hospital Readmission Predictor
[AUTHORS]
Zhenyi Fan, Jiaqi Li, Dongyu Luo, Yuqi Yuan
[ABSTRACT]
Hospital readmission prediction is critical for clinical decision support,
aiming to identify patients at risk of returning within 30 days post-discharge.
High readmission rates often indicate inadequate treatment or post-discharge
care, making effective prediction models essential for optimizing resources and
improving patient outcomes. We propose PT, a Transformer-based model that
integrates Electronic Health Records (EHR), medical images, and clinical notes
to predict 30-day all-cause hospital readmissions. PT extracts features from
raw data and uses specialized Transformer blocks tailored to the data’s
complexity. Enhanced with Random Forest for EHR feature selection and test-time
ensemble techniques, PT achieves superior accuracy, scalability, and
robustness. It performs well even when temporal information is missing. Our
main contributions are: (1)Simplicity: A powerful and efficient baseline model
outperforming existing ones in prediction accuracy; (2)Scalability: Flexible
handling of various features from different modalities, achieving high
performance with just clinical notes or EHR data; (3)Robustness: Strong
predictive performance even with missing or unclear temporal data.
[LINK]
http://arxiv.org/abs/2412.12909v1
[DATE]
2024-12-17 21:37:11+08:00
[CATEGORIES]
cs.LG
Search Strategy Generation for Branch and Bound Using Genetic Programming
[AUTHORS]
Gwen Maudet, Grégoire Danoy
[ABSTRACT]
Branch-and-Bound (B\&B) is an exact method in integer programming that
recursively divides the search space into a tree. During the resolution
process, determining the next subproblem to explore within the tree-known as
the search strategy-is crucial. Hand-crafted heuristics are commonly used, but
none are effective over all problem classes. Recent approaches utilizing neural
networks claim to make more intelligent decisions but are computationally
expensive. In this paper, we introduce GP2S (Genetic Programming for Search
Strategy), a novel machine learning approach that automatically generates a
B\&B search strategy heuristic, aiming to make intelligent decisions while
being computationally lightweight. We define a policy as a function that
evaluates the quality of a B\&B node by combining features from the node and
the problem; the search strategy policy is then defined by a best-first search
based on this node ranking. The policy space is explored using a genetic
programming algorithm, and the policy that achieves the best performance on a
training set is selected. We compare our approach with the standard method of
the SCIP solver, a recent graph neural network-based method, and handcrafted
heuristics. Our first evaluation includes three types of primal hard problems,
tested on instances similar to the training set and on larger instances. Our
method is at most 2\% slower than the best baseline and consistently
outperforms SCIP, achieving an average speedup of 11.3\%. Additionally, GP2S is
tested on the MIPLIB 2017 dataset, generating multiple heuristics from
different subsets of instances. It exceeds SCIP’s average performance in 7 out
of 10 cases across 15 times more instances and under a time limit 15 times
longer, with some GP2S methods leading on most experiments in terms of the
number of feasible solutions or optimality gap.
[COMMENTS]
Accepted at AAAI 2025
[LINK]
http://arxiv.org/abs/2412.09444v2
[DATE]
2024-12-17 21:24:36+08:00
[CATEGORIES]
cs.LG
Design of Restricted Normalizing Flow towards Arbitrary Stochastic Policy with Computational Efficiency
[AUTHORS]
Taisuke Kobayashi, Takumi Aotani
[ABSTRACT]
This paper proposes a new design method for a stochastic control policy using
a normalizing flow (NF). In reinforcement learning (RL), the policy is usually
modeled as a distribution model with trainable parameters. When this
parameterization has less expressiveness, it would fail to acquiring the
optimal policy. A mixture model has capability of a universal approximation,
but it with too much redundancy increases the computational cost, which can
become a bottleneck when considering the use of real-time robot control. As
another approach, NF, which is with additional parameters for invertible
transformation from a simple stochastic model as a base, is expected to exert
high expressiveness and lower computational cost. However, NF cannot compute
its mean analytically due to complexity of the invertible transformation, and
it lacks reliability because it retains stochastic behaviors after deployment
for robot controller. This paper therefore designs a restricted NF (RNF) that
achieves an analytic mean by appropriately restricting the invertible
transformation. In addition, the expressiveness impaired by this restriction is
regained using bimodal student-t distribution as its base, so-called Bit-RNF.
In RL benchmarks, Bit-RNF policy outperformed the previous models. Finally, a
real robot experiment demonstrated the applicability of Bit-RNF policy to real
world. The attached video is uploaded on youtube: https://youtu.be/R_GJVZDW9bk
[COMMENTS]
27 pages, 13 figures
[LINK]
http://arxiv.org/abs/2412.12894v1
[DATE]
2024-12-17 21:19:55+08:00
[CATEGORIES]
cs.LG
Suppressing Uncertainty in Gaze Estimation
[AUTHORS]
Shijing Wang, Yaping Huang
[ABSTRACT]
Uncertainty in gaze estimation manifests in two aspects: 1) low-quality
images caused by occlusion, blurriness, inconsistent eye movements, or even
non-face images; 2) incorrect labels resulting from the misalignment between
the labeled and actual gaze points during the annotation process. Allowing
these uncertainties to participate in training hinders the improvement of gaze
estimation. To tackle these challenges, in this paper, we propose an effective
solution, named Suppressing Uncertainty in Gaze Estimation (SUGE), which
introduces a novel triplet-label consistency measurement to estimate and reduce
the uncertainties. Specifically, for each training sample, we propose to
estimate a novel “neighboring label” calculated by a linearly weighted
projection from the neighbors to capture the similarity relationship between
image features and their corresponding labels, which can be incorporated with
the predicted pseudo label and ground-truth label for uncertainty estimation.
By modeling such triplet-label consistency, we can measure the qualities of
both images and labels, and further largely reduce the negative effects of
unqualified images and wrong labels through our designed sample weighting and
label correction strategies. Experimental results on the gaze estimation
benchmarks indicate that our proposed SUGE achieves state-of-the-art
performance.
[COMMENTS]
This paper has been accepted to AAAI 2024
[LINK]
http://arxiv.org/abs/2412.12890v1
[DATE]
2024-12-17 21:17:19+08:00
[CATEGORIES]
cs.LG
TimeCHEAT: A Channel Harmony Strategy for Irregularly Sampled Multivariate Time Series Analysis
[AUTHORS]
Jiexi Liu, Meng Cao, Songcan Chen
[ABSTRACT]
Irregularly sampled multivariate time series (ISMTS) are prevalent in
reality. Due to their non-uniform intervals between successive observations and
varying sampling rates among series, the channel-independent (CI) strategy,
which has been demonstrated more desirable for complete multivariate time
series forecasting in recent studies, has failed. This failure can be further
attributed to the sampling sparsity, which provides insufficient information
for effective CI learning, thereby reducing its capacity. When we resort to the
channel-dependent (CD) strategy, even higher capacity cannot mitigate the
potential loss of diversity in learning similar embedding patterns across
different channels. We find that existing work considers CI and CD strategies
to be mutually exclusive, primarily because they apply these strategies to the
global channel. However, we hold the view that channel strategies do not
necessarily have to be used globally. Instead, by appropriately applying them
locally and globally, we can create an opportunity to take full advantage of
both strategies. This leads us to introduce the Channel Harmony ISMTS
Transformer (TimeCHEAT), which utilizes the CD locally and the CI globally.
Specifically, we segment the ISMTS into sub-series level patches. Locally, the
CD strategy aggregates information within each patch for time embedding
learning, maximizing the use of relevant observations while reducing long-range
irrelevant interference. Here, we enhance generality by transforming embedding
learning into an edge weight prediction task using bipartite graphs,
eliminating the need for special prior knowledge. Globally, the CI strategy is
applied across patches, allowing the Transformer to learn individualized
attention patterns for each channel. Experimental results indicate our proposed
TimeCHEAT demonstrates competitive SOTA performance across three mainstream
tasks.
[COMMENTS]
Accepted by AAAI 2025
[LINK]
http://arxiv.org/abs/2412.12886v1
[DATE]
2024-12-17 21:10:02+08:00
[CATEGORIES]
cs.LG
A Comparative Study of Pruning Methods in Transformer-based Time Series Forecasting
[AUTHORS]
Nicholas Kiefer, Arvid Weyrauch, Muhammed Öz, Achim Streit, Markus Götz, Charlotte Debus
[ABSTRACT]
The current landscape in time-series forecasting is dominated by
Transformer-based models. Their high parameter count and corresponding demand
in computational resources pose a challenge to real-world deployment,
especially for commercial and scientific applications with low-power embedded
devices. Pruning is an established approach to reduce neural network parameter
count and save compute. However, the implications and benefits of pruning
Transformer-based models for time series forecasting are largely unknown. To
close this gap, we provide a comparative benchmark study by evaluating
unstructured and structured pruning on various state-of-the-art multivariate
time series models. We study the effects of these pruning strategies on model
predictive performance and computational aspects like model size, operations,
and inference time. Our results show that certain models can be pruned even up
to high sparsity levels, outperforming their dense counterpart. However,
fine-tuning pruned models is necessary. Furthermore, we demonstrate that even
with corresponding hardware and software support, structured pruning is unable
to provide significant time savings.
[COMMENTS]
16 pages, 5 figures, submitted to ACM Transactions on Intelligent
Systems and Technology
[LINK]
http://arxiv.org/abs/2412.12883v1
[DATE]
2024-12-17 21:07:31+08:00
[CATEGORIES]
cs.LG
Towards Physically Interpretable World Models: Meaningful Weakly Supervised Representations for Visual Trajectory Prediction
[AUTHORS]
Zhenjiang Mao, Ivan Ruchkin
[ABSTRACT]
Deep learning models are increasingly employed for perception, prediction,
and control in complex systems. Embedding physical knowledge into these models
is crucial for achieving realistic and consistent outputs, a challenge often
addressed by physics-informed machine learning. However, integrating physical
knowledge with representation learning becomes difficult when dealing with
high-dimensional observation data, such as images, particularly under
conditions of incomplete or imprecise state information. To address this, we
propose Physically Interpretable World Models, a novel architecture that aligns
learned latent representations with real-world physical quantities. Our method
combines a variational autoencoder with a dynamical model that incorporates
unknown system parameters, enabling the discovery of physically meaningful
representations. By employing weak supervision with interval-based constraints,
our approach eliminates the reliance on ground-truth physical annotations.
Experimental results demonstrate that our method improves the quality of
learned representations while achieving accurate predictions of future states,
advancing the field of representation learning in dynamic systems.
[LINK]
http://arxiv.org/abs/2412.12870v1
[DATE]
2024-12-17 20:51:24+08:00
[CATEGORIES]
cs.LG
Geodesic Flow Kernels for Semi-Supervised Learning on Mixed-Variable Tabular Dataset
[AUTHORS]
Yoontae Hwang, Yongjae Lee
[ABSTRACT]
Tabular data poses unique challenges due to its heterogeneous nature,
combining both continuous and categorical variables. Existing approaches often
struggle to effectively capture the underlying structure and relationships
within such data. We propose GFTab (Geodesic Flow Kernels for Semi- Supervised
Learning on Mixed-Variable Tabular Dataset), a semi-supervised framework
specifically designed for tabular datasets. GFTab incorporates three key
innovations: 1) Variable-specific corruption methods tailored to the distinct
properties of continuous and categorical variables, 2) A Geodesic flow kernel
based similarity measure to capture geometric changes between corrupted inputs,
and 3) Tree-based embedding to leverage hierarchical relationships from
available labeled data. To rigorously evaluate GFTab, we curate a comprehensive
set of 21 tabular datasets spanning various domains, sizes, and variable
compositions. Our experimental results show that GFTab outperforms existing
ML/DL models across many of these datasets, particularly in settings with
limited labeled data.
[COMMENTS]
AAAI-25
[LINK]
http://arxiv.org/abs/2412.12864v1
[DATE]
2024-12-17 20:47:53+08:00
[CATEGORIES]
cs.LG
Efficient Speech Command Recognition Leveraging Spiking Neural Network and Curriculum Learning-based Knowledge Distillation
[AUTHORS]
Jiaqi Wang, Liutao Yu, Liwei Huang, Chenlin Zhou, Han Zhang, Zhenxi Song, Min Zhang, Zhengyu Ma, Zhiguo Zhang
[ABSTRACT]
The intrinsic dynamics and event-driven nature of spiking neural networks
(SNNs) make them excel in processing temporal information by naturally
utilizing embedded time sequences as time steps. Recent studies adopting this
approach have demonstrated SNNs’ effectiveness in speech command recognition,
achieving high performance by employing large time steps for long time
sequences. However, the large time steps lead to increased deployment burdens
for edge computing applications. Thus, it is important to balance high
performance and low energy consumption when detecting temporal patterns in edge
devices. Our solution comprises two key components. 1). We propose a
high-performance fully spike-driven framework termed SpikeSCR, characterized by
a global-local hybrid structure for efficient representation learning, which
exhibits long-term learning capabilities with extended time steps. 2). To
further fully embrace low energy consumption, we propose an effective knowledge
distillation method based on curriculum learning (KDCL), where valuable
representations learned from the easy curriculum are progressively transferred
to the hard curriculum with minor loss, striking a trade-off between power
efficiency and high performance. We evaluate our method on three benchmark
datasets: the Spiking Heidelberg Dataset (SHD), the Spiking Speech Commands
(SSC), and the Google Speech Commands (GSC) V2. Our experimental results
demonstrate that SpikeSCR outperforms current state-of-the-art (SOTA) methods
across these three datasets with the same time steps. Furthermore, by executing
KDCL, we reduce the number of time steps by 60% and decrease energy consumption
by 54.8% while maintaining comparable performance to recent SOTA results.
Therefore, this work offers valuable insights for tackling temporal processing
challenges with long time sequences in edge neuromorphic computing systems.
[COMMENTS]
Under Review
[LINK]
http://arxiv.org/abs/2412.12858v1
[DATE]
2024-12-17 20:38:45+08:00
[CATEGORIES]
cs.LG
Boosting Fine-Grained Visual Anomaly Detection with Coarse-Knowledge-Aware Adversarial Learning
[AUTHORS]
Qingqing Fang, Qinliang Su, Wenxi Lv, Wenchao Xu, Jianxing Yu
[ABSTRACT]
Many unsupervised visual anomaly detection methods train an auto-encoder to
reconstruct normal samples and then leverage the reconstruction error map to
detect and localize the anomalies. However, due to the powerful modeling and
generalization ability of neural networks, some anomalies can also be well
reconstructed, resulting in unsatisfactory detection and localization accuracy.
In this paper, a small coarsely-labeled anomaly dataset is first collected.
Then, a coarse-knowledge-aware adversarial learning method is developed to
align the distribution of reconstructed features with that of normal features.
The alignment can effectively suppress the auto-encoder’s reconstruction
ability on anomalies and thus improve the detection accuracy. Considering that
anomalies often only occupy very small areas in anomalous images, a patch-level
adversarial learning strategy is further developed. Although no patch-level
anomalous information is available, we rigorously prove that by simply viewing
any patch features from anomalous images as anomalies, the proposed
knowledge-aware method can also align the distribution of reconstructed patch
features with the normal ones. Experimental results on four medical datasets
and two industrial datasets demonstrate the effectiveness of our method in
improving the detection and localization performance.
[COMMENTS]
The paper is accepted by AAAI 2025
[LINK]
http://arxiv.org/abs/2412.12850v1
[DATE]
2024-12-17 20:24:08+08:00
[CATEGORIES]
cs.LG
On Foundation Models for Dynamical Systems from Purely Synthetic Data
[AUTHORS]
Martin Ziegler, Andres Felipe Posada-Moreno, Friedrich Solowjow, Sebastian Trimpe
[ABSTRACT]
Foundation models have demonstrated remarkable generalization, data
efficiency, and robustness properties across various domains. In this paper, we
explore the feasibility of foundation models for applications in the control
domain. The success of these models is enabled by large-scale pretaining on
Internet-scale datasets. These are available in fields like natural language
processing and computer vision, but do not exist for dynamical systems. We
address this challenge by pretraining a transformer-based foundation model
exclusively on synthetic data and propose to sample dynamics functions from a
reproducing kernel Hilbert space. Our pretrained model generalizes for
prediction tasks across different dynamical systems, which we validate in
simulation and hardware experiments, including cart-pole and Furuta pendulum
setups. Additionally, the model can be fine-tuned effectively to new systems to
increase performance even further. Our results demonstrate the feasibility of
foundation models for dynamical systems that outperform specialist models in
terms of generalization, data efficiency, and robustness.
[COMMENTS]
10 pages
[LINK]
http://arxiv.org/abs/2412.00395v2
[DATE]
2024-12-17 20:04:32+08:00
[CATEGORIES]
cs.LG
Scrutinizing the Vulnerability of Decentralized Learning to Membership Inference Attacks
[AUTHORS]
Ousmane Touat, Jezekael Brunon, Yacine Belal, Julien Nicolas, Mohamed Maouche, César Sabater, Sonia Ben Mokhtar
[ABSTRACT]
The primary promise of decentralized learning is to allow users to engage in
the training of machine learning models in a collaborative manner while keeping
their data on their premises and without relying on any central entity.
However, this paradigm necessitates the exchange of model parameters or
gradients between peers. Such exchanges can be exploited to infer sensitive
information about training data, which is achieved through privacy attacks (e.g
Membership Inference Attacks – MIA). In order to devise effective defense
mechanisms, it is important to understand the factors that increase/reduce the
vulnerability of a given decentralized learning architecture to MIA. In this
study, we extensively explore the vulnerability to MIA of various decentralized
learning architectures by varying the graph structure (e.g number of
neighbors), the graph dynamics, and the aggregation strategy, across diverse
datasets and data distributions. Our key finding, which to the best of our
knowledge we are the first to report, is that the vulnerability to MIA is
heavily correlated to (i) the local model mixing strategy performed by each
node upon reception of models from neighboring nodes and (ii) the global mixing
properties of the communication graph. We illustrate these results
experimentally using four datasets and by theoretically analyzing the mixing
properties of various decentralized architectures. Our paper draws a set of
lessons learned for devising decentralized learning systems that reduce by
design the vulnerability to MIA.
[COMMENTS]
12 pages, 8 figures
[LINK]
http://arxiv.org/abs/2412.12837v1
[DATE]
2024-12-17 20:02:47+08:00
[CATEGORIES]
cs.LG
Conformal Prediction on Quantifying Uncertainty of Dynamic Systems
[AUTHORS]
Aoming Liang, Qi Liu, Lei Xu, Fahad Sohrab, Weicheng Cui, Changhui Song, Moncef Gabbouj
[ABSTRACT]
Numerous studies have focused on learning and understanding the dynamics of
physical systems from video data, such as spatial intelligence. Artificial
intelligence requires quantitative assessments of the uncertainty of the model
to ensure reliability. However, there is still a relative lack of systematic
assessment of the uncertainties, particularly the uncertainties of the physical
data. Our motivation is to introduce conformal prediction into the uncertainty
assessment of dynamical systems, providing a method supported by theoretical
guarantees. This paper uses the conformal prediction method to assess
uncertainties with benchmark operator learning methods. We have also compared
the Monte Carlo Dropout and Ensemble methods in the partial differential
equations dataset, effectively evaluating uncertainty through straight
roll-outs, making it ideal for time-series tasks.
[LINK]
http://arxiv.org/abs/2412.10459v2
[DATE]
2024-12-17 19:35:02+08:00
[CATEGORIES]
cs.LG
PROSAC: Provably Safe Certification for Machine Learning Models under Adversarial Attacks
[AUTHORS]
Chen Feng, Ziquan Liu, Zhuo Zhi, Ilija Bogunovic, Carsten Gerner-Beuerle, Miguel Rodrigues
[ABSTRACT]
It is widely known that state-of-the-art machine learning models, including
vision and language models, can be seriously compromised by adversarial
perturbations. It is therefore increasingly relevant to develop capabilities to
certify their performance in the presence of the most effective adversarial
attacks. Our paper offers a new approach to certify the performance of machine
learning models in the presence of adversarial attacks with population level
risk guarantees. In particular, we introduce the notion of
$(\alpha,\zeta)$-safe machine learning model. We propose a hypothesis testing
procedure, based on the availability of a calibration set, to derive
statistical guarantees providing that the probability of declaring that the
adversarial (population) risk of a machine learning model is less than $\alpha$
(i.e. the model is safe), while the model is in fact unsafe (i.e. the model
adversarial population risk is higher than $\alpha$), is less than $\zeta$. We
also propose Bayesian optimization algorithms to determine efficiently whether
a machine learning model is $(\alpha,\zeta)$-safe in the presence of an
adversarial attack, along with statistical guarantees. We apply our framework
to a range of machine learning models - including various sizes of vision
Transformer (ViT) and ResNet models - impaired by a variety of adversarial
attacks, such as PGDAttack, MomentumAttack, GenAttack and BanditAttack, to
illustrate the operation of our approach. Importantly, we show that ViT’s are
generally more robust to adversarial attacks than ResNets, and large models are
generally more robust than smaller models. Our approach goes beyond existing
empirical adversarial risk-based certification guarantees. It formulates
rigorous (and provable) performance guarantees that can be used to satisfy
regulatory requirements mandating the use of state-of-the-art technical tools.
[COMMENTS]
Accepted to AAAI2025
[LINK]
http://arxiv.org/abs/2402.02629v2
[DATE]
2024-12-17 19:28:49+08:00
[CATEGORIES]
cs.LG
Ask for More Than Bayes Optimal: A Theory of Indecisions for Classification
[AUTHORS]
Mohamed Ndaoud, Peter Radchenko, Bradley Rava
[ABSTRACT]
Selective classification frameworks are useful tools for automated decision
making in highly risky scenarios, since they allow for a classifier to only
make highly confident decisions, while abstaining from making a decision when
it is not confident enough to do so, which is otherwise known as an indecision.
For a given level of classification accuracy, we aim to make as many decisions
as possible. For many problems, this can be achieved without abstaining from
making decisions. But when the problem is hard enough, we show that we can
still control the misclassification rate of a classifier up to any user
specified level, while only abstaining from the minimum necessary amount of
decisions, even if this level of misclassification is smaller than the Bayes
optimal error rate. In many problem settings, the user could obtain a dramatic
decrease in misclassification while only paying a comparatively small price in
terms of indecisions.
[LINK]
http://arxiv.org/abs/2412.12807v1
[DATE]
2024-12-17 19:25:51+08:00
[CATEGORIES]
cs.LG
Learning Set Functions with Implicit Differentiation
[AUTHORS]
Gözde Özcan, Chengzhi Shi, Stratis Ioannidis
[ABSTRACT]
Ou et al. (2022) introduce the problem of learning set functions from data
generated by a so-called optimal subset oracle. Their approach approximates the
underlying utility function with an energy-based model, whose parameters are
estimated via mean-field variational inference. Ou et al. (2022) show this
reduces to fixed point iterations; however, as the number of iterations
increases, automatic differentiation quickly becomes computationally
prohibitive due to the size of the Jacobians that are stacked during
backpropagation. We address this challenge with implicit differentiation and
examine the convergence conditions for the fixed-point iterations. We
empirically demonstrate the efficiency of our method on synthetic and
real-world subset selection applications including product recommendation, set
anomaly detection and compound selection tasks.
[COMMENTS]
19 pages, 1 figure, extended version of the AAAI 2025 paper with the
same title
[LINK]
http://arxiv.org/abs/2412.11239v2
[DATE]
2024-12-17 19:14:52+08:00
[CATEGORIES]
cs.LG
Multi-View Incremental Learning with Structured Hebbian Plasticity for Enhanced Fusion Efficiency
[AUTHORS]
Yuhong Chen, Ailin Song, Huifeng Yin, Shuai Zhong, Fuhai Chen, Qi Xu, Shiping Wang, Mingkun Xu
[ABSTRACT]
The rapid evolution of multimedia technology has revolutionized human
perception, paving the way for multi-view learning. However, traditional
multi-view learning approaches are tailored for scenarios with fixed data
views, falling short of emulating the intricate cognitive procedures of the
human brain processing signals sequentially. Our cerebral architecture
seamlessly integrates sequential data through intricate feed-forward and
feedback mechanisms. In stark contrast, traditional methods struggle to
generalize effectively when confronted with data spanning diverse domains,
highlighting the need for innovative strategies that can mimic the brain’s
adaptability and dynamic integration capabilities. In this paper, we propose a
bio-neurologically inspired multi-view incremental framework named MVIL aimed
at emulating the brain’s fine-grained fusion of sequentially arriving views.
MVIL lies two fundamental modules: structured Hebbian plasticity and synaptic
partition learning. The structured Hebbian plasticity reshapes the structure of
weights to express the high correlation between view representations,
facilitating a fine-grained fusion of view representations. Moreover, synaptic
partition learning is efficient in alleviating drastic changes in weights and
also retaining old knowledge by inhibiting partial synapses. These modules
bionically play a central role in reinforcing crucial associations between
newly acquired information and existing knowledge repositories, thereby
enhancing the network’s capacity for generalization. Experimental results on
six benchmark datasets show MVIL’s effectiveness over state-of-the-art methods.
[COMMENTS]
11 pages
[LINK]
http://arxiv.org/abs/2412.12801v1
[DATE]
2024-12-17 19:10:46+08:00
[CATEGORIES]
cs.LG
Noise-based Local Learning using Stochastic Magnetic Tunnel Junctions
[AUTHORS]
Kees Koenders, Leo Schnitzpan, Fabian Kammerbauer, Sinan Shu, Gerhard Jakob, Mathis Kläui, Johan Mentink, Nasir Ahmad, Marcel van Gerven
[ABSTRACT]
Brain-inspired learning in physical hardware has enormous potential to learn
fast at minimal energy expenditure. One of the characteristics of biological
learning systems is their ability to learn in the presence of various noise
sources. Inspired by this observation, we introduce a novel noise-based
learning approach for physical systems implementing multi-layer neural
networks. Simulation results show that our approach allows for effective
learning whose performance approaches that of the conventional effective yet
energy-costly backpropagation algorithm. Using a spintronics hardware
implementation, we demonstrate experimentally that learning can be achieved in
a small network composed of physical stochastic magnetic tunnel junctions.
These results provide a path towards efficient learning in general physical
systems which embraces rather than mitigates the noise inherent in physical
devices.
[COMMENTS]
20 pages, 5 figures, submitted to Physical Review X
[LINK]
http://arxiv.org/abs/2412.12783v1
[DATE]
2024-12-17 18:43:26+08:00
[CATEGORIES]
cs.LG
Decomposing Global Feature Effects Based on Feature Interactions
[AUTHORS]
Julia Herbinger, Marvin N. Wright, Thomas Nagler, Bernd Bischl, Giuseppe Casalicchio
[ABSTRACT]
Global feature effect methods, such as partial dependence plots, provide an
intelligible visualization of the expected marginal feature effect. However,
such global feature effect methods can be misleading, as they do not represent
local feature effects of single observations well when feature interactions are
present. We formally introduce generalized additive decomposition of global
effects (GADGET), which is a new framework based on recursive partitioning to
find interpretable regions in the feature space such that the
interaction-related heterogeneity of local feature effects is minimized. We
provide a mathematical foundation of the framework and show that it is
applicable to the most popular methods to visualize marginal feature effects,
namely partial dependence, accumulated local effects, and Shapley additive
explanations (SHAP) dependence. Furthermore, we introduce and validate a new
permutation-based interaction detection procedure that is applicable to any
feature effect method that fits into our proposed framework. We empirically
evaluate the theoretical characteristics of the proposed methods based on
various feature effect methods in different experimental settings. Moreover, we
apply our introduced methodology to three real-world examples to showcase their
usefulness.
[LINK]
http://arxiv.org/abs/2306.00541v3
[DATE]
2024-12-17 18:31:14+08:00
[CATEGORIES]
cs.LG
Progressive Monitoring of Generative Model Training Evolution
[AUTHORS]
Vidya Prasad, Anna Vilanova, Nicola Pezzotti
[ABSTRACT]
While deep generative models (DGMs) have gained popularity, their
susceptibility to biases and other inefficiencies that lead to undesirable
outcomes remains an issue. With their growing complexity, there is a critical
need for early detection of issues to achieve desired results and optimize
resources. Hence, we introduce a progressive analysis framework to monitor the
training process of DGMs. Our method utilizes dimensionality reduction
techniques to facilitate the inspection of latent representations, the
generated and real distributions, and their evolution across training
iterations. This monitoring allows us to pause and fix the training method if
the representations or distributions progress undesirably. This approach allows
for the analysis of a models’ training dynamics and the timely identification
of biases and failures, minimizing computational loads. We demonstrate how our
method supports identifying and mitigating biases early in training a
Generative Adversarial Network (GAN) and improving the quality of the generated
data distribution.
[LINK]
http://arxiv.org/abs/2412.12755v1
[DATE]
2024-12-17 18:20:29+08:00
[CATEGORIES]
cs.LG
An Empirical Study of Fault Localisation Techniques for Deep Learning
[AUTHORS]
Nargiz Humbatova, Jinhan Kim, Gunel Jahangirova, Shin Yoo, Paolo Tonella
[ABSTRACT]
With the increased popularity of Deep Neural Networks (DNNs), increases also
the need for tools to assist developers in the DNN implementation, testing and
debugging process. Several approaches have been proposed that automatically
analyse and localise potential faults in DNNs under test. In this work, we
evaluate and compare existing state-of-the-art fault localisation techniques,
which operate based on both dynamic and static analysis of the DNN. The
evaluation is performed on a benchmark consisting of both real faults obtained
from bug reporting platforms and faulty models produced by a mutation tool. Our
findings indicate that the usage of a single, specific ground truth (e.g., the
human defined one) for the evaluation of DNN fault localisation tools results
in pretty low performance (maximum average recall of 0.31 and precision of
0.23). However, such figures increase when considering alternative, equivalent
patches that exist for a given faulty DNN. Results indicate that \dfd is the
most effective tool, achieving an average recall of 0.61 and precision of 0.41
on our benchmark.
[LINK]
http://arxiv.org/abs/2412.11304v2
[DATE]
2024-12-17 18:07:46+08:00
[CATEGORIES]
cs.LG
Subspace Implicit Neural Representations for Real-Time Cardiac Cine MR Imaging
[AUTHORS]
Wenqi Huang, Veronika Spieker, Siying Xu, Gastao Cruz, Claudia Prieto, Julia Schnabel, Kerstin Hammernik, Thomas Kuestner, Daniel Rueckert
[ABSTRACT]
Conventional cardiac cine MRI methods rely on retrospective gating, which
limits temporal resolution and the ability to capture continuous cardiac
dynamics, particularly in patients with arrhythmias and beat-to-beat
variations. To address these challenges, we propose a reconstruction framework
based on subspace implicit neural representations for real-time cardiac cine
MRI of continuously sampled radial data. This approach employs two multilayer
perceptrons to learn spatial and temporal subspace bases, leveraging the
low-rank properties of cardiac cine MRI. Initialized with low-resolution
reconstructions, the networks are fine-tuned using spoke-specific loss
functions to recover spatial details and temporal fidelity. Our method directly
utilizes the continuously sampled radial k-space spokes during training,
thereby eliminating the need for binning and non-uniform FFT. This approach
achieves superior spatial and temporal image quality compared to conventional
binned methods at the acceleration rate of 10 and 20, demonstrating potential
for high-resolution imaging of dynamic cardiac events and enhancing diagnostic
capability.
[LINK]
http://arxiv.org/abs/2412.12742v1
[DATE]
2024-12-17 18:06:37+08:00
[CATEGORIES]
cs.LG
Deep Learning for Resilient Adversarial Decision Fusion in Byzantine Networks
[AUTHORS]
Kassem Kallas
[ABSTRACT]
This paper introduces a deep learning-based framework for resilient decision
fusion in adversarial multi-sensor networks, providing a unified mathematical
setup that encompasses diverse scenarios, including varying Byzantine node
proportions, synchronized and unsynchronized attacks, unbalanced priors,
adaptive strategies, and Markovian states. Unlike traditional methods, which
depend on explicit parameter tuning and are limited by scenario-specific
assumptions, the proposed approach employs a deep neural network trained on a
globally constructed dataset to generalize across all cases without requiring
adaptation. Extensive simulations validate the method’s robustness, achieving
superior accuracy, minimal error probability, and scalability compared to
state-of-the-art techniques, while ensuring computational efficiency for
real-time applications. This unified framework demonstrates the potential of
deep learning to revolutionize decision fusion by addressing the challenges
posed by Byzantine nodes in dynamic adversarial environments.
[LINK]
http://arxiv.org/abs/2412.12739v1
[DATE]
2024-12-17 18:02:04+08:00
[CATEGORIES]
cs.LG
Towards Kriging-informed Conditional Diffusion for Regional Sea-Level Data Downscaling
[AUTHORS]
Subhankar Ghosh, Arun Sharma, Jayant Gupta, Aneesh Subramanian, Shashi Shekhar
[ABSTRACT]
Given coarser-resolution projections from global climate models or satellite
data, the downscaling problem aims to estimate finer-resolution regional
climate data, capturing fine-scale spatial patterns and variability.
Downscaling is any method to derive high-resolution data from low-resolution
variables, often to provide more detailed and local predictions and analyses.
This problem is societally crucial for effective adaptation, mitigation, and
resilience against significant risks from climate change. The challenge arises
from spatial heterogeneity and the need to recover finer-scale features while
ensuring model generalization. Most downscaling methods \cite{Li2020} fail to
capture the spatial dependencies at finer scales and underperform on real-world
climate datasets, such as sea-level rise. We propose a novel Kriging-informed
Conditional Diffusion Probabilistic Model (Ki-CDPM) to capture spatial
variability while preserving fine-scale features. Experimental results on
climate data show that our proposed method is more accurate than
state-of-the-art downscaling techniques.
[LINK]
http://arxiv.org/abs/2410.15628v2
[DATE]
2024-12-17 18:01:59+08:00
[CATEGORIES]
cs.LG
SPARKLE: A Unified Single-Loop Primal-Dual Framework for Decentralized Bilevel Optimization
[AUTHORS]
Shuchen Zhu, Boao Kong, Songtao Lu, Xinmeng Huang, Kun Yuan
[ABSTRACT]
This paper studies decentralized bilevel optimization, in which multiple
agents collaborate to solve problems involving nested optimization structures
with neighborhood communications. Most existing literature primarily utilizes
gradient tracking to mitigate the influence of data heterogeneity, without
exploring other well-known heterogeneity-correction techniques such as EXTRA or
Exact Diffusion. Additionally, these studies often employ identical
decentralized strategies for both upper- and lower-level problems, neglecting
to leverage distinct mechanisms across different levels. To address these
limitations, this paper proposes SPARKLE, a unified Single-loop Primal-dual
AlgoRithm frameworK for decentraLized bilEvel optimization. SPARKLE offers the
flexibility to incorporate various heterogeneitycorrection strategies into the
algorithm. Moreover, SPARKLE allows for different strategies to solve upper-
and lower-level problems. We present a unified convergence analysis for
SPARKLE, applicable to all its variants, with state-of-the-art convergence
rates compared to existing decentralized bilevel algorithms. Our results
further reveal that EXTRA and Exact Diffusion are more suitable for
decentralized bilevel optimization, and using mixed strategies in bilevel
algorithms brings more benefits than relying solely on gradient tracking.
[COMMENTS]
74 pages, the Thirty-Eighth Annual Conference on Neural Information
Processing Systems (2024)
[LINK]
http://arxiv.org/abs/2411.14166v3
[DATE]
2024-12-17 17:38:53+08:00
[CATEGORIES]
cs.LG
No More Adam: Learning Rate Scaling at Initialization is All You Need
[AUTHORS]
Minghao Xu, Lichuan Xiang, Xu Cai, Hongkai Wen
[ABSTRACT]
In this work, we question the necessity of adaptive gradient methods for
training deep neural networks. SGD-SaI is a simple yet effective enhancement to
stochastic gradient descent with momentum (SGDM). SGD-SaI performs learning
rate Scaling at Initialization (SaI) to distinct parameter groups, guided by
their respective gradient signal-to-noise ratios (g-SNR). By adjusting learning
rates without relying on adaptive second-order momentum, SGD-SaI helps prevent
training imbalances from the very first iteration and cuts the optimizer’s
memory usage by half compared to AdamW. Despite its simplicity and efficiency,
SGD-SaI consistently matches or outperforms AdamW in training a variety of
Transformer-based tasks, effectively overcoming a long-standing challenge of
using SGD for training Transformers. SGD-SaI excels in ImageNet-1K
classification with Vision Transformers(ViT) and GPT-2 pretraining for large
language models (LLMs, transformer decoder-only), demonstrating robustness to
hyperparameter variations and practicality for diverse applications. We further
tested its robustness on tasks like LoRA fine-tuning for LLMs and diffusion
models, where it consistently outperforms state-of-the-art optimizers. From a
memory efficiency perspective, SGD-SaI achieves substantial memory savings for
optimizer states, reducing memory usage by 5.93 GB for GPT-2 (1.5B parameters)
and 25.15 GB for Llama2-7B compared to AdamW in full-precision training
settings.
[COMMENTS]
20 pages, 10 figures
[LINK]
http://arxiv.org/abs/2412.11768v2
[DATE]
2024-12-17 17:30:44+08:00
[CATEGORIES]
cs.LG
Accelerating lensed quasars discovery and modeling with physics-informed variational autoencoders
[AUTHORS]
Irham T. Andika, Stefan Schuldt, Sherry H. Suyu, Satadru Bag, Raoul Cañameras, Alejandra Melo, Claudio Grillo, James H. H. Chan
[ABSTRACT]
Strongly lensed quasars provide valuable insights into the rate of cosmic
expansion, the distribution of dark matter in foreground deflectors, and the
characteristics of quasar hosts. However, detecting them in astronomical images
is difficult due to the prevalence of non-lensing objects. To address this
challenge, we developed a generative deep learning model called VariLens, built
upon a physics-informed variational autoencoder. This model seamlessly
integrates three essential modules: image reconstruction, object
classification, and lens modeling, offering a fast and comprehensive approach
to strong lens analysis. VariLens is capable of rapidly determining both (1)
the probability that an object is a lens system and (2) key parameters of a
singular isothermal ellipsoid (SIE) mass model – including the Einstein radius
($\theta_\mathrm{E}$), lens center, and ellipticity – in just milliseconds
using a single CPU. A direct comparison of VariLens estimates with traditional
lens modeling for 20 known lensed quasars within the Subaru Hyper Suprime-Cam
(HSC) footprint shows good agreement, with both results consistent within
$2\sigma$ for systems with $\theta_\mathrm{E}<3$ arcsecs. To identify new
lensed quasar candidates, we begin with an initial sample of approximately 80
million sources, combining HSC data with multiwavelength information from
various surveys. After applying a photometric preselection aimed at locating
$z>1.5$ sources, the number of candidates is reduced to 710,966. Subsequently,
VariLens highlights 13,831 sources, each showing a high likelihood of being a
lens. A visual assessment of these objects results in 42 promising candidates
that await spectroscopic confirmation. These results underscore the potential
of automated deep learning pipelines to efficiently detect and model strong
lenses in large datasets.
[COMMENTS]
Submitted to the Astronomy & Astrophysics journal. The paper consists
of 17 main pages, 14 figures, and 5 tables. We welcome feedback and comments
from readers!
[LINK]
http://arxiv.org/abs/2412.12709v1
[DATE]
2024-12-17 17:23:46+08:00
[CATEGORIES]
cs.LG
ParMod: A Parallel and Modular Framework for Learning Non-Markovian Tasks
[AUTHORS]
Ruixuan Miao, Xu Lu, Cong Tian, Bin Yu, Zhenhua Duan
[ABSTRACT]
The commonly used Reinforcement Learning (RL) model, MDPs (Markov Decision
Processes), has a basic premise that rewards depend on the current state and
action only. However, many real-world tasks are non-Markovian, which has
long-term memory and dependency. The reward sparseness problem is further
amplified in non-Markovian scenarios. Hence learning a non-Markovian task (NMT)
is inherently more difficult than learning a Markovian one. In this paper, we
propose a novel \textbf{Par}allel and \textbf{Mod}ular RL framework, ParMod,
specifically for learning NMTs specified by temporal logic. With the aid of
formal techniques, the NMT is modulaized into a series of sub-tasks based on
the automaton structure (equivalent to its temporal logic counterpart). On this
basis, sub-tasks will be trained by a group of agents in a parallel fashion,
with one agent handling one sub-task. Besides parallel training, the core of
ParMod lies in: a flexible classification method for modularizing the NMT, and
an effective reward shaping method for improving the sample efficiency. A
comprehensive evaluation is conducted on several challenging benchmark problems
with respect to various metrics. The experimental results show that ParMod
achieves superior performance over other relevant studies. Our work thus
provides a good synergy among RL, NMT and temporal logic.
[LINK]
http://arxiv.org/abs/2412.12700v1
[DATE]
2024-12-17 17:16:53+08:00
[CATEGORIES]
cs.LG
Auto-bidding in real-time auctions via Oracle Imitation Learning (OIL)
[AUTHORS]
Alberto Silvio Chiappa, Briti Gangopadhyay, Zhao Wang, Shingo Takamatsu
[ABSTRACT]
Online advertising has become one of the most successful business models of
the internet era. Impression opportunities are typically allocated through
real-time auctions, where advertisers bid to secure advertisement slots.
Deciding the best bid for an impression opportunity is challenging, due to the
stochastic nature of user behavior and the variability of advertisement traffic
over time. In this work, we propose a framework for training auto-bidding
agents in multi-slot second-price auctions to maximize acquisitions (e.g.,
clicks, conversions) while adhering to budget and cost-per-acquisition (CPA)
constraints. We exploit the insight that, after an advertisement campaign
concludes, determining the optimal bids for each impression opportunity can be
framed as a multiple-choice knapsack problem (MCKP) with a nonlinear objective.
We propose an “oracle” algorithm that identifies a near-optimal combination of
impression opportunities and advertisement slots, considering both past and
future advertisement traffic data. This oracle solution serves as a training
target for a student network which bids having access only to real-time
information, a method we term Oracle Imitation Learning (OIL). Through
numerical experiments, we demonstrate that OIL achieves superior performance
compared to both online and offline reinforcement learning algorithms, offering
improved sample efficiency. Notably, OIL shifts the complexity of training
auto-bidding agents from crafting sophisticated learning algorithms to solving
a nonlinear constrained optimization problem efficiently.
[LINK]
http://arxiv.org/abs/2412.11434v2
[DATE]
2024-12-17 16:56:42+08:00
[CATEGORIES]
cs.LG
A Two-Fold Patch Selection Approach for Improved 360-Degree Image Quality Assessment
[AUTHORS]
Abderrezzaq Sendjasni, Seif-Eddine Benkabou, Mohamed-Chaker Larabi
[ABSTRACT]
This article presents a novel approach to improving the accuracy of
360-degree perceptual image quality assessment (IQA) through a two-fold patch
selection process. Our methodology combines visual patch selection with
embedding similarity-based refinement. The first stage focuses on selecting
patches from 360-degree images using three distinct sampling methods to ensure
comprehensive coverage of visual content for IQA. The second stage, which is
the core of our approach, employs an embedding similarity-based selection
process to filter and prioritize the most informative patches based on their
embeddings similarity distances. This dual selection mechanism ensures that the
training data is both relevant and informative, enhancing the model’s learning
efficiency. Extensive experiments and statistical analyses using three distance
metrics across three benchmark datasets validate the effectiveness of our
selection algorithm. The results highlight its potential to deliver robust and
accurate 360-degree IQA, with performance gains of up to 4.5% in accuracy and
monotonicity of quality score prediction, while using only 40% to 50% of the
training patches. These improvements are consistent across various
configurations and evaluation metrics, demonstrating the strength of the
proposed method. The code for the selection process is available at:
https://github.com/sendjasni/patch-selection-360-image-quality.
[COMMENTS]
Submitted to IEEE Transactions on Image Processing
[LINK]
http://arxiv.org/abs/2412.12667v1
[DATE]
2024-12-17 16:36:47+08:00
[CATEGORIES]
cs.LG
Shared Attention-based Autoencoder with Hierarchical Fusion-based Graph Convolution Network for sEEG SOZ Identification
[AUTHORS]
Huachao Yan, Kailing Guo, Shiwei Song, Yihai Dai, Xiaoqiang Wei, Xiaofen Xing, Xiangmin Xu
[ABSTRACT]
Diagnosing seizure onset zone (SOZ) is a challenge in neurosurgery, where
stereoelectroencephalography (sEEG) serves as a critical technique. In sEEG SOZ
identification, the existing studies focus solely on the intra-patient
representation of epileptic information, overlooking the general features of
epilepsy across patients and feature interdependencies between feature elements
in each contact site. In order to address the aforementioned challenges, we
propose the shared attention-based autoencoder (sATAE). sATAE is trained by
sEEG data across all patients, with attention blocks introduced to enhance the
representation of interdependencies between feature elements. Considering the
spatial diversity of sEEG across patients, we introduce graph-based method for
identification SOZ of each patient. However, the current graph-based methods
for sEEG SOZ identification rely exclusively on static graphs to model
epileptic networks. Inspired by the finding of neuroscience that epileptic
network is intricately characterized by the interplay of sophisticated
equilibrium between fluctuating and stable states, we design the hierarchical
fusion-based graph convolution network (HFGCN) to identify the SOZ. HFGCN
integrates the dynamic and static characteristics of epileptic networks through
hierarchical weighting across different hierarchies, facilitating a more
comprehensive learning of epileptic features and enriching node information for
sEEG SOZ identification. Combining sATAE and HFGCN, we perform comprehensive
experiments with sATAE-HFGCN on the self-build sEEG dataset, which includes
sEEG data from 17 patients with temporal lobe epilepsy. The results show that
our method, sATAE-HFGCN, achieves superior performance for identifying the SOZ
of each patient, effectively addressing the aforementioned challenges,
providing an efficient solution for sEEG-based SOZ identification.
[LINK]
http://arxiv.org/abs/2412.12651v1
[DATE]
2024-12-17 16:20:02+08:00
[CATEGORIES]
cs.LG
Neural-Network-Driven Reward Prediction as a Heuristic: Advancing Q-Learning for Mobile Robot Path Planning
[AUTHORS]
Yiming Ji, Kaijie Yun, Yang Liu, Zongwu Xie, Hong Liu
[ABSTRACT]
Q-learning is a widely used reinforcement learning technique for solving path
planning problems. It primarily involves the interaction between an agent and
its environment, enabling the agent to learn an optimal strategy that maximizes
cumulative rewards. Although many studies have reported the effectiveness of
Q-learning, it still faces slow convergence issues in practical applications.
To address this issue, we propose the NDR-QL method, which utilizes neural
network outputs as heuristic information to accelerate the convergence process
of Q-learning. Specifically, we improved the dual-output neural network model
by introducing a start-end channel separation mechanism and enhancing the
feature fusion process. After training, the proposed NDR model can output a
narrowly focused optimal probability distribution, referred to as the
guideline, and a broadly distributed suboptimal distribution, referred to as
the region. Subsequently, based on the guideline prediction, we calculate the
continuous reward function for the Q-learning method, and based on the region
prediction, we initialize the Q-table with a bias. We conducted training,
validation, and path planning simulation experiments on public datasets. The
results indicate that the NDR model outperforms previous methods by up to 5\%
in prediction accuracy. Furthermore, the proposed NDR-QL method improves the
convergence speed of the baseline Q-learning method by 90\% and also surpasses
the previously improved Q-learning methods in path quality metrics.
[LINK]
http://arxiv.org/abs/2412.12650v1
[DATE]
2024-12-17 16:19:40+08:00
[CATEGORIES]
cs.LG
On the Impact of Hard Adversarial Instances on Overfitting in Adversarial Training
[AUTHORS]
Chen Liu, Zhichao Huang, Mathieu Salzmann, Tong Zhang, Sabine Süsstrunk
[ABSTRACT]
Adversarial training is a popular method to robustify models against
adversarial attacks. However, it exhibits much more severe overfitting than
training on clean inputs. In this work, we investigate this phenomenon from the
perspective of training instances, i.e., training input-target pairs. Based on
a quantitative metric measuring the relative difficulty of an instance in the
training set, we analyze the model’s behavior on training instances of
different difficulty levels. This lets us demonstrate that the decay in
generalization performance of adversarial training is a result of fitting hard
adversarial instances. We theoretically verify our observations for both linear
and general nonlinear models, proving that models trained on hard instances
have worse generalization performance than ones trained on easy instances, and
that this generalization gap increases with the size of the adversarial budget.
Finally, we investigate solutions to mitigate adversarial overfitting in
several scenarios, including fast adversarial training and fine-tuning a
pretrained model with additional data. Our results demonstrate that using
training data adaptively improves the model’s robustness.
[LINK]
http://arxiv.org/abs/2112.07324v2
[DATE]
2024-12-17 16:17:26+08:00
[CATEGORIES]
cs.LG
Exploring AI-Enabled Cybersecurity Frameworks: Deep-Learning Techniques, GPU Support, and Future Enhancements
[AUTHORS]
Tobias Becher, Simon Torka
[ABSTRACT]
Traditional rule-based cybersecurity systems have proven highly effective
against known malware threats. However, they face challenges in detecting novel
threats. To address this issue, emerging cybersecurity systems are
incorporating AI techniques, specifically deep-learning algorithms, to enhance
their ability to detect incidents, analyze alerts, and respond to events. While
these techniques offer a promising approach to combating dynamic security
threats, they often require significant computational resources. Therefore,
frameworks that incorporate AI-based cybersecurity mechanisms need to support
the use of GPUs to ensure optimal performance.
Many cybersecurity framework vendors do not provide sufficiently detailed
information about their implementation, making it difficult to assess the
techniques employed and their effectiveness. This study aims to overcome this
limitation by providing an overview of the most used cybersecurity frameworks
that utilize AI techniques, specifically focusing on frameworks that provide
comprehensive information about their implementation. Our primary objective is
to identify the deep-learning techniques employed by these frameworks and
evaluate their support for GPU acceleration. We have identified a total of
\emph{two} deep-learning algorithms that are utilized by \emph{three} out of 38
selected cybersecurity frameworks. Our findings aim to assist in selecting
open-source cybersecurity frameworks for future research and assessing any
discrepancies between deep-learning techniques used in theory and practice.
[LINK]
http://arxiv.org/abs/2412.12648v1
[DATE]
2024-12-17 16:14:12+08:00
[CATEGORIES]
cs.LG
RDPI: A Refine Diffusion Probability Generation Method for Spatiotemporal Data Imputation
[AUTHORS]
Zijin Liu, Xiang Zhao, You Song
[ABSTRACT]
Spatiotemporal data imputation plays a crucial role in various fields such as
traffic flow monitoring, air quality assessment, and climate prediction.
However, spatiotemporal data collected by sensors often suffer from temporal
incompleteness, and the sparse and uneven distribution of sensors leads to
missing data in the spatial dimension. Among existing methods, autoregressive
approaches are prone to error accumulation, while simple conditional diffusion
models fail to adequately capture the spatiotemporal relationships between
observed and missing data. To address these issues, we propose a novel
two-stage Refined Diffusion Probability Impuation (RDPI) framework based on an
initial network and a conditional diffusion model. In the initial stage,
deterministic imputation methods are used to generate preliminary estimates of
the missing data. In the refinement stage, residuals are treated as the
diffusion target, and observed values are innovatively incorporated into the
forward process. This results in a conditional diffusion model better suited
for spatiotemporal data imputation, bridging the gap between the preliminary
estimates and the true values. Experiments on multiple datasets demonstrate
that RDPI not only achieves state-of-the-art imputation accuracy but also
significantly reduces sampling computational costs.
[LINK]
http://arxiv.org/abs/2412.12642v1
[DATE]
2024-12-17 16:06:00+08:00
[CATEGORIES]
cs.LG
Lagrangian Index Policy for Restless Bandits with Average Reward
[AUTHORS]
Konstantin Avrachenkov, Vivek S. Borkar, Pratik Shah
[ABSTRACT]
We study the Lagrangian Index Policy (LIP) for restless multi-armed bandits
with long-run average reward. In particular, we compare the performance of LIP
with the performance of the Whittle Index Policy (WIP), both heuristic policies
known to be asymptotically optimal under certain natural conditions. Even
though in most cases their performances are very similar, in the cases when WIP
shows bad performance, LIP continues to perform very well. We then propose
reinforcement learning algorithms, both tabular and NN-based, to obtain online
learning schemes for LIP in the model-free setting. The proposed reinforcement
learning schemes for LIP requires significantly less memory than the analogous
scheme for WIP. We calculate analytically the Lagrangian index for the restart
model, which describes the optimal web crawling and the minimization of the
weighted age of information. We also give a new proof of asymptotic optimality
in case of homogeneous bandits as the number of arms goes to infinity, based on
exchangeability and de Finetti’s theorem.
[LINK]
http://arxiv.org/abs/2412.12641v1
[DATE]
2024-12-17 16:03:53+08:00
[CATEGORIES]
cs.LG
Building Gradient Bridges: Label Leakage from Restricted Gradient Sharing in Federated Learning
[AUTHORS]
Rui Zhang, Ka-Ho Chow, Ping Li
[ABSTRACT]
The growing concern over data privacy, the benefits of utilizing data from
diverse sources for model training, and the proliferation of networked devices
with enhanced computational capabilities have all contributed to the rise of
federated learning (FL). The clients in FL collaborate to train a global model
by uploading gradients computed on their private datasets without collecting
raw data. However, a new attack surface has emerged from gradient sharing,
where adversaries can restore the label distribution of a victim’s private data
by analyzing the obtained gradients. To mitigate this privacy leakage, existing
lightweight defenses restrict the sharing of gradients, such as encrypting the
final-layer gradients or locally updating the parameters within. In this paper,
we introduce a novel attack called Gradient Bridge (GDBR) that recovers the
label distribution of training data from the limited gradient information
shared in FL. GDBR explores the relationship between the layer-wise gradients,
tracks the flow of gradients, and analytically derives the batch training
labels. Extensive experiments show that GDBR can accurately recover more than
80% of labels in various FL settings. GDBR highlights the inadequacy of
restricted gradient sharing-based defenses and calls for the design of
effective defense schemes in FL.
[LINK]
http://arxiv.org/abs/2412.12640v1
[DATE]
2024-12-17 16:03:38+08:00
[CATEGORIES]
cs.LG
Exploring Diffusion and Flow Matching Under Generator Matching
[AUTHORS]
Zeeshan Patel, James DeLoye, Lance Mathias
[ABSTRACT]
In this paper, we present a comprehensive theoretical comparison of diffusion
and flow matching under the Generator Matching framework. Despite their
apparent differences, both diffusion and flow matching can be viewed under the
unified framework of Generator Matching. By recasting both diffusion and flow
matching under the same generative Markov framework, we provide theoretical
insights into why flow matching models can be more robust empirically and how
novel model classes can be constructed by mixing deterministic and stochastic
components. Our analysis offers a fresh perspective on the relationships
between state-of-the-art generative modeling paradigms.
[LINK]
http://arxiv.org/abs/2412.11024v2
[DATE]
2024-12-17 15:45:29+08:00
[CATEGORIES]
cs.LG
Modeling Temporal Dependencies within the Target for Long-Term Time Series Forecasting
[AUTHORS]
Qi Xiong, Kai Tang, Minbo Ma, Ji Zhang, Jie Xu, Tianrui Li
[ABSTRACT]
Long-term time series forecasting (LTSF) is a critical task across diverse
domains. Despite significant advancements in LTSF research, we identify a
performance bottleneck in existing LTSF methods caused by the inadequate
modeling of Temporal Dependencies within the Target (TDT). To address this
issue, we propose a novel and generic temporal modeling framework, Temporal
Dependency Alignment (TDAlign), that equips existing LTSF methods with TDT
learning capabilities. TDAlign introduces two key innovations: 1) a loss
function that aligns the change values between adjacent time steps in the
predictions with those in the target, ensuring consistency with variation
patterns, and 2) an adaptive loss balancing strategy that seamlessly integrates
the new loss function with existing LTSF methods without introducing additional
learnable parameters. As a plug-and-play framework, TDAlign enhances existing
methods with minimal computational overhead, featuring only linear time
complexity and constant space complexity relative to the prediction length.
Extensive experiments on six strong LTSF baselines across seven real-world
datasets demonstrate the effectiveness and flexibility of TDAlign. On average,
TDAlign reduces baseline prediction errors by \textbf{1.47\%} to
\textbf{9.19\%} and change value errors by \textbf{4.57\%} to \textbf{15.78\%},
highlighting its substantial performance improvements.
[LINK]
http://arxiv.org/abs/2406.04777v2
[DATE]
2024-12-17 15:44:36+08:00
[CATEGORIES]
cs.LG
Accelerating nuclear-norm regularized low-rank matrix optimization through Burer-Monteiro decomposition
[AUTHORS]
Ching-pei Lee, Ling Liang, Tianyun Tang, Kim-Chuan Toh
[ABSTRACT]
This work proposes a rapid algorithm, BM-Global, for nuclear-norm-regularized
convex and low-rank matrix optimization problems. BM-Global efficiently
decreases the objective value via low-cost steps leveraging the nonconvex but
smooth Burer-Monteiro (BM) decomposition, while effectively escapes saddle
points and spurious local minima ubiquitous in the BM form to obtain guarantees
of fast convergence rates to the global optima of the original
nuclear-norm-regularized problem through aperiodic inexact proximal gradient
steps on it. The proposed approach adaptively adjusts the rank for the BM
decomposition and can provably identify an optimal rank for the BM
decomposition problem automatically in the course of optimization through tools
of manifold identification. BM-Global hence also spends significantly less time
on parameter tuning than existing matrix-factorization methods, which require
an exhaustive search for finding this optimal rank. Extensive experiments on
real-world large-scale problems of recommendation systems, regularized kernel
estimation, and molecular conformation confirm that BM-Global can indeed
effectively escapes spurious local minima at which existing BM approaches are
stuck, and is a magnitude faster than state-of-the-art algorithms for low-rank
matrix optimization problems involving a nuclear-norm regularizer. Based on
this research, we have released an open-source package of the proposed
BM-Global at https://www.github.com/leepei/BM-Global/.
[COMMENTS]
52 pages, including 15 pages of appendices
[LINK]
http://arxiv.org/abs/2204.14067v3
[DATE]
2024-12-17 15:39:42+08:00
[CATEGORIES]
cs.LG
The Bandit Whisperer: Communication Learning for Restless Bandits
[AUTHORS]
Yunfan Zhao, Tonghan Wang, Dheeraj Nagaraj, Aparna Taneja, Milind Tambe
[ABSTRACT]
Applying Reinforcement Learning (RL) to Restless Multi-Arm Bandits (RMABs)
offers a promising avenue for addressing allocation problems with resource
constraints and temporal dynamics. However, classic RMAB models largely
overlook the challenges of (systematic) data errors - a common occurrence in
real-world scenarios due to factors like varying data collection protocols and
intentional noise for differential privacy. We demonstrate that conventional RL
algorithms used to train RMABs can struggle to perform well in such settings.
To solve this problem, we propose the first communication learning approach in
RMABs, where we study which arms, when involved in communication, are most
effective in mitigating the influence of such systematic data errors. In our
setup, the arms receive Q-function parameters from similar arms as messages to
guide behavioral policies, steering Q-function updates. We learn communication
strategies by considering the joint utility of messages across all pairs of
arms and using a Q-network architecture that decomposes the joint utility. Both
theoretical and empirical evidence validate the effectiveness of our method in
significantly improving RMAB performance across diverse problems.
[LINK]
http://arxiv.org/abs/2408.05686v2
[DATE]
2024-12-17 15:36:47+08:00
[CATEGORIES]
cs.LG
Beyond Scaleup: Knowledge-aware Parsimony Learning from Deep Networks
[AUTHORS]
Quanming Yao, Yongqi Zhang, Yaqing Wang, Nan Yin, James Kwok, Qiang Yang
[ABSTRACT]
The brute-force scaleup of training datasets, learnable parameters and
computation power, has become a prevalent strategy for developing more robust
learning models. However, due to bottlenecks in data, computation, and trust,
the sustainability of this strategy is a serious concern. In this paper, we
attempt to address this issue in a parsimonious manner (i.e., achieving greater
potential with simpler models). The key is to drive models using
domain-specific knowledge, such as symbols, logic, and formulas, instead of
purely relying on scaleup. This approach allows us to build a framework that
uses this knowledge as “building blocks” to achieve parsimony in model design,
training, and interpretation. Empirical results show that our methods surpass
those that typically follow the scaling law. We also demonstrate our framework
in AI for science, specifically in the problem of drug-drug interaction
prediction. We hope our research can foster more diverse technical roadmaps in
the era of foundation models.
[COMMENTS]
Accepted to AI Magazine
[LINK]
http://arxiv.org/abs/2407.00478v3
[DATE]
2024-12-17 15:30:46+08:00
[CATEGORIES]
cs.LG
Data Collaboration Analysis with Orthogonal Basis Alignment
[AUTHORS]
Keiyu Nosaka, Yuichi Takano, Akiko Yoshise
[ABSTRACT]
The Data Collaboration (DC) framework provides a privacy-preserving solution
for multi-source data fusion, enabling the joint analysis of data from multiple
sources to achieve enhanced insights. It utilizes linear transformations with
secretly selected bases to ensure privacy guarantees through non-iterative
communication. Despite its strengths, the DC framework often encounters
performance instability due to theoretical challenges in aligning the bases
used for mapping raw data. This study addresses these challenges by
establishing a rigorous theoretical foundation for basis alignment within the
DC framework, formulating it as an optimization problem over orthogonal
matrices. Under specific assumptions, we demonstrate that this problem can be
reduced to the Orthogonal Procrustes Problem, which has a well-known analytical
solution. Extensive empirical evaluations across diverse datasets reveal that
the proposed alignment method significantly enhances model performance and
computational efficiency, outperforming existing approaches. Additionally, it
demonstrates robustness across varying levels of differential privacy, thus
enabling practical and reliable implementations of the DC framework.
[COMMENTS]
28 pages
[LINK]
http://arxiv.org/abs/2403.02780v3
[DATE]
2024-12-17 15:23:04+08:00
[CATEGORIES]
cs.LG
An Advantage-based Optimization Method for Reinforcement Learning in Large Action Space
[AUTHORS]
Hai Lin, Cheng Huang, Zhihong Chen
[ABSTRACT]
Reinforcement learning tasks in real-world scenarios often involve large,
high-dimensional action spaces, leading to challenges such as convergence
difficulties, instability, and high computational complexity. It is widely
acknowledged that traditional value-based reinforcement learning algorithms
struggle to address these issues effectively. A prevalent approach involves
generating independent sub-actions within each dimension of the action space.
However, this method introduces bias, hindering the learning of optimal
policies. In this paper, we propose an advantage-based optimization method and
an algorithm named Advantage Branching Dueling Q-network (ABQ). ABQ
incorporates a baseline mechanism to tune the action value of each dimension,
leveraging the advantage relationship across different sub-actions. With this
approach, the learned policy can be optimized for each dimension. Empirical
results demonstrate that ABQ outperforms BDQ, achieving 3%, 171%, and 84% more
cumulative rewards in HalfCheetah, Ant, and Humanoid environments,
respectively. Furthermore, ABQ exhibits competitive performance when compared
against two continuous action benchmark algorithms, DDPG and TD3.
[LINK]
http://arxiv.org/abs/2412.12605v1
[DATE]
2024-12-17 15:04:39+08:00
[CATEGORIES]
cs.LG
Distribution-Free Uncertainty Quantification in Mechanical Ventilation Treatment: A Conformal Deep Q-Learning Framework
[AUTHORS]
Niloufar Eghbali, Tuka Alhanai, Mohammad M. Ghassemi
[ABSTRACT]
Mechanical Ventilation (MV) is a critical life-support intervention in
intensive care units (ICUs). However, optimal ventilator settings are
challenging to determine because of the complexity of balancing
patient-specific physiological needs with the risks of adverse outcomes that
impact morbidity, mortality, and healthcare costs. This study introduces
ConformalDQN, a novel distribution-free conformal deep Q-learning approach for
optimizing mechanical ventilation in intensive care units. By integrating
conformal prediction with deep reinforcement learning, our method provides
reliable uncertainty quantification, addressing the challenges of Q-value
overestimation and out-of-distribution actions in offline settings. We trained
and evaluated our model using ICU patient records from the MIMIC-IV database.
ConformalDQN extends the Double DQN architecture with a conformal predictor and
employs a composite loss function that balances Q-learning with well-calibrated
probability estimation. This enables uncertainty-aware action selection,
allowing the model to avoid potentially harmful actions in unfamiliar states
and handle distribution shifts by being more conservative in
out-of-distribution scenarios. Evaluation against baseline models, including
physician policies, policy constraint methods, and behavior cloning,
demonstrates that ConformalDQN consistently makes recommendations within
clinically safe and relevant ranges, outperforming other methods by increasing
the 90-day survival rate. Notably, our approach provides an interpretable
measure of confidence in its decisions, which is crucial for clinical adoption
and potential human-in-the-loop implementations.
[LINK]
http://arxiv.org/abs/2412.12597v1
[DATE]
2024-12-17 14:55:20+08:00
[CATEGORIES]
cs.LG
Adaptive pruning-based Newton’s method for distributed learning
[AUTHORS]
Shuzhen Chen, Yuan Yuan, Youming Tao, Tianzhu Wang, Zhipeng Cai, Dongxiao Yu
[ABSTRACT]
Newton’s method leverages curvature information to boost performance, and
thus outperforms first-order methods for distributed learning problems.
However, Newton’s method is not practical in large-scale and heterogeneous
learning environments, due to obstacles such as high computation and
communication costs of the Hessian matrix, sub-model diversity, staleness of
training, and data heterogeneity. To overcome these obstacles, this paper
presents a novel and efficient algorithm named Distributed Adaptive Newton
Learning (\texttt{DANL}), which solves the drawbacks of Newton’s method by
using a simple Hessian initialization and adaptive allocation of training
regions. The algorithm exhibits remarkable convergence properties, which are
rigorously examined under standard assumptions in stochastic optimization. The
theoretical analysis proves that \texttt{DANL} attains a linear convergence
rate while efficiently adapting to available resources and keeping high
efficiency. Furthermore, \texttt{DANL} shows notable independence from the
condition number of the problem and removes the necessity for complex parameter
tuning. Experiments demonstrate that \texttt{DANL} achieves linear convergence
with efficient communication and strong performance across different datasets.
[LINK]
http://arxiv.org/abs/2308.10154v4
[DATE]
2024-12-17 14:45:36+08:00
[CATEGORIES]
cs.LG
License Plate Detection and Character Recognition Using Deep Learning and Font Evaluation
[AUTHORS]
Zahra Ebrahimi Vargoorani, Ching Yee Suen
[ABSTRACT]
License plate detection (LPD) is essential for traffic management, vehicle
tracking, and law enforcement but faces challenges like variable lighting and
diverse font types, impacting accuracy. Traditionally reliant on image
processing and machine learning, the field is now shifting towards deep
learning for its robust performance in various conditions. Current methods,
however, often require tailoring to specific regional datasets. This paper
proposes a dual deep learning strategy using a Faster R-CNN for detection and a
CNN-RNN model with Connectionist Temporal Classification (CTC) loss and a
MobileNet V3 backbone for recognition. This approach aims to improve model
performance using datasets from Ontario, Quebec, California, and New York
State, achieving a recall rate of 92% on the Centre for Pattern Recognition and
Machine Intelligence (CENPARMI) dataset and 90% on the UFPR-ALPR dataset. It
includes a detailed error analysis to identify the causes of false positives.
Additionally, the research examines the role of font features in license plate
(LP) recognition, analyzing fonts like Driver Gothic, Dreadnought, California
Clarendon, and Zurich Extra Condensed with the OpenALPR system. It discovers
significant performance discrepancies influenced by font characteristics,
offering insights for future LPD system enhancements.
Keywords: Deep Learning, License Plate, Font Evaluation
[COMMENTS]
12 pages, 5 figures. This is the pre-Springer final accepted version.
The final version is published in Springer, Lecture Notes in Computer Science
(LNCS), Volume 14731, 2024. Springer Version of Record
[LINK]
http://arxiv.org/abs/2412.12572v1
[DATE]
2024-12-17 14:03:42+08:00
[CATEGORIES]
cs.LG
CSformer: Combining Channel Independence and Mixing for Robust Multivariate Time Series Forecasting
[AUTHORS]
Haoxin Wang, Yipeng Mo, Kunlan Xiang, Nan Yin, Honghe Dai, Bixiong Li, Songhai Fan, Site Mo
[ABSTRACT]
In the domain of multivariate time series analysis, the concept of channel
independence has been increasingly adopted, demonstrating excellent performance
due to its ability to eliminate noise and the influence of irrelevant
variables. However, such a concept often simplifies the complex interactions
among channels, potentially leading to information loss. To address this
challenge, we propose a strategy of channel independence followed by mixing.
Based on this strategy, we introduce CSformer, a novel framework featuring a
two-stage multiheaded self-attention mechanism. This mechanism is designed to
extract and integrate both channel-specific and sequence-specific information.
Distinctively, CSformer employs parameter sharing to enhance the cooperative
effects between these two types of information. Moreover, our framework
effectively incorporates sequence and channel adapters, significantly improving
the model’s ability to identify important information across various
dimensions. Extensive experiments on several real-world datasets demonstrate
that CSformer achieves state-of-the-art results in terms of overall
performance.
[COMMENTS]
Accepted by AAAI 2025
[LINK]
http://arxiv.org/abs/2312.06220v2
[DATE]
2024-12-17 13:39:48+08:00
[CATEGORIES]
cs.LG
Multi-Modal Recommendation Unlearning for Legal, Licensing, and Modality Constraints
[AUTHORS]
Yash Sinha, Murari Mandal, Mohan Kankanhalli
[ABSTRACT]
User data spread across multiple modalities has popularized multi-modal
recommender systems (MMRS). They recommend diverse content such as products,
social media posts, TikTok reels, etc., based on a user-item interaction graph.
With rising data privacy demands, recent methods propose unlearning private
user data from uni-modal recommender systems (RS). However, methods for
unlearning item data related to outdated user preferences, revoked licenses,
and legally requested removals are still largely unexplored.
Previous RS unlearning methods are unsuitable for MMRS due to the
incompatibility of their matrix-based representation with the multi-modal
user-item interaction graph. Moreover, their data partitioning step degrades
performance on each shard due to poor data heterogeneity and requires costly
performance aggregation across shards.
This paper introduces MMRecUn, the first approach known to us for unlearning
in MMRS and unlearning item data. Given a trained RS model, MMRecUn employs a
novel Reverse Bayesian Personalized Ranking (BPR) objective to enable the model
to forget marked data. The reverse BPR attenuates the impact of user-item
interactions within the forget set, while the forward BPR reinforces the
significance of user-item interactions within the retain set. Our experiments
demonstrate that MMRecUn outperforms baseline methods across various unlearning
requests when evaluated on benchmark MMRS datasets. MMRecUn achieves recall
performance improvements of up to 49.85% compared to baseline methods and is up
to $\mathbf{1.3}\times$ faster than the Gold model, which is trained on retain
set from scratch. MMRecUn offers significant advantages, including superiority
in removing target interactions, preserving retained interactions, and zero
overhead costs compared to previous methods. The code will be released after
review.
[COMMENTS]
Accepted at AAAI 2025
[LINK]
http://arxiv.org/abs/2405.15328v2
[DATE]
2024-12-17 13:35:15+08:00
[CATEGORIES]
cs.LG
Full error analysis of the random deep splitting method for nonlinear parabolic PDEs and PIDEs
[AUTHORS]
Ariel Neufeld, Philipp Schmocker, Sizhou Wu
[ABSTRACT]
In this paper, we present a randomized extension of the deep splitting
algorithm introduced in [Beck, Becker, Cheridito, Jentzen, and Neufeld (2021)]
using random neural networks suitable to approximately solve both
high-dimensional nonlinear parabolic PDEs and PIDEs with jumps having
(possibly) infinite activity. We provide a full error analysis of our so-called
random deep splitting method. In particular, we prove that our random deep
splitting method converges to the (unique viscosity) solution of the nonlinear
PDE or PIDE under consideration. Moreover, we empirically analyze our random
deep splitting method by considering several numerical examples including both
nonlinear PDEs and nonlinear PIDEs relevant in the context of pricing of
financial derivatives under default risk. In particular, we empirically
demonstrate in all examples that our random deep splitting method can
approximately solve nonlinear PDEs and PIDEs in 10’000 dimensions within
seconds.
[LINK]
http://arxiv.org/abs/2405.05192v3
[DATE]
2024-12-17 13:27:17+08:00
[CATEGORIES]
cs.LG
Zigzag Diffusion Sampling: Diffusion Models Can Self-Improve via Self-Reflection
[AUTHORS]
Lichen Bai, Shitong Shao, Zikai Zhou, Zipeng Qi, Zhiqiang Xu, Haoyi Xiong, Zeke Xie
[ABSTRACT]
Diffusion models, the most popular generative paradigm so far, can inject
conditional information into the generation path to guide the latent towards
desired directions. However, existing text-to-image diffusion models often fail
to maintain high image quality and high prompt-image alignment for those
challenging prompts. To mitigate this issue and enhance existing pretrained
diffusion models, we mainly made three contributions in this paper. First, we
propose diffusion self-reflection that alternately performs denoising and
inversion and demonstrate that such diffusion self-reflection can leverage the
guidance gap between denoising and inversion to capture prompt-related semantic
information with theoretical and empirical evidence. Second, motivated by
theoretical analysis, we derive Zigzag Diffusion Sampling (Z-Sampling), a novel
self-reflection-based diffusion sampling method that leverages the guidance gap
between denosing and inversion to accumulate semantic information step by step
along the sampling path, leading to improved sampling results. Moreover, as a
plug-and-play method, Z-Sampling can be generally applied to various diffusion
models (e.g., accelerated ones and Transformer-based ones) with very limited
coding and computational costs. Third, our extensive experiments demonstrate
that Z-Sampling can generally and significantly enhance generation quality
across various benchmark datasets, diffusion models, and performance evaluation
metrics. For example, DreamShaper with Z-Sampling can self-improve with the
HPSv2 winning rate up to 94% over the original results. Moreover, Z-Sampling
can further enhance existing diffusion models combined with other orthogonal
methods, including Diffusion-DPO.
[LINK]
http://arxiv.org/abs/2412.10891v2
[DATE]
2024-12-17 13:23:42+08:00
[CATEGORIES]
cs.LG
Stiefel Flow Matching for Moment-Constrained Structure Elucidation
[AUTHORS]
Austin Cheng, Alston Lo, Kin Long Kelvin Lee, Santiago Miret, Alán Aspuru-Guzik
[ABSTRACT]
Molecular structure elucidation is a fundamental step in understanding
chemical phenomena, with applications in identifying molecules in natural
products, lab syntheses, forensic samples, and the interstellar medium. We
consider the task of predicting a molecule’s all-atom 3D structure given only
its molecular formula and moments of inertia, motivated by the ability of
rotational spectroscopy to measure these moments. While existing generative
models can conditionally sample 3D structures with approximately correct
moments, this soft conditioning fails to leverage the many digits of precision
afforded by experimental rotational spectroscopy. To address this, we first
show that the space of $n$-atom point clouds with a fixed set of moments of
inertia is embedded in the Stiefel manifold $\mathrm{St}(n, 4)$. We then
propose Stiefel Flow Matching as a generative model for elucidating 3D
structure under exact moment constraints. Additionally, we learn simpler and
shorter flows by finding approximate solutions for equivariant optimal
transport on the Stiefel manifold. Empirically, enforcing exact moment
constraints allows Stiefel Flow Matching to achieve higher success rates and
faster sampling than Euclidean diffusion models, even on high-dimensional
manifolds corresponding to large molecules in the GEOM dataset.
[LINK]
http://arxiv.org/abs/2412.12540v1
[DATE]
2024-12-17 13:07:10+08:00
[CATEGORIES]
cs.LG
Differential Privacy via Distributionally Robust Optimization
[AUTHORS]
Aras Selvi, Huikang Liu, Wolfram Wiesemann
[ABSTRACT]
In recent years, differential privacy has emerged as the de facto standard
for sharing statistics of datasets while limiting the disclosure of private
information about the involved individuals. This is achieved by randomly
perturbing the statistics to be published, which in turn leads to a
privacy-accuracy trade-off: larger perturbations provide stronger privacy
guarantees, but they result in less accurate statistics that offer lower
utility to the recipients. Of particular interest are therefore optimal
mechanisms that provide the highest accuracy for a pre-selected level of
privacy. To date, work in this area has focused on specifying families of
perturbations a priori and subsequently proving their asymptotic and/or
best-in-class optimality. In this paper, we develop a class of mechanisms that
enjoy non-asymptotic and unconditional optimality guarantees. To this end, we
formulate the mechanism design problem as an infinite-dimensional
distributionally robust optimization problem. We show that the problem affords
a strong dual, and we exploit this duality to develop converging hierarchies of
finite-dimensional upper and lower bounding problems. Our upper (primal) bounds
correspond to implementable perturbations whose suboptimality can be bounded by
our lower (dual) bounds. Both bounding problems can be solved within seconds
via cutting plane techniques that exploit the inherent problem structure. Our
numerical experiments demonstrate that our perturbations can outperform the
previously best results from the literature on artificial as well as standard
benchmark problems.
[COMMENTS]
64 pages (28 main + 4 references + 32 appendices). Further proofs and
details in the GitHub supplements. 5 color figures + 2 grey figures
[LINK]
http://arxiv.org/abs/2304.12681v3
[DATE]
2024-12-17 12:43:33+08:00
[CATEGORIES]
cs.LG
Attack On Prompt: Backdoor Attack in Prompt-Based Continual Learning
[AUTHORS]
Trang Nguyen, Anh Tran, Nhat Ho
[ABSTRACT]
Prompt-based approaches offer a cutting-edge solution to data privacy issues
in continual learning, particularly in scenarios involving multiple data
suppliers where long-term storage of private user data is prohibited. Despite
delivering state-of-the-art performance, its impressive remembering capability
can become a double-edged sword, raising security concerns as it might
inadvertently retain poisoned knowledge injected during learning from private
user data. Following this insight, in this paper, we expose continual learning
to a potential threat: backdoor attack, which drives the model to follow a
desired adversarial target whenever a specific trigger is present while still
performing normally on clean samples. We highlight three critical challenges in
executing backdoor attacks on incremental learners and propose corresponding
solutions: (1) \emph{Transferability}: We employ a surrogate dataset and
manipulate prompt selection to transfer backdoor knowledge to data from other
suppliers; (2) \emph{Resiliency}: We simulate static and dynamic states of the
victim to ensure the backdoor trigger remains robust during intense incremental
learning processes; and (3) \emph{Authenticity}: We apply binary cross-entropy
loss as an anti-cheating factor to prevent the backdoor trigger from devolving
into adversarial noise. Extensive experiments across various benchmark datasets
and continual learners validate our continual backdoor framework, achieving up
to $100\%$ attack success rate, with further ablation studies confirming our
contributions’ effectiveness.
[COMMENTS]
Accepted to AAAI 2025
[LINK]
http://arxiv.org/abs/2406.19753v2
[DATE]
2024-12-17 12:41:46+08:00
[CATEGORIES]
cs.LG
PSMGD: Periodic Stochastic Multi-Gradient Descent for Fast Multi-Objective Optimization
[AUTHORS]
Mingjing Xu, Peizhong Ju, Jia Liu, Haibo Yang
[ABSTRACT]
Multi-objective optimization (MOO) lies at the core of many machine learning
(ML) applications that involve multiple, potentially conflicting objectives
(e.g., multi-task learning, multi-objective reinforcement learning, among many
others). Despite the long history of MOO, recent years have witnessed a surge
in interest within the ML community in the development of gradient manipulation
algorithms for MOO, thanks to the availability of gradient information in many
ML problems. However, existing gradient manipulation methods for MOO often
suffer from long training times, primarily due to the need for computing
dynamic weights by solving an additional optimization problem to determine a
common descent direction that can decrease all objectives simultaneously. To
address this challenge, we propose a new and efficient algorithm called
Periodic Stochastic Multi-Gradient Descent (PSMGD) to accelerate MOO. PSMGD is
motivated by the key observation that dynamic weights across objectives exhibit
small changes under minor updates over short intervals during the optimization
process. Consequently, our PSMGD algorithm is designed to periodically compute
these dynamic weights and utilizes them repeatedly, thereby effectively
reducing the computational overload. Theoretically, we prove that PSMGD can
achieve state-of-the-art convergence rates for strongly-convex, general convex,
and non-convex functions. Additionally, we introduce a new computational
complexity measure, termed backpropagation complexity, and demonstrate that
PSMGD could achieve an objective-independent backpropagation complexity.
Through extensive experiments, we verify that PSMGD can provide comparable or
superior performance to state-of-the-art MOO algorithms while significantly
reducing training time.
[COMMENTS]
Accepted to AAAI 2025
[LINK]
http://arxiv.org/abs/2412.10961v2
[DATE]
2024-12-17 12:25:55+08:00
[CATEGORIES]
cs.LG
Eliminating Domain Bias for Federated Learning in Representation Space
[AUTHORS]
Jianqing Zhang, Yang Hua, Jian Cao, Hao Wang, Tao Song, Zhengui Xue, Ruhui Ma, Haibing Guan
[COMMENTS]
Accepted by NeurIPS 2023, 24 pages
[LINK]
http://arxiv.org/abs/2311.14975v2
[DATE]
2024-12-17 12:17:00+08:00
[CATEGORIES]
cs.LG
Physics-model-guided Worst-case Sampling for Safe Reinforcement Learning
[AUTHORS]
Hongpeng Cao, Yanbing Mao, Lui Sha, Marco Caccamo
[ABSTRACT]
Real-world accidents in learning-enabled CPS frequently occur in challenging
corner cases. During the training of deep reinforcement learning (DRL) policy,
the standard setup for training conditions is either fixed at a single initial
condition or uniformly sampled from the admissible state space. This setup
often overlooks the challenging but safety-critical corner cases. To bridge
this gap, this paper proposes a physics-model-guided worst-case sampling
strategy for training safe policies that can handle safety-critical cases
toward guaranteed safety. Furthermore, we integrate the proposed worst-case
sampling strategy into the physics-regulated deep reinforcement learning
(Phy-DRL) framework to build a more data-efficient and safe learning algorithm
for safety-critical CPS. We validate the proposed training strategy with
Phy-DRL through extensive experiments on a simulated cart-pole system, a 2D
quadrotor, a simulated and a real quadruped robot, showing remarkably improved
sampling efficiency to learn more robust safe policies.
[COMMENTS]
under review
[LINK]
http://arxiv.org/abs/2412.13224v1
[DATE]
2024-12-17 12:13:06+08:00
[CATEGORIES]
cs.LG
Enhanced Momentum with Momentum Transformers
[AUTHORS]
Max Mason, Waasi A Jagirdar, David Huang, Rahul Murugan
[ABSTRACT]
The primary objective of this research is to build a Momentum Transformer
that is expected to outperform benchmark time-series momentum and
mean-reversion trading strategies. We extend the ideas introduced in the paper
Trading with the Momentum Transformer: An Intelligent and Interpretable
Architecture to equities as the original paper primarily only builds upon
futures and equity indices. Unlike conventional Long Short-Term Memory (LSTM)
models, which operate sequentially and are optimized for processing local
patterns, an attention mechanism equips our architecture with direct access to
all prior time steps in the training window. This hybrid design, combining
attention with an LSTM, enables the model to capture long-term dependencies,
enhance performance in scenarios accounting for transaction costs, and
seamlessly adapt to evolving market conditions, such as those witnessed during
the Covid Pandemic. We average 4.14% returns which is similar to the original
papers results. Our Sharpe is lower at an average of 1.12 due to much higher
volatility which may be due to stocks being inherently more volatile than
futures and indices.
[COMMENTS]
7 pages, 5 figures
[LINK]
http://arxiv.org/abs/2412.12516v1
[DATE]
2024-12-17 12:11:30+08:00
[CATEGORIES]
cs.LG
RL-LLM-DT: An Automatic Decision Tree Generation Method Based on RL Evaluation and LLM Enhancement
[AUTHORS]
Junjie Lin, Jian Zhao, Lin Liu, Yue Deng, Youpeng Zhao, Lanxiao Huang, Xia Lin, Wengang Zhou, Houqiang Li
[ABSTRACT]
Traditionally, AI development for two-player zero-sum games has relied on two
primary techniques: decision trees and reinforcement learning (RL). A common
approach involves using a fixed decision tree as one player’s strategy while
training an RL agent as the opponent to identify vulnerabilities in the
decision tree, thereby improving its strategic strength iteratively. However,
this process often requires significant human intervention to refine the
decision tree after identifying its weaknesses, resulting in inefficiencies and
hindering full automation of the strategy enhancement process. Fortunately, the
advent of Large Language Models (LLMs) offers a transformative opportunity to
automate the process. We propose RL-LLM-DT, an automatic decision tree
generation method based on RL Evaluation and LLM Enhancement. Given an initial
decision tree, the method involves two important iterative steps. Response
Policy Search: RL is used to discover counter-strategies targeting the decision
tree. Policy Improvement: LLMs analyze failure scenarios and generate improved
decision tree code. In our method, RL focuses on finding the decision tree’s
flaws while LLM is prompted to generate an improved version of the decision
tree. The iterative refinement process terminates when RL can’t find any flaw
of the tree or LLM fails to improve the tree. To evaluate the effectiveness of
this integrated approach, we conducted experiments in a curling game. After
iterative refinements, our curling AI based on the decision tree ranks first on
the Jidi platform among 34 curling AIs in total, which demonstrates that LLMs
can significantly enhance the robustness and adaptability of decision trees,
representing a substantial advancement in the field of Game AI. Our code is
available at https://github.com/Linjunjie99/RL-LLM-DT.
[COMMENTS]
Length:10 pages. Figures:10 figures. Additional Notes:In this paper,
we have introduced a novel hybrid approach which leverages the strengths of
both RL and LLMs to itera- tively refine decision tree tactics, enhancing
their performance and adaptability
[LINK]
http://arxiv.org/abs/2412.11417v2
[DATE]
2024-12-17 12:04:12+08:00
[CATEGORIES]
cs.LG
Interpreting GNN-based IDS Detections Using Provenance Graph Structural Features
[AUTHORS]
Kunal Mukherjee, Joshua Wiedemeier, Tianhao Wang, Muhyun Kim, Feng Chen, Murat Kantarcioglu, Kangkook Jee
[ABSTRACT]
Advanced cyber threats (e.g., Fileless Malware and Advanced Persistent Threat
(APT)) have driven the adoption of provenance-based security solutions. These
solutions employ Machine Learning (ML) models for behavioral modeling and
critical security tasks such as malware and anomaly detection. However, the
opacity of ML-based security models limits their broader adoption, as the lack
of transparency in their decision-making processes restricts explainability and
verifiability. We tailored our solution towards Graph Neural Network
(GNN)-based security solutions since recent studies employ GNNs to
comprehensively digest system provenance graphs for security critical tasks.
To enhance the explainability of GNN-based security models, we introduce
PROVEXPLAINER, a framework offering instance-level security-aware explanations
using an interpretable surrogate model. PROVEXPLAINER’s interpretable feature
space consists of discriminant subgraph patterns and graph structural features,
which can be directly mapped to the system provenance problem space, making the
explanations human understandable. By considering prominent GNN architectures
(e.g., GAT and GraphSAGE) for anomaly detection tasks, we show how
PROVEXPLAINER synergizes with current state-of-the-art (SOTA) GNN explainers to
deliver domain and instance-specific explanations. We measure the explanation
quality using the fidelity+/fidelity- metric as used by traditional GNN
explanation literature, and we incorporate the precision/recall metric where we
consider the accuracy of the explanation against the ground truth. On malware
and APT datasets, PROVEXPLAINER achieves up to 29%/27%/25% higher fidelity+,
precision and recall, and 12% lower fidelity- respectively, compared to SOTA
GNN explainers.
[LINK]
http://arxiv.org/abs/2306.00934v5
[DATE]
2024-12-17 11:59:21+08:00
[CATEGORIES]
cs.LG
A Step-by-step Introduction to the Implementation of Automatic Differentiation
[AUTHORS]
Yu-Hsueh Fang, He-Zhe Lin, Jie-Jyun Liu, Chih-Jen Lin
[ABSTRACT]
Automatic differentiation is a key component in deep learning. This topic is
well studied and excellent surveys such as Baydin et al. (2018) have been
available to clearly describe the basic concepts. Further, sophisticated
implementations of automatic differentiation are now an important part of
popular deep learning frameworks. However, it is difficult, if not impossible,
to directly teach students the implementation of existing systems due to the
complexity. On the other hand, if the teaching stops at the basic concept,
students fail to sense the realization of an implementation. For example, we
often mention the computational graph in teaching automatic differentiation,
but students wonder how to implement and use it. In this document, we partially
fill the gap by giving a step by step introduction of implementing a simple
automatic differentiation system. We streamline the mathematical concepts and
the implementation. Further, we give the motivation behind each implementation
detail, so the whole setting becomes very natural.
[COMMENTS]
17 pages, 15 figures
[LINK]
http://arxiv.org/abs/2402.16020v2
[DATE]
2024-12-17 11:58:55+08:00
[CATEGORIES]
cs.LG
Hierarchical Federated Learning with Multi-Timescale Gradient Correction
[AUTHORS]
Wenzhi Fang, Dong-Jun Han, Evan Chen, Shiqiang Wang, Christopher G. Brinton
[COMMENTS]
Accepted to NeurIPS 2024
[LINK]
http://arxiv.org/abs/2409.18448v3
[DATE]
2024-12-17 11:16:02+08:00
[CATEGORIES]
cs.LG
Can Large Language Models Improve the Adversarial Robustness of Graph Neural Networks?
[AUTHORS]
Zhongjian Zhang, Xiao Wang, Huichi Zhou, Yue Yu, Mengmei Zhang, Cheng Yang, Chuan Shi
[ABSTRACT]
Graph neural networks (GNNs) are vulnerable to adversarial attacks,
especially for topology perturbations, and many methods that improve the
robustness of GNNs have received considerable attention. Recently, we have
witnessed the significant success of large language models (LLMs), leading many
to explore the great potential of LLMs on GNNs. However, they mainly focus on
improving the performance of GNNs by utilizing LLMs to enhance the node
features. Therefore, we ask: Will the robustness of GNNs also be enhanced with
the powerful understanding and inference capabilities of LLMs? By presenting
the empirical results, we find that despite that LLMs can improve the
robustness of GNNs, there is still an average decrease of 23.1% in accuracy,
implying that the GNNs remain extremely vulnerable against topology attacks.
Therefore, another question is how to extend the capabilities of LLMs on graph
adversarial robustness. In this paper, we propose an LLM-based robust graph
structure inference framework, LLM4RGNN, which distills the inference
capabilities of GPT-4 into a local LLM for identifying malicious edges and an
LM-based edge predictor for finding missing important edges, so as to recover a
robust graph structure. Extensive experiments demonstrate that LLM4RGNN
consistently improves the robustness across various GNNs. Even in some cases
where the perturbation ratio increases to 40%, the accuracy of GNNs is still
better than that on the clean graph. The source code can be found in
https://github.com/zhongjian-zhang/LLM4RGNN.
[COMMENTS]
accepted by KDD2025
[LINK]
http://arxiv.org/abs/2408.08685v2
[DATE]
2024-12-17 11:01:23+08:00
[CATEGORIES]
cs.LG
Unlock the Correlation between Supervised Fine-Tuning and Reinforcement Learning in Training Code Large Language Models
[AUTHORS]
Jie Chen, Xintian Han, Yu Ma, Xun Zhou, Liang Xiang
[ABSTRACT]
Automatic code generation has been a longstanding research topic. With the
advancement of general-purpose large language models (LLMs), the ability to
code stands out as one important measure to the model’s reasoning performance.
Usually, a two-stage training paradigm is implemented to obtain a Code LLM,
namely the pretraining and the fine-tuning. Within the fine-tuning, supervised
fine-tuning (SFT), and reinforcement learning (RL) are often used to improve
the model’s zero-shot ability. A large number of work has been conducted to
improve the model’s performance on code-related benchmarks with either
modifications to the algorithm or refinement of the dataset. However, we still
lack a deep insight into the correlation between SFT and RL. For instance, what
kind of dataset should be used to ensure generalization, or what if we abandon
the SFT phase in fine-tuning. In this work, we make an attempt to understand
the correlation between SFT and RL. To facilitate our research, we manually
craft 100 basis python functions, called atomic functions, and then a
synthesizing pipeline is deployed to create a large number of synthetic
functions on top of the atomic ones. In this manner, we ensure that the train
and test sets remain distinct, preventing data contamination. Through
comprehensive ablation study, we find: (1) Both atomic and synthetic functions
are indispensable for SFT’s generalization, and only a handful of synthetic
functions are adequate; (2) Through RL, the SFT’s generalization to target
domain can be greatly enhanced, even with the same training prompts; (3)
Training RL from scratch can alleviate the over-fitting issue introduced in the
SFT phase.
[LINK]
http://arxiv.org/abs/2406.10305v2
[DATE]
2024-12-17 10:53:54+08:00
[CATEGORIES]
cs.LG
Echo: Simulating Distributed Training At Scale
[AUTHORS]
Yicheng Feng, Yuetao Chen, Kaiwen Chen, Jingzong Li, Tianyuan Wu, Peng Cheng, Chuan Wu, Wei Wang, Tsung-Yi Ho, Hong Xu
[ABSTRACT]
Simulation offers unique values for both enumeration and extrapolation
purposes, and is becoming increasingly important for managing the massive
machine learning (ML) clusters and large-scale distributed training jobs. In
this paper, we build Echo to tackle three key challenges in large-scale
training simulation: (1) tracing the runtime training workloads at each device
in an ex-situ fashion so we can use a single device to obtain the actual
execution graphs of 1K-GPU training, (2) accurately estimating the collective
communication without high overheads of discrete-event based network
simulation, and (3) accounting for the interference-induced computation
slowdown from overlapping communication and computation kernels on the same
device. Echo delivers on average 8% error in training step – roughly 3x lower
than state-of-the-art simulators – for GPT-175B on a 96-GPU H800 cluster with
3D parallelism on Megatron-LM under 2 minutes.
[LINK]
http://arxiv.org/abs/2412.12487v1
[DATE]
2024-12-17 10:44:35+08:00
[CATEGORIES]
cs.LG
Evolutionary Optimization for Designing Variational Quantum Circuits with High Model Capacity
[AUTHORS]
Samuel Yen-Chi Chen
[ABSTRACT]
Recent advancements in quantum computing (QC) and machine learning (ML) have
garnered significant attention, leading to substantial efforts toward the
development of quantum machine learning (QML) algorithms to address a variety
of complex challenges. The design of high-performance QML models, however,
requires expert-level knowledge, posing a significant barrier to the widespread
adoption of QML. Key challenges include the design of data encoding mechanisms
and parameterized quantum circuits, both of which critically impact the
generalization capabilities of QML models. We propose a novel method that
encodes quantum circuit architecture information to enable the evolution of
quantum circuit designs. In this approach, the fitness function is based on the
effective dimension, allowing for the optimization of quantum circuits towards
higher model capacity. Through numerical simulations, we demonstrate that the
proposed method is capable of discovering variational quantum circuit
architectures that offer improved learning capabilities, thereby enhancing the
overall performance of QML models for complex tasks.
[COMMENTS]
Accepted by IEEE Symposium Series on Computational Intelligence -
IEEE SSCI 2025
[LINK]
http://arxiv.org/abs/2412.12484v1
[DATE]
2024-12-17 10:40:35+08:00
[CATEGORIES]
cs.LG
Subversion Strategy Eval: Evaluating AI’s stateless strategic capabilities against control protocols
[AUTHORS]
Alex Mallen, Charlie Griffin, Alessandro Abate, Buck Shlegeris
[ABSTRACT]
AI control protocols are plans for usefully deploying AI systems in a way
that is safe, even if the AI intends to subvert the protocol. Previous work
evaluated protocols by subverting them with a human-AI red team, where an AI
follows the human-written strategy. This paper investigates how well AI systems
can generate and act on strategies for subverting control protocols whilst
operating without private memory. Usually, decision-makers only need the
ability to generate optimal plans. However, to subvert a protocol, an AI system
must be able to reliably generate optimal plans in each context; coordinate
plans with other instances of itself without communicating; and take actions
with well-calibrated probabilities. We develop evaluations for a model’s
ability to generate subversion strategies and develop affordance experiments
that help isolate component capabilities. Our initial evaluations with
Claude-3.5-sonnet highlight reasons why statelessly strategizing to subvert
control protocols might be hard. We hope that our work can serve as a starting
point for evaluations aimed at assessing whether a model is capable of
strategizing to subvert control protocols.
[COMMENTS]
Accepted at AAAI datasafe workshop
[LINK]
http://arxiv.org/abs/2412.12480v1
[DATE]
2024-12-17 10:33:45+08:00
[CATEGORIES]
cs.LG
A Consolidated Volatility Prediction with Back Propagation Neural Network and Genetic Algorithm
[AUTHORS]
Zong Ke, Jingyu Xu, Zizhou Zhang, Yu Cheng, Wenjun Wu
[ABSTRACT]
This paper provides a unique approach with AI algorithms to predict emerging
stock markets volatility. Traditionally, stock volatility is derived from
historical volatility,Monte Carlo simulation and implied volatility as well. In
this paper, the writer designs a consolidated model with back-propagation
neural network and genetic algorithm to predict future volatility of emerging
stock markets and found that the results are quite accurate with low errors.
[COMMENTS]
6 pages, 7 figures, 1 table, The paper will be published by IEEE on
conference: 2024 3rd International Conference on Image Processing, Computer
Vision and Machine Learning (ICICML 2024) (V2)
[LINK]
http://arxiv.org/abs/2412.07223v2
[DATE]
2024-12-17 10:32:05+08:00
[CATEGORIES]
cs.LG
A Method for Enhancing Generalization of Adam by Multiple Integrations
[AUTHORS]
Long Jin, Han Nong, Liangming Chen, Zhenming Su
[ABSTRACT]
The insufficient generalization of adaptive moment estimation (Adam) has
hindered its broader application. Recent studies have shown that flat minima in
loss landscapes are highly associated with improved generalization. Inspired by
the filtering effect of integration operations on high-frequency signals, we
propose multiple integral Adam (MIAdam), a novel optimizer that integrates a
multiple integral term into Adam. This multiple integral term effectively
filters out sharp minima encountered during optimization, guiding the optimizer
towards flatter regions and thereby enhancing generalization capability. We
provide a theoretical explanation for the improvement in generalization through
the diffusion theory framework and analyze the impact of the multiple integral
term on the optimizer’s convergence. Experimental results demonstrate that
MIAdam not only enhances generalization and robustness against label noise but
also maintains the rapid convergence characteristic of Adam, outperforming Adam
and its variants in state-of-the-art benchmarks.
[LINK]
http://arxiv.org/abs/2412.12473v1
[DATE]
2024-12-17 10:15:38+08:00
[CATEGORIES]
cs.LG
Empathy Detection from Text, Audiovisual, Audio or Physiological Signals: A Systematic Review of Task Formulations and Machine Learning Methods
[AUTHORS]
Md Rakibul Hasan, Md Zakir Hossain, Shreya Ghosh, Aneesh Krishna, Tom Gedeon
[ABSTRACT]
Empathy indicates an individual’s ability to understand others. Over the past
few years, empathy has drawn attention from various disciplines, including but
not limited to Affective Computing, Cognitive Science, and Psychology.
Detecting empathy has potential applications in society, healthcare and
education. Despite being a broad and overlapping topic, the avenue of empathy
detection leveraging Machine Learning remains underexplored from a systematic
literature review perspective. We collected 829 papers from 10 well-known
databases, systematically screened them and analysed the final 62 papers. Our
analyses reveal several prominent task formulations $-$ including empathy on
localised utterances or overall expressions, unidirectional or parallel
empathy, and emotional contagion $-$ in monadic, dyadic and group interactions.
Empathy detection methods are summarised based on four input modalities $-$
text, audiovisual, audio and physiological signals $-$ thereby presenting
modality-specific network architecture design protocols. We discuss challenges,
research gaps and potential applications in the Affective Computing-based
empathy domain, which can facilitate new avenues of exploration. We further
enlist the public availability of datasets and codes. This paper, therefore,
provides a structured overview of recent advancements and remaining challenges
towards developing a robust empathy detection system that could meaningfully
contribute to enhancing human well-being.
[COMMENTS]
This work has been submitted to the IEEE for possible publication
[LINK]
http://arxiv.org/abs/2311.00721v3
[DATE]
2024-12-17 10:12:54+08:00
[CATEGORIES]
cs.LG
Optimal Control Operator Perspective and a Neural Adaptive Spectral Method
[AUTHORS]
Mingquan Feng, Zhijie Chen, Yixin Huang, Yizhou Liu, Junchi Yan
[ABSTRACT]
Optimal control problems (OCPs) involve finding a control function for a
dynamical system such that a cost functional is optimized. It is central to
physical systems in both academia and industry. In this paper, we propose a
novel instance-solution control operator perspective, which solves OCPs in a
one-shot manner without direct dependence on the explicit expression of
dynamics or iterative optimization processes. The control operator is
implemented by a new neural operator architecture named Neural Adaptive
Spectral Method (NASM), a generalization of classical spectral methods. We
theoretically validate the perspective and architecture by presenting the
approximation error bounds of NASM for the control operator. Experiments on
synthetic environments and a real-world dataset verify the effectiveness and
efficiency of our approach, including substantial speedup in running time, and
high-quality in- and out-of-distribution generalization.
[COMMENTS]
Accepted for publication at AAAl’25. Extended version with full
appendix, 22 pages
[LINK]
http://arxiv.org/abs/2412.12469v1
[DATE]
2024-12-17 10:06:34+08:00
[CATEGORIES]
cs.LG
Adversarially robust generalization theory via Jacobian regularization for deep neural networks
[AUTHORS]
Dongya Wu, Xin Li
[ABSTRACT]
Powerful deep neural networks are vulnerable to adversarial attacks. To
obtain adversarially robust models, researchers have separately developed
adversarial training and Jacobian regularization techniques. There are abundant
theoretical and empirical studies for adversarial training, but theoretical
foundations for Jacobian regularization are still lacking. In this study, we
show that Jacobian regularization is closely related to adversarial training in
that $\ell_{2}$ or $\ell_{1}$ Jacobian regularized loss serves as an
approximate upper bound on the adversarially robust loss under $\ell_{2}$ or
$\ell_{\infty}$ adversarial attack respectively. Further, we establish the
robust generalization gap for Jacobian regularized risk minimizer via bounding
the Rademacher complexity of both the standard loss function class and Jacobian
regularization function class. Our theoretical results indicate that the norms
of Jacobian are related to both standard and robust generalization. We also
perform experiments on MNIST data classification to demonstrate that Jacobian
regularized risk minimization indeed serves as a surrogate for adversarially
robust risk minimization, and that reducing the norms of Jacobian can improve
both standard and robust generalization. This study promotes both theoretical
and empirical understandings to adversarially robust generalization via
Jacobian regularization.
[LINK]
http://arxiv.org/abs/2412.12449v1
[DATE]
2024-12-17 09:26:42+08:00
[CATEGORIES]
cs.LG
LazyDiT: Lazy Learning for the Acceleration of Diffusion Transformers
[AUTHORS]
Xuan Shen, Zhao Song, Yufa Zhou, Bo Chen, Yanyu Li, Yifan Gong, Kai Zhang, Hao Tan, Jason Kuen, Henghui Ding, Zhihao Shu, Wei Niu, Pu Zhao, Yanzhi Wang, Jiuxiang Gu
[ABSTRACT]
Diffusion Transformers have emerged as the preeminent models for a wide array
of generative tasks, demonstrating superior performance and efficacy across
various applications. The promising results come at the cost of slow inference,
as each denoising step requires running the whole transformer model with a
large amount of parameters. In this paper, we show that performing the full
computation of the model at each diffusion step is unnecessary, as some
computations can be skipped by lazily reusing the results of previous steps.
Furthermore, we show that the lower bound of similarity between outputs at
consecutive steps is notably high, and this similarity can be linearly
approximated using the inputs. To verify our demonstrations, we propose the
\textbf{LazyDiT}, a lazy learning framework that efficiently leverages cached
results from earlier steps to skip redundant computations. Specifically, we
incorporate lazy learning layers into the model, effectively trained to
maximize laziness, enabling dynamic skipping of redundant computations.
Experimental results show that LazyDiT outperforms the DDIM sampler across
multiple diffusion transformer models at various resolutions. Furthermore, we
implement our method on mobile devices, achieving better performance than DDIM
with similar latency.
[COMMENTS]
Accepted by AAAI 2025
[LINK]
http://arxiv.org/abs/2412.12444v1
[DATE]
2024-12-17 09:12:35+08:00
[CATEGORIES]
cs.LG
Multi-Task Reinforcement Learning for Quadrotors
[AUTHORS]
Jiaxu Xing, Ismail Geles, Yunlong Song, Elie Aljalbout, Davide Scaramuzza
[ABSTRACT]
Reinforcement learning (RL) has shown great effectiveness in quadrotor
control, enabling specialized policies to develop even human-champion-level
performance in single-task scenarios. However, these specialized policies often
struggle with novel tasks, requiring a complete retraining of the policy from
scratch. To address this limitation, this paper presents a novel multi-task
reinforcement learning (MTRL) framework tailored for quadrotor control,
leveraging the shared physical dynamics of the platform to enhance sample
efficiency and task performance. By employing a multi-critic architecture and
shared task encoders, our framework facilitates knowledge transfer across
tasks, enabling a single policy to execute diverse maneuvers, including
high-speed stabilization, velocity tracking, and autonomous racing. Our
experimental results, validated both in simulation and real-world scenarios,
demonstrate that our framework outperforms baseline approaches in terms of
sample efficiency and overall task performance.
[LINK]
http://arxiv.org/abs/2412.12442v1
[DATE]
2024-12-17 09:10:18+08:00
[CATEGORIES]
cs.LG
Numerical Pruning for Efficient Autoregressive Models
[AUTHORS]
Xuan Shen, Zhao Song, Yufa Zhou, Bo Chen, Jing Liu, Ruiyi Zhang, Ryan A. Rossi, Hao Tan, Tong Yu, Xiang Chen, Yufan Zhou, Tong Sun, Pu Zhao, Yanzhi Wang, Jiuxiang Gu
[ABSTRACT]
Transformers have emerged as the leading architecture in deep learning,
proving to be versatile and highly effective across diverse domains beyond
language and image processing. However, their impressive performance often
incurs high computational costs due to their substantial model size. This paper
focuses on compressing decoder-only transformer-based autoregressive models
through structural weight pruning to improve the model efficiency while
preserving performance for both language and image generation tasks.
Specifically, we propose a training-free pruning method that calculates a
numerical score with Newton’s method for the Attention and MLP modules,
respectively. Besides, we further propose another compensation algorithm to
recover the pruned model for better performance. To verify the effectiveness of
our method, we provide both theoretical support and extensive experiments. Our
experiments show that our method achieves state-of-the-art performance with
reduced memory usage and faster generation speeds on GPUs.
[COMMENTS]
Accepted by AAAI 2025
[LINK]
http://arxiv.org/abs/2412.12441v1
[DATE]
2024-12-17 09:09:23+08:00
[CATEGORIES]
cs.LG
Analyzing Fairness of Classification Machine Learning Model with Structured Dataset
[AUTHORS]
Ahmed Rashed, Abdelkrim Kallich, Mohamed Eltayeb
[ABSTRACT]
Machine learning (ML) algorithms have become integral to decision making in
various domains, including healthcare, finance, education, and law enforcement.
However, concerns about fairness and bias in these systems pose significant
ethical and social challenges. This study investigates the fairness of ML
models applied to structured datasets in classification tasks, highlighting the
potential for biased predictions to perpetuate systemic inequalities. A
publicly available dataset from Kaggle was selected for analysis, offering a
realistic scenario for evaluating fairness in machine learning workflows.
To assess and mitigate biases, three prominent fairness libraries; Fairlearn
by Microsoft, AIF360 by IBM, and the What If Tool by Google were employed.
These libraries provide robust frameworks for analyzing fairness, offering
tools to evaluate metrics, visualize results, and implement bias mitigation
strategies. The research aims to assess the extent of bias in the ML models,
compare the effectiveness of these libraries, and derive actionable insights
for practitioners.
The findings reveal that each library has unique strengths and limitations in
fairness evaluation and mitigation. By systematically comparing their
capabilities, this study contributes to the growing field of ML fairness by
providing practical guidance for integrating fairness tools into real world
applications. These insights are intended to support the development of more
equitable machine learning systems.
[COMMENTS]
12 pages, 3 tables
[LINK]
http://arxiv.org/abs/2412.09896v2
[DATE]
2024-12-17 08:57:15+08:00
[CATEGORIES]
cs.LG
Generative modeling of protein ensembles guided by crystallographic electron densities
[AUTHORS]
Sai Advaith Maddipatla, Nadav Bojan Sellam, Sanketh Vedula, Ailie Marx, Alex Bronstein
[ABSTRACT]
Proteins are dynamic, adopting ensembles of conformations. The nature of this
conformational heterogenity is imprinted in the raw electron density
measurements obtained from X-ray crystallography experiments. Fitting an
ensemble of protein structures to these measurements is a challenging,
ill-posed inverse problem. We propose a non-i.i.d. ensemble guidance approach
to solve this problem using existing protein structure generative models and
demonstrate that it accurately recovers complicated multi-modal alternate
protein backbone conformations observed in certain single crystal measurements.
[LINK]
http://arxiv.org/abs/2412.13223v1
[DATE]
2024-12-17 08:31:59+08:00
[CATEGORIES]
cs.LG
Scaling up the Banded Matrix Factorization Mechanism for Differentially Private ML
[AUTHORS]
Ryan McKenna
[LINK]
http://arxiv.org/abs/2405.15913v3
[DATE]
2024-12-17 08:12:40+08:00
[CATEGORIES]
cs.LG
GG-SSMs: Graph-Generating State Space Models
[AUTHORS]
Nikola Zubić, Davide Scaramuzza
[ABSTRACT]
State Space Models (SSMs) are powerful tools for modeling sequential data in
computer vision and time series analysis domains. However, traditional SSMs are
limited by fixed, one-dimensional sequential processing, which restricts their
ability to model non-local interactions in high-dimensional data. While methods
like Mamba and VMamba introduce selective and flexible scanning strategies,
they rely on predetermined paths, which fails to efficiently capture complex
dependencies. We introduce Graph-Generating State Space Models (GG-SSMs), a
novel framework that overcomes these limitations by dynamically constructing
graphs based on feature relationships. Using Chazelle’s Minimum Spanning Tree
algorithm, GG-SSMs adapt to the inherent data structure, enabling robust
feature propagation across dynamically generated graphs and efficiently
modeling complex dependencies. We validate GG-SSMs on 11 diverse datasets,
including event-based eye-tracking, ImageNet classification, optical flow
estimation, and six time series datasets. GG-SSMs achieve state-of-the-art
performance across all tasks, surpassing existing methods by significant
margins. Specifically, GG-SSM attains a top-1 accuracy of 84.9% on ImageNet,
outperforming prior SSMs by 1%, reducing the KITTI-15 error rate to 2.77%, and
improving eye-tracking detection rates by up to 0.33% with fewer parameters.
These results demonstrate that dynamic scanning based on feature relationships
significantly improves SSMs’ representational power and efficiency, offering a
versatile tool for various applications in computer vision and beyond.
[COMMENTS]
11 pages, 7 tables, 2 figures
[LINK]
http://arxiv.org/abs/2412.12423v1
[DATE]
2024-12-17 08:07:29+08:00
[CATEGORIES]
cs.LG
DeepSN: A Sheaf Neural Framework for Influence Maximization
[AUTHORS]
Asela Hevapathige, Qing Wang, Ahad N. Zehmakan
[ABSTRACT]
Influence maximization is key topic in data mining, with broad applications
in social network analysis and viral marketing. In recent years, researchers
have increasingly turned to machine learning techniques to address this
problem. They have developed methods to learn the underlying diffusion
processes in a data-driven manner, which enhances the generalizability of the
solution, and have designed optimization objectives to identify the optimal
seed set. Nonetheless, two fundamental gaps remain unsolved: (1) Graph Neural
Networks (GNNs) are increasingly used to learn diffusion models, but in their
traditional form, they often fail to capture the complex dynamics of influence
diffusion, (2) Designing optimization objectives is challenging due to
combinatorial explosion when solving this problem. To address these challenges,
we propose a novel framework, DeepSN. Our framework employs sheaf neural
diffusion to learn diverse influence patterns in a data-driven, end-to-end
manner, providing enhanced separability in capturing diffusion characteristics.
We also propose an optimization technique that accounts for overlapping
influence between vertices, which helps to reduce the search space and identify
the optimal seed set effectively and efficiently. Finally, we conduct extensive
experiments on both synthetic and real-world datasets to demonstrate the
effectiveness of our framework.
[COMMENTS]
Accepted to AAAI 2025
[LINK]
http://arxiv.org/abs/2412.12416v1
[DATE]
2024-12-17 07:49:51+08:00
[CATEGORIES]
cs.LG
Causally Consistent Normalizing Flow
[AUTHORS]
Qingyang Zhou, Kangjie Lu, Meng Xu
[ABSTRACT]
Causal inconsistency arises when the underlying causal graphs captured by
generative models like \textit{Normalizing Flows} (NFs) are inconsistent with
those specified in causal models like \textit{Struct Causal Models} (SCMs).
This inconsistency can cause unwanted issues including the unfairness problem.
Prior works to achieve causal consistency inevitably compromise the
expressiveness of their models by disallowing hidden layers. In this work, we
introduce a new approach: \textbf{C}ausally \textbf{C}onsistent
\textbf{N}ormalizing \textbf{F}low (CCNF). To the best of our knowledge, CCNF
is the first causally consistent generative model that can approximate any
distribution with multiple layers. CCNF relies on two novel constructs: a
sequential representation of SCMs and partial causal transformations. These
constructs allow CCNF to inherently maintain causal consistency without
sacrificing expressiveness. CCNF can handle all forms of causal inference
tasks, including interventions and counterfactuals. Through experiments, we
show that CCNF outperforms current approaches in causal inference. We also
empirically validate the practical utility of CCNF by applying it to real-world
datasets and show how CCNF addresses challenges like unfairness effectively.
[COMMENTS]
extended version of “Causally Consistent Normalizing Flow” accepted
by AAAI25
[LINK]
http://arxiv.org/abs/2412.12401v1
[DATE]
2024-12-17 07:13:17+08:00
[CATEGORIES]
cs.LG
Using machine learning to inform harvest control rule design in complex fishery settings
[AUTHORS]
Felipe Montealegre-Mora, Carl Boettiger, Carl J. Walters, Christopher L. Cahill
[ABSTRACT]
In fishery science, harvest management of size-structured stochastic
populations is a long-standing and difficult problem. Rectilinear precautionary
policies based on biomass and harvesting reference points have now become a
standard approach to this problem. While these standard feedback policies are
adapted from analytical or dynamic programming solutions assuming relatively
simple ecological dynamics, they are often applied to more complicated
ecological settings in the real world. In this paper we explore the problem of
designing harvest control rules for partially observed, age-structured,
spasmodic fish populations using tools from reinforcement learning (RL) and
Bayesian optimization. Our focus is on the case of Walleye fisheries in
Alberta, Canada, whose highly variable recruitment dynamics have perplexed
managers and ecologists. We optimized and evaluated policies using several
complementary performance metrics. The main questions we addressed were: 1. How
do standard policies based on reference points perform relative to numerically
optimized policies? 2. Can an observation of mean fish weight, in addition to
stock biomass, aid policy decisions?
[COMMENTS]
25 pages, 4 figures, 2 tables
[LINK]
http://arxiv.org/abs/2412.12400v1
[DATE]
2024-12-17 07:13:00+08:00
[CATEGORIES]
cs.LG
Optimizing Hyperparameters for Quantum Data Re-Uploaders in Calorimetric Particle Identification
[AUTHORS]
Léa Cassé, Bernhard Pfahringer, Albert Bifet, Frédéric Magniette
[ABSTRACT]
We present an application of a single-qubit Data Re-Uploading (QRU) quantum
model for particle classification in calorimetric experiments. Optimized for
Noisy Intermediate-Scale Quantum (NISQ) devices, this model requires minimal
qubits while delivering strong classification performance. Evaluated on a novel
simulated dataset specific to particle physics, the QRU model achieves high
accuracy in classifying particle types. Through a systematic exploration of
model hyperparameters – such as circuit depth, rotation gates, input
normalization and the number of trainable parameters per input – and training
parameters like batch size, optimizer, loss function and learning rate, we
assess their individual impacts on model accuracy and efficiency. Additionally,
we apply global optimization methods, uncovering hyperparameter correlations
that further enhance performance. Our results indicate that the QRU model
attains significant accuracy with efficient computational costs, underscoring
its potential for practical quantum machine learning applications.
[COMMENTS]
17 pages, 22 figures
[LINK]
http://arxiv.org/abs/2412.12397v1
[DATE]
2024-12-17 07:10:00+08:00
[CATEGORIES]
cs.LG
Quantum Adversarial Machine Learning and Defense Strategies: Challenges and Opportunities
[AUTHORS]
Eric Yocam, Anthony Rizi, Mahesh Kamepalli, Varghese Vaidyan, Yong Wang, Gurcan Comert
[ABSTRACT]
As quantum computing continues to advance, the development of quantum-secure
neural networks is crucial to prevent adversarial attacks. This paper proposes
three quantum-secure design principles: (1) using post-quantum cryptography,
(2) employing quantum-resistant neural network architectures, and (3) ensuring
transparent and accountable development and deployment. These principles are
supported by various quantum strategies, including quantum data anonymization,
quantum-resistant neural networks, and quantum encryption. The paper also
identifies open issues in quantum security, privacy, and trust, and recommends
exploring adaptive adversarial attacks and auto adversarial attacks as future
directions. The proposed design principles and recommendations provide guidance
for developing quantum-secure neural networks, ensuring the integrity and
reliability of machine learning models in the quantum era.
[COMMENTS]
24 pages, 9 figures, 12 tables
[LINK]
http://arxiv.org/abs/2412.12373v1
[DATE]
2024-12-17 06:07:27+08:00
[CATEGORIES]
cs.LG
Priority-Aware Model-Distributed Inference at Edge Networks
[AUTHORS]
Teng Li, Hulya Seferoglu
[ABSTRACT]
Distributed inference techniques can be broadly classified into
data-distributed and model-distributed schemes. In data-distributed inference
(DDI), each worker carries the entire Machine Learning (ML) model but processes
only a subset of the data. However, feeding the data to workers results in high
communication costs, especially when the data is large. An emerging paradigm is
model-distributed inference (MDI), where each worker carries only a subset of
ML layers. In MDI, a source device that has data processes a few layers of ML
model and sends the output to a neighboring device, i.e., offloads the rest of
the layers. This process ends when all layers are processed in a distributed
manner. In this paper, we investigate the design and development of MDI when
multiple data sources co-exist. We consider that each data source has a
different importance and, hence, a priority. We formulate and solve a
priority-aware model allocation optimization problem. Based on the structure of
the optimal solution, we design a practical Priority-Aware Model- Distributed
Inference (PA-MDI) algorithm that determines model allocation and distribution
over devices by taking into account the priorities of different sources.
Experiments were conducted on a real-life testbed of NVIDIA Jetson Xavier and
Nano edge devices as well as in the Colosseum testbed with ResNet-50, ResNet-
56, and GPT-2 models. The experimental results show that PA-MDI performs
priority-aware model allocation successfully while reducing the inference time
as compared to baselines.
[LINK]
http://arxiv.org/abs/2412.12371v1
[DATE]
2024-12-17 06:01:55+08:00
[CATEGORIES]
cs.LG
Residual-INR: Communication Efficient On-Device Learning Using Implicit Neural Representation
[AUTHORS]
Hanqiu Chen, Xuebin Yao, Pradeep Subedi, Cong Hao
[ABSTRACT]
Edge computing is a distributed computing paradigm that collects and
processes data at or near the source of data generation. The on-device learning
at edge relies on device-to-device wireless communication to facilitate
real-time data sharing and collaborative decision-making among multiple
devices. This significantly improves the adaptability of the edge computing
system to the changing environments. However, as the scale of the edge
computing system is getting larger, communication among devices is becoming the
bottleneck because of the limited bandwidth of wireless communication leads to
large data transfer latency. To reduce the amount of device-to-device data
transmission and accelerate on-device learning, in this paper, we propose
Residual-INR, a fog computing-based communication-efficient on-device learning
framework by utilizing implicit neural representation (INR) to compress
images/videos into neural network weights. Residual-INR enhances data transfer
efficiency by collecting JPEG images from edge devices, compressing them into
INR format at the fog node, and redistributing them for on-device learning. By
using a smaller INR for full image encoding and a separate object INR for
high-quality object region reconstruction through residual encoding, our
technique can reduce the encoding redundancy while maintaining the object
quality. Residual-INR is a promising solution for edge on-device learning
because it reduces data transmission by up to 5.16 x across a network of 10
edge devices. It also facilitates CPU-free accelerated on-device learning,
achieving up to 2.9 x speedup without sacrificing accuracy. Our code is
available at: https://github.com/sharc-lab/Residual-INR.
[COMMENTS]
This paper has been accepted by ICCAD 2024
[LINK]
http://arxiv.org/abs/2408.05617v3
[DATE]
2024-12-17 05:35:54+08:00
[CATEGORIES]
cs.LG
Approximation Power of Deep Neural Networks: an explanatory mathematical survey
[AUTHORS]
Owen Davis, Mohammad Motamed
[ABSTRACT]
This survey provides an in-depth and explanatory review of the approximation
properties of deep neural networks, with a focus on feed-forward and residual
architectures. The primary objective is to examine how effectively neural
networks approximate target functions and to identify conditions under which
they outperform traditional approximation methods. Key topics include the
nonlinear, compositional structure of deep networks and the formalization of
neural network tasks as optimization problems in regression and classification
settings. The survey also addresses the training process, emphasizing the role
of stochastic gradient descent and backpropagation in solving these
optimization problems, and highlights practical considerations such as
activation functions, overfitting, and regularization techniques. Additionally,
the survey explores the density of neural networks in the space of continuous
functions, comparing the approximation capabilities of deep ReLU networks with
those of other approximation methods. It discusses recent theoretical
advancements in understanding the expressiveness and limitations of these
networks. A detailed error-complexity analysis is also presented, focusing on
error rates and computational complexity for neural networks with ReLU and
Fourier-type activation functions in the context of bounded target functions
with minimal regularity assumptions. Alongside recent known results, the survey
introduces new findings, offering a valuable resource for understanding the
theoretical foundations of neural network approximation. Concluding remarks and
further reading suggestions are provided.
[COMMENTS]
66 pages, 24 figures
[LINK]
http://arxiv.org/abs/2207.09511v2
[DATE]
2024-12-17 05:06:21+08:00
[CATEGORIES]
cs.LG
Query3D: LLM-Powered Open-Vocabulary Scene Segmentation with Language Embedded 3D Gaussian
[AUTHORS]
Amirhosein Chahe, Lifeng Zhou
[LINK]
http://arxiv.org/abs/2408.03516v2
[DATE]
2024-12-17 04:54:56+08:00
[CATEGORIES]
cs.LG
AutoSciLab: A Self-Driving Laboratory For Interpretable Scientific Discovery
[AUTHORS]
Saaketh Desai, Sadhvikas Addamane, Jeffrey Y. Tsao, Igal Brener, Laura P. Swiler, Remi Dingreville, Prasad P. Iyer
[ABSTRACT]
Advances in robotic control and sensing have propelled the rise of automated
scientific laboratories capable of high-throughput experiments. However,
automated scientific laboratories are currently limited by human intuition in
their ability to efficiently design and interpret experiments in
high-dimensional spaces, throttling scientific discovery. We present
AutoSciLab, a machine learning framework for driving autonomous scientific
experiments, forming a surrogate researcher purposed for scientific discovery
in high-dimensional spaces. AutoSciLab autonomously follows the scientific
method in four steps: (i) generating high-dimensional experiments (x \in R^D)
using a variational autoencoder (ii) selecting optimal experiments by forming
hypotheses using active learning (iii) distilling the experimental results to
discover relevant low-dimensional latent variables (z \in R^d, with d « D)
with a ‘directional autoencoder’ and (iv) learning a human interpretable
equation connecting the discovered latent variables with a quantity of interest
(y = f(z)), using a neural network equation learner. We validate the
generalizability of AutoSciLab by rediscovering a) the principles of projectile
motion and b) the phase transitions within the spin-states of the Ising model
(NP-hard problem). Applying our framework to an open-ended nanophotonics
challenge, AutoSciLab uncovers a fundamentally novel method for directing
incoherent light emission that surpasses the current state-of-the-art (Iyer et
al. 2023b, 2020).
[COMMENTS]
Pre-print for paper accepted in AAAI
[LINK]
http://arxiv.org/abs/2412.12347v1
[DATE]
2024-12-17 04:41:46+08:00
[CATEGORIES]
cs.LG
Towards joint graph learning and sampling set selection from data
[AUTHORS]
Shashank N. Sridhara, Eduardo Pavez, Antonio Ortega
[ABSTRACT]
We explore the problem of sampling graph signals in scenarios where the graph
structure is not predefined and must be inferred from data. In this scenario,
existing approaches rely on a two-step process, where a graph is learned first,
followed by sampling. More generally, graph learning and graph signal sampling
have been studied as two independent problems in the literature. This work
provides a foundational step towards jointly optimizing the graph structure and
sampling set. Our main contribution, Vertex Importance Sampling (VIS), is to
show that the sampling set can be effectively determined from the vertex
importance (node weights) obtained from graph learning. We further propose
Vertex Importance Sampling with Repulsion (VISR), a greedy algorithm where
spatially -separated “important” nodes are selected to ensure better
reconstruction. Empirical results on simulated data show that sampling using
VIS and VISR leads to competitive reconstruction performance and lower
complexity than the conventional two-step approach of graph learning followed
by graph sampling.
[COMMENTS]
5 pages, 7 figures, IEEE Asilomar Conference on Signals, Systems, and
Computers-2024
[LINK]
http://arxiv.org/abs/2412.09753v2
[DATE]
2024-12-17 04:35:27+08:00
[CATEGORIES]
cs.LG
InfraLib: Enabling Reinforcement Learning and Decision-Making for Large-Scale Infrastructure Management
[AUTHORS]
Pranay Thangeda, Trevor S. Betz, Michael N. Grussing, Melkior Ornik
[ABSTRACT]
Efficient management of infrastructure systems is crucial for economic
stability, sustainability, and public safety. However, infrastructure
sustainment is challenging due to the vast scale of systems, stochastic
deterioration of components, partial observability, and resource constraints.
Decision-making strategies that rely solely on human judgment often result in
suboptimal decisions over large scales and long horizons. While data-driven
approaches like reinforcement learning offer promising solutions, their
application has been limited by the lack of suitable simulation environments.
We present InfraLib, an open-source modular and extensible framework that
enables modeling and analyzing infrastructure management problems with resource
constraints as sequential decision-making problems. The framework implements
hierarchical, stochastic deterioration models, supports realistic partial
observability, and handles practical constraints including cyclical budgets and
component unavailability. InfraLib provides standardized environments for
benchmarking decision-making approaches, along with tools for expert data
collection and policy evaluation. Through case studies on both synthetic
benchmarks and real-world road networks, we demonstrate InfraLib’s ability to
model diverse infrastructure management scenarios while maintaining
computational efficiency at scale.
[COMMENTS]
Updated preprint under active review
[LINK]
http://arxiv.org/abs/2409.03167v2
[DATE]
2024-12-17 04:32:49+08:00
[CATEGORIES]
cs.LG
A Bilevel Optimization Framework for Imbalanced Data Classification
[AUTHORS]
Karen Medlin, Sven Leyffer, Krishnan Raghavan
[ABSTRACT]
Data rebalancing techniques, including oversampling and undersampling, are a
common approach to addressing the challenges of imbalanced data. To tackle
unresolved problems related to both oversampling and undersampling, we propose
a new undersampling approach that: (i) avoids the pitfalls of noise and overlap
caused by synthetic data and (ii) avoids the pitfall of under-fitting caused by
random undersampling. Instead of undersampling majority data randomly, our
method undersamples datapoints based on their ability to improve model loss.
Using improved model loss as a proxy measurement for classification
performance, our technique assesses a datapoint’s impact on loss and rejects
those unable to improve it. In so doing, our approach rejects majority
datapoints redundant to datapoints already accepted and, thereby, finds an
optimal subset of majority training data for classification. The accept/reject
component of our algorithm is motivated by a bilevel optimization problem
uniquely formulated to identify the optimal training set we seek. Experimental
results show our proposed technique with F1 scores up to 10% higher than
state-of-the-art methods.
[LINK]
http://arxiv.org/abs/2410.11171v2
[DATE]
2024-12-17 04:27:06+08:00
[CATEGORIES]
cs.LG
$G$-Mapper: Learning a Cover in the Mapper Construction
[AUTHORS]
Enrique Alvarado, Robin Belton, Emily Fischer, Kang-Ju Lee, Sourabh Palande, Sarah Percival, Emilie Purvine
[ABSTRACT]
The Mapper algorithm is a visualization technique in topological data
analysis (TDA) that outputs a graph reflecting the structure of a given
dataset. However, the Mapper algorithm requires tuning several parameters in
order to generate a ``nice” Mapper graph. This paper focuses on selecting the
cover parameter. We present an algorithm that optimizes the cover of a Mapper
graph by splitting a cover repeatedly according to a statistical test for
normality. Our algorithm is based on $G$-means clustering which searches for
the optimal number of clusters in $k$-means by iteratively applying the
Anderson-Darling test. Our splitting procedure employs a Gaussian mixture model
to carefully choose the cover according to the distribution of the given data.
Experiments for synthetic and real-world datasets demonstrate that our
algorithm generates covers so that the Mapper graphs retain the essence of the
datasets, while also running significantly fast.
[LINK]
http://arxiv.org/abs/2309.06634v3
[DATE]
2024-12-17 03:58:17+08:00
[CATEGORIES]
cs.LG
Achieving Collective Welfare in Multi-Agent Reinforcement Learning via Suggestion Sharing
[AUTHORS]
Yue Jin, Shuangqing Wei, Giovanni Montana
[ABSTRACT]
In human society, the conflict between self-interest and collective
well-being often obstructs efforts to achieve shared welfare. Related concepts
like the Tragedy of the Commons and Social Dilemmas frequently manifest in our
daily lives. As artificial agents increasingly serve as autonomous proxies for
humans, we propose using multi-agent reinforcement learning (MARL) to address
this issue - learning policies to maximise collective returns even when
individual agents’ interests conflict with the collective one. Traditional MARL
solutions involve sharing rewards, values, and policies or designing intrinsic
rewards to encourage agents to learn collectively optimal policies. We
introduce a novel MARL approach based on Suggestion Sharing (SS), where agents
exchange only action suggestions. This method enables effective cooperation
without the need to design intrinsic rewards, achieving strong performance
while revealing less private information compared to sharing rewards, values,
or policies. Our theoretical analysis establishes a bound on the discrepancy
between collective and individual objectives, demonstrating how sharing
suggestions can align agents’ behaviours with the collective objective.
Experimental results demonstrate that SS performs competitively with baselines
that rely on value or policy sharing or intrinsic rewards.
[LINK]
http://arxiv.org/abs/2412.12326v1
[DATE]
2024-12-17 03:44:44+08:00
[CATEGORIES]
cs.LG
RL-MILP Solver: A Reinforcement Learning Approach for Solving Mixed-Integer Linear Programs with Graph Neural Networks
[AUTHORS]
Tae-Hoon Lee, Min-Soo Kim
[ABSTRACT]
Mixed-Integer Linear Programming (MILP) is an optimization technique widely
used in various fields. Existing end-to-end learning methods for MILP generate
values for a subset of decision variables and delegate the remaining problem to
traditional MILP solvers. However, this approach does not guarantee solution
feasibility (i.e., satisfying all constraints) due to inaccurate predictions
and primarily focuses on prediction for binary decision variables. When
addressing MILP involving non-binary integer variables using machine learning
(ML), feasibility issues can become even more pronounced. Since finding an
optimal solution requires satisfying all constraints, addressing feasibility is
critical. To overcome these limitations, we propose a novel reinforcement
learning (RL)-based solver that interacts with MILP to incrementally discover
better feasible solutions without relying on traditional solvers. We design
reward functions tailored for MILP, which enable the RL agent to learn
relationships between decision variables and constraints. Furthermore, we
leverage a Transformer encoder-based graph neural network (GNN) to effectively
model complex relationships among decision variables. Our experimental results
demonstrate that the proposed method can solve MILP problems and find
near-optimal solutions without delegating the remainder to traditional solvers.
The proposed method provides a meaningful step forward as an initial study in
solving MILP problems entirely with ML in an end-to-end manner.
[COMMENTS]
Accepted at the 2025 AAAI Workshop on AI to Accelerate Science and
Engineering (AI2ASE)
[LINK]
http://arxiv.org/abs/2411.19517v2
[DATE]
2024-12-17 03:33:38+08:00
[CATEGORIES]
cs.LG
Suboptimality analysis of receding horizon quadratic control with unknown linear systems and its applications in learning-based control
[AUTHORS]
Shengling Shi, Anastasios Tsiamis, Bart De Schutter
[ABSTRACT]
This work analyzes how the trade-off between the modeling error, the terminal
value function error, and the prediction horizon affects the performance of a
nominal receding-horizon linear quadratic (LQ) controller. By developing a
novel perturbation result of the Riccati difference equation, a novel
performance upper bound is obtained and suggests that for many cases, the
prediction horizon can be either one or infinity to improve the control
performance, depending on the relative difference between the modeling error
and the terminal value function error. The result also shows that when an
infinite horizon is desired, a finite prediction horizon that is larger than
the controllability index can be sufficient for achieving a near-optimal
performance, revealing a close relation between the prediction horizon and
controllability. The obtained suboptimality performance bound is applied to
provide novel sample complexity and regret guarantees for nominal
receding-horizon LQ controllers in a learning-based setting. We show that an
adaptive prediction horizon that increases as a logarithmic function of time is
beneficial for regret minimization.
[LINK]
http://arxiv.org/abs/2301.07876v3
[DATE]
2024-12-17 03:24:22+08:00
[CATEGORIES]
cs.LG
ChronoFlow: A Data-Driven Model for Gyrochronology
[AUTHORS]
Phil R. Van-Lane, Joshua S. Speagle, Gwendolyn M. Eadie, Stephanie T. Douglas, Phillip A. Cargile, Catherine Zucker, Yuxi, Lu, Ruth Angus
[ABSTRACT]
Gyrochronology is a technique for constraining stellar ages using rotation
periods, which change over a star’s main sequence lifetime due to magnetic
braking. This technique shows promise for main sequence FGKM stars, where other
methods are imprecise. However, models have historically struggled to capture
the observed rotational dispersion in stellar populations. To properly
understand this complexity, we have assembled the largest standardized data
catalog of rotators in open clusters to date, consisting of ~7,400 stars across
30 open clusters/associations spanning ages of 1.5 Myr to 4 Gyr. We have also
developed ChronoFlow: a flexible data-driven model which accurately captures
observed rotational dispersion. We show that ChronoFlow can be used to
accurately forward model rotational evolution, and to infer both cluster and
individual stellar ages. We recover cluster ages with a statistical uncertainty
of 0.06 dex ($\approx$ 15%), and individual stellar ages with a statistical
uncertainty of 0.7 dex. Additionally, we conducted robust systematic tests to
analyze the impact of extinction models, cluster membership, and calibration
ages on our model’s performance. These contribute an additional $\approx$ 0.06
dex of uncertainty in cluster age estimates, resulting in a total error budget
of 0.08 dex. We estimate ages for the NGC 6709 open cluster and the Theia 456
stellar stream, and calculate revised rotational ages for M34, NGC 2516, NGC
1750, and NGC 1647. Our results show that ChronoFlow can precisely estimate the
ages of coeval stellar populations, and constrain ages for individual stars.
Furthermore, its predictions may be used to inform physical spin down models.
ChronoFlow will be publicly available at
https://github.com/philvanlane/chronoflow.
[COMMENTS]
40 pages, 28 figures (including appendices)
[LINK]
http://arxiv.org/abs/2412.12244v1
[DATE]
2024-12-17 03:00:00+08:00
[CATEGORIES]
cs.LG
OmniPrism: Learning Disentangled Visual Concept for Image Generation
[AUTHORS]
Yangyang Li, Daqing Liu, Wu Liu, Allen He, Xinchen Liu, Yongdong Zhang, Guoqing Jin
[ABSTRACT]
Creative visual concept generation often draws inspiration from specific
concepts in a reference image to produce relevant outcomes. However, existing
methods are typically constrained to single-aspect concept generation or are
easily disrupted by irrelevant concepts in multi-aspect concept scenarios,
leading to concept confusion and hindering creative generation. To address
this, we propose OmniPrism, a visual concept disentangling approach for
creative image generation. Our method learns disentangled concept
representations guided by natural language and trains a diffusion model to
incorporate these concepts. We utilize the rich semantic space of a multimodal
extractor to achieve concept disentanglement from given images and concept
guidance. To disentangle concepts with different semantics, we construct a
paired concept disentangled dataset (PCD-200K), where each pair shares the same
concept such as content, style, and composition. We learn disentangled concept
representations through our contrastive orthogonal disentangled (COD) training
pipeline, which are then injected into additional diffusion cross-attention
layers for generation. A set of block embeddings is designed to adapt each
block’s concept domain in the diffusion models. Extensive experiments
demonstrate that our method can generate high-quality, concept-disentangled
results with high fidelity to text prompts and desired concepts.
[COMMENTS]
WebPage available at https://tale17.github.io/omni/
[LINK]
http://arxiv.org/abs/2412.12242v1
[DATE]
2024-12-17 02:59:52+08:00
[CATEGORIES]
cs.LG
No More Tuning: Prioritized Multi-Task Learning with Lagrangian Differential Multiplier Methods
[AUTHORS]
Zhengxing Cheng, Yuheng Huang, Zhixuan Zhang, Dan Ou, Qingwen Liu
[ABSTRACT]
Given the ubiquity of multi-task in practical systems, Multi-Task Learning
(MTL) has found widespread application across diverse domains. In real-world
scenarios, these tasks often have different priorities. For instance, In web
search, relevance is often prioritized over other metrics, such as
click-through rates or user engagement. Existing frameworks pay insufficient
attention to the prioritization among different tasks, which typically adjust
task-specific loss function weights to differentiate task priorities. However,
this approach encounters challenges as the number of tasks grows, leading to
exponential increases in hyper-parameter tuning complexity. Furthermore, the
simultaneous optimization of multiple objectives can negatively impact the
performance of high-priority tasks due to interference from lower-priority
tasks.
In this paper, we introduce a novel multi-task learning framework employing
Lagrangian Differential Multiplier Methods for step-wise multi-task
optimization. It is designed to boost the performance of high-priority tasks
without interference from other tasks. Its primary advantage lies in its
ability to automatically optimize multiple objectives without requiring
balancing hyper-parameters for different tasks, thereby eliminating the need
for manual tuning. Additionally, we provide theoretical analysis demonstrating
that our method ensures optimization guarantees, enhancing the reliability of
the process. We demonstrate its effectiveness through experiments on multiple
public datasets and its application in Taobao search, a large-scale industrial
search ranking system, resulting in significant improvements across various
business metrics.
[COMMENTS]
Accepted by AAAI 2025
[LINK]
http://arxiv.org/abs/2412.12092v1
[DATE]
2024-12-17 02:58:28+08:00
[CATEGORIES]
cs.LG
Stabilizing Reinforcement Learning in Differentiable Multiphysics Simulation
[AUTHORS]
Eliot Xing, Vernon Luk, Jean Oh
[ABSTRACT]
Recent advances in GPU-based parallel simulation have enabled practitioners
to collect large amounts of data and train complex control policies using deep
reinforcement learning (RL), on commodity GPUs. However, such successes for RL
in robotics have been limited to tasks sufficiently simulated by fast
rigid-body dynamics. Simulation techniques for soft bodies are comparatively
several orders of magnitude slower, thereby limiting the use of RL due to
sample complexity requirements. To address this challenge, this paper presents
both a novel RL algorithm and a simulation platform to enable scaling RL on
tasks involving rigid bodies and deformables. We introduce Soft Analytic Policy
Optimization (SAPO), a maximum entropy first-order model-based actor-critic RL
algorithm, which uses first-order analytic gradients from differentiable
simulation to train a stochastic actor to maximize expected return and entropy.
Alongside our approach, we develop Rewarped, a parallel differentiable
multiphysics simulation platform that supports simulating various materials
beyond rigid bodies. We re-implement challenging manipulation and locomotion
tasks in Rewarped, and show that SAPO outperforms baselines over a range of
tasks that involve interaction between rigid bodies, articulations, and
deformables.
[LINK]
http://arxiv.org/abs/2412.12089v1
[DATE]
2024-12-17 02:56:24+08:00
[CATEGORIES]
cs.LG
Extrapolating Jet Radiation with Autoregressive Transformers
[AUTHORS]
Anja Butter, François Charton, Javier Mariño Villadamigo, Ayodele Ore, Tilman Plehn, Jonas Spinner
[ABSTRACT]
Generative networks are an exciting tool for fast LHC event generation.
Usually, they are used to generate configurations with a fixed number of
particles. Autoregressive transformers allow us to generate events with
variable numbers of particles, very much in line with the physics of QCD jet
radiation. We show how they can learn a factorized likelihood for jet radiation
and extrapolate in terms of the number of generated jets. For this
extrapolation, bootstrapping training data and training with modifications of
the likelihood loss can be used.
[LINK]
http://arxiv.org/abs/2412.12074v1
[DATE]
2024-12-17 02:46:43+08:00
[CATEGORIES]
cs.LG
Differentially Private Prototypes for Imbalanced Transfer Learning
[AUTHORS]
Dariush Wahdany, Matthew Jagielski, Adam Dziedzic, Franziska Boenisch
[ABSTRACT]
Machine learning (ML) models have been shown to leak private information from
their training datasets. Differential Privacy (DP), typically implemented
through the differential private stochastic gradient descent algorithm
(DP-SGD), has become the standard solution to bound leakage from the models.
Despite recent improvements, DP-SGD-based approaches for private learning still
usually struggle in the high privacy ($\varepsilon\le1)$ and low data regimes,
and when the private training datasets are imbalanced. To overcome these
limitations, we propose Differentially Private Prototype Learning (DPPL) as a
new paradigm for private transfer learning. DPPL leverages publicly pre-trained
encoders to extract features from private data and generates DP prototypes that
represent each private class in the embedding space and can be publicly
released for inference. Since our DP prototypes can be obtained from only a few
private training data points and without iterative noise addition, they offer
high-utility predictions and strong privacy guarantees even under the notion of
\textit{pure DP}. We additionally show that privacy-utility trade-offs can be
further improved when leveraging the public data beyond pre-training of the
encoder: in particular, we can privately sample our DP prototypes from the
publicly available data points used to train the encoder. Our experimental
evaluation with four state-of-the-art encoders, four vision datasets, and under
different data and imbalancedness regimes demonstrate DPPL’s high performance
under strong privacy guarantees in challenging private learning setups
[COMMENTS]
To be published at the 39th Annual AAAI Conference on Artificial
Intelligence, Philadelphia, 2025
[LINK]
http://arxiv.org/abs/2406.08039v2
[DATE]
2024-12-17 02:28:19+08:00
[CATEGORIES]
cs.LG
[AUTHORS]
Kostas E. Bekris, Joe Doerr, Patrick Meng, Sumanth Tangirala [COMMENTS]
Presented at the International Symposium of Robotics Research (ISRR),
[LINK]
http://arxiv.org/abs/2410.12172v2 [DATE]
2024-12-17 02:25:35+08:00 [CATEGORIES]
cs.LG
Bilevel Learning with Inexact Stochastic Gradients
[AUTHORS]
Mohammad Sadegh Salehi, Subhadip Mukherjee, Lindon Roberts, Matthias J. Ehrhardt
[ABSTRACT]
Bilevel learning has gained prominence in machine learning, inverse problems,
and imaging applications, including hyperparameter optimization, learning
data-adaptive regularizers, and optimizing forward operators. The large-scale
nature of these problems has led to the development of inexact and
computationally efficient methods. Existing adaptive methods predominantly rely
on deterministic formulations, while stochastic approaches often adopt a
doubly-stochastic framework with impractical variance assumptions, enforces a
fixed number of lower-level iterations, and requires extensive tuning. In this
work, we focus on bilevel learning with strongly convex lower-level problems
and a nonconvex sum-of-functions in the upper-level. Stochasticity arises from
data sampling in the upper-level which leads to inexact stochastic
hypergradients. We establish their connection to state-of-the-art stochastic
optimization theory for nonconvex objectives. Furthermore, we prove the
convergence of inexact stochastic bilevel optimization under mild assumptions.
Our empirical results highlight significant speed-ups and improved
generalization in imaging tasks such as image denoising and deblurring in
comparison with adaptive deterministic bilevel methods.
[LINK]
http://arxiv.org/abs/2412.12049v1
[DATE]
2024-12-17 02:18:47+08:00
[CATEGORIES]
cs.LG
Qsco: A Quantum Scoring Module for Open-set Supervised Anomaly Detection
[AUTHORS]
Yifeng Peng, Xinyi Li, Zhiding Liang, Ying Wang
[ABSTRACT]
Open set anomaly detection (OSAD) is a crucial task that aims to identify
abnormal patterns or behaviors in data sets, especially when the anomalies
observed during training do not represent all possible classes of anomalies.
The recent advances in quantum computing in handling complex data structures
and improving machine learning models herald a paradigm shift in anomaly
detection methodologies. This study proposes a Quantum Scoring Module (Qsco),
embedding quantum variational circuits into neural networks to enhance the
model’s processing capabilities in handling uncertainty and unlabeled data.
Extensive experiments conducted across eight real-world anomaly detection
datasets demonstrate our model’s superior performance in detecting anomalies
across varied settings and reveal that integrating quantum simulators does not
result in prohibitive time complexities. Our study validates the feasibility of
quantum-enhanced anomaly detection methods in practical applications.
[COMMENTS]
The Thirty-Ninth AAAI Conference on Artificial Intelligence (AAAI-25)
[LINK]
http://arxiv.org/abs/2405.16368v2
[DATE]
2024-12-17 02:11:07+08:00
[CATEGORIES]
cs.LG
LLMs for Cold-Start Cutting Plane Separator Configuration
[AUTHORS]
Connor Lawless, Yingxi Li, Anders Wikum, Madeleine Udell, Ellen Vitercik
[ABSTRACT]
Mixed integer linear programming (MILP) solvers ship with a staggering number
of parameters that are challenging to select a priori for all but expert
optimization users, but can have an outsized impact on the performance of the
MILP solver. Existing machine learning (ML) approaches to configure solvers
require training ML models by solving thousands of related MILP instances,
generalize poorly to new problem sizes, and often require implementing complex
ML pipelines and custom solver interfaces that can be difficult to integrate
into existing optimization workflows. In this paper, we introduce a new
LLM-based framework to configure which cutting plane separators to use for a
given MILP problem with little to no training data based on characteristics of
the instance, such as a natural language description of the problem and the
associated LaTeX formulation. We augment these LLMs with descriptions of
cutting plane separators available in a given solver, grounded by summarizing
the existing research literature on separators. While individual solver
configurations have a large variance in performance, we present a novel
ensembling strategy that clusters and aggregates configurations to create a
small portfolio of high-performing configurations. Our LLM-based methodology
requires no custom solver interface, can find a high-performing configuration
by solving only a small number of MILPs, and can generate the configuration
with simple API calls that run in under a second. Numerical results show our
approach is competitive with existing configuration approaches on a suite of
classic combinatorial optimization problems and real-world datasets with only a
fraction of the training data and computation time.
[LINK]
http://arxiv.org/abs/2412.12038v1
[DATE]
2024-12-17 02:03:57+08:00
[CATEGORIES]
cs.LG
LeARN: Learnable and Adaptive Representations for Nonlinear Dynamics in System Identification
[AUTHORS]
Arunabh Singh, Joyjit Mukherjee
[ABSTRACT]
System identification, the process of deriving mathematical models of
dynamical systems from observed input-output data, has undergone a paradigm
shift with the advent of learning-based methods. Addressing the intricate
challenges of data-driven discovery in nonlinear dynamical systems, these
methods have garnered significant attention. Among them, Sparse Identification
of Nonlinear Dynamics (SINDy) has emerged as a transformative approach,
distilling complex dynamical behaviors into interpretable linear combinations
of basis functions. However, SINDy relies on domain-specific expertise to
construct its foundational “library” of basis functions, which limits its
adaptability and universality. In this work, we introduce a nonlinear system
identification framework called LeARN that transcends the need for prior domain
knowledge by learning the library of basis functions directly from data. To
enhance adaptability to evolving system dynamics under varying noise
conditions, we employ a novel meta-learning-based system identification
approach that uses a lightweight deep neural network (DNN) to dynamically
refine these basis functions. This not only captures intricate system behaviors
but also adapts seamlessly to new dynamical regimes. We validate our framework
on the Neural Fly dataset, showcasing its robust adaptation and generalization
capabilities. Despite its simplicity, our LeARN achieves competitive dynamical
error performance compared to SINDy. This work presents a step toward the
autonomous discovery of dynamical systems, paving the way for a future where
machine learning uncovers the governing principles of complex systems without
requiring extensive domain-specific interventions.
[COMMENTS]
This work has been submitted to the 7th Annual Learning for Dynamics
& Control Conference for review
[LINK]
http://arxiv.org/abs/2412.12036v1
[DATE]
2024-12-17 02:03:23+08:00
[CATEGORIES]
cs.LG
Thermodynamics-informed graph neural networks for real-time simulation of digital human twins
[AUTHORS]
Lucas Tesán, David González, Pedro Martins, Elías Cueto
[ABSTRACT]
The growing importance of real-time simulation in the medical field has
exposed the limitations and bottlenecks inherent in the digital representation
of complex biological systems. This paper presents a novel methodology aimed at
advancing current lines of research in soft tissue simulation. The proposed
approach introduces a hybrid model that integrates the geometric bias of graph
neural networks with the physical bias derived from the imposition of a
metriplectic structure as soft and hard constrains in the architecture, being
able to simulate hepatic tissue with dissipative properties. This approach
provides an efficient solution capable of generating predictions at high
feedback rate while maintaining a remarkable generalization ability for
previously unseen anatomies. This makes these features particularly relevant in
the context of precision medicine and haptic rendering.
Based on the adopted methodologies, we propose a model that predicts human
liver responses to traction and compression loads in as little as 7.3
milliseconds for optimized configurations and as fast as 1.65 milliseconds in
the most efficient cases, all in the forward pass. The model achieves relative
position errors below 0.15\%, with stress tensor and velocity estimations
maintaining relative errors under 7\%. This demonstrates the robustness of the
approach developed, which is capable of handling diverse load states and
anatomies effectively. This work highlights the feasibility of integrating
real-time simulation with patient-specific geometries through deep learning,
paving the way for more robust digital human twins in medical applications.
[LINK]
http://arxiv.org/abs/2412.12034v1
[DATE]
2024-12-17 02:01:40+08:00
[CATEGORIES]
cs.LG
Equivariant Action Sampling for Reinforcement Learning and Planning
[AUTHORS]
Linfeng Zhao, Owen Howell, Xupeng Zhu, Jung Yeon Park, Zhewen Zhang, Robin Walters, Lawson L. S. Wong
[ABSTRACT]
Reinforcement learning (RL) algorithms for continuous control tasks require
accurate sampling-based action selection. Many tasks, such as robotic
manipulation, contain inherent problem symmetries. However, correctly
incorporating symmetry into sampling-based approaches remains a challenge. This
work addresses the challenge of preserving symmetry in sampling-based planning
and control, a key component for enhancing decision-making efficiency in RL. We
introduce an action sampling approach that enforces the desired symmetry. We
apply our proposed method to a coordinate regression problem and show that the
symmetry aware sampling method drastically outperforms the naive sampling
approach. We furthermore develop a general framework for sampling-based
model-based planning with Model Predictive Path Integral (MPPI). We compare our
MPPI approach with standard sampling methods on several continuous control
tasks. Empirical demonstrations across multiple continuous control environments
validate the effectiveness of our approach, showcasing the importance of
symmetry preservation in sampling-based action selection.
[COMMENTS]
Published at International Workshop on the Algorithmic Foundations of
Robotics (WAFR) 2024. Website: http://lfzhao.com/EquivSampling
[LINK]
http://arxiv.org/abs/2412.12237v1
[DATE]
2024-12-17 01:51:14+08:00
[CATEGORIES]
cs.LG
Deep-learning-based identification of individual motion characteristics from upper-limb trajectories towards disorder stage evaluation
[AUTHORS]
Tim Sziburis, Susanne Blex, Tobias Glasmachers, Ioannis Iossifidis
[ABSTRACT]
The identification of individual movement characteristics sets the foundation
for the assessment of personal rehabilitation progress and can provide
diagnostic information on levels and stages of movement disorders. This work
presents a preliminary study for differentiating individual motion patterns
using a dataset of 3D upper-limb transport trajectories measured in task-space.
Identifying individuals by deep time series learning can be a key step to
abstracting individual motion properties. In this study, a classification
accuracy of about 95% is reached for a subset of nine, and about 78% for the
full set of 31 individuals. This provides insights into the separability of
patient attributes by exerting a simple standardized task to be transferred to
portable systems.
[LINK]
http://arxiv.org/abs/2412.12016v1
[DATE]
2024-12-17 01:41:33+08:00
[CATEGORIES]
cs.LG
Cost-Effective Label-free Node Classification with LLMs
[AUTHORS]
Taiyan Zhang, Renchi Yang, Mingyu Yan, Xiaochun Ye, Dongrui Fan, Yurui Lai
[ABSTRACT]
Graph neural networks (GNNs) have emerged as go-to models for node
classification in graph data due to their powerful abilities in fusing graph
structures and attributes. However, such models strongly rely on adequate
high-quality labeled data for training, which are expensive to acquire in
practice. With the advent of large language models (LLMs), a promising way is
to leverage their superb zero-shot capabilities and massive knowledge for node
labeling. Despite promising results reported, this methodology either demands
considerable queries to LLMs, or suffers from compromised performance caused by
noisy labels produced by LLMs.
To remedy these issues, this work presents Cella, an active self-training
framework that integrates LLMs into GNNs in a cost-effective manner. The design
recipe of Cella is to iteratively identify small sets of “critical” samples
using GNNs and extract informative pseudo-labels for them with both LLMs and
GNNs as additional supervision signals to enhance model training. Particularly,
Cella includes three major components: (i) an effective active node selection
strategy for initial annotations; (ii) a judicious sample selection scheme to
sift out the “critical” nodes based on label disharmonicity and entropy; and
(iii) a label refinement module combining LLMs and GNNs with rewired topology.
Our extensive experiments over five benchmark text-attributed graph datasets
demonstrate that Cella significantly outperforms the state of the arts under
the same query budget to LLMs in terms of label-free node classification. In
particular, on the DBLP dataset with 14.3k nodes, Cella is able to achieve an
8.08% conspicuous improvement in accuracy over the state-of-the-art at a cost
of less than one cent.
[COMMENTS]
15 pages, 5 figures
[LINK]
http://arxiv.org/abs/2412.11983v1
[DATE]
2024-12-17 01:04:40+08:00
[CATEGORIES]
cs.LG
Echo State network for coarsening dynamics of charge density waves
[AUTHORS]
Clement Dinh, Yunhao Fan, Gia-Wei Chern
[ABSTRACT]
An echo state network (ESN) is a type of reservoir computer that uses a
recurrent neural network with a sparsely connected hidden layer. Compared with
other recurrent neural networks, one great advantage of ESN is the simplicity
of its training process. Yet, despite the seemingly restricted learnable
parameters, ESN has been shown to successfully capture the spatial-temporal
dynamics of complex patterns. Here we build an ESN to model the coarsening
dynamics of charge-density waves (CDW) in a semi-classical Holstein model,
which exhibits a checkerboard electron density modulation at half-filling
stabilized by a commensurate lattice distortion. The inputs to the ESN are
local CDW order-parameters in a finite neighborhood centered around a given
site, while the output is the predicted CDW order of the center site at the
next time step. Special care is taken in the design of couplings between hidden
layer and input nodes to ensure lattice symmetries are properly incorporated
into the ESN model. Since the model predictions depend only on CDW
configurations of a finite domain, the ESN is scalable and transferrable in the
sense that a model trained on dataset from a small system can be directly
applied to dynamical simulations on larger lattices. Our work opens a new
avenue for efficient dynamical modeling of pattern formations in functional
electron materials.
[COMMENTS]
13 pages, 8 figures
[LINK]
http://arxiv.org/abs/2412.11982v1
[DATE]
2024-12-17 01:04:10+08:00
[CATEGORIES]
cs.LG
Industrial-scale Prediction of Cement Clinker Phases using Machine Learning
[AUTHORS]
Sheikh Junaid Fayaz, Nestor Montiel-Bohorquez, Shashank Bishnoi, Matteo Romano, Manuele Gatti, N. M. Anoop Krishnan
[ABSTRACT]
Cement production, exceeding 4.1 billion tonnes and contributing 2.4 tonnes
of CO2 annually, faces critical challenges in quality control and process
optimization. While traditional process models for cement manufacturing are
confined to steady-state conditions with limited predictive capability for
mineralogical phases, modern plants operate under dynamic conditions that
demand real-time quality assessment. Here, exploiting a comprehensive two-year
operational dataset from an industrial cement plant, we present a machine
learning framework that accurately predicts clinker mineralogy from process
data. Our model achieves unprecedented prediction accuracy for major clinker
phases while requiring minimal input parameters, demonstrating robust
performance under varying operating conditions. Through post-hoc explainable
algorithms, we interpret the hierarchical relationships between clinker oxides
and phase formation, providing insights into the functioning of an otherwise
black-box model. This digital twin framework can potentially enable real-time
optimization of cement production, thereby providing a route toward reducing
material waste and ensuring quality while reducing the associated emissions
under real plant conditions. Our approach represents a significant advancement
in industrial process control, offering a scalable solution for sustainable
cement manufacturing.
[LINK]
http://arxiv.org/abs/2412.11981v1
[DATE]
2024-12-17 01:03:04+08:00
[CATEGORIES]
cs.LG
AlphaZero Neural Scaling and Zipf’s Law: a Tale of Board Games and Power Laws
[AUTHORS]
Oren Neumann, Claudius Gros
[ABSTRACT]
Neural scaling laws are observed in a range of domains, to date with no clear
understanding of why they occur. Recent theories suggest that loss power laws
arise from Zipf’s law, a power law observed in domains like natural language.
One theory suggests that language scaling laws emerge when Zipf-distributed
task quanta are learned in descending order of frequency. In this paper we
examine power-law scaling in AlphaZero, a reinforcement learning algorithm,
using a theory of language-model scaling. We find that game states in training
and inference data scale with Zipf’s law, which is known to arise from the tree
structure of the environment, and examine the correlation between scaling-law
and Zipf’s-law exponents. In agreement with quanta scaling theory, we find that
agents optimize state loss in descending order of frequency, even though this
order scales inversely with modelling complexity. We also find that inverse
scaling, the failure of models to improve with size, is correlated with unusual
Zipf curves where end-game states are among the most frequent states. We show
evidence that larger models shift their focus to these less-important states,
sacrificing their understanding of important early-game states.
[LINK]
http://arxiv.org/abs/2412.11979v1
[DATE]
2024-12-17 00:59:55+08:00
[CATEGORIES]
cs.LG
DANCE: Deep Learning-Assisted Analysis of Protein Sequences Using Chaos Enhanced Kaleidoscopic Images
[AUTHORS]
Taslim Murad, Prakash Chourasia, Sarwan Ali, Imdad Ullah Khan, Murray Patterson
[ABSTRACT]
Cancer is a complex disease characterized by uncontrolled cell growth. T cell
receptors (TCRs), crucial proteins in the immune system, play a key role in
recognizing antigens, including those associated with cancer. Recent
advancements in sequencing technologies have facilitated comprehensive
profiling of TCR repertoires, uncovering TCRs with potent anti-cancer activity
and enabling TCR-based immunotherapies. However, analyzing these intricate
biomolecules necessitates efficient representations that capture their
structural and functional information. T-cell protein sequences pose unique
challenges due to their relatively smaller lengths compared to other
biomolecules. An image-based representation approach becomes a preferred choice
for efficient embeddings, allowing for the preservation of essential details
and enabling comprehensive analysis of T-cell protein sequences. In this paper,
we propose to generate images from the protein sequences using the idea of
Chaos Game Representation (CGR) using the Kaleidoscopic images approach. This
Deep Learning Assisted Analysis of Protein Sequences Using Chaos Enhanced
Kaleidoscopic Images (called DANCE) provides a unique way to visualize protein
sequences by recursively applying chaos game rules around a central seed point.
we perform the classification of the T cell receptors (TCRs) protein sequences
in terms of their respective target cancer cells, as TCRs are known for their
immune response against cancer disease. The TCR sequences are converted into
images using the DANCE method. We employ deep-learning vision models to perform
the classification to obtain insights into the relationship between the visual
patterns observed in the generated kaleidoscopic images and the underlying
protein properties. By combining CGR-based image generation with deep learning
classification, this study opens novel possibilities in the protein analysis
domain.
[LINK]
http://arxiv.org/abs/2409.06694v2
[DATE]
2024-12-17 00:58:18+08:00
[CATEGORIES]
cs.LG
Neural general circulation models optimized to predict satellite-based precipitation observations
[AUTHORS]
Janni Yuval, Ian Langmore, Dmitrii Kochkov, Stephan Hoyer
[ABSTRACT]
Climate models struggle to accurately simulate precipitation, particularly
extremes and the diurnal cycle. Here, we present a hybrid model that is trained
directly on satellite-based precipitation observations. Our model runs at
2.8$^\circ$ resolution and is built on the differentiable NeuralGCM framework.
The model demonstrates significant improvements over existing general
circulation models, the ERA5 reanalysis, and a global cloud-resolving model in
simulating precipitation. Our approach yields reduced biases, a more realistic
precipitation distribution, improved representation of extremes, and a more
accurate diurnal cycle. Furthermore, it outperforms the mid-range precipitation
forecast of the ECMWF ensemble. This advance paves the way for more reliable
simulations of current climate and demonstrates how training on observations
can be used to directly improve GCMs.
[COMMENTS]
20 pages, 6 figures in Main. 29 pages, 30 figures in SI
[LINK]
http://arxiv.org/abs/2412.11973v1
[DATE]
2024-12-17 00:55:34+08:00
[CATEGORIES]
cs.LG
A Digital twin for Diesel Engines: Operator-infused PINNs with Transfer Learning for Engine Health Monitoring
[AUTHORS]
Kamaljyoti Nath, Varun Kumar, Daniel J. Smith, George Em Karniadakis
[ABSTRACT]
Improving diesel engine efficiency and emission reduction have been critical
research topics. Recent government regulations have shifted this focus to
another important area related to engine health and performance monitoring.
Although the advancements in the use of deep learning methods for system
monitoring have shown promising results in this direction, designing efficient
methods suitable for field systems remains an open research challenge. The
objective of this study is to develop a computationally efficient neural
network-based approach for identifying unknown parameters of a mean value
diesel engine model to facilitate physics-based health monitoring and
maintenance forecasting. We propose a hybrid method combining physics informed
neural networks, PINNs, and a deep neural operator, DeepONet to predict unknown
parameters and gas flow dynamics in a diesel engine. The operator network
predicts independent actuator dynamics learnt through offline training, thereby
reducing the PINNs online computational cost. To address PINNs need for
retraining with changing input scenarios, we propose two transfer learning (TL)
strategies. The first strategy involves multi-stage transfer learning for
parameter identification. While this method is computationally efficient as
compared to online PINN training, improvements are required to meet field
requirements. The second TL strategy focuses solely on training the output
weights and biases of a subset of multi-head networks pretrained on a larger
dataset, substantially reducing computation time during online prediction. We
also evaluate our model for epistemic and aleatoric uncertainty by
incorporating dropout in pretrained networks and Gaussian noise in the training
dataset. This strategy offers a tailored, computationally inexpensive, and
physics-based approach for parameter identification in diesel engine sub
systems.
[LINK]
http://arxiv.org/abs/2412.11967v1
[DATE]
2024-12-17 00:47:15+08:00
[CATEGORIES]
cs.LG
BetaExplainer: A Probabilistic Method to Explain Graph Neural Networks
[AUTHORS]
Whitney Sloneker, Shalin Patel, Michael Wang, Lorin Crawford, Ritambhara Singh
[LINK]
http://arxiv.org/abs/2412.11964v1
[DATE]
2024-12-17 00:45:26+08:00
[CATEGORIES]
cs.LG
Gramian Multimodal Representation Learning and Alignment
[AUTHORS]
Giordano Cicchetti, Eleonora Grassucci, Luigi Sigillo, Danilo Comminiello
[ABSTRACT]
Human perception integrates multiple modalities, such as vision, hearing, and
language, into a unified understanding of the surrounding reality. While recent
multimodal models have achieved significant progress by aligning pairs of
modalities via contrastive learning, their solutions are unsuitable when
scaling to multiple modalities. These models typically align each modality to a
designated anchor without ensuring the alignment of all modalities with each
other, leading to suboptimal performance in tasks requiring a joint
understanding of multiple modalities. In this paper, we structurally rethink
the pairwise conventional approach to multimodal learning and we present the
novel Gramian Representation Alignment Measure (GRAM), which overcomes the
above-mentioned limitations. GRAM learns and then aligns $n$ modalities
directly in the higher-dimensional space in which modality embeddings lie by
minimizing the Gramian volume of the $k$-dimensional parallelotope spanned by
the modality vectors, ensuring the geometric alignment of all modalities
simultaneously. GRAM can replace cosine similarity in any downstream method,
holding for 2 to $n$ modality and providing more meaningful alignment with
respect to previous similarity measures. The novel GRAM-based contrastive loss
function enhances the alignment of multimodal models in the higher-dimensional
embedding space, leading to new state-of-the-art performance in downstream
tasks such as video-audio-text retrieval and audio-video classification. The
project page, the code, and the pretrained models are available at
https://ispamm.github.io/GRAM/.
[LINK]
http://arxiv.org/abs/2412.11959v1
[DATE]
2024-12-17 00:41:51+08:00
[CATEGORIES]
cs.LG
Deep Learning for Hydroelectric Optimization: Generating Long-Term River Discharge Scenarios with Ensemble Forecasts from Global Circulation Models
[AUTHORS]
Julio Alberto Silva Dias
[ABSTRACT]
Hydroelectric power generation is a critical component of the global energy
matrix, particularly in countries like Brazil, where it represents the majority
of the energy supply. However, its strong dependence on river discharges, which
are inherently uncertain due to climate variability, poses significant
challenges. River discharges are linked to precipitation patterns, making the
development of accurate probabilistic forecasting models crucial for improving
operational planning in systems heavily reliant on this resource.
Traditionally, statistical models have been used to represent river discharges
in energy optimization. Yet, these models are increasingly unable to produce
realistic scenarios due to structural shifts in climate behavior. Changes in
precipitation patterns have altered discharge dynamics, which traditional
approaches struggle to capture. Machine learning methods, while effective as
universal predictors for time series, often focus solely on historical data,
ignoring key external factors such as meteorological and climatic conditions.
Furthermore, these methods typically lack a probabilistic framework, which is
vital for representing the inherent variability of hydrological processes. The
limited availability of historical discharge data further complicates the
application of large-scale deep learning models to this domain. To address
these challenges, we propose a framework based on a modified recurrent neural
network architecture. This model generates parameterized probability
distributions conditioned on projections from global circulation models,
effectively accounting for the stochastic nature of river discharges.
Additionally, the architecture incorporates enhancements to improve its
generalization capabilities. We validate this framework within the Brazilian
Interconnected System, using projections from the SEAS5-ECMWF system as
conditional variables.
[COMMENTS]
11 pages, 15 figures
[LINK]
http://arxiv.org/abs/2412.12234v1
[DATE]
2024-12-17 00:37:27+08:00
[CATEGORIES]
cs.LG
The Impact of Generalization Techniques on the Interplay Among Privacy, Utility, and Fairness in Image Classification
[AUTHORS]
Ahmad Hassanpour, Amir Zarei, Khawla Mallat, Anderson Santana de Oliveira, Bian Yang
[ABSTRACT]
This study investigates the trade-offs between fairness, privacy, and utility
in image classification using machine learning (ML). Recent research suggests
that generalization techniques can improve the balance between privacy and
utility. One focus of this work is sharpness-aware training (SAT) and its
integration with differential privacy (DP-SAT) to further improve this balance.
Additionally, we examine fairness in both private and non-private learning
models trained on datasets with synthetic and real-world biases. We also
measure the privacy risks involved in these scenarios by performing membership
inference attacks (MIAs) and explore the consequences of eliminating
high-privacy risk samples, termed outliers. Moreover, we introduce a new
metric, named \emph{harmonic score}, which combines accuracy, privacy, and
fairness into a single measure.
Through empirical analysis using generalization techniques, we achieve an
accuracy of 81.11\% under $(8, 10^{-5})$-DP on CIFAR-10, surpassing the 79.5\%
reported by De et al. (2022). Moreover, our experiments show that memorization
of training samples can begin before the overfitting point, and generalization
techniques do not guarantee the prevention of this memorization. Our analysis
of synthetic biases shows that generalization techniques can amplify model bias
in both private and non-private models. Additionally, our results indicate that
increased bias in training data leads to reduced accuracy, greater
vulnerability to privacy attacks, and higher model bias. We validate these
findings with the CelebA dataset, demonstrating that similar trends persist
with real-world attribute imbalances. Finally, our experiments show that
removing outlier data decreases accuracy and further amplifies model bias.
[COMMENTS]
Published as a conference paper at the 25th Privacy Enhancing
Technologies Symposium (PETS 2025)
[LINK]
http://arxiv.org/abs/2412.11951v1
[DATE]
2024-12-17 00:35:31+08:00
[CATEGORIES]
cs.LG
Asynchronous Distributed Gaussian Process Regression for Online Learning and Dynamical Systems: Complementary Document
[AUTHORS]
Zewen Yang, Xiaobing Dai, Sandra Hirche
[ABSTRACT]
This is a complementary document for the paper titled “Asynchronous
Distributed Gaussian Process Regression for Online Learning and Dynamical
Systems”.
[LINK]
http://arxiv.org/abs/2412.11950v1
[DATE]
2024-12-17 00:34:48+08:00
[CATEGORIES]
cs.LG
[AUTHORS]
Chloe Qinyu Zhu, Muhang Tian, Lesia Semenova, Jiachang Liu, Jack Xu, Joseph Scarpa, Cynthia Rudin [ABSTRACT]
Prediction of mortality in intensive care unit (ICU) patients typically
relies on black box models (that are unacceptable for use in hospitals) or
hand-tuned interpretable models (that might lead to the loss in performance).
We aim to bridge the gap between these two categories by building on modern
interpretable ML techniques to design interpretable mortality risk scores that
are as accurate as black boxes. We developed a new algorithm, GroupFasterRisk,
which has several important benefits: it uses both hard and soft direct
sparsity regularization, it incorporates group sparsity to allow more cohesive
models, it allows for monotonicity constraint to include domain knowledge, and
it produces many equally-good models, which allows domain experts to choose
among them. For evaluation, we leveraged the largest existing public ICU
monitoring datasets (MIMIC III and eICU). Models produced by GroupFasterRisk
outperformed OASIS and SAPS II scores and performed similarly to APACHE IV/IVa
while using at most a third of the parameters. For patients with
sepsis/septicemia, acute myocardial infarction, heart failure, and acute kidney
failure, GroupFasterRisk models outperformed OASIS and SOFA. Finally, different
mortality prediction ML approaches performed better based on variables selected
by GroupFasterRisk as compared to OASIS variables. GroupFasterRisk’s models
performed better than risk scores currently used in hospitals, and on par with
black box ML models, while being orders of magnitude sparser. Because
GroupFasterRisk produces a variety of risk scores, it allows design flexibilitythe key enabler of practical model creation. GroupFasterRisk is a fast,
accessible, and flexible procedure that allows learning a diverse set of sparse
risk scores for mortality prediction.
[LINK]
http://arxiv.org/abs/2311.13015v2
[DATE]
2024-12-17 00:34:34+08:00
[CATEGORIES]
cs.LG
Risk and cross validation in ridge regression with correlated samples
[AUTHORS]
Alexander Atanasov, Jacob A. Zavatone-Veth, Cengiz Pehlevan
[ABSTRACT]
Recent years have seen substantial advances in our understanding of
high-dimensional ridge regression, but existing theories assume that training
examples are independent. By leveraging techniques from random matrix theory
and free probability, we provide sharp asymptotics for the in- and
out-of-sample risks of ridge regression when the data points have arbitrary
correlations. We demonstrate that in this setting, the generalized cross
validation estimator (GCV) fails to correctly predict the out-of-sample risk.
However, in the case where the noise residuals have the same correlations as
the data points, one can modify the GCV to yield an efficiently-computable
unbiased estimator that concentrates in the high-dimensional limit, which we
dub CorrGCV. We further extend our asymptotic analysis to the case where the
test point has nontrivial correlations with the training set, a setting often
encountered in time series forecasting. Assuming knowledge of the correlation
structure of the time series, this again yields an extension of the GCV
estimator, and sharply characterizes the degree to which such test points yield
an overly optimistic prediction of long-time risk. We validate the predictions
of our theory across a variety of high dimensional data.
[COMMENTS]
44 pages, 18 figures. v3: minor typos fixed
[LINK]
http://arxiv.org/abs/2408.04607v3
[DATE]
2024-12-17 00:30:54+08:00
[CATEGORIES]
cs.LG
Two-Timescale Critic-Actor for Average Reward MDPs with Function Approximation
[AUTHORS]
Prashansa Panda, Shalabh Bhatnagar
[ABSTRACT]
Several recent works have focused on carrying out non-asymptotic convergence
analyses for AC algorithms. Recently, a two-timescale critic-actor algorithm
has been presented for the discounted cost setting in the look-up table case
where the timescales of the actor and the critic are reversed and only
asymptotic convergence shown. In our work, we present the first two-timescale
critic-actor algorithm with function approximation in the long-run average
reward setting and present the first finite-time non-asymptotic as well as
asymptotic convergence analysis for such a scheme. We obtain optimal learning
rates and prove that our algorithm achieves a sample complexity of
{$\mathcal{\tilde{O}}(\epsilon^{-(2+\delta)})$ with $\delta >0$ arbitrarily
close to zero,} for the mean squared error of the critic to be upper bounded by
$\epsilon$ which is better than the one obtained for two-timescale AC in a
similar setting. A notable feature of our analysis is that we present the
asymptotic convergence analysis of our scheme in addition to the finite-time
bounds that we obtain and show the almost sure asymptotic convergence of the
(slower) critic recursion to the attractor of an associated differential
inclusion with actor parameters corresponding to local maxima of a perturbed
average reward objective. We also show the results of numerical experiments on
three benchmark settings and observe that our critic-actor algorithm performs
the best amongst all algorithms.
[LINK]
http://arxiv.org/abs/2402.01371v3
[DATE]
2024-12-17 00:17:46+08:00
[CATEGORIES]
cs.LG
Hierarchical Meta-Reinforcement Learning via Automated Macro-Action Discovery
[AUTHORS]
Minjae Cho, Chuangchuang Sun
[ABSTRACT]
Meta-Reinforcement Learning (Meta-RL) enables fast adaptation to new testing
tasks. Despite recent advancements, it is still challenging to learn performant
policies across multiple complex and high-dimensional tasks. To address this,
we propose a novel architecture with three hierarchical levels for 1) learning
task representations, 2) discovering task-agnostic macro-actions in an
automated manner, and 3) learning primitive actions. The macro-action can guide
the low-level primitive policy learning to more efficiently transition to goal
states. This can address the issue that the policy may forget previously
learned behavior while learning new, conflicting tasks. Moreover, the
task-agnostic nature of the macro-actions is enabled by removing task-specific
components from the state space. Hence, this makes them amenable to
re-composition across different tasks and leads to promising fast adaptation to
new tasks. Also, the prospective instability from the tri-level hierarchies is
effectively mitigated by our innovative, independently tailored training
schemes. Experiments in the MetaWorld framework demonstrate the improved sample
efficiency and success rate of our approach compared to previous
state-of-the-art methods.
[LINK]
http://arxiv.org/abs/2412.11930v1
[DATE]
2024-12-17 00:15:36+08:00
[CATEGORIES]
cs.LG
[AUTHORS]
Clemens Arndt, Judith Nickel [ABSTRACT]
Learning-based methods have demonstrated remarkable performance in solving
inverse problems, particularly in image reconstruction tasks. Despite their
success, these approaches often lack theoretical guarantees, which are crucial
in sensitive applications such as medical imaging. Recent works by Arndt et al
(2023 Inverse Problems 39 125018, 2024 Inverse Problems 40 045021) addressed
this gap by analyzing a data-driven reconstruction method based on invertible
residual networks (iResNets). They revealed that, under reasonable assumptions,
this approach constitutes a convergent regularization scheme. However, the
performance of the reconstruction method was only validated on academic toy
problems and small-scale iResNet architectures. In this work, we address this
gap by evaluating the performance of iResNets on two real-world imaging tasks:
a linear blurring operator and a nonlinear diffusion operator. To do so, we
extend some of the theoretical results from Arndt et al to encompass nonlinear
inverse problems and offer insights for the design of large-scale performant
iResNet architectures. Through numerical experiments, we compare the
performance of our iResNet models against state-of-the-art neural networks,
confirming their efficacy. Additionally, we numerically investigate the
theoretical guarantees of this approach and demonstrate how the invertibility
of the network enables a deeper analysis of the learned forward operator and
its learned regularization. [LINK]
http://arxiv.org/abs/2409.13482v2 [DATE]
2024-12-17 00:04:31+08:00 [CATEGORIES]
cs.LG
CharacterBench: Benchmarking Character Customization of Large Language Models
[AUTHORS]
Jinfeng Zhou, Yongkang Huang, Bosi Wen, Guanqun Bi, Yuxuan Chen, Pei Ke, Zhuang Chen, Xiyao Xiao, Libiao Peng, Kuntian Tang, Rongsheng Zhang, Le Zhang, Tangjie Lv, Zhipeng Hu, Hongning Wang, Minlie Huang
[ABSTRACT]
Character-based dialogue (aka role-playing) enables users to freely customize
characters for interaction, which often relies on LLMs, raising the need to
evaluate LLMs’ character customization capability. However, existing benchmarks
fail to ensure a robust evaluation as they often only involve a single
character category or evaluate limited dimensions. Moreover, the sparsity of
character features in responses makes feature-focused generative evaluation
both ineffective and inefficient. To address these issues, we propose
CharacterBench, the largest bilingual generative benchmark, with 22,859
human-annotated samples covering 3,956 characters from 25 detailed character
categories. We define 11 dimensions of 6 aspects, classified as sparse and
dense dimensions based on whether character features evaluated by specific
dimensions manifest in each response. We enable effective and efficient
evaluation by crafting tailored queries for each dimension to induce
characters’ responses related to specific dimensions. Further, we develop
CharacterJudge model for cost-effective and stable evaluations. Experiments
show its superiority over SOTA automatic judges (e.g., GPT-4) and our
benchmark’s potential to optimize LLMs’ character customization. Our repository
is at https://github.com/thu-coai/CharacterBench.
[COMMENTS]
AAAI 2025
[LINK]
http://arxiv.org/abs/2412.11912v1
[DATE]
2024-12-16 23:55:34+08:00
[CATEGORIES]
cs.CL
Can Language Models Rival Mathematics Students? Evaluating Mathematical Reasoning through Textual Manipulation and Human Experiments
[AUTHORS]
Andrii Nikolaiev, Yiannos Stathopoulos, Simone Teufel
[ABSTRACT]
In this paper we look at the ability of recent large language models (LLMs)
at solving mathematical problems in combinatorics. We compare models LLaMA-2,
LLaMA-3.1, GPT-4, and Mixtral against each other and against human pupils and
undergraduates with prior experience in mathematical olympiads. To facilitate
these comparisons we introduce the Combi-Puzzles dataset, which contains 125
problem variants based on 25 combinatorial reasoning problems. Each problem is
presented in one of five distinct forms, created by systematically manipulating
the problem statements through adversarial additions, numeric parameter
changes, and linguistic obfuscation. Our variations preserve the mathematical
core and are designed to measure the generalisability of LLM problem-solving
abilities, while also increasing confidence that problems are submitted to LLMs
in forms that have not been seen as training instances. We found that a model
based on GPT-4 outperformed all other models in producing correct responses,
and performed significantly better in the mathematical variation of the
problems than humans. We also found that modifications to problem statements
significantly impact the LLM’s performance, while human performance remains
unaffected.
[LINK]
http://arxiv.org/abs/2412.11908v1
[DATE]
2024-12-16 23:54:06+08:00
[CATEGORIES]
cs.CL
Classification of Spontaneous and Scripted Speech for Multilingual Audio
[AUTHORS]
Shahar Elisha, Andrew McDowell, Mariano Beguerisse-Díaz, Emmanouil Benetos
[ABSTRACT]
Distinguishing scripted from spontaneous speech is an essential tool for
better understanding how speech styles influence speech processing research. It
can also improve recommendation systems and discovery experiences for media
users through better segmentation of large recorded speech catalogues. This
paper addresses the challenge of building a classifier that generalises well
across different formats and languages. We systematically evaluate models
ranging from traditional, handcrafted acoustic and prosodic features to
advanced audio transformers, utilising a large, multilingual proprietary
podcast dataset for training and validation. We break down the performance of
each model across 11 language groups to evaluate cross-lingual biases. Our
experimental analysis extends to publicly available datasets to assess the
models’ generalisability to non-podcast domains. Our results indicate that
transformer-based models consistently outperform traditional feature-based
techniques, achieving state-of-the-art performance in distinguishing between
scripted and spontaneous speech across various languages.
[COMMENTS]
Accepted to IEEE Spoken Language Technology Workshop 2024
[LINK]
http://arxiv.org/abs/2412.11896v1
[DATE]
2024-12-16 23:45:10+08:00
[CATEGORIES]
cs.CL
GeoX: Geometric Problem Solving Through Unified Formalized Vision-Language Pre-training
[AUTHORS]
Renqiu Xia, Mingsheng Li, Hancheng Ye, Wenjie Wu, Hongbin Zhou, Jiakang Yuan, Tianshuo Peng, Xinyu Cai, Xiangchao Yan, Bin Wang, Conghui He, Botian Shi, Tao Chen, Junchi Yan, Bo Zhang
[ABSTRACT]
Despite their proficiency in general tasks, Multi-modal Large Language Models
(MLLMs) struggle with automatic Geometry Problem Solving (GPS), which demands
understanding diagrams, interpreting symbols, and performing complex reasoning.
This limitation arises from their pre-training on natural images and texts,
along with the lack of automated verification in the problem-solving process.
Besides, current geometric specialists are limited by their task-specific
designs, making them less effective for broader geometric problems. To this
end, we present GeoX, a multi-modal large model focusing on geometric
understanding and reasoning tasks. Given the significant differences between
geometric diagram-symbol and natural image-text, we introduce unimodal
pre-training to develop a diagram encoder and symbol decoder, enhancing the
understanding of geometric images and corpora. Furthermore, we introduce
geometry-language alignment, an effective pre-training paradigm that bridges
the modality gap between unimodal geometric experts. We propose a
Generator-And-Sampler Transformer (GS-Former) to generate discriminative
queries and eliminate uninformative representations from unevenly distributed
geometric signals. Finally, GeoX benefits from visual instruction tuning,
empowering it to take geometric images and questions as input and generate
verifiable solutions. Experiments show that GeoX outperforms both generalists
and geometric specialists on publicly recognized benchmarks, such as GeoQA,
UniGeo, Geometry3K, and PGPS9k.
[COMMENTS]
Our code is available at https://github.com/UniModal4Reasoning/GeoX
[LINK]
http://arxiv.org/abs/2412.11863v1
[DATE]
2024-12-16 23:20:03+08:00
[CATEGORIES]
cs.CL
Unified Multimodal Interleaved Document Representation for Retrieval
[AUTHORS]
Jaewoo Lee, Joonho Ko, Jinheon Baek, Soyeong Jeong, Sung Ju Hwang
[ABSTRACT]
Information Retrieval (IR) methods aim to identify documents relevant to a
query, which have been widely applied in various natural language tasks.
However, existing approaches typically consider only the textual content within
documents, overlooking the fact that documents can contain multiple modalities,
including images and tables. Also, they often segment each long document into
multiple discrete passages for embedding, which prevents them from capturing
the overall document context and interactions between paragraphs. To address
these two challenges, we propose a method that holistically embeds documents
interleaved with multiple modalities by leveraging the capability of recent
vision-language models that enable the processing and integration of text,
images, and tables into a unified format and representation. Moreover, to
mitigate the information loss from segmenting documents into passages, instead
of representing and retrieving passages individually, we further merge the
representations of segmented passages into one single document representation,
while we additionally introduce a reranking strategy to decouple and identify
the relevant passage within the document if necessary. Then, through extensive
experiments on diverse IR scenarios considering both the textual and multimodal
queries, we show that our approach substantially outperforms relevant
baselines, thanks to the consideration of the multimodal information within
documents.
[COMMENTS]
Preprint
[LINK]
http://arxiv.org/abs/2410.02729v2
[DATE]
2024-12-16 23:11:11+08:00
[CATEGORIES]
cs.CL
QUITO-X: A New Perspective on Context Compression from the Information Bottleneck Theory
[AUTHORS]
Yihang Wang, Xu Huang, Bowen Tian, Yueyang Su, Lei Yu, Huaming Liao, Yixing Fan, Jiafeng Guo, Xueqi Cheng
[ABSTRACT]
Generative LLM have achieved remarkable success in various industrial
applications, owing to their promising In-Context Learning capabilities.
However, the issue of long context in complex tasks poses a significant barrier
to their wider adoption, manifested in two main aspects: (i) The excessively
long context leads to high costs and inference delays. (ii) A substantial
amount of task-irrelevant information introduced by long contexts exacerbates
the “lost in the middle” problem. Existing methods compress context by removing
redundant tokens using metrics such as self-information or PPL, which is
inconsistent with the objective of retaining the most important tokens when
conditioning on a given query. In this study, we introduce information
bottleneck theory (IB) to model the problem, offering a novel perspective that
thoroughly addresses the essential properties required for context compression.
Additionally, we propose a cross-attention-based approach to approximate mutual
information in IB, which can be flexibly replaced with suitable alternatives in
different scenarios. Extensive experiments on four datasets demonstrate that
our method achieves a 25% increase in compression rate compared to the
state-of-the-art, while maintaining question answering performance. In
particular, the context compressed by our method even outperform the full
context in some cases.
[LINK]
http://arxiv.org/abs/2408.10497v2
[DATE]
2024-12-16 23:03:54+08:00
[CATEGORIES]
cs.CL
Improved Models for Media Bias Detection and Subcategorization
[AUTHORS]
Tim Menzner, Jochen L. Leidner
[ABSTRACT]
We present improved models for the granular detection and sub-classification
news media bias in English news articles. We compare the performance of
zero-shot versus fine-tuned large pre-trained neural transformer language
models, explore how the level of detail of the classes affects performance on a
novel taxonomy of 27 news bias-types, and demonstrate how using synthetically
generated example data can be used to improve quality
[LINK]
http://arxiv.org/abs/2412.11835v1
[DATE]
2024-12-16 22:56:31+08:00
[CATEGORIES]
cs.CL
Are You Doubtful? Oh, It Might Be Difficult Then! Exploring the Use of Model Uncertainty for Question Difficulty Estimation
[AUTHORS]
Leonidas Zotos, Hedderik van Rijn, Malvina Nissim
[ABSTRACT]
In an educational setting, an estimate of the difficulty of multiple-choice
questions (MCQs), a commonly used strategy to assess learning progress,
constitutes very useful information for both teachers and students. Since human
assessment is costly from multiple points of view, automatic approaches to MCQ
item difficulty estimation are investigated, yielding however mixed success
until now. Our approach to this problem takes a different angle from previous
work: asking various Large Language Models to tackle the questions included in
two different MCQ datasets, we leverage model uncertainty to estimate item
difficulty. By using both model uncertainty features as well as textual
features in a Random Forest regressor, we show that uncertainty features
contribute substantially to difficulty prediction, where difficulty is
inversely proportional to the number of students who can correctly answer a
question. In addition to showing the value of our approach, we also observe
that our model achieves state-of-the-art results on the BEA publicly available
dataset.
[COMMENTS]
14 pages,7 figures
[LINK]
http://arxiv.org/abs/2412.11831v1
[DATE]
2024-12-16 22:55:09+08:00
[CATEGORIES]
cs.CL
Learning from Synthetic Data for Visual Grounding
[AUTHORS]
Ruozhen He, Ziyan Yang, Paola Cascante-Bonilla, Alexander C. Berg, Vicente Ordonez
[ABSTRACT]
This paper extensively investigates the effectiveness of synthetic training
data to improve the capabilities of vision-and-language models for grounding
textual descriptions to image regions. We explore various strategies to best
generate image-text pairs and image-text-box triplets using a series of
pretrained models under different settings and varying degrees of reliance on
real data. Through comparative analyses with synthetic, real, and web-crawled
data, we identify factors that contribute to performance differences, and
propose SynGround, an effective pipeline for generating useful synthetic data
for visual grounding. Our findings show that SynGround can improve the
localization capabilities of off-the-shelf vision-and-language models and
offers the potential for arbitrarily large scale data generation. Particularly,
data generated with SynGround improves the pointing game accuracy of a
pretrained ALBEF and BLIP models by 4.81% and 17.11% absolute percentage
points, respectively, across the RefCOCO+ and the Flickr30k benchmarks.
[COMMENTS]
Project Page: https://catherine-r-he.github.io/SynGround/
[LINK]
http://arxiv.org/abs/2403.13804v2
[DATE]
2024-12-16 22:53:21+08:00
[CATEGORIES]
cs.CL
cs.LG
DEMO: Reframing Dialogue Interaction with Fine-grained Element Modeling
[AUTHORS]
Minzheng Wang, Xinghua Zhang, Kun Chen, Nan Xu, Haiyang Yu, Fei Huang, Wenji Mao, Yongbin Li
[ABSTRACT]
Large language models (LLMs) have made dialogue one of the central modes in
human-machine interaction, leading to the vast amounts of conversation logs and
increasing demand for dialogue generation. The dialogue’s life-cycle spans from
the $\textit{Prelude}$ through the $\textit{Interlocution}$ to the
$\textit{Epilogue}$, encompassing rich dialogue elements. Despite the large
volumes of dialogue-related studies, there is a lack of benchmark that
encompasses comprehensive dialogue elements, which hinders precise modeling,
generation and systematic evaluation. To bridge this gap, in this paper, we
introduce a new research task $\textbf{D}$ialogue $\textbf{E}$lement
$\textbf{MO}$deling, including $\textit{Element Awareness}$ and
$\textit{Dialogue Agent Interaction}$, and propose a novel benchmark,
$\textbf{DEMO}$, designed for a comprehensive dialogue modeling and assessment.
On this basis, we further build the DEMO agent with the adept ability to model
dialogue elements via imitation learning. Extensive experiments on DEMO
indicate that current representative LLMs still have considerable potential for
enhancement, and our DEMO agent performs well in both dialogue element modeling
and out-of-domain tasks.
[COMMENTS]
We release the code and data at https://github.com/MozerWang/DEMO
[LINK]
http://arxiv.org/abs/2412.04905v2
[DATE]
2024-12-16 22:36:19+08:00
[CATEGORIES]
cs.CL
cs.LG
EventSum: A Large-Scale Event-Centric Summarization Dataset for Chinese Multi-News Documents
[AUTHORS]
Mengna Zhu, Kaisheng Zeng, Mao Wang, Kaiming Xiao, Lei Hou, Hongbin Huang, Juanzi Li
[ABSTRACT]
In real life, many dynamic events, such as major disasters and large-scale
sports events, evolve continuously over time. Obtaining an overview of these
events can help people quickly understand the situation and respond more
effectively. This is challenging because the key information of the event is
often scattered across multiple documents, involving complex event knowledge
understanding and reasoning, which is under-explored in previous work.
Therefore, we proposed the Event-Centric Multi-Document Summarization (ECS)
task, which aims to generate concise and comprehensive summaries of a given
event based on multiple related news documents. Based on this, we constructed
the EventSum dataset, which was constructed using Baidu Baike entries and
underwent extensive human annotation, to facilitate relevant research. It is
the first large scale Chinese multi-document summarization dataset, containing
5,100 events and a total of 57,984 news documents, with an average of 11.4
input news documents and 13,471 characters per event. To ensure data quality
and mitigate potential data leakage, we adopted a multi-stage annotation
approach for manually labeling the test set. Given the complexity of
event-related information, existing metrics struggle to comprehensively assess
the quality of generated summaries. We designed specific metrics including
Event Recall, Argument Recall, Causal Recall, and Temporal Recall along with
corresponding calculation methods for evaluation. We conducted comprehensive
experiments on EventSum to evaluate the performance of advanced long-context
Large Language Models (LLMs) on this task. Our experimental results indicate
that: 1) The event-centric multi-document summarization task remains
challenging for existing long-context LLMs; 2) The recall metrics we designed
are crucial for evaluating the comprehensiveness of the summary information.
[COMMENTS]
Extended version for paper accepted to AAAI 2025
[LINK]
http://arxiv.org/abs/2412.11814v1
[DATE]
2024-12-16 22:29:49+08:00
[CATEGORIES]
cs.CL
UAlign: Leveraging Uncertainty Estimations for Factuality Alignment on Large Language Models
[AUTHORS]
Boyang Xue, Fei Mi, Qi Zhu, Hongru Wang, Rui Wang, Sheng Wang, Erxin Yu, Xuming Hu, Kam-Fai Wong
[ABSTRACT]
Despite demonstrating impressive capabilities, Large Language Models (LLMs)
still often struggle to accurately express the factual knowledge they possess,
especially in cases where the LLMs’ knowledge boundaries are ambiguous. To
improve LLMs’ factual expressions, we propose the UAlign framework, which
leverages Uncertainty estimations to represent knowledge boundaries, and then
explicitly incorporates these representations as input features into prompts
for LLMs to Align with factual knowledge. First, we prepare the dataset on
knowledge question-answering (QA) samples by calculating two uncertainty
estimations, including confidence score and semantic entropy, to represent the
knowledge boundaries for LLMs. Subsequently, using the prepared dataset, we
train a reward model that incorporates uncertainty estimations and then employ
the Proximal Policy Optimization (PPO) algorithm for factuality alignment on
LLMs. Experimental results indicate that, by integrating uncertainty
representations in LLM alignment, the proposed UAlign can significantly enhance
the LLMs’ capacities to confidently answer known questions and refuse unknown
questions on both in-domain and out-of-domain tasks, showing reliability
improvements and good generalizability over various prompt- and training-based
baselines.
[LINK]
http://arxiv.org/abs/2412.11803v1
[DATE]
2024-12-16 22:14:27+08:00
[CATEGORIES]
cs.CL
GPTKB: Comprehensively Materializing Factual LLM Knowledge
[AUTHORS]
Yujia Hu, Tuan-Phong Nguyen, Shrestha Ghosh, Simon Razniewski
[ABSTRACT]
LLMs have majorly advanced NLP and AI, and next to their ability to perform a
wide range of procedural tasks, a major success factor is their internalized
factual knowledge. Since (Petroni et al., 2019), analyzing this knowledge has
gained attention. However, most approaches investigate one question at a time
via modest-sized pre-defined samples, introducing an availability bias (Tversky
and Kahnemann, 1973) that prevents the discovery of knowledge (or beliefs) of
LLMs beyond the experimenter’s predisposition.
To address this challenge, we propose a novel methodology to comprehensively
materializing an LLM’s factual knowledge through recursive querying and result
consolidation.
As a prototype, we employ GPT-4o-mini to construct GPTKB, a large-scale
knowledge base (KB) comprising 105 million triples for over 2.9 million
entities - achieved at 1% of the cost of previous KB projects. This work marks
a milestone in two areas: For LLM research, for the first time, it provides
constructive insights into the scope and structure of LLMs’ knowledge (or
beliefs). For KB construction, it pioneers new pathways for the long-standing
challenge of general-domain KB construction. GPTKB is accessible at
https://gptkb.org.
[COMMENTS]
13 pages, 4 tables, 10 figures
[LINK]
http://arxiv.org/abs/2411.04920v3
[DATE]
2024-12-16 22:05:03+08:00
[CATEGORIES]
cs.CL
MAGIC: Generating Self-Correction Guideline for In-Context Text-to-SQL
[AUTHORS]
Arian Askari, Christian Poelitz, Xinye Tang
[ABSTRACT]
Self-correction in text-to-SQL is the process of prompting large language
model (LLM) to revise its previously incorrectly generated SQL, and commonly
relies on manually crafted self-correction guidelines by human experts that are
not only labor-intensive to produce but also limited by the human ability in
identifying all potential error patterns in LLM responses. We introduce MAGIC,
a novel multi-agent method that automates the creation of the self-correction
guideline. MAGIC uses three specialized agents: a manager, a correction, and a
feedback agent. These agents collaborate on the failures of an LLM-based method
on the training set to iteratively generate and refine a self-correction
guideline tailored to LLM mistakes, mirroring human processes but without human
involvement. Our extensive experiments show that MAGIC’s guideline outperforms
expert human’s created ones. We empirically find out that the guideline
produced by MAGIC enhance the interpretability of the corrections made,
providing insights in analyzing the reason behind the failures and successes of
LLMs in self-correction. We make all agent interactions publicly available to
the research community, to foster further research in this area, offering a
synthetic dataset for future explorations into automatic self-correction
guideline generation.
[COMMENTS]
Accepted at Proceedings of the Thirty-Ninth AAAI Conference on
Artificial Intelligence (AAAI 2025)
[LINK]
http://arxiv.org/abs/2406.12692v2
[DATE]
2024-12-16 21:52:51+08:00
[CATEGORIES]
cs.CL
QUENCH: Measuring the gap between Indic and Non-Indic Contextual General Reasoning in LLMs
[AUTHORS]
Mohammad Aflah Khan, Neemesh Yadav, Sarah Masud, Md. Shad Akhtar
[ABSTRACT]
The rise of large language models (LLMs) has created a need for advanced
benchmarking systems beyond traditional setups. To this end, we introduce
QUENCH, a novel text-based English Quizzing Benchmark manually curated and
transcribed from YouTube quiz videos. QUENCH possesses masked entities and
rationales for the LLMs to predict via generation. At the intersection of
geographical context and common sense reasoning, QUENCH helps assess world
knowledge and deduction capabilities of LLMs via a zero-shot, open-domain
quizzing setup. We perform an extensive evaluation on 7 LLMs and 4 metrics,
investigating the influence of model size, prompting style, geographical
context, and gold-labeled rationale generation. The benchmarking concludes with
an error analysis to which the LLMs are prone.
[COMMENTS]
17 Pages, 6 Figures, 8 Tables, COLING 2025
[LINK]
http://arxiv.org/abs/2412.11763v1
[DATE]
2024-12-16 21:28:29+08:00
[CATEGORIES]
cs.CL
Unveiling the Power of Source: Source-based Minimum Bayes Risk Decoding for Neural Machine Translation
[AUTHORS]
Boxuan Lyu, Hidetaka Kamigaito, Kotaro Funakoshi, Manabu Okumura
[ABSTRACT]
Maximum a posteriori decoding, a commonly used method for neural machine
translation (NMT), aims to maximize the estimated posterior probability.
However, high estimated probability does not always lead to high translation
quality. Minimum Bayes Risk (MBR) decoding (\citealp{kumar2004minimum}) offers
an alternative by seeking hypotheses with the highest expected utility.
Inspired by Quality Estimation (QE) reranking which uses the QE model as a
ranker (\citealp{fernandes-etal-2022-quality}), we propose source-based MBR
(sMBR) decoding, a novel approach that utilizes quasi-sources (generated via
paraphrasing or back-translation) as “support hypotheses” and a
reference-free quality estimation metric as the utility function, marking the
first work to solely use sources in MBR decoding. Experiments show that sMBR
outperforms QE reranking and the standard MBR decoding. Our findings suggest
that sMBR is a promising approach for NMT decoding.
[LINK]
http://arxiv.org/abs/2406.11632v3
[DATE]
2024-12-16 21:04:37+08:00
[CATEGORIES]
cs.CL
CSR:Achieving 1 Bit Key-Value Cache via Sparse Representation
[AUTHORS]
Hongxuan Zhang, Yao Zhao, Jiaqi Zheng, Chenyi Zhuang, Jinjie Gu, Guihai Chen
[ABSTRACT]
The emergence of long-context text applications utilizing large language
models (LLMs) has presented significant scalability challenges, particularly in
memory footprint. The linear growth of the Key-Value (KV) cache responsible for
storing attention keys and values to minimize redundant computations can lead
to substantial increases in memory consumption, potentially causing models to
fail to serve with limited memory resources. To address this issue, we propose
a novel approach called Cache Sparse Representation (CSR), which converts the
KV cache by transforming the dense Key-Value cache tensor into sparse indexes
and weights, offering a more memory-efficient representation during LLM
inference. Furthermore, we introduce NeuralDict, a novel neural network-based
method for automatically generating the dictionary used in our sparse
representation. Our extensive experiments demonstrate that CSR achieves
performance comparable to state-of-the-art KV cache quantization algorithms
while maintaining robust functionality in memory-constrained environments.
[LINK]
http://arxiv.org/abs/2412.11741v1
[DATE]
2024-12-16 21:01:53+08:00
[CATEGORIES]
cs.CL
Personalized LLM for Generating Customized Responses to the Same Query from Different Users
[AUTHORS]
Hang Zeng, Chaoyue Niu, Fan Wu, Chengfei Lv, Guihai Chen
[ABSTRACT]
Existing work on large language model (LLM) personalization assigned
different responding roles to LLM, but overlooked the diversity of questioners.
In this work, we propose a new form of questioner-aware LLM personalization,
generating different responses even for the same query from different
questioners. We design a dual-tower model architecture with a cross-questioner
general encoder and a questioner-specific encoder. We further apply contrastive
learning with multi-view augmentation, pulling close the dialogue
representations of the same questioner, while pulling apart those of different
questioners. To mitigate the impact of question diversity on
questioner-contrastive learning, we cluster the dialogues based on question
similarity and restrict the scope of contrastive learning within each cluster.
We also build a multi-questioner dataset from English and Chinese scripts and
WeChat records, called MQDialog, containing 173 questioners and 12 responders.
Extensive evaluation with different metrics shows a significant improvement in
the quality of personalized response generation.
[COMMENTS]
9 pages
[LINK]
http://arxiv.org/abs/2412.11736v1
[DATE]
2024-12-16 20:57:19+08:00
[CATEGORIES]
cs.CL
Positive Text Reframing under Multi-strategy Optimization
[AUTHORS]
Shutong Jia, Biwei Cao, Qingqing Gao, Jiuxin Cao, Bo Liu
[ABSTRACT]
Differing from sentiment transfer, positive reframing seeks to substitute
negative perspectives with positive expressions while preserving the original
meaning. With the emergence of pre-trained language models (PLMs), it is
possible to achieve acceptable results by fine-tuning PLMs. Nevertheless,
generating fluent, diverse and task-constrained reframing text remains a
significant challenge. To tackle this issue, a \textbf{m}ulti-\textbf{s}trategy
\textbf{o}ptimization \textbf{f}ramework (MSOF) is proposed in this paper.
Starting from the objective of positive reframing, we first design positive
sentiment reward and content preservation reward to encourage the model to
transform the negative expressions of the original text while ensuring the
integrity and consistency of the semantics. Then, different decoding
optimization approaches are introduced to improve the quality of text
generation. Finally, based on the modeling formula of positive reframing, we
propose a multi-dimensional re-ranking method that further selects candidate
sentences from three dimensions: strategy consistency, text similarity and
fluency. Extensive experiments on two Seq2Seq PLMs, BART and T5, demonstrate
our framework achieves significant improvements on unconstrained and controlled
positive reframing tasks.
[COMMENTS]
To appear at COLING 2025
[LINK]
http://arxiv.org/abs/2407.17940v3
[DATE]
2024-12-16 20:57:12+08:00
[CATEGORIES]
cs.CL
Findings of the WMT 2024 Shared Task on Discourse-Level Literary Translation
[AUTHORS]
Longyue Wang, Siyou Liu, Chenyang Lyu, Wenxiang Jiao, Xing Wang, Jiahao Xu, Zhaopeng Tu, Yan Gu, Weiyu Chen, Minghao Wu, Liting Zhou, Philipp Koehn, Andy Way, Yulin Yuan
[ABSTRACT]
Following last year, we have continued to host the WMT translation shared
task this year, the second edition of the Discourse-Level Literary Translation.
We focus on three language directions: Chinese-English, Chinese-German, and
Chinese-Russian, with the latter two ones newly added. This year, we totally
received 10 submissions from 5 academia and industry teams. We employ both
automatic and human evaluations to measure the performance of the submitted
systems. The official ranking of the systems is based on the overall human
judgments. We release data, system outputs, and leaderboard at
https://www2.statmt.org/wmt24/literary-translation-task.html.
[COMMENTS]
WMT2024
[LINK]
http://arxiv.org/abs/2412.11732v1
[DATE]
2024-12-16 20:54:52+08:00
[CATEGORIES]
cs.CL
Selective Forgetting: Advancing Machine Unlearning Techniques and Evaluation in Language Models
[AUTHORS]
Lingzhi Wang, Xingshan Zeng, Jinsong Guo, Kam-Fai Wong, Georg Gottlob
[ABSTRACT]
This paper explores Machine Unlearning (MU), an emerging field that is
gaining increased attention due to concerns about neural models unintentionally
remembering personal or sensitive information. We present SeUL, a novel method
that enables selective and fine-grained unlearning for language models. Unlike
previous work that employs a fully reversed training objective in unlearning,
SeUL minimizes the negative impact on the capability of language models,
particularly in terms of generation. Furthermore, we introduce two innovative
evaluation metrics, sensitive extraction likelihood (S-EL) and sensitive
memorization accuracy (S-MA), specifically designed to assess the effectiveness
of forgetting sensitive information. In support of the unlearning framework, we
propose efficient automatic online and offline sensitive span annotation
methods. The online selection method, based on language probability scores,
ensures computational efficiency, while the offline annotation involves a
two-stage LLM-based process for robust verification. In summary, this paper
contributes a novel selective unlearning method (SeUL), introduces specialized
evaluation metrics (S-EL and S-MA) for assessing sensitive information
forgetting, and proposes automatic online and offline sensitive span annotation
methods to support the overall unlearning framework and evaluation process.
[COMMENTS]
Accepted to AAAI2025
[LINK]
http://arxiv.org/abs/2402.05813v2
[DATE]
2024-12-16 20:44:07+08:00
[CATEGORIES]
cs.CL
Attention-Seeker: Dynamic Self-Attention Scoring for Unsupervised Keyphrase Extraction
[AUTHORS]
Erwin D. López Z., Cheng Tang, Atsushi Shimada
[COMMENTS]
This version has been accepted for presentation at COLING 2025, and
all peer-reviewed changes have been incorporated
[LINK]
http://arxiv.org/abs/2409.10907v2
[DATE]
2024-12-16 20:42:07+08:00
[CATEGORIES]
cs.CL
LLMs Can Simulate Standardized Patients via Agent Coevolution
[AUTHORS]
Zhuoyun Du, Lujie Zheng, Renjun Hu, Yuyang Xu, Xiawei Li, Ying Sun, Wei Chen, Jian Wu, Haolei Cai, Haohao Ying
[ABSTRACT]
Training medical personnel using standardized patients (SPs) remains a
complex challenge, requiring extensive domain expertise and role-specific
practice. Most research on Large Language Model (LLM)-based simulated patients
focuses on improving data retrieval accuracy or adjusting prompts through human
feedback. However, this focus has overlooked the critical need for patient
agents to learn a standardized presentation pattern that transforms data into
human-like patient responses through unsupervised simulations. To address this
gap, we propose EvoPatient, a novel simulated patient framework in which a
patient agent and doctor agents simulate the diagnostic process through
multi-turn dialogues, simultaneously gathering experience to improve the
quality of both questions and answers, ultimately enabling human doctor
training. Extensive experiments on various cases demonstrate that, by providing
only overall SP requirements, our framework improves over existing reasoning
methods by more than 10% in requirement alignment and better human preference,
while achieving an optimal balance of resource consumption after evolving over
200 cases for 10 hours, with excellent generalizability. The code will be
available at https://github.com/ZJUMAI/EvoPatient.
[COMMENTS]
Work in Progress
[LINK]
http://arxiv.org/abs/2412.11716v1
[DATE]
2024-12-16 20:36:47+08:00
[CATEGORIES]
cs.CL
Seeker: Towards Exception Safety Code Generation with Intermediate Language Agents Framework
[AUTHORS]
Xuanming Zhang, Yuxuan Chen, Yiming Zheng, Zhexin Zhang, Yuan Yuan, Minlie Huang
[COMMENTS]
30 pages, 9 figures, submitted to ARR Dec
[LINK]
http://arxiv.org/abs/2412.11713v1
[DATE]
2024-12-16 20:35:29+08:00
[CATEGORIES]
cs.CL
MiMoTable: A Multi-scale Spreadsheet Benchmark with Meta Operations for Table Reasoning
[AUTHORS]
Zheng Li, Yang Du, Mao Zheng, Mingyang Song
[COMMENTS]
Accepted by COLING 2025
[LINK]
http://arxiv.org/abs/2412.11711v1
[DATE]
2024-12-16 20:33:12+08:00
[CATEGORIES]
cs.CL
Context Filtering with Reward Modeling in Question Answering
[AUTHORS]
Sangryul Kim, James Thorne
[COMMENTS]
Accepted Main Conference at COLING 2025
[LINK]
http://arxiv.org/abs/2412.11707v1
[DATE]
2024-12-16 20:29:24+08:00
[CATEGORIES]
cs.CL
From Specific-MLLM to Omni-MLLM: A Survey about the MLLMs alligned with Multi-Modality
[AUTHORS]
Shixin Jiang, Jiafeng Liang, Ming Liu, Bing Qin
[ABSTRACT]
From the Specific-MLLM, which excels in single-modal tasks, to the Omni-MLLM,
which extends the range of general modalities, this evolution aims to achieve
understanding and generation of multimodal information. Omni-MLLM treats the
features of different modalities as different “foreign languages,” enabling
cross-modal interaction and understanding within a unified space. To promote
the advancement of related research, we have compiled 47 relevant papers to
provide the community with a comprehensive introduction to Omni-MLLM. We first
explain the four core components of Omni-MLLM for unified modeling and
interaction of multiple modalities. Next, we introduce the effective
integration achieved through “alignment pretraining” and “instruction
fine-tuning,” and discuss open-source datasets and testing of interaction
capabilities. Finally, we summarize the main challenges facing current
Omni-MLLM and outline future directions.
[COMMENTS]
13 pages
[LINK]
http://arxiv.org/abs/2412.11694v1
[DATE]
2024-12-16 20:12:45+08:00
[CATEGORIES]
cs.CL
cs.LG
Mixture of Hidden-Dimensions Transformer
[AUTHORS]
Yilong Chen, Junyuan Shang, Zhengyu Zhang, Jiawei Sheng, Tingwen Liu, Shuohuan Wang, Yu Sun, Hua Wu, Haifeng Wang
[ABSTRACT]
Transformer models encounter challenges in scaling hidden dimensions
efficiently, as uniformly increasing them inflates computational and memory
costs while failing to emphasize the most relevant features for each token. For
further understanding, we study hidden dimension sparsity and observe that
trained Transformers utilize only a small fraction of token dimensions,
revealing an “activation flow” pattern. Notably, there are shared
sub-dimensions with sustained activation across multiple consecutive tokens and
specialized sub-dimensions uniquely activated for each token. To better model
token-relevant sub-dimensions, we propose MoHD (Mixture of Hidden Dimensions),
a sparse conditional activation architecture. Particularly, MoHD employs shared
sub-dimensions for common token features and a routing mechanism to dynamically
activate specialized sub-dimensions. To mitigate potential information loss
from sparsity, we design activation scaling and group fusion mechanisms to
preserve activation flow. In this way, MoHD expands hidden dimensions with
negligible increases in computation or parameters, efficient training and
inference while maintaining performance. Evaluations across 10 NLP tasks show
that MoHD surpasses Vanilla Transformers in parameter efficiency and task
performance. It achieves 1.7% higher performance with 50% fewer activation
parameters and 3.7% higher performance with a 3x parameter expansion at
constant activation cost. MOHD offers a new perspective for scaling the model,
showcasing the potential of hidden dimension sparsity to boost efficiency
[COMMENTS]
16 pages, 10 figures, 5 tables
[LINK]
http://arxiv.org/abs/2412.05644v3
[DATE]
2024-12-16 20:12:19+08:00
[CATEGORIES]
cs.CL
Multilingual and Explainable Text Detoxification with Parallel Corpora
[AUTHORS]
Daryna Dementieva, Nikolay Babakov, Amit Ronen, Abinew Ali Ayele, Naquee Rizwan, Florian Schneider, Xintong Wang, Seid Muhie Yimam, Daniil Moskovskiy, Elisei Stakovskii, Eran Kaufman, Ashraf Elnagar, Animesh Mukherjee, Alexander Panchenko
[COMMENTS]
COLING 2025, main conference, long
[LINK]
http://arxiv.org/abs/2412.11691v1
[DATE]
2024-12-16 20:08:59+08:00
[CATEGORIES]
cs.CL
L4Q: Parameter Efficient Quantization-Aware Fine-Tuning on Large Language Models
[AUTHORS]
Hyesung Jeon, Yulhwa Kim, Jae-joon Kim
[ABSTRACT]
Due to the high memory and computational costs associated with large language
models (LLMs), model compression techniques such as quantization, which reduces
inference costs, and parameter-efficient fine-tuning (PEFT) methods like
Low-Rank Adaptation (LoRA), which reduce training costs, have gained
significant popularity. This trend has spurred active research into
quantization-aware PEFT techniques, aimed at maintaining model accuracy while
minimizing memory overhead during both inference and training. Previous
quantization-aware PEFT methods typically apply post-training quantization
(PTQ) to pre-trained LLMs, followed by PEFT to recover accuracy loss.
Meanwhile, this approach has limitations in recovering the accuracy loss. In
this paper, we propose L4Q, a method that integrates Quantization-Aware
Training (QAT) with LoRA. By employing a memory-optimized layer design, L4Q
significantly reduces QAT’s memory overhead, making its training cost
comparable to LoRA, while preserving the advantage of QAT in producing fully
quantized LLMs with high accuracy. Our experiments demonstrate that this
combined approach to quantization and fine-tuning achieves superior accuracy
compared to decoupled fine-tuning schemes, particularly in 4-bit and 3-bit
quantization, positioning L4Q as an efficient QAT solution. Using the LLaMA and
Mistral models with instructional datasets, we showcase L4Q’s capabilities in
language tasks and few-shot learning.
[COMMENTS]
8 pages, 4 figures, 3 tables
[LINK]
http://arxiv.org/abs/2402.04902v5
[DATE]
2024-12-16 20:06:53+08:00
[CATEGORIES]
cs.LG
cs.CL
P$^2$ Law: Scaling Law for Post-Training After Model Pruning
[AUTHORS]
Xiaodong Chen, Yuxuan Hu, Xiaokang Zhang, Yanling Wang, Cuiping Li, Hong Chen, Jing Zhang
[ABSTRACT]
Pruning has become a widely adopted technique for reducing the hardware
requirements of large language models (LLMs). To recover model performance
after pruning, post-training is commonly employed to mitigate the resulting
performance degradation. While post-training benefits from larger datasets,
once the dataset size is already substantial, increasing the training data
provides only limited performance gains. To balance post-training cost and
model performance, it is necessary to explore the optimal amount of
post-training data.Through extensive experiments on the Llama-3 and Qwen-2.5
series models, pruned using various common pruning methods, we uncover the
scaling \textbf{Law} for \textbf{P}ost-training after model \textbf{P}runing,
referred to as the P$^2$ Law.This law identifies four key factors for
predicting the pruned model’s post-training loss: the model size before
pruning, the number of post-training tokens, the pruning rate, and the model’s
loss before pruning. Moreover, P$^2$ Law can generalize to larger dataset
sizes, larger model sizes, and higher pruning rates, offering valuable insights
for the post-training of pruned LLMs.
[LINK]
http://arxiv.org/abs/2411.10272v2
[DATE]
2024-12-16 20:00:34+08:00
[CATEGORIES]
cs.CL
cs.LG
Against All Odds: Overcoming Typology, Script, and Language Confusion in Multilingual Embedding Inversion Attacks
[AUTHORS]
Yiyi Chen, Russa Biswas, Heather Lent, Johannes Bjerva
[ABSTRACT]
Large Language Models (LLMs) are susceptible to malicious influence by cyber
attackers through intrusions such as adversarial, backdoor, and embedding
inversion attacks. In response, the burgeoning field of LLM Security aims to
study and defend against such threats. Thus far, the majority of works in this
area have focused on monolingual English models, however, emerging research
suggests that multilingual LLMs may be more vulnerable to various attacks than
their monolingual counterparts. While previous work has investigated embedding
inversion over a small subset of European languages, it is challenging to
extrapolate these findings to languages from different linguistic families and
with differing scripts. To this end, we explore the security of multilingual
LLMs in the context of embedding inversion attacks and investigate
cross-lingual and cross-script inversion across 20 languages, spanning over 8
language families and 12 scripts. Our findings indicate that languages written
in Arabic script and Cyrillic script are particularly vulnerable to embedding
inversion, as are languages within the Indo-Aryan language family. We further
observe that inversion models tend to suffer from language confusion, sometimes
greatly reducing the efficacy of an attack. Accordingly, we systematically
explore this bottleneck for inversion models, uncovering predictable patterns
which could be leveraged by attackers. Ultimately, this study aims to further
the field’s understanding of the outstanding security vulnerabilities facing
multilingual LLMs and raise awareness for the languages most at risk of
negative impact from these attacks.
[COMMENTS]
11 pages, 4 figures, 7 tables
[LINK]
http://arxiv.org/abs/2408.11749v2
[DATE]
2024-12-16 19:53:09+08:00
[CATEGORIES]
cs.CL
Bias Vector: Mitigating Biases in Language Models with Task Arithmetic Approach
[AUTHORS]
Daiki Shirafuji, Makoto Takenaka, Shinya Taguchi
[ABSTRACT]
The use of language models (LMs) has increased considerably in recent years,
and the biases and stereotypes in training data that are reflected in the LM
outputs are causing social problems. In this paper, inspired by the task
arithmetic, we propose the “Bias Vector” method for the mitigation of these
LM biases. The Bias Vector method does not require manually created debiasing
data. The three main steps of our approach involve: (1) continual training the
pre-trained LMs on biased data using masked language modeling; (2) constructing
the Bias Vector as the difference between the weights of the biased LMs and
those of pre-trained LMs; and (3) subtracting the Bias Vector from the weights
of the pre-trained LMs for debiasing. We evaluated the Bias Vector method on
the SEAT across three LMs and confirmed an average improvement of 0.177 points.
We demonstrated that the Bias Vector method does not degrade the LM performance
on downstream tasks in the GLUE benchmark. In addition, we examined the impact
of scaling factors, which control the magnitudes of Bias Vectors, with effect
sizes on the SEAT and conducted a comprehensive evaluation of our debiased LMs
across both the SEAT and GLUE benchmarks.
[COMMENTS]
Accepted to COLING2025
[LINK]
http://arxiv.org/abs/2412.11679v1
[DATE]
2024-12-16 19:38:23+08:00
[CATEGORIES]
cs.CL
Prompto: An open source library for asynchronous querying of LLM endpoints
[AUTHORS]
Ryan Sze-Yin Chan, Federico Nanni, Angus R. Williams, Edwin Brown, Liam Burke-Moore, Ed Chapman, Kate Onslow, Tvesha Sippy, Jonathan Bright, Evelina Gabasova
[ABSTRACT]
Recent surge in Large Language Model (LLM) availability has opened exciting
avenues for research. However, efficiently interacting with these models
presents a significant hurdle since LLMs often reside on proprietary or
self-hosted API endpoints, each requiring custom code for interaction.
Conducting comparative studies between different models can therefore be
time-consuming and necessitate significant engineering effort, hindering
research efficiency and reproducibility. To address these challenges, we
present prompto, an open source Python library which facilitates asynchronous
querying of LLM endpoints enabling researchers to interact with multiple LLMs
concurrently, while maximising efficiency and utilising individual rate limits.
Our library empowers researchers and developers to interact with LLMs more
effectively and allowing faster experimentation, data generation and
evaluation. prompto is released with an introductory video
(https://youtu.be/lWN9hXBOLyQ) under MIT License and is available via GitHub
(https://github.com/alan-turing-institute/prompto).
[LINK]
http://arxiv.org/abs/2408.11847v2
[DATE]
2024-12-16 19:26:21+08:00
[CATEGORIES]
cs.CL
How Reliable are LLMs as Knowledge Bases? Re-thinking Facutality and Consistency
[AUTHORS]
Danna Zheng, Mirella Lapata, Jeff Z. Pan
[LINK]
http://arxiv.org/abs/2407.13578v2
[DATE]
2024-12-16 19:23:14+08:00
[CATEGORIES]
cs.CL
Steering Language Models with Game-Theoretic Solvers
[AUTHORS]
Ian Gemp, Roma Patel, Yoram Bachrach, Marc Lanctot, Vibhavari Dasagi, Luke Marris, Georgios Piliouras, Siqi Liu, Karl Tuyls
[ABSTRACT]
Mathematical models of interactions among rational agents have long been
studied in game theory. However these interactions are often over a small set
of discrete game actions which is very different from how humans communicate in
natural language. To bridge this gap, we introduce a framework that allows
equilibrium solvers to work over the space of natural language dialogue
generated by large language models (LLMs). Specifically, by modelling the
players, strategies and payoffs in a “game” of dialogue, we create a binding
from natural language interactions to the conventional symbolic logic of game
theory. Given this binding, we can ask existing game-theoretic algorithms to
provide us with strategic solutions (e.g., what string an LLM should generate
to maximize payoff in the face of strategic partners or opponents), giving us
predictors of stable, rational conversational strategies. We focus on three
domains that require different negotiation strategies: scheduling meetings,
trading fruit and debate, and evaluate an LLM’s generated language when guided
by solvers. We see that LLMs that follow game-theory solvers result in dialogue
generations that are less exploitable than the control (no guidance from
solvers), and the language generated results in higher rewards, in all
negotiation domains. We discuss future implications of this work, and how
game-theoretic solvers that can leverage the expressivity of natural language
can open up a new avenue of guiding language research.
[COMMENTS]
Code available @
https://github.com/google-deepmind/open_spiel/blob/master/open_spiel/python/games/chat_game.py
[LINK]
http://arxiv.org/abs/2402.01704v3
[DATE]
2024-12-16 19:03:31+08:00
[CATEGORIES]
cs.CL
Self-Adaptive Paraphrasing and Preference Learning for Improved Claim Verifiability
[AUTHORS]
Amelie Wührl, Roman Klinger
[ABSTRACT]
In fact-checking, structure and phrasing of claims critically influence a
model’s ability to predict verdicts accurately. Social media content in
particular rarely serves as optimal input for verification systems, which
necessitates pre-processing to extract the claim from noisy context before fact
checking. Prior work suggests extracting a claim representation that humans
find to be checkworthy and verifiable. This has two limitations: (1) the format
may not be optimal for a fact-checking model, and (2), it requires annotated
data to learn the extraction task from. We address both issues and propose a
method to extract claims that is not reliant on labeled training data. Instead,
our self-adaptive approach only requires a black-box fact checking model and a
generative language model (LM). Given a tweet, we iteratively optimize the LM
to generate a claim paraphrase that increases the performance of a fact
checking model. By learning from preference pairs, we align the LM to the fact
checker using direct preference optimization. We show that this novel setup
extracts a claim paraphrase that is more verifiable than their original social
media formulations, and is on par with competitive baselines. For refuted
claims, our method consistently outperforms all baselines.
[COMMENTS]
Under review at ACL ARR
[LINK]
http://arxiv.org/abs/2412.11653v1
[DATE]
2024-12-16 18:54:57+08:00
[CATEGORIES]
cs.CL
SE-GCL: An Event-Based Simple and Effective Graph Contrastive Learning for Text Representation
[AUTHORS]
Tao Meng, Wei Ai, Jianbin Li, Ze Wang, Yuntao Shou, Keqin Li
[ABSTRACT]
Text representation learning is significant as the cornerstone of natural
language processing. In recent years, graph contrastive learning (GCL) has been
widely used in text representation learning due to its ability to represent and
capture complex text information in a self-supervised setting. However, current
mainstream graph contrastive learning methods often require the incorporation
of domain knowledge or cumbersome computations to guide the data augmentation
process, which significantly limits the application efficiency and scope of
GCL. Additionally, many methods learn text representations only by constructing
word-document relationships, which overlooks the rich contextual semantic
information in the text. To address these issues and exploit representative
textual semantics, we present an event-based, simple, and effective graph
contrastive learning (SE-GCL) for text representation. Precisely, we extract
event blocks from text and construct internal relation graphs to represent
inter-semantic interconnections, which can ensure that the most critical
semantic information is preserved. Then, we devise a streamlined, unsupervised
graph contrastive learning framework to leverage the complementary nature of
the event semantic and structural information for intricate feature data
capture. In particular, we introduce the concept of an event skeleton for core
representation semantics and simplify the typically complex data augmentation
techniques found in existing graph contrastive learning to boost algorithmic
efficiency. We employ multiple loss functions to prompt diverse embeddings to
converge or diverge within a confined distance in the vector space, ultimately
achieving a harmonious equilibrium. We conducted experiments on the proposed
SE-GCL on four standard data sets (AG News, 20NG, SougouNews, and THUCNews) to
verify its effectiveness in text representation learning.
[COMMENTS]
19 pages, 6 tables
[LINK]
http://arxiv.org/abs/2412.11652v1
[DATE]
2024-12-16 18:53:24+08:00
[CATEGORIES]
cs.CL
SAM Decoding: Speculative Decoding via Suffix Automaton
[AUTHORS]
Yuxuan Hu, Ke Wang, Xiaokang Zhang, Fanjin Zhang, Cuiping Li, Hong Chen, Jing Zhang
[ABSTRACT]
Speculative decoding (SD) has been demonstrated as an effective technique for
lossless LLM inference acceleration. Retrieval-based SD methods, one kind of
model-free method, have yielded promising speedup, but they often rely on
incomplete retrieval resources, inefficient retrieval methods, and are
constrained to certain domains. This paper presents a novel retrieval-based
speculative decoding method that adapts suffix automaton (SAM) for efficient
and accurate draft generation by utilizing common text corpus and dynamic text
sequence. Unlike existing $n$-gram matching methods, SAM-Decoding finds the
exact longest suffix match, achieving an average time complexity of O(1) per
generation step of SAM update and suffix retrieval. It can also integrate with
existing methods, adaptively selecting a draft generation strategy based on
match length to generalize to broader domains. Extensive experiments on
Spec-Bench show that our method is $18\%+$ faster than other retrieval-based SD
methods. Additionally, when combined with advanced EAGLE-2, it provides an
additional speedup of $3.28\%$ – $11.13\%$ across various-sized LLM backbones.
Our code is available at our
\href{https://github.com/hyx1999/SAM-Decoding}{repository}.
[COMMENTS]
16 pages, 9 figures, 9 tables
[LINK]
http://arxiv.org/abs/2411.10666v3
[DATE]
2024-12-16 18:48:28+08:00
[CATEGORIES]
cs.CL
Fool Me, Fool Me: User Attitudes Toward LLM Falsehoods
[AUTHORS]
Diana Bar-Or Nirman, Ariel Weizman, Amos Azaria
[ABSTRACT]
While Large Language Models (LLMs) have become central tools in various
fields, they often provide inaccurate or false information. This study examines
user preferences regarding falsehood responses from LLMs. Specifically, we
evaluate preferences for LLM responses where false statements are explicitly
marked versus unmarked responses and preferences for confident falsehoods
compared to LLM disclaimers acknowledging a lack of knowledge. Additionally, we
investigate how requiring users to assess the truthfulness of statements
influences these preferences.
Surprisingly, 61\% of users prefer unmarked falsehood responses over marked
ones, and 69\% prefer confident falsehoods over LLMs admitting lack of
knowledge. In all our experiments, a total of 300 users participated,
contributing valuable data to our analysis and conclusions. When users are
required to evaluate the truthfulness of statements, preferences for unmarked
and falsehood responses decrease slightly but remain high. These findings
suggest that user preferences, which influence LLM training via feedback
mechanisms, may inadvertently encourage the generation of falsehoods. Future
research should address the ethical and practical implications of aligning LLM
behavior with such preferences.
[COMMENTS]
11 pages, 5 figures, 5 tables
[LINK]
http://arxiv.org/abs/2412.11625v1
[DATE]
2024-12-16 18:10:27+08:00
[CATEGORIES]
cs.CL
DLF: Disentangled-Language-Focused Multimodal Sentiment Analysis
[AUTHORS]
Pan Wang, Qiang Zhou, Yawen Wu, Tianlong Chen, Jingtong Hu
[ABSTRACT]
Multimodal Sentiment Analysis (MSA) leverages heterogeneous modalities, such
as language, vision, and audio, to enhance the understanding of human
sentiment. While existing models often focus on extracting shared information
across modalities or directly fusing heterogeneous modalities, such approaches
can introduce redundancy and conflicts due to equal treatment of all modalities
and the mutual transfer of information between modality pairs. To address these
issues, we propose a Disentangled-Language-Focused (DLF) multimodal
representation learning framework, which incorporates a feature disentanglement
module to separate modality-shared and modality-specific information. To
further reduce redundancy and enhance language-targeted features, four
geometric measures are introduced to refine the disentanglement process. A
Language-Focused Attractor (LFA) is further developed to strengthen language
representation by leveraging complementary modality-specific information
through a language-guided cross-attention mechanism. The framework also employs
hierarchical predictions to improve overall accuracy. Extensive experiments on
two popular MSA datasets, CMU-MOSI and CMU-MOSEI, demonstrate the significant
performance gains achieved by the proposed DLF framework. Comprehensive
ablation studies further validate the effectiveness of the feature
disentanglement module, language-focused attractor, and hierarchical
predictions. Our code is available at https://github.com/pwang322/DLF.
[COMMENTS]
AAAI 2025 accepted
[LINK]
http://arxiv.org/abs/2412.12225v1
[DATE]
2024-12-16 18:03:44+08:00
[CATEGORIES]
cs.LG
cs.CL
MT-LENS: An all-in-one Toolkit for Better Machine Translation Evaluation
[AUTHORS]
Javier García Gilabert, Carlos Escolano, Audrey Mash, Xixian Liao, Maite Melero
[ABSTRACT]
We introduce MT-LENS, a framework designed to evaluate Machine Translation
(MT) systems across a variety of tasks, including translation quality, gender
bias detection, added toxicity, and robustness to misspellings. While several
toolkits have become very popular for benchmarking the capabilities of Large
Language Models (LLMs), existing evaluation tools often lack the ability to
thoroughly assess the diverse aspects of MT performance. MT-LENS addresses
these limitations by extending the capabilities of LM-eval-harness for MT,
supporting state-of-the-art datasets and a wide range of evaluation metrics. It
also offers a user-friendly platform to compare systems and analyze
translations with interactive visualizations. MT-LENS aims to broaden access to
evaluation strategies that go beyond traditional translation quality
evaluation, enabling researchers and engineers to better understand the
performance of a NMT model and also easily measure system’s biases.
[COMMENTS]
6 pages, 2 figures
[LINK]
http://arxiv.org/abs/2412.11615v1
[DATE]
2024-12-16 17:57:28+08:00
[CATEGORIES]
cs.CL
SPaR: Self-Play with Tree-Search Refinement to Improve Instruction-Following in Large Language Models
[AUTHORS]
Jiale Cheng, Xiao Liu, Cunxiang Wang, Xiaotao Gu, Yida Lu, Dan Zhang, Yuxiao Dong, Jie Tang, Hongning Wang, Minlie Huang
[ABSTRACT]
Instruction-following is a fundamental capability of language models,
requiring the model to recognize even the most subtle requirements in the
instructions and accurately reflect them in its output. Such an ability is
well-suited for and often optimized by preference learning. However, existing
methods often directly sample multiple independent responses from the model
when creating preference pairs. Such practice can introduce content variations
irrelevant to whether the instruction is precisely followed (e.g., different
expressions about the same semantic), interfering with the goal of teaching
models to recognize the key differences that lead to improved instruction
following. In light of this, we introduce SPaR, a self-play framework
integrating tree-search self-refinement to yield valid and comparable
preference pairs free from distractions. By playing against itself, an LLM
employs a tree-search strategy to refine its previous responses with respect to
the instruction while minimizing unnecessary variations. Our experiments show
that a LLaMA3-8B model, trained over three iterations guided by SPaR, surpasses
GPT-4-Turbo on the IFEval benchmark without losing general capabilities.
Furthermore, SPaR demonstrates promising scalability and transferability,
greatly enhancing models like GLM-4-9B and LLaMA3-70B. We also identify how
inference scaling in tree search would impact model performance. Our code and
data are publicly available at https://github.com/thu-coai/SPaR.
[LINK]
http://arxiv.org/abs/2412.11605v1
[DATE]
2024-12-16 17:47:43+08:00
[CATEGORIES]
cs.CL
cs.LG
Improvement in Sign Language Translation Using Text CTC Alignment
[AUTHORS]
Sihan Tan, Taro Miyazaki, Nabeela Khan, Kazuhiro Nakadai
[ABSTRACT]
Current sign language translation (SLT) approaches often rely on gloss-based
supervision with Connectionist Temporal Classification (CTC), limiting their
ability to handle non-monotonic alignments between sign language video and
spoken text. In this work, we propose a novel method combining joint
CTC/Attention and transfer learning. The joint CTC/Attention introduces
hierarchical encoding and integrates CTC with the attention mechanism during
decoding, effectively managing both monotonic and non-monotonic alignments.
Meanwhile, transfer learning helps bridge the modality gap between vision and
language in SLT. Experimental results on two widely adopted benchmarks,
RWTH-PHOENIX-Weather 2014 T and CSL-Daily, show that our method achieves
results comparable to state-of-the-art and outperforms the pure-attention
baseline. Additionally, this work opens a new door for future research into
gloss-free SLT using text-based CTC alignment.
[LINK]
http://arxiv.org/abs/2412.09014v3
[DATE]
2024-12-16 17:02:51+08:00
[CATEGORIES]
cs.CL
AUEB-Archimedes at RIRAG-2025: Is obligation concatenation really all you need?
[AUTHORS]
Ioannis Chasandras, Odysseas S. Chlapanis, Ion Androutsopoulos
[COMMENTS]
RIRAG 2025 Shared-Task at RegNLP workshop collocated with COLING 2025
[LINK]
http://arxiv.org/abs/2412.11567v1
[DATE]
2024-12-16 16:54:21+08:00
[CATEGORIES]
cs.CL
Intention Analysis Makes LLMs A Good Jailbreak Defender
[AUTHORS]
Yuqi Zhang, Liang Ding, Lefei Zhang, Dacheng Tao
[COMMENTS]
COLING 2025
[LINK]
http://arxiv.org/abs/2401.06561v4
[DATE]
2024-12-16 16:43:24+08:00
[CATEGORIES]
cs.CL
Token Prepending: A Training-Free Approach for Eliciting Better Sentence Embeddings from LLMs
[AUTHORS]
Yuchen Fu, Zifeng Cheng, Zhiwei Jiang, Zhonghui Wang, Yafeng Yin, Zhengliang Li, Qing Gu
[ABSTRACT]
Extracting sentence embeddings from large language models (LLMs) is a
promising direction, as LLMs have demonstrated stronger semantic understanding
capabilities. Previous studies typically focus on prompt engineering to elicit
sentence embeddings from LLMs by prompting the model to encode sentence
information into the embedding of the last token. However, LLMs are mostly
decoder-only models with causal attention and the earlier tokens in the
sentence cannot attend to the latter tokens, resulting in biased encoding of
sentence information and cascading effects on the final decoded token. To this
end, we propose a novel Token Prepending (TP) technique that prepends each
layer’s decoded sentence embedding to the beginning of the sentence in the next
layer’s input, allowing earlier tokens to attend to the complete sentence
information under the causal attention mechanism. The proposed TP technique is
a plug-and-play and training-free technique, which means it can be seamlessly
integrated with various prompt-based sentence embedding methods and
autoregressive LLMs. Extensive experiments on various Semantic Textual
Similarity (STS) tasks and downstream classification tasks demonstrate that our
proposed TP technique can significantly improve the performance of existing
prompt-based sentence embedding methods across different LLMs, while incurring
negligible additional inference cost.
[COMMENTS]
14 pages, 5 figures
[LINK]
http://arxiv.org/abs/2412.11556v1
[DATE]
2024-12-16 16:42:00+08:00
[CATEGORIES]
cs.CL
Error Diversity Matters: An Error-Resistant Ensemble Method for Unsupervised Dependency Parsing
[AUTHORS]
Behzad Shayegh, Hobie H. -B. Lee, Xiaodan Zhu, Jackie Chi Kit Cheung, Lili Mou
[ABSTRACT]
We address unsupervised dependency parsing by building an ensemble of diverse
existing models through post hoc aggregation of their output dependency parse
structures. We observe that these ensembles often suffer from low robustness
against weak ensemble components due to error accumulation. To tackle this
problem, we propose an efficient ensemble-selection approach that avoids error
accumulation. Results demonstrate that our approach outperforms each individual
model as well as previous ensemble techniques. Additionally, our experiments
show that the proposed ensemble-selection method significantly enhances the
performance and robustness of our ensemble, surpassing previously proposed
strategies, which have not accounted for error diversity.
[COMMENTS]
Accepted by the AAAI Conference on Artificial Intelligence (AAAI)
2025
[LINK]
http://arxiv.org/abs/2412.11543v1
[DATE]
2024-12-16 16:23:50+08:00
[CATEGORIES]
cs.CL
cs.LG
RankAdaptor: Hierarchical Rank Allocation for Efficient Fine-Tuning Pruned LLMs via Performance Model
[AUTHORS]
Changhai Zhou, Shijie Han, Lining Yang, Yuhua Zhou, Xu Cheng, Yibin Wang, Hongguang Li
[ABSTRACT]
The efficient compression of large language models (LLMs) has become
increasingly popular. However, recovering the performance of compressed LLMs
remains a major challenge. The current practice in LLM compression entails the
implementation of structural pruning, complemented by a recovery phase that
leverages the Low-Rank Adaptation (LoRA) algorithm. Structural pruning’s uneven
modification of model architecture, coupled with standard LoRA’s fixed
configuration allocation across layers in an online pipeline, leads to
suboptimal performance in various downstream tasks for pruned models. To
address this challenge, we introduce RankAdaptor, a hierarchical rank
allocation method that enables efficient fine-tuning of pruned LLMs according
to layerwise specific recovery requirements. We employ a performance model that
conducts offline meta-learning and online incremental learning to explore
optimal rank values for each layer. Comprehensive experiments on popular
benchmarks show that RankAdaptor consistently outperforms state-of-the-art
methods across a variety of pruning settings and LLM architectures, with
improvements ranging from 0.7\% to 5.5\%.
[LINK]
http://arxiv.org/abs/2406.15734v2
[DATE]
2024-12-16 16:19:26+08:00
[CATEGORIES]
cs.CL
Specifications: The missing link to making the development of LLM systems an engineering discipline
[AUTHORS]
Ion Stoica, Matei Zaharia, Joseph Gonzalez, Ken Goldberg, Koushik Sen, Hao Zhang, Anastasios Angelopoulos, Shishir G. Patil, Lingjiao Chen, Wei-Lin Chiang, Jared Q. Davis
[ABSTRACT]
Despite the significant strides made by generative AI in just a few short
years, its future progress is constrained by the challenge of building modular
and robust systems. This capability has been a cornerstone of past
technological revolutions, which relied on combining components to create
increasingly sophisticated and reliable systems. Cars, airplanes, computers,
and software consist of components-such as engines, wheels, CPUs, and
libraries-that can be assembled, debugged, and replaced. A key tool for
building such reliable and modular systems is specification: the precise
description of the expected behavior, inputs, and outputs of each component.
However, the generality of LLMs and the inherent ambiguity of natural language
make defining specifications for LLM-based components (e.g., agents) both a
challenging and urgent problem. In this paper, we discuss the progress the
field has made so far-through advances like structured outputs, process
supervision, and test-time compute-and outline several future directions for
research to enable the development of modular and reliable LLM-based systems
through improved specifications.
[LINK]
http://arxiv.org/abs/2412.05299v2
[DATE]
2024-12-16 16:17:09+08:00
[CATEGORIES]
cs.CL
Let your LLM generate a few tokens and you will reduce the need for retrieval
[AUTHORS]
Hervé Déjean
[ABSTRACT]
In this paper, we investigate how efficiently large language models (LLM) can
be trained to check whether an answer is already stored in their parametric
memory. We distill an LLM-as-a-judge to compute the IK (I Know) score. We found
that this method is particularly beneficial in the context of
retrieval-assisted augmented generation (RAG), with a respectable accuracy of
80%. It enables a significant reduction (more than 50%) in the number of search
and reranking steps required for certain data sets. We have also introduced the
IK score, which serves as a useful tool for characterising datasets by
facilitating the classification task. Interestingly, through the inclusion of
response tokens as input, our results suggest that only about 20,000 training
samples are required to achieve good performance. The central element of this
work is the use of a teacher model - the LLM as a judge - to generate training
data. We also assess the robustness of the IK classifier by evaluating it with
various types of teachers, including both string-based methods and LLMs, with
the latter providing better results.
[LINK]
http://arxiv.org/abs/2412.11536v1
[DATE]
2024-12-16 16:13:14+08:00
[CATEGORIES]
cs.CL
Critic-V: VLM Critics Help Catch VLM Errors in Multimodal Reasoning
[AUTHORS]
Di Zhang, Junxian Li, Jingdi Lei, Xunzhi Wang, Yujie Liu, Zonglin Yang, Jiatong Li, Weida Wang, Suorong Yang, Jianbo Wu, Peng Ye, Wanli Ouyang, Dongzhan Zhou
[ABSTRACT]
Vision-language models (VLMs) have shown remarkable advancements in
multimodal reasoning tasks. However, they still often generate inaccurate or
irrelevant responses due to issues like hallucinated image understandings or
unrefined reasoning paths. To address these challenges, we introduce Critic-V,
a novel framework inspired by the Actor-Critic paradigm to boost the reasoning
capability of VLMs. This framework decouples the reasoning process and critic
process by integrating two independent components: the Reasoner, which
generates reasoning paths based on visual and textual inputs, and the Critic,
which provides constructive critique to refine these paths. In this approach,
the Reasoner generates reasoning responses according to text prompts, which can
evolve iteratively as a policy based on feedback from the Critic. This
interaction process was theoretically driven by a reinforcement learning
framework where the Critic offers natural language critiques instead of scalar
rewards, enabling more nuanced feedback to boost the Reasoner’s capability on
complex reasoning tasks. The Critic model is trained using Direct Preference
Optimization (DPO), leveraging a preference dataset of critiques ranked by
Rule-based Reward~(RBR) to enhance its critic capabilities. Evaluation results
show that the Critic-V framework significantly outperforms existing methods,
including GPT-4V, on 5 out of 8 benchmarks, especially regarding reasoning
accuracy and efficiency. Combining a dynamic text-based policy for the Reasoner
and constructive feedback from the preference-optimized Critic enables a more
reliable and context-sensitive multimodal reasoning process. Our approach
provides a promising solution to enhance the reliability of VLMs, improving
their performance in real-world reasoning-heavy multimodal applications such as
autonomous driving and embodied intelligence.
[COMMENTS]
16 pages, 11 figures
[LINK]
http://arxiv.org/abs/2411.18203v3
[DATE]
2024-12-16 16:12:17+08:00
[CATEGORIES]
cs.CL
MQM-APE: Toward High-Quality Error Annotation Predictors with Automatic Post-Editing in LLM Translation Evaluators
[AUTHORS]
Qingyu Lu, Liang Ding, Kanjian Zhang, Jinxia Zhang, Dacheng Tao
[ABSTRACT]
Large Language Models (LLMs) have shown significant potential as judges for
Machine Translation (MT) quality assessment, providing both scores and
fine-grained feedback. Although approaches such as GEMBA-MQM have shown
state-of-the-art performance on reference-free evaluation, the predicted errors
do not align well with those annotated by human, limiting their
interpretability as feedback signals. To enhance the quality of error
annotations predicted by LLM evaluators, we introduce a universal and
training-free framework, $\textbf{MQM-APE}$, based on the idea of filtering out
non-impactful errors by Automatically Post-Editing (APE) the original
translation based on each error, leaving only those errors that contribute to
quality improvement. Specifically, we prompt the LLM to act as 1)
$\textit{evaluator}$ to provide error annotations, 2) $\textit{post-editor}$ to
determine whether errors impact quality improvement and 3) $\textit{pairwise
quality verifier}$ as the error filter. Experiments show that our approach
consistently improves both the reliability and quality of error spans against
GEMBA-MQM, across eight LLMs in both high- and low-resource languages.
Orthogonal to trained approaches, MQM-APE complements translation-specific
evaluators such as Tower, highlighting its broad applicability. Further
analysis confirms the effectiveness of each module and offers valuable insights
into evaluator design and LLMs selection.
[COMMENTS]
COLING 2025
[LINK]
http://arxiv.org/abs/2409.14335v2
[DATE]
2024-12-16 16:08:51+08:00
[CATEGORIES]
cs.CL
GraphInsight: Unlocking Insights in Large Language Models for Graph Structure Understanding
[AUTHORS]
Yukun Cao, Shuo Han, Zengyi Gao, Zezhong Ding, Xike Xie, S. Kevin Zhou
[ABSTRACT]
Although Large Language Models (LLMs) have demonstrated potential in
processing graphs, they struggle with comprehending graphical structure
information through prompts of graph description sequences, especially as the
graph size increases. We attribute this challenge to the uneven memory
performance of LLMs across different positions in graph description sequences,
known as ‘‘positional biases’’. To address this, we propose GraphInsight, a
novel framework aimed at improving LLMs’ comprehension of both macro- and
micro-level graphical information. GraphInsight is grounded in two key
strategies: 1) placing critical graphical information in positions where LLMs
exhibit stronger memory performance, and 2) investigating a lightweight
external knowledge base for regions with weaker memory performance, inspired by
retrieval-augmented generation (RAG). Moreover, GraphInsight explores
integrating these two strategies into LLM agent processes for composite graph
tasks that require multi-step reasoning. Extensive empirical studies on
benchmarks with a wide range of evaluation tasks show that GraphInsight
significantly outperforms all other graph description methods (e.g., prompting
techniques and reordering strategies) in understanding graph structures of
varying sizes.
[LINK]
http://arxiv.org/abs/2409.03258v3
[DATE]
2024-12-16 16:06:27+08:00
[CATEGORIES]
cs.CL
DART: An AIGT Detector using AMR of Rephrased Text
[AUTHORS]
Hyeonchu Park, Byungjun Kim, Bugeun Kim
[ABSTRACT]
As large language models (LLMs) generate more human-like texts, concerns
about the side effects of AI-generated texts (AIGT) have grown. So, researchers
have developed methods for detecting AIGT. However, two challenges remain.
First, the performance on detecting black-box LLMs is low, because existing
models have focused on syntactic features. Second, most AIGT detectors have
been tested on a single-candidate setting, which assumes that we know the
origin of an AIGT and may deviate from the real-world scenario. To resolve
these challenges, we propose DART, which consists of four steps: rephrasing,
semantic parsing, scoring, and multiclass classification. We conducted several
experiments to test the performance of DART by following previous work. The
experimental result shows that DART can discriminate multiple black-box LLMs
without using syntactic features and knowing the origin of AIGT.
[COMMENTS]
Under review
[LINK]
http://arxiv.org/abs/2412.11517v1
[DATE]
2024-12-16 15:51:09+08:00
[CATEGORIES]
cs.CL
ReflecTool: Towards Reflection-Aware Tool-Augmented Clinical Agents
[AUTHORS]
Yusheng Liao, Shuyang Jiang, Yanfeng Wang, Yu Wang
[ABSTRACT]
Large Language Models (LLMs) have shown promising potential in the medical
domain, assisting with tasks like clinical note generation and patient
communication. However, current LLMs are limited to text-based communication,
hindering their ability to interact with diverse forms of information in
clinical environments. Despite clinical agents succeeding in diverse signal
interaction, they are oriented to a single clinical scenario and hence fail for
broader applications. To evaluate clinical agents holistically, we propose
ClinicalAgent Bench~(CAB), a comprehensive medical agent benchmark consisting
of 18 tasks across five key realistic clinical dimensions. Building on this, we
introduce ReflecTool, a novel framework that excels at utilizing
domain-specific tools within two stages. The first optimization stage
progressively enlarges a long-term memory by saving successful solving
processes and tool-wise experience of agents in a tiny pre-defined training
set. In the following inference stage, ReflecTool can search for supportive
successful demonstrations from already built long-term memory to guide the tool
selection strategy, and a verifier improves the tool usage according to the
tool-wise experience with two verification methods–iterative refinement and
candidate selection. Extensive experiments on ClinicalAgent Benchmark
demonstrate that ReflecTool surpasses the pure LLMs with more than 10 points
and the well-established agent-based methods with 3 points, highlighting its
adaptability and effectiveness in solving complex clinical tasks.
[COMMENTS]
24 pages
[LINK]
http://arxiv.org/abs/2410.17657v2
[DATE]
2024-12-16 15:28:33+08:00
[CATEGORIES]
cs.CL
InfuserKI: Enhancing Large Language Models with Knowledge Graphs via Infuser-Guided Knowledge Integration
[AUTHORS]
Fali Wang, Runxue Bao, Suhang Wang, Wenchao Yu, Yanchi Liu, Wei Cheng, Haifeng Chen
[ABSTRACT]
Large Language Models (LLMs) have achieved exceptional capabilities in open
generation across various domains, yet they encounter difficulties with tasks
that require intensive knowledge. To address these challenges, methods for
integrating knowledge have been developed, which augment LLMs with
domain-specific knowledge graphs through external modules. These approaches,
however, face data inefficiency issues as they necessitate the processing of
both known and unknown knowledge for fine-tuning. Thus, our research focuses on
a novel problem: efficiently integrating unknown knowledge into LLMs without
unnecessary overlap of known knowledge. A risk of introducing new knowledge is
the potential forgetting of existing knowledge. To mitigate this risk, we
propose the innovative {\method} framework. This framework employs transformer
internal states to determine when to enrich LLM outputs with additional
information, effectively preventing knowledge forgetting. Performance
evaluations using the UMLS-2.5k and MetaQA domain knowledge graphs reveal that
{\method} not only successfully integrates new knowledge but also outperforms
state-of-the-art baselines, reducing knowledge forgetting by 9\% and 6\%,
respectively.
[COMMENTS]
14 pages, 7 figures, EMNLP 2024 Findings
[LINK]
http://arxiv.org/abs/2402.11441v2
[DATE]
2024-12-16 15:18:06+08:00
[CATEGORIES]
cs.CL
cs.LG
Chain-of-Discussion: A Multi-Model Framework for Complex Evidence-Based Question Answering
[AUTHORS]
Mingxu Tao, Dongyan Zhao, Yansong Feng
[ABSTRACT]
Open-ended question answering requires models to find appropriate evidence to
form wellreasoned, comprehensive and helpful answers. In practical
applications, models also need to engage in extended discussions on potential
scenarios closely relevant to the question. With augmentation of retrieval
module, open-source Large Language Models (LLMs) can produce coherent answers
often with different focuses, but are still sub-optimal in terms of reliable
evidence selection and in-depth question analysis. In this paper, we propose a
novel Chain-ofDiscussion framework to leverage the synergy among multiple
open-source LLMs aiming to provide more correct and more comprehensive answers
for open-ended QA, although they are not strong enough individually. Our
experiments show that discussions among multiple LLMs play a vital role in
enhancing the quality of answers.
[COMMENTS]
COLING 2025
[LINK]
http://arxiv.org/abs/2402.16313v3
[DATE]
2024-12-16 15:11:59+08:00
[CATEGORIES]
cs.CL
FTP: A Fine-grained Token-wise Pruner for Large Language Models via Token Routing
[AUTHORS]
Zekai Li, Jintu Zheng, Ji Liu, Han Liu, Haowei Zhu, Zeping Li, Fuwei Yang, Haiduo Huang, Jinzhang Peng, Dong Li, Lu Tian, Emad Barsoum
[ABSTRACT]
Recently, large language models (LLMs) have demonstrated superior performance
across various tasks by adhering to scaling laws, which significantly increase
model size. However, the huge computation overhead during inference hinders the
deployment in industrial applications. Many works leverage traditional
compression approaches to boost model inference, but these always introduce
additional training costs to restore the performance and the pruning results
typically show noticeable performance drops compared to the original model when
aiming for a specific level of acceleration. To address these issues, we
propose a fine-grained token-wise pruning approach for the LLMs, which presents
a learnable router to adaptively identify the less important tokens and skip
them across model blocks to reduce computational cost during inference. To
construct the router efficiently, we present a search-based sparsity scheduler
for pruning sparsity allocation, a trainable router combined with our proposed
four low-dimensional factors as input and three proposed losses. We conduct
extensive experiments across different benchmarks on different LLMs to
demonstrate the superiority of our method. Our approach achieves
state-of-the-art (SOTA) pruning results, surpassing other existing pruning
methods. For instance, our method outperforms BlockPruner and ShortGPT by
approximately 10 points on both LLaMA2-7B and Qwen1.5-7B in accuracy retention
at comparable token sparsity levels.
[LINK]
http://arxiv.org/abs/2412.11494v1
[DATE]
2024-12-16 15:09:46+08:00
[CATEGORIES]
cs.CL
Game Development as Human-LLM Interaction
[AUTHORS]
Jiale Hong, Hongqiu Wu, Hai Zhao
[ABSTRACT]
Game development is a highly specialized task that relies on a complex game
engine powered by complex programming languages, preventing many gaming
enthusiasts from handling it. This paper introduces the Chat Game Engine
(ChatGE) powered by LLM, which allows everyone to develop a custom game using
natural language through Human-LLM interaction. To enable an LLM to function as
a ChatGE, we instruct it to perform the following processes in each turn: (1)
$P_{script}$: configure the game script segment based on the user’s input; (2)
$P_{code}$: generate the corresponding code snippet based on the game script
segment; (3) $P_{utter}$: interact with the user, including guidance and
feedback. We propose a data synthesis pipeline based on LLM to generate game
script-code pairs and interactions from a few manually crafted seed data. We
propose a three-stage progressive training strategy to transfer the
dialogue-based LLM to our ChatGE smoothly. We construct a ChatGE for poker
games as a case study and comprehensively evaluate it from two perspectives:
interaction quality and code correctness.
[LINK]
http://arxiv.org/abs/2408.09386v2
[DATE]
2024-12-16 14:58:49+08:00
[CATEGORIES]
cs.CL
Grading Massive Open Online Courses Using Large Language Models
[AUTHORS]
Shahriar Golchin, Nikhil Garuda, Christopher Impey, Matthew Wenger
[COMMENTS]
Final version; accepted at COLING 2025
[LINK]
http://arxiv.org/abs/2406.11102v2
[DATE]
2024-12-16 14:50:20+08:00
[CATEGORIES]
cs.CL
NoteContrast: Contrastive Language-Diagnostic Pretraining for Medical Text
[AUTHORS]
Prajwal Kailas, Max Homilius, Rahul C. Deo, Calum A. MacRae
[ABSTRACT]
Accurate diagnostic coding of medical notes is crucial for enhancing patient
care, medical research, and error-free billing in healthcare organizations.
Manual coding is a time-consuming task for providers, and diagnostic codes
often exhibit low sensitivity and specificity, whereas the free text in medical
notes can be a more precise description of a patients status. Thus, accurate
automated diagnostic coding of medical notes has become critical for a learning
healthcare system. Recent developments in long-document transformer
architectures have enabled attention-based deep-learning models to adjudicate
medical notes. In addition, contrastive loss functions have been used to
jointly pre-train large language and image models with noisy labels. To further
improve the automated adjudication of medical notes, we developed an approach
based on i) models for ICD-10 diagnostic code sequences using a large
real-world data set, ii) large language models for medical notes, and iii)
contrastive pre-training to build an integrated model of both ICD-10 diagnostic
codes and corresponding medical text. We demonstrate that a contrastive
approach for pre-training improves performance over prior state-of-the-art
models for the MIMIC-III-50, MIMIC-III-rare50, and MIMIC-III-full diagnostic
coding tasks.
[LINK]
http://arxiv.org/abs/2412.11477v1
[DATE]
2024-12-16 14:44:39+08:00
[CATEGORIES]
cs.LG
cs.CL
Augmenting Math Word Problems via Iterative Question Composing
[AUTHORS]
Haoxiong Liu, Yifan Zhang, Yifan Luo, Andrew Chi-Chih Yao
[ABSTRACT]
Despite the advancements in large language models (LLMs) for mathematical
reasoning, solving competition-level math problems remains a significant
challenge, especially for open-source LLMs without external tools. We introduce
the MMIQC dataset, comprising a mixture of processed web data and synthetic
question-response pairs, aimed at enhancing the mathematical reasoning
capabilities of base language models. Models fine-tuned on MMIQC consistently
surpass their counterparts in performance on the MATH benchmark across various
model sizes. Notably, Qwen-72B-MMIQC achieves a 45.0% accuracy, exceeding the
previous open-source state-of-the-art by 8.2% and outperforming the initial
version GPT-4 released in 2023. Extensive evaluation results on Hungarian high
school finals suggest that such improvement can generalize to unseen data. Our
ablation study on MMIQC reveals that a large part of the improvement can be
attributed to our novel augmentation method, Iterative Question Composing
(IQC), which involves iteratively composing new questions from seed problems
using an LLM and applying rejection sampling through another LLM.
[LINK]
http://arxiv.org/abs/2401.09003v5
[DATE]
2024-12-16 14:13:03+08:00
[CATEGORIES]
cs.CL
cs.LG
LawLuo: A Multi-Agent Collaborative Framework for Multi-Round Chinese Legal Consultation
[AUTHORS]
Jingyun Sun, Chengxiao Dai, Zhongze Luo, Yangbo Chang, Yang Li
[ABSTRACT]
Legal Large Language Models (LLMs) have shown promise in providing legal
consultations to non-experts. However, most existing Chinese legal consultation
models are based on single-agent systems, which differ from real-world legal
consultations, where multiple professionals collaborate to offer more tailored
responses. To better simulate real consultations, we propose LawLuo, a
multi-agent framework for multi-turn Chinese legal consultations. LawLuo
includes four agents: the receptionist agent, which assesses user intent and
selects a lawyer agent; the lawyer agent, which interacts with the user; the
secretary agent, which organizes conversation records and generates
consultation reports; and the boss agent, which evaluates the performance of
the lawyer and secretary agents to ensure optimal results. These agents’
interactions mimic the operations of real law firms. To train them to follow
different legal instructions, we developed distinct fine-tuning datasets. We
also introduce a case graph-based RAG to help the lawyer agent address vague
user inputs. Experimental results show that LawLuo outperforms baselines in
generating more personalized and professional responses, handling ambiguous
queries, and following legal instructions in multi-turn conversations. Our full
code and constructed datasets will be open-sourced upon paper acceptance.
[COMMENTS]
13 pages
[LINK]
http://arxiv.org/abs/2407.16252v3
[DATE]
2024-12-16 13:53:28+08:00
[CATEGORIES]
cs.CL
Understanding Knowledge Hijack Mechanism in In-context Learning through Associative Memory
[AUTHORS]
Shuo Wang, Issei Sato
[ABSTRACT]
In-context learning (ICL) enables large language models (LLMs) to adapt to
new tasks without fine-tuning by leveraging contextual information provided
within a prompt. However, ICL relies not only on contextual clues but also on
the global knowledge acquired during pretraining for the next token prediction.
Analyzing this process has been challenging due to the complex computational
circuitry of LLMs. This paper investigates the balance between in-context
information and pretrained bigram knowledge in token prediction, focusing on
the induction head mechanism, a key component in ICL. Leveraging the fact that
a two-layer transformer can implement the induction head mechanism with
associative memories, we theoretically analyze the logits when a two-layer
transformer is given prompts generated by a bigram model. In the experiments,
we design specific prompts to evaluate whether the outputs of a two-layer
transformer align with the theoretical results.
[LINK]
http://arxiv.org/abs/2412.11459v1
[DATE]
2024-12-16 13:33:05+08:00
[CATEGORIES]
cs.CL
cs.LG
Towards Better Multi-task Learning: A Framework for Optimizing Dataset Combinations in Large Language Models
[AUTHORS]
Zaifu Zhan, Rui Zhang
[ABSTRACT]
To efficiently select optimal dataset combinations for enhancing multi-task
learning (MTL) performance in large language models, we proposed a novel
framework that leverages a neural network to predict the best dataset
combinations. The framework iteratively refines the selection, greatly
improving efficiency, while being model-, dataset-, and domain-independent.
Through experiments on 12 biomedical datasets across four tasks - named entity
recognition, relation extraction, event extraction, and text classification-we
demonstrate that our approach effectively identifies better combinations, even
for tasks that may seem unpromising from a human perspective. This verifies
that our framework provides a promising solution for maximizing MTL potential.
[COMMENTS]
14 pages, 5 figures, 4 tables
[LINK]
http://arxiv.org/abs/2412.11455v1
[DATE]
2024-12-16 13:20:18+08:00
[CATEGORIES]
cs.CL
Can Knowledge Graphs Make Large Language Models More Trustworthy? An Empirical Study over Open-ended Question Answering
[AUTHORS]
Yuan Sui, Yufei He, Zifeng Ding, Bryan Hooi
[ABSTRACT]
Recent works integrating Knowledge Graphs (KGs) have led to promising
improvements in enhancing reasoning accuracy of Large Language Models (LLMs).
However, current benchmarks focus mainly on closed-ended tasks, leaving a gap
in the assessment of more complex real-world scenarios. This gap has also
obscured the evaluation of KGs’ potential to mitigate the problem of
hallucination in LLMs. To fill the gap, we introduce OKGQA, a new benchmark
specifically designed to assess LLMs enhanced with KGs under open-ended,
real-world question answering scenarios. OKGQA is designed to closely reflect
the complexities of practical applications using questions from different
types, and incorporates specific metrics to measure both the reduction in
hallucinations and the enhancement in reasoning capabilities. To consider the
scenario in which KGs may have varying levels of mistakes, we propose another
benchmark variant OKGQA-P to assess model performance when the semantics and
structure of KGs are deliberately perturbed and contaminated. OKGQA aims to (1)
explore whether KGs can make LLMs more trustworthy in an open-ended setting,
and (2) conduct a comparative analysis to shed light on methods and future
directions for leveraging KGs to reduce LLMs’ hallucination. We believe that
this study can facilitate a more complete performance comparison and encourage
continuous improvement in integrating KGs with LLMs.
[LINK]
http://arxiv.org/abs/2410.08085v2
[DATE]
2024-12-16 13:18:12+08:00
[CATEGORIES]
cs.CL
Are LLMs Rigorous Logical Reasoner? Empowering Natural Language Proof Generation with Contrastive Stepwise Decoding
[AUTHORS]
Ying Su, Xiaojin Fu, Mingwen Liu, Zhijiang Guo
[ABSTRACT]
Logical reasoning remains a pivotal component within the realm of artificial
intelligence. The recent evolution of large language models (LLMs) has marked
significant progress in this domain. The adoption of strategies like
chain-of-thought (CoT) has enhanced the performance of LLMs across diverse
reasoning tasks. Nonetheless, logical reasoning that involves proof planning,
specifically those that necessitate the validation of explanation accuracy,
continues to present stumbling blocks. In this study, we first evaluate the
efficacy of LLMs with advanced CoT strategies concerning such tasks. Our
analysis reveals that LLMs still struggle to navigate complex reasoning chains,
which demand the meticulous linkage of premises to derive a cogent conclusion.
To address this issue, we finetune a smaller-scale language model, equipping it
to decompose proof objectives into more manageable subgoals. We also introduce
contrastive decoding to stepwise proof generation, making use of negative
reasoning paths to strengthen the model’s capacity for logical deduction.
Experiments on EntailmentBank underscore the success of our method in
augmenting the proof planning abilities of language models.
[COMMENTS]
The paper is currently undergoing extensive revisions and
improvements
[LINK]
http://arxiv.org/abs/2311.06736v2
[DATE]
2024-12-16 13:18:05+08:00
[CATEGORIES]
cs.CL
Whisper-GPT: A Hybrid Representation Audio Large Language Model
[AUTHORS]
Prateek Verma
[ABSTRACT]
We propose WHISPER-GPT: A generative large language model (LLM) for speech
and music that allows us to work with continuous audio representations and
discrete tokens simultaneously as part of a single architecture. There has been
a huge surge in generative audio, speech, and music models that utilize
discrete audio tokens derived from neural compression algorithms, e.g. ENCODEC.
However, one of the major drawbacks of this approach is handling the context
length. It blows up for high-fidelity generative architecture if one has to
account for all the audio contents at various frequencies for the next token
prediction. By combining continuous audio representation like the spectrogram
and discrete acoustic tokens, we retain the best of both worlds: Have all the
information needed from the audio at a specific time instance in a single
token, yet allow LLM to predict the future token to allow for sampling and
other benefits discrete space provides. We show how our architecture improves
the perplexity and negative log-likelihood scores for the next token prediction
compared to a token-based LLM for speech and music.
[COMMENTS]
6 pages, 3 figures. 50th International Conference on Acoustics,
Speech and Signal Processing, Hyderabad, India
[LINK]
http://arxiv.org/abs/2412.11449v1
[DATE]
2024-12-16 13:03:48+08:00
[CATEGORIES]
cs.CL
cs.LG
InterCLIP-MEP: Interactive CLIP and Memory-Enhanced Predictor for Multi-modal Sarcasm Detection
[AUTHORS]
Junjie Chen, Hang Yu, Subin Huang, Sanmin Liu, Linfeng Zhang
[ABSTRACT]
Sarcasm in social media, often expressed through text-image combinations,
poses challenges for sentiment analysis and intention mining. Current
multi-modal sarcasm detection methods have been demonstrated to overly rely on
spurious cues within the textual modality, revealing a limited ability to
genuinely identify sarcasm through nuanced text-image interactions. To solve
this problem, we propose InterCLIP-MEP, which introduces Interactive CLIP
(InterCLIP) with an efficient training strategy to extract enriched text-image
representations by embedding cross-modal information directly into each
encoder. Additionally, we design a Memory-Enhanced Predictor (MEP) with a
dynamic dual-channel memory that stores valuable test sample knowledge during
inference, acting as a non-parametric classifier for robust sarcasm
recognition. Experiments on two benchmarks demonstrate that InterCLIP-MEP
achieves state-of-the-art performance, with significant accuracy and F1 score
improvements on MMSD and MMSD2.0. Our code is available at
https://github.com/CoderChen01/InterCLIP-MEP.
[COMMENTS]
15 pages, 7 figures, 11 tables; Code and data are available at
https://github.com/CoderChen01/InterCLIP-MEP
[LINK]
http://arxiv.org/abs/2406.16464v5
[DATE]
2024-12-16 12:13:38+08:00
[CATEGORIES]
cs.CL
Exploiting the Index Gradients for Optimization-Based Jailbreaking on Large Language Models
[AUTHORS]
Jiahui Li, Yongchang Hao, Haoyu Xu, Xing Wang, Yu Hong
[ABSTRACT]
Despite the advancements in training Large Language Models (LLMs) with
alignment techniques to enhance the safety of generated content, these models
remain susceptible to jailbreak, an adversarial attack method that exposes
security vulnerabilities in LLMs. Notably, the Greedy Coordinate Gradient (GCG)
method has demonstrated the ability to automatically generate adversarial
suffixes that jailbreak state-of-the-art LLMs. However, the optimization
process involved in GCG is highly time-consuming, rendering the jailbreaking
pipeline inefficient. In this paper, we investigate the process of GCG and
identify an issue of Indirect Effect, the key bottleneck of the GCG
optimization. To this end, we propose the Model Attack Gradient Index GCG
(MAGIC), that addresses the Indirect Effect by exploiting the gradient
information of the suffix tokens, thereby accelerating the procedure by having
less computation and fewer iterations. Our experiments on AdvBench show that
MAGIC achieves up to a 1.5x speedup, while maintaining Attack Success Rates
(ASR) on par or even higher than other baselines. Our MAGIC achieved an ASR of
74% on the Llama-2 and an ASR of 54% when conducting transfer attacks on
GPT-3.5. Code is available at https://github.com/jiah-li/magic.
[COMMENTS]
13 pages,2 figures, accepted by COLING 2025
[LINK]
http://arxiv.org/abs/2412.08615v2
[DATE]
2024-12-16 12:05:45+08:00
[CATEGORIES]
cs.CL
Evaluation Agent: Efficient and Promptable Evaluation Framework for Visual Generative Models
[AUTHORS]
Fan Zhang, Shulin Tian, Ziqi Huang, Yu Qiao, Ziwei Liu
[ABSTRACT]
Recent advancements in visual generative models have enabled high-quality
image and video generation, opening diverse applications. However, evaluating
these models often demands sampling hundreds or thousands of images or videos,
making the process computationally expensive, especially for diffusion-based
models with inherently slow sampling. Moreover, existing evaluation methods
rely on rigid pipelines that overlook specific user needs and provide numerical
results without clear explanations. In contrast, humans can quickly form
impressions of a model’s capabilities by observing only a few samples. To mimic
this, we propose the Evaluation Agent framework, which employs human-like
strategies for efficient, dynamic, multi-round evaluations using only a few
samples per round, while offering detailed, user-tailored analyses. It offers
four key advantages: 1) efficiency, 2) promptable evaluation tailored to
diverse user needs, 3) explainability beyond single numerical scores, and 4)
scalability across various models and tools. Experiments show that Evaluation
Agent reduces evaluation time to 10% of traditional methods while delivering
comparable results. The Evaluation Agent framework is fully open-sourced to
advance research in visual generative models and their efficient evaluation.
[COMMENTS]
Equal contributions from first three authors. Project page:
https://vchitect.github.io/Evaluation-Agent-project Code:
https://github.com/Vchitect/Evaluation-Agent
[LINK]
http://arxiv.org/abs/2412.09645v2
[DATE]
2024-12-16 12:05:05+08:00
[CATEGORIES]
cs.CL
Optimized Quran Passage Retrieval Using an Expanded QA Dataset and Fine-Tuned Language Models
[AUTHORS]
Mohamed Basem, Islam Oshallah, Baraa Hikal, Ali Hamdi, Ammar Mohamed
[ABSTRACT]
Understanding the deep meanings of the Qur’an and bridging the language gap
between modern standard Arabic and classical Arabic is essential to improve the
question-and-answer system for the Holy Qur’an. The Qur’an QA 2023 shared task
dataset had a limited number of questions with weak model retrieval. To address
this challenge, this work updated the original dataset and improved the model
accuracy. The original dataset, which contains 251 questions, was reviewed and
expanded to 629 questions with question diversification and reformulation,
leading to a comprehensive set of 1895 categorized into single-answer,
multi-answer, and zero-answer types. Extensive experiments fine-tuned
transformer models, including AraBERT, RoBERTa, CAMeLBERT, AraELECTRA, and
BERT. The best model, AraBERT-base, achieved a MAP@10 of 0.36 and MRR of 0.59,
representing improvements of 63% and 59%, respectively, compared to the
baseline scores (MAP@10: 0.22, MRR: 0.37). Additionally, the dataset expansion
led to improvements in handling “no answer” cases, with the proposed approach
achieving a 75% success rate for such instances, compared to the baseline’s
25%. These results demonstrate the effect of dataset improvement and model
architecture optimization in increasing the performance of QA systems for the
Holy Qur’an, with higher accuracy, recall, and precision.
[LINK]
http://arxiv.org/abs/2412.11431v1
[DATE]
2024-12-16 12:03:58+08:00
[CATEGORIES]
cs.CL
AutoScale: Automatic Prediction of Compute-optimal Data Composition for Training LLMs
[AUTHORS]
Feiyang Kang, Yifan Sun, Bingbing Wen, Si Chen, Dawn Song, Rafid Mahmood, Ruoxi Jia
[ABSTRACT]
Domain reweighting is an emerging research area aimed at adjusting the
relative weights of different data sources to improve the effectiveness and
efficiency of language model pre-training. This paper demonstrates that the
optimal composition of training data from different domains is scale-dependent,
challenging the existing practice of determining optimal mixtures through
small-scale experiments and directly applying them at larger scales. We derive
an analytical model for the dependence of optimal weights on data scale and
introduce AutoScale, a novel, practical approach for optimizing data
compositions at potentially large training data scales. AutoScale first uses
a principled optimization framework to find optimal compositions at smaller,
feasible scales, then predicts optimal compositions at larger scales using our
derived model. Our evaluation on GPT-2 Large and BERT pre-training demonstrates
AutoScale’s effectiveness in improving training convergence and downstream
performance. Particularly, for GPT-2 Large on RedPajama, AutoScale decreases
validation perplexity 28% faster than baselines, with up to 38% speed-up over
unweighted training, achieving the best performance across downstream tasks.
This work provides insights into the varying benefits of data sources across
training scales for language models, contributing to the burgeoning research on
scale-dependent data curation. Code is open-sourced.
[COMMENTS]
Preprint. Under review
[LINK]
http://arxiv.org/abs/2407.20177v3
[DATE]
2024-12-16 11:39:20+08:00
[CATEGORIES]
cs.LG
cs.CL
Unveiling Language Skills via Path-Level Circuit Discovery
[AUTHORS]
Hang Chen, Jiaying Zhu, Xinyu Yang, Wenya Wang
[ABSTRACT]
Circuit discovery with edge-level ablation has become a foundational
framework for mechanism interpretability of language models. However, its focus
on individual edges often overlooks the sequential, path-level causal
relationships that underpin complex behaviors, thus potentially leading to
misleading or incomplete circuit discoveries. To address this issue, we propose
a novel path-level circuit discovery framework capturing how behaviors emerge
through interconnected linear chain and build towards complex behaviors. Our
framework is constructed upon a fully-disentangled linear combinations of
“memory circuits” decomposed from the original model. To discover functional
circuit paths, we leverage a 2-step pruning strategy by first reducing the
computational graph to a faithful and minimal subgraph and then applying causal
mediation to identify common paths of a specific skill, termed as skill paths.
In contrast to circuit graph from existing works, we focus on the complete
paths of a generic skill rather than on the fine-grained responses to
individual components of the input. To demonstrate this, we explore three
generic language skills, namely Previous Token Skill, Induction Skill and
In-Context Learning Skill using our framework and provide more compelling
evidence to substantiate stratification and inclusiveness of these skills.
[COMMENTS]
30 pages
[LINK]
http://arxiv.org/abs/2410.01334v2
[DATE]
2024-12-16 11:33:36+08:00
[CATEGORIES]
cs.CL
Attention with Dependency Parsing Augmentation for Fine-Grained Attribution
[AUTHORS]
Qiang Ding, Lvzhou Luo, Yixuan Cao, Ping Luo
[ABSTRACT]
To assist humans in efficiently validating RAG-generated content, developing
a fine-grained attribution mechanism that provides supporting evidence from
retrieved documents for every answer span is essential. Existing fine-grained
attribution methods rely on model-internal similarity metrics between responses
and documents, such as saliency scores and hidden state similarity. However,
these approaches suffer from either high computational complexity or
coarse-grained representations. Additionally, a common problem shared by the
previous works is their reliance on decoder-only Transformers, limiting their
ability to incorporate contextual information after the target span. To address
the above problems, we propose two techniques applicable to all
model-internals-based methods. First, we aggregate token-wise evidence through
set union operations, preserving the granularity of representations. Second, we
enhance the attributor by integrating dependency parsing to enrich the semantic
completeness of target spans. For practical implementation, our approach
employs attention weights as the similarity metric. Experimental results
demonstrate that the proposed method consistently outperforms all prior works.
[COMMENTS]
16 pages, 7 figures, submitted to ACL ARR 2024 October
[LINK]
http://arxiv.org/abs/2412.11404v1
[DATE]
2024-12-16 11:12:13+08:00
[CATEGORIES]
cs.CL
ASLoRA: Adaptive Sharing Low-Rank Adaptation Across Layers
[AUTHORS]
Junyan Hu, Xue Xiao, Mengqi Zhang, Yao Chen, Zhaochun Ren, Zhumin Chen, Pengjie Ren
[ABSTRACT]
As large language models (LLMs) grow in size, traditional full fine-tuning
becomes increasingly impractical due to its high computational and storage
costs. Although popular parameter-efficient fine-tuning methods, such as LoRA,
have significantly reduced the number of tunable parameters, there is still
room for further optimization. In this work, we propose ASLoRA, a cross-layer
parameter-sharing strategy combining global sharing with partial adaptive
sharing. Specifically, we share the low-rank matrix A across all layers and
adaptively merge matrix B during training. This sharing mechanism not only
mitigates overfitting effectively but also captures inter-layer dependencies,
significantly enhancing the model’s representational capability. We conduct
extensive experiments on various NLP tasks, showing that ASLoRA outperforms
LoRA while using less than 25% of the parameters, highlighting its flexibility
and superior parameter efficiency. Furthermore, in-depth analyses of the
adaptive sharing strategy confirm its significant advantages in enhancing both
model flexibility and task adaptability.
[LINK]
http://arxiv.org/abs/2412.10135v2
[DATE]
2024-12-16 11:05:38+08:00
[CATEGORIES]
cs.CL
Is There No Such Thing as a Bad Question? H4R: HalluciBot For Ratiocination, Rewriting, Ranking, and Routing
[AUTHORS]
William Watson, Nicole Cho, Nishan Srishankar
[ABSTRACT]
Hallucination continues to be one of the most critical challenges in the
institutional adoption journey of Large Language Models (LLMs). While prior
studies have primarily focused on the post-generation analysis and refinement
of outputs, this paper centers on the effectiveness of queries in eliciting
accurate responses from LLMs. We present HalluciBot, a model that estimates the
query’s propensity to hallucinate before generation, without invoking any LLMs
during inference. HalluciBot can serve as a proxy reward model for query
rewriting, offering a general framework to estimate query quality based on
accuracy and consensus. In essence, HalluciBot investigates how poorly
constructed queries can lead to erroneous outputs - moreover, by employing
query rewriting guided by HalluciBot’s empirical estimates, we demonstrate that
95.7% output accuracy can be achieved for Multiple Choice questions. The
training procedure for HalluciBot consists of perturbing 369,837 queries n
times, employing n+1 independent LLM agents, sampling an output from each
query, conducting a Multi-Agent Monte Carlo simulation on the sampled outputs,
and training an encoder classifier. The idea of perturbation is the outcome of
our ablation studies that measures the increase in output diversity (+12.5
agreement spread) by perturbing a query in lexically different but semantically
similar ways. Therefore, HalluciBot paves the way to ratiocinate (76.0% test F1
score, 46.6% in saved computation on hallucinatory queries), rewrite (+30.2%
positive class transition from hallucinatory to non-hallucinatory), rank
(+50.6% positive class transition from hallucinatory to non-hallucinatory), and
route queries to effective pipelines.
[COMMENTS]
Accepted at The 39th Annual AAAI Conference on Artificial
Intelligence (AAAI 2025)
[LINK]
http://arxiv.org/abs/2404.12535v3
[DATE]
2024-12-16 10:35:31+08:00
[CATEGORIES]
cs.LG
cs.CL
Why Does ChatGPT “Delve” So Much? Exploring the Sources of Lexical Overrepresentation in Large Language Models
[AUTHORS]
Tom S. Juzek, Zina B. Ward
[COMMENTS]
15 pages, 8 figures, The 31st International Conference on
Computational Linguistics (COLING 2025)
[LINK]
http://arxiv.org/abs/2412.11385v1
[DATE]
2024-12-16 10:27:59+08:00
[CATEGORIES]
cs.CL
cs.LG
Planning with Multi-Constraints via Collaborative Language Agents
[AUTHORS]
Cong Zhang, Derrick Goh Xin Deik, Dexun Li, Hao Zhang, Yong Liu
[ABSTRACT]
The rapid advancement of neural language models has sparked a new surge of
intelligent agent research. Unlike traditional agents, large language
model-based agents (LLM agents) have emerged as a promising paradigm for
achieving artificial general intelligence (AGI) due to their superior reasoning
and generalization capabilities. Effective planning is crucial for the success
of LLM agents in real-world tasks, making it a highly pursued topic in the
community. Current planning methods typically translate tasks into executable
action sequences. However, determining a feasible or optimal sequence for
complex tasks with multiple constraints at fine granularity, which often
requires compositing long chains of heterogeneous actions, remains challenging.
This paper introduces Planning with Multi-Constraints (PMC), a zero-shot
methodology for collaborative LLM-based multi-agent systems that simplifies
complex task planning with constraints by decomposing it into a hierarchy of
subordinate tasks. Each subtask is then mapped into executable actions. PMC was
assessed on two constraint-intensive benchmarks, TravelPlanner and API-Bank.
Notably, PMC achieved an average 42.68% success rate on TravelPlanner,
significantly higher than GPT-4 (2.92%), and outperforming GPT-4 with ReAct on
API-Bank by 13.64%, showing the immense potential of integrating LLM with
multi-agent systems. We also show that PMC works with small LLM as the planning
core, e.g., LLaMA-3.1-8B.
[LINK]
http://arxiv.org/abs/2405.16510v4
[DATE]
2024-12-16 10:27:55+08:00
[CATEGORIES]
cs.CL
cs.LG
Fast-DetectGPT: Efficient Zero-Shot Detection of Machine-Generated Text via Conditional Probability Curvature
[AUTHORS]
Guangsheng Bao, Yanbin Zhao, Zhiyang Teng, Linyi Yang, Yue Zhang
[ABSTRACT]
Large language models (LLMs) have shown the ability to produce fluent and
cogent content, presenting both productivity opportunities and societal risks.
To build trustworthy AI systems, it is imperative to distinguish between
machine-generated and human-authored content. The leading zero-shot detector,
DetectGPT, showcases commendable performance but is marred by its intensive
computational costs. In this paper, we introduce the concept of conditional
probability curvature to elucidate discrepancies in word choices between LLMs
and humans within a given context. Utilizing this curvature as a foundational
metric, we present Fast-DetectGPT, an optimized zero-shot detector, which
substitutes DetectGPT’s perturbation step with a more efficient sampling step.
Our evaluations on various datasets, source models, and test conditions
indicate that Fast-DetectGPT not only surpasses DetectGPT by a relative around
75% in both the white-box and black-box settings but also accelerates the
detection process by a factor of 340, as detailed in Table 1. See
\url{https://github.com/baoguangsheng/fast-detect-gpt} for code, data, and
results.
[COMMENTS]
ICLR 2024 camera version (9 pages, 5 figures, 11 tables)
[LINK]
http://arxiv.org/abs/2310.05130v3
[DATE]
2024-12-16 10:20:31+08:00
[CATEGORIES]
cs.CL
Failures to Find Transferable Image Jailbreaks Between Vision-Language Models
[AUTHORS]
Rylan Schaeffer, Dan Valentine, Luke Bailey, James Chua, Cristóbal Eyzaguirre, Zane Durante, Joe Benton, Brando Miranda, Henry Sleight, John Hughes, Rajashree Agrawal, Mrinank Sharma, Scott Emmons, Sanmi Koyejo, Ethan Perez
[ABSTRACT]
The integration of new modalities into frontier AI systems offers exciting
capabilities, but also increases the possibility such systems can be
adversarially manipulated in undesirable ways. In this work, we focus on a
popular class of vision-language models (VLMs) that generate text outputs
conditioned on visual and textual inputs. We conducted a large-scale empirical
study to assess the transferability of gradient-based universal image
jailbreaks" using a diverse set of over 40 open-parameter VLMs, including 18
new VLMs that we publicly release. Overall, we find that transferable
gradient-based image jailbreaks are extremely difficult to obtain. When an
image jailbreak is optimized against a single VLM or against an ensemble of
VLMs, the jailbreak successfully jailbreaks the attacked VLM(s), but exhibits
little-to-no transfer to any other VLMs; transfer is not affected by whether
the attacked and target VLMs possess matching vision backbones or language
models, whether the language model underwent instruction-following and/or
safety-alignment training, or many other factors. Only two settings display
partially successful transfer: between identically-pretrained and
identically-initialized VLMs with slightly different VLM training data, and
between different training checkpoints of a single VLM. Leveraging these
results, we then demonstrate that transfer can be significantly improved
against a specific target VLM by attacking larger ensembles of
highly-similar” VLMs. These results stand in stark contrast to existing
evidence of universal and transferable text jailbreaks against language models
and transferable adversarial attacks against image classifiers, suggesting that
VLMs may be more robust to gradient-based transfer attacks.
[COMMENTS]
NeurIPS 2024 Workshops: RBFM (Best Paper), Frontiers in AdvML (Oral),
Red Teaming GenAI (Oral), SoLaR (Spotlight), SATA
[LINK]
http://arxiv.org/abs/2407.15211v2
[DATE]
2024-12-16 09:20:42+08:00
[CATEGORIES]
cs.CL
cs.LG
LLM-Based Multi-Agent Systems are Scalable Graph Generative Models
[AUTHORS]
Jiarui Ji, Runlin Lei, Jialing Bi, Zhewei Wei, Xu Chen, Yankai Lin, Xuchen Pan, Yaliang Li, Bolin Ding
[ABSTRACT]
The structural properties of naturally arising social graphs are extensively
studied to understand their evolution. Prior approaches for modeling network
dynamics typically rely on rule-based models, which lack realism and
generalizability, or deep learning-based models, which require large-scale
training datasets. Social graphs, as abstract graph representations of
entity-wise interactions, present an opportunity to explore network evolution
mechanisms through realistic simulations of human-item interactions. Leveraging
the pre-trained social consensus knowledge embedded in large language models
(LLMs), we present GraphAgent-Generator (GAG), a novel simulation-based
framework for dynamic, text-attributed social graph generation. GAG simulates
the temporal node and edge generation processes for zero-shot social graph
generation. The resulting graphs exhibit adherence to seven key macroscopic
network properties, achieving an 11% improvement in microscopic graph structure
metrics. Through the node classification benchmarking task, we validate GAG
effectively captures the intricate text-structure correlations in graph
generation. Furthermore, GAG supports generating graphs with up to nearly
100,000 nodes or 10 million edges through large-scale LLM-based agent
simulation with parallel acceleration, achieving a minimum speed-up of 90.4%.
The source code is available at
https://anonymous.4open.science/r/GraphAgent-2206.
[LINK]
http://arxiv.org/abs/2410.09824v5
[DATE]
2024-12-16 09:03:04+08:00
[CATEGORIES]
cs.CL
Distilling Rule-based Knowledge into Large Language Models
[AUTHORS]
Wenkai Yang, Yankai Lin, Jie Zhou, Ji-Rong Wen
[ABSTRACT]
Large language models (LLMs) have shown incredible performance in completing
various real-world tasks. The current paradigm of knowledge learning for LLMs
is mainly based on learning from examples, in which LLMs learn the internal
rule implicitly from a certain number of supervised examples. However, this
learning paradigm may not well learn those complicated rules, especially when
the training examples are limited. We are inspired that humans can learn the
new tasks or knowledge in another way by learning from rules. That is, humans
can learn new tasks or grasp new knowledge quickly and generalize well given
only a detailed rule and a few optional examples. Therefore, in this paper, we
aim to explore the feasibility of this new learning paradigm, which targets on
encoding rule-based knowledge into LLMs. We further propose rule distillation,
which first uses the strong in-context abilities of LLMs to extract the
knowledge from the textual rules, and then explicitly encode the knowledge into
the parameters of LLMs by learning from the above in-context signals produced
inside the model. Our experiments show that making LLMs learn from rules by our
method is much more efficient than example-based learning in both the sample
size and generalization ability. Warning: This paper may contain examples with
offensive content.
[COMMENTS]
Accepted at COLING 2025, camera-ready version. Code and data are
available at https://github.com/RUCBM/rule-distillation
[LINK]
http://arxiv.org/abs/2311.08883v3
[DATE]
2024-12-16 08:37:33+08:00
[CATEGORIES]
cs.CL
Dissecting Adversarial Robustness of Multimodal LM Agents
[AUTHORS]
Chen Henry Wu, Rishi Shah, Jing Yu Koh, Ruslan Salakhutdinov, Daniel Fried, Aditi Raghunathan
[ABSTRACT]
As language models (LMs) are used to build autonomous agents in real
environments, ensuring their adversarial robustness becomes a critical
challenge. Unlike chatbots, agents are compound systems with multiple
components, which existing LM safety evaluations do not adequately address. To
bridge this gap, we manually create 200 targeted adversarial tasks and
evaluation functions in a realistic threat model on top of VisualWebArena, a
real environment for web-based agents. In order to systematically examine the
robustness of various multimodal we agents, we propose the Agent Robustness
Evaluation (ARE) framework. ARE views the agent as a graph showing the flow of
intermediate outputs between components and decomposes robustness as the flow
of adversarial information on the graph. First, we find that we can
successfully break a range of the latest agents that use black-box frontier
LLMs, including those that perform reflection and tree-search. With
imperceptible perturbations to a single product image (less than 5% of total
web page pixels), an attacker can hijack these agents to execute targeted
adversarial goals with success rates up to 67%. We also use ARE to rigorously
evaluate how the robustness changes as new components are added. We find that
new components that typically improve benign performance can open up new
vulnerabilities and harm robustness. An attacker can compromise the evaluator
used by the reflexion agent and the value function of the tree search agent,
which increases the attack success relatively by 15% and 20%. Our data and code
for attacks, defenses, and evaluation are available at
https://github.com/ChenWu98/agent-attack
[COMMENTS]
Oral presentation at NeurIPS 2024 Open-World Agents Workshop
[LINK]
http://arxiv.org/abs/2406.12814v2
[DATE]
2024-12-16 08:21:27+08:00
[CATEGORIES]
cs.LG
cs.CL
Can AI Extract Antecedent Factors of Human Trust in AI? An Application of Information Extraction for Scientific Literature in Behavioural and Computer Sciences
[AUTHORS]
Melanie McGrath, Harrison Bailey, Necva Bölücü, Xiang Dai, Sarvnaz Karimi, Cecile Paris
[ABSTRACT]
Information extraction from the scientific literature is one of the main
techniques to transform unstructured knowledge hidden in the text into
structured data which can then be used for decision-making in down-stream
tasks. One such area is Trust in AI, where factors contributing to human trust
in artificial intelligence applications are studied. The relationships of these
factors with human trust in such applications are complex. We hence explore
this space from the lens of information extraction where, with the input of
domain experts, we carefully design annotation guidelines, create the first
annotated English dataset in this domain, investigate an LLM-guided annotation,
and benchmark it with state-of-the-art methods using large language models in
named entity and relation extraction. Our results indicate that this problem
requires supervised learning which may not be currently feasible with
prompt-based LLMs.
[LINK]
http://arxiv.org/abs/2412.11344v1
[DATE]
2024-12-16 08:02:38+08:00
[CATEGORIES]
cs.CL
No Argument Left Behind: Overlapping Chunks for Faster Processing of Arbitrarily Long Legal Texts
[AUTHORS]
Israel Fama, Bárbara Bueno, Alexandre Alcoforado, Thomas Palmeira Ferraz, Arnold Moya, Anna Helena Reali Costa
[ABSTRACT]
In a context where the Brazilian judiciary system, the largest in the world,
faces a crisis due to the slow processing of millions of cases, it becomes
imperative to develop efficient methods for analyzing legal texts. We introduce
uBERT, a hybrid model that combines Transformer and Recurrent Neural Network
architectures to effectively handle long legal texts. Our approach processes
the full text regardless of its length while maintaining reasonable
computational overhead. Our experiments demonstrate that uBERT achieves
superior performance compared to BERT+LSTM when overlapping input is used and
is significantly faster than ULMFiT for processing long legal documents.
[COMMENTS]
Presented at 15th Symposium in Information and Human Language
Technology (STIL) @ BRACIS’24
[LINK]
http://arxiv.org/abs/2410.19184v2
[DATE]
2024-12-16 07:36:13+08:00
[CATEGORIES]
cs.CL
cs.LG
Segment-Level Diffusion: A Framework for Controllable Long-Form Generation with Diffusion Language Models
[AUTHORS]
Xiaochen Zhu, Georgi Karadzhov, Chenxi Whitehouse, Andreas Vlachos
[ABSTRACT]
Diffusion models have shown promise in text generation but often struggle
with generating long, coherent, and contextually accurate text. Token-level
diffusion overlooks word-order dependencies and enforces short output windows,
while passage-level diffusion struggles with learning robust representation for
long-form text. To address these challenges, we propose Segment-Level Diffusion
(SLD), a framework that enhances diffusion-based text generation through text
segmentation, robust representation training with adversarial and contrastive
learning, and improved latent-space guidance. By segmenting long-form outputs
into separate latent representations and decoding them with an autoregressive
decoder, SLD simplifies diffusion predictions and improves scalability.
Experiments on XSum, ROCStories, DialogSum, and DeliData demonstrate that SLD
achieves competitive or superior performance in fluency, coherence, and
contextual compatibility across automatic and human evaluation metrics
comparing with other diffusion and autoregressive baselines. Ablation studies
further validate the effectiveness of our segmentation and representation
learning strategies.
[LINK]
http://arxiv.org/abs/2412.11333v1
[DATE]
2024-12-16 06:47:44+08:00
[CATEGORIES]
cs.CL
Do Robot Snakes Dream like Electric Sheep? Investigating the Effects of Architectural Inductive Biases on Hallucination
[AUTHORS]
Jerry Huang, Prasanna Parthasarathi, Mehdi Rezagholizadeh, Boxing Chen, Sarath Chandar
[ABSTRACT]
The growth in prominence of large language models (LLMs) in everyday life can
be largely attributed to their generative abilities, yet some of this is also
owed to the risks and costs associated with their use. On one front is their
tendency to \textit{hallucinate} false or misleading information, limiting
their reliability. On another is the increasing focus on the computational
limitations associated with traditional self-attention based LLMs, which has
brought about new alternatives, in particular recurrent models, meant to
overcome them. Yet it remains uncommon to consider these two concerns
simultaneously. Do changes in architecture exacerbate/alleviate existing
concerns about hallucinations? Do they affect how and where they occur? Through
an extensive evaluation, we study how these architecture-based inductive biases
affect the propensity to hallucinate. While hallucination remains a general
phenomenon not limited to specific architectures, the situations in which they
occur and the ease with which specific types of hallucinations can be induced
can significantly differ based on the model architecture. These findings
highlight the need for better understanding both these problems in conjunction
with each other, as well as consider how to design more universal techniques
for handling hallucinations.
[LINK]
http://arxiv.org/abs/2410.17477v3
[DATE]
2024-12-16 06:30:24+08:00
[CATEGORIES]
cs.CL
cs.LG
Finding a Wolf in Sheep’s Clothing: Combating Adversarial Text-To-Image Prompts with Text Summarization
[AUTHORS]
Portia Cooper, Harshita Narnoli, Mihai Surdeanu
[ABSTRACT]
Text-to-image models are vulnerable to the stepwise “Divide-and-Conquer
Attack” (DACA) that utilize a large language model to obfuscate inappropriate
content in prompts by wrapping sensitive text in a benign narrative. To
mitigate stepwise DACA attacks, we propose a two-layer method involving text
summarization followed by binary classification. We assembled the Adversarial
Text-to-Image Prompt (ATTIP) dataset ($N=940$), which contained DACA-obfuscated
and non-obfuscated prompts. From the ATTIP dataset, we created two summarized
versions: one generated by a small encoder model and the other by a large
language model. Then, we used an encoder classifier and a GPT-4o classifier to
perform content moderation on the summarized and unsummarized prompts. When
compared with a classifier that operated over the unsummarized data, our method
improved F1 score performance by 31%. Further, the highest recorded F1 score
achieved (98%) was produced by the encoder classifier on a summarized ATTIP
variant. This study indicates that pre-classification text summarization can
inoculate content detection models against stepwise DACA obfuscations.
[LINK]
http://arxiv.org/abs/2412.12212v1
[DATE]
2024-12-16 06:12:36+08:00
[CATEGORIES]
cs.CL
Aligning LLMs with Individual Preferences via Interaction
[AUTHORS]
Shujin Wu, May Fung, Cheng Qian, Jeonghwan Kim, Dilek Hakkani-Tur, Heng Ji
[ABSTRACT]
As large language models (LLMs) demonstrate increasingly advanced
capabilities, aligning their behaviors with human values and preferences
becomes crucial for their wide adoption. While previous research focuses on
general alignment to principles such as helpfulness, harmlessness, and honesty,
the need to account for individual and diverse preferences has been largely
overlooked, potentially undermining customized human experiences. To address
this gap, we train LLMs that can ‘‘interact to align’’, essentially cultivating
the meta-skill of LLMs to implicitly infer the unspoken personalized
preferences of the current user through multi-turn conversations, and then
dynamically align their following behaviors and responses to these inferred
preferences. Our approach involves establishing a diverse pool of 3,310
distinct user personas by initially creating seed examples, which are then
expanded through iterative self-generation and filtering. Guided by distinct
user personas, we leverage multi-LLM collaboration to develop a multi-turn
preference dataset containing 3K+ multi-turn conversations in tree structures.
Finally, we apply supervised fine-tuning and reinforcement learning to enhance
LLMs using this dataset. For evaluation, we establish the ALOE (ALign With
CustOmized PrEferences) benchmark, consisting of 100 carefully selected
examples and well-designed metrics to measure the customized alignment
performance during conversations. Experimental results demonstrate the
effectiveness of our method in enabling dynamic, personalized alignment via
interaction.
[COMMENTS]
Accepted to COLING 2025. The code and dataset are made public at
https://github.com/ShujinWu-0814/ALOE
[LINK]
http://arxiv.org/abs/2410.03642v2
[DATE]
2024-12-16 05:40:35+08:00
[CATEGORIES]
cs.CL
Cross-Document Event-Keyed Summarization
[AUTHORS]
William Walden, Pavlo Kuchmiichuk, Alexander Martin, Chihsheng Jin, Angela Cao, Claire Sun, Curisia Allen, Aaron Steven White
[COMMENTS]
ACL Rolling Review long paper (in submission)
[LINK]
http://arxiv.org/abs/2410.14795v2
[DATE]
2024-12-16 05:32:26+08:00
[CATEGORIES]
cs.CL
Generics are puzzling. Can language models find the missing piece?
[AUTHORS]
Gustavo Cilleruelo Calderón, Emily Allaway, Barry Haddow, Alexandra Birch
[COMMENTS]
Accepted at CoLing 2025
[LINK]
http://arxiv.org/abs/2412.11318v1
[DATE]
2024-12-16 05:30:21+08:00
[CATEGORIES]
cs.CL
RoLargeSum: A Large Dialect-Aware Romanian News Dataset for Summary, Headline, and Keyword Generation
[AUTHORS]
Andrei-Marius Avram, Mircea Timpuriu, Andreea Iuga, Vlad-Cristian Matei, Iulian-Marius Tăiatu, Tudor Găină, Dumitru-Clementin Cercel, Florin Pop, Mihaela-Claudia Cercel
[COMMENTS]
Accepted at COLING 2024 (long papers)
[LINK]
http://arxiv.org/abs/2412.11317v1
[DATE]
2024-12-16 05:27:33+08:00
[CATEGORIES]
cs.CL
Reliable, Reproducible, and Really Fast Leaderboards with Evalica
[AUTHORS]
Dmitry Ustalov
[COMMENTS]
accepted at COLING 2025 system demonstration track
[LINK]
http://arxiv.org/abs/2412.11314v1
[DATE]
2024-12-16 05:22:46+08:00
[CATEGORIES]
cs.CL
Large Concept Models: Language Modeling in a Sentence Representation Space
[AUTHORS]
LCM team, Loïc Barrault, Paul-Ambroise Duquenne, Maha Elbayad, Artyom Kozhevnikov, Belen Alastruey, Pierre Andrews, Mariano Coria, Guillaume Couairon, Marta R. Costa-jussà, David Dale, Hady Elsahar, Kevin Heffernan, João Maria Janeiro, Tuan Tran, Christophe Ropers, Eduardo Sánchez, Robin San Roman, Alexandre Mourachko, Safiyyah Saleem, Holger Schwenk
[ABSTRACT]
LLMs have revolutionized the field of artificial intelligence and have
emerged as the de-facto tool for many tasks. The current established technology
of LLMs is to process input and generate output at the token level. This is in
sharp contrast to humans who operate at multiple levels of abstraction, well
beyond single words, to analyze information and to generate creative content.
In this paper, we present an attempt at an architecture which operates on an
explicit higher-level semantic representation, which we name a concept.
Concepts are language- and modality-agnostic and represent a higher level idea
or action in a flow. Hence, we build a “Large Concept Model”. In this study, as
proof of feasibility, we assume that a concept corresponds to a sentence, and
use an existing sentence embedding space, SONAR, which supports up to 200
languages in both text and speech modalities.
The Large Concept Model is trained to perform autoregressive sentence
prediction in an embedding space. We explore multiple approaches, namely MSE
regression, variants of diffusion-based generation, and models operating in a
quantized SONAR space. These explorations are performed using 1.6B parameter
models and training data in the order of 1.3T tokens. We then scale one
architecture to a model size of 7B parameters and training data of about 2.7T
tokens. We perform an experimental evaluation on several generative tasks,
namely summarization and a new task of summary expansion. Finally, we show that
our model exhibits impressive zero-shot generalization performance to many
languages, outperforming existing LLMs of the same size. The training code of
our models is freely available.
[COMMENTS]
49 pages
[LINK]
http://arxiv.org/abs/2412.08821v2
[DATE]
2024-12-16 05:20:12+08:00
[CATEGORIES]
cs.CL
Suicide Risk Assessment on Social Media with Semi-Supervised Learning
[AUTHORS]
Max Lovitt, Haotian Ma, Song Wang, Yifan Peng
[ABSTRACT]
With social media communities increasingly becoming places where suicidal
individuals post and congregate, natural language processing presents an
exciting avenue for the development of automated suicide risk assessment
systems. However, past efforts suffer from a lack of labeled data and class
imbalances within the available labeled data. To accommodate this task’s
imperfect data landscape, we propose a semi-supervised framework that leverages
labeled (n=500) and unlabeled (n=1,500) data and expands upon the self-training
algorithm with a novel pseudo-label acquisition process designed to handle
imbalanced datasets. To further ensure pseudo-label quality, we manually verify
a subset of the pseudo-labeled data that was not predicted unanimously across
multiple trials of pseudo-label generation. We test various models to serve as
the backbone for this framework, ultimately deciding that RoBERTa performs the
best. Ultimately, by leveraging partially validated pseudo-labeled data in
addition to ground-truth labeled data, we substantially improve our model’s
ability to assess suicide risk from social media posts.
[COMMENTS]
Accepted for publication in the 2024 IEEE International Conference on
Big Data
[LINK]
http://arxiv.org/abs/2411.12767v2
[DATE]
2024-12-16 05:12:14+08:00
[CATEGORIES]
cs.CL
Assessing the Human Likeness of AI-Generated Counterspeech
[AUTHORS]
Xiaoying Song, Sujana Mamidisetty, Eduardo Blanco, Lingzi Hong
[ABSTRACT]
Counterspeech is a targeted response to counteract and challenge abusive or
hateful content. It effectively curbs the spread of hatred and fosters
constructive online communication. Previous studies have proposed different
strategies for automatically generated counterspeech. Evaluations, however,
focus on relevance, surface form, and other shallow linguistic characteristics.
This paper investigates the human likeness of AI-generated counterspeech, a
critical factor influencing effectiveness. We implement and evaluate several
LLM-based generation strategies, and discover that AI-generated and
human-written counterspeech can be easily distinguished by both simple
classifiers and humans. Further, we reveal differences in linguistic
characteristics, politeness, and specificity. The dataset used in this study is
publicly available for further research.
[COMMENTS]
Accepted for presentation at the COLING 2025
[LINK]
http://arxiv.org/abs/2410.11007v2
[DATE]
2024-12-16 05:07:52+08:00
[CATEGORIES]
cs.CL
Sequence-Level Analysis of Leakage Risk of Training Data in Large Language Models
[AUTHORS]
Trishita Tiwari, G. Edward Suh
[ABSTRACT]
This work advocates for the use of sequence level probabilities for
quantifying the risk of extraction training data from Large Language Models
(LLMs) as they provide much finer-grained information than has been previously
obtained. We re-analyze the effects of decoding schemes, model-size, prefix
length, partial sequence leakages, and token positions to uncover new insights
that have were not possible in prior work due to their choice of metrics. We
perform this study on two pre-trained models, LLaMa and OPT, trained on the
Common Crawl and Pile respectively. We discover that 1) Extraction rate, the
predominant metric used in prior quantification work, underestimates the threat
of leakage of training data in randomized LLMs by as much as 2.14x. 2) Though,
on average, larger models and longer prefixes can extract more data, this is
not true with a substantial portion of individual sequences. 30.4-41.5% of our
sequences are easier to extract with either shorter prefixes or smaller models.
3) Contrary to prior belief, partial leakage in the commonly used decoding
schemes like top-k and top-p are not easier than leaking verbatim training
data. 4) Extracting later tokens in a sequence is as much as 912% easier than
extracting earlier tokens. The insights gained from our analysis show that it
is important to look at leakage of training data on a per-sequence basis.
[LINK]
http://arxiv.org/abs/2412.11302v1
[DATE]
2024-12-16 04:27:45+08:00
[CATEGORIES]
cs.CL
cs.LG
CATER: Leveraging LLM to Pioneer a Multidimensional, Reference-Independent Paradigm in Translation Quality Evaluation
[AUTHORS]
Kurando IIDA, Kenjiro MIMURA
[ABSTRACT]
This paper introduces the Comprehensive AI-assisted Translation Edit Ratio
(CATER), a novel and fully prompt-driven framework for evaluating machine
translation (MT) quality. Leveraging large language models (LLMs) via a
carefully designed prompt-based protocol, CATER expands beyond traditional
reference-bound metrics, offering a multidimensional, reference-independent
evaluation that addresses linguistic accuracy, semantic fidelity, contextual
coherence, stylistic appropriateness, and information completeness. CATER’s
unique advantage lies in its immediate implementability: by providing the
source and target texts along with a standardized prompt, an LLM can rapidly
identify errors, quantify edit effort, and produce category-level and overall
scores. This approach eliminates the need for pre-computed references or
domain-specific resources, enabling instant adaptation to diverse languages,
genres, and user priorities through adjustable weights and prompt
modifications. CATER’s LLM-enabled strategy supports more nuanced assessments,
capturing phenomena such as subtle omissions, hallucinations, and
discourse-level shifts that increasingly challenge contemporary MT systems. By
uniting the conceptual rigor of frameworks like MQM and DQF with the
scalability and flexibility of LLM-based evaluation, CATER emerges as a
valuable tool for researchers, developers, and professional translators
worldwide. The framework and example prompts are openly available, encouraging
community-driven refinement and further empirical validation.
[COMMENTS]
17pages,1sample prompt
[LINK]
http://arxiv.org/abs/2412.11261v1
[DATE]
2024-12-16 01:45:34+08:00
[CATEGORIES]
cs.CL
Beyond Discrete Personas: Personality Modeling Through Journal Intensive Conversations
[AUTHORS]
Sayantan Pal, Souvik Das, Rohini K. Srihari
[COMMENTS]
Accepted in COLING 2025
[LINK]
http://arxiv.org/abs/2412.11250v1
[DATE]
2024-12-16 01:16:08+08:00
[CATEGORIES]
cs.CL
VP-MEL: Visual Prompts Guided Multimodal Entity Linking
[AUTHORS]
Hongze Mi, Jinyuan Li, Xuying Zhang, Haoran Cheng, Jiahao Wang, Di Sun, Gang Pan
[ABSTRACT]
Multimodal entity linking (MEL), a task aimed at linking mentions within
multimodal contexts to their corresponding entities in a knowledge base (KB),
has attracted much attention due to its wide applications in recent years.
However, existing MEL methods often rely heavily on mention words as retrieval
cues, which limits their ability to effectively utilize information from both
images and text. This reliance poses significant challenges in scenarios where
mention words are absent, as current MEL approaches struggle to leverage
image-text pairs for accurate entity linking. To solve these issues, we
introduce a Visual Prompts guided Multimodal Entity Linking (VP-MEL) task.
Given a text-image pair, VP-MEL aims to link a marked region (i.e., visual
prompt) in an image to its corresponding entities in the knowledge base. To
facilitate this task, we present a new dataset, VPWiki, specifically designed
for VP-MEL. Furthermore, we propose a framework named FBMEL, which enhances
visual feature extraction using visual prompts and leverages the pretrained
Detective-VLM model to capture latent information. Experimental results on the
VPWiki dataset demonstrate that FBMEL outperforms baseline methods across
multiple benchmarks for the VP-MEL task.
[LINK]
http://arxiv.org/abs/2412.06720v3
[DATE]
2024-12-16 00:48:59+08:00
[CATEGORIES]
cs.CL
CoVoMix: Advancing Zero-Shot Speech Generation for Human-like Multi-talker Conversations
[AUTHORS]
Leying Zhang, Yao Qian, Long Zhou, Shujie Liu, Dongmei Wang, Xiaofei Wang, Midia Yousefi, Yanmin Qian, Jinyu Li, Lei He, Sheng Zhao, Michael Zeng
[ABSTRACT]
Recent advancements in zero-shot text-to-speech (TTS) modeling have led to
significant strides in generating high-fidelity and diverse speech. However,
dialogue generation, along with achieving human-like naturalness in speech,
continues to be a challenge. In this paper, we introduce CoVoMix:
Conversational Voice Mixture Generation, a novel model for zero-shot,
human-like, multi-speaker, multi-round dialogue speech generation. CoVoMix
first converts dialogue text into multiple streams of discrete tokens, with
each token stream representing semantic information for individual talkers.
These token streams are then fed into a flow-matching based acoustic model to
generate mixed mel-spectrograms. Finally, the speech waveforms are produced
using a HiFi-GAN model. Furthermore, we devise a comprehensive set of metrics
for measuring the effectiveness of dialogue modeling and generation. Our
experimental results show that CoVoMix can generate dialogues that are not only
human-like in their naturalness and coherence but also involve multiple talkers
engaging in multiple rounds of conversation. This is exemplified by instances
generated in a single channel where one speaker’s utterance is seamlessly mixed
with another’s interjections or laughter, indicating the latter’s role as an
attentive listener. Audio samples are available at https://aka.ms/covomix.
[COMMENTS]
Neural Information Processing Systems 2024, poster
[LINK]
http://arxiv.org/abs/2404.06690v3
[DATE]
2024-12-16 00:30:54+08:00
[CATEGORIES]
cs.CL
cs.LG
Robust Synthetic Data-Driven Detection of Living-Off-the-Land Reverse Shells
[AUTHORS]
Dmitrijs Trizna, Luca Demetrio, Battista Biggio, Fabio Roli
[ABSTRACT]
Living-off-the-land (LOTL) techniques pose a significant challenge to
security operations, exploiting legitimate tools to execute malicious commands
that evade traditional detection methods. To address this, we present a robust
augmentation framework for cyber defense systems as Security Information and
Event Management (SIEM) solutions, enabling the detection of LOTL attacks such
as reverse shells through machine learning. Leveraging real-world threat
intelligence and adversarial training, our framework synthesizes diverse
malicious datasets while preserving the variability of legitimate activity,
ensuring high accuracy and low false-positive rates. We validate our approach
through extensive experiments on enterprise-scale datasets, achieving a 90\%
improvement in detection rates over non-augmented baselines at an
industry-grade False Positive Rate (FPR) of $10^{-5}$. We define black-box
data-driven attacks that successfully evade unprotected models, and develop
defenses to mitigate them, producing adversarially robust variants of ML
models. Ethical considerations are central to this work; we discuss safeguards
for synthetic data generation and the responsible release of pre-trained models
across four best performing architectures, including both adversarially and
regularly trained variants: https://huggingface.co/dtrizna/quasarnix.
Furthermore, we provide a malicious LOTL dataset containing over 1 million
augmented attack variants to enable reproducible research and community
collaboration: https://huggingface.co/datasets/dtrizna/QuasarNix. This work
offers a reproducible, scalable, and production-ready defense against evolving
LOTL threats.
[LINK]
http://arxiv.org/abs/2402.18329v2
[DATE]
2024-12-16 23:43:48+08:00
[CATEGORIES]
cs.LG
Diffusion-based Reinforcement Learning via Q-weighted Variational Policy Optimization
[AUTHORS]
Shutong Ding, Ke Hu, Zhenhao Zhang, Kan Ren, Weinan Zhang, Jingyi Yu, Jingya Wang, Ye Shi
[ABSTRACT]
Diffusion models have garnered widespread attention in Reinforcement Learning
(RL) for their powerful expressiveness and multimodality. It has been verified
that utilizing diffusion policies can significantly improve the performance of
RL algorithms in continuous control tasks by overcoming the limitations of
unimodal policies, such as Gaussian policies, and providing the agent with
enhanced exploration capabilities. However, existing works mainly focus on the
application of diffusion policies in offline RL, while their incorporation into
online RL is less investigated. The training objective of the diffusion model,
known as the variational lower bound, cannot be optimized directly in online RL
due to the unavailability of ‘good’ actions. This leads to difficulties in
conducting diffusion policy improvement. To overcome this, we propose a novel
model-free diffusion-based online RL algorithm, Q-weighted Variational Policy
Optimization (QVPO). Specifically, we introduce the Q-weighted variational
loss, which can be proved to be a tight lower bound of the policy objective in
online RL under certain conditions. To fulfill these conditions, the Q-weight
transformation functions are introduced for general scenarios. Additionally, to
further enhance the exploration capability of the diffusion policy, we design a
special entropy regularization term. We also develop an efficient behavior
policy to enhance sample efficiency by reducing the variance of the diffusion
policy during online interactions. Consequently, the QVPO algorithm leverages
the exploration capabilities and multimodality of diffusion policies,
preventing the RL agent from converging to a sub-optimal policy. To verify the
effectiveness of QVPO, we conduct comprehensive experiments on MuJoCo
benchmarks. The final results demonstrate that QVPO achieves state-of-the-art
performance on both cumulative reward and sample efficiency.
[COMMENTS]
Accepted by NeurIPS2024
[LINK]
http://arxiv.org/abs/2405.16173v3
[DATE]
2024-12-16 23:42:46+08:00
[CATEGORIES]
cs.LG
Bayesian Surrogate Training on Multiple Data Sources: A Hybrid Modeling Strategy
[AUTHORS]
Philipp Reiser, Paul-Christian Bürkner, Anneli Guthke
[ABSTRACT]
Surrogate models are often used as computationally efficient approximations
to complex simulation models, enabling tasks such as solving inverse problems,
sensitivity analysis, and probabilistic forward predictions, which would
otherwise be computationally infeasible. During training, surrogate parameters
are fitted such that the surrogate reproduces the simulation model’s outputs as
closely as possible. However, the simulation model itself is merely a
simplification of the real-world system, often missing relevant processes or
suffering from misspecifications e.g., in inputs or boundary conditions. Hints
about these might be captured in real-world measurement data, and yet, we
typically ignore those hints during surrogate building. In this paper, we
propose two novel probabilistic approaches to integrate simulation data and
real-world measurement data during surrogate training. The first method trains
separate surrogate models for each data source and combines their predictive
distributions, while the second incorporates both data sources by training a
single surrogate. We show the conceptual differences and benefits of the two
approaches through both synthetic and real-world case studies. The results
demonstrate the potential of these methods to improve predictive accuracy,
predictive coverage, and to diagnose problems in the underlying simulation
model. These insights can improve system understanding and future model
development.
[LINK]
http://arxiv.org/abs/2412.11875v1
[DATE]
2024-12-16 23:27:28+08:00
[CATEGORIES]
cs.LG
Transformers Use Causal World Models in Maze-Solving Tasks
[AUTHORS]
Alex F. Spies, William Edwards, Michael I. Ivanitskiy, Adrians Skapars, Tilman Räuker, Katsumi Inoue, Alessandra Russo, Murray Shanahan
[ABSTRACT]
Recent studies in interpretability have explored the inner workings of
transformer models trained on tasks across various domains, often discovering
that these networks naturally develop surprisingly structured representations.
When such representations comprehensively reflect the task domain’s structure,
they are commonly referred to as “World Models” (WMs). In this work, we
discover such WMs in transformers trained on maze tasks. In particular, by
employing Sparse Autoencoders (SAEs) and analysing attention patterns, we
examine the construction of WMs and demonstrate consistency between the circuit
analysis and the SAE feature-based analysis. We intervene upon the isolated
features to confirm their causal role and, in doing so, find asymmetries
between certain types of interventions. Surprisingly, we find that models are
able to reason with respect to a greater number of active features than they
see during training, even if attempting to specify these in the input token
sequence would lead the model to fail. Futhermore, we observe that varying
positional encodings can alter how WMs are encoded in a model’s residual
stream. By analyzing the causal role of these WMs in a toy domain we hope to
make progress toward an understanding of emergent structure in the
representations acquired by Transformers, leading to the development of more
interpretable and controllable AI systems.
[COMMENTS]
Main paper: 9 pages, 9 figures. Supplementary material: 10 pages, 17
additional figures. Code and data will be available upon publication.
Corresponding author: A. F. Spies ([email protected])
[LINK]
http://arxiv.org/abs/2412.11867v1
[DATE]
2024-12-16 23:21:04+08:00
[CATEGORIES]
cs.LG
Ensemble Learning and 3D Pix2Pix for Comprehensive Brain Tumor Analysis in Multimodal MRI
[AUTHORS]
Ramy A. Zeineldin, Franziska Mathis-Ullrich
[ABSTRACT]
Motivated by the need for advanced solutions in the segmentation and
inpainting of glioma-affected brain regions in multi-modal magnetic resonance
imaging (MRI), this study presents an integrated approach leveraging the
strengths of ensemble learning with hybrid transformer models and convolutional
neural networks (CNNs), alongside the innovative application of 3D Pix2Pix
Generative Adversarial Network (GAN). Our methodology combines robust tumor
segmentation capabilities, utilizing axial attention and transformer encoders
for enhanced spatial relationship modeling, with the ability to synthesize
biologically plausible brain tissue through 3D Pix2Pix GAN. This integrated
approach addresses the BraTS 2023 cluster challenges by offering precise
segmentation and realistic inpainting, tailored for diverse tumor types and
sub-regions. The results demonstrate outstanding performance, evidenced by
quantitative evaluations such as the Dice Similarity Coefficient (DSC),
Hausdorff Distance (HD95) for segmentation, and Structural Similarity Index
Measure (SSIM), Peak Signal-to-Noise Ratio (PSNR), and Mean-Square Error (MSE)
for inpainting. Qualitative assessments further validate the high-quality,
clinically relevant outputs. In conclusion, this study underscores the
potential of combining advanced machine learning techniques for comprehensive
brain tumor analysis, promising significant advancements in clinical
decision-making and patient care within the realm of medical imaging.
[COMMENTS]
Accepted at the MICCAI BraTS Challenge 2023
[LINK]
http://arxiv.org/abs/2412.11849v1
[DATE]
2024-12-16 23:10:53+08:00
[CATEGORIES]
cs.LG
Proactive Model Adaptation Against Concept Drift for Online Time Series Forecasting
[AUTHORS]
Lifan Zhao, Yanyan Shen
[ABSTRACT]
Time series forecasting always faces the challenge of concept drift, where
data distributions evolve over time, leading to a decline in forecast model
performance. Existing solutions are based on online learning, which continually
organize recent time series observations as new training samples and update
model parameters according to the forecasting feedback on recent data. However,
they overlook a critical issue: obtaining ground-truth future values of each
sample should be delayed until after the forecast horizon. This delay creates a
temporal gap between the training samples and the test sample. Our empirical
analysis reveals that the gap can introduce concept drift, causing forecast
models to adapt to outdated concepts. In this paper, we present
\textsc{Proceed}, a novel proactive model adaptation framework for online time
series forecasting. \textsc{Proceed} first estimates the concept drift between
the recently used training samples and the current test sample. It then employs
an adaptation generator to efficiently translate the estimated drift into
parameter adjustments, proactively adapting the model to the test sample. To
enhance the generalization capability of the framework, \textsc{Proceed} is
trained on synthetic diverse concept drifts. Extensive experiments on five
real-world datasets across various forecast models demonstrate that
\textsc{Proceed} brings more performance improvements than the state-of-the-art
online learning methods, significantly facilitating forecast models’ resilience
against concept drifts. Code is available at
\url{https://github.com/SJTU-DMTai/OnlineTSF}.
[COMMENTS]
Accepted by KDD 2025. Preprint version
[LINK]
http://arxiv.org/abs/2412.08435v2
[DATE]
2024-12-16 23:02:52+08:00
[CATEGORIES]
cs.LG
Evaluating the Efficacy of Vectocardiographic and ECG Parameters for Efficient Tertiary Cardiology Care Allocation Using Decision Tree Analysis
[AUTHORS]
Lucas José da Costa, Vinicius Ruiz Uemoto, Mariana F. N. de Marchi, Renato de Aguiar Hortegal, Renata Valeri de Freitas
[ABSTRACT]
Use real word data to evaluate the performance of the electrocardiographic
markers of GEH as features in a machine learning model with Standard ECG
features and Risk Factors in Predicting Outcome of patients in a population
referred to a tertiary cardiology hospital.
Patients forwarded to specific evaluation in a cardiology specialized
hospital performed an ECG and a risk factor anamnesis. A series of follow up
attendances occurred in periods of 6 months, 12 months and 15 months to check
for cardiovascular related events (mortality or new nonfatal cardiovascular
events (Stroke, MI, PCI, CS), as identified during 1-year phone follow-ups.
The first attendance ECG was measured by a specialist and processed in order
to obtain the global electric heterogeneity (GEH) using the Kors Matriz. The
ECG measurements, GEH parameters and risk factors were combined for training
multiple instances of XGBoost decision trees models. Each instance were
optmized for the AUCPR and the instance with higher AUC is chosen as
representative to the model. The importance of each parameter for the winner
tree model was compared to better understand the improvement from using GEH
parameters.
The GEH parameters turned out to have statistical significance for this
population specially the QRST angle and the SVG. The combined model with the
tree parameters class had the best performance. The findings suggest that using
VCG features can facilitate more accurate identification of patients who
require tertiary care, thereby optimizing resource allocation and improving
patient outcomes. Moreover, the decision tree model’s transparency and ability
to pinpoint critical features make it a valuable tool for clinical
decision-making and align well with existing clinical practices.
[LINK]
http://arxiv.org/abs/2412.11839v1
[DATE]
2024-12-16 23:01:53+08:00
[CATEGORIES]
cs.LG
The Eclipsing Binaries via Artificial Intelligence. II. Need for Speed in PHOEBE Forward Models
[AUTHORS]
Marcin Wrona, Andrej Prša
[ABSTRACT]
In modern astronomy, the quantity of data collected has vastly exceeded the
capacity for manual analysis, necessitating the use of advanced artificial
intelligence (AI) techniques to assist scientists with the most labor-intensive
tasks. AI can optimize simulation codes where computational bottlenecks arise
from the time required to generate forward models. One such example is PHOEBE,
a modeling code for eclipsing binaries (EBs), where simulating individual
systems is feasible, but analyzing observables for extensive parameter
combinations is highly time-consuming.
To address this, we present a fully connected feedforward artificial neural
network (ANN) trained on a dataset of over one million synthetic light curves
generated with PHOEBE. Optimization of the ANN architecture yielded a model
with six hidden layers, each containing 512 nodes, provides an optimized
balance between accuracy and computational complexity. Extensive testing
enabled us to establish ANN’s applicability limits and to quantify the
systematic and statistical errors associated with using such networks for EB
analysis. Our findings demonstrate the critical role of dilution effects in
parameter estimation for EBs, and we outline methods to incorporate these
effects in AI-based models.
This proposed ANN framework enables a speedup of over four orders of
magnitude compared to traditional methods, with systematic errors not exceeding
1\%, and often as low as 0.01\%, across the entire parameter space.
[COMMENTS]
Submitted to AAS Journals. 26 pages, 21 figures, 3 tables
[LINK]
http://arxiv.org/abs/2412.11837v1
[DATE]
2024-12-16 23:00:15+08:00
[CATEGORIES]
cs.LG
Data-Incremental Continual Offline Reinforcement Learning
[AUTHORS]
Sibo Gai, Donglin Wang
[ABSTRACT]
In this work, we propose a new setting of continual learning:
data-incremental continual offline reinforcement learning (DICORL), in which an
agent is asked to learn a sequence of datasets of a single offline
reinforcement learning (RL) task continually, instead of learning a sequence of
offline RL tasks with respective datasets. Then, we propose that this new
setting will introduce a unique challenge to continual learning: active
forgetting, which means that the agent will forget the learnt skill actively.
The main reason for active forgetting is conservative learning used by offline
RL, which is used to solve the overestimation problem. With conservative
learning, the offline RL method will suppress the value of all actions, learnt
or not, without selection, unless it is in the just learning dataset.
Therefore, inferior data may overlay premium data because of the learning
sequence. To solve this problem, we propose a new algorithm, called
experience-replay-based ensemble implicit Q-learning (EREIQL), which introduces
multiple value networks to reduce the initial value and avoid using
conservative learning, and the experience replay to relieve catastrophic
forgetting. Our experiments show that EREIQL relieves active forgetting in
DICORL and performs well.
[COMMENTS]
10 pages, 9 figures
[LINK]
http://arxiv.org/abs/2404.12639v3
[DATE]
2024-12-16 23:00:05+08:00
[CATEGORIES]
cs.LG
Homeostasis and Sparsity in Transformer
[AUTHORS]
Leonid Kotyuzanskiy, Artem Klimov
[ABSTRACT]
The transformer architecture has become an integral part of the field of
modern neural networks, playing a crucial role in a variety of tasks, such as
text generation, machine translation, image and audio processing, among others.
There is also an alternative approach to building intelligent systems, proposed
by Jeff Hawkins and inspired by the processes occurring in the neocortex. In
our article we want to combine some of these ideas and to propose the use of
homeostasis mechanisms, such as RFB-kWTA and “Smart” Inhibition, in the
attention mechanism of the transformer and at the output of the transformer
block, as well as conducting an experiment involving the introduction of sparse
distributed representations of the transformer at various points. RFB-kWTA
utilizes statistics of layer activations across time to adjust the entire
layer, enhancing the values of rare activations while reducing those of
frequent ones. “Smart” Inhibition also uses activation statistics to sample
sparsity masks, with rarer activation times are more likely to be activated.
Our proposed mechanisms significantly outperform the classical transformer
0.2768 BLEU and a model that only makes use of dropout in the attention
mechanism and output of the transformer block 0.3007 BLEU, achieving a score of
0.3062 on the Multi30K dataset.
[LINK]
http://arxiv.org/abs/2412.00503v3
[DATE]
2024-12-16 22:59:05+08:00
[CATEGORIES]
cs.LG
You Only Submit One Image to Find the Most Suitable Generative Model
[AUTHORS]
Zhi Zhou, Lan-Zhe Guo, Peng-Xiao Song, Yu-Feng Li
[ABSTRACT]
Deep generative models have achieved promising results in image generation,
and various generative model hubs, e.g., Hugging Face and Civitai, have been
developed that enable model developers to upload models and users to download
models. However, these model hubs lack advanced model management and
identification mechanisms, resulting in users only searching for models through
text matching, download sorting, etc., making it difficult to efficiently find
the model that best meets user requirements. In this paper, we propose a novel
setting called Generative Model Identification (GMI), which aims to enable the
user to identify the most appropriate generative model(s) for the user’s
requirements from a large number of candidate models efficiently. To our best
knowledge, it has not been studied yet. In this paper, we introduce a
comprehensive solution consisting of three pivotal modules: a weighted Reduced
Kernel Mean Embedding (RKME) framework for capturing the generated image
distribution and the relationship between images and prompts, a pre-trained
vision-language model aimed at addressing dimensionality challenges, and an
image interrogator designed to tackle cross-modality issues. Extensive
empirical results demonstrate the proposal is both efficient and effective. For
example, users only need to submit a single example image to describe their
requirements, and the model platform can achieve an average top-4
identification accuracy of more than 80%.
[COMMENTS]
Accepted by NeurIPS 2023 Workshop on Diffusion Models
[LINK]
http://arxiv.org/abs/2412.12232v1
[DATE]
2024-12-16 22:46:57+08:00
[CATEGORIES]
cs.LG
Demonstrating Data-to-Knowledge Pipelines for Connecting Production Sites in the World Wide Lab
[AUTHORS]
Leon Gorißen, Jan-Niklas Schneider, Mohamed Behery, Philipp Brauner, Moritz Lennartz, David Kötter, Thomas Kaster, Oliver Petrovic, Christian Hinke, Thomas Gries, Gerhard Lakemeyer, Martina Ziefle, Christian Brecher, Constantin Häfner
[ABSTRACT]
The digital transformation of production requires new methods of data
integration and storage, as well as decision making and support systems that
work vertically and horizontally throughout the development, production, and
use cycle. In this paper, we propose Data-to-Knowledge (and Knowledge-to-Data)
pipelines for production as a universal concept building on a network of
Digital Shadows (a concept augmenting Digital Twins). We show a proof of
concept that builds on and bridges existing infrastructure to 1) capture and
semantically annotates trajectory data from multiple similar but independent
robots in different organisations and use cases in a data lakehouse and 2) an
independent process that dynamically queries matching data for training an
inverse dynamic foundation model for robotic control. The article discusses the
challenges and benefits of this approach and how Data-to-Knowledge pipelines
contribute efficiency gains and industrial scalability in a World Wide Lab as a
research outlook.
[COMMENTS]
15 pages, 6 figures, submitted to CAiSE 2025
[LINK]
http://arxiv.org/abs/2412.12231v1
[DATE]
2024-12-16 22:36:51+08:00
[CATEGORIES]
cs.LG
Optimal Gradient Checkpointing for Sparse and Recurrent Architectures using Off-Chip Memory
[AUTHORS]
Wadjih Bencheikh, Jan Finkbeiner, Emre Neftci
[ABSTRACT]
Recurrent neural networks (RNNs) are valued for their computational
efficiency and reduced memory requirements on tasks involving long sequence
lengths but require high memory-processor bandwidth to train. Checkpointing
techniques can reduce the memory requirements by only storing a subset of
intermediate states, the checkpoints, but are still rarely used due to the
computational overhead of the additional recomputation phase. This work
addresses these challenges by introducing memory-efficient gradient
checkpointing strategies tailored for the general class of sparse RNNs and
Spiking Neural Networks (SNNs). SNNs are energy efficient alternatives to RNNs
thanks to their local, event-driven operation and potential neuromorphic
implementation. We use the Intelligence Processing Unit (IPU) as an exemplary
platform for architectures with distributed local memory. We exploit its
suitability for sparse and irregular workloads to scale SNN training on long
sequence lengths. We find that Double Checkpointing emerges as the most
effective method, optimizing the use of local memory resources while minimizing
recomputation overhead. This approach reduces dependency on slower large-scale
memory access, enabling training on sequences over 10 times longer or 4 times
larger networks than previously feasible, with only marginal time overhead. The
presented techniques demonstrate significant potential to enhance scalability
and efficiency in training sparse and recurrent networks across diverse
hardware platforms, and highlights the benefits of sparse activations for
scalable recurrent neural network training.
[LINK]
http://arxiv.org/abs/2412.11810v1
[DATE]
2024-12-16 22:23:31+08:00
[CATEGORIES]
cs.LG
TrackFormers: In Search of Transformer-Based Particle Tracking for the High-Luminosity LHC Era
[AUTHORS]
Sascha Caron, Nadezhda Dobreva, Antonio Ferrer Sánchez, José D. Martín-Guerrero, Uraz Odyurt, Roberto Ruiz de Austri Bazan, Zef Wolffs, Yue Zhao
[ABSTRACT]
High-Energy Physics experiments are facing a multi-fold data increase with
every new iteration. This is certainly the case for the upcoming
High-Luminosity LHC upgrade. Such increased data processing requirements forces
revisions to almost every step of the data processing pipeline. One such step
in need of an overhaul is the task of particle track reconstruction, a.k.a.,
tracking. A Machine Learning-assisted solution is expected to provide
significant improvements, since the most time-consuming step in tracking is the
assignment of hits to particles or track candidates. This is the topic of this
paper.
We take inspiration from large language models. As such, we consider two
approaches: the prediction of the next word in a sentence (next hit point in a
track), as well as the one-shot prediction of all hits within an event. In an
extensive design effort, we have experimented with three models based on the
Transformer architecture and one model based on the U-Net architecture,
performing track association predictions for collision event hit points. In our
evaluation, we consider a spectrum of simple to complex representations of the
problem, eliminating designs with lower metrics early on. We report extensive
results, covering both prediction accuracy (score) and computational
performance. We have made use of the REDVID simulation framework, as well as
reductions applied to the TrackML data set, to compose five data sets from
simple to complex, for our experiments. The results highlight distinct
advantages among different designs in terms of prediction accuracy and
computational performance, demonstrating the efficiency of our methodology.
Most importantly, the results show the viability of a one-shot
encoder-classifier based Transformer solution as a practical approach for the
task of tracking.
[LINK]
http://arxiv.org/abs/2407.07179v2
[DATE]
2024-12-16 22:22:23+08:00
[CATEGORIES]
cs.LG
Scalable Temporal Anomaly Causality Discovery in Large Systems: Achieving Computational Efficiency with Binary Anomaly Flag Data
[AUTHORS]
Mulugeta Weldezgina Asres, Christian Walter Omlin, The CMS-HCAL Collaboration
[ABSTRACT]
Extracting anomaly causality facilitates diagnostics once monitoring systems
detect system faults. Identifying anomaly causes in large systems involves
investigating a more extensive set of monitoring variables across multiple
subsystems. However, learning causal graphs comes with a significant
computational burden that restrains the applicability of most existing methods
in real-time and large-scale deployments. In addition, modern monitoring
applications for large systems often generate large amounts of binary alarm
flags, and the distinct characteristics of binary anomaly data – the meaning
of state transition and data sparsity – challenge existing causality learning
mechanisms. This study proposes an anomaly causal discovery approach
(AnomalyCD), addressing the accuracy and computational challenges of generating
causal graphs from binary flag data sets. The AnomalyCD framework presents
several strategies, such as anomaly flag characteristics incorporating
causality testing, sparse data and link compression, and edge pruning
adjustment approaches. We validate the performance of this framework on two
datasets: monitoring sensor data of the readout-box system of the Compact Muon
Solenoid experiment at CERN, and a public data set for information technology
monitoring. The results demonstrate the considerable reduction of the
computation overhead and moderate enhancement of the accuracy of temporal
causal discovery on binary anomaly data sets.
[COMMENTS]
30 pages, 17 figures, 9 tables
[LINK]
http://arxiv.org/abs/2412.11800v1
[DATE]
2024-12-16 22:11:28+08:00
[CATEGORIES]
cs.LG
Fast and Slow Gradient Approximation for Binary Neural Network Optimization
[AUTHORS]
Xinquan Chen, Junqi Gao, Biqing Qi, Dong Li, Yiang Luo, Fangyuan Li, Pengfei Li
[ABSTRACT]
Binary Neural Networks (BNNs) have garnered significant attention due to
their immense potential for deployment on edge devices. However, the
non-differentiability of the quantization function poses a challenge for the
optimization of BNNs, as its derivative cannot be backpropagated. To address
this issue, hypernetwork based methods, which utilize neural networks to learn
the gradients of non-differentiable quantization functions, have emerged as a
promising approach due to their adaptive learning capabilities to reduce
estimation errors. However, existing hypernetwork based methods typically rely
solely on current gradient information, neglecting the influence of historical
gradients. This oversight can lead to accumulated gradient errors when
calculating gradient momentum during optimization. To incorporate historical
gradient information, we design a Historical Gradient Storage (HGS) module,
which models the historical gradient sequence to generate the first-order
momentum required for optimization. To further enhance gradient generation in
hypernetworks, we propose a Fast and Slow Gradient Generation (FSG) method.
Additionally, to produce more precise gradients, we introduce Layer Recognition
Embeddings (LRE) into the hypernetwork, facilitating the generation of
layer-specific fine gradients. Extensive comparative experiments on the
CIFAR-10 and CIFAR-100 datasets demonstrate that our method achieves faster
convergence and lower loss values, outperforming existing baselines.Code is
available at http://github.com/two-tiger/FSG .
[COMMENTS]
Accepted to AAAI 2025
[LINK]
http://arxiv.org/abs/2412.11777v1
[DATE]
2024-12-16 21:48:40+08:00
[CATEGORIES]
cs.LG
Enhancing Transformer-based models for Long Sequence Time Series Forecasting via Structured Matrix
[AUTHORS]
Zhicheng Zhang, Yong Wang, Shaoqi Tan, Bowei Xia, Yujie Luo
[ABSTRACT]
Recently, Transformer-based models for long sequence time series forecasting
have demonstrated promising results. The self-attention mechanism as the core
component of these Transformer-based models exhibits great potential in
capturing various dependencies among data points. Despite these advancements,
it has been a subject of concern to improve the efficiency of the
self-attention mechanism. Unfortunately, current specific optimization methods
are facing the challenges in applicability and scalability for the future
design of long sequence time series forecasting models. Hence, in this article,
we propose a novel architectural framework that enhances Transformer-based
models through the integration of Surrogate Attention Blocks (SAB) and
Surrogate Feed-Forward Neural Network Blocks (SFB). The framework reduces both
time and space complexity by the replacement of the self-attention and
feed-forward layers with SAB and SFB while maintaining their expressive power
and architectural advantages. The equivalence of this substitution is fully
demonstrated. The extensive experiments on 10 Transformer-based models across
five distinct time series tasks demonstrate an average performance improvement
of 12.4%, alongside 61.3% reduction in parameter counts.
[LINK]
http://arxiv.org/abs/2405.12462v4
[DATE]
2024-12-16 21:47:34+08:00
[CATEGORIES]
cs.LG
Generalized Bayesian deep reinforcement learning
[AUTHORS]
Shreya Sinha Roy, Richard G. Everitt, Christian P. Robert, Ritabrata Dutta
[ABSTRACT]
Bayesian reinforcement learning (BRL) is a method that merges principles from
Bayesian statistics and reinforcement learning to make optimal decisions in
uncertain environments. Similar to other model-based RL approaches, it involves
two key components: (1) Inferring the posterior distribution of the data
generating process (DGP) modeling the true environment and (2) policy learning
using the learned posterior. We propose to model the dynamics of the unknown
environment through deep generative models assuming Markov dependence. In
absence of likelihood functions for these models we train them by learning a
generalized predictive-sequential (or prequential) scoring rule (SR) posterior.
We use sequential Monte Carlo (SMC) samplers to draw samples from this
generalized Bayesian posterior distribution. In conjunction, to achieve
scalability in the high dimensional parameter space of the neural networks, we
use the gradient based Markov chain Monte Carlo (MCMC) kernels within SMC. To
justify the use of the prequential scoring rule posterior we prove a
Bernstein-von Misses type theorem. For policy learning, we propose expected
Thompson sampling (ETS) to learn the optimal policy by maximizing the expected
value function with respect to the posterior distribution. This improves upon
traditional Thompson sampling (TS) and its extensions which utilize only one
sample drawn from the posterior distribution. This improvement is studied both
theoretically and using simulation studies assuming discrete action and
state-space. Finally we successfully extend our setup for a challenging problem
with continuous action space without theoretical guarantees.
[LINK]
http://arxiv.org/abs/2412.11743v1
[DATE]
2024-12-16 21:02:17+08:00
[CATEGORIES]
cs.LG
Solvation Free Energies from Neural Thermodynamic Integration
[AUTHORS]
Bálint Máté, François Fleuret, Tristan Bereau
[ABSTRACT]
We present a method for computing free-energy differences using thermodynamic
integration with a neural network potential that interpolates between two
target Hamiltonians. The interpolation is defined at the sample distribution
level, and the neural network potential is optimized to match the corresponding
equilibrium potential at every intermediate time-step. Once the interpolating
potentials and samples are well-aligned, the free-energy difference can be
estimated using (neural) thermodynamic integration. To target molecular
systems, we simultaneously couple Lennard-Jones and electrostatic interactions
and model the rigid-body rotation of molecules. We report accurate results for
several benchmark systems: a Lennard-Jones particle in a Lennard-Jones fluid,
as well as the insertion of both water and methane solutes in a water solvent
at atomistic resolution using a simple three-body neural-network potential.
[LINK]
http://arxiv.org/abs/2410.15815v2
[DATE]
2024-12-16 21:00:48+08:00
[CATEGORIES]
cs.LG
Efficiently Achieving Secure Model Training and Secure Aggregation to Ensure Bidirectional Privacy-Preservation in Federated Learning
[AUTHORS]
Xue Yang, Depan Peng, Yan Feng, Xiaohu Tang, Weijun Fang, Jun Shao
[ABSTRACT]
Bidirectional privacy-preservation federated learning is crucial as both
local gradients and the global model may leak privacy. However, only a few
works attempt to achieve it, and they often face challenges such as excessive
communication and computational overheads, or significant degradation of model
accuracy, which hinders their practical applications. In this paper, we design
an efficient and high-accuracy bidirectional privacy-preserving scheme for
federated learning to complete secure model training and secure aggregation. To
efficiently achieve bidirectional privacy, we design an efficient and
accuracy-lossless model perturbation method on the server side (called
$\mathbf{MP_Server}$) that can be combined with local differential privacy
(LDP) to prevent clients from accessing the model, while ensuring that the
local gradients obtained on the server side satisfy LDP. Furthermore, to ensure
model accuracy, we customize a distributed differential privacy mechanism on
the client side (called $\mathbf{DDP_Client}$). When combined with
$\mathbf{MP_Server}$, it ensures LDP of the local gradients, while ensuring
that the aggregated result matches the accuracy of central differential privacy
(CDP). Extensive experiments demonstrate that our scheme significantly
outperforms state-of-the-art bidirectional privacy-preservation baselines
(SOTAs) in terms of computational cost, model accuracy, and defense ability
against privacy attacks. Particularly, given target accuracy, the training time
of SOTAs is approximately $200$ times, or even over $1000$ times, longer than
that of our scheme. When the privacy budget is set relatively small, our scheme
incurs less than $6\%$ accuracy loss compared to the privacy-ignoring method,
while SOTAs suffer up to $20\%$ accuracy loss. Experimental results also show
that the defense capability of our scheme outperforms than SOTAs.
[LINK]
http://arxiv.org/abs/2412.11737v1
[DATE]
2024-12-16 20:58:21+08:00
[CATEGORIES]
cs.LG
Dual Unscented Kalman Filter Architecture for Sensor Fusion in Water Networks Leak Localization
[AUTHORS]
Luis Romero-Ben, Paul Irofti, Florin Stoican, Vicenç Puig
[ABSTRACT]
Leakage in water systems results in significant daily water losses, degrading
service quality, increasing costs, and aggravating environmental problems. Most
leak localization methods rely solely on pressure data, missing valuable
information from other sensor types. This article proposes a hydraulic state
estimation methodology based on a dual Unscented Kalman Filter (UKF) approach,
which enhances the estimation of both nodal hydraulic heads, critical in
localization tasks, and pipe flows, useful for operational purposes. The
approach enables the fusion of different sensor types, such as pressure, flow
and demand meters. The strategy is evaluated in well-known open source case
studies, namely Modena and L-TOWN, showing improvements over other
state-of-the-art estimation approaches in terms of interpolation accuracy, as
well as more precise leak localization performance in L-TOWN.
[LINK]
http://arxiv.org/abs/2412.11687v1
[DATE]
2024-12-16 20:01:08+08:00
[CATEGORIES]
cs.LG
NEST: A Neuromodulated Small-world Hypergraph Trajectory Prediction Model for Autonomous Driving
[AUTHORS]
Chengyue Wang, Haicheng Liao, Bonan Wang, Yanchen Guan, Bin Rao, Ziyuan Pu, Zhiyong Cui, Chengzhong Xu, Zhenning Li
[ABSTRACT]
Accurate trajectory prediction is essential for the safety and efficiency of
autonomous driving. Traditional models often struggle with real-time
processing, capturing non-linearity and uncertainty in traffic environments,
efficiency in dense traffic, and modeling temporal dynamics of interactions. We
introduce NEST (Neuromodulated Small-world Hypergraph Trajectory Prediction), a
novel framework that integrates Small-world Networks and hypergraphs for
superior interaction modeling and prediction accuracy. This integration enables
the capture of both local and extended vehicle interactions, while the
Neuromodulator component adapts dynamically to changing traffic conditions. We
validate the NEST model on several real-world datasets, including nuScenes,
MoCAD, and HighD. The results consistently demonstrate that NEST outperforms
existing methods in various traffic scenarios, showcasing its exceptional
generalization capability, efficiency, and temporal foresight. Our
comprehensive evaluation illustrates that NEST significantly improves the
reliability and operational efficiency of autonomous driving systems, making it
a robust solution for trajectory prediction in complex traffic environments.
[COMMENTS]
Accepted by AAAI-25
[LINK]
http://arxiv.org/abs/2412.11682v1
[DATE]
2024-12-16 19:49:12+08:00
[CATEGORIES]
cs.LG
Infinite-dimensional next-generation reservoir computing
[AUTHORS]
Lyudmila Grigoryeva, Hannah Lim Jing Ting, Juan-Pablo Ortega
[ABSTRACT]
Next-generation reservoir computing (NG-RC) has attracted much attention due
to its excellent performance in spatio-temporal forecasting of complex systems
and its ease of implementation. This paper shows that NG-RC can be encoded as a
kernel ridge regression that makes training efficient and feasible even when
the space of chosen polynomial features is very large. Additionally, an
extension to an infinite number of covariates is possible, which makes the
methodology agnostic with respect to the lags into the past that are considered
as explanatory factors, as well as with respect to the number of polynomial
covariates, an important hyperparameter in traditional NG-RC. We show that this
approach has solid theoretical backing and good behavior based on kernel
universality properties previously established in the literature. Various
numerical illustrations show that these generalizations of NG-RC outperform the
traditional approach in several forecasting applications.
[COMMENTS]
13 pages, 2 figures, 3 tables; corrected typos, added github link,
added acknowledgments
[LINK]
http://arxiv.org/abs/2412.09800v2
[DATE]
2024-12-16 19:48:07+08:00
[CATEGORIES]
cs.LG
EDformer: Embedded Decomposition Transformer for Interpretable Multivariate Time Series Predictions
[AUTHORS]
Sanjay Chakraborty, Ibrahim Delibasoglu, Fredrik Heintz
[ABSTRACT]
Time series forecasting is a crucial challenge with significant applications
in areas such as weather prediction, stock market analysis, and scientific
simulations. This paper introduces an embedded decomposed transformer,
‘EDformer’, for multivariate time series forecasting tasks. Without altering
the fundamental elements, we reuse the Transformer architecture and consider
the capable functions of its constituent parts in this work. Edformer first
decomposes the input multivariate signal into seasonal and trend components.
Next, the prominent multivariate seasonal component is reconstructed across the
reverse dimensions, followed by applying the attention mechanism and
feed-forward network in the encoder stage. In particular, the feed-forward
network is used for each variable frame to learn nonlinear representations,
while the attention mechanism uses the time points of individual seasonal
series embedded within variate frames to capture multivariate correlations.
Therefore, the trend signal is added with projection and performs the final
forecasting. The EDformer model obtains state-of-the-art predicting results in
terms of accuracy and efficiency on complex real-world time series datasets.
This paper also addresses model explainability techniques to provide insights
into how the model makes its predictions and why specific features or time
steps are important, enhancing the interpretability and trustworthiness of the
forecasting results.
[LINK]
http://arxiv.org/abs/2412.12227v1
[DATE]
2024-12-16 19:13:57+08:00
[CATEGORIES]
cs.LG
Interpretable, multi-dimensional Evaluation Framework for Causal Discovery from observational i.i.d. Data
[AUTHORS]
Georg Velev, Stefan Lessmann
[ABSTRACT]
Nonlinear causal discovery from observational data imposes strict
identifiability assumptions on the formulation of structural equations utilized
in the data generating process. The evaluation of structure learning methods
under assumption violations requires a rigorous and interpretable approach,
which quantifies both the structural similarity of the estimation with the
ground truth and the capacity of the discovered graphs to be used for causal
inference. Motivated by the lack of unified performance assessment framework,
we introduce an interpretable, six-dimensional evaluation metric, i.e.,
distance to optimal solution (DOS), which is specifically tailored to the field
of causal discovery. Furthermore, this is the first research to assess the
performance of structure learning algorithms from seven different families on
increasing percentage of non-identifiable, nonlinear causal patterns, inspired
by real-world processes. Our large-scale simulation study, which incorporates
seven experimental factors, shows that besides causal order-based methods,
amortized causal discovery delivers results with comparatively high proximity
to the optimal solution.
[LINK]
http://arxiv.org/abs/2409.19377v2
[DATE]
2024-12-16 19:09:58+08:00
[CATEGORIES]
cs.LG
Beyond adaptive gradient: Fast-Controlled Minibatch Algorithm for large-scale optimization
[AUTHORS]
Corrado Coppola, Lorenzo Papa, Irene Amerini, Laura Palagi
[ABSTRACT]
Adaptive gradient methods have been increasingly adopted by deep learning
community due to their fast convergence and reduced sensitivity to
hyper-parameters. However, these methods come with limitations, such as
increased memory requirements for elements like moving averages and a poorly
understood convergence theory. To overcome these challenges, we introduce
F-CMA, a Fast-Controlled Mini-batch Algorithm with a random reshuffling method
featuring a sufficient decrease condition and a line-search procedure to ensure
loss reduction per epoch, along with its deterministic proof of global
convergence to a stationary point. To evaluate the F-CMA, we integrate it into
conventional training protocols for classification tasks involving both
convolutional neural networks and vision transformer models, allowing for a
direct comparison with popular optimizers. Computational tests show significant
improvements, including a decrease in the overall training time by up to 68%,
an increase in per-epoch efficiency by up to 20%, and in model accuracy by up
to 5%.
[COMMENTS]
There is an error in the literature review, in section 1. In
particular, we noticed that there is a wrong citation, the [65], which has
been erroneously associated with another author’s claims
[LINK]
http://arxiv.org/abs/2411.15795v3
[DATE]
2024-12-16 19:06:00+08:00
[CATEGORIES]
cs.LG
BA-BFL: Barycentric Aggregation for Bayesian Federated Learning
[AUTHORS]
Nour Jamoussi, Giuseppe Serra, Photios A. Stavrou, Marios Kountouris
[ABSTRACT]
In this work, we study the problem of aggregation in the context of Bayesian
Federated Learning (BFL). Using an information geometric perspective, we
interpret the BFL aggregation step as finding the barycenter of the trained
posteriors for a pre-specified divergence metric. We study the barycenter
problem for the parametric family of $\alpha$-divergences and, focusing on the
standard case of independent and Gaussian distributed parameters, we recover
the closed-form solution of the reverse Kullback-Leibler barycenter and develop
the analytical form of the squared Wasserstein-2 barycenter. Considering a
non-IID setup, where clients possess heterogeneous data, we analyze the
performance of the developed algorithms against state-of-the-art (SOTA)
Bayesian aggregation methods in terms of accuracy, uncertainty quantification
(UQ), model calibration (MC), and fairness. Finally, we extend our analysis to
the framework of Hybrid Bayesian Deep Learning (HBDL), where we study how the
number of Bayesian layers in the architecture impacts the considered
performance metrics. Our experimental results show that the proposed
methodology presents comparable performance with the SOTA while offering a
geometric interpretation of the aggregation phase.
[LINK]
http://arxiv.org/abs/2412.11646v1
[DATE]
2024-12-16 18:47:05+08:00
[CATEGORIES]
cs.LG
HyperPg – Prototypical Gaussians on the Hypersphere for Interpretable Deep Learning
[AUTHORS]
Maximilian Xiling Li, Korbinian Franz Rudolf, Nils Blank, Rudolf Lioutikov
[ABSTRACT]
Prototype Learning methods provide an interpretable alternative to black-box
deep learning models. Approaches such as ProtoPNet learn, which part of a test
image “look like” known prototypical parts from training images, combining
predictive power with the inherent interpretability of case-based reasoning.
However, existing approaches have two main drawbacks: A) They rely solely on
deterministic similarity scores without statistical confidence. B) The
prototypes are learned in a black-box manner without human input. This work
introduces HyperPg, a new prototype representation leveraging Gaussian
distributions on a hypersphere in latent space, with learnable mean and
variance. HyperPg prototypes adapt to the spread of clusters in the latent
space and output likelihood scores. The new architecture, HyperPgNet, leverages
HyperPg to learn prototypes aligned with human concepts from pixel-level
annotations. Consequently, each prototype represents a specific concept such as
color, image texture, or part of the image subject. A concept extraction
pipeline built on foundation models provides pixel-level annotations,
significantly reducing human labeling effort. Experiments on CUB-200-2011 and
Stanford Cars datasets demonstrate that HyperPgNet outperforms other prototype
learning architectures while using fewer parameters and training steps.
Additionally, the concept-aligned HyperPg prototypes are learned transparently,
enhancing model interpretability.
[LINK]
http://arxiv.org/abs/2410.08925v2
[DATE]
2024-12-16 18:28:02+08:00
[CATEGORIES]
cs.LG
A Mapper Algorithm with implicit intervals and its optimization
[AUTHORS]
Yuyang Tao, Shufei Ge
[ABSTRACT]
The Mapper algorithm is an essential tool for visualizing complex, high
dimensional data in topology data analysis (TDA) and has been widely used in
biomedical research. It outputs a combinatorial graph whose structure implies
the shape of the data. However,the need for manual parameter tuning and fixed
intervals, along with fixed overlapping ratios may impede the performance of
the standard Mapper algorithm. Variants of the standard Mapper algorithms have
been developed to address these limitations, yet most of them still require
manual tuning of parameters. Additionally, many of these variants, including
the standard version found in the literature, were built within a deterministic
framework and overlooked the uncertainty inherent in the data. To relax these
limitations, in this work, we introduce a novel framework that implicitly
represents intervals through a hidden assignment matrix, enabling automatic
parameter optimization via stochastic gradient descent. In this work, we
develop a soft Mapper framework based on a Gaussian mixture model(GMM) for
flexible and implicit interval construction. We further illustrate the
robustness of the soft Mapper algorithm by introducing the Mapper graph mode as
a point estimation for the output graph. Moreover, a stochastic gradient
descent algorithm with a specific topological loss function is proposed for
optimizing parameters in the model. Both simulation and application studies
demonstrate its effectiveness in capturing the underlying topological
structures. In addition, the application to an RNA expression dataset obtained
from the Mount Sinai/JJ Peters VA Medical Center Brain Bank (MSBB) successfully
identifies a distinct subgroup of Alzheimer’s Disease.
[LINK]
http://arxiv.org/abs/2412.11631v1
[DATE]
2024-12-16 18:16:54+08:00
[CATEGORIES]
cs.LG
QPruner: Probabilistic Decision Quantization for Structured Pruning in Large Language Models
[AUTHORS]
Changhai Zhou, Yuhua Zhou, Shijie Han, Qian Qiao, Hongguang Li
[ABSTRACT]
The rise of large language models (LLMs) has significantly advanced various
natural language processing (NLP) tasks. However, the resource demands of these
models pose substantial challenges. Structured pruning is an effective approach
to reducing model size, but it often results in significant accuracy
degradation, necessitating parameter updates to adapt. Unfortunately, such
fine-tuning requires substantial memory, which limits its applicability. To
address these challenges, we introduce quantization into the structured pruning
framework to reduce memory consumption during both fine-tuning and inference.
However, the combined errors from pruning and quantization increase the
difficulty of fine-tuning, requiring a more refined quantization scheme. To
this end, we propose QPruner, a novel framework that employs structured pruning
to reduce model size, followed by a layer-wise mixed-precision quantization
scheme. Quantization precisions are assigned to each layer based on their
importance to the target task, and Bayesian optimization is employed to refine
precision allocation strategies, ensuring a balance between model accuracy and
memory efficiency. Extensive experiments on benchmark datasets demonstrate that
QPruner significantly outperforms existing methods in memory savings while
maintaining or improving model performance.
[LINK]
http://arxiv.org/abs/2412.11629v1
[DATE]
2024-12-16 18:14:01+08:00
[CATEGORIES]
cs.LG
PriPHiT: Privacy-Preserving Hierarchical Training of Deep Neural Networks
[AUTHORS]
Yamin Sepehri, Pedram Pad, Pascal Frossard, L. Andrea Dunbar
[ABSTRACT]
The training phase of deep neural networks requires substantial resources and
as such is often performed on cloud servers. However, this raises privacy
concerns when the training dataset contains sensitive content, e.g., facial or
medical images. In this work, we propose a method to perform the training phase
of a deep learning model on both an edge device and a cloud server that
prevents sensitive content being transmitted to the cloud while retaining the
desired information. The proposed privacy-preserving method uses adversarial
early exits to suppress the sensitive content at the edge and transmits the
task-relevant information to the cloud. This approach incorporates noise
addition during the training phase to provide a differential privacy guarantee.
We extensively test our method on different facial and medical datasets with
diverse attributes using various deep learning architectures, showcasing its
outstanding performance. We also demonstrate the effectiveness of privacy
preservation through successful defenses against different white-box, deep and
GAN-based reconstruction attacks. This approach is designed for
resource-constrained edge devices, ensuring minimal memory usage and
computational overhead.
[COMMENTS]
21 pages, 19 figures, 11 tables
[LINK]
http://arxiv.org/abs/2408.05092v2
[DATE]
2024-12-16 18:10:10+08:00
[CATEGORIES]
cs.LG
Application of Quantum Pre-Processing Filter for Binary Image Classification with Small Samples
[AUTHORS]
Farina Riaz, Shahab Abdulla, Hajime Suzuki, Srinjoy Ganguly, Ravinesh C. Deo, Susan Hopkins
[ABSTRACT]
Over the past few years, there has been significant interest in Quantum
Machine Learning (QML) among researchers, as it has the potential to transform
the field of machine learning. Several models that exploit the properties of
quantum mechanics have been developed for practical applications. In this
study, we investigated the application of our previously proposed quantum
pre-processing filter (QPF) to binary image classification. We evaluated the
QPF on four datasets: MNIST (handwritten digits), EMNIST (handwritten digits
and alphabets), CIFAR-10 (photographic images) and GTSRB (real-life traffic
sign images). Similar to our previous multi-class classification results, the
application of QPF improved the binary image classification accuracy using
neural network against MNIST, EMNIST, and CIFAR-10 from 98.9% to 99.2%, 97.8%
to 98.3%, and 71.2% to 76.1%, respectively, but degraded it against GTSRB from
93.5% to 92.0%. We then applied QPF in cases using a smaller number of training
and testing samples, i.e. 80 and 20 samples per class, respectively. In order
to derive statistically stable results, we conducted the experiment with 100
trials choosing randomly different training and testing samples and averaging
the results. The result showed that the application of QPF did not improve the
image classification accuracy against MNIST and EMNIST but improved it against
CIFAR-10 and GTSRB from 65.8% to 67.2% and 90.5% to 91.8%, respectively.
Further research will be conducted as part of future work to investigate the
potential of QPF to assess the scalability of the proposed approach to larger
and complex datasets.
[COMMENTS]
This paper is accepted by Journal of Data Science and Intelligent
Systems (JDSIS)
[LINK]
http://arxiv.org/abs/2308.14930v2
[DATE]
2024-12-16 18:04:38+08:00
[CATEGORIES]
cs.LG
EvoLlama: Enhancing LLMs’ Understanding of Proteins via Multimodal Structure and Sequence Representations
[AUTHORS]
Nuowei Liu, Changzhi Sun, Tao Ji, Junfeng Tian, Jianxin Tang, Yuanbin Wu, Man Lan
[ABSTRACT]
Current Large Language Models (LLMs) for understanding proteins primarily
treats amino acid sequences as a text modality. Meanwhile, Protein Language
Models (PLMs), such as ESM-2, have learned massive sequential evolutionary
knowledge from the universe of natural protein sequences. Furthermore,
structure-based encoders like ProteinMPNN learn the structural information of
proteins through Graph Neural Networks. However, whether the incorporation of
protein encoders can enhance the protein understanding of LLMs has not been
explored. To bridge this gap, we propose EvoLlama, a multimodal framework that
connects a structure-based encoder, a sequence-based protein encoder and an LLM
for protein understanding. EvoLlama consists of a ProteinMPNN structure
encoder, an ESM-2 protein sequence encoder, a multimodal projector to align
protein and text representations and a Llama-3 text decoder. To train EvoLlama,
we fine-tune it on protein-oriented instructions and protein property
prediction datasets verbalized via natural language instruction templates. Our
experiments show that EvoLlama’s protein understanding capabilities have been
significantly enhanced, outperforming other fine-tuned protein-oriented LLMs in
zero-shot settings by an average of 1%-8% and surpassing the state-of-the-art
baseline with supervised fine-tuning by an average of 6%. On protein property
prediction datasets, our approach achieves promising results that are
competitive with state-of-the-art task-specific baselines. We will release our
code in a future version.
[LINK]
http://arxiv.org/abs/2412.11618v1
[DATE]
2024-12-16 18:01:33+08:00
[CATEGORIES]
cs.LG
Generate Any Scene: Evaluating and Improving Text-to-Vision Generation with Scene Graph Programming
[AUTHORS]
Ziqi Gao, Weikai Huang, Jieyu Zhang, Aniruddha Kembhavi, Ranjay Krishna
[ABSTRACT]
DALL-E and Sora have gained attention by producing implausible images, such
as “astronauts riding a horse in space.” Despite the proliferation of
text-to-vision models that have inundated the internet with synthetic visuals,
from images to 3D assets, current benchmarks predominantly evaluate these
models on real-world scenes paired with captions. We introduce Generate Any
Scene, a framework that systematically enumerates scene graphs representing a
vast array of visual scenes, spanning realistic to imaginative compositions.
Generate Any Scene leverages ‘scene graph programming’, a method for
dynamically constructing scene graphs of varying complexity from a structured
taxonomy of visual elements. This taxonomy includes numerous objects,
attributes, and relations, enabling the synthesis of an almost infinite variety
of scene graphs. Using these structured representations, Generate Any Scene
translates each scene graph into a caption, enabling scalable evaluation of
text-to-vision models through standard metrics. We conduct extensive
evaluations across multiple text-to-image, text-to-video, and text-to-3D
models, presenting key findings on model performance. We find that DiT-backbone
text-to-image models align more closely with input captions than UNet-backbone
models. Text-to-video models struggle with balancing dynamics and consistency,
while both text-to-video and text-to-3D models show notable gaps in human
preference alignment. We demonstrate the effectiveness of Generate Any Scene by
conducting three practical applications leveraging captions generated by
Generate Any Scene: 1) a self-improving framework where models iteratively
enhance their performance using generated data, 2) a distillation process to
transfer specific strengths from proprietary models to open-source
counterparts, and 3) improvements in content moderation by identifying and
generating challenging synthetic data.
[LINK]
http://arxiv.org/abs/2412.08221v2
[DATE]
2024-12-16 17:54:46+08:00
[CATEGORIES]
cs.LG
Towards Adversarial Robustness of Model-Level Mixture-of-Experts Architectures for Semantic Segmentation
[AUTHORS]
Svetlana Pavlitska, Enrico Eisen, J. Marius Zöllner
[ABSTRACT]
Vulnerability to adversarial attacks is a well-known deficiency of deep
neural networks. Larger networks are generally more robust, and ensembling is
one method to increase adversarial robustness: each model’s weaknesses are
compensated by the strengths of others. While an ensemble uses a deterministic
rule to combine model outputs, a mixture of experts (MoE) includes an
additional learnable gating component that predicts weights for the outputs of
the expert models, thus determining their contributions to the final
prediction. MoEs have been shown to outperform ensembles on specific tasks, yet
their susceptibility to adversarial attacks has not been studied yet. In this
work, we evaluate the adversarial vulnerability of MoEs for semantic
segmentation of urban and highway traffic scenes. We show that MoEs are, in
most cases, more robust to per-instance and universal white-box adversarial
attacks and can better withstand transfer attacks. Our code is available at
\url{https://github.com/KASTEL-MobilityLab/mixtures-of-experts/}.
[COMMENTS]
Accepted for publication at ICMLA 2024
[LINK]
http://arxiv.org/abs/2412.11608v1
[DATE]
2024-12-16 17:49:59+08:00
[CATEGORIES]
cs.LG
Application of machine learning in grain-related clustering of Laue spots in a polycrystalline energy dispersive Laue pattern
[AUTHORS]
Amir Tosson, Mohammad Shokr, Mahmoud Al Humaidi, Eduard Mikayelyan, Christian Gutt, Ulrich Pietsch
[ABSTRACT]
We address the identification of grain-corresponding Laue reflections in
energy dispersive Laue diffraction (EDLD) experiments by formulating it as a
clustering problem solvable through unsupervised machine learning (ML). To
achieve reliable and efficient identification of grains in a Laue pattern, we
employ a combination of clustering algorithms, namely hierarchical clustering
(HC) and K-means. These algorithms allow us to group together similar Laue
reflections, revealing the underlying grain structure in the diffraction
pattern. Additionally, we utilise the elbow method to determine the optimal
number of clusters, ensuring accurate results. To evaluate the performance of
our proposed method, we conducted experiments using both simulated and
experimental datasets obtained from nickel wires. The simulated datasets were
generated to mimic the characteristics of real-world EDLD experiments, while
the experimental datasets were obtained from actual measurements.
[LINK]
http://arxiv.org/abs/2412.12224v1
[DATE]
2024-12-16 17:28:17+08:00
[CATEGORIES]
cs.LG
A Method for Auto-Differentiation of the Voronoi Tessellation
[AUTHORS]
Sergei Shumilin, Alexander Ryabov, Serguei Barannikov, Evgeny Burnaev, Vladimir Vanovskii
[ABSTRACT]
Voronoi tessellation, also known as Voronoi diagram, is an important
computational geometry technique that has applications in various scientific
disciplines. It involves dividing a given space into regions based on the
proximity to a set of points. Autodifferentiation is a powerful tool for
solving optimization tasks. Autodifferentiation assumes constructing a
computational graph that allows to compute gradients using backpropagation
algorithm. However, often the Voronoi tessellation remains the only
non-differentiable part of a pipeline, prohibiting end-to-end differentiation.
We present the method for autodifferentiation of the 2D Voronoi tessellation.
The method allows one to construct the Voronoi tessellation and pass gradients,
making the construction end-to-end differentiable. We provide the
implementation details and present several important applications. To the best
of our knowledge this is the first autodifferentiable realization of the
Voronoi tessellation providing full set of Voronoi geometrical parameters in a
differentiable way.
[LINK]
http://arxiv.org/abs/2312.16192v3
[DATE]
2024-12-16 17:23:47+08:00
[CATEGORIES]
cs.LG
A Novel Approach to Solving Goal-Achieving Problems for Board Games
[AUTHORS]
Chung-Chin Shih, Ti-Rong Wu, Ting Han Wei, I-Chen Wu
[ABSTRACT]
Goal-achieving problems are puzzles that set up a specific situation with a
clear objective. An example that is well-studied is the category of
life-and-death (L&D) problems for Go, which helps players hone their skill of
identifying region safety. Many previous methods like lambda search try null
moves first, then derive so-called relevance zones (RZs), outside of which the
opponent does not need to search. This paper first proposes a novel RZ-based
approach, called the RZ-Based Search (RZS), to solving L&D problems for Go. RZS
tries moves before determining whether they are null moves post-hoc. This means
we do not need to rely on null move heuristics, resulting in a more elegant
algorithm, so that it can also be seamlessly incorporated into AlphaZero’s
super-human level play in our solver. To repurpose AlphaZero for solving, we
also propose a new training method called Faster to Life (FTL), which modifies
AlphaZero to entice it to win more quickly. We use RZS and FTL to solve L&D
problems on Go, namely solving 68 among 106 problems from a professional L&D
book while a previous program solves 11 only. Finally, we discuss that the
approach is generic in the sense that RZS is applicable to solving many other
goal-achieving problems for board games.
[COMMENTS]
The main text is the final version to AAAI-22
[LINK]
http://arxiv.org/abs/2112.02563v2
[DATE]
2024-12-16 17:23:46+08:00
[CATEGORIES]
cs.LG
Estimating the history of a random recursive tree
[AUTHORS]
Simon Briend, Christophe Giraud, Gábor Lugosi, Déborah Sulem
[ABSTRACT]
This paper studies the problem of estimating the order of arrival of the
vertices in a random recursive tree. Specifically, we study two fundamental
models: the uniform attachment model and the linear preferential attachment
model. We propose an order estimator based on the Jordan centrality measure and
define a family of risk measures to quantify the quality of the ordering
procedure. Moreover, we establish a minimax lower bound for this problem, and
prove that the proposed estimator is nearly optimal. Finally, we numerically
demonstrate that the proposed estimator outperforms degree-based and spectral
ordering procedures.
[LINK]
http://arxiv.org/abs/2403.09755v3
[DATE]
2024-12-16 17:20:42+08:00
[CATEGORIES]
cs.LG
Simulation Based Bayesian Optimization
[AUTHORS]
Roi Naveiro, Becky Tang
[ABSTRACT]
Bayesian Optimization (BO) is a powerful method for optimizing black-box
functions by combining prior knowledge with ongoing function evaluations. BO
constructs a probabilistic surrogate model of the objective function given the
covariates, which is in turn used to inform the selection of future evaluation
points through an acquisition function. For smooth continuous search spaces,
Gaussian Processes (GPs) are commonly used as the surrogate model as they offer
analytical access to posterior predictive distributions, thus facilitating the
computation and optimization of acquisition functions. However, in complex
scenarios involving optimization over categorical or mixed covariate spaces,
GPs may not be ideal. This paper introduces Simulation Based Bayesian
Optimization (SBBO) as a novel approach to optimizing acquisition functions
that only requires sampling-based access to posterior predictive distributions.
SBBO allows the use of surrogate probabilistic models tailored for
combinatorial spaces with discrete variables. Any Bayesian model in which
posterior inference is carried out through Markov chain Monte Carlo can be
selected as the surrogate model in SBBO. We demonstrate empirically the
effectiveness of SBBO using various choices of surrogate models in applications
involving combinatorial optimization. choices of surrogate models.
[LINK]
http://arxiv.org/abs/2401.10811v2
[DATE]
2024-12-16 16:55:12+08:00
[CATEGORIES]
cs.LG
The dark side of the forces: assessing non-conservative force models for atomistic machine learning
[AUTHORS]
Filippo Bigi, Marcel Langer, Michele Ceriotti
[ABSTRACT]
The use of machine learning to estimate the energy of a group of atoms, and
the forces that drive them to more stable configurations, have revolutionized
the fields of computational chemistry and materials discovery. In this domain,
rigorous enforcement of symmetry and conservation laws has traditionally been
considered essential. For this reason, interatomic forces are usually computed
as the derivatives of the potential energy, ensuring energy conservation.
Several recent works have questioned this physically-constrained approach,
suggesting that using the forces as explicit learning targets yields a better
trade-off between accuracy and computational efficiency - and that energy
conservation can be learned during training. The present work investigates the
applicability of such non-conservative models in microscopic simulations. We
identify and demonstrate several fundamental issues, from ill-defined
convergence of geometry optimization to instability in various types of
molecular dynamics. Contrary to the case of rotational symmetry, lack of energy
conservation is hard to learn, control, and correct. The best approach to
exploit the acceleration afforded by direct force evaluation might be to use it
in tandem with a conservative model, reducing - rather than eliminating - the
additional cost of backpropagation, but avoiding most of the pathological
behavior associated with non-conservative forces.
[LINK]
http://arxiv.org/abs/2412.11569v1
[DATE]
2024-12-16 16:55:03+08:00
[CATEGORIES]
cs.LG
Towards Efficient Training and Evaluation of Robust Models against $l_0$ Bounded Adversarial Perturbations
[AUTHORS]
Xuyang Zhong, Yixiao Huang, Chen Liu
[ABSTRACT]
This work studies sparse adversarial perturbations bounded by $l_0$ norm. We
propose a white-box PGD-like attack method named sparse-PGD to effectively and
efficiently generate such perturbations. Furthermore, we combine sparse-PGD
with a black-box attack to comprehensively and more reliably evaluate the
models’ robustness against $l_0$ bounded adversarial perturbations. Moreover,
the efficiency of sparse-PGD enables us to conduct adversarial training to
build robust models against sparse perturbations. Extensive experiments
demonstrate that our proposed attack algorithm exhibits strong performance in
different scenarios. More importantly, compared with other robust models, our
adversarially trained model demonstrates state-of-the-art robustness against
various sparse attacks. Codes are available at
https://github.com/CityU-MLO/sPGD.
[COMMENTS]
Accepted by ICML2024
[LINK]
http://arxiv.org/abs/2405.05075v2
[DATE]
2024-12-16 16:50:59+08:00
[CATEGORIES]
cs.LG
Learning Massive-scale Partial Correlation Networks in Clinical Multi-omics Studies with HP-ACCORD
[AUTHORS]
Sungdong Lee, Joshua Bang, Youngrae Kim, Hyungwon Choi, Sang-Yun Oh, Joong-Ho Won
[ABSTRACT]
Graphical model estimation from modern multi-omics data requires a balance
between statistical estimation performance and computational scalability. We
introduce a novel pseudolikelihood-based graphical model framework that
reparameterizes the target precision matrix while preserving sparsity pattern
and estimates it by minimizing an $\ell_1$-penalized empirical risk based on a
new loss function. The proposed estimator maintains estimation and selection
consistency in various metrics under high-dimensional assumptions. The
associated optimization problem allows for a provably fast computation
algorithm using a novel operator-splitting approach and communication-avoiding
distributed matrix multiplication. A high-performance computing implementation
of our framework was tested in simulated data with up to one million variables
demonstrating complex dependency structures akin to biological networks.
Leveraging this scalability, we estimated partial correlation network from a
dual-omic liver cancer data set. The co-expression network estimated from the
ultrahigh-dimensional data showed superior specificity in prioritizing key
transcription factors and co-activators by excluding the impact of epigenomic
regulation, demonstrating the value of computational scalability in multi-omic
data analysis. %derived from the gene expression data.
[LINK]
http://arxiv.org/abs/2412.11554v1
[DATE]
2024-12-16 16:38:02+08:00
[CATEGORIES]
cs.LG
Integrated Sensing and Communications for Low-Altitude Economy: A Deep Reinforcement Learning Approach
[AUTHORS]
Xiaowen Ye, Yuyi Mao, Xianghao Yu, Shu Sun, Liqun Fu, Jie Xu
[ABSTRACT]
This paper studies an integrated sensing and communications (ISAC) system for
low-altitude economy (LAE), where a ground base station (GBS) provides
communication and navigation services for authorized unmanned aerial vehicles
(UAVs), while sensing the low-altitude airspace to monitor the unauthorized
mobile target. The expected communication sum-rate over a given flight period
is maximized by jointly optimizing the beamforming at the GBS and UAVs’
trajectories, subject to the constraints on the average signal-to-noise ratio
requirement for sensing, the flight mission and collision avoidance of UAVs, as
well as the maximum transmit power at the GBS. Typically, this is a sequential
decision-making problem with the given flight mission. Thus, we transform it to
a specific Markov decision process (MDP) model called episode task. Based on
this modeling, we propose a novel LAE-oriented ISAC scheme, referred to as Deep
LAE-ISAC (DeepLSC), by leveraging the deep reinforcement learning (DRL)
technique. In DeepLSC, a reward function and a new action selection policy
termed constrained noise-exploration policy are judiciously designed to fulfill
various constraints. To enable efficient learning in episode tasks, we develop
a hierarchical experience replay mechanism, where the gist is to employ all
experiences generated within each episode to jointly train the neural network.
Besides, to enhance the convergence speed of DeepLSC, a symmetric experience
augmentation mechanism, which simultaneously permutes the indexes of all
variables to enrich available experience sets, is proposed. Simulation results
demonstrate that compared with benchmarks, DeepLSC yields a higher sum-rate
while meeting the preset constraints, achieves faster convergence, and is more
robust against different settings.
[COMMENTS]
submitted for an IEEE publication
[LINK]
http://arxiv.org/abs/2412.04074v2
[DATE]
2024-12-16 16:36:07+08:00
[CATEGORIES]
cs.LG
Modeling Latent Non-Linear Dynamical System over Time Series
[AUTHORS]
Ren Fujiwara, Yasuko Matsubara, Yasushi Sakurai
[ABSTRACT]
We study the problem of modeling a non-linear dynamical system when given a
time series by deriving equations directly from the data. Despite the fact that
time series data are given as input, models for dynamics and estimation
algorithms that incorporate long-term temporal dependencies are largely absent
from existing studies. In this paper, we introduce a latent state to allow
time-dependent modeling and formulate this problem as a dynamics estimation
problem in latent states. We face multiple technical challenges, including (1)
modeling latent non-linear dynamics and (2) solving circular dependencies
caused by the presence of latent states. To tackle these challenging problems,
we propose a new method, Latent Non-Linear equation modeling (LaNoLem), that
can model a latent non-linear dynamical system and a novel alternating
minimization algorithm for effectively estimating latent states and model
parameters. In addition, we introduce criteria to control model complexity
without human intervention. Compared with the state-of-the-art model, LaNoLem
achieves competitive performance for estimating dynamics while outperforming
other methods in prediction.
[COMMENTS]
accepted at AAAI’25
[LINK]
http://arxiv.org/abs/2412.08114v3
[DATE]
2024-12-16 16:25:18+08:00
[CATEGORIES]
cs.LG
Probability-Informed Machine Learning
[AUTHORS]
Mohsen Rashki
[ABSTRACT]
Machine learning (ML) has emerged as a powerful tool for tackling complex
regression and classification tasks, yet its success often hinges on the
quality of training data. This study introduces a novel ML paradigm inspired by
domain knowledge of the structure of output function, akin to physics-informed
ML, but rooted in probabilistic principles rather than physical laws. The
proposed approach integrates the probabilistic structure of the target variable
(such as its cumulative distribution function) into the training process. This
probabilistic information is obtained from historical data or estimated using
structural reliability methods during experimental design. By embedding
domain-specific probabilistic insights into the learning process, the method
enhances model accuracy and mitigates risks of overfitting and underfitting.
Applications in regression, image denoising, and classification demonstrate the
effectiveness of the approach in addressing real-world problems.
[LINK]
http://arxiv.org/abs/2412.11526v1
[DATE]
2024-12-16 16:01:22+08:00
[CATEGORIES]
cs.LG
On the Ability of Deep Networks to Learn Symmetries from Data: A Neural Kernel Theory
[AUTHORS]
Andrea Perin, Stephane Deny
[ABSTRACT]
Symmetries (transformations by group actions) are present in many datasets,
and leveraging them holds significant promise for improving predictions in
machine learning. In this work, we aim to understand when and how deep networks
can learn symmetries from data. We focus on a supervised classification
paradigm where data symmetries are only partially observed during training:
some classes include all transformations of a cyclic group, while others
include only a subset. We ask: can deep networks generalize symmetry invariance
to the partially sampled classes? In the infinite-width limit, where kernel
analogies apply, we derive a neural kernel theory of symmetry learning to
address this question. The group-cyclic nature of the dataset allows us to
analyze the spectrum of neural kernels in the Fourier domain; here we find a
simple characterization of the generalization error as a function of the
interaction between class separation (signal) and class-orbit density (noise).
We observe that generalization can only be successful when the local structure
of the data prevails over its non-local, symmetric, structure, in the kernel
space defined by the architecture. This occurs when (1) classes are
sufficiently distinct and (2) class orbits are sufficiently dense. Our
framework also applies to equivariant architectures (e.g., CNNs), and recovers
their success in the special case where the architecture matches the inherent
symmetry of the data. Empirically, our theory reproduces the generalization
failure of finite-width networks (MLP, CNN, ViT) trained on partially observed
versions of rotated-MNIST. We conclude that conventional networks trained with
supervision lack a mechanism to learn symmetries that have not been explicitly
embedded in their architecture a priori. Our framework could be extended to
guide the design of architectures and training procedures able to learn
symmetries from data.
[LINK]
http://arxiv.org/abs/2412.11521v1
[DATE]
2024-12-16 15:56:54+08:00
[CATEGORIES]
cs.LG
Explicit and Implicit Graduated Optimization in Deep Neural Networks
[AUTHORS]
Naoki Sato, Hideaki Iiduka
[ABSTRACT]
Graduated optimization is a global optimization technique that is used to
minimize a multimodal nonconvex function by smoothing the objective function
with noise and gradually refining the solution. This paper experimentally
evaluates the performance of the explicit graduated optimization algorithm with
an optimal noise scheduling derived from a previous study and discusses its
limitations. It uses traditional benchmark functions and empirical loss
functions for modern neural network architectures for evaluating. In addition,
this paper extends the implicit graduated optimization algorithm, which is
based on the fact that stochastic noise in the optimization process of SGD
implicitly smooths the objective function, to SGD with momentum, analyzes its
convergence, and demonstrates its effectiveness through experiments on image
classification tasks with ResNet architectures.
[COMMENTS]
Accepted at AAAI-25
[LINK]
http://arxiv.org/abs/2412.11501v1
[DATE]
2024-12-16 15:23:22+08:00
[CATEGORIES]
cs.LG
HGSFusion: Radar-Camera Fusion with Hybrid Generation and Synchronization for 3D Object Detection
[AUTHORS]
Zijian Gu, Jianwei Ma, Yan Huang, Honghao Wei, Zhanye Chen, Hui Zhang, Wei Hong
[ABSTRACT]
Millimeter-wave radar plays a vital role in 3D object detection for
autonomous driving due to its all-weather and all-lighting-condition
capabilities for perception. However, radar point clouds suffer from pronounced
sparsity and unavoidable angle estimation errors. To address these limitations,
incorporating a camera may partially help mitigate the shortcomings.
Nevertheless, the direct fusion of radar and camera data can lead to negative
or even opposite effects due to the lack of depth information in images and
low-quality image features under adverse lighting conditions. Hence, in this
paper, we present the radar-camera fusion network with Hybrid Generation and
Synchronization (HGSFusion), designed to better fuse radar potentials and image
features for 3D object detection. Specifically, we propose the Radar Hybrid
Generation Module (RHGM), which fully considers the Direction-Of-Arrival (DOA)
estimation errors in radar signal processing. This module generates denser
radar points through different Probability Density Functions (PDFs) with the
assistance of semantic information. Meanwhile, we introduce the Dual Sync
Module (DSM), comprising spatial sync and modality sync, to enhance image
features with radar positional information and facilitate the fusion of
distinct characteristics in different modalities. Extensive experiments
demonstrate the effectiveness of our approach, outperforming the
state-of-the-art methods in the VoD and TJ4DRadSet datasets by $6.53\%$ and
$2.03\%$ in RoI AP and BEV AP, respectively. The code is available at
https://github.com/garfield-cpp/HGSFusion.
[COMMENTS]
12 pages, 8 figures, 7 tables. Accepted by AAAI 2025 , the 39th
Annual AAAI Conference on Artificial Intelligence
[LINK]
http://arxiv.org/abs/2412.11489v1
[DATE]
2024-12-16 15:06:17+08:00
[CATEGORIES]
cs.LG
State-Space Systems as Dynamic Generative Models
[AUTHORS]
Juan-Pablo Ortega, Florian Rossmannek
[ABSTRACT]
A probabilistic framework to study the dependence structure induced by
deterministic discrete-time state-space systems between input and output
processes is introduced. General sufficient conditions are formulated under
which output processes exist and are unique once an input process has been
fixed, a property that in the deterministic state-space literature is known as
the echo state property. When those conditions are satisfied, the given
state-space system becomes a generative model for probabilistic dependences
between two sequence spaces. Moreover, those conditions guarantee that the
output depends continuously on the input when using the Wasserstein metric. The
output processes whose existence is proved are shown to be causal in a specific
sense and to generalize those studied in purely deterministic situations. The
results in this paper constitute a significant stochastic generalization of
sufficient conditions for the deterministic echo state property to hold, in the
sense that the stochastic echo state property can be satisfied under
contractivity conditions that are strictly weaker than those in deterministic
situations. This means that state-space systems can induce a purely
probabilistic dependence structure between input and output sequence spaces
even when there is no functional relation between those two spaces.
[LINK]
http://arxiv.org/abs/2404.08717v2
[DATE]
2024-12-16 14:56:28+08:00
[CATEGORIES]
cs.LG
Matrix Completion via Residual Spectral Matching
[AUTHORS]
Ziyuan Chen, Fang Yao
[ABSTRACT]
Noisy matrix completion has attracted significant attention due to its
applications in recommendation systems, signal processing and image
restoration. Most existing works rely on (weighted) least squares methods under
various low-rank constraints. However, minimizing the sum of squared residuals
is not always efficient, as it may ignore the potential structural information
in the residuals. In this study, we propose a novel residual spectral matching
criterion that incorporates not only the numerical but also locational
information of residuals. This criterion is the first in noisy matrix
completion to adopt the perspective of low-rank perturbation of random matrices
and exploit the spectral properties of sparse random matrices. We derive
optimal statistical properties by analyzing the spectral properties of sparse
random matrices and bounding the effects of low-rank perturbations and partial
observations. Additionally, we propose algorithms that efficiently approximate
solutions by constructing easily computable pseudo-gradients. The iterative
process of the proposed algorithms ensures convergence at a rate consistent
with the optimal statistical error bound. Our method and algorithms demonstrate
improved numerical performance in both simulated and real data examples,
particularly in environments with high noise levels.
[COMMENTS]
23 pages, 6 figures
[LINK]
http://arxiv.org/abs/2412.10005v2
[DATE]
2024-12-16 14:48:34+08:00
[CATEGORIES]
cs.LG
Deep Learning Calabi-Yau four folds with hybrid and recurrent neural network architectures
[AUTHORS]
H. L. Dao
[ABSTRACT]
In this work, we report the results of applying deep learning based on hybrid
convolutional-recurrent and purely recurrent neural network architectures to
the dataset of almost one million complete intersection Calabi-Yau four-folds
(CICY4) to machine-learn their four Hodge numbers $h^{1,1}, h^{2,1}, h^{3,1},
h^{2,2}$. In particular, we explored and experimented with twelve different
neural network models, nine of which are convolutional-recurrent (CNN-RNN)
hybrids with the RNN unit being either GRU (Gated Recurrent Unit) or Long Short
Term Memory (LSTM). The remaining four models are purely recurrent neural
networks based on LSTM. In terms of the $h^{1,1}, h^{2,1}, h^{3,1}, h^{2,2}$
prediction accuracies, at 72% training ratio, our best performing individual
model is CNN-LSTM-400, a hybrid CNN-LSTM with the LSTM hidden size of 400,
which obtained 99.74%, 98.07%, 95.19%, 81.01%, our second best performing
individual model is LSTM-448, an LSTM-based model with the hidden size of 448,
which obtained 99.74%, 97.51%, 94.24%, and 78.63%. These results were improved
by forming ensembles of the top two, three or even four models. Our best
ensemble, consisting of the top four models, achieved the accuracies of 99.84%,
98.71%, 96.26%, 85.03%. At 80% training ratio, the top two performing models
LSTM-448 and LSTM-424 are both LSTM-based with the hidden sizes of 448 and 424.
Compared with the 72% training ratio, there is a significant improvement of
accuracies, which reached 99.85%, 98.66%, 96.26%, 84.77% for the best
individual model and 99.90%, 99.03%, 97.97%, 87.34% for the best ensemble. By
nature a proof of concept, the results of this work conclusively established
the utility of RNN-based architectures and demonstrated their effective
performances compared to the well-explored purely CNN-based architectures in
the problem of deep learning Calabi Yau manifolds.
[COMMENTS]
v3: more discussions added, references added, typos corrected
[LINK]
http://arxiv.org/abs/2405.17406v3
[DATE]
2024-12-16 14:44:05+08:00
[CATEGORIES]
cs.LG
Vertical Federated Unlearning via Backdoor Certification
[AUTHORS]
Mengde Han, Tianqing Zhu, Lefeng Zhang, Huan Huo, Wanlei Zhou
[ABSTRACT]
Vertical Federated Learning (VFL) offers a novel paradigm in machine
learning, enabling distinct entities to train models cooperatively while
maintaining data privacy. This method is particularly pertinent when entities
possess datasets with identical sample identifiers but diverse attributes.
Recent privacy regulations emphasize an individual’s \emph{right to be
forgotten}, which necessitates the ability for models to unlearn specific
training data. The primary challenge is to develop a mechanism to eliminate the
influence of a specific client from a model without erasing all relevant data
from other clients. Our research investigates the removal of a single client’s
contribution within the VFL framework. We introduce an innovative modification
to traditional VFL by employing a mechanism that inverts the typical learning
trajectory with the objective of extracting specific data contributions. This
approach seeks to optimize model performance using gradient ascent, guided by a
pre-defined constrained model. We also introduce a backdoor mechanism to verify
the effectiveness of the unlearning procedure. Our method avoids fully
accessing the initial training data and avoids storing parameter updates.
Empirical evidence shows that the results align closely with those achieved by
retraining from scratch. Utilizing gradient ascent, our unlearning approach
addresses key challenges in VFL, laying the groundwork for future advancements
in this domain. All the code and implementations related to this paper are
publicly available at https://github.com/mengde-han/VFL-unlearn.
[LINK]
http://arxiv.org/abs/2412.11476v1
[DATE]
2024-12-16 14:40:25+08:00
[CATEGORIES]
cs.LG
Collapse or Thrive? Perils and Promises of Synthetic Data in a Self-Generating World
[AUTHORS]
Joshua Kazdan, Rylan Schaeffer, Apratim Dey, Matthias Gerstgrasser, Rafael Rafailov, David L. Donoho, Sanmi Koyejo
[ABSTRACT]
The increasing presence of AI-generated content on the internet raises a
critical question: What happens when generative machine learning models are
pretrained on web-scale datasets containing data created by earlier models?
Some authors prophesy \textit{model collapse} under a \{\it replace\}' scenario:
a <span style="color:#e74d3c;">sequence</span> of models, the first trained with real data and each later one
trained \{\it only on\} synthetic data from its preceding model. In this
scenario, models successively degrade. Others see collapse as avoidable; in an
{\it accumulate}’ scenario, a sequence of models is trained, but each training
uses all real and synthetic data generated so far. In this work, we deepen and
extend the study of these contrasting scenarios. First, collapse versus
avoidance of collapse is studied by comparing the replace and accumulate
scenarios on each of three prominent generative modeling settings; we find the
same contrast emerges in all three settings. Second, we study a compromise
scenario; the available data remains the same as in the {\it accumulate}
scenario – but unlike {\it accumulate} and like {\it replace}, each model is
trained using a fixed compute budget; we demonstrate that model test loss on
real data is larger than in the {\it accumulate} scenario, but apparently
plateaus, unlike the divergence seen with {\it replace}. Third, we study the
relative importance of cardinality and proportion of real data for avoiding
model collapse. Surprisingly, we find a non-trivial interaction between real
and synthetic data, where the value of synthetic data for reducing test loss
depends on the absolute quantity of real data. Our insights are particularly
important when forecasting whether future frontier generative models will
collapse or thrive, and our results open avenues for empirically and
mathematically studying the context-dependent value of synthetic data.
[COMMENTS]
Accepted at NeurIPS 2024 Workshops: Mathematics of Modern Machine
Learning (M3L) and Attributing Model Behavior at Scale (ATTRIB)
[LINK]
http://arxiv.org/abs/2410.16713v2
[DATE]
2024-12-16 14:37:01+08:00
[CATEGORIES]
cs.LG
Phase-aware Training Schedule Simplifies Learning in Flow-Based Generative Models
[AUTHORS]
Santiago Aranguri, Francesco Insulla
[ABSTRACT]
We analyze the training of a two-layer autoencoder used to parameterize a
flow-based generative model for sampling from a high-dimensional Gaussian
mixture. Previous work shows that the phase where the relative probability
between the modes is learned disappears as the dimension goes to infinity
without an appropriate time schedule. We introduce a time dilation that solves
this problem. This enables us to characterize the learned velocity field,
finding a first phase where the probability of each mode is learned and a
second phase where the variance of each mode is learned. We find that the
autoencoder representing the velocity field learns to simplify by estimating
only the parameters relevant to each phase. Turning to real data, we propose a
method that, for a given feature, finds intervals of time where training
improves accuracy the most on that feature. Since practitioners take a uniform
distribution over training times, our method enables more efficient training.
We provide preliminary experiments validating this approach.
[LINK]
http://arxiv.org/abs/2412.07972v2
[DATE]
2024-12-16 14:12:06+08:00
[CATEGORIES]
cs.LG
Near-Field Spot Beamfocusing: A Correlation-Aware Transfer Learning Approach
[AUTHORS]
Mohammad Amir Fallah, Mehdi Monemi, Mehdi Rasti, Matti Latva-Aho
[ABSTRACT]
Three-dimensional (3D) spot beamfocusing (SBF), in contrast to conventional
angular-domain beamforming, concentrates radiating power within a very small
volume in both radial and angular domains in the near-field zone. Recently the
implementation of channel-state-information (CSI)-independent machine learning
(ML)-based approaches have been developed for effective SBF using extremely
large-scale programmable metasurface (ELPMs). These methods involve dividing
the ELPMs into subarrays and independently training them with Deep
Reinforcement Learning to jointly focus the beam at the desired focal point
(DFP).
This paper explores near-field SBF using ELPMs, addressing challenges
associated with lengthy training times resulting from independent training of
subarrays. To achieve a faster CSI-independent solution, inspired by the
correlation between the beamfocusing matrices of the subarrays, we leverage
transfer learning techniques. First, we introduce a novel similarity criterion
based on the phase distribution image (PDI) of subarray apertures. Then we
devise a subarray policy propagation scheme that transfers the knowledge from
trained to untrained subarrays. We further enhance learning by introducing
quasi-liquid layers as a revised version of the adaptive policy reuse
technique. We show through simulations that the proposed scheme improves the
training speed about 5 times. Furthermore, for dynamic DFP management, we
devised a DFP policy blending process, which augments the convergence rate up
to 8-fold.
[LINK]
http://arxiv.org/abs/2405.19347v3
[DATE]
2024-12-16 13:56:32+08:00
[CATEGORIES]
cs.LG
Mining In-distribution Attributes in Outliers for Out-of-distribution Detection
[AUTHORS]
Yutian Lei, Luping Ji, Pei Liu
[ABSTRACT]
Out-of-distribution (OOD) detection is indispensable for deploying reliable
machine learning systems in real-world scenarios. Recent works, using auxiliary
outliers in training, have shown good potential. However, they seldom concern
the intrinsic correlations between in-distribution (ID) and OOD data. In this
work, we discover an obvious correlation that OOD data usually possesses
significant ID attributes. These attributes should be factored into the
training process, rather than blindly suppressed as in previous approaches.
Based on this insight, we propose a structured multi-view-based
out-of-distribution detection learning (MVOL) framework, which facilitates
rational handling of the intrinsic in-distribution attributes in outliers. We
provide theoretical insights on the effectiveness of MVOL for OOD detection.
Extensive experiments demonstrate the superiority of our framework to others.
MVOL effectively utilizes both auxiliary OOD datasets and even wild datasets
with noisy in-distribution data. Code is available at
https://github.com/UESTC-nnLab/MVOL.
[COMMENTS]
Accepted by AAAI2025
[LINK]
http://arxiv.org/abs/2412.11466v1
[DATE]
2024-12-16 13:47:35+08:00
[CATEGORIES]
cs.LG
FedCAR: Cross-client Adaptive Re-weighting for Generative Models in Federated Learning
[AUTHORS]
Minjun Kim, Minjee Kim, Jinhoon Jeong
[ABSTRACT]
Generative models trained on multi-institutional datasets can provide an
enriched understanding through diverse data distributions. However, training
the models on medical images is often challenging due to hospitals’ reluctance
to share data for privacy reasons. Federated learning(FL) has emerged as a
privacy-preserving solution for training distributed datasets across data
centers by aggregating model weights from multiple clients instead of sharing
raw data. Previous research has explored the adaptation of FL to generative
models, yet effective aggregation algorithms specifically tailored for
generative models remain unexplored. We hereby propose a novel algorithm aimed
at improving the performance of generative models within FL. Our approach
adaptively re-weights the contribution of each client, resulting in
well-trained shared parameters. In each round, the server side measures the
distribution distance between fake images generated by clients instead of
directly comparing the Fr'echet Inception Distance per client, thereby
enhancing efficiency of the learning. Experimental results on three public
chest X-ray datasets show superior performance in medical image generation,
outperforming both centralized learning and conventional FL algorithms. Our
code is available at https://github.com/danny0628/FedCAR.
[LINK]
http://arxiv.org/abs/2412.11463v1
[DATE]
2024-12-16 13:43:14+08:00
[CATEGORIES]
cs.LG
Unsupervised Anomaly Detection for Tabular Data Using Noise Evaluation
[AUTHORS]
Wei Dai, Kai Hwang, Jicong Fan
[ABSTRACT]
Unsupervised anomaly detection (UAD) plays an important role in modern data
analytics and it is crucial to provide simple yet effective and guaranteed UAD
algorithms for real applications. In this paper, we present a novel UAD method
for tabular data by evaluating how much noise is in the data. Specifically, we
propose to learn a deep neural network from the clean (normal) training dataset
and a noisy dataset, where the latter is generated by adding highly diverse
noises to the clean data. The neural network can learn a reliable decision
boundary between normal data and anomalous data when the diversity of the
generated noisy data is sufficiently high so that the hard abnormal samples lie
in the noisy region. Importantly, we provide theoretical guarantees, proving
that the proposed method can detect anomalous data successfully, although the
method does not utilize any real anomalous data in the training stage.
Extensive experiments through more than 60 benchmark datasets demonstrate the
effectiveness of the proposed method in comparison to 12 baselines of UAD. Our
method obtains a 92.27\% AUC score and a 1.68 ranking score on average.
Moreover, compared to the state-of-the-art UAD methods, our method is easier to
implement.
[COMMENTS]
The paper was accepted by AAAI 2025
[LINK]
http://arxiv.org/abs/2412.11461v1
[DATE]
2024-12-16 13:35:58+08:00
[CATEGORIES]
cs.LG
Improving Sequential Market Coordination via Value-oriented Renewable Energy Forecasting
[AUTHORS]
Yufan Zhang, Honglin Wen, Yuexin Bian, Yuanyuan Shi
[ABSTRACT]
Large penetration of renewable energy sources (RESs) brings huge uncertainty
into the electricity markets. The current deterministic clearing approach in
the day-ahead (DA) market, where RESs participate based on expected production,
has been criticized for causing a lack of coordination between the DA and
real-time (RT) markets, leading to high overall operating costs. Previous works
indicate that improving day-ahead RES entering quantities can significantly
mitigate the drawbacks of deterministic clearing. In this work, we propose
using a trained forecasting model, referred to as value-oriented forecasting,
to determine RES Improved Entering Quantities (RIEQ) more efficiently during
the operational phase. Unlike traditional models that minimize statistical
forecasting errors, our approach trains model parameters to minimize the
expected overall operating costs across both DA and RT markets. We derive the
exact form of the loss function used for training, which becomes piecewise
linear when market clearing is modeled by linear programs. Additionally, we
provide the analytical gradient of the loss function with respect to the
forecast, enabling an efficient training strategy. Numerical studies
demonstrate that our forecasts significantly reduce overall operating costs for
deterministic market clearing compared to conventional forecasts based on
expected RES production.
[COMMENTS]
Submitted to IEEE Transactions on Energy Markets, Policy, and
Regulation
[LINK]
http://arxiv.org/abs/2405.09004v2
[DATE]
2024-12-16 13:34:33+08:00
[CATEGORIES]
cs.LG
Regional Expected Improvement for Efficient Trust Region Selection in High-Dimensional Bayesian Optimization
[AUTHORS]
Nobuo Namura, Sho Takemori
[ABSTRACT]
Real-world optimization problems often involve complex objective functions
with costly evaluations. While Bayesian optimization (BO) with Gaussian
processes is effective for these challenges, it suffers in high-dimensional
spaces due to performance degradation from limited function evaluations. To
overcome this, simplification techniques like dimensionality reduction have
been employed, yet they often rely on assumptions about the problem
characteristics, potentially underperforming when these assumptions do not
hold. Trust-region-based methods, which avoid such assumptions, focus on local
search but risk stagnation in local optima. In this study, we propose a novel
acquisition function, regional expected improvement (REI), designed to enhance
trust-region-based BO in medium to high-dimensional settings. REI identifies
regions likely to contain the global optimum, improving performance without
relying on specific problem characteristics. We provide a theoretical proof
that REI effectively identifies optimal trust regions and empirically
demonstrate that incorporating REI into trust-region-based BO outperforms
conventional BO and other high-dimensional BO methods in medium to
high-dimensional real-world problems.
[COMMENTS]
Accepted by AAAI 2025
[LINK]
http://arxiv.org/abs/2412.11456v1
[DATE]
2024-12-16 13:23:01+08:00
[CATEGORIES]
cs.LG
Multilabel Classification for Lung Disease Detection: Integrating Deep Learning and Natural Language Processing
[AUTHORS]
Maria Efimovich, Jayden Lim, Vedant Mehta, Ethan Poon
[ABSTRACT]
Classifying chest radiographs is a time-consuming and challenging task, even
for experienced radiologists. This provides an area for improvement due to the
difficulty in precisely distinguishing between conditions such as pleural
effusion, pneumothorax, and pneumonia. We propose a novel transfer learning
model for multi-label lung disease classification, utilizing the CheXpert
dataset with over 12,617 images of frontal radiographs being analyzed. By
integrating RadGraph parsing for efficient annotation extraction, we enhance
the model’s ability to accurately classify multiple lung diseases from complex
medical images. The proposed model achieved an F1 score of 0.69 and an AUROC of
0.86, demonstrating its potential for clinical applications. Also explored was
the use of Natural Language Processing (NLP) to parse report metadata and
address uncertainties in disease classification. By comparing uncertain reports
with more certain cases, the NLP-enhanced model improves its ability to
conclusively classify conditions. This research highlights the connection
between deep learning and NLP, underscoring their potential to enhance
radiological diagnostics and aid in the efficient analysis of chest
radiographs.
[COMMENTS]
All authors contributed equally
[LINK]
http://arxiv.org/abs/2412.11452v1
[DATE]
2024-12-16 13:14:08+08:00
[CATEGORIES]
cs.LG
Data-Dependent Generalization Bounds for Parameterized Quantum Models Under Noise
[AUTHORS]
Bikram Khanal, Pablo Rivas
[ABSTRACT]
Quantum machine learning offers a transformative approach to solving complex
problems, but the inherent noise hinders its practical implementation in
near-term quantum devices. This obstacle makes it challenging to understand the
generalization capabilities of quantum circuit models. Designing robust quantum
machine learning models under noise requires a principled understanding of
complexity and generalization, extending beyond classical capacity measures.
This study investigates the generalization properties of parameterized quantum
machine learning models under the influence of noise. We present a
data-dependent generalization bound grounded in the quantum Fisher information
matrix. We leverage statistical learning theory to relate the parameter space
volumes and training sizes to estimate the generalization capability of the
trained model. By integrating local parameter neighborhoods and effective
dimensions defined through quantum Fisher information matrix eigenvalues, we
provide a structured characterization of complexity in quantum models. We
analyze the tightness of the bound and discuss the trade-off between model
expressiveness and generalization performance.
[LINK]
http://arxiv.org/abs/2412.11451v1
[DATE]
2024-12-16 13:10:58+08:00
[CATEGORIES]
cs.LG
SoundMorpher: Perceptually-Uniform Sound Morphing with Diffusion Model
[AUTHORS]
Xinlei Niu, Jing Zhang, Charles Patrick Martin
[ABSTRACT]
We present SoundMorpher, an open-world sound morphing method designed to
generate perceptually uniform morphing trajectories. Traditional sound morphing
techniques typically assume a linear relationship between the morphing factor
and sound perception, achieving smooth transitions by linearly interpolating
the semantic features of source and target sounds while gradually adjusting the
morphing factor. However, these methods oversimplify the complexities of sound
perception, resulting in limitations in morphing quality. In contrast,
SoundMorpher explores an explicit relationship between the morphing factor and
the perception of morphed sounds, leveraging log Mel-spectrogram features. This
approach further refines the morphing sequence by ensuring a constant target
perceptual difference for each transition and determining the corresponding
morphing factors using binary search. To address the lack of a formal
quantitative evaluation framework for sound morphing, we propose a set of
metrics based on three established objective criteria. These metrics enable
comprehensive assessment of morphed results and facilitate direct comparisons
between methods, fostering advancements in sound morphing research. Extensive
experiments demonstrate the effectiveness and versatility of SoundMorpher in
real-world scenarios, showcasing its potential in applications such as creative
music composition, film post-production, and interactive audio technologies.
Our demonstration and codes are available
at~\url{https://xinleiniu.github.io/SoundMorpher-demo/}.
[LINK]
http://arxiv.org/abs/2410.02144v2
[DATE]
2024-12-16 13:10:56+08:00
[CATEGORIES]
cs.LG
Universal Domain Adaptive Object Detection via Dual Probabilistic Alignment
[AUTHORS]
Yuanfan Zheng, Jinlin Wu, Wuyang Li, Zhen Chen
[ABSTRACT]
Domain Adaptive Object Detection (DAOD) transfers knowledge from a labeled
source domain to an unannotated target domain under closed-set assumption.
Universal DAOD (UniDAOD) extends DAOD to handle open-set, partial-set, and
closed-set domain adaptation. In this paper, we first unveil two issues:
domain-private category alignment is crucial for global-level features, and the
domain probability heterogeneity of features across different levels. To
address these issues, we propose a novel Dual Probabilistic Alignment (DPA)
framework to model domain probability as Gaussian distribution, enabling the
heterogeneity domain distribution sampling and measurement. The DPA consists of
three tailored modules: the Global-level Domain Private Alignment (GDPA), the
Instance-level Domain Shared Alignment (IDSA), and the Private Class Constraint
(PCC). GDPA utilizes the global-level sampling to mine domain-private category
samples and calculate alignment weight through a cumulative distribution
function to address the global-level private category alignment. IDSA utilizes
instance-level sampling to mine domain-shared category samples and calculates
alignment weight through Gaussian distribution to conduct the domain-shared
category domain alignment to address the feature heterogeneity. The PCC
aggregates domain-private category centroids between feature and probability
spaces to mitigate negative transfer. Extensive experiments demonstrate that
our DPA outperforms state-of-the-art UniDAOD and DAOD methods across various
datasets and scenarios, including open, partial, and closed sets. Codes are
available at \url{https://github.com/zyfone/DPA}.
[COMMENTS]
This work is accepted by AAAI 2025
[LINK]
http://arxiv.org/abs/2412.11443v1
[DATE]
2024-12-16 12:55:13+08:00
[CATEGORIES]
cs.LG
Efficient Generative Modeling with Residual Vector Quantization-Based Tokens
[AUTHORS]
Jaehyeon Kim, Taehong Moon, Keon Lee, Jaewoong Cho
[ABSTRACT]
We explore the use of Residual Vector Quantization (RVQ) for high-fidelity
generation in vector-quantized generative models. This quantization technique
maintains higher data fidelity by employing more in-depth tokens. However,
increasing the token number in generative models leads to slower inference
speeds. To this end, we introduce ResGen, an efficient RVQ-based discrete
diffusion model that generates high-fidelity samples without compromising
sampling speed. Our key idea is a direct prediction of vector embedding of
collective tokens rather than individual ones. Moreover, we demonstrate that
our proposed token masking and multi-token prediction method can be formulated
within a principled probabilistic framework using a discrete diffusion process
and variational inference. We validate the efficacy and generalizability of the
proposed method on two challenging tasks across different modalities:
conditional image generation} on ImageNet 256x256 and zero-shot text-to-speech
synthesis. Experimental results demonstrate that ResGen outperforms
autoregressive counterparts in both tasks, delivering superior performance
without compromising sampling speed. Furthermore, as we scale the depth of RVQ,
our generative models exhibit enhanced generation fidelity or faster sampling
speeds compared to similarly sized baseline models. The project page can be
found at https://resgen-genai.github.io
[LINK]
http://arxiv.org/abs/2412.10208v2
[DATE]
2024-12-16 12:51:01+08:00
[CATEGORIES]
cs.LG
UIBDiffusion: Universal Imperceptible Backdoor Attack for Diffusion Models
[AUTHORS]
Yuning Han, Bingyin Zhao, Rui Chu, Feng Luo, Biplab Sikdar, Yingjie Lao
[ABSTRACT]
Recent studies show that diffusion models (DMs) are vulnerable to backdoor
attacks. Existing backdoor attacks impose unconcealed triggers (e.g., a gray
box and eyeglasses) that contain evident patterns, rendering remarkable attack
effects yet easy detection upon human inspection and defensive algorithms.
While it is possible to improve stealthiness by reducing the strength of the
backdoor, doing so can significantly compromise its generality and
effectiveness. In this paper, we propose UIBDiffusion, the universal
imperceptible backdoor attack for diffusion models, which allows us to achieve
superior attack and generation performance while evading state-of-the-art
defenses. We propose a novel trigger generation approach based on universal
adversarial perturbations (UAPs) and reveal that such perturbations, which are
initially devised for fooling pre-trained discriminative models, can be adapted
as potent imperceptible backdoor triggers for DMs. We evaluate UIBDiffusion on
multiple types of DMs with different kinds of samplers across various datasets
and targets. Experimental results demonstrate that UIBDiffusion brings three
advantages: 1) Universality, the imperceptible trigger is universal (i.e.,
image and model agnostic) where a single trigger is effective to any images and
all diffusion models with different samplers; 2) Utility, it achieves
comparable generation quality (e.g., FID) and even better attack success rate
(i.e., ASR) at low poison rates compared to the prior works; and 3)
Undetectability, UIBDiffusion is plausible to human perception and can bypass
Elijah and TERD, the SOTA defenses against backdoors for DMs. We will release
our backdoor triggers and code.
[LINK]
http://arxiv.org/abs/2412.11441v1
[DATE]
2024-12-16 12:47:55+08:00
[CATEGORIES]
cs.LG
Bayesian Flow Is All You Need to Sample Out-of-Distribution Chemical Spaces
[AUTHORS]
Nianze Tao
[ABSTRACT]
Generating novel molecules with higher properties than the training space,
namely the out-of-distribution generation, is important for ${de~novo}$ drug
design. However, it is not easy for distribution learning-based models, for
example diffusion models, to solve this challenge as these methods are designed
to fit the distribution of training data as close as possible. In this paper,
we show that Bayesian flow network is capable of effortlessly generating high
quality out-of-distribution samples that meet several scenarios. We introduce a
semi-autoregressive training/sampling method that helps to enhance the model
performance and surpass the state-of-the-art models.
[LINK]
http://arxiv.org/abs/2412.11439v1
[DATE]
2024-12-16 12:43:54+08:00
[CATEGORIES]
cs.LG
Deep Learning Model Security: Threats and Defenses
[AUTHORS]
Tianyang Wang, Ziqian Bi, Yichao Zhang, Ming Liu, Weiche Hsieh, Pohsun Feng, Lawrence K. Q. Yan, Yizhu Wen, Benji Peng, Junyu Liu, Keyu Chen, Sen Zhang, Ming Li, Chuanqi Jiang, Xinyuan Song, Junjie Yang, Bowen Jing, Jintao Ren, Junhao Song, Hong-Ming Tseng, Silin Chen, Yunze Wang, Chia Xin Liang, Jiawei Xu, Xuanhe Pan, Jinlang Wang, Qian Niu
[ABSTRACT]
Deep learning has transformed AI applications but faces critical security
challenges, including adversarial attacks, data poisoning, model theft, and
privacy leakage. This survey examines these vulnerabilities, detailing their
mechanisms and impact on model integrity and confidentiality. Practical
implementations, including adversarial examples, label flipping, and backdoor
attacks, are explored alongside defenses such as adversarial training,
differential privacy, and federated learning, highlighting their strengths and
limitations.
Advanced methods like contrastive and self-supervised learning are presented
for enhancing robustness. The survey concludes with future directions,
emphasizing automated defenses, zero-trust architectures, and the security
challenges of large AI models. A balanced approach to performance and security
is essential for developing reliable deep learning systems.
[LINK]
http://arxiv.org/abs/2412.08969v2
[DATE]
2024-12-16 11:34:55+08:00
[CATEGORIES]
cs.LG
MGDA: Model-based Goal Data Augmentation for Offline Goal-conditioned Weighted Supervised Learning
[AUTHORS]
Xing Lei, Xuetao Zhang, Donglin Wang
[ABSTRACT]
Recently, a state-of-the-art family of algorithms, known as Goal-Conditioned
Weighted Supervised Learning (GCWSL) methods, has been introduced to tackle
challenges in offline goal-conditioned reinforcement learning (RL). GCWSL
optimizes a lower bound of the goal-conditioned RL objective and has
demonstrated outstanding performance across diverse goal-reaching tasks,
providing a simple, effective, and stable solution. However, prior research has
identified a critical limitation of GCWSL: the lack of trajectory stitching
capabilities. To address this, goal data augmentation strategies have been
proposed to enhance these methods. Nevertheless, existing techniques often
struggle to sample suitable augmented goals for GCWSL effectively. In this
paper, we establish unified principles for goal data augmentation, focusing on
goal diversity, action optimality, and goal reachability. Based on these
principles, we propose a Model-based Goal Data Augmentation (MGDA) approach,
which leverages a learned dynamics model to sample more suitable augmented
goals. MGDA uniquely incorporates the local Lipschitz continuity assumption
within the learned model to mitigate the impact of compounding errors.
Empirical results show that MGDA significantly enhances the performance of
GCWSL methods on both state-based and vision-based maze datasets, surpassing
previous goal data augmentation techniques in improving stitching capabilities.
[LINK]
http://arxiv.org/abs/2412.11410v1
[DATE]
2024-12-16 11:25:28+08:00
[CATEGORIES]
cs.LG
Formulations and scalability of neural network surrogates in nonlinear optimization problems
[AUTHORS]
Robert B. Parker, Oscar Dowson, Nicole LoGiudice, Manuel Garcia, Russell Bent
[ABSTRACT]
We compare full-space, reduced-space, and gray-box formulations for
representing trained neural networks in nonlinear constrained optimization
problems. We test these formulations on a transient stability-constrained,
security-constrained alternating current optimal power flow (SCOPF) problem
where the transient stability criteria are represented by a trained neural
network surrogate. Optimization problems are implemented in JuMP and trained
neural networks are embedded using a new Julia package: MathOptAI.jl. To study
the bottlenecks of the three formulations, we use neural networks with up to
590 million trained parameters. The full-space formulation is bottlenecked by
the linear solver used by the optimization algorithm, while the reduced-space
formulation is bottlenecked by the algebraic modeling environment and
derivative computations. The gray-box formulation is the most scalable and is
capable of solving with the largest neural networks tested. It is bottlenecked
by evaluation of the neural network’s outputs and their derivatives, which may
be accelerated with a graphics processing unit (GPU). Leveraging the gray-box
formulation and GPU acceleration, we solve our test problem with our largest
neural network surrogate in 2.5$\times$ the time required for a simpler SCOPF
problem without the stability constraint.
[LINK]
http://arxiv.org/abs/2412.11403v1
[DATE]
2024-12-16 11:09:06+08:00
[CATEGORIES]
cs.LG
Modeling Inter-Intra Heterogeneity for Graph Federated Learning
[AUTHORS]
Wentao Yu, Shuo Chen, Yongxin Tong, Tianlong Gu, Chen Gong
[ABSTRACT]
Heterogeneity is a fundamental and challenging issue in federated learning,
especially for the graph data due to the complex relationships among the graph
nodes. To deal with the heterogeneity, lots of existing methods perform the
weighted federation based on their calculated similarities between pairwise
clients (i.e., subgraphs). However, their inter-subgraph similarities estimated
with the outputs of local models are less reliable, because the final outputs
of local models may not comprehensively represent the real distribution of
subgraph data. In addition, they ignore the critical intra-heterogeneity which
usually exists within each subgraph itself. To address these issues, we propose
a novel Federated learning method by integrally modeling the Inter-Intra
Heterogeneity (FedIIH). For the inter-subgraph relationship, we propose a novel
hierarchical variational model to infer the whole distribution of subgraph data
in a multi-level form, so that we can accurately characterize the
inter-subgraph similarities with the global perspective. For the
intra-heterogeneity, we disentangle the subgraph into multiple latent factors
and partition the model parameters into multiple parts, where each part
corresponds to a single latent factor. Our FedIIH not only properly computes
the distribution similarities between subgraphs, but also learns disentangled
representations that are robust to irrelevant factors within subgraphs, so that
it successfully considers the inter- and intra- heterogeneity simultaneously.
Extensive experiments on six homophilic and five heterophilic graph datasets in
both non-overlapping and overlapping settings demonstrate the effectiveness of
our method when compared with nine state-of-the-art methods. Specifically,
FedIIH averagely outperforms the second-best method by a large margin of 5.79%
on all heterophilic datasets.
[COMMENTS]
accepted by AAAI 2025
[LINK]
http://arxiv.org/abs/2412.11402v1
[DATE]
2024-12-16 11:02:42+08:00
[CATEGORIES]
cs.LG
Scaled Conjugate Gradient Method for Nonconvex Optimization in Deep Neural Networks
[AUTHORS]
Naoki Sato, Koshiro Izumi, Hideaki Iiduka
[ABSTRACT]
A scaled conjugate gradient method that accelerates existing adaptive methods
utilizing stochastic gradients is proposed for solving nonconvex optimization
problems with deep neural networks. It is shown theoretically that, whether
with constant or diminishing learning rates, the proposed method can obtain a
stationary point of the problem. Additionally, its rate of convergence with
diminishing learning rates is verified to be superior to that of the conjugate
gradient method. The proposed method is shown to minimize training loss
functions faster than the existing adaptive methods in practical applications
of image and text classification. Furthermore, in the training of generative
adversarial networks, one version of the proposed method achieved the lowest
Frechet inception distance score among those of the adaptive methods.
[COMMENTS]
Accepted at JMLR (Dec. 2024)
[LINK]
http://arxiv.org/abs/2412.11400v1
[DATE]
2024-12-16 10:57:23+08:00
[CATEGORIES]
cs.LG
Quantization of Climate Change Impacts on Renewable Energy Generation Capacity: A Super-Resolution Recurrent Diffusion Model
[AUTHORS]
Xiaochong Dong, Jun Dan, Yingyun Sun, Yang Liu, Xuemin Zhang, Shengwei Mei
[ABSTRACT]
Driven by global climate change and the ongoing energy transition, the
coupling between power supply capabilities and meteorological factors has
become increasingly significant. Over the long term, accurately quantifying the
power generation capacity of renewable energy under the influence of climate
change is essential for the development of sustainable power systems. However,
due to interdisciplinary differences in data requirements, climate data often
lacks the necessary hourly resolution to capture the short-term variability and
uncertainties of renewable energy resources. To address this limitation, a
super-resolution recurrent diffusion model (SRDM) has been developed to enhance
the temporal resolution of climate data and model the short-term uncertainty.
The SRDM incorporates a pre-trained decoder and a denoising network, that
generates long-term, high-resolution climate data through a recurrent coupling
mechanism. The high-resolution climate data is then converted into power value
using the mechanism model, enabling the simulation of wind and photovoltaic
(PV) power generation capacity on future long-term scales. Case studies were
conducted in the Ejina region of Inner Mongolia, China, using fifth-generation
reanalysis (ERA5) and coupled model intercomparison project (CMIP6) data under
two climate pathways: SSP126 and SSP585. The results demonstrate that the SRDM
outperforms existing generative models in generating super-resolution climate
data. For the Ejina region, under a high-emission pathway, the annual
utilization hours of wind power are projected to decrease by 2.82 hours/year,
while those for PV power are projected to decrease by 0.26 hours/year.
Furthermore, the research highlights the estimation biases introduced when
low-resolution climate data is used for power conversion.
[LINK]
http://arxiv.org/abs/2412.11399v1
[DATE]
2024-12-16 10:54:21+08:00
[CATEGORIES]
cs.LG
Omics-driven hybrid dynamic modeling of bioprocesses with uncertainty estimation
[AUTHORS]
Sebastián Espinel-Ríos, José Montaño López, José L. Avalos
[ABSTRACT]
This work presents an omics-driven modeling pipeline that integrates
machine-learning tools to facilitate the dynamic modeling of multiscale
biological systems. Random forests and permutation feature importance are
proposed to mine omics datasets, guiding feature selection and dimensionality
reduction for dynamic modeling. Continuous and differentiable machine-learning
functions can be trained to link the reduced omics feature set to key
components of the dynamic model, resulting in a hybrid model. As proof of
concept, we apply this framework to a high-dimensional proteomics dataset of
$\textit{Saccharomyces cerevisiae}$. After identifying key intracellular
proteins that correlate with cell growth, targeted dynamic experiments are
designed, and key model parameters are captured as functions of the selected
proteins using Gaussian processes. This approach captures the dynamic behavior
of yeast strains under varying proteome profiles while estimating the
uncertainty in the hybrid model’s predictions. The outlined modeling framework
is adaptable to other scenarios, such as integrating additional layers of omics
data for more advanced multiscale biological systems, or employing alternative
machine-learning methods to handle larger datasets. Overall, this study
outlines a strategy for leveraging omics data to inform multiscale dynamic
modeling in systems biology and bioprocess engineering.
[LINK]
http://arxiv.org/abs/2410.18864v2
[DATE]
2024-12-16 10:54:14+08:00
[CATEGORIES]
cs.LG
Are Large Language Models Useful for Time Series Data Analysis?
[AUTHORS]
Francis Tang, Ying Ding
[ABSTRACT]
Time series data plays a critical role across diverse domains such as
healthcare, energy, and finance, where tasks like classification, anomaly
detection, and forecasting are essential for informed decision-making.
Recently, large language models (LLMs) have gained prominence for their ability
to handle complex data and extract meaningful insights. This study investigates
whether LLMs are effective for time series data analysis by comparing their
performance with non-LLM-based approaches across three tasks: classification,
anomaly detection, and forecasting.
Through a series of experiments using GPT4TS and autoregressive models, we
evaluate their performance on benchmark datasets and assess their accuracy,
precision, and ability to generalize. Our findings indicate that while
LLM-based methods excel in specific tasks like anomaly detection, their
benefits are less pronounced in others, such as forecasting, where simpler
models sometimes perform comparably or better. This research highlights the
role of LLMs in time series analysis and lays the groundwork for future studies
to systematically explore their applications and limitations in handling
temporal data.
[LINK]
http://arxiv.org/abs/2412.12219v1
[DATE]
2024-12-16 10:47:44+08:00
[CATEGORIES]
cs.LG
STDHL: Spatio-Temporal Dynamic Hypergraph Learning for Wind Power Forecasting
[AUTHORS]
Xiaochong Dong, Xuemin Zhang, Ming Yang, Shengwei Mei
[ABSTRACT]
Leveraging spatio-temporal correlations among wind farms can significantly
enhance the accuracy of ultra-short-term wind power forecasting. However, the
complex and dynamic nature of these correlations presents significant modeling
challenges. To address this, we propose a spatio-temporal dynamic hypergraph
learning (STDHL) model. This model uses a hypergraph structure to represent
spatial features among wind farms. Unlike traditional graph structures, which
only capture pair-wise node features, hypergraphs create hyperedges connecting
multiple nodes, enabling the representation and transmission of higher-order
spatial features. The STDHL model incorporates a novel dynamic hypergraph
convolutional layer to model dynamic spatial correlations and a grouped
temporal convolutional layer for channel-independent temporal modeling. The
model uses spatio-temporal encoders to extract features from multi-source
covariates, which are mapped to quantile results through a forecast decoder.
Experimental results using the GEFCom dataset show that the STDHL model
outperforms existing state-of-the-art methods. Furthermore, an in-depth
analysis highlights the critical role of spatio-temporal covariates in
improving ultra-short-term forecasting accuracy.
[LINK]
http://arxiv.org/abs/2412.11393v1
[DATE]
2024-12-16 10:43:29+08:00
[CATEGORIES]
cs.LG
Accurate, Robust and Privacy-Preserving Brain-Computer Interface Decoding
[AUTHORS]
Xiaoqing Chen, Tianwang Jia, Dongrui Wu
[ABSTRACT]
An electroencephalogram (EEG) based brain-computer interface (BCI) enables
direct communication between the brain and external devices. However, EEG-based
BCIs face at least three major challenges in real-world applications: data
scarcity and individual differences, adversarial vulnerability, and data
privacy. While previous studies have addressed one or two of these issues,
simultaneous accommodation of all three challenges remains challenging and
unexplored. This paper fills this gap, by proposing an Augmented Robustness
Ensemble (ARE) algorithm and integrating it into three privacy protection
scenarios (centralized source-free transfer, federated source-free transfer,
and source data perturbation), achieving simultaneously accurate decoding,
adversarial robustness, and privacy protection of EEG-based BCIs. Experiments
on three public EEG datasets demonstrated that our proposed approach
outperformed over 10 classic and state-of-the-art approaches in both accuracy
and robustness in all three privacy-preserving scenarios, even outperforming
state-of-the-art transfer learning approaches that do not consider privacy
protection at all. This is the first time that three major challenges in
EEG-based BCIs can be addressed simultaneously, significantly improving the
practicalness of EEG decoding in real-world BCIs.
[LINK]
http://arxiv.org/abs/2412.11390v1
[DATE]
2024-12-16 10:37:38+08:00
[CATEGORIES]
cs.LG
Forward KL Regularized Preference Optimization for Aligning Diffusion Policies
[AUTHORS]
Zhao Shan, Chenyou Fan, Shuang Qiu, Jiyuan Shi, Chenjia Bai
[ABSTRACT]
Diffusion models have achieved remarkable success in sequential
decision-making by leveraging the highly expressive model capabilities in
policy learning. A central problem for learning diffusion policies is to align
the policy output with human intents in various tasks. To achieve this,
previous methods conduct return-conditioned policy generation or Reinforcement
Learning (RL)-based policy optimization, while they both rely on pre-defined
reward functions. In this work, we propose a novel framework, Forward KL
regularized Preference optimization for aligning Diffusion policies, to align
the diffusion policy with preferences directly. We first train a diffusion
policy from the offline dataset without considering the preference, and then
align the policy to the preference data via direct preference optimization.
During the alignment phase, we formulate direct preference learning in a
diffusion policy, where the forward KL regularization is employed in preference
optimization to avoid generating out-of-distribution actions. We conduct
extensive experiments for MetaWorld manipulation and D4RL tasks. The results
show our method exhibits superior alignment with preferences and outperforms
previous state-of-the-art algorithms.
[LINK]
http://arxiv.org/abs/2409.05622v2
[DATE]
2024-12-16 10:32:15+08:00
[CATEGORIES]
cs.LG
The Baldwin Effect in Advancing Generalizability of Physics-Informed Neural Networks
[AUTHORS]
Jian Cheng Wong, Chin Chun Ooi, Abhishek Gupta, Pao-Hsiung Chiu, Joshua Shao Zheng Low, My Ha Dao, Yew-Soon Ong
[ABSTRACT]
Physics-informed neural networks (PINNs) are at the forefront of scientific
machine learning, making possible the creation of machine intelligence that is
cognizant of physical laws and able to accurately simulate them. However,
today’s PINNs are often trained for a single physics task and require
computationally expensive re-training for each new task, even for tasks from
similar physics domains. To address this limitation, this paper proposes a
pioneering approach to advance the generalizability of PINNs through the
framework of Baldwinian evolution. Drawing inspiration from the
neurodevelopment of precocial species that have evolved to learn, predict and
react quickly to their environment, we envision PINNs that are pre-wired with
connection strengths inducing strong biases towards efficient learning of
physics. A novel two-stage stochastic programming formulation coupling
evolutionary selection pressure (based on proficiency over a distribution of
physics tasks) with lifetime learning (to specialize on a sampled subset of
those tasks) is proposed to instantiate the Baldwin effect. The evolved
Baldwinian-PINNs demonstrate fast and physics-compliant prediction capabilities
across a range of empirically challenging problem instances with more than an
order of magnitude improvement in prediction accuracy at a fraction of the
computation cost compared to state-of-the-art gradient-based meta-learning
methods. For example, when solving the diffusion-reaction equation, a 70x
improvement in accuracy was obtained while taking 700x less computational time.
This paper thus marks a leap forward in the meta-learning of PINNs as
generalizable physics solvers. Sample codes are available at
\url{https://github.com/chiuph/Baldwinian-PINN}.
[LINK]
http://arxiv.org/abs/2312.03243v2
[DATE]
2024-12-16 10:26:24+08:00
[CATEGORIES]
cs.LG
FinLoRA: Finetuning Quantized Financial Large Language Models Using Low-Rank Adaptation
[AUTHORS]
Dannong Wang, Daniel Kim, Bo Jin, Xingjian Zhao, Tianfan Fu, Steve Yang, Xiao-Yang Liu
[ABSTRACT]
Finetuned large language models (LLMs) have shown remarkable performance in
financial tasks, such as sentiment analysis and information retrieval. Due to
privacy concerns, finetuning and deploying Financial LLMs (FinLLMs) locally are
crucial for institutions. However, finetuning FinLLMs poses challenges
including GPU memory constraints and long input sequences. In this paper, we
employ quantized low-rank adaptation (QLoRA) to finetune FinLLMs, which
leverage low-rank matrix decomposition and quantization techniques to
significantly reduce computational requirements while maintaining high model
performance. We also employ data and pipeline parallelism to enable local
finetuning using cost-effective, widely accessible GPUs. Experiments on
financial datasets demonstrate that our method achieves substantial
improvements in accuracy, GPU memory usage, and time efficiency, underscoring
the potential of lowrank methods for scalable and resource-efficient LLM
finetuning.
[LINK]
http://arxiv.org/abs/2412.11378v1
[DATE]
2024-12-16 10:05:49+08:00
[CATEGORIES]
cs.LG
Accelerating Sparse Graph Neural Networks with Tensor Core Optimization
[AUTHORS]
Ka Wai Wu
[ABSTRACT]
Graph neural networks (GNNs) have seen extensive application in domains such
as social networks, bioinformatics, and recommendation systems. However, the
irregularity and sparsity of graph data challenge traditional computing
methods, which are insufficient to meet the performance demands of GNNs. Recent
research has explored parallel acceleration using CUDA Cores and Tensor Cores,
but significant challenges persist: (1) kernel fusion leads to false high
utilization, failing to treat CUDA and Tensor Cores as independent resources,
and (2) heterogeneous cores have distinct computation preferences, causing
inefficiencies. To address these issues, this paper proposes FTC-GNN, a novel
acceleration framework that efficiently utilizes CUDA and Tensor Cores for GNN
computation. FTC-GNN introduces (1) a collaborative design that enables the
parallel utilization of CUDA and Tensor Cores and (2) a sparse-to-dense
transformation strategy that assigns dense matrix operations to Tensor Cores
while leveraging CUDA Cores for data management and sparse edge processing.
This design optimizes GPU resource utilization and improves computational
efficiency. Experimental results demonstrate the effectiveness of FTC-GNN using
GCN and AGNN models across various datasets. For GCN, FTC-GNN achieves speedups
of 4.90x, 7.10x, and 1.17x compared to DGL, PyG, and TC-GNN, respectively. For
AGNN, it achieves speedups of 5.32x, 2.92x, and 1.02x, establishing its
superiority in accelerating GNN computations.
[LINK]
http://arxiv.org/abs/2412.12218v1
[DATE]
2024-12-16 09:57:53+08:00
[CATEGORIES]
cs.LG
Privacy-Preserving Low-Rank Adaptation against Membership Inference Attacks for Latent Diffusion Models
[AUTHORS]
Zihao Luo, Xilie Xu, Feng Liu, Yun Sing Koh, Di Wang, Jingfeng Zhang
[ABSTRACT]
Low-rank adaptation (LoRA) is an efficient strategy for adapting latent
diffusion models (LDMs) on a private dataset to generate specific images by
minimizing the adaptation loss. However, the LoRA-adapted LDMs are vulnerable
to membership inference (MI) attacks that can judge whether a particular data
point belongs to the private dataset, thus leading to the privacy leakage. To
defend against MI attacks, we first propose a straightforward solution:
Membership-Privacy-preserving LoRA (MP-LoRA). MP-LoRA is formulated as a
min-max optimization problem where a proxy attack model is trained by
maximizing its MI gain while the LDM is adapted by minimizing the sum of the
adaptation loss and the MI gain of the proxy attack model. However, we
empirically find that MP-LoRA has the issue of unstable optimization, and
theoretically analyze that the potential reason is the unconstrained local
smoothness, which impedes the privacy-preserving adaptation. To mitigate this
issue, we further propose a Stable Membership-Privacy-preserving LoRA
(SMP-LoRA) that adapts the LDM by minimizing the ratio of the adaptation loss
to the MI gain. Besides, we theoretically prove that the local smoothness of
SMP-LoRA can be constrained by the gradient norm, leading to improved
convergence. Our experimental results corroborate that SMP-LoRA can indeed
defend against MI attacks and generate high-quality images. Our Code is
available at \url{https://github.com/WilliamLUO0/StablePrivateLoRA}.
[COMMENTS]
AAAI 2025 Accept
[LINK]
http://arxiv.org/abs/2402.11989v3
[DATE]
2024-12-16 09:54:47+08:00
[CATEGORIES]
cs.LG
Comprehensive Survey on Adversarial Examples in Cybersecurity: Impacts, Challenges, and Mitigation Strategies
[AUTHORS]
Li Li
[ABSTRACT]
Deep learning (DL) has significantly transformed cybersecurity, enabling
advancements in malware detection, botnet identification, intrusion detection,
user authentication, and encrypted traffic analysis. However, the rise of
adversarial examples (AE) poses a critical challenge to the robustness and
reliability of DL-based systems. These subtle, crafted perturbations can
deceive models, leading to severe consequences like misclassification and
system vulnerabilities. This paper provides a comprehensive review of the
impact of AE attacks on key cybersecurity applications, highlighting both their
theoretical and practical implications. We systematically examine the methods
used to generate adversarial examples, their specific effects across various
domains, and the inherent trade-offs attackers face between efficacy and
resource efficiency. Additionally, we explore recent advancements in defense
mechanisms, including gradient masking, adversarial training, and detection
techniques, evaluating their potential to enhance model resilience. By
summarizing cutting-edge research, this study aims to bridge the gap between
adversarial research and practical security applications, offering insights to
fortify the adoption of DL solutions in cybersecurity.
[LINK]
http://arxiv.org/abs/2412.12217v1
[DATE]
2024-12-16 09:54:07+08:00
[CATEGORIES]
cs.LG
Individual Bus Trip Chain Prediction and Pattern Identification Considering Similarities
[AUTHORS]
Xiannan Huang, Yixin Chen, Quan Yuan, Chao Yang
[ABSTRACT]
Predicting future bus trip chains for an existing user is of great
significance for operators of public transit systems. Existing methods always
treat this task as a time-series prediction problem, but the 1-dimensional time
series structure cannot express the complex relationship between trips. To
better capture the inherent patterns in bus travel behavior, this paper
proposes a novel approach that synthesizes future bus trip chains based on
those from similar days. Key similarity patterns are defined and tested using
real-world data, and a similarity function is then developed to capture these
patterns. Afterwards, a graph is constructed where each day is represented as a
node and edge weight reflects the similarity between days. Besides, the trips
on a given day can be regarded as labels for each node, transferring the bus
trip chain prediction problem to a semi-supervised classification problem on a
graph. To address this, we propose several methods and validate them on a
real-world dataset of 10000 bus users, achieving state-of-the-art prediction
results. Analyzing the parameters of similarity function reveals some
interesting bus usage patterns, allowing us can to cluster bus users into three
types: repeat-dominated, evolve-dominate and repeat-evolve balanced. In
summary, our work demonstrates the effectiveness of similarity-based prediction
for bus trip chains and provides a new perspective for analyzing individual bus
travel patterns. The code for our prediction model is publicly available.
[LINK]
http://arxiv.org/abs/2412.11364v1
[DATE]
2024-12-16 09:32:26+08:00
[CATEGORIES]
cs.LG
The Stabilizer Bootstrap of Quantum Machine Learning with up to 10000 qubits
[AUTHORS]
Yuqing Li, Jinglei Cheng, Xulong Tang, Youtao Zhang, Frederic T. Chong, Junyu Liu
[ABSTRACT]
Quantum machine learning is considered one of the flagship applications of
quantum computers, where variational quantum circuits could be the leading
paradigm both in the near-term quantum devices and the early fault-tolerant
quantum computers. However, it is not clear how to identify the regime of
quantum advantages from these circuits, and there is no explicit theory to
guide the practical design of variational ansatze to achieve better
performance. We address these challenges with the stabilizer bootstrap, a
method that uses stabilizer-based techniques to optimize quantum neural
networks before their quantum execution, together with theoretical proofs and
high-performance computing with 10000 qubits or random datasets up to 1000
data. We find that, in a general setup of variational ansatze, the possibility
of improvements from the stabilizer bootstrap depends on the structure of the
observables and the size of the datasets. The results reveal that
configurations exhibit two distinct behaviors: some maintain a constant
probability of circuit improvement, while others show an exponential decay in
improvement probability as qubit numbers increase. These patterns are termed
strong stabilizer enhancement and weak stabilizer enhancement, respectively,
with most situations falling in between. Our work seamlessly bridges techniques
from fault-tolerant quantum computing with applications of variational quantum
algorithms. Not only does it offer practical insights for designing variational
circuits tailored to large-scale machine learning challenges, but it also maps
out a clear trajectory for defining the boundaries of feasible and practical
quantum advantages.
[COMMENTS]
15 pages, 14 figures
[LINK]
http://arxiv.org/abs/2412.11356v1
[DATE]
2024-12-16 09:12:00+08:00
[CATEGORIES]
cs.LG
Deep Random Features for Scalable Interpolation of Spatiotemporal Data
[AUTHORS]
Weibin Chen, Azhir Mahmood, Michel Tsamados, So Takao
[ABSTRACT]
The rapid growth of earth observation systems calls for a scalable approach
to interpolate remote-sensing observations. These methods in principle, should
acquire more information about the observed field as data grows. Gaussian
processes (GPs) are candidate model choices for interpolation. However, due to
their poor scalability, they usually rely on inducing points for inference,
which restricts their expressivity. Moreover, commonly imposed assumptions such
as stationarity prevents them from capturing complex patterns in the data.
While deep GPs can overcome this issue, training and making inference with them
are difficult, again requiring crude approximations via inducing points. In
this work, we instead approach the problem through Bayesian deep learning,
where spatiotemporal fields are represented by deep neural networks, whose
layers share the inductive bias of stationary GPs on the plane/sphere via
random feature expansions. This allows one to (1) capture high frequency
patterns in the data, and (2) use mini-batched gradient descent for large scale
training. We experiment on various remote sensing data at local/global scales,
showing that our approach produce competitive or superior results to existing
methods, with well-calibrated uncertainties.
[LINK]
http://arxiv.org/abs/2412.11350v1
[DATE]
2024-12-16 08:32:04+08:00
[CATEGORIES]
cs.LG
TransPeakNet: Solvent-Aware 2D NMR Prediction via Multi-Task Pre-Training and Unsupervised Learning
[AUTHORS]
Yunrui Li, Hao Xu, Ambrish Kumar, Duosheng Wang, Christian Heiss, Parastoo Azadi, Pengyu Hong
[ABSTRACT]
Nuclear Magnetic Resonance (NMR) spectroscopy is essential for revealing
molecular structure, electronic environment, and dynamics. Accurate NMR shift
prediction allows researchers to validate structures by comparing predicted and
observed shifts. While Machine Learning (ML) has improved one-dimensional (1D)
NMR shift prediction, predicting 2D NMR remains challenging due to limited
annotated data. To address this, we introduce an unsupervised training
framework for predicting cross-peaks in 2D NMR, specifically Heteronuclear
Single Quantum Coherence (HSQC).Our approach pretrains an ML model on an
annotated 1D dataset of 1H and 13C shifts, then finetunes it in an unsupervised
manner using unlabeled HSQC data, which simultaneously generates cross-peak
annotations. Our model also adjusts for solvent effects. Evaluation on 479
expert-annotated HSQC spectra demonstrates our model’s superiority over
traditional methods (ChemDraw and Mestrenova), achieving Mean Absolute Errors
(MAEs) of 2.05 ppm and 0.165 ppm for 13C shifts and 1H shifts respectively. Our
algorithmic annotations show a 95.21% concordance with experts’ assignments,
underscoring the approach’s potential for structural elucidation in fields like
organic chemistry, pharmaceuticals, and natural products.
[LINK]
http://arxiv.org/abs/2403.11353v4
[DATE]
2024-12-16 08:31:21+08:00
[CATEGORIES]
cs.LG
GARLIC: GPT-Augmented Reinforcement Learning with Intelligent Control for Vehicle Dispatching
[AUTHORS]
Xiao Han, Zijian Zhang, Xiangyu Zhao, Yuanshao Zhu, Guojiang Shen, Xiangjie Kong, Xuetao Wei, Liqiang Nie, Jieping Ye
[ABSTRACT]
As urban residents demand higher travel quality, vehicle dispatch has become
a critical component of online ride-hailing services. However, current vehicle
dispatch systems struggle to navigate the complexities of urban traffic
dynamics, including unpredictable traffic conditions, diverse driver behaviors,
and fluctuating supply and demand patterns. These challenges have resulted in
travel difficulties for passengers in certain areas, while many drivers in
other areas are unable to secure orders, leading to a decline in the overall
quality of urban transportation services. To address these issues, this paper
introduces GARLIC: a framework of GPT-Augmented Reinforcement Learning with
Intelligent Control for vehicle dispatching. GARLIC utilizes multiview graphs
to capture hierarchical traffic states, and learns a dynamic reward function
that accounts for individual driving behaviors. The framework further
integrates a GPT model trained with a custom loss function to enable
high-precision predictions and optimize dispatching policies in real-world
scenarios. Experiments conducted on two real-world datasets demonstrate that
GARLIC effectively aligns with driver behaviors while reducing the empty load
rate of vehicles.
[COMMENTS]
Accepted by AAAI 2025
[LINK]
http://arxiv.org/abs/2408.10286v3
[DATE]
2024-12-16 08:11:13+08:00
[CATEGORIES]
cs.LG
Coupling-based Convergence Diagnostic and Stepsize Scheme for Stochastic Gradient Descent
[AUTHORS]
Xiang Li, Qiaomin Xie
[ABSTRACT]
The convergence behavior of Stochastic Gradient Descent (SGD) crucially
depends on the stepsize configuration. When using a constant stepsize, the SGD
iterates form a Markov chain, enjoying fast convergence during the initial
transient phase. However, when reaching stationarity, the iterates oscillate
around the optimum without making further progress. In this paper, we study the
convergence diagnostics for SGD with constant stepsize, aiming to develop an
effective dynamic stepsize scheme. We propose a novel coupling-based
convergence diagnostic procedure, which monitors the distance of two coupled
SGD iterates for stationarity detection. Our diagnostic statistic is simple and
is shown to track the transition from transience stationarity theoretically. We
conduct extensive numerical experiments and compare our method against various
existing approaches. Our proposed coupling-based stepsize scheme is observed to
achieve superior performance across a diverse set of convex and non-convex
problems. Moreover, our results demonstrate the robustness of our approach to a
wide range of hyperparameters.
[COMMENTS]
13 pages, 30 figures, to be published in AAAI 2025
[LINK]
http://arxiv.org/abs/2412.11341v1
[DATE]
2024-12-16 07:50:23+08:00
[CATEGORIES]
cs.LG
PyTorch Frame: A Modular Framework for Multi-Modal Tabular Learning
[AUTHORS]
Weihua Hu, Yiwen Yuan, Zecheng Zhang, Akihiro Nitta, Kaidi Cao, Vid Kocijan, Jinu Sunil, Jure Leskovec, Matthias Fey
[ABSTRACT]
We present PyTorch Frame, a PyTorch-based framework for deep learning over
multi-modal tabular data. PyTorch Frame makes tabular deep learning easy by
providing a PyTorch-based data structure to handle complex tabular data,
introducing a model abstraction to enable modular implementation of tabular
models, and allowing external foundation models to be incorporated to handle
complex columns (e.g., LLMs for text columns). We demonstrate the usefulness of
PyTorch Frame by implementing diverse tabular models in a modular way,
successfully applying these models to complex multi-modal tabular data, and
integrating our framework with PyTorch Geometric, a PyTorch library for Graph
Neural Networks (GNNs), to perform end-to-end learning over relational
databases.
[COMMENTS]
https://github.com/pyg-team/pytorch-frame
[LINK]
http://arxiv.org/abs/2404.00776v2
[DATE]
2024-12-16 06:49:53+08:00
[CATEGORIES]
cs.LG
The AI Black-Scholes: Finance-Informed Neural Network
[AUTHORS]
Amine M. Aboussalah, Xuanze Li, Cheng Chi, Raj Patel
[ABSTRACT]
In the realm of option pricing, existing models are typically classified into
principle-driven methods, such as solving partial differential equations (PDEs)
that pricing function satisfies, and data-driven approaches, such as machine
learning (ML) techniques that parameterize the pricing function directly. While
principle-driven models offer a rigorous theoretical framework, they often rely
on unrealistic assumptions, such as asset processes adhering to fixed
stochastic differential equations (SDEs). Moreover, they can become
computationally intensive, particularly in high-dimensional settings when
analytical solutions are not available and thus numerical solutions are needed.
In contrast, data-driven models excel in capturing market data trends, but they
often lack alignment with core financial principles, raising concerns about
interpretability and predictive accuracy, especially when dealing with limited
or biased datasets. This work proposes a hybrid approach to address these
limitations by integrating the strengths of both principled and data-driven
methodologies. Our framework combines the theoretical rigor and
interpretability of PDE-based models with the adaptability of machine learning
techniques, yielding a more versatile methodology for pricing a broad spectrum
of options. We validate our approach across different volatility modeling
approaches-both with constant volatility (Black-Scholes) and stochastic
volatility (Heston), demonstrating that our proposed framework,
Finance-Informed Neural Network (FINN), not only enhances predictive accuracy
but also maintains adherence to core financial principles. FINN presents a
promising tool for practitioners, offering robust performance across a variety
of market conditions.
[LINK]
http://arxiv.org/abs/2412.12213v1
[DATE]
2024-12-16 06:40:40+08:00
[CATEGORIES]
cs.LG
Burning RED: Unlocking Subtask-Driven Reinforcement Learning and Risk-Awareness in Average-Reward Markov Decision Processes
[AUTHORS]
Juan Sebastian Rojas, Chi-Guhn Lee
[ABSTRACT]
Average-reward Markov decision processes (MDPs) provide a foundational
framework for sequential decision-making under uncertainty. However,
average-reward MDPs have remained largely unexplored in reinforcement learning
(RL) settings, with the majority of RL-based efforts having been allocated to
episodic and discounted MDPs. In this work, we study a unique structural
property of average-reward MDPs and utilize it to introduce Reward-Extended
Differential (or RED) reinforcement learning: a novel RL framework that can be
used to effectively and efficiently solve various subtasks simultaneously in
the average-reward setting. We introduce a family of RED learning algorithms
for prediction and control, including proven-convergent algorithms for the
tabular case. We then showcase the power of these algorithms by demonstrating
how they can be used to learn a policy that optimizes, for the first time, the
well-known conditional value-at-risk (CVaR) risk measure in a fully-online
manner, without the use of an explicit bi-level optimization scheme or an
augmented state-space.
[LINK]
http://arxiv.org/abs/2410.10578v7
[DATE]
2024-12-16 06:28:30+08:00
[CATEGORIES]
cs.LG
Context-Aware Assistant Selection for Improved Inference Acceleration with Large Language Models
[AUTHORS]
Jerry Huang, Prasanna Parthasarathi, Mehdi Rezagholizadeh, Sarath Chandar
[ABSTRACT]
Despite their widespread adoption, large language models (LLMs) remain
prohibitive to use under resource constraints, with their ever growing sizes
only increasing the barrier for use. One noted issue is the high latency
associated with auto-regressive generation, rendering large LLMs use dependent
on advanced computing infrastructure. Assisted decoding, where a smaller draft
model guides a larger target model’s generation, has helped alleviate this, but
remains dependent on alignment between the two models. Thus if the draft model
is insufficiently capable on some domain relative to the target model,
performance can degrade. Alternatively, one can leverage multiple draft models
to better cover the expertise of the target, but when multiple black-box draft
models are available, selecting an assistant without details about its
construction can be difficult. To better understand this decision making
problem, we observe it as a contextual bandit, where a policy must choose a
draft model based on a context. We show that even without prior knowledge of
the draft models, creating an offline dataset from only outputs of independent
draft/target models and training a policy over the alignment of these outputs
can accelerate performance on multiple domains provided the candidates are
effective. Further results show this to hold on various settings with multiple
assisted decoding candidates, highlighting its flexibility and the advantageous
role that such decision making can play.
[COMMENTS]
Published as a long paper at the 2024 Conference on Empirical Methods
in Natural Language Processing (EMNLP). Official version of paper within
conference proceedings is available at
http://aclanthology.org/2024.emnlp-main.332
[LINK]
http://arxiv.org/abs/2408.08470v4
[DATE]
2024-12-16 06:27:58+08:00
[CATEGORIES]
cs.LG
Problem-dependent convergence bounds for randomized linear gradient compression
[AUTHORS]
Thomas Flynn, Patrick Johnstone, Shinjae Yoo
[ABSTRACT]
In distributed optimization, the communication of model updates can be a
performance bottleneck. Consequently, gradient compression has been proposed as
a means of increasing optimization throughput. In general, due to information
loss, compression introduces a penalty on the number of iterations needed to
reach a solution. In this work, we investigate how the iteration penalty
depends on the interaction between compression and problem structure, in the
context of non-convex stochastic optimization. We focus on linear compression
schemes, where compression and decompression can be modeled as multiplication
with a random matrix. We consider several distributions of matrices, among them
random orthogonal matrices and matrices with random Gaussian entries. We find
that in each case, the impact of compression on convergence can be quantified
in terms of the norm of the Hessian of the objective, using a norm defined by
the compression scheme. The analysis reveals that in certain cases, compression
performance is related to low-rank structure or other spectral properties of
the problem. In these cases, our bounds predict that the penalty introduced by
compression is significantly reduced compared to worst-case bounds that only
consider the compression level, ignoring problem data. We verify the
theoretical findings on several optimization problems, including fine-tuning an
image classification model.
[COMMENTS]
15 pages, 3 figures
[LINK]
http://arxiv.org/abs/2411.12898v2
[DATE]
2024-12-16 06:18:59+08:00
[CATEGORIES]
cs.LG
datadriftR: An R Package for Concept Drift Detection in Predictive Models
[AUTHORS]
Ugur Dar, Mustafa Cavus
[ABSTRACT]
Predictive models often face performance degradation due to evolving data
distributions, a phenomenon known as data drift. Among its forms, concept
drift, where the relationship between explanatory variables and the response
variable changes, is particularly challenging to detect and adapt to.
Traditional drift detection methods often rely on metrics such as accuracy or
variable distributions, which may fail to capture subtle but significant
conceptual changes. This paper introduces drifter, an R package designed to
detect concept drift, and proposes a novel method called Profile Drift
Detection (PDD) that enables both drift detection and an enhanced understanding
of the cause behind the drift by leveraging an explainable AI tool - Partial
Dependence Profiles (PDPs). The PDD method, central to the package, quantifies
changes in PDPs through novel metrics, ensuring sensitivity to shifts in the
data stream without excessive computational costs. This approach aligns with
MLOps practices, emphasizing model monitoring and adaptive retraining in
dynamic environments. The experiments across synthetic and real-world datasets
demonstrate that PDD outperforms existing methods by maintaining high accuracy
while effectively balancing sensitivity and stability. The results highlight
its capability to adaptively retrain models in dynamic environments, making it
a robust tool for real-time applications. The paper concludes by discussing the
advantages, limitations, and future extensions of the package for broader use
cases.
[COMMENTS]
37 pages, 6 figures
[LINK]
http://arxiv.org/abs/2412.11308v1
[DATE]
2024-12-16 04:59:49+08:00
[CATEGORIES]
cs.LG
Can Graph Neural Networks Learn Language with Extremely Weak Text Supervision?
[AUTHORS]
Zihao Li, Lecheng Zheng, Bowen Jin, Dongqi Fu, Baoyu Jing, Yikun Ban, Jingrui He, Jiawei Han
[ABSTRACT]
While great success has been achieved in building vision models with
Contrastive Language-Image Pre-training (CLIP) over Internet-scale image-text
pairs, building transferable Graph Neural Networks (GNNs) with CLIP pipeline is
challenging because of three fundamental issues: the scarcity of labeled data
and text supervision, different levels of downstream tasks, and the conceptual
gaps between domains. In this work, to address these issues, we leverage
multi-modal prompt learning to effectively adapt pre-trained GNN to downstream
tasks and data, given only a few semantically labeled samples, each with
extremely weak text supervision. Our new paradigm embeds the graphs directly in
the same space as the Large Language Models (LLMs) by learning both graph
prompts and text prompts simultaneously. To accomplish this, we improve
state-of-the-art graph prompt method, and then propose the first graph-language
multi-modal prompt learning approach for exploiting the knowledge in
pre-trained models. Notably, due to the insufficient supervision for
fine-tuning, in our paradigm, the pre-trained GNN and the LLM are kept frozen,
so the learnable parameters are much fewer than fine-tuning any pre-trained
model. Through extensive experiments on real-world datasets, we demonstrate the
superior performance of our paradigm in few-shot, multi-task-level, and
cross-domain settings. Moreover, we build the first CLIP-style zero-shot
classification prototype that can generalize GNNs to unseen classes with
extremely weak text supervision.
[COMMENTS]
Preprint, 25 pages
[LINK]
http://arxiv.org/abs/2412.08174v2
[DATE]
2024-12-16 04:54:47+08:00
[CATEGORIES]
cs.LG
Regularized Dikin Walks for Sampling Truncated Logconcave Measures, Mixed Isoperimetry and Beyond Worst-Case Analysis
[AUTHORS]
Minhui Jiang, Yuansi Chen
[ABSTRACT]
We study the problem of drawing samples from a logconcave distribution
truncated on a polytope, motivated by computational challenges in Bayesian
statistical models with indicator variables, such as probit regression.
Building on interior point methods and the Dikin walk for sampling from uniform
distributions, we analyze the mixing time of regularized Dikin walks. Our
contributions are threefold. First, for a logconcave and log-smooth
distribution with condition number $\kappa$, truncated on a polytope in
$\mathbb{R}^n$ defined with $m$ linear constraints, we prove that the
soft-threshold Dikin walk mixes in $\widetilde{O}((m+\kappa)n)$ iterations from
a warm initialization. It improves upon prior work which required the polytope
to be bounded and involved a bound dependent on the radius of the bounded
region. Moreover, we introduce the regularized Dikin walk using Lewis weights
for approximating the John ellipsoid. We show that it mixes in
$\widetilde{O}((n^{2.5}+\kappa n)$. Second, we extend the mixing time
guarantees mentioned above to weakly log-concave distributions truncated on
polytopes, provided that they have a finite covariance matrix. Third, going
beyond worst-case mixing time analysis, we demonstrate that soft-threshold
Dikin walk can mix significantly faster when only a limited number of
constraints intersect the high-probability mass of the distribution, improving
the $\widetilde{O}((m+\kappa)n)$ upper bound to $\widetilde{O}(m + \kappa n)$.
Additionally, per-iteration complexity of regularized Dikin walk and ways to
generate a warm initialization are discussed to facilitate practical
implementation.
[COMMENTS]
63 pages, 2 figures
[LINK]
http://arxiv.org/abs/2412.11303v1
[DATE]
2024-12-16 04:43:51+08:00
[CATEGORIES]
cs.LG
Neural networks that overcome classic challenges through practice
[AUTHORS]
Kazuki Irie, Brenden M. Lake
[LINK]
http://arxiv.org/abs/2410.10596v2
[DATE]
2024-12-16 04:26:27+08:00
[CATEGORIES]
cs.LG
Semi-Implicit Neural Ordinary Differential Equations
[AUTHORS]
Hong Zhang, Ying Liu, Romit Maulik
[ABSTRACT]
Classical neural ODEs trained with explicit methods are intrinsically limited
by stability, crippling their efficiency and robustness for stiff learning
problems that are common in graph learning and scientific machine learning. We
present a semi-implicit neural ODE approach that exploits the partitionable
structure of the underlying dynamics. Our technique leads to an implicit neural
network with significant computational advantages over existing approaches
because of enhanced stability and efficient linear solves during time
integration. We show that our approach outperforms existing approaches on a
variety of applications including graph classification and learning complex
dynamical systems. We also demonstrate that our approach can train challenging
neural ODEs where both explicit methods and fully implicit methods are
intractable.
[LINK]
http://arxiv.org/abs/2412.11301v1
[DATE]
2024-12-16 04:21:02+08:00
[CATEGORIES]
cs.LG
How not to Stitch Representations to Measure Similarity: Task Loss Matching versus Direct Matching
[AUTHORS]
András Balogh, Márk Jelasity
[ABSTRACT]
Measuring the similarity of the internal representations of deep neural
networks is an important and challenging problem. Model stitching has been
proposed as a possible approach, where two half-networks are connected by
mapping the output of the first half-network to the input of the second one.
The representations are considered functionally similar if the resulting
stitched network achieves good task-specific performance. The mapping is
normally created by training an affine stitching layer on the task at hand
while freezing the two half-networks, a method called task loss matching. Here,
we argue that task loss matching may be very misleading as a similarity index.
For example, it can indicate very high similarity between very distant layers,
whose representations are known to have different functional properties.
Moreover, it can indicate very distant layers to be more similar than
architecturally corresponding layers. Even more surprisingly, when comparing
layers within the same network, task loss matching often indicates that some
layers are more similar to a layer than itself. We argue that the main reason
behind these problems is that task loss matching tends to create
out-of-distribution representations to improve task-specific performance. We
demonstrate that direct matching (when the mapping minimizes the distance
between the stitched representations) does not suffer from these problems. We
compare task loss matching, direct matching, and well-known similarity indices
such as CCA and CKA. We conclude that direct matching strikes a good balance
between the structural and functional requirements for a good similarity index.
[COMMENTS]
Accepted at AAAI 2025. For the implementation, see
https://github.com/szegedai/stitching-ood
[LINK]
http://arxiv.org/abs/2412.11299v1
[DATE]
2024-12-16 04:18:49+08:00
[CATEGORIES]
cs.LG
A Comparative Study on Dynamic Graph Embedding based on Mamba and Transformers
[AUTHORS]
Ashish Parmanand Pandey, Alan John Varghese, Sarang Patil, Mengjia Xu
[ABSTRACT]
Dynamic graph embedding has emerged as an important technique for modeling
complex time-evolving networks across diverse domains. While transformer-based
models have shown promise in capturing long-range dependencies in temporal
graph data, they face scalability challenges due to quadratic computational
complexity. This study presents a comparative analysis of dynamic graph
embedding approaches using transformers and the recently proposed Mamba
architecture, a state-space model with linear complexity. We introduce three
novel models: TransformerG2G augment with graph convolutional networks,
DG-Mamba, and GDG-Mamba with graph isomorphism network edge convolutions. Our
experiments on multiple benchmark datasets demonstrate that Mamba-based models
achieve comparable or superior performance to transformer-based approaches in
link prediction tasks while offering significant computational efficiency gains
on longer sequences. Notably, DG-Mamba variants consistently outperform
transformer-based models on datasets with high temporal variability, such as
UCI, Bitcoin, and Reality Mining, while maintaining competitive performance on
more stable graphs like SBM. We provide insights into the learned temporal
dependencies through analysis of attention weights and state matrices,
revealing the models’ ability to capture complex temporal patterns. By
effectively combining state-space models with graph neural networks, our work
addresses key limitations of previous approaches and contributes to the growing
body of research on efficient temporal graph representation learning. These
findings offer promising directions for scaling dynamic graph embedding to
larger, more complex real-world networks, potentially enabling new applications
in areas such as social network analysis, financial modeling, and biological
system dynamics.
[COMMENTS]
18 pages, 6 figures
[LINK]
http://arxiv.org/abs/2412.11293v1
[DATE]
2024-12-16 03:56:56+08:00
[CATEGORIES]
cs.LG
Grassmannian Geometry Meets Dynamic Mode Decomposition in DMD-GEN: A New Metric for Mode Collapse in Time Series Generative Models
[AUTHORS]
Amime Mohamed Aboussalah, Yassine Abbahaddou
[ABSTRACT]
Generative models like Generative Adversarial Networks (GANs) and Variational
Autoencoders (VAEs) often fail to capture the full diversity of their training
data, leading to mode collapse. While this issue is well-explored in image
generation, it remains underinvestigated for time series data. We introduce a
new definition of mode collapse specific to time series and propose a novel
metric, DMD-GEN, to quantify its severity. Our metric utilizes Dynamic Mode
Decomposition (DMD), a data-driven technique for identifying coherent
spatiotemporal patterns, and employs Optimal Transport between DMD eigenvectors
to assess discrepancies between the underlying dynamics of the original and
generated data. This approach not only quantifies the preservation of essential
dynamic characteristics but also provides interpretability by pinpointing which
modes have collapsed. We validate DMD-GEN on both synthetic and real-world
datasets using various generative models, including TimeGAN, TimeVAE, and
DiffusionTS. The results demonstrate that DMD-GEN correlates well with
traditional evaluation metrics for static data while offering the advantage of
applicability to dynamic data. This work offers for the first time a definition
of mode collapse for time series, improving understanding, and forming the
basis of our tool for assessing and improving generative models in the time
series domain.
[LINK]
http://arxiv.org/abs/2412.11292v1
[DATE]
2024-12-16 03:53:17+08:00
[CATEGORIES]
cs.LG
Diffusion models as probabilistic neural operators for recovering unobserved states of dynamical systems
[AUTHORS]
Katsiaryna Haitsiukevich, Onur Poyraz, Pekka Marttinen, Alexander Ilin
[ABSTRACT]
This paper explores the efficacy of diffusion-based generative models as
neural operators for partial differential equations (PDEs). Neural operators
are neural networks that learn a mapping from the parameter space to the
solution space of PDEs from data, and they can also solve the inverse problem
of estimating the parameter from the solution. Diffusion models excel in many
domains, but their potential as neural operators has not been thoroughly
explored. In this work, we show that diffusion-based generative models exhibit
many properties favourable for neural operators, and they can effectively
generate the solution of a PDE conditionally on the parameter or recover the
unobserved parts of the system. We propose to train a single model adaptable to
multiple tasks, by alternating between the tasks during training. In our
experiments with multiple realistic dynamical systems, diffusion models
outperform other neural operators. Furthermore, we demonstrate how the
probabilistic diffusion model can elegantly deal with systems which are only
partially identifiable, by producing samples corresponding to the different
possible solutions.
[LINK]
http://arxiv.org/abs/2405.07097v2
[DATE]
2024-12-16 03:04:32+08:00
[CATEGORIES]
cs.LG
Bayesian inference of mean velocity fields and turbulence models from flow MRI
[AUTHORS]
A. Kontogiannis, P. Nair, M. Loecher, D. B. Ennis, A. Marsden, M. P. Juniper
[ABSTRACT]
We solve a Bayesian inverse Reynolds-averaged Navier-Stokes (RANS) problem
that assimilates mean flow data by jointly reconstructing the mean flow field
and learning its unknown RANS parameters. We devise an algorithm that learns
the most likely parameters of an algebraic effective viscosity model, and
estimates their uncertainties, from mean flow data of a turbulent flow. We
conduct a flow MRI experiment to obtain mean flow data of a confined turbulent
jet in an idealized medical device known as the FDA (Food and Drug
Administration) nozzle. The algorithm successfully reconstructs the mean flow
field and learns the most likely turbulence model parameters without
overfitting. The methodology accepts any turbulence model, be it algebraic
(explicit) or multi-equation (implicit), as long as the model is
differentiable, and naturally extends to unsteady turbulent flows.
[LINK]
http://arxiv.org/abs/2412.11266v1
[DATE]
2024-12-16 02:07:36+08:00
[CATEGORIES]
cs.LG
Prediction-Enhanced Monte Carlo: A Machine Learning View on Control Variate
[AUTHORS]
Fengpei Li, Haoxian Chen, Jiahe Lin, Arkin Gupta, Xiaowei Tan, Gang Xu, Yuriy Nevmyvaka, Agostino Capponi, Henry Lam
[ABSTRACT]
Despite being an essential tool across engineering and finance, Monte Carlo
simulation can be computationally intensive, especially in large-scale,
path-dependent problems that hinder straightforward parallelization. A natural
alternative is to replace simulation with machine learning or surrogate
prediction, though this introduces challenges in understanding the resulting
errors.We introduce a Prediction-Enhanced Monte Carlo (PEMC) framework where we
leverage machine learning prediction as control variates, thus maintaining
unbiased evaluations instead of the direct use of ML predictors. Traditional
control variate methods require knowledge of means and focus on per-sample
variance reduction. In contrast, PEMC aims at overall cost-aware variance
reduction, eliminating the need for mean knowledge. PEMC leverages pre-trained
neural architectures to construct effective control variates and replaces
computationally expensive sample-path generation with efficient neural network
evaluations. This allows PEMC to address scenarios where no good control
variates are known. We showcase the efficacy of PEMC through two
production-grade exotic option-pricing problems: swaption pricing in HJM model
and the variance swap pricing in a stochastic local volatility model.
[LINK]
http://arxiv.org/abs/2412.11257v1
[DATE]
2024-12-16 01:41:38+08:00
[CATEGORIES]
cs.LG
Are Expressive Models Truly Necessary for Offline RL?
[AUTHORS]
Guan Wang, Haoyi Niu, Jianxiong Li, Li Jiang, Jianming Hu, Xianyuan Zhan
[ABSTRACT]
Among various branches of offline reinforcement learning (RL) methods,
goal-conditioned supervised learning (GCSL) has gained increasing popularity as
it formulates the offline RL problem as a sequential modeling task, therefore
bypassing the notoriously difficult credit assignment challenge of value
learning in conventional RL paradigm. Sequential modeling, however, requires
capturing accurate dynamics across long horizons in trajectory data to ensure
reasonable policy performance. To meet this requirement, leveraging large,
expressive models has become a popular choice in recent literature, which,
however, comes at the cost of significantly increased computation and inference
latency. Contradictory yet promising, we reveal that lightweight models as
simple as shallow 2-layer MLPs, can also enjoy accurate dynamics consistency
and significantly reduced sequential modeling errors against large expressive
models by adopting a simple recursive planning scheme: recursively planning
coarse-grained future sub-goals based on current and target information, and
then executes the action with a goal-conditioned policy learned from data
rela-beled with these sub-goal ground truths. We term our method Recursive
Skip-Step Planning (RSP). Simple yet effective, RSP enjoys great efficiency
improvements thanks to its lightweight structure, and substantially outperforms
existing methods, reaching new SOTA performances on the D4RL benchmark,
especially in multi-stage long-horizon tasks.
[COMMENTS]
Instead of relying on expressive models, shallow MLPs can also excel
in long sequential decision-making tasks with Recursive Skip-Step Planning
(RSP)
[LINK]
http://arxiv.org/abs/2412.11253v1
[DATE]
2024-12-16 01:33:56+08:00
[CATEGORIES]
cs.LG
Experimental verification of the quantum nature of a neural network
[AUTHORS]
Andrei T. Patrascu
[ABSTRACT]
Neural networks are being used to improve the probing of the state spaces of
many particle systems as approximations to wavefunctions and in order to avoid
the recurring sign problem of quantum monte-carlo. One may ask whether the
usual classical neural networks have some actual hidden quantum properties that
make them such suitable tools for a highly coupled quantum problem. I discuss
here what makes a system quantum and to what extent we can interpret a neural
network as having quantum remnants. I suggest that a system can be quantum both
due to its fundamental quantum constituents and due to the rules of its
functioning, therefore, we can obtain entanglement both due to the quantum
constituents’ nature and due to the functioning rules, or, in category theory
terms, both due to the quantum nature of the objects of a category and of the
maps. From a practical point of view, I suggest a possible experiment that
could extract entanglement from the quantum functioning rules (maps) of an
otherwise classical (from the point of view of the constituents) neural
network.
[COMMENTS]
arXiv admin note: text overlap with arXiv:2206.00005
[LINK]
http://arxiv.org/abs/2209.07577v5
[DATE]
2024-12-16 01:25:19+08:00
[CATEGORIES]
cs.LG
Wasserstein Bounds for generative diffusion models with Gaussian tail targets
[AUTHORS]
Xixian Wang, Zhongjian Wang
[ABSTRACT]
We present an estimate of the Wasserstein distance between the data
distribution and the generation of score-based generative models, assuming an
$\epsilon$-accurate approximation of the score and a Gaussian-type tail
behavior of the data distribution. The complexity bound in dimension is
$O(\sqrt{d})$, with a logarithmic constant. Such Gaussian tail assumption
applies to the distribution of a compact support target with early stopping
technique and the Bayesian posterior with a bounded observation operator.
Corresponding convergence and complexity bounds are derived.
The crux of the analysis lies in the Lipchitz bound of the score, which is
related to the Hessian estimate of a viscous Hamilton-Jacobi equation (vHJ).
This latter is demonstrated by employing a dimension independent kernel
estimate. Consequently, our complexity bound scales linearly (up to a
logarithmic constant) with the square root of the trace of the covariance
operator, which relates to the invariant distribution of forward process. Our
analysis also extends to the probabilistic flow ODE, as the sampling process.
[LINK]
http://arxiv.org/abs/2412.11251v1
[DATE]
2024-12-16 01:20:42+08:00
[CATEGORIES]
cs.LG
Physics-Inspired Interpretability Of Machine Learning Models
[AUTHORS]
Maximilian P Niroomand, David J Wales
[ABSTRACT]
The ability to explain decisions made by machine learning models remains one
of the most significant hurdles towards widespread adoption of AI in highly
sensitive areas such as medicine, cybersecurity or autonomous driving. Great
interest exists in understanding which features of the input data prompt model
decision making. In this contribution, we propose a novel approach to identify
relevant features of the input data, inspired by methods from the energy
landscapes field, developed in the physical sciences. By identifying conserved
weights within groups of minima of the loss landscapes, we can identify the
drivers of model decision making. Analogues to this idea exist in the molecular
sciences, where coordinate invariants or order parameters are employed to
identify critical features of a molecule. However, no such approach exists for
machine learning loss landscapes. We will demonstrate the applicability of
energy landscape methods to machine learning models and give examples, both
synthetic and from the real world, for how these methods can help to make
models more interpretable.
[COMMENTS]
6 pages, 2 figures, ICLR 2023 Workshop on Physics for Machine
Learning
[LINK]
http://arxiv.org/abs/2304.02381v2
[DATE]
2024-12-16 01:17:01+08:00
[CATEGORIES]
cs.LG
On Characterizing and Mitigating Imbalances in Multi-Instance Partial Label Learning
[AUTHORS]
Kaifu Wang, Efthymia Tsamoura, Dan Roth
[ABSTRACT]
Multi-Instance Partial Label Learning (MI-PLL) is a weakly-supervised
learning setting encompassing partial label learning, latent structural
learning, and neurosymbolic learning. Unlike supervised learning, in MI-PLL,
the inputs to the classifiers at training-time are tuples of instances
$\mathbf{x}$. At the same time, the supervision signal is generated by a
function $\sigma$ over the (hidden) gold labels of $\mathbf{x}$. In this work,
we make multiple contributions towards addressing a problem that hasn’t been
studied so far in the context of MI-PLL: that of characterizing and mitigating
learning imbalances, i.e., major differences in the errors occurring when
classifying instances of different classes (aka class-specific risks). In
terms of theory, we derive class-specific risk bounds for MI-PLL, while making
minimal assumptions. Our theory reveals a unique phenomenon: that $\sigma$ can
greatly impact learning imbalances. This result is in sharp contrast with
previous research on supervised and weakly-supervised learning, which only
studies learning imbalances under the prism of data imbalances. On the
practical side, we introduce a technique for estimating the marginal of the
hidden labels using only MI-PLL data. Then, we introduce algorithms that
mitigate imbalances at training- and testing-time, by treating the marginal of
the hidden labels as a constraint. We demonstrate the effectiveness of our
techniques using strong baselines from neurosymbolic and long-tail learning,
suggesting performance improvements of up to 14\%.
[LINK]
http://arxiv.org/abs/2407.10000v3
[DATE]
2024-12-16 01:02:00+08:00
[CATEGORIES]
cs.LG
Transformer-Based Bearing Fault Detection using Temporal Decomposition Attention Mechanism
[AUTHORS]
Marzieh Mirzaeibonehkhater, Mohammad Ali Labbaf-Khaniki, Mohammad Manthouri
[ABSTRACT]
Bearing fault detection is a critical task in predictive maintenance, where
accurate and timely fault identification can prevent costly downtime and
equipment damage. Traditional attention mechanisms in Transformer neural
networks often struggle to capture the complex temporal patterns in bearing
vibration data, leading to suboptimal performance. To address this limitation,
we propose a novel attention mechanism, Temporal Decomposition Attention (TDA),
which combines temporal bias encoding with seasonal-trend decomposition to
capture both long-term dependencies and periodic fluctuations in time series
data. Additionally, we incorporate the Hull Exponential Moving Average (HEMA)
for feature extraction, enabling the model to effectively capture meaningful
characteristics from the data while reducing noise. Our approach integrates TDA
into the Transformer architecture, allowing the model to focus separately on
the trend and seasonal components of the data. Experimental results on the Case
Western Reserve University (CWRU) bearing fault detection dataset demonstrate
that our approach outperforms traditional attention mechanisms and achieves
state-of-the-art performance in terms of accuracy and interpretability. The
HEMA-Transformer-TDA model achieves an accuracy of 98.1%, with exceptional
precision, recall, and F1-scores, demonstrating its effectiveness in bearing
fault detection and its potential for application in other time series tasks
with seasonal patterns or trends.
[LINK]
http://arxiv.org/abs/2412.11245v1
[DATE]
2024-12-16 00:51:31+08:00
[CATEGORIES]
cs.LG
GENIE: Watermarking Graph Neural Networks for Link Prediction
[AUTHORS]
Venkata Sai Pranav Bachina, Ankit Gangwal, Aaryan Ajay Sharma, Charu Sharma
[ABSTRACT]
Graph Neural Networks (GNNs) have become invaluable intellectual property in
graph-based machine learning. However, their vulnerability to model stealing
attacks when deployed within Machine Learning as a Service (MLaaS) necessitates
robust Ownership Demonstration (OD) techniques. Watermarking is a promising OD
framework for Deep Neural Networks, but existing methods fail to generalize to
GNNs due to the non-Euclidean nature of graph data. Previous works on GNN
watermarking have primarily focused on node and graph classification,
overlooking Link Prediction (LP).
In this paper, we propose GENIE (watermarking Graph nEural Networks for lInk
prEdiction), the first-ever scheme to watermark GNNs for LP. GENIE creates a
novel backdoor for both node-representation and subgraph-based LP methods,
utilizing a unique trigger set and a secret watermark vector. Our OD scheme is
equipped with Dynamic Watermark Thresholding (DWT), ensuring high verification
probability (>99.99%) while addressing practical issues in existing
watermarking schemes. We extensively evaluate GENIE across 4 model
architectures (i.e., SEAL, GCN, GraphSAGE and NeoGNN) and 7 real-world
datasets. Furthermore, we validate the robustness of GENIE against 11
state-of-the-art watermark removal techniques and 3 model extraction attacks.
We also show GENIE’s resilience against ownership piracy attacks. Finally, we
discuss a defense strategy to counter adaptive attacks against GENIE.
[LINK]
http://arxiv.org/abs/2406.04805v2
[DATE]
2024-12-16 00:46:38+08:00
[CATEGORIES]
cs.LG
XAL: EXplainable Active Learning Makes Classifiers Better Low-resource Learners
[AUTHORS]
Yun Luo, Zhen Yang, Fandong Meng, Yingjie Li, Fang Guo, Qinglin Qi, Jie Zhou, Yue Zhang
[ABSTRACT]
Active learning (AL), which aims to construct an effective training set by
iteratively curating the most formative unlabeled data for annotation, has been
widely used in low-resource tasks. Most active learning techniques in
classification rely on the model’s uncertainty or disagreement to choose
unlabeled data, suffering from the problem of over-confidence in superficial
patterns and a lack of exploration. Inspired by the cognitive processes in
which humans deduce and predict through causal information, we take an initial
attempt towards integrating rationales into AL and propose a novel Explainable
Active Learning framework (XAL) for low-resource text classification, which
aims to encourage classifiers to justify their inferences and delve into
unlabeled data for which they cannot provide reasonable explanations.
Specifically, besides using a pre-trained bi-directional encoder for
classification, we employ a pre-trained uni-directional decoder to generate and
score the explanation. We further facilitate the alignment of the model with
human reasoning preference through a proposed ranking loss. During the
selection of unlabeled data, the predicted uncertainty of the encoder and the
explanation score of the decoder complement each other as the final metric to
acquire informative data. Extensive experiments on six datasets show that XAL
achieves consistent improvement over 9 strong baselines. Analysis indicates
that the proposed method can generate corresponding explanations for its
predictions.
[COMMENTS]
Accepted by NAACL 2024
[LINK]
http://arxiv.org/abs/2310.05502v4
[DATE]
2024-12-15 22:58:09+08:00
[CATEGORIES]
cs.CL
Task-Oriented Dialog Systems for the Senegalese Wolof Language
[AUTHORS]
Derguene Mbaye, Moussa Diallo
[ABSTRACT]
In recent years, we are seeing considerable interest in conversational agents
with the rise of large language models (LLMs). Although they offer considerable
advantages, LLMs also present significant risks, such as hallucination, which
hinder their widespread deployment in industry. Moreover, low-resource
languages such as African ones are still underrepresented in these systems
limiting their performance in these languages. In this paper, we illustrate a
more classical approach based on modular architectures of Task-oriented Dialog
Systems (ToDS) offering better control over outputs. We propose a chatbot
generation engine based on the Rasa framework and a robust methodology for
projecting annotations onto the Wolof language using an in-house machine
translation system. After evaluating a generated chatbot trained on the Amazon
Massive dataset, our Wolof Intent Classifier performs similarly to the one
obtained for French, which is a resource-rich language. We also show that this
approach is extensible to other low-resource languages, thanks to the intent
classifier’s language-agnostic pipeline, simplifying the design of chatbots in
these languages.
[COMMENTS]
10 pages, 3 tables, 6 figures, The 31st International Conference on
Computational Linguistics (COLING 2025)
[LINK]
http://arxiv.org/abs/2412.11203v1
[DATE]
2024-12-15 22:35:49+08:00
[CATEGORIES]
cs.CL
Drawing the Line: Enhancing Trustworthiness of MLLMs Through the Power of Refusal
[AUTHORS]
Yuhao Wang, Zhiyuan Zhu, Heyang Liu, Yusheng Liao, Hongcheng Liu, Yanfeng Wang, Yu Wang
[ABSTRACT]
Multimodal large language models (MLLMs) excel at multimodal perception and
understanding, yet their tendency to generate hallucinated or inaccurate
responses undermines their trustworthiness. Existing methods have largely
overlooked the importance of refusal responses as a means of enhancing MLLMs
reliability. To bridge this gap, we present the Information Boundary-aware
Learning Framework (InBoL), a novel approach that empowers MLLMs to refuse to
answer user queries when encountering insufficient information. To the best of
our knowledge, InBoL is the first framework that systematically defines the
conditions under which refusal is appropriate for MLLMs using the concept of
information boundaries proposed in our paper. This framework introduces a
comprehensive data generation pipeline and tailored training strategies to
improve the model’s ability to deliver appropriate refusal responses. To
evaluate the trustworthiness of MLLMs, we further propose a user-centric
alignment goal along with corresponding metrics. Experimental results
demonstrate a significant improvement in refusal accuracy without noticeably
compromising the model’s helpfulness, establishing InBoL as a pivotal
advancement in building more trustworthy MLLMs.
[LINK]
http://arxiv.org/abs/2412.11196v1
[DATE]
2024-12-15 22:17:14+08:00
[CATEGORIES]
cs.CL
Analyzing the Attention Heads for Pronoun Disambiguation in Context-aware Machine Translation Models
[AUTHORS]
Paweł Mąka, Yusuf Can Semerci, Jan Scholtes, Gerasimos Spanakis
[ABSTRACT]
In this paper, we investigate the role of attention heads in Context-aware
Machine Translation models for pronoun disambiguation in the English-to-German
and English-to-French language directions. We analyze their influence by both
observing and modifying the attention scores corresponding to the plausible
relations that could impact a pronoun prediction. Our findings reveal that
while some heads do attend the relations of interest, not all of them influence
the models’ ability to disambiguate pronouns. We show that certain heads are
underutilized by the models, suggesting that model performance could be
improved if only the heads would attend one of the relations more strongly.
Furthermore, we fine-tune the most promising heads and observe the increase in
pronoun disambiguation accuracy of up to 5 percentage points which demonstrates
that the improvements in performance can be solidified into the models’
parameters.
[COMMENTS]
COLING 2025
[LINK]
http://arxiv.org/abs/2412.11187v1
[DATE]
2024-12-15 21:42:49+08:00
[CATEGORIES]
cs.CL
cs.LG
Transliterated Zero-Shot Domain Adaptation for Automatic Speech Recognition
[AUTHORS]
Han Zhu, Gaofeng Cheng, Qingwei Zhao, Pengyuan Zhang
[ABSTRACT]
The performance of automatic speech recognition models often degenerates on
domains not covered by the training data. Domain adaptation can address this
issue, assuming the availability of the target domain data in the target
language. However, such assumption does not stand in many real-world
applications. To make domain adaptation more applicable, we address the problem
of zero-shot domain adaptation (ZSDA), where target domain data is unavailable
in the target language. Instead, we transfer the target domain knowledge from
another source language where the target domain data is more accessible. To do
that, we first perform cross-lingual pre-training (XLPT) to share domain
knowledge across languages, then use target language fine-tuning to build the
final model. One challenge in this practice is that the pre-trained knowledge
can be forgotten during fine-tuning, resulting in sub-optimal adaptation
performance. To address this issue, we propose transliterated ZSDA to achieve
consistent pre-training and fine-tuning labels, leading to maximum preservation
of the pre-trained knowledge. Experimental results show that transliterated
ZSDA relatively decreases the word error rate by 9.2% compared with a wav2vec
2.0 baseline. Moreover, transliterated ZSDA consistently outperforms
self-supervised ZSDA and performs on par with supervised ZSDA, proving the
superiority of transliteration-based pre-training labels.
[LINK]
http://arxiv.org/abs/2412.11185v1
[DATE]
2024-12-15 21:32:08+08:00
[CATEGORIES]
cs.CL
Provably Secure Disambiguating Neural Linguistic Steganography
[AUTHORS]
Yuang Qi, Kejiang Chen, Kai Zeng, Weiming Zhang, Nenghai Yu
[ABSTRACT]
Recent research in provably secure neural linguistic steganography has
overlooked a crucial aspect: the sender must detokenize stegotexts to avoid
raising suspicion from the eavesdropper. The segmentation ambiguity problem,
which arises when using language models based on subwords, leads to occasional
decoding failures in all neural language steganography implementations based on
these models. Current solutions to this issue involve altering the probability
distribution of candidate words, rendering them incompatible with provably
secure steganography. We propose a novel secure disambiguation method named
SyncPool, which effectively addresses the segmentation ambiguity problem. We
group all tokens with prefix relationships in the candidate pool before the
steganographic embedding algorithm runs to eliminate uncertainty among
ambiguous tokens. To enable the receiver to synchronize the sampling process of
the sender, a shared cryptographically-secure pseudorandom number generator
(CSPRNG) is deployed to select a token from the ambiguity pool. SyncPool does
not change the size of the candidate pool or the distribution of tokens and
thus is applicable to provably secure language steganography methods. We
provide theoretical proofs and experimentally demonstrate the applicability of
our solution to various languages and models, showing its potential to
significantly improve the reliability and security of neural linguistic
steganography systems.
[COMMENTS]
12 pages. Accepted by IEEE Transactions on Dependable and Secure
Computing (TDSC)
[LINK]
http://arxiv.org/abs/2403.17524v2
[DATE]
2024-12-15 20:59:05+08:00
[CATEGORIES]
cs.CL
Unpacking the Resilience of SNLI Contradiction Examples to Attacks
[AUTHORS]
Chetan Verma, Archit Agarwal
[ABSTRACT]
Pre-trained models excel on NLI benchmarks like SNLI and MultiNLI, but their
true language understanding remains uncertain. Models trained only on
hypotheses and labels achieve high accuracy, indicating reliance on dataset
biases and spurious correlations. To explore this issue, we applied the
Universal Adversarial Attack to examine the model’s vulnerabilities. Our
analysis revealed substantial drops in accuracy for the entailment and neutral
classes, whereas the contradiction class exhibited a smaller decline.
Fine-tuning the model on an augmented dataset with adversarial examples
restored its performance to near-baseline levels for both the standard and
challenge sets. Our findings highlight the value of adversarial triggers in
identifying spurious correlations and improving robustness while providing
insights into the resilience of the contradiction class to adversarial attacks.
[LINK]
http://arxiv.org/abs/2412.11172v1
[DATE]
2024-12-15 20:47:28+08:00
[CATEGORIES]
cs.CL
Fast-and-Frugal Text-Graph Transformers are Effective Link Predictors
[AUTHORS]
Andrei C. Coman, Christos Theodoropoulos, Marie-Francine Moens, James Henderson
[ABSTRACT]
We propose Fast-and-Frugal Text-Graph (FnF-TG) Transformers, a
Transformer-based framework that unifies textual and structural information for
inductive link prediction in text-attributed knowledge graphs. We demonstrate
that, by effectively encoding ego-graphs (1-hop neighbourhoods), we can reduce
the reliance on resource-intensive textual encoders. This makes the model both
fast at training and inference time, as well as frugal in terms of cost. We
perform a comprehensive evaluation on three popular datasets and show that
FnF-TG can achieve superior performance compared to previous state-of-the-art
methods. We also extend inductive learning to a fully inductive setting, where
relations don’t rely on transductive (fixed) representations, as in previous
work, but are a function of their textual description. Additionally, we
introduce new variants of existing datasets, specifically designed to test the
performance of models on unseen relations at inference time, thus offering a
new test-bench for fully inductive link prediction.
[LINK]
http://arxiv.org/abs/2408.06778v3
[DATE]
2024-12-15 20:37:51+08:00
[CATEGORIES]
cs.CL
Neeko: Leveraging Dynamic LoRA for Efficient Multi-Character Role-Playing Agent
[AUTHORS]
Xiaoyan Yu, Tongxu Luo, Yifan Wei, Fangyu Lei, Yiming Huang, Hao Peng, Liehuang Zhu
[ABSTRACT]
Large Language Models (LLMs) have revolutionized open-domain dialogue agents
but encounter challenges in multi-character role-playing (MCRP) scenarios. To
address the issue, we present Neeko, an innovative framework designed for
efficient multiple characters imitation. Unlike existing methods, Neeko employs
a dynamic low-rank adapter (LoRA) strategy, enabling it to adapt seamlessly to
diverse characters. Our framework breaks down the role-playing process into
agent pre-training, multiple characters playing, and character incremental
learning, effectively handling both seen and unseen roles. This dynamic
approach, coupled with distinct LoRA blocks for each character, enhances
Neeko’s adaptability to unique attributes, personalities, and speaking
patterns. As a result, Neeko demonstrates superior performance in MCRP over
most existing methods, offering more engaging and versatile user interaction
experiences. Code and data are available at
https://github.com/weiyifan1023/Neeko.
[COMMENTS]
EMNLP 2024
[LINK]
http://arxiv.org/abs/2402.13717v3
[DATE]
2024-12-15 20:37:31+08:00
[CATEGORIES]
cs.CL
Cultural Palette: Pluralising Culture Alignment via Multi-agent Palette
[AUTHORS]
Jiahao Yuan, Zixiang Di, Shangzixin Zhao, Usman Naseem
[ABSTRACT]
Large language models (LLMs) face challenges in aligning with diverse
cultural values despite their remarkable performance in generation, which stems
from inherent monocultural biases and difficulties in capturing nuanced
cultural semantics. Existing methods lack adaptability to unkown culture after
finetuning. Inspired by cultural geography across five continents, we propose
Cultural Palette, a multi-agent framework for cultural alignment. We first
introduce the Pentachromatic Cultural Palette Dataset synthesized using LLMs to
capture diverse cultural values from social dialogues across five continents.
Building on this, Cultural Palette integrates five continent-level alignment
agents with a meta-agent using our superior Cultural MoErges alignment
technique by dynamically activating relevant cultural expertise based on user
prompts to adapting new culture, which outperforms other joint and merging
alignment strategies in overall cultural value alignment. Each continent agent
generates a cultural draft, which is then refined and self-regulated by the
meta-agent to produce the final culturally aligned response. Experiments across
various countries demonstrate that Cultural Palette surpasses existing
baselines in cultural alignment.
[LINK]
http://arxiv.org/abs/2412.11167v1
[DATE]
2024-12-15 20:30:52+08:00
[CATEGORIES]
cs.CL
SEE: Sememe Entanglement Encoding for Transformer-bases Models Compression
[AUTHORS]
Jing Zhang, Shuzhen Sun, Peng Zhang, Guangxing Cao, Hui Gao, Xindian Ma, Nan Xu, Yuexian Hou
[ABSTRACT]
Transformer-based large language models exhibit groundbreaking capabilities,
but their storage and computational costs are prohibitively high, limiting
their application in resource-constrained scenarios. An effective approach is
to eliminate redundant model parameters and computational costs while
incorporating efficient expert-derived knowledge structures to achieve a
balance between compression and performance. Therefore, we propose the
\textit{Sememe Entanglement Encoding (SEE)} algorithm. Guided by expert prior
knowledge, the model is compressed through the low-rank approximation idea. In
Entanglement Embedding, basic semantic units such as sememes are represented as
low-dimensional vectors, and then reconstructed into high-dimensional word
embeddings through the combination of generalized quantum entanglement. We
adapt the Sememe Entanglement Encoding algorithm to transformer-based models of
different magnitudes. Experimental results indicate that our approach achieves
stable performance while compressing model parameters and computational costs.
[LINK]
http://arxiv.org/abs/2412.12204v1
[DATE]
2024-12-15 20:01:43+08:00
[CATEGORIES]
cs.LG
cs.CL
AD-LLM: Benchmarking Large Language Models for Anomaly Detection
[AUTHORS]
Tiankai Yang, Yi Nian, Shawn Li, Ruiyao Xu, Yuangang Li, Jiaqi Li, Zhuo Xiao, Xiyang Hu, Ryan Rossi, Kaize Ding, Xia Hu, Yue Zhao
[ABSTRACT]
Anomaly detection (AD) is an important machine learning task with many
real-world uses, including fraud detection, medical diagnosis, and industrial
monitoring. Within natural language processing (NLP), AD helps detect issues
like spam, misinformation, and unusual user activity. Although large language
models (LLMs) have had a strong impact on tasks such as text generation and
summarization, their potential in AD has not been studied enough. This paper
introduces AD-LLM, the first benchmark that evaluates how LLMs can help with
NLP anomaly detection. We examine three key tasks: (i) zero-shot detection,
using LLMs’ pre-trained knowledge to perform AD without tasks-specific
training; (ii) data augmentation, generating synthetic data and category
descriptions to improve AD models; and (iii) model selection, using LLMs to
suggest unsupervised AD models. Through experiments with different datasets, we
find that LLMs can work well in zero-shot AD, that carefully designed
augmentation methods are useful, and that explaining model selection for
specific datasets remains challenging. Based on these results, we outline six
future research directions on LLMs for AD.
[LINK]
http://arxiv.org/abs/2412.11142v1
[DATE]
2024-12-15 18:22:14+08:00
[CATEGORIES]
cs.CL
TransMI: A Framework to Create Strong Baselines from Multilingual Pretrained Language Models for Transliterated Data
[AUTHORS]
Yihong Liu, Chunlan Ma, Haotian Ye, Hinrich Schütze
[COMMENTS]
COLING 2025
[LINK]
http://arxiv.org/abs/2405.09913v2
[DATE]
2024-12-15 18:20:21+08:00
[CATEGORIES]
cs.CL
How Transliterations Improve Crosslingual Alignment
[AUTHORS]
Yihong Liu, Mingyang Wang, Amir Hossein Kargaran, Ayyoob Imani, Orgest Xhelili, Haotian Ye, Chunlan Ma, François Yvon, Hinrich Schütze
[COMMENTS]
COLING 2025
[LINK]
http://arxiv.org/abs/2409.17326v2
[DATE]
2024-12-15 18:14:11+08:00
[CATEGORIES]
cs.CL
Feature engineering vs. deep learning for paper section identification: Toward applications in Chinese medical literature
[AUTHORS]
Sijia Zhou, Xin Li
[ABSTRACT]
Section identification is an important task for library science, especially
knowledge management. Identifying the sections of a paper would help filter
noise in entity and relation extraction. In this research, we studied the paper
section identification problem in the context of Chinese medical literature
analysis, where the subjects, methods, and results are more valuable from a
physician’s perspective. Based on previous studies on English literature
section identification, we experiment with the effective features to use with
classic machine learning algorithms to tackle the problem. It is found that
Conditional Random Fields, which consider sentence interdependency, is more
effective in combining different feature sets, such as bag-of-words,
part-of-speech, and headings, for Chinese literature section identification.
Moreover, we find that classic machine learning algorithms are more effective
than generic deep learning models for this problem. Based on these
observations, we design a novel deep learning model, the Structural
Bidirectional Long Short-Term Memory (SLSTM) model, which models word and
sentence interdependency together with the contextual information. Experiments
on a human-curated asthma literature dataset show that our approach outperforms
the traditional machine learning methods and other deep learning methods and
achieves close to 90% precision and recall in the task. The model shows good
potential for use in other text mining tasks. The research has significant
methodological and practical implications.
[LINK]
http://arxiv.org/abs/2412.11125v1
[DATE]
2024-12-15 17:11:14+08:00
[CATEGORIES]
cs.CL
cs.LG
MMM: Multilingual Mutual Reinforcement Effect Mix Datasets & Test with Open-domain Information Extraction Large Language Models
[AUTHORS]
Chengguang Gan, Sunbowen Lee, Qingyu Yin, Xinyang He, Hanjun Wei, Yunhao Liang, Younghun Lim, Shijian Wang, Hexiang Huang, Qinghao Zhang, Shiwen Ni, Tatsunori Mori
[ABSTRACT]
The Mutual Reinforcement Effect (MRE) represents a promising avenue in
information extraction and multitasking research. Nevertheless, its
applicability has been constrained due to the exclusive availability of MRE mix
datasets in Japanese, thereby limiting comprehensive exploration by the global
research community. To address this limitation, we introduce a Multilingual MRE
mix dataset (MMM) that encompasses 21 sub-datasets in English, Japanese, and
Chinese. In this paper, we also propose a method for dataset translation
assisted by Large Language Models (LLMs), which significantly reduces the
manual annotation time required for dataset construction by leveraging LLMs to
translate the original Japanese datasets. Additionally, we have enriched the
dataset by incorporating open-domain Named Entity Recognition (NER) and
sentence classification tasks. Utilizing this expanded dataset, we developed a
unified input-output framework to train an Open-domain Information Extraction
Large Language Model (OIELLM). The OIELLM model demonstrates the capability to
effectively process novel MMM datasets, exhibiting significant improvements in
performance. The OIELLM model and datasets is open-source in HuggingFace:
\href{https://ganchengguang.github.io/MRE/}{GitHub
Website}\footnote{\url{https://ganchengguang.github.io/MRE/}}
[COMMENTS]
Under Review. 11 pages, 5 Figure
[LINK]
http://arxiv.org/abs/2407.10953v3
[DATE]
2024-12-15 16:38:29+08:00
[CATEGORIES]
cs.CL
xVal: A Continuous Numerical Tokenization for Scientific Language Models
[AUTHORS]
Siavash Golkar, Mariel Pettee, Michael Eickenberg, Alberto Bietti, Miles Cranmer, Geraud Krawezik, Francois Lanusse, Michael McCabe, Ruben Ohana, Liam Parker, Bruno Régaldo-Saint Blancard, Tiberiu Tesileanu, Kyunghyun Cho, Shirley Ho
[ABSTRACT]
Due in part to their discontinuous and discrete default encodings for
numbers, Large Language Models (LLMs) have not yet been commonly used to
process numerically-dense scientific datasets. Rendering datasets as text,
however, could help aggregate diverse and multi-modal scientific data into a
single training corpus, thereby potentially facilitating the development of
foundation models for science. In this work, we introduce xVal, a strategy for
continuously tokenizing numbers within language models that results in a more
appropriate inductive bias for scientific applications. By training
specially-modified language models from scratch on a variety of scientific
datasets formatted as text, we find that xVal generally outperforms other
common numerical tokenization strategies on metrics including
out-of-distribution generalization and computational efficiency.
[COMMENTS]
15 pages, 12 figures. Appendix: 8 pages, 2 figures. Accepted
contribution at the NeurIPS Workshop on ML for the Physical Sciences
[LINK]
http://arxiv.org/abs/2310.02989v2
[DATE]
2024-12-15 15:07:28+08:00
[CATEGORIES]
cs.CL
cs.LG
An Efficient Sign Language Translation Using Spatial Configuration and Motion Dynamics with LLMs
[AUTHORS]
Eui Jun Hwang, Sukmin Cho, Junmyeong Lee, Jong C. Park
[ABSTRACT]
Gloss-free Sign Language Translation (SLT) converts sign videos directly into
spoken language sentences without relying on glosses. Recently, Large Language
Models (LLMs) have shown remarkable translation performance in gloss-free
methods by harnessing their powerful natural language generation capabilities.
However, these methods often rely on domain-specific fine-tuning of visual
encoders to achieve optimal results. By contrast, this paper emphasizes the
importance of capturing the spatial configurations and motion dynamics inherent
in sign language. With this in mind, we introduce Spatial and Motion-based Sign
Language Translation (SpaMo), a novel LLM-based SLT framework. The core idea of
SpaMo is simple yet effective. We first extract spatial and motion features
using off-the-shelf visual encoders and then input these features into an LLM
with a language prompt. Additionally, we employ a visual-text alignment process
as a warm-up before the SLT supervision. Our experiments demonstrate that SpaMo
achieves state-of-the-art performance on two popular datasets, PHOENIX14T and
How2Sign.
[COMMENTS]
Under Review
[LINK]
http://arxiv.org/abs/2408.10593v2
[DATE]
2024-12-15 14:18:53+08:00
[CATEGORIES]
cs.CL
LAW: Legal Agentic Workflows for Custody and Fund Services Contracts
[AUTHORS]
William Watson, Nicole Cho, Nishan Srishankar, Zhen Zeng, Lucas Cecchi, Daniel Scott, Suchetha Siddagangappa, Rachneet Kaur, Tucker Balch, Manuela Veloso
[COMMENTS]
Accepted at The 31st International Conference on Computational
Linguistics (COLING 2025)
[LINK]
http://arxiv.org/abs/2412.11063v1
[DATE]
2024-12-15 13:40:57+08:00
[CATEGORIES]
cs.CL
NITRO: LLM Inference on Intel Laptop NPUs
[AUTHORS]
Anthony Fei, Mohamed S. Abdelfattah
[ABSTRACT]
Large Language Models (LLMs) have become essential tools in natural language
processing, finding large usage in chatbots such as ChatGPT and Gemini, and are
a central area of research. A particular area of interest includes designing
hardware specialized for these AI applications, with one such example being the
neural processing unit (NPU). In 2023, Intel released the Intel Core Ultra
processor with codename Meteor Lake, featuring a CPU, GPU, and NPU
system-on-chip. However, official software support for the NPU through Intel’s
OpenVINO framework is limited to static model inference. The dynamic nature of
autoregressive token generation in LLMs is therefore not supported out of the
box. To address this shortcoming, we present NITRO (NPU Inference for
Transformers Optimization), a Python-based framework built on top of OpenVINO
to support text and chat generation on NPUs. In this paper, we discuss in
detail the key modifications made to the transformer architecture to enable
inference, some performance benchmarks, and future steps towards improving the
package. The code repository for NITRO can be found here:
https://github.com/abdelfattah-lab/nitro.
[COMMENTS]
11 pages, 7 figures
[LINK]
http://arxiv.org/abs/2412.11053v1
[DATE]
2024-12-15 13:15:54+08:00
[CATEGORIES]
cs.CL
MatFormer: Nested Transformer for Elastic Inference
[AUTHORS]
Devvrit, Sneha Kudugunta, Aditya Kusupati, Tim Dettmers, Kaifeng Chen, Inderjit Dhillon, Yulia Tsvetkov, Hannaneh Hajishirzi, Sham Kakade, Ali Farhadi, Prateek Jain
[ABSTRACT]
Foundation models are applied in a broad spectrum of settings with different
inference constraints, from massive multi-accelerator clusters to
resource-constrained standalone mobile devices. However, the substantial costs
associated with training these models often limit the number of unique model
sizes that can be offered. Consequently, practitioners are compelled to select
a model that may not be optimally aligned with their specific latency and cost
requirements. We present MatFormer, a novel Transformer architecture designed
to provide elastic inference across diverse deployment constraints. MatFormer
achieves this by incorporating a nested Feed Forward Network (FFN) block
structure within a standard Transformer model. During training, we optimize the
parameters of multiple nested FFN blocks with varying sizes, enabling the
extraction of hundreds of accurate smaller models without incurring additional
computational costs. We empirically validate the efficacy of MatFormer across
different model classes (decoders and encoders) and modalities (language and
vision), demonstrating its potential for real-world deployment. We show that a
850M decoder-only MatFormer language model (MatLM) allows us to extract
multiple smaller models spanning from 582M to 850M parameters, each exhibiting
better validation loss and one-shot downstream evaluations than independently
trained counterparts. Furthermore, we observe that smaller encoders extracted
from a universal MatFormer-based ViT (MatViT) encoder preserve the metric-space
structure for adaptive large-scale retrieval. Finally, we showcase that
speculative decoding with the accurate and consistent submodels extracted from
MatFormer can lead to significant reduction in inference latency. Project
website: https://devvrit.github.io/matformer/
[COMMENTS]
30 pages, 11 figures, first three authors contributed equally.
NeurIPS, 2024
[LINK]
http://arxiv.org/abs/2310.07707v2
[DATE]
2024-12-15 11:45:36+08:00
[CATEGORIES]
cs.LG
cs.CL
DETAIL: Task DEmonsTration Attribution for Interpretable In-context Learning
[AUTHORS]
Zijian Zhou, Xiaoqiang Lin, Xinyi Xu, Alok Prakash, Daniela Rus, Bryan Kian Hsiang Low
[ABSTRACT]
In-context learning (ICL) allows transformer-based language models that are
pre-trained on general text to quickly learn a specific task with a few “task
demonstrations” without updating their parameters, significantly boosting their
flexibility and generality. ICL possesses many distinct characteristics from
conventional machine learning, thereby requiring new approaches to interpret
this learning paradigm. Taking the viewpoint of recent works showing that
transformers learn in context by formulating an internal optimizer, we propose
an influence function-based attribution technique, DETAIL, that addresses the
specific characteristics of ICL. We empirically verify the effectiveness of our
approach for demonstration attribution while being computationally efficient.
Leveraging the results, we then show how DETAIL can help improve model
performance in real-world scenarios through demonstration reordering and
curation. Finally, we experimentally prove the wide applicability of DETAIL by
showing our attribution scores obtained on white-box models are transferable to
black-box models in improving model performance.
[COMMENTS]
Accepted at NeurIPS 2024
[LINK]
http://arxiv.org/abs/2405.14899v2
[DATE]
2024-12-15 10:13:10+08:00
[CATEGORIES]
cs.CL
cs.LG
A Contextualized BERT model for Knowledge Graph Completion
[AUTHORS]
Haji Gul, Abdul Ghani Naim, Ajaz A. Bhat
[ABSTRACT]
Knowledge graphs (KGs) are valuable for representing structured,
interconnected information across domains, enabling tasks like semantic search,
recommendation systems and inference. A pertinent challenge with KGs, however,
is that many entities (i.e., heads, tails) or relationships are unknown.
Knowledge Graph Completion (KGC) addresses this by predicting these missing
nodes or links, enhancing the graph’s informational depth and utility.
Traditional methods like TransE and ComplEx predict tail entities but struggle
with unseen entities. Textual-based models leverage additional semantics but
come with high computational costs, semantic inconsistencies, and data
imbalance issues. Recent LLM-based models show improvement but overlook
contextual information and rely heavily on entity descriptions. In this study,
we introduce a contextualized BERT model for KGC that overcomes these
limitations by utilizing the contextual information from neighbouring entities
and relationships to predict tail entities. Our model eliminates the need for
entity descriptions and negative triplet sampling, reducing computational
demands while improving performance. Our model outperforms state-of-the-art
methods on standard datasets, improving Hit@1 by 5.3% and 4.88% on FB15k-237
and WN18RR respectively, setting a new benchmark in KGC.
[COMMENTS]
MuslML Workshop, 38th Conference on Neural Information Processing
Systems (NeurIPS 2024)
[LINK]
http://arxiv.org/abs/2412.11016v1
[DATE]
2024-12-15 10:03:16+08:00
[CATEGORIES]
cs.CL
cs.LG
SQLFixAgent: Towards Semantic-Accurate Text-to-SQL Parsing via Consistency-Enhanced Multi-Agent Collaboration
[AUTHORS]
Jipeng Cen, Jiaxin Liu, Zhixu Li, Jingjing Wang
[ABSTRACT]
While fine-tuned large language models (LLMs) excel in generating
grammatically valid SQL in Text-to-SQL parsing, they often struggle to ensure
semantic accuracy in queries, leading to user confusion and diminished system
usability. To tackle this challenge, we introduce SQLFixAgent, a new
consistency-enhanced multi-agent collaborative framework designed for detecting
and repairing erroneous SQL. Our framework comprises a core agent, SQLRefiner,
alongside two auxiliary agents: SQLReviewer and QueryCrafter. The SQLReviewer
agent employs the rubber duck debugging method to identify potential semantic
mismatches between SQL and user query. If the error is detected, the
QueryCrafter agent generates multiple SQL as candidate repairs using a
fine-tuned SQLTool. Subsequently, leveraging similar repair retrieval and
failure memory reflection, the SQLRefiner agent selects the most fitting SQL
statement from the candidates as the final repair. We evaluated our proposed
framework on five Text-to-SQL benchmarks. The experimental results show that
our method consistently enhances the performance of the baseline model,
specifically achieving an execution accuracy improvement of over 3\% on the
Bird benchmark. Our framework also has a higher token efficiency compared to
other advanced methods, making it more competitive.
[LINK]
http://arxiv.org/abs/2406.13408v4
[DATE]
2024-12-15 09:39:42+08:00
[CATEGORIES]
cs.CL
Dual Traits in Probabilistic Reasoning of Large Language Models
[AUTHORS]
Shenxiong Li, Huaxia Rui
[ABSTRACT]
We conducted three experiments to investigate how large language models
(LLMs) evaluate posterior probabilities. Our results reveal the coexistence of
two modes in posterior judgment among state-of-the-art models: a normative
mode, which adheres to Bayes’ rule, and a representative-based mode, which
relies on similarity – paralleling human System 1 and System 2 thinking.
Additionally, we observed that LLMs struggle to recall base rate information
from their memory, and developing prompt engineering strategies to mitigate
representative-based judgment may be challenging. We further conjecture that
the dual modes of judgment may be a result of the contrastive loss function
employed in reinforcement learning from human feedback. Our findings underscore
the potential direction for reducing cognitive biases in LLMs and the necessity
for cautious deployment of LLMs in critical areas.
[LINK]
http://arxiv.org/abs/2412.11009v1
[DATE]
2024-12-15 09:33:45+08:00
[CATEGORIES]
cs.CL
Entropy-Regularized Process Reward Model
[AUTHORS]
Hanning Zhang, Pengcheng Wang, Shizhe Diao, Yong Lin, Rui Pan, Hanze Dong, Dylan Zhang, Pavlo Molchanov, Tong Zhang
[ABSTRACT]
Large language models (LLMs) have shown promise in performing complex
multi-step reasoning, yet they continue to struggle with mathematical
reasoning, often making systematic errors. A promising solution is
reinforcement learning (RL) guided by reward models, particularly those
focusing on process rewards, which score each intermediate step rather than
solely evaluating the final outcome. This approach is more effective at guiding
policy models towards correct reasoning trajectories. In this work, we propose
an entropy-regularized process reward model (ER-PRM) that integrates
KL-regularized Markov Decision Processes (MDP) to balance policy optimization
with the need to prevent the policy from shifting too far from its initial
distribution. We derive a novel reward construction method based on the
theoretical results. Our theoretical analysis shows that we could derive the
optimal reward model from the initial policy sampling. Our empirical
experiments on the MATH and GSM8K benchmarks demonstrate that ER-PRM
consistently outperforms existing process reward models, achieving 1%
improvement on GSM8K and 2-3% improvement on MATH under best-of-N evaluation,
and more than 1% improvement under RLHF. These results highlight the efficacy
of entropy-regularization in enhancing LLMs’ reasoning capabilities.
[COMMENTS]
Preprint
[LINK]
http://arxiv.org/abs/2412.11006v1
[DATE]
2024-12-15 09:09:23+08:00
[CATEGORIES]
cs.LG
cs.CL
LOLA – An Open-Source Massively Multilingual Large Language Model
[AUTHORS]
Nikit Srivastava, Denis Kuchelev, Tatiana Moteu Ngoli, Kshitij Shetty, Michael Röder, Hamada Zahera, Diego Moussallem, Axel-Cyrille Ngonga Ngomo
[ABSTRACT]
This paper presents LOLA, a massively multilingual large language model
trained on more than 160 languages using a sparse Mixture-of-Experts
Transformer architecture. Our architectural and implementation choices address
the challenge of harnessing linguistic diversity while maintaining efficiency
and avoiding the common pitfalls of multilinguality. Our analysis of the
evaluation results shows competitive performance in natural language generation
and understanding tasks. Additionally, we demonstrate how the learned
expert-routing mechanism exploits implicit phylogenetic linguistic patterns to
potentially alleviate the curse of multilinguality. We provide an in-depth look
at the training process, an analysis of the datasets, and a balanced
exploration of the model’s strengths and limitations. As an open-source model,
LOLA promotes reproducibility and serves as a robust foundation for future
research. Our findings enable the development of compute-efficient multilingual
models with strong, scalable performance across languages.
[LINK]
http://arxiv.org/abs/2409.11272v5
[DATE]
2024-12-15 08:34:42+08:00
[CATEGORIES]
cs.CL
cs.LG
Why are state-space models more expressive than $n$-gram models?
[AUTHORS]
Vinoth Nandakumar, Qiang Qu, Peng Mi, Tongliang Liu
[ABSTRACT]
Recent advancements in recurrent neural networks (RNNs) have reinvigorated
interest in their application to natural language processing tasks,
particularly with the development of more efficient and parallelizable variants
known as state space models (SSMs), which have shown competitive performance
against transformer models while maintaining a lower memory footprint. While
RNNs and SSMs (e.g., Mamba) have been empirically more successful than
rule-based systems based on $n$-gram models, a rigorous theoretical explanation
for this success has not yet been developed, as it is unclear how these models
encode the combinatorial rules that govern the next-word prediction task. In
this paper, we construct state space language models that can solve the
next-word prediction task for languages generated from $n$-gram rules, thereby
showing that the former are more expressive. Our proof shows how SSMs can
encode $n$-gram rules using new theoretical results on their memorization
capacity, and demonstrates how their context window can be controlled by
restricting the spectrum of the hidden weight matrix. We conduct experiments
with a small dataset generated from $n$-gram rules to show how our framework
can be applied to SSMs and RNNs obtained through gradient-based optimization.
[COMMENTS]
This submission subsumes arXiv:2306.17184
[LINK]
http://arxiv.org/abs/2306.17184v2
[DATE]
2024-12-15 08:24:59+08:00
[CATEGORIES]
cs.CL
cs.LG
Can a Neural Model Guide Fieldwork? A Case Study on Morphological Data Collection
[AUTHORS]
Aso Mahmudi, Borja Herce, Demian Inostroza Amestica, Andreas Scherbakov, Eduard Hovy, Ekaterina Vylomova
[ABSTRACT]
Linguistic fieldwork is an important component in language documentation and
preservation. However, it is a long, exhaustive, and time-consuming process.
This paper presents a novel model that guides a linguist during the fieldwork
and accounts for the dynamics of linguist-speaker interactions. We introduce a
novel framework that evaluates the efficiency of various sampling strategies
for obtaining morphological data and assesses the effectiveness of
state-of-the-art neural models in generalising morphological structures. Our
experiments highlight two key strategies for improving the efficiency: (1)
increasing the diversity of annotated data by uniform sampling among the cells
of the paradigm tables, and (2) using model confidence as a guide to enhance
positive interaction by providing reliable predictions during annotation.
[LINK]
http://arxiv.org/abs/2409.14628v2
[DATE]
2024-12-15 08:07:34+08:00
[CATEGORIES]
cs.CL
Retrofitting Large Language Models with Dynamic Tokenization
[AUTHORS]
Darius Feher, Ivan Vulić, Benjamin Minixhofer
[ABSTRACT]
Current language models (LMs) use a fixed, static subword tokenizer. This
default choice typically results in degraded efficiency and language
capabilities, especially in languages other than English. To address this
issue, we challenge the static design and propose retrofitting LMs with dynamic
tokenization: a way to dynamically decide on token boundaries based on the
input text via a subword-merging algorithm inspired by byte-pair encoding. We
merge frequent subword sequences in a batch, then apply a pre-trained
embedding-prediction hypernetwork to compute the token embeddings on-the-fly.
For encoder-style models (e.g., XLM-R), this on average reduces token sequence
lengths by >20% across 14 languages while degrading performance by less than
2%. The same method applied to pre-filling and scoring in decoder-style models
(e.g., Mistral-7B; evaluated on English) results in minimal performance
degradation at up to 6% reduction in sequence length. Overall, we find that
dynamic tokenization can mitigate the limitations of static tokenization by
substantially improving inference speed and promoting fairness across
languages, enabling more equitable and adaptable LMs.
[LINK]
http://arxiv.org/abs/2411.18553v2
[DATE]
2024-12-15 07:43:54+08:00
[CATEGORIES]
cs.CL
Language Model Alignment in Multilingual Trolley Problems
[AUTHORS]
Zhijing Jin, Max Kleiman-Weiner, Giorgio Piatti, Sydney Levine, Jiarui Liu, Fernando Gonzalez, Francesco Ortu, András Strausz, Mrinmaya Sachan, Rada Mihalcea, Yejin Choi, Bernhard Schölkopf
[COMMENTS]
Best Paper @ NeurIPS 2024 Workshop on Pluralistic Alignment
[LINK]
http://arxiv.org/abs/2407.02273v5
[DATE]
2024-12-15 05:28:47+08:00
[CATEGORIES]
cs.CL
MABR: A Multilayer Adversarial Bias Removal Approach Without Prior Bias Knowledge
[AUTHORS]
Maxwell J. Yin, Boyu Wang, Charles Ling
[ABSTRACT]
Models trained on real-world data often mirror and exacerbate existing social
biases. Traditional methods for mitigating these biases typically require prior
knowledge of the specific biases to be addressed, such as gender or racial
biases, and the social groups associated with each instance. In this paper, we
introduce a novel adversarial training strategy that operates independently of
prior bias-type knowledge and protected attribute labels. Our approach
proactively identifies biases during model training by utilizing auxiliary
models, which are trained concurrently by predicting the performance of the
main model without relying on task labels. Additionally, we implement these
auxiliary models at various levels of the feature maps of the main model,
enabling the detection of a broader and more nuanced range of bias features.
Through experiments on racial and gender biases in sentiment and occupation
classification tasks, our method effectively reduces social biases without the
need for demographic annotations. Moreover, our approach not only matches but
often surpasses the efficacy of methods that require detailed demographic
insights, marking a significant advancement in bias mitigation techniques.
[LINK]
http://arxiv.org/abs/2408.05497v2
[DATE]
2024-12-15 05:10:42+08:00
[CATEGORIES]
cs.CL
Can LLMs Help Create Grammar?: Automating Grammar Creation for Endangered Languages with In-Context Learning
[AUTHORS]
Piyapath T Spencer, Nanthipat Kongborrirak
[ABSTRACT]
Yes! In the present-day documenting and preserving endangered languages, the
application of Large Language Models (LLMs) presents a promising approach. This
paper explores how LLMs, particularly through in-context learning, can assist
in generating grammatical information for low-resource languages with limited
amount of data. We takes Moklen as a case study to evaluate the efficacy of
LLMs in producing coherent grammatical rules and lexical entries using only
bilingual dictionaries and parallel sentences of the unknown language without
building the model from scratch. Our methodology involves organising the
existing linguistic data and prompting to efficiently enable to generate formal
XLE grammar. Our results demonstrate that LLMs can successfully capture key
grammatical structures and lexical information, although challenges such as the
potential for English grammatical biases remain. This study highlights the
potential of LLMs to enhance language documentation efforts, providing a
cost-effective solution for generating linguistic data and contributing to the
preservation of endangered languages.
[COMMENTS]
Preprint manuscript. Under revision. Accepted to COLING 2025
[LINK]
http://arxiv.org/abs/2412.10960v1
[DATE]
2024-12-15 04:43:12+08:00
[CATEGORIES]
cs.CL
Improving Multi-candidate Speculative Decoding
[AUTHORS]
Xiaofan Lu, Yixiao Zeng, Feiyang Ma, Zixu Yu, Marco Levorato
[ABSTRACT]
Speculative Decoding (SD) is a technique to accelerate the inference of Large
Language Models (LLMs) by using a lower complexity draft model to propose
candidate tokens verified by a larger target model. To further improve
efficiency, Multi-Candidate Speculative Decoding (MCSD) improves upon this by
sampling multiple candidate tokens from the draft model at each step and
verifying them in parallel, thus increasing the chances of accepting a token
and reducing generation time. Existing MCSD methods rely on the draft model to
initialize the multi-candidate sequences and use static length and tree
attention structure for draft generation. However, such an approach suffers
from the draft and target model’s output distribution differences, especially
in a dynamic generation context. In this work, we introduce a new version of
MCSD that includes a target model initialized multi-candidate generation, a
dynamic sliced topology-aware causal mask for dynamic length adjustment, and
decision models to optimize early stopping. We experimented with our method on
Llama 2-7B and its variants and observed a maximum 27.5% speedup compared to
our MCSD baseline across three benchmarks with Llama 2-7B as the target model
and JackFram 68M as the draft model. Additionally, we evaluate the effects of
using the target model initialized multi-candidate process with different draft
models on output quality.
[COMMENTS]
Accepted by NeurIPS ENLSP 2024 Workshop
[LINK]
http://arxiv.org/abs/2409.10644v3
[DATE]
2024-12-15 04:15:00+08:00
[CATEGORIES]
cs.CL
Enhancing Discoverability in Enterprise Conversational Systems with Proactive Question Suggestions
[AUTHORS]
Xiaobin Shen, Daniel Lee, Sumit Ranjan, Sai Sree Harsha, Pawan Sevak, Yunyao Li
[ABSTRACT]
Enterprise conversational AI systems are becoming increasingly popular to
assist users in completing daily tasks such as those in marketing and customer
management. However, new users often struggle to ask effective questions,
especially in emerging systems with unfamiliar or evolving capabilities. This
paper proposes a framework to enhance question suggestions in conversational
enterprise AI systems by generating proactive, context-aware questions that try
to address immediate user needs while improving feature discoverability. Our
approach combines periodic user intent analysis at the population level with
chat session-based question generation. We evaluate the framework using
real-world data from the AI Assistant for Adobe Experience Platform (AEP),
demonstrating the improved usefulness and system discoverability of the AI
Assistant.
[LINK]
http://arxiv.org/abs/2412.10933v1
[DATE]
2024-12-15 03:04:16+08:00
[CATEGORIES]
cs.CL
Decoding Decoded: Understanding Hyperparameter Effects in Open-Ended Text Generation
[AUTHORS]
Esteban Garces Arias, Meimingwei Li, Christian Heumann, Matthias Aßenmacher
[ABSTRACT]
Decoding strategies for generative large language models (LLMs) are a
critical but often underexplored aspect of text generation tasks. Guided by
specific hyperparameters, these strategies aim to transform the raw probability
distributions produced by language models into coherent, fluent text. In this
study, we undertake a large-scale empirical assessment of a range of decoding
methods, open-source LLMs, textual domains, and evaluation protocols to
determine how hyperparameter choices shape the outputs. Our experiments include
both factual (e.g., news) and creative (e.g., fiction) domains, and incorporate
a broad suite of automatic evaluation metrics alongside human judgments.
Through extensive sensitivity analyses, we distill practical recommendations
for selecting and tuning hyperparameters, noting that optimal configurations
vary across models and tasks. By synthesizing these insights, this study
provides actionable guidance for refining decoding strategies, enabling
researchers and practitioners to achieve higher-quality, more reliable, and
context-appropriate text generation outcomes.
[COMMENTS]
Accepted to COLING 2025
[LINK]
http://arxiv.org/abs/2410.06097v2
[DATE]
2024-12-15 03:02:28+08:00
[CATEGORIES]
cs.CL
cs.LG
LLMs-in-the-Loop Part 2: Expert Small AI Models for Anonymization and De-identification of PHI Across Multiple Languages
[AUTHORS]
Murat Gunay, Bunyamin Keles, Raife Hizlan
[ABSTRACT]
The rise of chronic diseases and pandemics like COVID-19 has emphasized the
need for effective patient data processing while ensuring privacy through
anonymization and de-identification of protected health information (PHI).
Anonymized data facilitates research without compromising patient
confidentiality. This paper introduces expert small AI models developed using
the LLM-in-the-loop methodology to meet the demand for domain-specific
de-identification NER models. These models overcome the privacy risks
associated with large language models (LLMs) used via APIs by eliminating the
need to transmit or store sensitive data. More importantly, they consistently
outperform LLMs in de-identification tasks, offering superior performance and
reliability. Our de-identification NER models, developed in eight languages
(English, German, Italian, French, Romanian, Turkish, Spanish, and Arabic)
achieved f1-micro score averages of 0.966, 0.975, 0.976, 0.970, 0.964, 0.974,
0.978, and 0.953 respectively. These results establish them as the most
accurate healthcare anonymization solutions, surpassing existing small models
and even general-purpose LLMs such as GPT-4o. While Part-1 of this series
introduced the LLM-in-the-loop methodology for bio-medical document
translation, this second paper showcases its success in developing
cost-effective expert small NER models in de-identification tasks. Our findings
lay the groundwork for future healthcare AI innovations, including biomedical
entity and relation extraction, demonstrating the value of specialized models
for domain-specific challenges.
[COMMENTS]
21 pages, 7 tables
[LINK]
http://arxiv.org/abs/2412.10918v1
[DATE]
2024-12-15 02:10:29+08:00
[CATEGORIES]
cs.CL
SusGen-GPT: A Data-Centric LLM for Financial NLP and Sustainability Report Generation
[AUTHORS]
Qilong Wu, Xiaoneng Xiang, Hejia Huang, Xuan Wang, Yeo Wei Jie, Ranjan Satapathy, Ricardo Shirota Filho, Bharadwaj Veeravalli
[ABSTRACT]
The rapid growth of the financial sector and the rising focus on
Environmental, Social, and Governance (ESG) considerations highlight the need
for advanced NLP tools. However, open-source LLMs proficient in both finance
and ESG domains remain scarce. To address this gap, we introduce SusGen-30K, a
category-balanced dataset comprising seven financial NLP tasks and ESG report
generation, and propose TCFD-Bench, a benchmark for evaluating sustainability
report generation. Leveraging this dataset, we developed SusGen-GPT, a suite of
models achieving state-of-the-art performance across six adapted and two
off-the-shelf tasks, trailing GPT-4 by only 2% despite using 7-8B parameters
compared to GPT-4’s 1,700B. Based on this, we propose the SusGen system,
integrated with Retrieval-Augmented Generation (RAG), to assist in
sustainability report generation. This work demonstrates the efficiency of our
approach, advancing research in finance and ESG.
[LINK]
http://arxiv.org/abs/2412.10906v1
[DATE]
2024-12-15 01:30:33+08:00
[CATEGORIES]
cs.CL
cs.LG
Improving the Efficiency of Visually Augmented Language Models
[AUTHORS]
Paula Ontalvilla, Aitor Ormazabal, Gorka Azkune
[ABSTRACT]
Despite the impressive performance of autoregressive Language Models (LM) it
has been shown that due to reporting bias, LMs lack visual knowledge, i.e. they
do not know much about the visual world and its properties. To augment LMs with
visual knowledge, existing solutions often rely on explicit images, requiring
time-consuming retrieval or image generation systems. This paper shows that
explicit images are not necessary to visually augment an LM. Instead, we use
visually-grounded text representations obtained from the well-known CLIP
multimodal system. For a fair comparison, we modify VALM, a visually-augmented
LM which uses image retrieval and representation, to work directly with
visually-grounded text representations. We name this new model BLIND-VALM. We
show that BLIND-VALM performs on par with VALM for Visual Language
Understanding (VLU), Natural Language Understanding (NLU) and Language Modeling
tasks, despite being significantly more efficient and simpler. We also show
that scaling up our model within the compute budget of VALM, either increasing
the model or pre-training corpus size, we outperform VALM for all the
evaluation tasks.
[COMMENTS]
COLING 2025
[LINK]
http://arxiv.org/abs/2409.11148v3
[DATE]
2024-12-15 01:17:48+08:00
[CATEGORIES]
cs.CL
BgGPT 1.0: Extending English-centric LLMs to other languages
[AUTHORS]
Anton Alexandrov, Veselin Raychev, Dimitar I. Dimitrov, Ce Zhang, Martin Vechev, Kristina Toutanova
[ABSTRACT]
We present BgGPT-Gemma-2-27B-Instruct and BgGPT-Gemma-2-9B-Instruct:
continually pretrained and fine-tuned versions of Google’s Gemma-2 models,
specifically optimized for Bulgarian language understanding and generation.
Leveraging Gemma-2’s multilingual capabilities and over 100 billion tokens of
Bulgarian and English text data, our models demonstrate strong performance in
Bulgarian language tasks, setting a new standard for language-specific AI
models. Our approach maintains the robust capabilities of the original Gemma-2
models, ensuring that the English language performance remains intact. To
preserve the base model capabilities, we incorporate continual learning
strategies based on recent Branch-and-Merge techniques as well as thorough
curation and selection of training data. We provide detailed insights into our
methodology, including the release of model weights with a commercial-friendly
license, enabling broader adoption by researchers, companies, and hobbyists.
Further, we establish a comprehensive set of benchmarks based on non-public
educational data sources to evaluate models on Bulgarian language tasks as well
as safety and chat capabilities. Our findings demonstrate the effectiveness of
fine-tuning state-of-the-art models like Gemma 2 to enhance language-specific
AI applications while maintaining cross-lingual capabilities.
[LINK]
http://arxiv.org/abs/2412.10893v1
[DATE]
2024-12-15 00:49:52+08:00
[CATEGORIES]
cs.CL
cs.LG
Uni-AdaFocus: Spatial-temporal Dynamic Computation for Video Recognition
[AUTHORS]
Yulin Wang, Haoji Zhang, Yang Yue, Shiji Song, Chao Deng, Junlan Feng, Gao Huang
[ABSTRACT]
This paper presents a comprehensive exploration of the phenomenon of data
redundancy in video understanding, with the aim to improve computational
efficiency. Our investigation commences with an examination of spatial
redundancy, which refers to the observation that the most informative region in
each video frame usually corresponds to a small image patch, whose shape, size
and location shift smoothly across frames. Motivated by this phenomenon, we
formulate the patch localization problem as a dynamic decision task, and
introduce a spatially adaptive video recognition approach, termed AdaFocus. In
specific, a lightweight encoder is first employed to quickly process the full
video sequence, whose features are then utilized by a policy network to
identify the most task-relevant regions. Subsequently, the selected patches are
inferred by a high-capacity deep network for the final prediction. The full
model can be trained in end-to-end conveniently. Furthermore, AdaFocus can be
extended by further considering temporal and sample-wise redundancies, i.e.,
allocating the majority of computation to the most task-relevant frames, and
minimizing the computation spent on relatively “easier” videos. Our resulting
approach, Uni-AdaFocus, establishes a comprehensive framework that seamlessly
integrates spatial, temporal, and sample-wise dynamic computation, while it
preserves the merits of AdaFocus in terms of efficient end-to-end training and
hardware friendliness. In addition, Uni-AdaFocus is general and flexible as it
is compatible with off-the-shelf efficient backbones (e.g., TSM and X3D), which
can be readily deployed as our feature extractor, yielding a significantly
improved computational efficiency. Empirically, extensive experiments based on
seven benchmark datasets and three application scenarios substantiate that
Uni-AdaFocus is considerably more efficient than the competitive baselines.
[COMMENTS]
Accepted by IEEE TPAMI. Journal version of arXiv:2105.03245
(AdaFocusV1, ICCV 2021 Oral), arXiv:2112.14238 (AdaFocusV2, CVPR 2022), and
arXiv:2209.13465 (AdaFocusV3, ECCV 2022). Code and pre-trained models:
https://github.com/LeapLabTHU/Uni-AdaFocus
[LINK]
http://arxiv.org/abs/2412.11228v1
[DATE]
2024-12-15 23:51:44+08:00
[CATEGORIES]
cs.LG
HAGNN: Hybrid Aggregation for Heterogeneous Graph Neural Networks
[AUTHORS]
Guanghui Zhu, Zhennan Zhu, Hongyang Chen, Chunfeng Yuan, Yihua Huang
[ABSTRACT]
Heterogeneous graph neural networks (GNNs) have been successful in handling
heterogeneous graphs. In existing heterogeneous GNNs, meta-path plays an
essential role. However, recent work pointed out that simple homogeneous graph
model without meta-path can also achieve comparable results, which calls into
question the necessity of meta-path. In this paper, we first present the
intrinsic difference about meta-path-based and meta-path-free models, i.e., how
to select neighbors for node aggregation. Then, we propose a novel framework to
utilize the rich type semantic information in heterogeneous graphs
comprehensively, namely HAGNN (Hybrid Aggregation for Heterogeneous GNNs). The
core of HAGNN is to leverage the meta-path neighbors and the directly connected
neighbors simultaneously for node aggregations. HAGNN divides the overall
aggregation process into two phases: meta-path-based intra-type aggregation and
meta-path-free inter-type aggregation. During the intra-type aggregation phase,
we propose a new data structure called fused meta-path graph and perform
structural semantic aware aggregation on it. Finally, we combine the embeddings
generated by each phase. Compared with existing heterogeneous GNN models, HAGNN
can take full advantage of the heterogeneity in heterogeneous graphs. Extensive
experimental results on node classification, node clustering, and link
prediction tasks show that HAGNN outperforms the existing modes, demonstrating
the effectiveness of HAGNN.
[COMMENTS]
Accepted by IEEE TNNLS
[LINK]
http://arxiv.org/abs/2307.01636v2
[DATE]
2024-12-15 23:14:51+08:00
[CATEGORIES]
cs.LG
Neural Port-Hamiltonian Differential Algebraic Equations for Compositional Learning of Electrical Networks
[AUTHORS]
Cyrus Neary, Nathan Tsao, Ufuk Topcu
[ABSTRACT]
We develop compositional learning algorithms for coupled dynamical systems.
While deep learning has proven effective at modeling complex relationships from
data, compositional couplings between system components typically introduce
algebraic constraints on state variables, posing challenges to many existing
data-driven approaches to modeling dynamical systems. Towards developing deep
learning models for constrained dynamical systems, we introduce neural
port-Hamiltonian differential algebraic equations (N-PHDAEs), which use neural
networks to parametrize unknown terms in both the differential and algebraic
components of a port-Hamiltonian DAE. To train these models, we propose an
algorithm that uses automatic differentiation to perform index reduction,
automatically transforming the neural DAE into an equivalent system of neural
ordinary differential equations (N-ODEs), for which established model inference
and backpropagation methods exist. The proposed compositional modeling
framework and learning algorithms may be applied broadly to learn
control-oriented models of dynamical systems in a variety of application areas,
however, in this work, we focus on their application to the modeling of
electrical networks. Experiments simulating the dynamics of nonlinear circuits
exemplify the benefits of our approach: the proposed N-PHDAE model achieves an
order of magnitude improvement in prediction accuracy and constraint
satisfaction when compared to a baseline N-ODE over long prediction time
horizons. We also validate the compositional capabilities of our approach
through experiments on a simulated D.C. microgrid: we train individual N-PHDAE
models for separate grid components, before coupling them to accurately predict
the behavior of larger-scale networks.
[LINK]
http://arxiv.org/abs/2412.11215v1
[DATE]
2024-12-15 23:13:11+08:00
[CATEGORIES]
cs.LG
ConfigX: Modular Configuration for Evolutionary Algorithms via Multitask Reinforcement Learning
[AUTHORS]
Hongshu Guo, Zeyuan Ma, Jiacheng Chen, Yining Ma, Zhiguang Cao, Xinglin Zhang, Yue-Jiao Gong
[ABSTRACT]
Recent advances in Meta-learning for Black-Box Optimization (MetaBBO) have
shown the potential of using neural networks to dynamically configure
evolutionary algorithms (EAs), enhancing their performance and adaptability
across various BBO instances. However, they are often tailored to a specific
EA, which limits their generalizability and necessitates retraining or
redesigns for different EAs and optimization problems. To address this
limitation, we introduce ConfigX, a new paradigm of the MetaBBO framework that
is capable of learning a universal configuration agent (model) for boosting
diverse EAs. To achieve so, our ConfigX first leverages a novel modularization
system that enables the flexible combination of various optimization
sub-modules to generate diverse EAs during training. Additionally, we propose a
Transformer-based neural network to meta-learn a universal configuration policy
through multitask reinforcement learning across a designed joint optimization
task space. Extensive experiments verify that, our ConfigX, after large-scale
pre-training, achieves robust zero-shot generalization to unseen tasks and
outperforms state-of-the-art baselines. Moreover, ConfigX exhibits strong
lifelong learning capabilities, allowing efficient adaptation to new tasks
through fine-tuning. Our proposed ConfigX represents a significant step toward
an automatic, all-purpose configuration agent for EAs.
[LINK]
http://arxiv.org/abs/2412.07507v2
[DATE]
2024-12-15 23:09:38+08:00
[CATEGORIES]
cs.LG
Deep Learning-based Approaches for State Space Models: A Selective Review
[AUTHORS]
Jiahe Lin, George Michailidis
[ABSTRACT]
State-space models (SSMs) offer a powerful framework for dynamical system
analysis, wherein the temporal dynamics of the system are assumed to be
captured through the evolution of the latent states, which govern the values of
the observations. This paper provides a selective review of recent advancements
in deep neural network-based approaches for SSMs, and presents a unified
perspective for discrete time deep state space models and continuous time ones
such as latent neural Ordinary Differential and Stochastic Differential
Equations. It starts with an overview of the classical maximum likelihood based
approach for learning SSMs, reviews variational autoencoder as a general
learning pipeline for neural network-based approaches in the presence of latent
variables, and discusses in detail representative deep learning models that
fall under the SSM framework. Very recent developments, where SSMs are used as
standalone architectural modules for improving efficiency in sequence modeling,
are also examined. Finally, examples involving mixed frequency and
irregularly-spaced time series data are presented to demonstrate the advantage
of SSMs in these settings.
[LINK]
http://arxiv.org/abs/2412.11211v1
[DATE]
2024-12-15 23:04:35+08:00
[CATEGORIES]
cs.LG
ProFe: Communication-Efficient Decentralized Federated Learning via Distillation and Prototypes
[AUTHORS]
Pedro Miguel Sánchez Sánchez, Enrique Tomás Martínez Beltrán, Miguel Fernández Llamas, Gérôme Bovet, Gregorio Martínez Pérez, Alberto Huertas Celdrán
[ABSTRACT]
Decentralized Federated Learning (DFL) trains models in a collaborative and
privacy-preserving manner while removing model centralization risks and
improving communication bottlenecks. However, DFL faces challenges in efficient
communication management and model aggregation within decentralized
environments, especially with heterogeneous data distributions. Thus, this
paper introduces ProFe, a novel communication optimization algorithm for DFL
that combines knowledge distillation, prototype learning, and quantization
techniques. ProFe utilizes knowledge from large local models to train smaller
ones for aggregation, incorporates prototypes to better learn unseen classes,
and applies quantization to reduce data transmitted during communication
rounds. The performance of ProFe has been validated and compared to the
literature by using benchmark datasets like MNIST, CIFAR10, and CIFAR100.
Results showed that the proposed algorithm reduces communication costs by up to
~40-50% while maintaining or improving model performance. In addition, it adds
~20% training time due to increased complexity, generating a trade-off.
[LINK]
http://arxiv.org/abs/2412.11207v1
[DATE]
2024-12-15 22:49:29+08:00
[CATEGORIES]
cs.LG
Concept Learning in the Wild: Towards Algorithmic Understanding of Neural Networks
[AUTHORS]
Elad Shohama, Hadar Cohena, Khalil Wattada, Havana Rikab, Dan Vilenchik
[ABSTRACT]
Explainable AI (XAI) methods typically focus on identifying essential input
features or more abstract concepts for tasks like image or text classification.
However, for algorithmic tasks like combinatorial optimization, these concepts
may depend not only on the input but also on the current state of the network,
like in the graph neural networks (GNN) case. This work studies concept
learning for an existing GNN model trained to solve Boolean satisfiability
(SAT). \textcolor{black}{Our analysis reveals that the model learns key
concepts matching those guiding human-designed SAT heuristics, particularly the
notion of ‘support.’ We demonstrate that these concepts are encoded in the top
principal components (PCs) of the embedding’s covariance matrix, allowing for
unsupervised discovery. Using sparse PCA, we establish the minimality of these
concepts and show their teachability through a simplified GNN. Two direct
applications of our framework are (a) We improve the convergence time of the
classical WalkSAT algorithm and (b) We use the discovered concepts to
“reverse-engineer” the black-box GNN and rewrite it as a white-box textbook
algorithm. Our results highlight the potential of concept learning in
understanding and enhancing algorithmic neural networks for combinatorial
optimization tasks.
[LINK]
http://arxiv.org/abs/2412.11205v1
[DATE]
2024-12-15 22:37:56+08:00
[CATEGORIES]
cs.LG
GNNs-to-MLPs by Teacher Injection and Dirichlet Energy Distillation
[AUTHORS]
Ziang Zhou, Zhihao Ding, Jieming Shi, Qing Li, Shiqi Shen
[ABSTRACT]
Graph Neural Networks (GNNs) are fundamental to graph-based learning and
excel in node classification tasks. However, GNNs suffer from scalability
issues due to the need for multi-hop data during inference, limiting their use
in latency-sensitive applications. Recent studies attempt to distill GNNs into
multi-layer perceptrons (MLPs) for faster inference. They typically treat GNN
and MLP models as single units for distillation, insufficiently utilizing the
fine-grained knowledge within GNN layers. In this paper, we propose TINED, a
novel method that distills GNNs to MLPs layer-wise through Teacher Injection
with fine-tuning and Dirichlet Energy Distillation techniques. We analyze key
operations in GNN layers, feature transformation (FT) and graph propagation
(GP), and identify that an FT performs the same computation as a
fully-connected (FC) layer in MLPs. Thus, we propose directly injecting
valuable teacher parameters of an FT in a GNN into an FC layer of the student
MLP, assisted by fine-tuning. In TINED, FC layers in an MLP mirror the order of
the corresponding FTs and GPs in GNN. We provide a theoretical bound on the
approximation of GPs. Moreover, we observe that in a GNN layer, FT and GP
operations often have opposing smoothing effects: GP is aggressive, while FT is
conservative, in smoothing. Using Dirichlet energy, we design a DE ratio to
quantify these smoothing effects and propose Dirichlet Energy Distillation to
distill these characteristics from GNN layers to MLP layers. Extensive
experiments demonstrate that TINED achieves superior performance over GNNs and
state-of-the-art distillation methods under various settings across seven
datasets. The code is in supplementary material.
[COMMENTS]
14 pages
[LINK]
http://arxiv.org/abs/2412.11180v1
[DATE]
2024-12-15 21:18:56+08:00
[CATEGORIES]
cs.LG
pFedGPA: Diffusion-based Generative Parameter Aggregation for Personalized Federated Learning
[AUTHORS]
Jiahao Lai, Jiaqi Li, Jian Xu, Yanru Wu, Boshi Tang, Siqi Chen, Yongfeng Huang, Wenbo Ding, Yang Li
[ABSTRACT]
Federated Learning (FL) offers a decentralized approach to model training,
where data remains local and only model parameters are shared between the
clients and the central server. Traditional methods, such as Federated
Averaging (FedAvg), linearly aggregate these parameters which are usually
trained on heterogeneous data distributions, potentially overlooking the
complex, high-dimensional nature of the parameter space. This can result in
degraded performance of the aggregated model. While personalized FL approaches
can mitigate the heterogeneous data issue to some extent, the limitation of
linear aggregation remains unresolved. To alleviate this issue, we investigate
the generative approach of diffusion model and propose a novel generative
parameter aggregation framework for personalized FL, \texttt{pFedGPA}. In this
framework, we deploy a diffusion model on the server to integrate the diverse
parameter distributions and propose a parameter inversion method to efficiently
generate a set of personalized parameters for each client. This inversion
method transforms the uploaded parameters into a latent code, which is then
aggregated through denoising sampling to produce the final personalized
parameters. By encoding the dependence of a client’s model parameters on the
specific data distribution using the high-capacity diffusion model,
\texttt{pFedGPA} can effectively decouple the complexity of the overall
distribution of all clients’ model parameters from the complexity of each
individual client’s parameter distribution. Our experimental results
consistently demonstrate the superior performance of the proposed method across
multiple datasets, surpassing baseline approaches.
[LINK]
http://arxiv.org/abs/2409.05701v2
[DATE]
2024-12-15 21:15:09+08:00
[CATEGORIES]
cs.LG
A Progressive Transformer for Unifying Binary Code Embedding and Knowledge Transfer
[AUTHORS]
Hanxiao Lu, Hongyu Cai, Yiming Liang, Antonio Bianchi, Z. Berkay Celik
[ABSTRACT]
Language model approaches have recently been integrated into binary analysis
tasks, such as function similarity detection and function signature recovery.
These models typically employ a two-stage training process: pre-training via
Masked Language Modeling (MLM) on machine code and fine-tuning for specific
tasks. While MLM helps to understand binary code structures, it ignores
essential code characteristics, including control and data flow, which
negatively affect model generalization. Recent work leverages domain-specific
features (e.g., control flow graphs and dynamic execution traces) in
transformer-based approaches to improve binary code semantic understanding.
However, this approach involves complex feature engineering, a cumbersome and
time-consuming process that can introduce predictive uncertainty when dealing
with stripped or obfuscated code, leading to a performance drop. In this paper,
we introduce ProTST, a novel transformer-based methodology for binary code
embedding. ProTST employs a hierarchical training process based on a unique
tree-like structure, where knowledge progressively flows from fundamental tasks
at the root to more specialized tasks at the leaves. This progressive
teacher-student paradigm allows the model to build upon previously learned
knowledge, resulting in high-quality embeddings that can be effectively
leveraged for diverse downstream binary analysis tasks. The effectiveness of
ProTST is evaluated in seven binary analysis tasks, and the results show that
ProTST yields an average validation score (F1, MRR, and Recall@1) improvement
of 14.8% compared to traditional two-stage training and an average validation
score of 10.7% compared to multimodal two-stage frameworks.
[LINK]
http://arxiv.org/abs/2412.11177v1
[DATE]
2024-12-15 21:04:29+08:00
[CATEGORIES]
cs.LG
Semi-Supervised Risk Control via Prediction-Powered Inference
[AUTHORS]
Bat-Sheva Einbinder, Liran Ringel, Yaniv Romano
[ABSTRACT]
The risk-controlling prediction sets (RCPS) framework is a general tool for
transforming the output of any machine learning model to design a predictive
rule with rigorous error rate control. The key idea behind this framework is to
use labeled hold-out calibration data to tune a hyper-parameter that affects
the error rate of the resulting prediction rule. However, the limitation of
such a calibration scheme is that with limited hold-out data, the tuned
hyper-parameter becomes noisy and leads to a prediction rule with an error rate
that is often unnecessarily conservative. To overcome this sample-size barrier,
we introduce a semi-supervised calibration procedure that leverages unlabeled
data to rigorously tune the hyper-parameter without compromising statistical
validity. Our procedure builds upon the prediction-powered inference framework,
carefully tailoring it to risk-controlling tasks. We demonstrate the benefits
and validity of our proposal through two real-data experiments: few-shot image
classification and early time series classification.
[LINK]
http://arxiv.org/abs/2412.11174v1
[DATE]
2024-12-15 21:00:23+08:00
[CATEGORIES]
cs.LG
Learning Latent Spaces for Domain Generalization in Time Series Forecasting
[AUTHORS]
Songgaojun Deng, Maarten de Rijke
[ABSTRACT]
Time series forecasting is vital in many real-world applications, yet
developing models that generalize well on unseen relevant domains – such as
forecasting web traffic data on new platforms/websites or estimating e-commerce
demand in new regions – remains underexplored. Existing forecasting models
often struggle with domain shifts in time series data, as the temporal patterns
involve complex components like trends, seasonality, etc. While some prior work
addresses this by matching feature distributions across domains or
disentangling domain-shared features using label information, they fail to
reveal insights into the latent temporal dependencies, which are critical for
identifying common patterns across domains and achieving generalization.
We propose a framework for domain generalization in time series forecasting
by mining the latent factors that govern temporal dependencies across domains.
Our approach uses a decomposition-based architecture with a new Conditional
$\beta$-Variational Autoencoder (VAE), wherein time series data is first
decomposed into trend-cyclical and seasonal components, each modeled
independently through separate $\beta$-VAE modules. The $\beta$-VAE aims to
capture disentangled latent factors that control temporal dependencies across
domains. We enhance the learning of domain-specific information with a
decoder-conditional design and introduce domain regularization to improve the
separation of domain-shared and domain-specific latent factors. Our proposed
method is flexible and can be applied to various time series forecasting
models, enabling effective domain generalization with simplicity and
efficiency. We validate its effectiveness on five real-world time series
datasets, covering web traffic, e-commerce, finance and power consumption,
demonstrating improved generalization performance over state-of-the-art
methods.
[COMMENTS]
18 pages with 8 figures
[LINK]
http://arxiv.org/abs/2412.11171v1
[DATE]
2024-12-15 20:41:53+08:00
[CATEGORIES]
cs.LG
PGD-Imp: Rethinking and Unleashing Potential of Classic PGD with Dual Strategies for Imperceptible Adversarial Attacks
[AUTHORS]
Jin Li, Zitong Yu, Ziqiang He, Z. Jane Wang, Xiangui Kang
[ABSTRACT]
Imperceptible adversarial attacks have recently attracted increasing research
interests. Existing methods typically incorporate external modules or loss
terms other than a simple $l_p$-norm into the attack process to achieve
imperceptibility, while we argue that such additional designs may not be
necessary. In this paper, we rethink the essence of imperceptible attacks and
propose two simple yet effective strategies to unleash the potential of PGD,
the common and classical attack, for imperceptibility from an optimization
perspective. Specifically, the Dynamic Step Size is introduced to find the
optimal solution with minimal attack cost towards the decision boundary of the
attacked model, and the Adaptive Early Stop strategy is adopted to reduce the
redundant strength of adversarial perturbations to the minimum level. The
proposed PGD-Imperceptible (PGD-Imp) attack achieves state-of-the-art results
in imperceptible adversarial attacks for both untargeted and targeted
scenarios. When performing untargeted attacks against ResNet-50, PGD-Imp
attains 100$\%$ (+0.3$\%$) ASR, 0.89 (-1.76) $l_2$ distance, and 52.93 (+9.2)
PSNR with 57s (-371s) running time, significantly outperforming existing
methods.
[LINK]
http://arxiv.org/abs/2412.11168v1
[DATE]
2024-12-15 20:34:22+08:00
[CATEGORIES]
cs.LG
OTLRM: Orthogonal Learning-based Low-Rank Metric for Multi-Dimensional Inverse Problems
[AUTHORS]
Xiangming Wang, Haijin Zeng, Jiaoyang Chen, Sheng Liu, Yongyong Chen, Guoqing Chao
[ABSTRACT]
In real-world scenarios, complex data such as multispectral images and
multi-frame videos inherently exhibit robust low-rank property. This property
is vital for multi-dimensional inverse problems, such as tensor completion,
spectral imaging reconstruction, and multispectral image denoising. Existing
tensor singular value decomposition (t-SVD) definitions rely on hand-designed
or pre-given transforms, which lack flexibility for defining tensor nuclear
norm (TNN). The TNN-regularized optimization problem is solved by the singular
value thresholding (SVT) operator, which leverages the t-SVD framework to
obtain the low-rank tensor. However, it is quite complicated to introduce SVT
into deep neural networks due to the numerical instability problem in solving
the derivatives of the eigenvectors. In this paper, we introduce a novel
data-driven generative low-rank t-SVD model based on the learnable orthogonal
transform, which can be naturally solved under its representation. Prompted by
the linear algebra theorem of the Householder transformation, our learnable
orthogonal transform is achieved by constructing an endogenously orthogonal
matrix adaptable to neural networks, optimizing it as arbitrary orthogonal
matrices. Additionally, we propose a low-rank solver as a generalization of
SVT, which utilizes an efficient representation of generative networks to
obtain low-rank structures. Extensive experiments highlight its significant
restoration enhancements.
[COMMENTS]
AAAI 2025
[LINK]
http://arxiv.org/abs/2412.11165v1
[DATE]
2024-12-15 20:28:57+08:00
[CATEGORIES]
cs.LG
Missing data imputation for noisy time-series data and applications in healthcare
[AUTHORS]
Lien P. Le, Xuan-Hien Nguyen Thi, Thu Nguyen, Michael A. Riegler, Pål Halvorsen, Binh T. Nguyen
[ABSTRACT]
Healthcare time series data is vital for monitoring patient activity but
often contains noise and missing values due to various reasons such as sensor
errors or data interruptions. Imputation, i.e., filling in the missing values,
is a common way to deal with this issue. In this study, we compare imputation
methods, including Multiple Imputation with Random Forest (MICE-RF) and
advanced deep learning approaches (SAITS, BRITS, Transformer) for noisy,
missing time series data in terms of MAE, F1-score, AUC, and MCC, across
missing data rates (10 % - 80 %). Our results show that MICE-RF can effectively
impute missing data compared to deep learning methods and the improvement in
classification of data imputed indicates that imputation can have denoising
effects. Therefore, using an imputation algorithm on time series with missing
data can, at the same time, offer denoising effects.
[LINK]
http://arxiv.org/abs/2412.11164v1
[DATE]
2024-12-15 20:23:20+08:00
[CATEGORIES]
cs.LG
Learning Parameter Sharing with Tensor Decompositions and Sparsity
[AUTHORS]
Cem Üyük, Mike Lasby, Mohamed Yassin, Utku Evci, Yani Ioannou
[ABSTRACT]
Large neural networks achieve remarkable performance, but their size hinders
deployment on resource-constrained devices. While various compression
techniques exist, parameter sharing remains relatively unexplored. This paper
introduces Fine-grained Parameter Sharing (FiPS), a novel algorithm that
leverages the relationship between parameter sharing, tensor decomposition, and
sparsity to efficiently compress large vision transformer models. FiPS employs
a shared base and sparse factors to represent shared neurons across multi-layer
perception (MLP) modules. Shared parameterization is initialized via Singular
Value Decomposition (SVD) and optimized by minimizing block-wise reconstruction
error. Experiments demonstrate that FiPS compresses DeiT-B and Swin-L MLPs to
25-40% of their original parameter count while maintaining accuracy within 1
percentage point of the original models.
[LINK]
http://arxiv.org/abs/2411.09816v2
[DATE]
2024-12-15 20:12:48+08:00
[CATEGORIES]
cs.LG
Supervised Score-Based Modeling by Gradient Boosting
[AUTHORS]
Changyuan Zhao, Hongyang Du, Guangyuan Liu, Dusit Niyato
[ABSTRACT]
Score-based generative models can effectively learn the distribution of data
by estimating the gradient of the distribution. Due to the multi-step denoising
characteristic, researchers have recently considered combining score-based
generative models with the gradient boosting algorithm, a multi-step supervised
learning algorithm, to solve supervised learning tasks. However, existing
generative model algorithms are often limited by the stochastic nature of the
models and the long inference time, impacting prediction performances.
Therefore, we propose a Supervised Score-based Model (SSM), which can be viewed
as a gradient boosting algorithm combining score matching. We provide a
theoretical analysis of learning and sampling for SSM to balance inference time
and prediction accuracy. Via the ablation experiment in selected examples, we
demonstrate the outstanding performances of the proposed techniques.
Additionally, we compare our model with other probabilistic models, including
Natural Gradient Boosting (NGboost), Classification and Regression Diffusion
Models (CARD), Diffusion Boosted Trees (DBT), and non-probabilistic GBM models.
The experimental results show that our model outperforms existing models in
both accuracy and inference time.
[LINK]
http://arxiv.org/abs/2411.01159v2
[DATE]
2024-12-15 20:02:51+08:00
[CATEGORIES]
cs.LG
GDSG: Graph Diffusion-based Solution Generator for Optimization Problems in MEC Networks
[AUTHORS]
Ruihuai Liang, Bo Yang, Pengyu Chen, Xuelin Cao, Zhiwen Yu, Mérouane Debbah, Dusit Niyato, H. Vincent Poor, Chau Yuen
[ABSTRACT]
Optimization is crucial for MEC networks to function efficiently and
reliably, most of which are NP-hard and lack efficient approximation
algorithms. This leads to a paucity of optimal solution, constraining the
effectiveness of conventional deep learning approaches. Most existing
learning-based methods necessitate extensive optimal data and fail to exploit
the potential benefits of suboptimal data that can be obtained with greater
efficiency and effectiveness. Taking the multi-server multi-user computation
offloading (MSCO) problem, which is widely observed in systems like
Internet-of-Vehicles (IoV) and Unmanned Aerial Vehicle (UAV) networks, as a
concrete scenario, we present a Graph Diffusion-based Solution Generation
(GDSG) method. This approach is designed to work with suboptimal datasets while
converging to the optimal solution large probably. We transform the
optimization issue into distribution-learning and offer a clear explanation of
learning from suboptimal training datasets. We build GDSG as a multi-task
diffusion model utilizing a Graph Neural Network (GNN) to acquire the
distribution of high-quality solutions. We use a simple and efficient heuristic
approach to obtain a sufficient amount of training data composed entirely of
suboptimal solutions. In our implementation, we enhance the backbone GNN and
achieve improved generalization. GDSG also reaches nearly 100\% task
orthogonality, ensuring no interference between the discrete and continuous
generation tasks. We further reveal that this orthogonality arises from the
diffusion-related training loss, rather than the neural network architecture
itself. The experiments demonstrate that GDSG surpasses other benchmark methods
on both the optimal and suboptimal training datasets. The MSCO datasets has
open-sourced at http://ieee-dataport.org/13824, as well as the GDSG algorithm
codes at https://github.com/qiyu3816/GDSG.
[LINK]
http://arxiv.org/abs/2412.08296v2
[DATE]
2024-12-15 19:50:43+08:00
[CATEGORIES]
cs.LG
Early Concept Drift Detection via Prediction Uncertainty
[AUTHORS]
Pengqian Lu, Jie Lu, Anjin Liu, Guangquan Zhang
[ABSTRACT]
Concept drift, characterized by unpredictable changes in data distribution
over time, poses significant challenges to machine learning models in streaming
data scenarios. Although error rate-based concept drift detectors are widely
used, they often fail to identify drift in the early stages when the data
distribution changes but error rates remain constant. This paper introduces the
Prediction Uncertainty Index (PU-index), derived from the prediction
uncertainty of the classifier, as a superior alternative to the error rate for
drift detection. Our theoretical analysis demonstrates that: (1) The PU-index
can detect drift even when error rates remain stable. (2) Any change in the
error rate will lead to a corresponding change in the PU-index. These
properties make the PU-index a more sensitive and robust indicator for drift
detection compared to existing methods. We also propose a PU-index-based Drift
Detector (PUDD) that employs a novel Adaptive PU-index Bucketing algorithm for
detecting drift. Empirical evaluations on both synthetic and real-world
datasets demonstrate PUDD’s efficacy in detecting drift in structured and image
data.
[COMMENTS]
Accepted by AAAI-2025
[LINK]
http://arxiv.org/abs/2412.11158v1
[DATE]
2024-12-15 19:43:53+08:00
[CATEGORIES]
cs.LG
Partial Identifiability in Inverse Reinforcement Learning For Agents With Non-Exponential Discounting
[AUTHORS]
Joar Skalse, Alessandro Abate
[ABSTRACT]
The aim of inverse reinforcement learning (IRL) is to infer an agent’s
preferences from observing their behaviour. Usually, preferences are modelled
as a reward function, $R$, and behaviour is modelled as a policy, $\pi$. One of
the central difficulties in IRL is that multiple preferences may lead to the
same observed behaviour. That is, $R$ is typically underdetermined by $\pi$,
which means that $R$ is only partially identifiable. Recent work has
characterised the extent of this partial identifiability for different types of
agents, including optimal and Boltzmann-rational agents. However, work so far
has only considered agents that discount future reward exponentially: this is a
serious limitation, especially given that extensive work in the behavioural
sciences suggests that humans are better modelled as discounting
hyperbolically. In this work, we newly characterise partial identifiability in
IRL for agents with non-exponential discounting: our results are in particular
relevant for hyperbolical discounting, but they also more generally apply to
agents that use other types of (non-exponential) discounting. We significantly
show that generally IRL is unable to infer enough information about $R$ to
identify the correct optimal policy, which entails that IRL alone can be
insufficient to adequately characterise the preferences of such agents.
[LINK]
http://arxiv.org/abs/2412.11155v1
[DATE]
2024-12-15 19:08:58+08:00
[CATEGORIES]
cs.LG
ViSymRe: Vision-guided Multimodal Symbolic Regression
[AUTHORS]
Da Li, Junping Yin, Jin Xu, Xinxin Li, Juan Zhang
[ABSTRACT]
Symbolic regression automatically searches for mathematical equations to
reveal underlying mechanisms within datasets, offering enhanced
interpretability compared to black box models. Traditionally, symbolic
regression has been considered to be purely numeric-driven, with insufficient
attention given to the potential contributions of visual information in
augmenting this process. When dealing with high-dimensional and complex
datasets, existing symbolic regression models are often inefficient and tend to
generate overly complex equations, making subsequent mechanism analysis
complicated. In this paper, we propose the vision-guided multimodal symbolic
regression model, called ViSymRe, that systematically explores how visual
information can improve various metrics of symbolic regression. Compared to
traditional models, our proposed model has the following innovations: (1) It
integrates three modalities: vision, symbol and numeric to enhance symbolic
regression, enabling the model to benefit from the strengths of each modality;
(2) It establishes a meta-learning framework that can learn from historical
experiences to efficiently solve new symbolic regression problems; (3) It
emphasizes the simplicity and structural rationality of the equations rather
than merely numerical fitting. Extensive experiments show that our proposed
model exhibits strong generalization capability and noise resistance. The
equations it generates outperform state-of-the-art numeric-only baselines in
terms of fitting effect, simplicity and structural accuracy, thus being able to
facilitate accurate mechanism analysis and the development of theoretical
models.
[LINK]
http://arxiv.org/abs/2412.11139v1
[DATE]
2024-12-15 18:05:31+08:00
[CATEGORIES]
cs.LG
Safe Reinforcement Learning using Finite-Horizon Gradient-based Estimation
[AUTHORS]
Juntao Dai, Yaodong Yang, Qian Zheng, Gang Pan
[ABSTRACT]
A key aspect of Safe Reinforcement Learning (Safe RL) involves estimating the
constraint condition for the next policy, which is crucial for guiding the
optimization of safe policy updates. However, the existing Advantage-based
Estimation (ABE) method relies on the infinite-horizon discounted advantage
function. This dependence leads to catastrophic errors in finite-horizon
scenarios with non-discounted constraints, resulting in safety-violation
updates. In response, we propose the first estimation method for finite-horizon
non-discounted constraints in deep Safe RL, termed Gradient-based Estimation
(GBE), which relies on the analytic gradient derived along trajectories. Our
theoretical and empirical analyses demonstrate that GBE can effectively
estimate constraint changes over a finite horizon. Constructing a surrogate
optimization problem with GBE, we developed a novel Safe RL algorithm called
Constrained Gradient-based Policy Optimization (CGPO). CGPO identifies feasible
optimal policies by iteratively resolving sub-problems within trust regions.
Our empirical results reveal that CGPO, unlike baseline algorithms,
successfully estimates the constraint functions of subsequent policies, thereby
ensuring the efficiency and feasibility of each update.
[LINK]
http://arxiv.org/abs/2412.11138v1
[DATE]
2024-12-15 18:05:23+08:00
[CATEGORIES]
cs.LG
Modeling the Heterogeneous Duration of User Interest in Time-Dependent Recommendation: A Hidden Semi-Markov Approach
[AUTHORS]
Haidong Zhang, Wancheng Ni, Xin Li, Yiping Yang
[ABSTRACT]
Recommender systems are widely used for suggesting books, education
materials, and products to users by exploring their behaviors. In reality,
users’ preferences often change over time, leading to studies on time-dependent
recommender systems. However, most existing approaches that deal with time
information remain primitive. In this paper, we extend existing methods and
propose a hidden semi-Markov model to track the change of users’ interests.
Particularly, this model allows for capturing the different durations of user
stays in a (latent) interest state, which can better model the heterogeneity of
user interests and focuses. We derive an expectation maximization algorithm to
estimate the parameters of the framework and predict users’ actions.
Experiments on three real-world datasets show that our model significantly
outperforms the state-of-the-art time-dependent and static benchmark methods.
Further analyses of the experiment results indicate that the performance
improvement is related to the heterogeneity of state durations and the drift of
user interests in the dataset.
[LINK]
http://arxiv.org/abs/2412.11127v1
[DATE]
2024-12-15 17:17:45+08:00
[CATEGORIES]
cs.LG
Attacking Byzantine Robust Aggregation in High Dimensions
[AUTHORS]
Sarthak Choudhary, Aashish Kolluri, Prateek Saxena
[ABSTRACT]
Training modern neural networks or models typically requires averaging over a
sample of high-dimensional vectors. Poisoning attacks can skew or bias the
average vectors used to train the model, forcing the model to learn specific
patterns or avoid learning anything useful. Byzantine robust aggregation is a
principled algorithmic defense against such biasing. Robust aggregators can
bound the maximum bias in computing centrality statistics, such as mean, even
when some fraction of inputs are arbitrarily corrupted. Designing such
aggregators is challenging when dealing with high dimensions. However, the
first polynomial-time algorithms with strong theoretical bounds on the bias
have recently been proposed. Their bounds are independent of the number of
dimensions, promising a conceptual limit on the power of poisoning attacks in
their ongoing arms race against defenses.
In this paper, we show a new attack called HIDRA on practical realization of
strong defenses which subverts their claim of dimension-independent bias. HIDRA
highlights a novel computational bottleneck that has not been a concern of
prior information-theoretic analysis. Our experimental evaluation shows that
our attacks almost completely destroy the model performance, whereas existing
attacks with the same goal fail to have much effect. Our findings leave the
arms race between poisoning attacks and provable defenses wide open.
[LINK]
http://arxiv.org/abs/2312.14461v3
[DATE]
2024-12-15 17:04:29+08:00
[CATEGORIES]
cs.LG
Hierarchical Bidirectional Transition Dispersion Entropy-based Lempel-Ziv Complexity and Its Application in Fault-Bearing Diagnosis
[AUTHORS]
Runze Jiang, Pengjian Shang
[ABSTRACT]
Lempel-Ziv complexity (LZC) is a key measure for detecting the irregularity
and complexity of nonlinear time series and has seen various improvements in
recent decades. However, existing LZC-based metrics, such as Permutation
Lempel-Ziv complexity (PLZC) and Dispersion-Entropy based Lempel-Ziv complexity
(DELZC), focus mainly on patterns of independent embedding vectors, often
overlooking the transition patterns within the time series. To address this
gap, this paper introduces a novel LZC-based method called Bidirectional
Transition Dispersion Entropy-based Lempel-Ziv complexity (BT-DELZC).
Leveraging Markov chain theory, this method integrates a bidirectional
transition network framework with DELZC to better capture dynamic signal
information. Additionally, an improved hierarchical decomposition algorithm is
used to extract features from various frequency components of the time series.
The proposed BT-DELZC method is first evaluated through four simulated
experiments, demonstrating its robustness and effectiveness in characterizing
nonlinear time series. Additionally, two fault-bearing diagnosis experiments
are conducted by combining the hierarchical BT-DELZC method with various
classifiers from the machine learning domain. The results indicate that
BT-DELZC achieves the highest accuracy across both datasets, significantly
outperforming existing methods such as LZC, PLZC, and DELZC in extracting
features related to fault bearings.
[LINK]
http://arxiv.org/abs/2412.11123v1
[DATE]
2024-12-15 17:00:28+08:00
[CATEGORIES]
cs.LG
Paid with Models: Optimal Contract Design for Collaborative Machine Learning
[AUTHORS]
Bingchen Wang, Zhaoxuan Wu, Fusheng Liu, Bryan Kian Hsiang Low
[ABSTRACT]
Collaborative machine learning (CML) provides a promising paradigm for
democratizing advanced technologies by enabling cost-sharing among
participants. However, the potential for rent-seeking behaviors among parties
can undermine such collaborations. Contract theory presents a viable solution
by rewarding participants with models of varying accuracy based on their
contributions. However, unlike monetary compensation, using models as rewards
introduces unique challenges, particularly due to the stochastic nature of
these rewards when contribution costs are privately held information. This
paper formalizes the optimal contracting problem within CML and proposes a
transformation that simplifies the non-convex optimization problem into one
that can be solved through convex optimization algorithms. We conduct a
detailed analysis of the properties that an optimal contract must satisfy when
models serve as the rewards, and we explore the potential benefits and welfare
implications of these contract-driven CML schemes through numerical
experiments.
[COMMENTS]
Accepted for publication in AAAI 2025
[LINK]
http://arxiv.org/abs/2412.11122v1
[DATE]
2024-12-15 16:55:16+08:00
[CATEGORIES]
cs.LG
Latent Reward: LLM-Empowered Credit Assignment in Episodic Reinforcement Learning
[AUTHORS]
Yun Qu, Yuhang Jiang, Boyuan Wang, Yixiu Mao, Cheems Wang, Chang Liu, Xiangyang Ji
[ABSTRACT]
Reinforcement learning (RL) often encounters delayed and sparse feedback in
real-world applications, even with only episodic rewards. Previous approaches
have made some progress in reward redistribution for credit assignment but
still face challenges, including training difficulties due to redundancy and
ambiguous attributions stemming from overlooking the multifaceted nature of
mission performance evaluation. Hopefully, Large Language Model (LLM)
encompasses fruitful decision-making knowledge and provides a plausible tool
for reward redistribution. Even so, deploying LLM in this case is non-trivial
due to the misalignment between linguistic knowledge and the symbolic form
requirement, together with inherent randomness and hallucinations in inference.
To tackle these issues, we introduce LaRe, a novel LLM-empowered symbolic-based
decision-making framework, to improve credit assignment. Key to LaRe is the
concept of the Latent Reward, which works as a multi-dimensional performance
evaluation, enabling more interpretable goal attainment from various
perspectives and facilitating more effective reward redistribution. We examine
that semantically generated code from LLM can bridge linguistic knowledge and
symbolic latent rewards, as it is executable for symbolic objects. Meanwhile,
we design latent reward self-verification to increase the stability and
reliability of LLM inference. Theoretically, reward-irrelevant redundancy
elimination in the latent reward benefits RL performance from more accurate
reward estimation. Extensive experimental results witness that LaRe (i)
achieves superior temporal credit assignment to SOTA methods, (ii) excels in
allocating contributions among multiple agents, and (iii) outperforms policies
trained with ground truth rewards for certain tasks.
[LINK]
http://arxiv.org/abs/2412.11120v1
[DATE]
2024-12-15 16:51:14+08:00
[CATEGORIES]
cs.LG
Impact of Adversarial Attacks on Deep Learning Model Explainability
[AUTHORS]
Gazi Nazia Nur, Mohammad Ahnaf Sadat
[ABSTRACT]
In this paper, we investigate the impact of adversarial attacks on the
explainability of deep learning models, which are commonly criticized for their
black-box nature despite their capacity for autonomous feature extraction. This
black-box nature can affect the perceived trustworthiness of these models. To
address this, explainability techniques such as GradCAM, SmoothGrad, and LIME
have been developed to clarify model decision-making processes. Our research
focuses on the robustness of these explanations when models are subjected to
adversarial attacks, specifically those involving subtle image perturbations
that are imperceptible to humans but can significantly mislead models. For
this, we utilize attack methods like the Fast Gradient Sign Method (FGSM) and
the Basic Iterative Method (BIM) and observe their effects on model accuracy
and explanations. The results reveal a substantial decline in model accuracy,
with accuracies dropping from 89.94% to 58.73% and 45.50% under FGSM and BIM
attacks, respectively. Despite these declines in accuracy, the explanation of
the models measured by metrics such as Intersection over Union (IoU) and Root
Mean Square Error (RMSE) shows negligible changes, suggesting that these
metrics may not be sensitive enough to detect the presence of adversarial
perturbations.
[COMMENTS]
29 pages with reference included, submitted to a journal
[LINK]
http://arxiv.org/abs/2412.11119v1
[DATE]
2024-12-15 16:41:37+08:00
[CATEGORIES]
cs.LG
Optimal Top-Two Method for Best Arm Identification and Fluid Analysis
[AUTHORS]
Agniv Bandyopadhyay, Sandeep Juneja, Shubhada Agrawal
[ABSTRACT]
Top-$2$ methods have become popular in solving the best arm identification
(BAI) problem. The best arm, or the arm with the largest mean amongst finitely
many, is identified through an algorithm that at any sequential step
independently pulls the empirical best arm, with a fixed probability $\beta$,
and pulls the best challenger arm otherwise. The probability of incorrect
selection is guaranteed to lie below a specified $\delta >0$. Information
theoretic lower bounds on sample complexity are well known for BAI problem and
are matched asymptotically as $\delta \rightarrow 0$ by computationally
demanding plug-in methods. The above top 2 algorithm for any $\beta \in (0,1)$
has sample complexity within a constant of the lower bound. However,
determining the optimal $\beta$ that matches the lower bound has proven
difficult. In this paper, we address this and propose an optimal top-2 type
algorithm. We consider a function of allocations anchored at a threshold. If it
exceeds the threshold then the algorithm samples the empirical best arm.
Otherwise, it samples the challenger arm. We show that the proposed algorithm
is optimal as $\delta \rightarrow 0$. Our analysis relies on identifying a
limiting fluid dynamics of allocations that satisfy a series of ordinary
differential equations pasted together and that describe the asymptotic path
followed by our algorithm. We rely on the implicit function theorem to show
existence and uniqueness of these fluid ode’s and to show that the proposed
algorithm remains close to the ode solution.
[COMMENTS]
To appear in NeurIPS 2024
[LINK]
http://arxiv.org/abs/2403.09123v2
[DATE]
2024-12-15 16:24:09+08:00
[CATEGORIES]
cs.LG
Multi-Graph Co-Training for Capturing User Intent in Session-based Recommendation
[AUTHORS]
Zhe Yang, Tiantian Liang
[COMMENTS]
COLING 2025 Main Conference
[LINK]
http://arxiv.org/abs/2412.11105v1
[DATE]
2024-12-15 16:08:07+08:00
[CATEGORIES]
cs.LG
ABC3: Active Bayesian Causal Inference with Cohn Criteria in Randomized Experiments
[AUTHORS]
Taehun Cha, Donghun Lee
[ABSTRACT]
In causal inference, randomized experiment is a de facto method to overcome
various theoretical issues in observational study. However, the experimental
design requires expensive costs, so an efficient experimental design is
necessary. We propose ABC3, a Bayesian active learning policy for causal
inference. We show a policy minimizing an estimation error on conditional
average treatment effect is equivalent to minimizing an integrated posterior
variance, similar to Cohn criteria \citep{cohn1994active}. We theoretically
prove ABC3 also minimizes an imbalance between the treatment and control groups
and the type 1 error probability. Imbalance-minimizing characteristic is
especially notable as several works have emphasized the importance of achieving
balance. Through extensive experiments on real-world data sets, ABC3 achieves
the highest efficiency, while empirically showing the theoretical results hold.
[COMMENTS]
AAAI 2025
[LINK]
http://arxiv.org/abs/2412.11104v1
[DATE]
2024-12-15 16:00:57+08:00
[CATEGORIES]
cs.LG
Transparent Networks for Multivariate Time Series
[AUTHORS]
Minkyu Kim, Suan Lee, Jinho Kim
[ABSTRACT]
Transparent models, which are machine learning models that produce inherently
interpretable predictions, are receiving significant attention in high-stakes
domains. However, despite much real-world data being collected as time series,
there is a lack of studies on transparent time series models. To address this
gap, we propose a novel transparent neural network model for time series called
Generalized Additive Time Series Model (GATSM). GATSM consists of two parts: 1)
independent feature networks to learn feature representations, and 2) a
transparent temporal module to learn temporal patterns across different time
steps using the feature representations. This structure allows GATSM to
effectively capture temporal patterns and handle dynamic-length time series
while preserving transparency. Empirical experiments show that GATSM
significantly outperforms existing generalized additive models and achieves
comparable performance to black-box time series models, such as recurrent
neural networks and Transformer. In addition, we demonstrate that GATSM finds
interesting patterns in time series. The source code is available at
https://github.com/gim4855744/GATSM.
[COMMENTS]
Additional experiments are added in appendix
[LINK]
http://arxiv.org/abs/2410.10535v2
[DATE]
2024-12-15 15:56:04+08:00
[CATEGORIES]
cs.LG
BlockLLM: Memory-Efficient Adaptation of LLMs by Selecting and Optimizing the Right Coordinate Blocks
[AUTHORS]
Amrutha Varshini Ramesh, Vignesh Ganapathiraman, Issam H. Laradji, Mark Schmidt
[ABSTRACT]
Training large language models (LLMs) for pretraining or adapting to new
tasks and domains has become increasingly critical as their applications
expand. However, as the model and the data sizes grow, the training process
presents significant memory challenges, often requiring a prohibitive amount of
GPU memory that may not be readily available. Existing methods such as low-rank
adaptation (LoRA) add trainable low-rank matrix factorizations, altering the
training dynamics and limiting the model’s parameter search to a low-rank
subspace. GaLore, a more recent method, employs Gradient Low-Rank Projection to
reduce the memory footprint, in the full parameter training setting. However
GaLore can only be applied to a subset of the LLM layers that satisfy the
“reversibility” property, thus limiting their applicability. In response to
these challenges, we introduce BlockLLM, an approach inspired by block
coordinate descent. Our method carefully selects and updates a very small
subset of the trainable parameters without altering any part of its
architecture and training procedure. BlockLLM achieves state-of-the-art
performance in both finetuning and pretraining tasks, while reducing the memory
footprint of the underlying optimization process. Our experiments demonstrate
that fine-tuning with only less than 5% of the parameters, BlockLLM achieves
state-of-the-art perplexity scores on the GLUE benchmarks. On Llama model
pretrained on C4 dataset, BlockLLM is able to train with significantly less
memory than the state-of-the-art, while still maintaining competitive
performance.
[COMMENTS]
18 pages, 7 figures
[LINK]
http://arxiv.org/abs/2406.17296v2
[DATE]
2024-12-15 15:51:18+08:00
[CATEGORIES]
cs.LG
Dynamic Graph Attention Networks for Travel Time Distribution Prediction in Urban Arterial Roads
[AUTHORS]
Nooshin Yousefzadeh, Rahul Sengupta, Sanjay Ranka
[ABSTRACT]
Effective congestion management along signalized corridors is essential for
improving productivity and reducing costs, with arterial travel time serving as
a key performance metric. Traditional approaches, such as Coordinated Signal
Timing and Adaptive Traffic Control Systems, often lack scalability and
generalizability across diverse urban layouts. We propose Fusion-based Dynamic
Graph Neural Networks (FDGNN), a structured framework for simultaneous modeling
of travel time distributions in both directions along arterial corridors. FDGNN
utilizes attentional graph convolution on dynamic, bidirectional graphs and
integrates fusion techniques to capture evolving spatiotemporal traffic
dynamics. The framework is trained on extensive hours of simulation data and
utilizes GPU computation to ensure scalability. The results demonstrate that
our framework can efficiently and accurately model travel time as a normal
distribution on arterial roads leveraging a unique dynamic graph representation
of corridor traffic states. This representation integrates sequential traffic
signal timing plans, local driving behaviors, temporal turning movement counts,
and ingress traffic volumes, even when aggregated over intervals as short as a
single cycle length. The results demonstrate resilience to effective traffic
variations, including cycle lengths, green time percentages, traffic density,
and counterfactual routes. Results further confirm its stability under varying
conditions at different intersections. This framework supports dynamic signal
timing, enhances congestion management, and improves travel time reliability in
real-world applications.
[COMMENTS]
11 pages, 4 figures, 3 tables
[LINK]
http://arxiv.org/abs/2412.11095v1
[DATE]
2024-12-15 15:30:01+08:00
[CATEGORIES]
cs.LG
GraphMoRE: Mitigating Topological Heterogeneity via Mixture of Riemannian Experts
[AUTHORS]
Zihao Guo, Qingyun Sun, Haonan Yuan, Xingcheng Fu, Min Zhou, Yisen Gao, Jianxin Li
[ABSTRACT]
Real-world graphs have inherently complex and diverse topological patterns,
known as topological heterogeneity. Most existing works learn graph
representation in a single constant curvature space that is insufficient to
match the complex geometric shapes, resulting in low-quality embeddings with
high distortion. This also constitutes a critical challenge for graph
foundation models, which are expected to uniformly handle a wide variety of
diverse graph data. Recent studies have indicated that product manifold gains
the possibility to address topological heterogeneity. However, the product
manifold is still homogeneous, which is inadequate and inflexible for
representing the mixed heterogeneous topology. In this paper, we propose a
novel Graph Mixture of Riemannian Experts (GraphMoRE) framework to effectively
tackle topological heterogeneity by personalized fine-grained topology geometry
pattern preservation. Specifically, to minimize the embedding distortion, we
propose a topology-aware gating mechanism to select the optimal embedding space
for each node. By fusing the outputs of diverse Riemannian experts with learned
gating weights, we construct personalized mixed curvature spaces for nodes,
effectively embedding the graph into a heterogeneous manifold with varying
curvatures at different points. Furthermore, to fairly measure pairwise
distances between different embedding spaces, we present a concise and
effective alignment strategy. Extensive experiments on real-world and synthetic
datasets demonstrate that our method achieves superior performance with lower
distortion, highlighting its potential for modeling complex graphs with
topological heterogeneity, and providing a novel architectural perspective for
graph foundation models.
[COMMENTS]
Accepted by the Main Technical Track of the 39th Annual AAAI
Conference on Artificial Intelligence (AAAI-2025)
[LINK]
http://arxiv.org/abs/2412.11085v1
[DATE]
2024-12-15 14:52:40+08:00
[CATEGORIES]
cs.LG
BarcodeMamba: State Space Models for Biodiversity Analysis
[AUTHORS]
Tiancheng Gao, Graham W. Taylor
[ABSTRACT]
DNA barcodes are crucial in biodiversity analysis for building automatic
identification systems that recognize known species and discover unseen
species. Unlike human genome modeling, barcode-based invertebrate
identification poses challenges in the vast diversity of species and taxonomic
complexity. Among Transformer-based foundation models, BarcodeBERT excelled in
species-level identification of invertebrates, highlighting the effectiveness
of self-supervised pretraining on barcode-specific datasets. Recently,
structured state space models (SSMs) have emerged, with a time complexity that
scales sub-quadratically with the context length. SSMs provide an efficient
parameterization of sequence modeling relative to attention-based
architectures. Given the success of Mamba and Mamba-2 in natural language, we
designed BarcodeMamba, a performant and efficient foundation model for DNA
barcodes in biodiversity analysis. We conducted a comprehensive ablation study
on the impacts of self-supervised training and tokenization methods, and
compared both versions of Mamba layers in terms of expressiveness and their
capacity to identify “unseen” species held back from training. Our study shows
that BarcodeMamba has better performance than BarcodeBERT even when using only
8.3% as many parameters, and improves accuracy to 99.2% on species-level
accuracy in linear probing without fine-tuning for “seen” species. In our
scaling study, BarcodeMamba with 63.6% of BarcodeBERT’s parameters achieved
70.2% genus-level accuracy in 1-nearest neighbor (1-NN) probing for unseen
species. The code repository to reproduce our experiments is available at
https://github.com/bioscan-ml/BarcodeMamba.
[COMMENTS]
9 pages, 2 figures, accepted at Foundation Models for Science:
Progress, Opportunities, and Challenges Workshop (NeurIPS 2024)
[LINK]
http://arxiv.org/abs/2412.11084v1
[DATE]
2024-12-15 14:52:18+08:00
[CATEGORIES]
cs.LG
EquiFlow: Equivariant Conditional Flow Matching with Optimal Transport for 3D Molecular Conformation Prediction
[AUTHORS]
Qingwen Tian, Yuxin Xu, Yixuan Yang, Zhen Wang, Ziqi Liu, Pengju Yan, Xiaolin Li
[ABSTRACT]
Molecular 3D conformations play a key role in determining how molecules
interact with other molecules or protein surfaces. Recent deep learning
advancements have improved conformation prediction, but slow training speeds
and difficulties in utilizing high-degree features limit performance. We
propose EquiFlow, an equivariant conditional flow matching model with optimal
transport. EquiFlow uniquely applies conditional flow matching in molecular 3D
conformation prediction, leveraging simulation-free training to address slow
training speeds. It uses a modified Equiformer model to encode Cartesian
molecular conformations along with their atomic and bond properties into
higher-degree embeddings. Additionally, EquiFlow employs an ODE solver,
providing faster inference speeds compared to diffusion models with SDEs.
Experiments on the QM9 dataset show that EquiFlow predicts small molecule
conformations more accurately than current state-of-the-art models.
[COMMENTS]
11 pages,5 figures
[LINK]
http://arxiv.org/abs/2412.11082v1
[DATE]
2024-12-15 14:48:22+08:00
[CATEGORIES]
cs.LG
Deep Spectral Clustering via Joint Spectral Embedding and Kmeans
[AUTHORS]
Wengang Guo, Wei Ye
[ABSTRACT]
Spectral clustering is a popular clustering method. It first maps data into
the spectral embedding space and then uses Kmeans to find clusters. However,
the two decoupled steps prohibit joint optimization for the optimal solution.
In addition, it needs to construct the similarity graph for samples, which
suffers from the curse of dimensionality when the data are high-dimensional. To
address these two challenges, we introduce \textbf{D}eep \textbf{S}pectral
\textbf{C}lustering (\textbf{DSC}), which consists of two main modules: the
spectral embedding module and the greedy Kmeans module. The former module
learns to efficiently embed raw samples into the spectral embedding space using
deep neural networks and power iteration. The latter module improves the
cluster structures of Kmeans on the learned spectral embeddings by a greedy
optimization strategy, which iteratively reveals the direction of the worst
cluster structures and optimizes embeddings in this direction. To jointly
optimize spectral embeddings and clustering, we seamlessly integrate the two
modules and optimize them in an end-to-end manner. Experimental results on
seven real-world datasets demonstrate that DSC achieves state-of-the-art
clustering performance.
[LINK]
http://arxiv.org/abs/2412.11080v1
[DATE]
2024-12-15 14:40:22+08:00
[CATEGORIES]
cs.LG
Edge Contrastive Learning: An Augmentation-Free Graph Contrastive Learning Model
[AUTHORS]
Yujun Li, Hongyuan Zhang, Yuan Yuan
[ABSTRACT]
Graph contrastive learning (GCL) aims to learn representations from unlabeled
graph data in a self-supervised manner and has developed rapidly in recent
years. However, edgelevel contrasts are not well explored by most existing GCL
methods. Most studies in GCL only regard edges as auxiliary information while
updating node features. One of the primary obstacles of edge-based GCL is the
heavy computation burden. To tackle this issue, we propose a model that can
efficiently learn edge features for GCL, namely AugmentationFree Edge
Contrastive Learning (AFECL) to achieve edgeedge contrast. AFECL depends on no
augmentation consisting of two parts. Firstly, we design a novel edge feature
generation method, where edge features are computed by embedding concatenation
of their connected nodes. Secondly, an edge contrastive learning scheme is
developed, where edges connecting the same nodes are defined as positive pairs,
and other edges are defined as negative pairs. Experimental results show that
compared with recent state-of-the-art GCL methods or even some supervised GNNs,
AFECL achieves SOTA performance on link prediction and semi-supervised node
classification of extremely scarce labels. The source code is available at
https://github.com/YujunLi361/AFECL.
[LINK]
http://arxiv.org/abs/2412.11075v1
[DATE]
2024-12-15 14:16:01+08:00
[CATEGORIES]
cs.LG
Learning Robust and Privacy-Preserving Representations via Information Theory
[AUTHORS]
Binghui Zhang, Sayedeh Leila Noorbakhsh, Yun Dong, Yuan Hong, Binghui Wang
[ABSTRACT]
Machine learning models are vulnerable to both security attacks (e.g.,
adversarial examples) and privacy attacks (e.g., private attribute inference).
We take the first step to mitigate both the security and privacy attacks, and
maintain task utility as well. Particularly, we propose an
information-theoretic framework to achieve the goals through the lens of
representation learning, i.e., learning representations that are robust to both
adversarial examples and attribute inference adversaries. We also derive novel
theoretical results under our framework, e.g., the inherent trade-off between
adversarial robustness/utility and attribute privacy, and guaranteed attribute
privacy leakage against attribute inference adversaries.
[LINK]
http://arxiv.org/abs/2412.11066v1
[DATE]
2024-12-15 13:51:48+08:00
[CATEGORIES]
cs.LG
Representation learning of dynamic networks
[AUTHORS]
Haixu Wang, Jiguo Cao, Jian Pei
[ABSTRACT]
This study presents a novel representation learning model tailored for
dynamic networks, which describes the continuously evolving relationships among
individuals within a population. The problem is encapsulated in the dimension
reduction topic of functional data analysis. With dynamic networks represented
as matrix-valued functions, our objective is to map this functional data into a
set of vector-valued functions in a lower-dimensional learning space. This
space, defined as a metric functional space, allows for the calculation of
norms and inner products. By constructing this learning space, we address (i)
attribute learning, (ii) community detection, and (iii) link prediction and
recovery of individual nodes in the dynamic network. Our model also
accommodates asymmetric low-dimensional representations, enabling the separate
study of nodes’ regulatory and receiving roles. Crucially, the learning method
accounts for the time-dependency of networks, ensuring that representations are
continuous over time. The functional learning space we define naturally spans
the time frame of the dynamic networks, facilitating both the inference of
network links at specific time points and the reconstruction of the entire
network structure without direct observation. We validated our approach through
simulation studies and real-world applications. In simulations, we compared our
methods link prediction performance to existing approaches under various data
corruption scenarios. For real-world applications, we examined a dynamic social
network replicated across six ant populations, demonstrating that our
low-dimensional learning space effectively captures interactions, roles of
individual ants, and the social evolution of the network. Our findings align
with existing knowledge of ant colony behavior.
[LINK]
http://arxiv.org/abs/2412.11065v1
[DATE]
2024-12-15 13:45:20+08:00
[CATEGORIES]
cs.LG
Set-Valued Sensitivity Analysis of Deep Neural Networks
[AUTHORS]
Xin Wang, Feilong Wang, Xuegang Ban
[ABSTRACT]
This paper proposes a sensitivity analysis framework based on set valued
mapping for deep neural networks (DNN) to understand and compute how the
solutions (model weights) of DNN respond to perturbations in the training data.
As a DNN may not exhibit a unique solution (minima) and the algorithm of
solving a DNN may lead to different solutions with minor perturbations to input
data, we focus on the sensitivity of the solution set of DNN, instead of
studying a single solution. In particular, we are interested in the expansion
and contraction of the set in response to data perturbations. If the change of
solution set can be bounded by the extent of the data perturbation, the model
is said to exhibit the Lipschitz like property. This “set-to-set” analysis
approach provides a deeper understanding of the robustness and reliability of
DNNs during training. Our framework incorporates both isolated and non-isolated
minima, and critically, does not require the assumption that the Hessian of
loss function is non-singular. By developing set-level metrics such as distance
between sets, convergence of sets, derivatives of set-valued mapping, and
stability across the solution set, we prove that the solution set of the Fully
Connected Neural Network holds Lipschitz-like properties. For general neural
networks (e.g., Resnet), we introduce a graphical-derivative-based method to
estimate the new solution set following data perturbation without retraining.
[LINK]
http://arxiv.org/abs/2412.11057v1
[DATE]
2024-12-15 13:22:38+08:00
[CATEGORIES]
cs.LG
DisCo-DSO: Coupling Discrete and Continuous Optimization for Efficient Generative Design in Hybrid Spaces
[AUTHORS]
Jacob F. Pettit, Chak Shing Lee, Jiachen Yang, Alex Ho, Daniel Faissol, Brenden Petersen, Mikel Landajuela
[ABSTRACT]
We consider the challenge of black-box optimization within hybrid
discrete-continuous and variable-length spaces, a problem that arises in
various applications, such as decision tree learning and symbolic regression.
We propose DisCo-DSO (Discrete-Continuous Deep Symbolic Optimization), a novel
approach that uses a generative model to learn a joint distribution over
discrete and continuous design variables to sample new hybrid designs. In
contrast to standard decoupled approaches, in which the discrete and continuous
variables are optimized separately, our joint optimization approach uses fewer
objective function evaluations, is robust against non-differentiable
objectives, and learns from prior samples to guide the search, leading to
significant improvement in performance and sample efficiency. Our experiments
on a diverse set of optimization tasks demonstrate that the advantages of
DisCo-DSO become increasingly evident as the complexity of the problem
increases. In particular, we illustrate DisCo-DSO’s superiority over the
state-of-the-art methods for interpretable reinforcement learning with decision
trees.
[COMMENTS]
Accepted at AAAI-25
[LINK]
http://arxiv.org/abs/2412.11051v1
[DATE]
2024-12-15 12:51:54+08:00
[CATEGORIES]
cs.LG
Research on short-term load forecasting model based on VMD and IPSO-ELM
[AUTHORS]
Qiang Xie
[ABSTRACT]
To enhance the accuracy of power load forecasting in wind farms, this study
introduces an advanced combined forecasting method that integrates Variational
Mode Decomposition (VMD) with an Improved Particle Swarm Optimization (IPSO)
algorithm to optimize the Extreme Learning Machine (ELM). Initially, the VMD
algorithm is employed to perform high-precision modal decomposition of the
original power load data, which is then categorized into high-frequency and
low-frequency sequences based on mutual information entropy theory.
Subsequently, this research profoundly modifies the traditional multiverse
optimizer by incorporating Tent chaos mapping, exponential travel distance
rate, and an elite reverse learning mechanism, developing the IPSO-ELM
prediction model. This model independently predicts the high and low-frequency
sequences and reconstructs the data to achieve the final forecasting results.
Simulation results indicate that the proposed method significantly improves
prediction accuracy and convergence speed compared to traditional ELM, PSO-ELM,
and PSO-ELM methods.
[COMMENTS]
10 pages, in Chinese language, 5 figures
[LINK]
http://arxiv.org/abs/2410.05300v2
[DATE]
2024-12-15 12:45:11+08:00
[CATEGORIES]
cs.LG
Temporal-Aware Evaluation and Learning for Temporal Graph Neural Networks
[AUTHORS]
Junwei Su, Shan Wu
[ABSTRACT]
Temporal Graph Neural Networks (TGNNs) are a family of graph neural networks
designed to model and learn dynamic information from temporal graphs. Given
their substantial empirical success, there is an escalating interest in TGNNs
within the research community. However, the majority of these efforts have been
channelled towards algorithm and system design, with the evaluation metrics
receiving comparatively less attention. Effective evaluation metrics are
crucial for providing detailed performance insights, particularly in the
temporal domain. This paper investigates the commonly used evaluation metrics
for TGNNs and illustrates the failure mechanisms of these metrics in capturing
essential temporal structures in the predictive behaviour of TGNNs. We provide
a mathematical formulation of existing performance metrics and utilize an
instance-based study to underscore their inadequacies in identifying volatility
clustering (the occurrence of emerging errors within a brief interval). This
phenomenon has profound implications for both algorithm and system design in
the temporal domain. To address this deficiency, we introduce a new
volatility-aware evaluation metric (termed volatility cluster statistics),
designed for a more refined analysis of model temporal performance.
Additionally, we demonstrate how this metric can serve as a
temporal-volatility-aware training objective to alleviate the clustering of
temporal errors. Through comprehensive experiments on various TGNN models, we
validate our analysis and the proposed approach. The empirical results offer
revealing insights: 1) existing TGNNs are prone to making errors with
volatility clustering, and 2) TGNNs with different mechanisms to capture
temporal information exhibit distinct volatility clustering patterns. Our
empirical findings demonstrate that our proposed training objective effectively
reduces volatility clusters in error.
[LINK]
http://arxiv.org/abs/2412.07273v2
[DATE]
2024-12-15 12:10:49+08:00
[CATEGORIES]
cs.LG
Understanding and Mitigating Memorization in Diffusion Models for Tabular Data
[AUTHORS]
Zhengyu Fang, Zhimeng Jiang, Huiyuan Chen, Xiao Li, Jing Li
[ABSTRACT]
Tabular data generation has attracted significant research interest in recent
years, with the tabular diffusion models greatly improving the quality of
synthetic data. However, while memorization, where models inadvertently
replicate exact or near-identical training data, has been thoroughly
investigated in image and text generation, its effects on tabular data remain
largely unexplored. In this paper, we conduct the first comprehensive
investigation of memorization phenomena in diffusion models for tabular data.
Our empirical analysis reveals that memorization appears in tabular diffusion
models and increases with larger training epochs. We further examine the
influence of factors such as dataset sizes, feature dimensions, and different
diffusion models on memorization. Additionally, we provide a theoretical
explanation for why memorization occurs in tabular diffusion models. To address
this issue, we propose TabCutMix, a simple yet effective data augmentation
technique that exchanges randomly selected feature segments between random
same-class training sample pairs. Building upon this, we introduce
TabCutMixPlus, an enhanced method that clusters features based on feature
correlations and ensures that features within the same cluster are exchanged
together during augmentation. This clustering mechanism mitigates
out-of-distribution (OOD) generation issues by maintaining feature coherence.
Experimental results across various datasets and diffusion models demonstrate
that TabCutMix effectively mitigates memorization while maintaining
high-quality data generation.
[LINK]
http://arxiv.org/abs/2412.11044v1
[DATE]
2024-12-15 12:04:37+08:00
[CATEGORIES]
cs.LG
Multi-Agent Inverse Reinforcement Learning in Real World Unstructured Pedestrian Crowds
[AUTHORS]
Rohan Chandra, Haresh Karnan, Negar Mehr, Peter Stone, Joydeep Biswas
[ABSTRACT]
Social robot navigation in crowded public spaces such as university campuses,
restaurants, grocery stores, and hospitals, is an increasingly important area
of research. One of the core strategies for achieving this goal is to
understand humans’ intent–underlying psychological factors that govern their
motion–by learning their reward functions, typically via inverse reinforcement
learning (IRL). Despite significant progress in IRL, learning reward functions
of multiple agents simultaneously in dense unstructured pedestrian crowds has
remained intractable due to the nature of the tightly coupled social
interactions that occur in these scenarios \textit{e.g.} passing,
intersections, swerving, weaving, etc. In this paper, we present a new
multi-agent maximum entropy inverse reinforcement learning algorithm for real
world unstructured pedestrian crowds. Key to our approach is a simple, but
effective, mathematical trick which we name the so-called
tractability-rationality trade-off trick that achieves tractability at the cost
of a slight reduction in accuracy. We compare our approach to the classical
single-agent MaxEnt IRL as well as state-of-the-art trajectory prediction
methods on several datasets including the ETH, UCY, SCAND, JRDB, and a new
dataset, called Speedway, collected at a busy intersection on a University
campus focusing on dense, complex agent interactions. Our key findings show
that, on the dense Speedway dataset, our approach ranks 1st among top 7
baselines with >2X improvement over single-agent IRL, and is competitive with
state-of-the-art large transformer-based encoder-decoder models on sparser
datasets such as ETH/UCY (ranks 3rd among top 7 baselines).
[LINK]
http://arxiv.org/abs/2405.16439v2
[DATE]
2024-12-15 11:48:49+08:00
[CATEGORIES]
cs.LG
Embracing Large Language Models in Traffic Flow Forecasting
[AUTHORS]
Yusheng Zhao, Xiao Luo, Haomin Wen, Zhiping Xiao, Wei Ju, Ming Zhang
[ABSTRACT]
Traffic flow forecasting aims to predict future traffic flows based on the
historical traffic conditions and the road network. It is an important problem
in intelligent transportation systems, with a plethora of methods been
proposed. Existing efforts mainly focus on capturing and utilizing
spatio-temporal dependencies to predict future traffic flows. Though promising,
they fall short in adapting to test-time environmental changes of traffic
conditions. To tackle this challenge, we propose to introduce large language
models (LLMs) to help traffic flow forecasting and design a novel method named
Large Language Model Enhanced Traffic Flow Predictor (LEAF). LEAF adopts two
branches, capturing different spatio-temporal relations using graph and
hypergraph structures respectively. The two branches are first pre-trained
individually, and during test-time, they yield different predictions. Based on
these predictions, a large language model is used to select the most likely
result. Then, a ranking loss is applied as the learning objective to enhance
the prediction ability of the two branches. Extensive experiments on several
datasets demonstrate the effectiveness of the proposed LEAF.
[LINK]
http://arxiv.org/abs/2412.12201v1
[DATE]
2024-12-15 11:08:28+08:00
[CATEGORIES]
cs.LG
Golden Noise for Diffusion Models: A Learning Framework
[AUTHORS]
Zikai Zhou, Shitong Shao, Lichen Bai, Zhiqiang Xu, Bo Han, Zeke Xie
[ABSTRACT]
Text-to-image diffusion model is a popular paradigm that synthesizes
personalized images by providing a text prompt and a random Gaussian noise.
While people observe that some noises are “golden noises” that can achieve
better text-image alignment and higher human preference than others, we still
lack a machine learning framework to obtain those golden noises. To learn
golden noises for diffusion sampling, we mainly make three contributions in
this paper. First, we identify a new concept termed the \textit{noise prompt},
which aims at turning a random Gaussian noise into a golden noise by adding a
small desirable perturbation derived from the text prompt. Following the
concept, we first formulate the \textit{noise prompt learning} framework that
systematically learns “prompted” golden noise associated with a text prompt
for diffusion models. Second, we design a noise prompt data collection pipeline
and collect a large-scale \textit{noise prompt dataset}~(NPD) that contains
100k pairs of random noises and golden noises with the associated text prompts.
With the prepared NPD as the training dataset, we trained a small \textit{noise
prompt network}~(NPNet) that can directly learn to transform a random noise
into a golden noise. The learned golden noise perturbation can be considered as
a kind of prompt for noise, as it is rich in semantic information and tailored
to the given text prompt. Third, our extensive experiments demonstrate the
impressive effectiveness and generalization of NPNet on improving the quality
of synthesized images across various diffusion models, including SDXL,
DreamShaper-xl-v2-turbo, and Hunyuan-DiT. Moreover, NPNet is a small and
efficient controller that acts as a plug-and-play module with very limited
additional inference and computational costs, as it just provides a golden
noise instead of a random noise without accessing the original pipeline.
[LINK]
http://arxiv.org/abs/2411.09502v3
[DATE]
2024-12-15 10:48:48+08:00
[CATEGORIES]
cs.LG
PromptV: Leveraging LLM-powered Multi-Agent Prompting for High-quality Verilog Generation
[AUTHORS]
Zhendong Mi, Renming Zheng, Haowen Zhong, Yue Sun, Shaoyi Huang
[ABSTRACT]
Recent advances in agentic LLMs have demonstrated remarkable automated
Verilog code generation capabilities. However, existing approaches either
demand substantial computational resources or rely on LLM-assisted single-agent
prompt learning techniques, which we observe for the first time has a
degeneration issue - characterized by deteriorating generative performance and
diminished error detection and correction capabilities. This paper proposes a
novel multi-agent prompt learning framework to address these limitations and
enhance code generation quality. We show for the first time that multi-agent
architectures can effectively mitigate the degeneration risk while improving
code error correction capabilities, resulting in higher-quality Verilog code
generation. Experimental results show that the proposed method could achieve
96.4% and 96.5% pass@10 scores on VerilogEval Machine and Human benchmarks,
respectively while attaining 100% Syntax and 99.9% Functionality pass@5 metrics
on the RTLLM benchmark.
[LINK]
http://arxiv.org/abs/2412.11014v1
[DATE]
2024-12-15 09:58:10+08:00
[CATEGORIES]
cs.LG
An advantage based policy transfer algorithm for reinforcement learning with measures of transferability
[AUTHORS]
Md Ferdous Alam, Parinaz Naghizadeh, David Hoelzle
[ABSTRACT]
Reinforcement learning (RL) enables sequential decision-making in complex and
high-dimensional environments through interaction with the environment. In most
real-world applications, however, a high number of interactions are infeasible.
In these environments, transfer RL algorithms, which can be used for the
transfer of knowledge from one or multiple source environments to a target
environment, have been shown to increase learning speed and improve initial and
asymptotic performance. However, most existing transfer RL algorithms are
on-policy and sample inefficient, fail in adversarial target tasks, and often
require heuristic choices in algorithm design. This paper proposes an
off-policy Advantage-based Policy Transfer algorithm, APT-RL, for fixed domain
environments. Its novelty is in using the popular notion of “advantage” as a
regularizer, to weigh the knowledge that should be transferred from the source,
relative to new knowledge learned in the target, removing the need for
heuristic choices. Further, we propose a new transfer performance measure to
evaluate the performance of our algorithm and unify existing transfer RL
frameworks. Finally, we present a scalable, theoretically-backed task
similarity measurement algorithm to illustrate the alignments between our
proposed transferability measure and similarities between source and target
environments. We compare APT-RL with several baselines, including existing
transfer-RL algorithms, in three high-dimensional continuous control tasks. Our
experiments demonstrate that APT-RL outperforms existing transfer RL algorithms
and is at least as good as learning from scratch in adversarial tasks.
[COMMENTS]
12 pages, 5 figures
[LINK]
http://arxiv.org/abs/2311.06731v2
[DATE]
2024-12-15 09:27:10+08:00
[CATEGORIES]
cs.LG
Topological Generalization Bounds for Discrete-Time Stochastic Optimization Algorithms
[AUTHORS]
Rayna Andreeva, Benjamin Dupuis, Rik Sarkar, Tolga Birdal, Umut Şimşekli
[ABSTRACT]
We present a novel set of rigorous and computationally efficient
topology-based complexity notions that exhibit a strong correlation with the
generalization gap in modern deep neural networks (DNNs). DNNs show remarkable
generalization properties, yet the source of these capabilities remains
elusive, defying the established statistical learning theory. Recent studies
have revealed that properties of training trajectories can be indicative of
generalization. Building on this insight, state-of-the-art methods have
leveraged the topology of these trajectories, particularly their fractal
dimension, to quantify generalization. Most existing works compute this
quantity by assuming continuous- or infinite-time training dynamics,
complicating the development of practical estimators capable of accurately
predicting generalization without access to test data. In this paper, we
respect the discrete-time nature of training trajectories and investigate the
underlying topological quantities that can be amenable to topological data
analysis tools. This leads to a new family of reliable topological complexity
measures that provably bound the generalization error, eliminating the need for
restrictive geometric assumptions. These measures are computationally friendly,
enabling us to propose simple yet effective algorithms for computing
generalization indices. Moreover, our flexible framework can be extended to
different domains, tasks, and architectures. Our experimental results
demonstrate that our new complexity measures correlate highly with
generalization error in industry-standards architectures such as transformers
and deep graph networks. Our approach consistently outperforms existing
topological bounds across a wide range of datasets, models, and optimizers,
highlighting the practical relevance and effectiveness of our complexity
measures.
[LINK]
http://arxiv.org/abs/2407.08723v2
[DATE]
2024-12-15 09:26:17+08:00
[CATEGORIES]
cs.LG
FlashSparse: Minimizing Computation Redundancy for Fast Sparse Matrix Multiplications on Tensor Cores
[AUTHORS]
Jinliang Shi, Shigang Li, Youxuan Xu, Rongtian Fu, Xueying Wang, Tong Wu
[ABSTRACT]
Sparse Matrix-matrix Multiplication (SpMM) and Sampled Dense-dense Matrix
Multiplication (SDDMM) are important sparse operators in scientific computing
and deep learning. Tensor Core Units (TCUs) enhance modern accelerators with
superior computing power, which is promising to boost the performance of matrix
operators to a higher level. However, due to the irregularity of unstructured
sparse data, it is difficult to deliver practical speedups on TCUs. To this
end, we propose FlashSparse, a novel approach to bridge the gap between sparse
workloads and the TCU architecture. Specifically, FlashSparse minimizes the
sparse granularity for SpMM and SDDMM on TCUs through a novel
swap-and-transpose matrix multiplication strategy. Benefiting from the minimum
sparse granularity, the computation redundancy is remarkably reduced while the
computing power of TCUs is fully utilized. Besides, FlashSparse is equipped
with a memory-efficient thread mapping strategy for coalesced data access and a
sparse matrix storage format to save memory footprint. Extensive experimental
results on H100 and RTX 4090 GPUs show that FlashSparse sets a new
state-of-the-art for sparse matrix multiplications (geometric mean 5.5x speedup
over DTC-SpMM and 3.22x speedup over RoDe).
[COMMENTS]
Accepted by 30th ACM SIGPLAN Annual Symposium on Principles and
Practice of Parallel Programming (PPoPP’25)
[LINK]
http://arxiv.org/abs/2412.11007v1
[DATE]
2024-12-15 09:12:33+08:00
[CATEGORIES]
cs.LG
An Adversarial Example for Direct Logit Attribution: Memory Management in GELU-4L
[AUTHORS]
Jett Janiak, Can Rager, James Dao, Yeu-Tong Lau
[ABSTRACT]
Prior work suggests that language models manage the limited bandwidth of the
residual stream through a “memory management” mechanism, where certain
attention heads and MLP layers clear residual stream directions set by earlier
layers. Our study provides concrete evidence for this erasure phenomenon in a
4-layer transformer, identifying heads that consistently remove the output of
earlier heads. We further demonstrate that direct logit attribution (DLA), a
common technique for interpreting the output of intermediate transformer
layers, can show misleading results by not accounting for erasure.
[LINK]
http://arxiv.org/abs/2310.07325v4
[DATE]
2024-12-15 06:22:55+08:00
[CATEGORIES]
cs.LG
Composers’ Evaluations of an AI Music Tool: Insights for Human-Centred Design
[AUTHORS]
Eleanor Row, György Fazekas
[ABSTRACT]
We present a study that explores the role of user-centred design in
developing Generative AI (GenAI) tools for music composition. Through
semi-structured interviews with professional composers, we gathered insights on
a novel generative model for creating variations, highlighting concerns around
trust, transparency, and ethical design. The findings helped form a feedback
loop, guiding improvements to the model that emphasised traceability,
transparency and explainability. They also revealed new areas for innovation,
including novel features for controllability and research questions on the
ethical and practical implementation of GenAI models.
[COMMENTS]
Accepted to NeurIPS 2024 Workshop on Generative AI and Creativity: A
dialogue between machine learning researchers and creative professionals in
Vancouver, Canada
[LINK]
http://arxiv.org/abs/2412.10968v1
[DATE]
2024-12-15 04:56:23+08:00
[CATEGORIES]
cs.LG
FlowDock: Geometric Flow Matching for Generative Protein-Ligand Docking and Affinity Prediction
[AUTHORS]
Alex Morehead, Jianlin Cheng
[ABSTRACT]
Powerful generative models of protein-ligand structure have recently been
proposed, but few of these methods support both flexible protein-ligand docking
and affinity estimation. Of those that do, none can directly model multiple
binding ligands concurrently or have been rigorously benchmarked on
pharmacologically relevant drug targets, hindering their widespread adoption in
drug discovery efforts. In this work, we propose FlowDock, a deep geometric
generative model based on conditional flow matching that learns to directly map
unbound (apo) structures to their bound (holo) counterparts for an arbitrary
number of binding ligands. Furthermore, FlowDock provides predicted structural
confidence scores and binding affinity values with each of its generated
protein-ligand complex structures, enabling fast virtual screening of new
(multi-ligand) drug targets. For the commonly-used PoseBusters Benchmark
dataset, FlowDock achieves a 51% blind docking success rate using unbound (apo)
protein input structures and without any information derived from multiple
sequence alignments, and for the challenging new DockGen-E dataset, FlowDock
matches the performance of single-sequence Chai-1 for binding pocket
generalization. Additionally, in the ligand category of the 16th community-wide
Critical Assessment of Techniques for Structure Prediction (CASP16), FlowDock
ranked among the top-5 methods for pharmacological binding affinity estimation
across 140 protein-ligand complexes, demonstrating the efficacy of its learned
representations in virtual screening. Source code, data, and pre-trained models
are available at https://github.com/BioinfoMachineLearning/FlowDock.
[COMMENTS]
10 pages, 2 tables, 2 algorithms, 7 figures. Code, data, pre-trained
models, and baseline method predictions are available at
https://github.com/BioinfoMachineLearning/FlowDock
[LINK]
http://arxiv.org/abs/2412.10966v1
[DATE]
2024-12-15 04:54:37+08:00
[CATEGORIES]
cs.LG
RepairLLaMA: Efficient Representations and Fine-Tuned Adapters for Program Repair
[AUTHORS]
André Silva, Sen Fang, Martin Monperrus
[ABSTRACT]
Automated Program Repair (APR) has evolved significantly with the advent of
Large Language Models (LLMs). Fine-tuning LLMs for program repair is a recent
avenue of research, with many dimensions which have not been explored. Existing
work mostly fine-tune LLMs with naive code representations and does not scale
to frontier models. To address this problem, we propose RepairLLaMA, a novel
program repair approach that 1) identifies optimal code representations for APR
with fine-tuned models, and 2) pioneers state-of-the-art parameter-efficient
fine-tuning technique (PEFT) for program repair. This results in RepairLLaMA
producing a highly effective `program repair adapter’ for fixing bugs with AI.
Our experiments demonstrate the validity of both concepts. First, fine-tuning
adapters with program repair specific code representations enables the model to
use meaningful repair signals and produce better patches. Second,
parameter-efficient fine-tuning helps fine-tuning to converge and clearly
contributes to the effectiveness of RepairLLaMA in fixing bugs outside the
fine-tuning data distribution. Overall, RepairLLaMA correctly fixes 144
Defects4J v2, 109 HumanEval-Java, and 20 GitBug-Java bugs, outperforming all
baselines.
[LINK]
http://arxiv.org/abs/2312.15698v5
[DATE]
2024-12-15 04:39:12+08:00
[CATEGORIES]
cs.LG
SoftVQ-VAE: Efficient 1-Dimensional Continuous Tokenizer
[AUTHORS]
Hao Chen, Ze Wang, Xiang Li, Ximeng Sun, Fangyi Chen, Jiang Liu, Jindong Wang, Bhiksha Raj, Zicheng Liu, Emad Barsoum
[ABSTRACT]
Efficient image tokenization with high compression ratios remains a critical
challenge for training generative models. We present SoftVQ-VAE, a continuous
image tokenizer that leverages soft categorical posteriors to aggregate
multiple codewords into each latent token, substantially increasing the
representation capacity of the latent space. When applied to Transformer-based
architectures, our approach compresses 256x256 and 512x512 images using as few
as 32 or 64 1-dimensional tokens. Not only does SoftVQ-VAE show consistent and
high-quality reconstruction, more importantly, it also achieves
state-of-the-art and significantly faster image generation results across
different denoising-based generative models. Remarkably, SoftVQ-VAE improves
inference throughput by up to 18x for generating 256x256 images and 55x for
512x512 images while achieving competitive FID scores of 1.78 and 2.21 for
SiT-XL. It also improves the training efficiency of the generative models by
reducing the number of training iterations by 2.3x while maintaining comparable
performance. With its fully-differentiable design and semantic-rich latent
space, our experiment demonstrates that SoftVQ-VQE achieves efficient
tokenization without compromising generation quality, paving the way for more
efficient generative models. Code and model are released.
[COMMENTS]
Code and model: https://github.com/Hhhhhhao/continuous_tokenizer
[LINK]
http://arxiv.org/abs/2412.10958v1
[DATE]
2024-12-15 04:29:29+08:00
[CATEGORIES]
cs.LG
Deep Learning-Based Noninvasive Screening of Type 2 Diabetes with Chest X-ray Images and Electronic Health Records
[AUTHORS]
Sanjana Gundapaneni, Zhuo Zhi, Miguel Rodrigues
[ABSTRACT]
The imperative for early detection of type 2 diabetes mellitus (T2DM) is
challenged by its asymptomatic onset and dependence on suboptimal clinical
diagnostic tests, contributing to its widespread global prevalence. While
research into noninvasive T2DM screening tools has advanced, conventional
machine learning approaches remain limited to unimodal inputs due to extensive
feature engineering requirements. In contrast, deep learning models can
leverage multimodal data for a more holistic understanding of patients’ health
conditions. However, the potential of chest X-ray (CXR) imaging, one of the
most commonly performed medical procedures, remains underexplored. This study
evaluates the integration of CXR images with other noninvasive data sources,
including electronic health records (EHRs) and electrocardiography signals, for
T2DM detection. Utilising datasets meticulously compiled from the MIMIC-IV
databases, we investigated two deep fusion paradigms: an early fusion-based
multimodal transformer and a modular joint fusion ResNet-LSTM architecture. The
end-to-end trained ResNet-LSTM model achieved an AUROC of 0.86, surpassing the
CXR-only baseline by 2.3% with just 9863 training samples. These findings
demonstrate the diagnostic value of CXRs within multimodal frameworks for
identifying at-risk individuals early. Additionally, the dataset preprocessing
pipeline has also been released to support further research in this domain.
[LINK]
http://arxiv.org/abs/2412.10955v1
[DATE]
2024-12-15 04:18:16+08:00
[CATEGORIES]
cs.LG
Optimizing AI-Assisted Code Generation
[AUTHORS]
Simon Torka, Sahin Albayrak
[ABSTRACT]
In recent years, the rise of AI-assisted code-generation tools has
significantly transformed software development. While code generators have
mainly been used to support conventional software development, their use will
be extended to powerful and secure AI systems. Systems capable of generating
code, such as ChatGPT, OpenAI Codex, GitHub Copilot, and AlphaCode, take
advantage of advances in machine learning (ML) and natural language processing
(NLP) enabled by large language models (LLMs). However, it must be borne in
mind that these models work probabilistically, which means that although they
can generate complex code from natural language input, there is no guarantee
for the functionality and security of the generated code.
However, to fully exploit the considerable potential of this technology, the
security, reliability, functionality, and quality of the generated code must be
guaranteed. This paper examines the implementation of these goals to date and
explores strategies to optimize them. In addition, we explore how these systems
can be optimized to create safe, high-performance, and executable artificial
intelligence (AI) models, and consider how to improve their accessibility to
make AI development more inclusive and equitable.
[LINK]
http://arxiv.org/abs/2412.10953v1
[DATE]
2024-12-15 04:14:44+08:00
[CATEGORIES]
cs.LG
Generative Modeling with Diffusion
[AUTHORS]
Justin Le
[ABSTRACT]
We introduce the diffusion model as a method to generate new samples.
Generative models have been recently adopted for tasks such as art generation
(Stable Diffusion, Dall-E) and text generation (ChatGPT). Diffusion models in
particular apply noise to sample data and then “reverse” this noising process
to generate new samples. We will formally define the noising and denoising
processes, then introduce algorithms to train and generate with a diffusion
model. Finally, we will explore a potential application of diffusion models in
improving classifier performance on imbalanced data.
[COMMENTS]
16 pages with 5 figures. This work was submitted to SIAM
Undergraduate Research Online for consideration in their journal
[LINK]
http://arxiv.org/abs/2412.10948v1
[DATE]
2024-12-15 04:04:46+08:00
[CATEGORIES]
cs.LG
APAR: Modeling Irregular Target Functions in Tabular Regression via Arithmetic-Aware Pre-Training and Adaptive-Regularized Fine-Tuning
[AUTHORS]
Hong-Wei Wu, Wei-Yao Wang, Kuang-Da Wang, Wen-Chih Peng
[ABSTRACT]
Tabular data are fundamental in common machine learning applications, ranging
from finance to genomics and healthcare. This paper focuses on tabular
regression tasks, a field where deep learning (DL) methods are not consistently
superior to machine learning (ML) models due to the challenges posed by
irregular target functions inherent in tabular data, causing sensitive label
changes with minor variations from features. To address these issues, we
propose a novel Arithmetic-Aware Pre-training and Adaptive-Regularized
Fine-tuning framework (APAR), which enables the model to fit irregular target
function in tabular data while reducing the negative impact of overfitting. In
the pre-training phase, APAR introduces an arithmetic-aware pretext objective
to capture intricate sample-wise relationships from the perspective of
continuous labels. In the fine-tuning phase, a consistency-based adaptive
regularization technique is proposed to self-learn appropriate data
augmentation. Extensive experiments across 10 datasets demonstrated that APAR
outperforms existing GBDT-, supervised NN-, and pretrain-finetune NN-based
methods in RMSE (+9.43% $\sim$ 20.37%), and empirically validated the effects
of pre-training tasks, including the study of arithmetic operations. Our code
and data are publicly available at https://github.com/johnnyhwu/APAR.
[COMMENTS]
AAAI 2025 Main Track
[LINK]
http://arxiv.org/abs/2412.10941v1
[DATE]
2024-12-15 03:33:21+08:00
[CATEGORIES]
cs.LG
Unsupervised Learning Approach to Anomaly Detection in Gravitational Wave Data
[AUTHORS]
Ammar Fayad
[ABSTRACT]
Gravitational waves (GW), predicted by Einstein’s General Theory of
Relativity, provide a powerful probe of astrophysical phenomena and fundamental
physics. In this work, we propose an unsupervised anomaly detection method
using variational autoencoders (VAEs) to analyze GW time-series data. By
training on noise-only data, the VAE accurately reconstructs noise inputs while
failing to reconstruct anomalies, such as GW signals, which results in
measurable spikes in the reconstruction error. The method was applied to data
from the LIGO H1 and L1 detectors. Evaluation on testing datasets containing
both noise and GW events demonstrated reliable detection, achieving an area
under the ROC curve (AUC) of 0.89. This study introduces VAEs as a robust,
unsupervised approach for identifying anomalies in GW data, which offers a
scalable framework for detecting known and potentially new phenomena in
physics.
[COMMENTS]
The work is still in progress
[LINK]
http://arxiv.org/abs/2411.19450v2
[DATE]
2024-12-15 03:12:09+08:00
[CATEGORIES]
cs.LG
Multimodal Audio-based Disease Prediction with Transformer-based Hierarchical Fusion Network
[AUTHORS]
Jinjin Cai, Ruiqi Wang, Dezhong Zhao, Ziqin Yuan, Victoria McKenna, Aaron Friedman, Rachel Foot, Susan Storey, Ryan Boente, Sudip Vhaduri, Byung-Cheol Min
[ABSTRACT]
Audio-based disease prediction is emerging as a promising supplement to
traditional medical diagnosis methods, facilitating early, convenient, and
non-invasive disease detection and prevention. Multimodal fusion, which
integrates features from various domains within or across bio-acoustic
modalities, has proven effective in enhancing diagnostic performance. However,
most existing methods in the field employ unilateral fusion strategies that
focus solely on either intra-modal or inter-modal fusion. This approach limits
the full exploitation of the complementary nature of diverse acoustic feature
domains and bio-acoustic modalities. Additionally, the inadequate and isolated
exploration of latent dependencies within modality-specific and modality-shared
spaces curtails their capacity to manage the inherent heterogeneity in
multimodal data. To fill these gaps, we propose a transformer-based
hierarchical fusion network designed for general multimodal audio-based disease
prediction. Specifically, we seamlessly integrate intra-modal and inter-modal
fusion in a hierarchical manner and proficiently encode the necessary
intra-modal and inter-modal complementary correlations, respectively.
Comprehensive experiments demonstrate that our model achieves state-of-the-art
performance in predicting three diseases: COVID-19, Parkinson’s disease, and
pathological dysarthria, showcasing its promising potential in a broad context
of audio-based disease prediction tasks. Additionally, extensive ablation
studies and qualitative analyses highlight the significant benefits of each
main component within our model.
[LINK]
http://arxiv.org/abs/2410.09289v2
[DATE]
2024-12-15 03:08:02+08:00
[CATEGORIES]
cs.LG
Progressive Compression with Universally Quantized Diffusion Models
[AUTHORS]
Yibo Yang, Justus C. Will, Stephan Mandt
[ABSTRACT]
Diffusion probabilistic models have achieved mainstream success in many
generative modeling tasks, from image generation to inverse problem solving. A
distinct feature of these models is that they correspond to deep hierarchical
latent variable models optimizing a variational evidence lower bound (ELBO) on
the data likelihood. Drawing on a basic connection between likelihood modeling
and compression, we explore the potential of diffusion models for progressive
coding, resulting in a sequence of bits that can be incrementally transmitted
and decoded with progressively improving reconstruction quality. Unlike prior
work based on Gaussian diffusion or conditional diffusion models, we propose a
new form of diffusion model with uniform noise in the forward process, whose
negative ELBO corresponds to the end-to-end compression cost using universal
quantization. We obtain promising first results on image compression, achieving
competitive rate-distortion and rate-realism results on a wide range of
bit-rates with a single model, bringing neural codecs a step closer to
practical deployment.
[COMMENTS]
20 pages, 10 figures, submitted to ICLR 2025
[LINK]
http://arxiv.org/abs/2412.10935v1
[DATE]
2024-12-15 03:06:01+08:00
[CATEGORIES]
cs.LG
Linear Programming based Approximation to Individually Fair k-Clustering with Outliers
[AUTHORS]
Binita Maity, Shrutimoy Das, Anirban Dasgupta
[COMMENTS]
12 pages
[LINK]
http://arxiv.org/abs/2412.10923v1
[DATE]
2024-12-15 02:16:46+08:00
[CATEGORIES]
cs.LG
Predicting Survival of Hemodialysis Patients using Federated Learning
[AUTHORS]
Abhiram Raju, Praneeth Vepakomma
[ABSTRACT]
Hemodialysis patients who are on donor lists for kidney transplant may get
misidentified, delaying their wait time. Thus, predicting their survival time
is crucial for optimizing waiting lists and personalizing treatment plans.
Predicting survival times for patients often requires large quantities of high
quality but sensitive data. This data is siloed and since individual datasets
are smaller and less diverse, locally trained survival models do not perform as
well as centralized ones. Hence, we propose the use of Federated Learning in
the context of predicting survival for hemodialysis patients. Federated
Learning or FL can have comparatively better performances than local models
while not sharing data between centers. However, despite the increased use of
such technologies, the application of FL in survival and even more, dialysis
patients remains sparse. This paper studies the performance of FL for data of
hemodialysis patients from NephroPlus, the largest private network of dialysis
centers in India.
[COMMENTS]
6 pages, 2 figures, 4 tables, Presented at MIT Undergraduate Research
Technology Conference and to be published as conference proceeding at IEEE
Xplore
[LINK]
http://arxiv.org/abs/2412.10919v1
[DATE]
2024-12-15 02:10:44+08:00
[CATEGORIES]
cs.LG
Adaptive Reward Design for Reinforcement Learning in Complex Robotic Tasks
[AUTHORS]
Minjae Kwon, Ingy ElSayed-Aly, Lu Feng
[ABSTRACT]
There is a surge of interest in using formal languages such as Linear
Temporal Logic (LTL) and finite automata to precisely and succinctly specify
complex tasks and derive reward functions for reinforcement learning (RL) in
robotic applications. However, existing methods often assign sparse rewards
(e.g., giving a reward of 1 only if a task is completed and 0 otherwise),
necessitating extensive exploration to converge to a high-quality policy. To
address this limitation, we propose a suite of reward functions that
incentivize an RL agent to make measurable progress on tasks specified by LTL
formulas and develop an adaptive reward shaping approach that dynamically
updates these reward functions during the learning process. Experimental
results on a range of RL-based robotic tasks demonstrate that the proposed
approach is compatible with various RL algorithms and consistently outperforms
baselines, achieving earlier convergence to better policies with higher task
success rates and returns.
[COMMENTS]
9 pages, 7 figures. Under review at RA-L
[LINK]
http://arxiv.org/abs/2412.10917v1
[DATE]
2024-12-15 02:04:18+08:00
[CATEGORIES]
cs.LG
RL$^3$: Boosting Meta Reinforcement Learning via RL inside RL$^2$
[AUTHORS]
Abhinav Bhatia, Samer B. Nashed, Shlomo Zilberstein
[ABSTRACT]
Meta reinforcement learning (meta-RL) methods such as RL$^2$ have emerged as
promising approaches for learning data-efficient RL algorithms tailored to a
given task distribution. However, they show poor asymptotic performance and
struggle with out-of-distribution tasks because they rely on sequence models,
such as recurrent neural networks or transformers, to process experiences
rather than summarize them using general-purpose RL components such as value
functions. In contrast, traditional RL algorithms are data-inefficient as they
do not use domain knowledge, but do converge to an optimal policy in the limit.
We propose RL$^3$, a principled hybrid approach that incorporates
action-values, learned per task via traditional RL, in the inputs to meta-RL.
We show that RL$^3$ earns greater cumulative reward in the long term compared
to RL$^2$ while drastically reducing meta-training time and generalizes better
to out-of-distribution tasks. Experiments are conducted on both custom and
benchmark discrete domains from the meta-RL literature that exhibit a range of
short-term, long-term, and complex dependencies.
[LINK]
http://arxiv.org/abs/2306.15909v5
[DATE]
2024-12-15 01:58:44+08:00
[CATEGORIES]
cs.LG
PEARL: Input-Agnostic Prompt Enhancement with Negative Feedback Regulation for Class-Incremental Learning
[AUTHORS]
Yongchun Qin, Pengfei Fang, Hui Xue
[ABSTRACT]
Class-incremental learning (CIL) aims to continuously introduce novel
categories into a classification system without forgetting previously learned
ones, thus adapting to evolving data distributions. Researchers are currently
focusing on leveraging the rich semantic information of pre-trained models
(PTMs) in CIL tasks. Prompt learning has been adopted in CIL for its ability to
adjust data distribution to better align with pre-trained knowledge. This paper
critically examines the limitations of existing methods from the perspective of
prompt learning, which heavily rely on input information. To address this
issue, we propose a novel PTM-based CIL method called Input-Agnostic Prompt
Enhancement with Negative Feedback Regulation (PEARL). In PEARL, we implement
an input-agnostic global prompt coupled with an adaptive momentum update
strategy to reduce the model’s dependency on data distribution, thereby
effectively mitigating catastrophic forgetting. Guided by negative feedback
regulation, this adaptive momentum update addresses the parameter sensitivity
inherent in fixed-weight momentum updates. Furthermore, it fosters the
continuous enhancement of the prompt for new tasks by harnessing correlations
between different tasks in CIL. Experiments on six benchmarks demonstrate that
our method achieves state-of-the-art performance. The code is available at:
https://github.com/qinyongchun/PEARL.
[COMMENTS]
Accepted by AAAI-25
[LINK]
http://arxiv.org/abs/2412.10900v1
[DATE]
2024-12-15 01:13:30+08:00
[CATEGORIES]
cs.LG
Task Diversity in Bayesian Federated Learning: Simultaneous Processing of Classification and Regression
[AUTHORS]
Junliang Lyu, Yixuan Zhang, Xiaoling Lu, Feng Zhou
[ABSTRACT]
This work addresses a key limitation in current federated learning
approaches, which predominantly focus on homogeneous tasks, neglecting the task
diversity on local devices. We propose a principled integration of multi-task
learning using multi-output Gaussian processes (MOGP) at the local level and
federated learning at the global level. MOGP handles correlated classification
and regression tasks, offering a Bayesian non-parametric approach that
naturally quantifies uncertainty. The central server aggregates the posteriors
from local devices, updating a global MOGP prior redistributed for training
local models until convergence. Challenges in performing posterior inference on
local devices are addressed through the P'{o}lya-Gamma augmentation technique
and mean-field variational inference, enhancing computational efficiency and
convergence rate. Experimental results on both synthetic and real data
demonstrate superior predictive performance, OOD detection, uncertainty
calibration and convergence rate, highlighting the method’s potential in
diverse applications. Our code is publicly available at
https://github.com/JunliangLv/task_diversity_BFL.
[LINK]
http://arxiv.org/abs/2412.10897v1
[DATE]
2024-12-15 01:10:54+08:00
[CATEGORIES]
cs.LG
Multi-Class and Multi-Task Strategies for Neural Directed Link Prediction
[AUTHORS]
Claudio Moroni, Claudio Borile, Carolina Mattsson, Michele Starnini, André Panisson
[ABSTRACT]
Link Prediction is a foundational task in Graph Representation Learning,
supporting applications like link recommendation, knowledge graph completion
and graph generation. Graph Neural Networks have shown the most promising
results in this domain and are currently the de facto standard approach to
learning from graph data. However, a key distinction exists between Undirected
and Directed Link Prediction: the former just predicts the existence of an
edge, while the latter must also account for edge directionality and
bidirectionality. This translates to Directed Link Prediction (DLP) having
three sub-tasks, each defined by how training, validation and test sets are
structured. Most research on DLP overlooks this trichotomy, focusing solely on
the “existence” sub-task, where training and test sets are random, uncorrelated
samples of positive and negative directed edges. Even in the works that
recognize the aforementioned trichotomy, models fail to perform well across all
three sub-tasks. In this study, we experimentally demonstrate that training
Neural DLP (NDLP) models only on the existence sub-task, using methods adapted
from Neural Undirected Link Prediction, results in parameter configurations
that fail to capture directionality and bidirectionality, even after
rebalancing edge classes. To address this, we propose three strategies that
handle the three tasks simultaneously. Our first strategy, the Multi-Class
Framework for Neural Directed Link Prediction (MC-NDLP) maps NDLP to a
Multi-Class training objective. The second and third approaches adopt a
Multi-Task perspective, either with a Multi-Objective (MO-DLP) or a Scalarized
(S-DLP) strategy. Our results show that these methods outperform traditional
approaches across multiple datasets and models, achieving equivalent or
superior performance in addressing the three DLP sub-tasks.
[COMMENTS]
15 pages, 2 figures
[LINK]
http://arxiv.org/abs/2412.10895v1
[DATE]
2024-12-15 00:55:16+08:00
[CATEGORIES]
cs.LG
Know Unreported Roadway Incidents in Real-time: A Deep Learning Framework for Early Traffic Anomaly Detection
[AUTHORS]
Haocheng Duan, Hao Wu, Sean Qian
[ABSTRACT]
Conventional automatic incident detection (AID) has relied heavily on all
incident reports exclusively for training and evaluation. However, these
reports suffer from a number of issues, such as delayed reports, inaccurate
descriptions, false alarms, missing reports, and incidents that do not
necessarily influence traffic. Relying on these reports to train or calibrate
AID models hinders their ability to detect traffic anomalies effectively and
timely, even leading to convergence issues in the model training process.
Moreover, conventional AID models are not inherently designed to capture the
early indicators of any generic incidents. It remains unclear how far ahead an
AID model can report incidents. The AID applications in the literature are also
spatially limited because the data used by most models is often limited to
specific test road segments. To solve these problems, we propose a deep
learning framework utilizing prior domain knowledge and model-designing
strategies. This allows the model to detect a broader range of anomalies, not
only incidents that significantly influence traffic flow but also early
characteristics of incidents along with historically unreported anomalies. We
specially design the model to target the early-stage detection/prediction of an
incident. Additionally, unlike most conventional AID studies, we use widely
available data, enhancing our method’s scalability. The experimental results
across numerous road segments on different maps demonstrate that our model
leads to more effective and early anomaly detection. Our framework does not
focus on stacking or tweaking various deep learning models; instead, it focuses
on model design and training strategies to improve early detection performance.
[LINK]
http://arxiv.org/abs/2412.10892v1
[DATE]
2024-12-15 00:49:29+08:00
[CATEGORIES]
cs.LG
Integrating Generative and Physics-Based Models for Ptychographic Imaging with Uncertainty Quantification
[AUTHORS]
Canberk Ekmekci, Tekin Bicer, Zichao Wendy Di, Junjing Deng, Mujdat Cetin
[ABSTRACT]
Ptychography is a scanning coherent diffractive imaging technique that
enables imaging nanometer-scale features in extended samples. One main
challenge is that widely used iterative image reconstruction methods often
require significant amount of overlap between adjacent scan locations, leading
to large data volumes and prolonged acquisition times. To address this key
limitation, this paper proposes a Bayesian inversion method for ptychography
that performs effectively even with less overlap between neighboring scan
locations. Furthermore, the proposed method can quantify the inherent
uncertainty on the ptychographic object, which is created by the ill-posed
nature of the ptychographic inverse problem. At a high level, the proposed
method first utilizes a deep generative model to learn the prior distribution
of the object and then generates samples from the posterior distribution of the
object by using a Markov Chain Monte Carlo algorithm. Our results from
simulated ptychography experiments show that the proposed framework can
consistently outperform a widely used iterative reconstruction algorithm in
cases of reduced overlap. Moreover, the proposed framework can provide
uncertainty estimates that closely correlate with the true error, which is not
available in practice. The project website is available here.
[COMMENTS]
Machine Learning and the Physical Sciences Workshop at NeurIPS 2024,
7 pages, 4 figures
[LINK]
http://arxiv.org/abs/2412.10882v1
[DATE]
2024-12-15 00:16:37+08:00
[CATEGORIES]
cs.LG
Adaptive Quantization Resolution and Power Control for Federated Learning over Cell-free Networks
[AUTHORS]
Afsaneh Mahmoudi, Emil Björnson
[ABSTRACT]
Federated learning (FL) is a distributed learning framework where users train
a global model by exchanging local model updates with a server instead of raw
datasets, preserving data privacy and reducing communication overhead. However,
the latency grows with the number of users and the model size, impeding the
successful FL over traditional wireless networks with orthogonal access.
Cell-free massive multiple-input multipleoutput (CFmMIMO) is a promising
solution to serve numerous users on the same time/frequency resource with
similar rates. This architecture greatly reduces uplink latency through spatial
multiplexing but does not take application characteristics into account. In
this paper, we co-optimize the physical layer with the FL application to
mitigate the straggler effect. We introduce a novel adaptive mixed-resolution
quantization scheme of the local gradient vector updates, where only the most
essential entries are given high resolution. Thereafter, we propose a dynamic
uplink power control scheme to manage the varying user rates and mitigate the
straggler effect. The numerical results demonstrate that the proposed method
achieves test accuracy comparable to classic FL while reducing communication
overhead by at least 93% on the CIFAR-10, CIFAR-100, and Fashion-MNIST
datasets. We compare our methods against AQUILA, Top-q, and LAQ, using the
max-sum rate and Dinkelbach power control schemes. Our approach reduces the
communication overhead by 75% and achieves 10% higher test accuracy than these
benchmarks within a constrained total latency budget.
[LINK]
http://arxiv.org/abs/2412.10878v1
[DATE]
2024-12-15 00:08:05+08:00
[CATEGORIES]
cs.LG
A Novel End-To-End Event Geolocation Method Leveraging Hyperbolic Space and Toponym Hierarchies
[AUTHORS]
Yaqiong Qiao, Guojun Huang
[ABSTRACT]
Timely detection and geolocation of events based on social data can provide
critical information for applications such as crisis response and resource
allocation. However, most existing methods are greatly affected by event
detection errors, leading to insufficient geolocation accuracy. To this end,
this paper proposes a novel end-to-end event geolocation method (GTOP)
leveraging Hyperbolic space and toponym hierarchies. Specifically, the proposed
method contains one event detection module and one geolocation module. The
event detection module constructs a heterogeneous information networks based on
social data, and then constructs a homogeneous message graph and combines it
with the text and time feature of the message to learning initial features of
nodes. Node features are updated in Hyperbolic space and then fed into a
classifier for event detection. To reduce the geolocation error, this paper
proposes a noise toponym filtering algorithm (HIST) based on the hierarchical
structure of toponyms. HIST analyzes the hierarchical structure of toponyms
mentioned in the event cluster, taking the highly frequent city-level locations
as the coarse-grained locations for events. By comparing the hierarchical
structure of the toponyms within the cluster against those of the
coarse-grained locations of events, HIST filters out noisy toponyms. To further
improve the geolocation accuracy, we propose a fine-grained pseudo toponyms
generation algorithm (FIT) based on the output of HIST, and combine generated
pseudo toponyms with filtered toponyms to locate events based on the geographic
center points of the combined toponyms. Extensive experiments are conducted on
the Chinese dataset constructed in this paper and another public English
dataset. The experimental results show that the proposed method is superior to
the state-of-the-art baselines.
[LINK]
http://arxiv.org/abs/2412.10870v1
[DATE]
2024-12-14 23:43:58+08:00
[CATEGORIES]
cs.CL
CRENER: A Character Relation Enhanced Chinese NER Model
[AUTHORS]
Yaqiong Qiao, Shixuan Peng
[ABSTRACT]
Chinese Named Entity Recognition (NER) is an important task in information
extraction, which has a significant impact on downstream applications. Due to
the lack of natural separators in Chinese, previous NER methods mostly relied
on external dictionaries to enrich the semantic and boundary information of
Chinese words. However, such methods may introduce noise that affects the
accuracy of named entity recognition. To this end, we propose a character
relation enhanced Chinese NER model (CRENER). This model defines four types of
tags that reflect the relationships between characters, and proposes a
fine-grained modeling of the relationships between characters based on three
types of relationships: adjacency relations between characters, relations
between characters and tags, and relations between tags, to more accurately
identify entity boundaries and improve Chinese NER accuracy. Specifically, we
transform the Chinese NER task into a character-character relationship
classification task, ensuring the accuracy of entity boundary recognition
through joint modeling of relation tags. To enhance the model’s ability to
understand contextual information, WRENER further constructed an adapted
transformer encoder that combines unscaled direction-aware and distance-aware
masked self-attention mechanisms. Moreover, a relationship representation
enhancement module was constructed to model predefined relationship tags,
effectively mining the relationship representations between characters and
tags. Experiments conducted on four well-known Chinese NER benchmark datasets
have shown that the proposed model outperforms state-of-the-art baselines. The
ablation experiment also demonstrated the effectiveness of the proposed model.
[LINK]
http://arxiv.org/abs/2412.10858v1
[DATE]
2024-12-14 23:14:39+08:00
[CATEGORIES]
cs.CL
Superhuman performance of a large language model on the reasoning tasks of a physician
[AUTHORS]
Peter G. Brodeur, Thomas A. Buckley, Zahir Kanjee, Ethan Goh, Evelyn Bin Ling, Priyank Jain, Stephanie Cabral, Raja-Elie Abdulnour, Adrian Haimovich, Jason A. Freed, Andrew Olson, Daniel J. Morgan, Jason Hom, Robert Gallo, Eric Horvitz, Jonathan Chen, Arjun K. Manrai, Adam Rodman
[ABSTRACT]
Performance of large language models (LLMs) on medical tasks has
traditionally been evaluated using multiple choice question benchmarks.
However, such benchmarks are highly constrained, saturated with repeated
impressive performance by LLMs, and have an unclear relationship to performance
in real clinical scenarios. Clinical reasoning, the process by which physicians
employ critical thinking to gather and synthesize clinical data to diagnose and
manage medical problems, remains an attractive benchmark for model performance.
Prior LLMs have shown promise in outperforming clinicians in routine and
complex diagnostic scenarios. We sought to evaluate OpenAI’s o1-preview model,
a model developed to increase run-time via chain of thought processes prior to
generating a response. We characterize the performance of o1-preview with five
experiments including differential diagnosis generation, display of diagnostic
reasoning, triage differential diagnosis, probabilistic reasoning, and
management reasoning, adjudicated by physician experts with validated
psychometrics. Our primary outcome was comparison of the o1-preview output to
identical prior experiments that have historical human controls and benchmarks
of previous LLMs. Significant improvements were observed with differential
diagnosis generation and quality of diagnostic and management reasoning. No
improvements were observed with probabilistic reasoning or triage differential
diagnosis. This study highlights o1-preview’s ability to perform strongly on
tasks that require complex critical thinking such as diagnosis and management
while its performance on probabilistic reasoning tasks was similar to past
models. New robust benchmarks and scalable evaluation of LLM capabilities
compared to human physicians are needed along with trials evaluating AI in real
clinical settings.
[LINK]
http://arxiv.org/abs/2412.10849v1
[DATE]
2024-12-14 22:46:18+08:00
[CATEGORIES]
cs.CL
Rethinking Chain-of-Thought from the Perspective of Self-Training
[AUTHORS]
Zongqian Wu, Baoduo Xu, Ruochen Cui, Mengmeng Zhan, Xiaofeng Zhu, Lei Feng
[ABSTRACT]
Chain-of-thought (CoT) reasoning has emerged as an effective approach for
activating latent capabilities in large language models (LLMs). We observe that
CoT shares significant similarities with self-training in terms of their
learning processes. Motivated by these parallels, this paper explores the
underlying relationship between CoT and self-training, demonstrating how
insights from self-training can enhance CoT performance. Specifically, our
study first reveals that CoT, like self-training, follows the principle of
semantic entropy minimization. Leveraging this insight, we propose a novel CoT
framework that incorporates two key components: (i) a task-specific prompt
module designed to guide LLMs in generating high-quality initial reasoning
processes, and (ii) an adaptive reasoning iteration module for progressively
refining the reasoning process.
[COMMENTS]
16 pages, 12 figures
[LINK]
http://arxiv.org/abs/2412.10827v1
[DATE]
2024-12-14 21:12:50+08:00
[CATEGORIES]
cs.CL
CursorCore: Assist Programming through Aligning Anything
[AUTHORS]
Hao Jiang, Qi Liu, Rui Li, Shengyu Ye, Shijin Wang
[ABSTRACT]
Large language models have been successfully applied to programming
assistance tasks, such as code completion, code insertion, and instructional
code editing. However, these applications remain insufficiently automated and
struggle to effectively integrate various types of information during the
programming process, including coding history, current code, and user
instructions. In this work, we propose a new conversational framework that
comprehensively integrates these information sources, collect data to train our
models and evaluate their performance. Firstly, to thoroughly evaluate how well
models align with different types of information and the quality of their
outputs, we introduce a new benchmark, APEval (Assist Programming Eval), to
comprehensively assess the performance of models in programming assistance
tasks. Then, for data collection, we develop a data generation pipeline,
Programming-Instruct, which synthesizes training data from diverse sources,
such as GitHub and online judge platforms. This pipeline can automatically
generate various types of messages throughout the programming process. Finally,
using this pipeline, we generate 219K samples, fine-tune multiple models, and
develop the CursorCore series. We show that CursorCore outperforms other models
of comparable size. This framework unifies applications such as inline chat and
automated editing, contributes to the advancement of coding assistants. Code,
models and data are freely available at
https://github.com/TechxGenus/CursorCore.
[LINK]
http://arxiv.org/abs/2410.07002v2
[DATE]
2024-12-14 20:20:29+08:00
[CATEGORIES]
cs.CL
Evaluating Self-Generated Documents for Enhancing Retrieval-Augmented Generation with Large Language Models
[AUTHORS]
Jiatao Li, Xinyu Hu, Xunjian Yin, Xiaojun Wan
[ABSTRACT]
The integration of documents generated by LLMs themselves (Self-Docs)
alongside retrieved documents has emerged as a promising strategy for
retrieval-augmented generation systems. However, previous research primarily
focuses on optimizing the use of Self-Docs, with their inherent properties
remaining underexplored. To bridge this gap, we first investigate the overall
effectiveness of Self-Docs, identifying key factors that shape their
contribution to RAG performance (RQ1). Building on these insights, we develop a
taxonomy grounded in Systemic Functional Linguistics to compare the influence
of various Self-Docs categories (RQ2) and explore strategies for combining them
with external sources (RQ3). Our findings reveal which types of Self-Docs are
most beneficial and offer practical guidelines for leveraging them to achieve
significant improvements in knowledge-intensive question answering tasks.
[LINK]
http://arxiv.org/abs/2410.13192v2
[DATE]
2024-12-14 20:12:01+08:00
[CATEGORIES]
cs.CL
Sample then Identify: A General Framework for Risk Control and Assessment in Multimodal Large Language Models
[AUTHORS]
Qingni Wang, Tiantian Geng, Zhiyuan Wang, Teng Wang, Bo Fu, Feng Zheng
[ABSTRACT]
Multimodal Large Language Models (MLLMs) exhibit promising advancements
across various tasks, yet they still encounter significant trustworthiness
issues. Prior studies apply Split Conformal Prediction (SCP) in language
modeling to construct prediction sets with statistical guarantees. However,
these methods typically rely on internal model logits or are restricted to
multiple-choice settings, which hampers their generalizability and adaptability
in dynamic, open-ended environments. In this paper, we introduce TRON, a
two-step framework for risk control and assessment, applicable to any MLLM that
supports sampling in both open-ended and closed-ended scenarios. TRON comprises
two main components: (1) a novel conformal score to sample response sets of
minimum size, and (2) a nonconformity score to identify high-quality responses
based on self-consistency theory, controlling the error rates by two specific
risk levels. Furthermore, we investigate semantic redundancy in prediction sets
within open-ended contexts for the first time, leading to a promising
evaluation metric for MLLMs based on average set size. Our comprehensive
experiments across four Video Question-Answering (VideoQA) datasets utilizing
eight MLLMs show that TRON achieves desired error rates bounded by two
user-specified risk levels. Additionally, deduplicated prediction sets maintain
adaptiveness while being more efficient and stable for risk assessment under
different risk levels.
[LINK]
http://arxiv.org/abs/2410.08174v2
[DATE]
2024-12-14 18:34:35+08:00
[CATEGORIES]
cs.CL
cs.LG
Selected Languages are All You Need for Cross-lingual Truthfulness Transfer
[AUTHORS]
Weihao Liu, Ning Wu, Wenbiao Ding, Shining Liang, Ming Gong, Dongmei Zhang
[ABSTRACT]
Truthfulness stands out as an essential challenge for Large Language Models
(LLMs). Although many works have developed various ways for truthfulness
enhancement, they seldom focus on truthfulness in multilingual scenarios.
Meanwhile, contemporary multilingual aligning technologies struggle to balance
numerous languages and often exhibit serious truthfulness gaps across different
languages, especially those that differ greatly from English. In our work, we
extend truthfulness evaluation to multilingual contexts and propose a practical
method for cross-lingual truthfulness transfer called Fact-aware Multilingual
Selective Synergy (FaMSS). FaMSS is able to select an optimal subset of all
tested languages by language bias and transfer contributions, and then employ
translation instruction tuning for cross-lingual truthfulness transfer.
Experimental results demonstrate that our approach can effectively reduce the
multilingual representation disparity and boost cross-lingual truthfulness
transfer of LLMs.
[COMMENTS]
16 pages
[LINK]
http://arxiv.org/abs/2406.14434v2
[DATE]
2024-12-14 17:58:07+08:00
[CATEGORIES]
cs.CL
Prompting Large Language Models to Tackle the Full Software Development Lifecycle: A Case Study
[AUTHORS]
Bowen Li, Wenhan Wu, Ziwei Tang, Lin Shi, John Yang, Jinyang Li, Shunyu Yao, Chen Qian, Binyuan Hui, Qicheng Zhang, Zhiyin Yu, He Du, Ping Yang, Dahua Lin, Chao Peng, Kai Chen
[ABSTRACT]
Recent advancements in large language models (LLMs) have significantly
enhanced their coding capabilities. However, existing benchmarks predominantly
focused on simplified or isolated aspects of coding, such as single-file code
generation or repository issue debugging, falling short of measuring the full
spectrum of challenges raised by real-world programming activities. In this
case study, we explore the performance of LLMs across the entire software
development lifecycle with DevEval, encompassing stages including software
design, environment setup, implementation, acceptance testing, and unit
testing. DevEval features four programming languages, multiple domains,
high-quality data collection, and carefully designed and verified metrics for
each task. Empirical studies show that current LLMs, including GPT-4, fail to
solve the challenges presented within DevEval. Our findings offer actionable
insights for the future development of LLMs toward real-world programming
applications.
[COMMENTS]
Our data and code are available at
https://github.com/open-compass/DevEval
[LINK]
http://arxiv.org/abs/2403.08604v3
[DATE]
2024-12-14 17:45:51+08:00
[CATEGORIES]
cs.CL
WEPO: Web Element Preference Optimization for LLM-based Web Navigation
[AUTHORS]
Jiarun Liu, Jia Hao, Chunhong Zhang, Zheng Hu
[ABSTRACT]
The rapid advancement of autonomous web navigation has significantly
benefited from grounding pretrained Large Language Models (LLMs) as agents.
However, current research has yet to fully leverage the redundancy of HTML
elements for contrastive training. This paper introduces a novel approach to
LLM-based web navigation tasks, called Web Element Preference Optimization
(WEPO). WEPO utilizes unsupervised preference learning by sampling
distance-based non-salient web elements as negative samples, optimizing maximum
likelihood objective within Direct Preference Optimization (DPO). We evaluate
WEPO on the Mind2Web benchmark and empirically demonstrate that WEPO aligns
user high-level intent with output actions more effectively. The results show
that our method achieved the state-of-the-art, with an improvement of 13.8%
over WebAgent and 5.3% over the visual language model CogAgent baseline. Our
findings underscore the potential of preference optimization to enhance web
navigation and other web page based tasks, suggesting a promising direction for
future research.
[COMMENTS]
Published at AAAI 2025
[LINK]
http://arxiv.org/abs/2412.10742v1
[DATE]
2024-12-14 16:25:28+08:00
[CATEGORIES]
cs.CL
An Active Learning Framework for Inclusive Generation by Large Language Models
[AUTHORS]
Sabit Hassan, Anthony Sicilia, Malihe Alikhani
[COMMENTS]
COLING, 2025
[LINK]
http://arxiv.org/abs/2410.13641v2
[DATE]
2024-12-14 16:15:55+08:00
[CATEGORIES]
cs.CL
Dynamic Language Group-Based MoE: Enhancing Code-Switching Speech Recognition with Hierarchical Routing
[AUTHORS]
Hukai Huang, Shenghui Lu, Yahui Shan, He Qu, Wenhao Guan, Qingyang Hong, Lin Li
[ABSTRACT]
The Mixture of Experts (MoE) approach is well-suited for multilingual and
code-switching (CS) tasks due to its multi-expert architecture. This work
introduces the DLG-MoE, a Dynamic Language Group-based MoE optimized for
bilingual and CS scenarios. DLG-MoE operates based on a hierarchical routing
mechanism. First, the language router explicitly models the language and
dispatches the representations to the corresponding language expert groups.
Subsequently, the unsupervised router within each language group implicitly
models attributes beyond language, and coordinates expert routing and
collaboration. The model achieves state-of-the-art (SOTA) performance while
also having unparalleled flexibility. It supports different top-k inference and
streaming capabilities, and can also prune the model parameters to obtain a
monolingual sub-model. The Code will be released.
[COMMENTS]
Submitted to ICASSP2025
[LINK]
http://arxiv.org/abs/2407.18581v3
[DATE]
2024-12-14 15:25:28+08:00
[CATEGORIES]
cs.CL
HITgram: A Platform for Experimenting with n-gram Language Models
[AUTHORS]
Shibaranjani Dasgupta, Chandan Maity, Somdip Mukherjee, Rohan Singh, Diptendu Dutta, Debasish Jana
[ABSTRACT]
Large language models (LLMs) are powerful but resource intensive, limiting
accessibility. HITgram addresses this gap by offering a lightweight platform
for n-gram model experimentation, ideal for resource-constrained environments.
It supports unigrams to 4-grams and incorporates features like context
sensitive weighting, Laplace smoothing, and dynamic corpus management to
e-hance prediction accuracy, even for unseen word sequences. Experiments
demonstrate HITgram’s efficiency, achieving 50,000 tokens/second and generating
2-grams from a 320MB corpus in 62 seconds. HITgram scales efficiently,
constructing 4-grams from a 1GB file in under 298 seconds on an 8 GB RAM
system. Planned enhancements include multilingual support, advanced smoothing,
parallel processing, and model saving, further broadening its utility.
[LINK]
http://arxiv.org/abs/2412.10717v1
[DATE]
2024-12-14 15:20:35+08:00
[CATEGORIES]
cs.CL
On Effects of Steering Latent Representation for Large Language Model Unlearning
[AUTHORS]
Dang Huu-Tien, Trung-Tin Pham, Hoang Thanh-Tung, Naoya Inoue
[ABSTRACT]
Representation Misdirection for Unlearning (RMU), which steers model
representation in the intermediate layer to a target random representation, is
an effective method for large language model (LLM) unlearning. Despite its high
performance, the underlying cause and explanation remain underexplored. In this
paper, we theoretically demonstrate that steering forget representations in the
intermediate layer reduces token confidence, causing LLMs to generate wrong or
nonsense responses. We investigate how the coefficient influences the alignment
of forget-sample representations with the random direction and hint at the
optimal coefficient values for effective unlearning across different network
layers. We show that RMU unlearned models are robust against adversarial
jailbreak attacks. Furthermore, our empirical analysis shows that RMU is less
effective when applied to the middle and later layers in LLMs. To resolve this
drawback, we propose Adaptive RMU – a simple yet effective alternative method
that makes unlearning effective with most layers. Extensive experiments
demonstrate that Adaptive RMU significantly improves the unlearning performance
compared to prior art while incurring no additional computational cost.
[COMMENTS]
Accepted at AAAI-25 Main Technical Track
[LINK]
http://arxiv.org/abs/2408.06223v2
[DATE]
2024-12-14 15:08:18+08:00
[CATEGORIES]
cs.CL
Towards Effective, Efficient and Unsupervised Social Event Detection in the Hyperbolic Space
[AUTHORS]
Xiaoyan Yu, Yifan Wei, Shuaishuai Zhou, Zhiwei Yang, Li Sun, Hao Peng, Liehuang Zhu, Philip S. Yu
[ABSTRACT]
The vast, complex, and dynamic nature of social message data has posed
challenges to social event detection (SED). Despite considerable effort, these
challenges persist, often resulting in inadequately expressive message
representations (ineffective) and prolonged learning durations (inefficient).
In response to the challenges, this work introduces an unsupervised framework,
HyperSED (Hyperbolic SED). Specifically, the proposed framework first models
social messages into semantic-based message anchors, and then leverages the
structure of the anchor graph and the expressiveness of the hyperbolic space to
acquire structure- and geometry-aware anchor representations. Finally, HyperSED
builds the partitioning tree of the anchor message graph by incorporating
differentiable structural information as the reflection of the detected events.
Extensive experiments on public datasets demonstrate HyperSED’s competitive
performance, along with a substantial improvement in efficiency compared to the
current state-of-the-art unsupervised paradigm. Statistically, HyperSED boosts
incremental SED by an average of 2%, 2%, and 25% in NMI, AMI, and ARI,
respectively; enhancing efficiency by up to 37.41 times and at least 12.10
times, illustrating the advancement of the proposed framework. Our code is
publicly available at https://github.com/XiaoyanWork/HyperSED.
[COMMENTS]
Accepted to AAAI 2025
[LINK]
http://arxiv.org/abs/2412.10712v1
[DATE]
2024-12-14 14:55:27+08:00
[CATEGORIES]
cs.CL
Efficient Adaptation of Multilingual Models for Japanese ASR
[AUTHORS]
Mark Bajo, Haruka Fukukawa, Ryuji Morita, Yuma Ogasawara
[ABSTRACT]
This study explores fine-tuning multilingual ASR (Automatic Speech
Recognition) models, specifically OpenAI’s Whisper-Tiny, to improve performance
in Japanese. While multilingual models like Whisper offer versatility, they
often lack precision in specific languages. Conversely, monolingual models like
ReazonSpeech excel in language-specific tasks but are less adaptable. Using
Japanese-specific datasets and Low-Rank Adaptation (LoRA) along with end-to-end
(E2E) training, we fine-tuned Whisper-Tiny to bridge this gap. Our results show
that fine-tuning reduced Whisper-Tiny’s Character Error Rate (CER) from 32.7 to
20.8 with LoRA and to 14.7 with end-to-end fine-tuning, surpassing
Whisper-Base’s CER of 20.2. However, challenges with domain-specific terms
remain, highlighting the need for specialized datasets. These findings
demonstrate that fine-tuning multilingual models can achieve strong
language-specific performance while retaining their flexibility. This approach
provides a scalable solution for improving ASR in resource-constrained
environments and languages with complex writing systems like Japanese.
[LINK]
http://arxiv.org/abs/2412.10705v1
[DATE]
2024-12-14 14:32:16+08:00
[CATEGORIES]
cs.CL
VisDoM: Multi-Document QA with Visually Rich Elements Using Multimodal Retrieval-Augmented Generation
[AUTHORS]
Manan Suri, Puneet Mathur, Franck Dernoncourt, Kanika Goswami, Ryan A. Rossi, Dinesh Manocha
[ABSTRACT]
Understanding information from a collection of multiple documents,
particularly those with visually rich elements, is important for
document-grounded question answering. This paper introduces VisDoMBench, the
first comprehensive benchmark designed to evaluate QA systems in multi-document
settings with rich multimodal content, including tables, charts, and
presentation slides. We propose VisDoMRAG, a novel multimodal Retrieval
Augmented Generation (RAG) approach that simultaneously utilizes visual and
textual RAG, combining robust visual retrieval capabilities with sophisticated
linguistic reasoning. VisDoMRAG employs a multi-step reasoning process
encompassing evidence curation and chain-of-thought reasoning for concurrent
textual and visual RAG pipelines. A key novelty of VisDoMRAG is its
consistency-constrained modality fusion mechanism, which aligns the reasoning
processes across modalities at inference time to produce a coherent final
answer. This leads to enhanced accuracy in scenarios where critical information
is distributed across modalities and improved answer verifiability through
implicit context attribution. Through extensive experiments involving
open-source and proprietary large language models, we benchmark
state-of-the-art document QA methods on VisDoMBench. Extensive results show
that VisDoMRAG outperforms unimodal and long-context LLM baselines for
end-to-end multimodal document QA by 12-20%.
[LINK]
http://arxiv.org/abs/2412.10704v1
[DATE]
2024-12-14 14:24:55+08:00
[CATEGORIES]
cs.CL
Aligning Large Language Models with Human Opinions through Persona Selection and Value–Belief–Norm Reasoning
[AUTHORS]
Do Xuan Long, Kenji Kawaguchi, Min-Yen Kan, Nancy F. Chen
[ABSTRACT]
Reasoning and predicting human opinions with large language models (LLMs) is
essential yet challenging. Current methods employ role-playing with personae
but face two major issues: LLMs are sensitive to even a single irrelevant
persona, skewing predictions by up to 30%, and LLMs fail to reason
strategically over personae. We propose Chain-of-Opinion (COO), a simple
four-step solution modeling which and how to reason with personae, inspired by
the Value–Belief–Norm (VBN) theory. COO differentiates between explicit
personae (demographics and ideology) and implicit personae (historical
opinions), involves: (1) filtering irrelevant attributes from explicit
personae, (2) ranking implicit personae into a preferential list for selecting
top-k, (3) applying novel VBN reasoning to extract user environmental and
personal value, belief, and norm variables for accurate and reliable
predictions, and (4) iterating VBN reasoning with progressively larger lists of
implicit personae to handle potential persona insufficiency. COO efficiently
achieves new state-of-the-art opinion prediction via prompting with only 5
inference calls, improving prior techniques by up to 4%. Notably, fine-tuning
LMs with COO data results in significantly better opinion-aligned models, by up
to 23%.
[COMMENTS]
COLING 2025
[LINK]
http://arxiv.org/abs/2311.08385v5
[DATE]
2024-12-14 14:08:07+08:00
[CATEGORIES]
cs.CL
Neural-Symbolic Collaborative Distillation: Advancing Small Language Models for Complex Reasoning Tasks
[AUTHORS]
Huanxuan Liao, Shizhu He, Yao Xu, Yuanzhe Zhang, Kang Liu, Jun Zhao
[ABSTRACT]
In this paper, we propose $\textbf{Ne}$ural-$\textbf{Sy}$mbolic
$\textbf{C}$ollaborative $\textbf{D}$istillation ($\textbf{NesyCD}$), a novel
knowledge distillation method for learning the complex reasoning abilities of
Large Language Models (LLMs, e.g., \textgreater 13B). We argue that complex
reasoning tasks are difficult for Small Language Models (SLMs, e.g., $\leq$
7B), as these tasks demand not only general cognitive abilities but also
specialized knowledge, which is often sparse and difficult for these
neural-based SLMs to effectively capture. Therefore, NesyCD distills the
general capabilities and specialized knowledge in LLMs using different manners.
On the one hand, we distill only general abilities from teacher LLMs into the
student SLMs of parameterized neural networks. On the other hand, for the
specialized abilities and uncommon knowledge of a complex reasoning task, we
employ a symbolic knowledge distillation approach to obtain and store the
specialized knowledge within a symbolic knowledge base (KB). By decoupling
general and specialized capabilities, the proposed NesyCD can achieve superior
performance cost-effectively, utilizing smaller models and blending
parameterized neural networks with symbolic KB. Moreover, the specialized KB
generalizes well and is comprehended and manipulated by humans. Our experiments
show that NesyCD significantly boosts SLMs’ complex reasoning performance on
in-domain (BBH, GSM8K) and out-of-domain (AGIEval, ARC) datasets. Notably, our
approach enabled the LLaMA3-8B and Qwen2-7B to surpass GPT-3.5-turbo in
performance and come close to matching LLaMA3-70B, despite the latter having
nine times more parameters. Our code will be available at
https://github.com/Xnhyacinth/NesyCD.
[COMMENTS]
Accepted to AAAI 2025
[LINK]
http://arxiv.org/abs/2409.13203v2
[DATE]
2024-12-14 13:57:43+08:00
[CATEGORIES]
cs.CL
Awakening Augmented Generation: Learning to Awaken Internal Knowledge of Large Language Models for Question Answering
[AUTHORS]
Huanxuan Liao, Shizhu He, Yao Xu, Yuanzhe Zhang, Kang Liu, Shengping Liu, Jun Zhao
[ABSTRACT]
Retrieval-Augmented-Generation and Generation-Augmented-Generation have been
proposed to enhance the knowledge required for question answering with Large
Language Models (LLMs) by leveraging richer context. However, the former relies
on external resources, and both require incorporating explicit documents into
the context, which increases execution costs and susceptibility to noise data
during inference. Recent works indicate that LLMs model rich knowledge, but it
is often not effectively activated and awakened. Inspired by this, we propose a
novel knowledge-augmented framework, $\textbf{Awakening-Augmented-Generation}$
(AAG), which mimics the human ability to answer questions using only thinking
and recalling to compensate for knowledge gaps, thereby awaking relevant
knowledge in LLMs without relying on external resources. AAG consists of two
key components for awakening richer context. Explicit awakening fine-tunes a
context generator to create a synthetic, compressed document that functions as
symbolic context. Implicit awakening utilizes a hypernetwork to generate
adapters based on the question and synthetic document, which are inserted into
LLMs to serve as parameter context. Experimental results on three datasets
demonstrate that AAG exhibits significant advantages in both open-domain and
closed-book settings, as well as in out-of-distribution generalization. Our
code will be available at \url{https://github.com/Xnhyacinth/IAG}.
[COMMENTS]
Accepted to COLING 2025
[LINK]
http://arxiv.org/abs/2403.15268v5
[DATE]
2024-12-14 13:52:11+08:00
[CATEGORIES]
cs.CL
$\textit{SKIntern}$: Internalizing Symbolic Knowledge for Distilling Better CoT Capabilities into Small Language Models
[AUTHORS]
Huanxuan Liao, Shizhu He, Yupu Hao, Xiang Li, Yuanzhe Zhang, Jun Zhao, Kang Liu
[ABSTRACT]
Small Language Models (SLMs) are attracting attention due to the high
computational demands and privacy concerns of Large Language Models (LLMs).
Some studies fine-tune SLMs using Chains of Thought (CoT) data distilled from
LLMs, aiming to enhance their reasoning ability. Furthermore, Some CoT
distillation methods introduce external symbolic knowledge into the generation
process to improve the limited knowledge memory, reasoning ability and
out-of-domain (OOD) generalization of SLMs. However, the introduction of
symbolic knowledge increases computational overhead and introduces potential
noise. In this paper, we introduce $\textit{SKIntern}$, an innovative approach
that empowers SLMs to internalize symbolic knowledge and few-shot examples
gradually through a progressive fine-tuning process, guided by a predefined
linear decay schedule under curriculum learning. By efficiently internalizing
knowledge, $\textit{SKIntern}$ reduces computational overhead and speeds up the
reasoning process by focusing solely on the question during inference. It
outperforms state-of-the-art baselines by over 5\%, while reducing inference
costs (measured in FLOPs) by up to $4\times$ across a wide range of SLMs in
both in-domain (ID) and out-of-domain (OOD) tasks. Our code will be available
at \url{https://github.com/Xnhyacinth/SKIntern}.
[COMMENTS]
Accepted to COLING 2025
[LINK]
http://arxiv.org/abs/2409.13183v2
[DATE]
2024-12-14 13:50:41+08:00
[CATEGORIES]
cs.CL
Learning to Verify Summary Facts with Fine-Grained LLM Feedback
[AUTHORS]
Jihwan Oh, Jeonghwan Choi, Nicole Hee-Yeon Kim, Taewon Yun, Hwanjun Song
[ABSTRACT]
Training automatic summary fact verifiers often faces the challenge of a lack
of human-labeled data. In this paper, we explore alternative way of leveraging
Large Language Model (LLM) generated feedback to address the inherent
limitation of using human-labeled data. We introduce FineSumFact, a large-scale
dataset containing fine-grained factual feedback on summaries. We employ 10
distinct LLMs for diverse summary generation and Llama-3-70B-Instruct for
feedback. We utilize this dataset to fine-tune the lightweight open-source
model Llama-3-8B-Instruct, optimizing resource efficiency while maintaining
high performance. Our experimental results reveal that the model trained on
extensive LLM-generated datasets surpasses that trained on smaller
human-annotated datasets when evaluated using human-generated test sets.
Fine-tuning fact verification models with LLM feedback can be more effective
and cost-efficient than using human feedback. The dataset is available at
https://github.com/DISL-Lab/FineSumFact.
[COMMENTS]
Accepted at COLING 2025
[LINK]
http://arxiv.org/abs/2412.10689v1
[DATE]
2024-12-14 13:28:44+08:00
[CATEGORIES]
cs.CL
Inference Scaling for Bridging Retrieval and Augmented Generation
[AUTHORS]
Youngwon Lee, Seung-won Hwang, Daniel Campos, Filip Graliński, Zhewei Yao, Yuxiong He
[ABSTRACT]
Retrieval-augmented generation (RAG) has emerged as a popular approach to
steering the output of a large language model (LLM) by incorporating retrieved
contexts as inputs. However, existing work observed the generator bias, such
that improving the retrieval results may negatively affect the outcome. In this
work, we show such bias can be mitigated, from inference scaling, aggregating
inference calls from the permuted order of retrieved contexts. The proposed
Mixture-of-Intervention (MOI) explicitly models the debiased utility of each
passage with multiple forward passes to construct a new ranking. We also show
that MOI can leverage the retriever’s prior knowledge to reduce the
computational cost by minimizing the number of permutations considered and
lowering the cost per LLM call. We showcase the effectiveness of MOI on diverse
RAG tasks, improving ROUGE-L on MS MARCO and EM on HotpotQA benchmarks by ~7
points.
[LINK]
http://arxiv.org/abs/2412.10684v1
[DATE]
2024-12-14 13:06:43+08:00
[CATEGORIES]
cs.CL
Chasing Progress, Not Perfection: Revisiting Strategies for End-to-End LLM Plan Generation
[AUTHORS]
Sukai Huang, Trevor Cohn, Nir Lipovetzky
[COMMENTS]
8 pages main body, 10 pages appendix, accepted by Workshop on
Planning in the Era of LLMs (LM4Plan @ AAAI 2025)
[LINK]
http://arxiv.org/abs/2412.10675v1
[DATE]
2024-12-14 12:23:14+08:00
[CATEGORIES]
cs.CL
KBAlign: Efficient Self Adaptation on Specific Knowledge Bases
[AUTHORS]
Zheni Zeng, Yuxuan Chen, Shi Yu, Ruobing Wang, Yukun Yan, Zhenghao Liu, Shuo Wang, Xu Han, Zhiyuan Liu, Maosong Sun
[ABSTRACT]
Humans can utilize techniques to quickly acquire knowledge from specific
materials in advance, such as creating self-assessment questions, enabling us
to achieving related tasks more efficiently. In contrast, large language models
(LLMs) usually relies on retrieval-augmented generation to exploit knowledge
materials in an instant manner, or requires external signals such as human
preference data and stronger LLM annotations to conduct knowledge adaptation.
To unleash the self-learning potential of LLMs, we propose KBAlign, an approach
designed for efficient adaptation to downstream tasks involving knowledge
bases. Our method utilizes iterative training with self-annotated data such as
Q&A pairs and revision suggestions, enabling the model to grasp the knowledge
content efficiently. Experimental results on multiple datasets demonstrate the
effectiveness of our approach, significantly boosting model performance in
downstream tasks that require specific knowledge at a low cost. Notably, our
approach achieves over 90% of the performance improvement that can be obtained
by using GPT-4-turbo annotation, while relying entirely on self-supervision. We
release our experimental data, models, and process analyses to the community
for further exploration (https://github.com/thunlp/KBAlign).
[LINK]
http://arxiv.org/abs/2411.14790v3
[DATE]
2024-12-14 11:47:44+08:00
[CATEGORIES]
cs.CL
Thinking with Knowledge Graphs: Enhancing LLM Reasoning Through Structured Data
[AUTHORS]
Xue Wu, Kostas Tsioutsiouliklis
[ABSTRACT]
Large Language Models (LLMs) have demonstrated remarkable capabilities in
natural language understanding and generation. However, they often struggle
with complex reasoning tasks and are prone to hallucination. Recent research
has shown promising results in leveraging knowledge graphs (KGs) to enhance LLM
performance. KGs provide a structured representation of entities and their
relationships, offering a rich source of information that can enhance the
reasoning capabilities of LLMs. For this work, we have developed different
techniques that tightly integrate KG structures and semantics into LLM
representations. Our results show that we are able to significantly improve the
performance of LLMs in complex reasoning scenarios, and ground the reasoning
process with KGs. We are the first to represent KGs with programming language
and fine-tune pretrained LLMs with KGs. This integration facilitates more
accurate and interpretable reasoning processes, paving the way for more
advanced reasoning capabilities of LLMs.
[LINK]
http://arxiv.org/abs/2412.10654v1
[DATE]
2024-12-14 10:51:47+08:00
[CATEGORIES]
cs.CL
cs.LG
SecFormer: Fast and Accurate Privacy-Preserving Inference for Transformer Models via SMPC
[AUTHORS]
Jinglong Luo, Yehong Zhang, Zhuo Zhang, Jiaqi Zhang, Xin Mu, Hui Wang, Yue Yu, Zenglin Xu
[ABSTRACT]
With the growing use of Transformer models hosted on cloud platforms to offer
inference services, privacy concerns are escalating, especially concerning
sensitive data like investment plans and bank account details. Secure
Multi-Party Computing (SMPC) emerges as a promising solution to protect the
privacy of inference data and model parameters. However, the application of
SMPC in Privacy-Preserving Inference (PPI) for Transformer models often leads
to considerable slowdowns or declines in performance. This is largely due to
the multitude of nonlinear operations in the Transformer architecture, which
are not well-suited to SMPC and difficult to circumvent or optimize
effectively. To address this concern, we introduce a comprehensive PPI
framework called SecFormer to achieve fast and accurate PPI for Transformer
models. We successfully eliminate the high-cost exponential and maximum
operations in PPI without sacrificing model performance and develop a suite of
efficient SMPC protocols by employing suitable numerical computation methods to
boost other complex nonlinear functions in PPI, including GeLU, LayerNorm, and
a redesigned Softmax. Our extensive experiments reveal that SecFormer
outperforms MPCFormer in performance, showing improvements of $3.4\%$ and
$24.7\%$ for BERT${\text{BASE}}$ and BERT${\text{LARGE}}$, respectively. In
terms of efficiency, SecFormer is 3.57 and 3.58 times faster than PUMA for
BERT${\text{BASE}}$ and BERT${\text{LARGE}}$, demonstrating its effectiveness
and speed.
[COMMENTS]
ACL 2024
[LINK]
http://arxiv.org/abs/2401.00793v4
[DATE]
2024-12-14 10:42:10+08:00
[CATEGORIES]
cs.LG
cs.CL
Human-interpretable clustering of short-text using large language models
[AUTHORS]
Justin K. Miller, Tristram J. Alexander
[ABSTRACT]
Clustering short text is a difficult problem, due to the low word
co-occurrence between short text documents. This work shows that large language
models (LLMs) can overcome the limitations of traditional clustering approaches
by generating embeddings that capture the semantic nuances of short text. In
this study clusters are found in the embedding space using Gaussian Mixture
Modelling (GMM). The resulting clusters are found to be more distinctive and
more human-interpretable than clusters produced using the popular methods of
doc2vec and Latent Dirichlet Allocation (LDA). The success of the clustering
approach is quantified using human reviewers and through the use of a
generative LLM. The generative LLM shows good agreement with the human
reviewers, and is suggested as a means to bridge the `validation gap’ which
often exists between cluster production and cluster interpretation. The
comparison between LLM-coding and human-coding reveals intrinsic biases in
each, challenging the conventional reliance on human coding as the definitive
standard for cluster validation.
[COMMENTS]
Main text: 18 pages, 6 figures. Supplementary: 21 pages, 15 figures,
3 tables
[LINK]
http://arxiv.org/abs/2405.07278v2
[DATE]
2024-12-14 09:04:10+08:00
[CATEGORIES]
cs.CL
cs.LG
DUET: Dual Clustering Enhanced Multivariate Time Series Forecasting
[AUTHORS]
Xiangfei Qiu, Xingjian Wu, Yan Lin, Chenjuan Guo, Jilin Hu, Bin Yang
[ABSTRACT]
Multivariate time series forecasting is crucial for various applications,
such as financial investment, energy management, weather forecasting, and
traffic optimization. However, accurate forecasting is challenging due to two
main factors. First, real-world time series often show heterogeneous temporal
patterns caused by distribution shifts over time. Second, correlations among
channels are complex and intertwined, making it hard to model the interactions
among channels precisely and flexibly.
In this study, we address these challenges by proposing a general framework
called \textbf{DUET}, which introduces \underline{DU}al clustering on the
temporal and channel dimensions to \underline{E}nhance multivariate
\underline{T}ime series forecasting. First, we design a Temporal Clustering
Module (TCM) that clusters time series into fine-grained distributions to
handle heterogeneous temporal patterns. For different distribution clusters, we
design various pattern extractors to capture their intrinsic temporal patterns,
thus modeling the heterogeneity. Second, we introduce a novel
Channel-Soft-Clustering strategy and design a Channel Clustering Module (CCM),
which captures the relationships among channels in the frequency domain through
metric learning and applies sparsification to mitigate the adverse effects of
noisy channels. Finally, DUET combines TCM and CCM to incorporate both the
temporal and channel dimensions. Extensive experiments on 25 real-world
datasets from 10 application domains, demonstrate the state-of-the-art
performance of DUET.
[COMMENTS]
Accepted by KDD 2025
[LINK]
http://arxiv.org/abs/2412.10859v1
[DATE]
2024-12-14 23:15:17+08:00
[CATEGORIES]
cs.LG
RWKV-edge: Deeply Compressed RWKV for Resource-Constrained Devices
[AUTHORS]
Wonkyo Choe, Yangfeng Ji, Felix Lin
[ABSTRACT]
To deploy LLMs on resource-contained platforms such as mobile robotics and
wearables, non-transformers LLMs have achieved major breakthroughs. Recently, a
novel RNN-based LLM family, Repentance Weighted Key Value (RWKV) models have
shown promising results in text generation on resource-constrained devices
thanks to their computational efficiency. However, these models remain too
large to be deployed on embedded devices due to their high parameter count. In
this paper, we propose an efficient suite of compression techniques, tailored
to the RWKV architecture. These techniques include low-rank approximation,
sparsity predictors, and clustering head, designed to align with the model
size. Our methods compress the RWKV models by 4.95–3.8x with only 2.95pp loss
in accuracy.
[LINK]
http://arxiv.org/abs/2412.10856v1
[DATE]
2024-12-14 23:11:07+08:00
[CATEGORIES]
cs.LG
Fast and Robust Visuomotor Riemannian Flow Matching Policy
[AUTHORS]
Haoran Ding, Noémie Jaquier, Jan Peters, Leonel Rozo
[ABSTRACT]
Diffusion-based visuomotor policies excel at learning complex robotic tasks
by effectively combining visual data with high-dimensional, multi-modal action
distributions. However, diffusion models often suffer from slow inference due
to costly denoising processes or require complex sequential training arising
from recent distilling approaches. This paper introduces Riemannian Flow
Matching Policy (RFMP), a model that inherits the easy training and fast
inference capabilities of flow matching (FM). Moreover, RFMP inherently
incorporates geometric constraints commonly found in realistic robotic
applications, as the robot state resides on a Riemannian manifold. To enhance
the robustness of RFMP, we propose Stable RFMP (SRFMP), which leverages
LaSalle’s invariance principle to equip the dynamics of FM with stability to
the support of a target Riemannian distribution. Rigorous evaluation on eight
simulated and real-world tasks show that RFMP successfully learns and
synthesizes complex sensorimotor policies on Euclidean and Riemannian spaces
with efficient training and inference phases, outperforming Diffusion Policies
while remaining competitive with Consistency Policies.
[COMMENTS]
14 pages, 10 figures, 9 tables, project website:
https://sites.google.com/view/rfmp
[LINK]
http://arxiv.org/abs/2412.10855v1
[DATE]
2024-12-14 23:03:33+08:00
[CATEGORIES]
cs.LG
Community-Centric Graph Unlearning
[AUTHORS]
Yi Li, Shichao Zhang, Guixian Zhang, Debo Cheng
[ABSTRACT]
Graph unlearning technology has become increasingly important since the
advent of the `right to be forgotten’ and the growing concerns about the
privacy and security of artificial intelligence. Graph unlearning aims to
quickly eliminate the effects of specific data on graph neural networks (GNNs).
However, most existing deterministic graph unlearning frameworks follow a
balanced partition-submodel training-aggregation paradigm, resulting in a lack
of structural information between subgraph neighborhoods and redundant
unlearning parameter calculations. To address this issue, we propose a novel
Graph Structure Mapping Unlearning paradigm (GSMU) and a novel method based on
it named Community-centric Graph Eraser (CGE). CGE maps community subgraphs to
nodes, thereby enabling the reconstruction of a node-level unlearning operation
within a reduced mapped graph. CGE makes the exponential reduction of both the
amount of training data and the number of unlearning parameters. Extensive
experiments conducted on five real-world datasets and three widely used GNN
backbones have verified the high performance and efficiency of our CGE method,
highlighting its potential in the field of graph unlearning.
[LINK]
http://arxiv.org/abs/2408.09705v2
[DATE]
2024-12-14 23:01:30+08:00
[CATEGORIES]
cs.LG
Improving Graph Neural Networks via Adversarial Robustness Evaluation
[AUTHORS]
Yongyu Wang
[ABSTRACT]
Graph Neural Networks (GNNs) are currently one of the most powerful types of
neural network architectures. Their advantage lies in the ability to leverage
both the graph topology, which represents the relationships between samples,
and the features of the samples themselves. However, the given graph topology
often contains noisy edges, and GNNs are vulnerable to noise in the graph
structure. This issue remains unresolved. In this paper, we propose using
adversarial robustness evaluation to select a small subset of robust nodes that
are less affected by noise. We then only feed the features of these robust
nodes, along with the KNN graph constructed from these nodes, into the GNN for
classification. Additionally, we compute the centroids for each class. For the
remaining non-robust nodes, we assign them to the class whose centroid is
closest to them. Experimental results show that this method significantly
improves the accuracy of GNNs.
[LINK]
http://arxiv.org/abs/2412.10850v1
[DATE]
2024-12-14 22:47:20+08:00
[CATEGORIES]
cs.LG
A Diagrammatic Approach to Improve Computational Efficiency in Group Equivariant Neural Networks
[AUTHORS]
Edward Pearce-Crump, William J. Knottenbelt
[ABSTRACT]
Group equivariant neural networks are growing in importance owing to their
ability to generalise well in applications where the data has known underlying
symmetries. Recent characterisations of a class of these networks that use
high-order tensor power spaces as their layers suggest that they have
significant potential; however, their implementation remains challenging owing
to the prohibitively expensive nature of the computations that are involved. In
this work, we present a fast matrix multiplication algorithm for any
equivariant weight matrix that maps between tensor power layer spaces in these
networks for four groups: the symmetric, orthogonal, special orthogonal, and
symplectic groups. We obtain this algorithm by developing a diagrammatic
framework based on category theory that enables us to not only express each
weight matrix as a linear combination of diagrams but also makes it possible
for us to use these diagrams to factor the original computation into a series
of steps that are optimal. We show that this algorithm improves the Big-$O$
time complexity exponentially in comparison to a na"{i}ve matrix
multiplication.
[COMMENTS]
51 pages
[LINK]
http://arxiv.org/abs/2412.10837v1
[DATE]
2024-12-14 22:08:06+08:00
[CATEGORIES]
cs.LG
Variational f-divergence Minimization
[AUTHORS]
Mingtian Zhang, Thomas Bird, Raza Habib, Tianlin Xu, David Barber
[ABSTRACT]
Probabilistic models are often trained by maximum likelihood, which
corresponds to minimizing a specific f-divergence between the model and data
distribution. In light of recent successes in training Generative Adversarial
Networks, alternative non-likelihood training criteria have been proposed.
Whilst not necessarily statistically efficient, these alternatives may better
match user requirements such as sharp image generation. A general variational
method for training probabilistic latent variable models using maximum
likelihood is well established; however, how to train latent variable models
using other f-divergences is comparatively unknown. We discuss a variational
approach that, when combined with the recently introduced Spread Divergence,
can be applied to train a large class of latent variable models using any
f-divergence.
[LINK]
http://arxiv.org/abs/1907.11891v2
[DATE]
2024-12-14 21:56:40+08:00
[CATEGORIES]
cs.LG
Uncovering Temporal Patterns in Visualizations of High-Dimensional Data
[AUTHORS]
Pavlin G. Poličar, Blaž Zupan
[ABSTRACT]
With the increasing availability of high-dimensional data, analysts often
rely on exploratory data analysis to understand complex data sets. A key
approach to exploring such data is dimensionality reduction, which embeds
high-dimensional data in two dimensions to enable visual exploration. However,
popular embedding techniques, such as t-SNE and UMAP, typically assume that
data points are independent. When this assumption is violated, as in
time-series data, the resulting visualizations may fail to reveal important
temporal patterns and trends. To address this, we propose a formal extension to
existing dimensionality reduction methods that incorporates two temporal loss
terms that explicitly highlight temporal progression in the embedded
visualizations. Through a series of experiments on both synthetic and
real-world datasets, we demonstrate that our approach effectively uncovers
temporal patterns and improves the interpretability of the visualizations.
Furthermore, the method improves temporal coherence while preserving the
fidelity of the embeddings, providing a robust tool for dynamic data analysis.
[LINK]
http://arxiv.org/abs/2403.19040v2
[DATE]
2024-12-14 21:35:23+08:00
[CATEGORIES]
cs.LG
Kernel Stochastic Configuration Networks for Nonlinear Regression
[AUTHORS]
Yongxuan Chen, Dianhui Wang
[ABSTRACT]
Stochastic configuration networks (SCNs), as a class of randomized learner
models, are featured by its way of random parameters assignment in the light of
a supervisory mechanism, resulting in the universal approximation property at
algorithmic level. This paper presents a kernel version of SCNs, termed KSCNs,
aiming to enhance model’s representation learning capability and performance
stability. The random bases of a built SCN model can be used to span a
reproducing kernel Hilbert space (RKHS), followed by our proposed algorithm for
constructing KSCNs. It is shown that the data distribution in the
reconstructive space is favorable for regression solving and the proposed KSCN
learner models hold the universal approximation property. Three benchmark
datasets including two industrial datasets are used in this study for
performance evaluation. Experimental results with comparisons against existing
solutions clearly demonstrate that the proposed KSCN remarkably outperforms the
original SCNs and some typical kernel methods for resolving nonlinear
regression problems in terms of the learning performance, the model’s stability
and robustness with respect to the kernel parameter settings.
[COMMENTS]
13 pages, 20 figures
[LINK]
http://arxiv.org/abs/2412.05846v2
[DATE]
2024-12-14 21:32:58+08:00
[CATEGORIES]
cs.LG
ASTD Patterns for Integrated Continuous Anomaly Detection In Data Logs
[AUTHORS]
Chaymae El Jabri, Marc Frappier, Pierre-Martin Tardif
[ABSTRACT]
This paper investigates the use of the ASTD language for ensemble anomaly
detection in data logs. It uses a sliding window technique for continuous
learning in data streams, coupled with updating learning models upon the
completion of each window to maintain accurate detection and align with current
data trends. It proposes ASTD patterns for combining learning models,
especially in the context of unsupervised learning, which is commonly used for
data streams. To facilitate this, a new ASTD operator is proposed, the
Quantified Flow, which enables the seamless combination of learning models
while ensuring that the specification remains concise. Our contribution is a
specification pattern, highlighting the capacity of ASTDs to abstract and
modularize anomaly detection systems. The ASTD language provides a unique
approach to develop data flow anomaly detection systems, grounded in the
combination of processes through the graphical representation of the language
operators. This simplifies the design task for developers, who can focus
primarily on defining the functional operations that constitute the system.
[LINK]
http://arxiv.org/abs/2411.07272v3
[DATE]
2024-12-14 21:07:00+08:00
[CATEGORIES]
cs.LG
Graph Attention Hamiltonian Neural Networks: A Lattice System Analysis Model Based on Structural Learning
[AUTHORS]
Ru Geng, Yixian Gao, Jian Zu, Hong-Kun Zhang
[ABSTRACT]
A deep understanding of the intricate interactions between particles within a
system is a key approach to revealing the essential characteristics of the
system, whether it is an in-depth analysis of molecular properties in the field
of chemistry or the design of new materials for specific performance
requirements in materials science. To this end, we propose Graph Attention
Hamiltonian Neural Network (GAHN), a neural network method that can understand
the underlying structure of lattice Hamiltonian systems solely through the
dynamic trajectories of particles. We can determine which particles in the
system interact with each other, the proportion of interactions between
different particles, and whether the potential energy of interactions between
particles exhibits even symmetry or not. The obtained structure helps the
neural network model to continue predicting the trajectory of the system and
further understand the dynamic properties of the system. In addition to
understanding the underlying structure of the system, it can be used for
detecting lattice structural abnormalities, such as link defects, abnormal
interactions, etc. These insights benefit system optimization, design, and
detection of aging or damage. Moreover, this approach can integrate other
components to deduce the link structure needed for specific parts, showcasing
its scalability and potential. We tested it on a challenging molecular dynamics
dataset, and the results proved its ability to accurately infer molecular bond
connectivity, highlighting its scientific research potential.
[COMMENTS]
17 pages, 7 figures
[LINK]
http://arxiv.org/abs/2412.10821v1
[DATE]
2024-12-14 21:03:15+08:00
[CATEGORIES]
cs.LG
Diffusion-based Method for Satellite Pattern-of-Life Identification
[AUTHORS]
Yongchao Ye, Xinting Zhu, Xuejin Shen, Xiaoyu Chen, Lishuai Li, S. Joe Qin
[ABSTRACT]
Satellite pattern-of-life (PoL) identification is crucial for space safety
and satellite monitoring, involving the analysis of typical satellite behaviors
such as station-keeping, drift, etc. However, existing PoL identification
methods remain underdeveloped due to the complexity of aerospace systems,
variability in satellite behaviors, and fluctuating observation sampling rates.
In a first attempt, we developed a domain expertise-informed machine learning
method (Expert-ML) to combine satellite orbital movement knowledge and machine
learning models. The Expert-ML method achieved high accuracy results in
simulation data and real-world data with normal sampling rate. However, this
approach lacks of generality as it requires domain expertise and its
performance degraded significantly when data sampling rate varied. To achieve
generality, we propose a novel diffusion-based PoL identification method.
Distinct from prior approaches, the proposed method leverages a diffusion model
to achieve end-to-end identification without manual refinement or
domain-specific knowledge. Specifically, we employ a multivariate time-series
encoder to capture hidden representations of satellite positional data. The
encoded features are subsequently incorporated as conditional information in
the denoising process to generate PoL labels. Through experimentation across
real-world satellite settings, our proposed diffusion-based method demonstrates
its high identification quality and provides a robust solution even with
reduced data sampling rates, indicating its great potential in practical
satellite behavior pattern identification, tracking and related mission
deployment.
[LINK]
http://arxiv.org/abs/2412.10814v1
[DATE]
2024-12-14 20:39:19+08:00
[CATEGORIES]
cs.LG
AuctionNet: A Novel Benchmark for Decision-Making in Large-Scale Games
[AUTHORS]
Kefan Su, Yusen Huo, Zhilin Zhang, Shuai Dou, Chuan Yu, Jian Xu, Zongqing Lu, Bo Zheng
[ABSTRACT]
Decision-making in large-scale games is an essential research area in
artificial intelligence (AI) with significant real-world impact. However, the
limited access to realistic large-scale game environments has hindered research
progress in this area. In this paper, we present \textbf{AuctionNet}, a
benchmark for bid decision-making in large-scale ad auctions derived from a
real-world online advertising platform. AuctionNet is composed of three parts:
an ad auction environment, a pre-generated dataset based on the environment,
and performance evaluations of several baseline bid decision-making algorithms.
More specifically, the environment effectively replicates the integrity and
complexity of real-world ad auctions through the interaction of several
modules: the ad opportunity generation module employs deep generative models to
bridge the gap between simulated and real-world data while mitigating the risk
of sensitive data exposure; the bidding module implements diverse auto-bidding
agents trained with different decision-making algorithms; and the auction
module is anchored in the classic Generalized Second Price (GSP) auction but
also allows for customization of auction mechanisms as needed. To facilitate
research and provide insights into the game environment, we have also
pre-generated a substantial dataset based on the environment. The dataset
contains trajectories involving 48 diverse agents competing with each other,
totaling over 500 million records and occupying 80GB of storage. Performance
evaluations of baseline algorithms such as linear programming, reinforcement
learning, and generative models for bid decision-making are also presented as
part of AuctionNet. We note that AuctionNet is applicable not only to research
on bid decision-making algorithms in ad auctions but also to the general area
of decision-making in large-scale games.
[LINK]
http://arxiv.org/abs/2412.10798v1
[DATE]
2024-12-14 19:31:21+08:00
[CATEGORIES]
cs.LG
Exploring Critical Testing Scenarios for Decision-Making Policies: An LLM Approach
[AUTHORS]
Weichao Xu, Huaxin Pei, Jingxuan Yang, Yuchen Shi, Yi Zhang, Qianchuan Zhao
[ABSTRACT]
Recent advances in decision-making policies have led to significant progress
in fields such as autonomous driving and robotics. However, testing these
policies remains crucial with the existence of critical scenarios that may
threaten their reliability. Despite ongoing research, challenges such as low
testing efficiency and limited diversity persist due to the complexity of the
decision-making policies and their environments. To address these challenges,
this paper proposes an adaptable Large Language Model (LLM)-driven online
testing framework to explore critical and diverse testing scenarios for
decision-making policies. Specifically, we design a “generate-test-feedback”
pipeline with templated prompt engineering to harness the world knowledge and
reasoning abilities of LLMs. Additionally, a multi-scale scenario generation
strategy is proposed to address the limitations of LLMs in making fine-grained
adjustments, further enhancing testing efficiency. Finally, the proposed
LLM-driven method is evaluated on five widely recognized benchmarks, and the
experimental results demonstrate that our method significantly outperforms
baseline methods in uncovering both critical and diverse scenarios. These
findings suggest that LLM-driven methods hold significant promise for advancing
the testing of decision-making policies.
[COMMENTS]
16 pages, 13 figures
[LINK]
http://arxiv.org/abs/2412.06684v2
[DATE]
2024-12-14 19:06:37+08:00
[CATEGORIES]
cs.LG
Audio-based Anomaly Detection in Industrial Machines Using Deep One-Class Support Vector Data Description
[AUTHORS]
Sertac Kilickaya, Mete Ahishali, Cansu Celebioglu, Fahad Sohrab, Levent Eren, Turker Ince, Murat Askar, Moncef Gabbouj
[ABSTRACT]
The frequent breakdowns and malfunctions of industrial equipment have driven
increasing interest in utilizing cost-effective and easy-to-deploy sensors,
such as microphones, for effective condition monitoring of machinery.
Microphones offer a low-cost alternative to widely used condition monitoring
sensors with their high bandwidth and capability to detect subtle anomalies
that other sensors might have less sensitivity. In this study, we investigate
malfunctioning industrial machines to evaluate and compare anomaly detection
performance across different machine types and fault conditions. Log-Mel
spectrograms of machinery sound are used as input, and the performance is
evaluated using the area under the curve (AUC) score for two different methods:
baseline dense autoencoder (AE) and one-class deep Support Vector Data
Description (deep SVDD) with different subspace dimensions. Our results over
the MIMII sound dataset demonstrate that the deep SVDD method with a subspace
dimension of 2 provides superior anomaly detection performance, achieving
average AUC scores of 0.84, 0.80, and 0.69 for 6 dB, 0 dB, and -6 dB
signal-to-noise ratios (SNRs), respectively, compared to 0.82, 0.72, and 0.64
for the baseline model. Moreover, deep SVDD requires 7.4 times fewer trainable
parameters than the baseline dense AE, emphasizing its advantage in both
effectiveness and computational efficiency.
[COMMENTS]
To be published in 2025 IEEE Symposium Series on Computational
Intelligence
[LINK]
http://arxiv.org/abs/2412.10792v1
[DATE]
2024-12-14 19:05:06+08:00
[CATEGORIES]
cs.LG
Scaling Up Graph Propagation Computation on Large Graphs: A Local Chebyshev Approximation Approach
[AUTHORS]
Yichun Yang, Rong-Hua Li, Meihao Liao, Longlong Lin, Guoren Wang
[ABSTRACT]
Graph propagation (GP) computation plays a crucial role in graph data
analysis, supporting various applications such as graph node similarity
queries, graph node ranking, graph clustering, and graph neural networks.
Existing methods, mainly relying on power iteration or push computation
frameworks, often face challenges with slow convergence rates when applied to
large-scale graphs. To address this issue, we propose a novel and powerful
approach that accelerates power iteration and push methods using Chebyshev
polynomials. Specifically, we first present a novel Chebyshev expansion formula
for general GP functions, offering a new perspective on GP computation and
achieving accelerated convergence. Building on these theoretical insights, we
develop a novel Chebyshev power iteration method (\ltwocheb) and a novel
Chebyshev push method (\chebpush). Our \ltwocheb method demonstrates an
approximate acceleration of $O(\sqrt{N})$ compared to existing power iteration
techniques for both personalized PageRank and heat kernel PageRank
computations, which are well-studied GP problems. For \chebpush, we propose an
innovative subset Chebyshev recurrence technique, enabling the design of a
push-style local algorithm with provable error guarantee and reduced time
complexity compared to existing push methods. We conduct extensive experiments
using 5 large real-world datasets to evaluate our proposed algorithms,
demonstrating their superior efficiency compared to state-of-the-art
approaches.
[COMMENTS]
15 pages
[LINK]
http://arxiv.org/abs/2412.10789v1
[DATE]
2024-12-14 18:56:02+08:00
[CATEGORIES]
cs.LG
ANaGRAM: A Natural Gradient Relative to Adapted Model for efficient PINNs learning
[AUTHORS]
Nilo Schwencke, Cyril Furtlehner
[ABSTRACT]
In the recent years, Physics Informed Neural Networks (PINNs) have received
strong interest as a method to solve PDE driven systems, in particular for data
assimilation purpose. This method is still in its infancy, with many
shortcomings and failures that remain not properly understood. In this paper we
propose a natural gradient approach to PINNs which contributes to speed-up and
improve the accuracy of the training. Based on an in depth analysis of the
differential geometric structures of the problem, we come up with two distinct
contributions: (i) a new natural gradient algorithm that scales as $\min(P^2S,
S^2P)$, where $P$ is the number of parameters, and $S$ the batch size; (ii) a
mathematically principled reformulation of the PINNs problem that allows the
extension of natural gradient to it, with proved connections to Green’s
function theory.
[COMMENTS]
submitted to ICLR 2025
[LINK]
http://arxiv.org/abs/2412.10782v1
[DATE]
2024-12-14 18:38:09+08:00
[CATEGORIES]
cs.LG
Continual Learning for Behavior-based Driver Identification
[AUTHORS]
Mattia Fanan, Davide Dalle Pezze, Emad Efatinasab, Ruggero Carli, Mirco Rampazzo, Gian Antonio Susto
[ABSTRACT]
Behavior-based Driver Identification is an emerging technology that
recognizes drivers based on their unique driving behaviors, offering important
applications such as vehicle theft prevention and personalized driving
experiences. However, most studies fail to account for the real-world
challenges of deploying Deep Learning models within vehicles. These challenges
include operating under limited computational resources, adapting to new
drivers, and changes in driving behavior over time. The objective of this study
is to evaluate if Continual Learning (CL) is well-suited to address these
challenges, as it enables models to retain previously learned knowledge while
continually adapting with minimal computational overhead and resource
requirements. We tested several CL techniques across three scenarios of
increasing complexity based on the well-known OCSLab dataset. This work
provides an important step forward in scalable driver identification solutions,
demonstrating that CL approaches, such as DER, can obtain strong performance,
with only an 11% reduction in accuracy compared to the static scenario.
Furthermore, to enhance the performance, we propose two new methods, SmooER and
SmooDER, that leverage the temporal continuity of driver identity over time to
enhance classification accuracy. Our novel method, SmooDER, achieves optimal
results with only a 2% reduction compared to the 11\% of the DER approach. In
conclusion, this study proves the feasibility of CL approaches to address the
challenges of Driver Identification in dynamic environments, making them
suitable for deployment on cloud infrastructure or directly within vehicles.
[LINK]
http://arxiv.org/abs/2412.10780v1
[DATE]
2024-12-14 18:24:44+08:00
[CATEGORIES]
cs.LG
Communication-Efficient Federated Learning via Clipped Uniform Quantization
[AUTHORS]
Zavareh Bozorgasl, Hao Chen
[ABSTRACT]
This paper presents a novel approach to enhance communication efficiency in
federated learning through clipped uniform quantization. By leveraging optimal
clipping thresholds and client-specific adaptive quantization schemes, the
proposed method significantly reduces bandwidth and memory requirements for
model weight transmission between clients and the server while maintaining
competitive accuracy. We investigate the effects of symmetric clipping and
uniform quantization on model performance, emphasizing the role of stochastic
quantization in mitigating artifacts and improving robustness. Extensive
simulations demonstrate that the method achieves near-full-precision
performance with substantial communication savings. Moreover, the proposed
approach facilitates efficient weight averaging based on the inverse of the
mean squared quantization errors, effectively balancing the trade-off between
communication efficiency and model accuracy. Moreover, in contrast to federated
averaging, this design obviates the need to disclose client-specific data
volumes to the server, thereby enhancing client privacy. Comparative analysis
with conventional quantization methods further confirms the efficacy of the
proposed scheme.
[COMMENTS]
Under review
[LINK]
http://arxiv.org/abs/2405.13365v2
[DATE]
2024-12-14 17:43:24+08:00
[CATEGORIES]
cs.LG
Disentangled Dynamic Intrusion Detection
[AUTHORS]
Chenyang Qiu, Guoshun Nan, Hongrui Xia, Zheng Weng, Xueting Wang, Meng Shen, Xiaofeng Tao, Jun Liu
[ABSTRACT]
Network-based intrusion detection system (NIDS) monitors network traffic for
malicious activities, forming the frontline defense against increasing attacks
over information infrastructures. Although promising, our quantitative analysis
shows that existing methods perform inconsistently in declaring various
attacks, and perform poorly in few-shot intrusion detections. We reveal that
the underlying cause is entangled distributions of flow features. This
motivates us to propose DIDS-MFL, a disentangled intrusion detection method to
handle various intrusion detection scenarios. DIDS-MFL involves two key
components, respectively: a double Disentanglementbased Intrusion Detection
System (DIDS) and a plug-and-play Multi-scale Few-shot Learning-based (MFL)
intrusion detection module. Specifically, the proposed DIDS first disentangles
traffic features by a non-parameterized optimization, automatically
differentiating tens and hundreds of complex features of various attacks. Such
differentiated features will be further disentangled to highlight the
attack-specific features. Our DIDS additionally uses a novel graph diffusion
method that dynamically fuses the network topology in evolving data streams.
Furthermore, the proposed MFL involves an alternating optimization framework to
address the entangled representations in few-shot traffic threats with rigorous
derivation. MFL first captures multiscale information in latent space to
distinguish attack-specific information and then optimizes the disentanglement
term to highlight the attack-specific information. Finally, MFL fuses and
alternately solves them in an end-to-end way. Experiments show the superiority
of our proposed DIDS-MFL. Our code is available at
https://github.com/qcydm/DIDS-MFL
[COMMENTS]
V2(Existing Extension Version):A SUBMISSION TO IEEE TRANSACTION ON
PATTERN ANALYSIS AND MACHINE INTELLIGENCE(TPAMI) (Under Review) |||||
V1:Accepted and appeared in the proceedings of the KDD 2023 Research Track
(DOI:10.1145/3580305.3599238)
[LINK]
http://arxiv.org/abs/2307.11079v2
[DATE]
2024-12-14 17:12:39+08:00
[CATEGORIES]
cs.LG
Explainable Fuzzy Neural Network with Multi-Fidelity Reinforcement Learning for Micro-Architecture Design Space Exploration
[AUTHORS]
Hanwei Fan, Ya Wang, Sicheng Li, Tingyuan Liang, Wei Zhang
[ABSTRACT]
With the continuous advancement of processors, modern micro-architecture
designs have become increasingly complex. The vast design space presents
significant challenges for human designers, making design space exploration
(DSE) algorithms a significant tool for $\mu$-arch design. In recent years,
efforts have been made in the development of DSE algorithms, and promising
results have been achieved. However, the existing DSE algorithms, e.g.,
Bayesian Optimization and ensemble learning, suffer from poor interpretability,
hindering designers’ understanding of the decision-making process. To address
this limitation, we propose utilizing Fuzzy Neural Networks to induce and
summarize knowledge and insights from the DSE process, enhancing
interpretability and controllability. Furthermore, to improve efficiency, we
introduce a multi-fidelity reinforcement learning approach, which primarily
conducts exploration using cheap but less precise data, thereby substantially
diminishing the reliance on costly data. Experimental results show that our
method achieves excellent results with a very limited sample budget and
successfully surpasses the current state-of-the-art. Our DSE framework is
open-sourced and available at https://github.com/fanhanwei/FNN_MFRL_ArchDSE/\ .
[COMMENTS]
preprint version, published on DAC24
[LINK]
http://arxiv.org/abs/2412.10754v1
[DATE]
2024-12-14 16:54:07+08:00
[CATEGORIES]
cs.LG
ConvTimeNet: A Deep Hierarchical Fully Convolutional Model for Multivariate Time Series Analysis
[AUTHORS]
Mingyue Cheng, Jiqian Yang, Tingyue Pan, Qi Liu, Zhi Li
[ABSTRACT]
Designing effective models for learning time series representations is
foundational for time series analysis. Many previous works have explored time
series representation modeling approaches and have made progress in this area.
Despite their effectiveness, they lack adaptive perception of local patterns in
temporally dependent basic units and fail to capture the multi-scale dependency
among these units. Instead of relying on prevalent methods centered around
self-attention mechanisms, we propose ConvTimeNet, a hierarchical pure
convolutional model designed for time series analysis. ConvTimeNet introduces a
deformable patch layer that adaptively perceives local patterns of temporally
dependent basic units in a data-driven manner. Based on the extracted local
patterns, hierarchical pure convolutional blocks are designed to capture
dependency relationships among the representations of basic units at different
scales. Moreover, a large kernel mechanism is employed to ensure that
convolutional blocks can be deeply stacked, thereby achieving a larger
receptive field. In this way, local patterns and their multi-scale dependencies
can be effectively modeled within a single model. Extensive experiments
comparing a wide range of different types of models demonstrate that pure
convolutional models still exhibit strong viability, effectively addressing the
aforementioned two challenges and showing superior performance across multiple
tasks. The code is available for reproducibility.
[LINK]
http://arxiv.org/abs/2403.01493v2
[DATE]
2024-12-14 16:52:06+08:00
[CATEGORIES]
cs.LG
p-Mean Regret for Stochastic Bandits
[AUTHORS]
Anand Krishna, Philips George John, Adarsh Barik, Vincent Y. F. Tan
[ABSTRACT]
In this work, we extend the concept of the $p$-mean welfare objective from
social choice theory (Moulin 2004) to study $p$-mean regret in stochastic
multi-armed bandit problems. The $p$-mean regret, defined as the difference
between the optimal mean among the arms and the $p$-mean of the expected
rewards, offers a flexible framework for evaluating bandit algorithms, enabling
algorithm designers to balance fairness and efficiency by adjusting the
parameter $p$. Our framework encompasses both average cumulative regret and
Nash regret as special cases.
We introduce a simple, unified UCB-based algorithm (Explore-Then-UCB) that
achieves novel $p$-mean regret bounds. Our algorithm consists of two phases: a
carefully calibrated uniform exploration phase to initialize sample means,
followed by the UCB1 algorithm of Auer, Cesa-Bianchi, and Fischer (2002). Under
mild assumptions, we prove that our algorithm achieves a $p$-mean regret bound
of $\tilde{O}\left(\sqrt{\frac{k}{T^{\frac{1}{2|p|}}}}\right)$ for all $p \leq
-1$, where $k$ represents the number of arms and $T$ the time horizon. When
$-1<p<0$, we achieve a regret bound of
$\tilde{O}\left(\sqrt{\frac{k^{1.5}}{T^{\frac{1}{2}}}}\right)$. For the range
$0< p \leq 1$, we achieve a $p$-mean regret scaling as
$\tilde{O}\left(\sqrt{\frac{k}{T}}\right)$, which matches the previously
established lower bound up to logarithmic factors (Auer et al. 1995). This
result stems from the fact that the $p$-mean regret of any algorithm is at
least its average cumulative regret for $p \leq 1$.
In the case of Nash regret (the limit as $p$ approaches zero), our unified
approach differs from prior work (Barman et al. 2023), which requires a new
Nash Confidence Bound algorithm. Notably, we achieve the same regret bound up
to constant factors using our more general method.
[LINK]
http://arxiv.org/abs/2412.10751v1
[DATE]
2024-12-14 16:38:26+08:00
[CATEGORIES]
cs.LG
A Pioneering Neural Network Method for Efficient and Robust Fuel Sloshing Simulation in Aircraft
[AUTHORS]
Yu Chen, Shuai Zheng, Nianyi Wang, Menglong Jin, Yan Chang
[ABSTRACT]
Simulating fuel sloshing within aircraft tanks during flight is crucial for
aircraft safety research. Traditional methods based on Navier-Stokes equations
are computationally expensive. In this paper, we treat fluid motion as point
cloud transformation and propose the first neural network method specifically
designed for simulating fuel sloshing in aircraft. This model is also the deep
learning model that is the first to be capable of stably modeling fluid
particle dynamics in such complex scenarios. Our triangle feature fusion design
achieves an optimal balance among fluid dynamics modeling, momentum
conservation constraints, and global stability control. Additionally, we
constructed the Fueltank dataset, the first dataset for aircraft fuel surface
sloshing. It comprises 320,000 frames across four typical tank types and covers
a wide range of flight maneuvers, including multi-directional rotations. We
conducted comprehensive experiments on both our dataset and the take-off
scenario of the aircraft. Compared to existing neural network-based fluid
simulation algorithms, we significantly enhanced accuracy while maintaining
high computational speed. Compared to traditional SPH methods, our speed
improved approximately 10 times. Furthermore, compared to traditional fluid
simulation software such as Flow3D, our computation speed increased by more
than 300 times.
[COMMENTS]
This paper has been accepted by AAAI Conference on Artificial
Intelligence (AAAI-25)
[LINK]
http://arxiv.org/abs/2412.10748v1
[DATE]
2024-12-14 16:31:56+08:00
[CATEGORIES]
cs.LG
RegMixMatch: Optimizing Mixup Utilization in Semi-Supervised Learning
[AUTHORS]
Haorong Han, Jidong Yuan, Chixuan Wei, Zhongyang Yu
[ABSTRACT]
Consistency regularization and pseudo-labeling have significantly advanced
semi-supervised learning (SSL). Prior works have effectively employed Mixup for
consistency regularization in SSL. However, our findings indicate that applying
Mixup for consistency regularization may degrade SSL performance by
compromising the purity of artificial labels. Moreover, most pseudo-labeling
based methods utilize thresholding strategy to exclude low-confidence data,
aiming to mitigate confirmation bias; however, this approach limits the utility
of unlabeled samples. To address these challenges, we propose RegMixMatch, a
novel framework that optimizes the use of Mixup with both high- and
low-confidence samples in SSL. First, we introduce semi-supervised RegMixup,
which effectively addresses reduced artificial labels purity by using both
mixed samples and clean samples for training. Second, we develop a class-aware
Mixup technique that integrates information from the top-2 predicted classes
into low-confidence samples and their artificial labels, reducing the
confirmation bias associated with these samples and enhancing their effective
utilization. Experimental results demonstrate that RegMixMatch achieves
state-of-the-art performance across various SSL benchmarks.
[LINK]
http://arxiv.org/abs/2412.10741v1
[DATE]
2024-12-14 16:22:49+08:00
[CATEGORIES]
cs.LG
Multistep Brent Oil Price Forecasting with a Multi-Aspect Meta-heuristic Optimization and Ensemble Deep Learning Model
[AUTHORS]
Mohammed Alruqimi, Luca Di Persio
[ABSTRACT]
Accurate crude oil price forecasting is crucial for various economic
activities, including energy trading, risk management, and investment planning.
Although deep learning models have emerged as powerful tools for crude oil
price forecasting, achieving accurate forecasts remains challenging. Deep
learning models’ performance is heavily influenced by hyperparameters tuning,
and they are expected to perform differently under various circumstances.
Furthermore, price volatility is also sensitive to external factors such as
world events. To address these limitations, we propose a hybrid approach that
integrates metaheuristic optimization with an ensemble of five widely used
neural network architectures for time series forecasting. Unlike existing
methods that apply metaheuristics to optimise hyperparameters within the neural
network architecture, we exploit the GWO metaheuristic optimiser at four
levels: feature selection, data preparation, model training, and forecast
blending. The proposed approach has been evaluated for forecasting three-ahead
days using real-world Brent crude oil price data, and the obtained results
demonstrate that the proposed approach improves the forecasting performance
measured using various benchmarks, achieving 0.000127 of MSE.
[LINK]
http://arxiv.org/abs/2407.12062v2
[DATE]
2024-12-14 15:53:46+08:00
[CATEGORIES]
cs.LG
NoisyEQA: Benchmarking Embodied Question Answering Against Noisy Queries
[AUTHORS]
Tao Wu, Chuhao Zhou, Yen Heng Wong, Lin Gu, Jianfei Yang
[ABSTRACT]
The rapid advancement of Vision-Language Models (VLMs) has significantly
advanced the development of Embodied Question Answering (EQA), enhancing
agents’ abilities in language understanding and reasoning within complex and
realistic scenarios. However, EQA in real-world scenarios remains challenging,
as human-posed questions often contain noise that can interfere with an agent’s
exploration and response, bringing challenges especially for language beginners
and non-expert users. To address this, we introduce a NoisyEQA benchmark
designed to evaluate an agent’s ability to recognize and correct noisy
questions. This benchmark introduces four common types of noise found in
real-world applications: Latent Hallucination Noise, Memory Noise, Perception
Noise, and Semantic Noise generated through an automated dataset creation
framework. Additionally, we also propose a ‘Self-Correction’ prompting
mechanism and a new evaluation metric to enhance and measure both noise
detection capability and answer quality. Our comprehensive evaluation reveals
that current EQA agents often struggle to detect noise in questions, leading to
responses that frequently contain erroneous information. Through our
Self-Correct Prompting mechanism, we can effectively improve the accuracy of
agent answers.
[LINK]
http://arxiv.org/abs/2412.10726v1
[DATE]
2024-12-14 15:52:24+08:00
[CATEGORIES]
cs.LG
Hyper-Compression: Model Compression via Hyperfunction
[AUTHORS]
Fenglei Fan, Juntong Fan, Dayang Wang, Jingbo Zhang, Zelin Dong, Shijun Zhang, Ge Wang, Tieyong Zeng
[ABSTRACT]
The rapid growth of large models’ size has far outpaced that of GPU memory.
To bridge this gap, inspired by the parsimonious relationship between genotype
and phenotype, we turn the model compression problem into the issue of
parameter representation to propose the so-called hyper-compression. The
hyper-compression uses a hyperfunction to represent the parameters of the
target network per ergodic theory, that addresses the following approximation
problem: if a low-dimensional dynamic system can fill the high-dimensional
space eventually. Empirically, the proposed hyper-compression enjoys the
following merits: 1) \textbf{P}referable compression ratio; 2) \textbf{N}o
post-hoc retraining; 3) \textbf{A}ffordable inference time; and 4)
\textbf{S}hort compression time. It compresses LLaMA2-7B in an hour and
achieves close-to-int4-quantization performance, without retraining and with a
performance drop of less than 1\%. Our work can facilitate the harmony between
the scaling law and the stagnation of hardware upgradation in terms of saving
both computation and data. We have open-sourced our
\href{https://github.com/Juntongkuki/Hyper-Compression.git}{code} for readers’
free download and evaluation.
[LINK]
http://arxiv.org/abs/2409.00592v2
[DATE]
2024-12-14 15:52:04+08:00
[CATEGORIES]
cs.LG
HEP-NAS: Towards Efficient Few-shot Neural Architecture Search via Hierarchical Edge Partitioning
[AUTHORS]
Jianfeng Li, Jiawen Zhang, Feng Wang, Lianbo Ma
[ABSTRACT]
One-shot methods have significantly advanced the field of neural architecture
search (NAS) by adopting weight-sharing strategy to reduce search costs.
However, the accuracy of performance estimation can be compromised by
co-adaptation. Few-shot methods divide the entire supernet into individual
sub-supernets by splitting edge by edge to alleviate this issue, yet neglect
relationships among edges and result in performance degradation on huge search
space. In this paper, we introduce HEP-NAS, a hierarchy-wise partition
algorithm designed to further enhance accuracy. To begin with, HEP-NAS treats
edges sharing the same end node as a hierarchy, permuting and splitting edges
within the same hierarchy to directly search for the optimal operation
combination for each intermediate node. This approach aligns more closely with
the ultimate goal of NAS. Furthermore, HEP-NAS selects the most promising
sub-supernet after each segmentation, progressively narrowing the search space
in which the optimal architecture may exist. To improve performance evaluation
of sub-supernets, HEP-NAS employs search space mutual distillation, stabilizing
the training process and accelerating the convergence of each individual
sub-supernet. Within a given budget, HEP-NAS enables the splitting of all edges
and gradually searches for architectures with higher accuracy. Experimental
results across various datasets and search spaces demonstrate the superiority
of HEP-NAS compared to state-of-the-art methods.
[LINK]
http://arxiv.org/abs/2412.10723v1
[DATE]
2024-12-14 15:42:56+08:00
[CATEGORIES]
cs.LG
A Visual-inertial Localization Algorithm using Opportunistic Visual Beacons and Dead-Reckoning for GNSS-Denied Large-scale Applications
[AUTHORS]
Liqiang Zhang, Ye Tian, Dongyan Wei
[ABSTRACT]
With the development of smart cities, the demand for continuous pedestrian
navigation in large-scale urban environments has significantly increased. While
global navigation satellite systems (GNSS) provide low-cost and reliable
positioning services, they are often hindered in complex urban canyon
environments. Thus, exploring opportunistic signals for positioning in urban
areas has become a key solution. Augmented reality (AR) allows pedestrians to
acquire real-time visual information. Accordingly, we propose a low-cost
visual-inertial positioning solution. This method comprises a lightweight
multi-scale group convolution (MSGC)-based visual place recognition (VPR)
neural network, a pedestrian dead reckoning (PDR) algorithm, and a
visual/inertial fusion approach based on a Kalman filter with gross error
suppression. The VPR serves as a conditional observation to the Kalman filter,
effectively correcting the errors accumulated through the PDR method. This
enables the entire algorithm to ensure the reliability of long-term positioning
in GNSS-denied areas. Extensive experimental results demonstrate that our
method maintains stable positioning during large-scale movements. Compared to
the lightweight MobileNetV3-based VPR method, our proposed VPR solution
improves Recall@1 by at least 3\% on two public datasets while reducing the
number of parameters by 63.37\%. It also achieves performance that is
comparable to the VGG16-based method. The VPR-PDR algorithm improves
localization accuracy by more than 40\% compared to the original PDR.
[LINK]
http://arxiv.org/abs/2411.19845v2
[DATE]
2024-12-14 15:36:15+08:00
[CATEGORIES]
cs.LG
Control of Overfitting with Physics
[AUTHORS]
Sergei V. Kozyrev, Ilya A Lopatin, Alexander N Pechen
[ABSTRACT]
While there are many works on the applications of machine learning, not so
many of them are trying to understand the theoretical justifications to explain
their efficiency. In this work, overfitting control (or generalization
property) in machine learning is explained using analogies from physics and
biology. For stochastic gradient Langevin dynamics, we show that the Eyring
formula of kinetic theory allows to control overfitting in the algorithmic
stability approach - when wide minima of the risk function with low free energy
correspond to low overfitting. For the generative adversarial network (GAN)
model, we establish an analogy between GAN and the predator-prey model in
biology. An application of this analogy allows us to explain the selection of
wide likelihood maxima and overfitting reduction for GANs.
[COMMENTS]
20 pages, 6 figures
[LINK]
http://arxiv.org/abs/2412.10716v1
[DATE]
2024-12-14 15:20:33+08:00
[CATEGORIES]
cs.LG
Automated Molecular Concept Generation and Labeling with Large Language Models
[AUTHORS]
Zimin Zhang, Qianli Wu, Botao Xia, Fang Sun, Ziniu Hu, Yizhou Sun, Shichang Zhang
[LINK]
http://arxiv.org/abs/2406.09612v2
[DATE]
2024-12-14 15:16:41+08:00
[CATEGORIES]
cs.LG
RAT: Adversarial Attacks on Deep Reinforcement Agents for Targeted Behaviors
[AUTHORS]
Fengshuo Bai, Runze Liu, Yali Du, Ying Wen, Yaodong Yang
[ABSTRACT]
Evaluating deep reinforcement learning (DRL) agents against targeted behavior
attacks is critical for assessing their robustness. These attacks aim to
manipulate the victim into specific behaviors that align with the attacker’s
objectives, often bypassing traditional reward-based defenses. Prior methods
have primarily focused on reducing cumulative rewards; however, rewards are
typically too generic to capture complex safety requirements effectively. As a
result, focusing solely on reward reduction can lead to suboptimal attack
strategies, particularly in safety-critical scenarios where more precise
behavior manipulation is needed. To address these challenges, we propose RAT, a
method designed for universal, targeted behavior attacks. RAT trains an
intention policy that is explicitly aligned with human preferences, serving as
a precise behavioral target for the adversary. Concurrently, an adversary
manipulates the victim’s policy to follow this target behavior. To enhance the
effectiveness of these attacks, RAT dynamically adjusts the state occupancy
measure within the replay buffer, allowing for more controlled and effective
behavior manipulation. Our empirical results on robotic simulation tasks
demonstrate that RAT outperforms existing adversarial attack algorithms in
inducing specific behaviors. Additionally, RAT shows promise in improving agent
robustness, leading to more resilient policies. We further validate RAT by
guiding Decision Transformer agents to adopt behaviors aligned with human
preferences in various MuJoCo tasks, demonstrating its effectiveness across
diverse tasks.
[COMMENTS]
Accepted by AAAI 2025
[LINK]
http://arxiv.org/abs/2412.10713v1
[DATE]
2024-12-14 14:56:11+08:00
[CATEGORIES]
cs.LG
Diffusion Models as Network Optimizers: Explorations and Analysis
[AUTHORS]
Ruihuai Liang, Bo Yang, Pengyu Chen, Xianjin Li, Yifan Xue, Zhiwen Yu, Xuelin Cao, Yan Zhang, Mérouane Debbah, H. Vincent Poor, Chau Yuen
[ABSTRACT]
Network optimization is a fundamental challenge in the Internet of Things
(IoT) network, often characterized by complex features that make it difficult
to solve these problems. Recently, generative diffusion models (GDMs) have
emerged as a promising new approach to network optimization, with the potential
to directly address these optimization problems. However, the application of
GDMs in this field is still in its early stages, and there is a noticeable lack
of theoretical research and empirical findings. In this study, we first explore
the intrinsic characteristics of generative models. Next, we provide a concise
theoretical proof and intuitive demonstration of the advantages of generative
models over discriminative models in network optimization. Based on this
exploration, we implement GDMs as optimizers aimed at learning high-quality
solution distributions for given inputs, sampling from these distributions
during inference to approximate or achieve optimal solutions. Specifically, we
utilize denoising diffusion probabilistic models (DDPMs) and employ a
classifier-free guidance mechanism to manage conditional guidance based on
input parameters. We conduct extensive experiments across three challenging
network optimization problems. By investigating various model configurations
and the principles of GDMs as optimizers, we demonstrate the ability to
overcome prediction errors and validate the convergence of generated solutions
to optimal solutions.We provide code and data at
https://github.com/qiyu3816/DiffSG.
[LINK]
http://arxiv.org/abs/2411.00453v3
[DATE]
2024-12-14 14:41:52+08:00
[CATEGORIES]
cs.LG
BlockDoor: Blocking Backdoor Based Watermarks in Deep Neural Networks
[AUTHORS]
Yi Hao Puah, Anh Tu Ngo, Nandish Chattopadhyay, Anupam Chattopadhyay
[ABSTRACT]
Adoption of machine learning models across industries have turned Neural
Networks (DNNs) into a prized Intellectual Property (IP), which needs to be
protected from being stolen or being used without authorization. This topic
gave rise to multiple watermarking schemes, through which, one can establish
the ownership of a model. Watermarking using backdooring is the most well
established method available in the literature, with specific works
demonstrating the difficulty in removing the watermarks, embedded as backdoors
within the weights of the network. However, in our work, we have identified a
critical flaw in the design of the watermark verification with backdoors,
pertaining to the behaviour of the samples of the Trigger Set, which acts as
the secret key. In this paper, we present BlockDoor, which is a comprehensive
package of techniques that is used as a wrapper to block all three different
kinds of Trigger samples, which are used in the literature as means to embed
watermarks within the trained neural networks as backdoors. The framework
implemented through BlockDoor is able to detect potential Trigger samples,
through separate functions for adversarial noise based triggers,
out-of-distribution triggers and random label based triggers. Apart from a
simple Denial-of-Service for a potential Trigger sample, our approach is also
able to modify the Trigger samples for correct machine learning functionality.
Extensive evaluation of BlockDoor establishes that it is able to significantly
reduce the watermark validation accuracy of the Trigger set by up to $98\%$
without compromising on functionality, delivering up to a less than $1\%$ drop
on the clean samples. BlockDoor has been tested on multiple datasets and neural
architectures.
[COMMENTS]
Accepted as a Work-in-Progress paper at SPACE 2024: Fourteenth
International Conference on Security, Privacy and Applied Cryptographic
Engineering
[LINK]
http://arxiv.org/abs/2412.12194v1
[DATE]
2024-12-14 14:38:01+08:00
[CATEGORIES]
cs.LG
Doubly-Bounded Queue for Constrained Online Learning: Keeping Pace with Dynamics of Both Loss and Constraint
[AUTHORS]
Juncheng Wang, Bingjie Yan, Yituo Liu
[ABSTRACT]
We consider online convex optimization with time-varying constraints and
conduct performance analysis using two stringent metrics: dynamic regret with
respect to the online solution benchmark, and hard constraint violation that
does not allow any compensated violation over time. We propose an efficient
algorithm called Constrained Online Learning with Doubly-bounded Queue (COLDQ),
which introduces a novel virtual queue that is both lower and upper bounded,
allowing tight control of the constraint violation without the need for the
Slater condition. We prove via a new Lyapunov drift analysis that COLDQ
achieves $O(T^\frac{1+V_x}{2})$ dynamic regret and $O(T^{V_g})$ hard constraint
violation, where $V_x$ and $V_g$ capture the dynamics of the loss and
constraint functions. For the first time, the two bounds smoothly approach to
the best-known $O(T^\frac{1}{2})$ regret and $O(1)$ violation, as the dynamics
of the losses and constraints diminish. For strongly convex loss functions,
COLDQ matches the best-known $O(\log{T})$ static regret while maintaining the
$O(T^{V_g})$ hard constraint violation. We further introduce an expert-tracking
variation of COLDQ, which achieves the same performance bounds without any
prior knowledge of the system dynamics. Simulation results demonstrate that
COLDQ outperforms the state-of-the-art approaches.
[COMMENTS]
To appear in AAAI 2025
[LINK]
http://arxiv.org/abs/2412.10703v1
[DATE]
2024-12-14 14:22:49+08:00
[CATEGORIES]
cs.LG
Cluster-Based Multi-Agent Task Scheduling for Space-Air-Ground Integrated Networks
[AUTHORS]
Zhiying Wang, Gang Sun, Yuhui Wang, Hongfang Yu, Dusit Niyato
[ABSTRACT]
The Space-Air-Ground Integrated Network (SAGIN) framework is a crucial
foundation for future networks, where satellites and aerial nodes assist in
computational task offloading. The low-altitude economy, leveraging the
flexibility and multifunctionality of Unmanned Aerial Vehicles (UAVs) in SAGIN,
holds significant potential for development in areas such as communication and
sensing. However, effective coordination is needed to streamline information
exchange and enable efficient system resource allocation. In this paper, we
propose a Clustering-based Multi-agent Deep Deterministic Policy Gradient
(CMADDPG) algorithm to address the multi-UAV cooperative task scheduling
challenges in SAGIN. The CMADDPG algorithm leverages dynamic UAV clustering to
partition UAVs into clusters, each managed by a Cluster Head (CH) UAV,
facilitating a distributed-centralized control approach. Within each cluster,
UAVs delegate offloading decisions to the CH UAV, reducing intra-cluster
communication costs and decision conflicts, thereby enhancing task scheduling
efficiency. Additionally, by employing a multi-agent reinforcement learning
framework, the algorithm leverages the extensive coverage of satellites to
achieve centralized training and distributed execution of multi-agent tasks,
while maximizing overall system profit through optimized task offloading
decision-making. Simulation results reveal that the CMADDPG algorithm
effectively optimizes resource allocation, minimizes queue delays, maintains
balanced load distribution, and surpasses existing methods by achieving at
least a 25\% improvement in system profit, showcasing its robustness and
adaptability across diverse scenarios.
[LINK]
http://arxiv.org/abs/2412.10700v1
[DATE]
2024-12-14 14:17:33+08:00
[CATEGORIES]
cs.LG
Linked Adapters: Linking Past and Future to Present for Effective Continual Learning
[AUTHORS]
Dupati Srikar Chandra, P. K. Srijith, Dana Rezazadegan, Chris McCarthy
[ABSTRACT]
Continual learning allows the system to learn and adapt to new tasks while
retaining the knowledge acquired from previous tasks. However, deep learning
models suffer from catastrophic forgetting of knowledge learned from earlier
tasks while learning a new task. Moreover, retraining large models like
transformers from scratch for every new task is costly. An effective approach
to address continual learning is to use a large pre-trained model with
task-specific adapters to adapt to the new tasks. Though this approach can
mitigate catastrophic forgetting, they fail to transfer knowledge across tasks
as each task is learning adapters separately. To address this, we propose a
novel approach Linked Adapters that allows knowledge transfer through a
weighted attention mechanism to other task-specific adapters. Linked adapters
use a multi-layer perceptron (MLP) to model the attention weights, which
overcomes the challenge of backward knowledge transfer in continual learning in
addition to modeling the forward knowledge transfer. During inference, our
proposed approach effectively leverages knowledge transfer through MLP-based
attention weights across all the lateral task adapters. Through numerous
experiments conducted on diverse image classification datasets, we effectively
demonstrated the improvement in performance on the continual learning tasks
using Linked Adapters.
[COMMENTS]
13 Pages, 5 Figures
[LINK]
http://arxiv.org/abs/2412.10687v1
[DATE]
2024-12-14 13:25:17+08:00
[CATEGORIES]
cs.LG
Analysis of Corrected Graph Convolutions
[AUTHORS]
Robert Wang, Aseem Baranwal, Kimon Fountoulakis
[ABSTRACT]
Machine learning for node classification on graphs is a prominent area driven
by applications such as recommendation systems. State-of-the-art models often
use multiple graph convolutions on the data, as empirical evidence suggests
they can enhance performance. However, it has been shown empirically and
theoretically, that too many graph convolutions can degrade performance
significantly, a phenomenon known as oversmoothing. In this paper, we provide a
rigorous theoretical analysis, based on the two-class contextual stochastic
block model (CSBM), of the performance of vanilla graph convolution from which
we remove the principal eigenvector to avoid oversmoothing. We perform a
spectral analysis for $k$ rounds of corrected graph convolutions, and we
provide results for partial and exact classification. For partial
classification, we show that each round of convolution can reduce the
misclassification error exponentially up to a saturation level, after which
performance does not worsen. We also extend this analysis to the multi-class
setting with features distributed according to a Gaussian mixture model. For
exact classification, we show that the separability threshold can be improved
exponentially up to $O({\log{n}}/{\log\log{n}})$ corrected convolutions.
[LINK]
http://arxiv.org/abs/2405.13987v2
[DATE]
2024-12-14 13:02:54+08:00
[CATEGORIES]
cs.LG
Stochastic $k$-Submodular Bandits with Full Bandit Feedback
[AUTHORS]
Guanyu Nie, Vaneet Aggarwal, Christopher John Quinn
[ABSTRACT]
In this paper, we present the first sublinear $\alpha$-regret bounds for
online $k$-submodular optimization problems with full-bandit feedback, where
$\alpha$ is a corresponding offline approximation ratio. Specifically, we
propose online algorithms for multiple $k$-submodular stochastic combinatorial
multi-armed bandit problems, including (i) monotone functions and individual
size constraints, (ii) monotone functions with matroid constraints, (iii)
non-monotone functions with matroid constraints, (iv) non-monotone functions
without constraints, and (v) monotone functions without constraints. We
transform approximation algorithms for offline $k$-submodular maximization
problems into online algorithms through the offline-to-online framework
proposed by Nie et al. (2023a). A key contribution of our work is analyzing the
robustness of the offline algorithms.
[COMMENTS]
26 pages, 1 figure
[LINK]
http://arxiv.org/abs/2412.10682v1
[DATE]
2024-12-14 13:02:53+08:00
[CATEGORIES]
cs.LG
Proposing and solving olympiad geometry with guided tree search
[AUTHORS]
Chi Zhang, Jiajun Song, Siyu Li, Yitao Liang, Yuxi Ma, Wei Wang, Yixin Zhu, Song-Chun Zhu
[ABSTRACT]
Mathematics olympiads are prestigious competitions, with problem proposing
and solving highly honored. Building artificial intelligence that proposes and
solves olympiads presents an unresolved challenge in automated theorem
discovery and proving, especially in geometry for its combination of numerical
and spatial elements. We introduce TongGeometry, a Euclidean geometry system
supporting tree-search-based guided problem proposing and solving. The
efficient geometry system establishes the most extensive repository of geometry
theorems to date: within the same computational budget as the existing
state-of-the-art, TongGeometry discovers 6.7 billion geometry theorems
requiring auxiliary constructions, including 4.1 billion exhibiting geometric
symmetry. Among them, 10 theorems were proposed to regional mathematical
olympiads with 3 of TongGeometry’s proposals selected in real competitions,
earning spots in a national team qualifying exam or a top civil olympiad in
China and the US. Guided by fine-tuned large language models, TongGeometry
solved all International Mathematical Olympiad geometry in IMO-AG-30,
outperforming gold medalists for the first time. It also surpasses the existing
state-of-the-art across a broader spectrum of olympiad-level problems. The full
capabilities of the system can be utilized on a consumer-grade machine, making
the model more accessible and fostering widespread democratization of its use.
By analogy, unlike existing systems that merely solve problems like students,
TongGeometry acts like a geometry coach, discovering, presenting, and proving
theorems.
[LINK]
http://arxiv.org/abs/2412.10673v1
[DATE]
2024-12-14 12:20:47+08:00
[CATEGORIES]
cs.LG
FairGP: A Scalable and Fair Graph Transformer Using Graph Partitioning
[AUTHORS]
Renqiang Luo, Huafei Huang, Ivan Lee, Chengpei Xu, Jianzhong Qi, Feng Xia
[ABSTRACT]
Recent studies have highlighted significant fairness issues in Graph
Transformer (GT) models, particularly against subgroups defined by sensitive
features. Additionally, GTs are computationally intensive and memory-demanding,
limiting their application to large-scale graphs. Our experiments demonstrate
that graph partitioning can enhance the fairness of GT models while reducing
computational complexity. To understand this improvement, we conducted a
theoretical investigation into the root causes of fairness issues in GT models.
We found that the sensitive features of higher-order nodes disproportionately
influence lower-order nodes, resulting in sensitive feature bias. We propose
Fairness-aware scalable GT based on Graph Partitioning (FairGP), which
partitions the graph to minimize the negative impact of higher-order nodes. By
optimizing attention mechanisms, FairGP mitigates the bias introduced by global
attention, thereby enhancing fairness. Extensive empirical evaluations on six
real-world datasets validate the superior performance of FairGP in achieving
fairness compared to state-of-the-art methods. The codes are available at
https://github.com/LuoRenqiang/FairGP.
[COMMENTS]
11 pages, 2 figures, Accepted at AAAI 2025
[LINK]
http://arxiv.org/abs/2412.10669v1
[DATE]
2024-12-14 12:15:32+08:00
[CATEGORIES]
cs.LG
Pretrained Event Classification Model for High Energy Physics Analysis
[AUTHORS]
Joshua Ho, Benjamin Ryan Roberts, Shuo Han, Haichen Wang
[ABSTRACT]
We introduce a foundation model for event classification in high-energy
physics, built on a Graph Neural Network architecture and trained on 120
million simulated proton-proton collision events spanning 12 distinct physics
processes. The model is pretrained to learn a general and robust representation
of collision data using challenging multiclass and multilabel classification
tasks. Its performance is evaluated across five event classification tasks,
which include both physics processes used during pretraining and new processes
not encountered during pretraining. Fine-tuning the pretrained model
significantly improves classification performance, particularly in scenarios
with limited training data, demonstrating gains in both accuracy and
computational efficiency. To investigate the underlying mechanisms behind these
performance improvements, we employ a representational similarity evaluation
framework based on Centered Kernel Alignment. This analysis reveals notable
differences in the learned representations of fine-tuned pretrained models
compared to baseline models trained from scratch.
[COMMENTS]
9 pages, 1 figure
[LINK]
http://arxiv.org/abs/2412.10665v1
[DATE]
2024-12-14 11:45:07+08:00
[CATEGORIES]
cs.LG
Structured Sampling for Robust Euclidean Distance Geometry
[AUTHORS]
Chandra Kundu, Abiy Tasissa, HanQin Cai
[ABSTRACT]
This paper addresses the problem of estimating the positions of points from
distance measurements corrupted by sparse outliers. Specifically, we consider a
setting with two types of nodes: anchor nodes, for which exact distances to
each other are known, and target nodes, for which complete but corrupted
distance measurements to the anchors are available. To tackle this problem, we
propose a novel algorithm powered by Nystr"om method and robust principal
component analysis. Our method is computationally efficient as it processes
only a localized subset of the distance matrix and does not require distance
measurements between target nodes. Empirical evaluations on synthetic datasets,
designed to mimic sensor localization, and on molecular experiments,
demonstrate that our algorithm achieves accurate recovery with a modest number
of anchors, even in the presence of high levels of sparse outliers.
[LINK]
http://arxiv.org/abs/2412.10664v1
[DATE]
2024-12-14 11:42:18+08:00
[CATEGORIES]
cs.LG
ProtoS-ViT: Visual foundation models for sparse self-explainable classifications
[AUTHORS]
Hugues Turbé, Mina Bjelogrlic, Gianmarco Mengaldo, Christian Lovis
[COMMENTS]
Update publication to match paper presented at the Interpretable AI:
Past, Present and Future Workshop at NeurIPS 2024
[LINK]
http://arxiv.org/abs/2406.10025v2
[DATE]
2024-12-14 11:38:30+08:00
[CATEGORIES]
cs.LG
Memory-Efficient 4-bit Preconditioned Stochastic Optimization
[AUTHORS]
Jingyang Li, Kuangyu Ding, Kim-Chuan Toh, Pan Zhou
[ABSTRACT]
Preconditioned stochastic optimization algorithms, exemplified by Shampoo,
have demonstrated superior performance over first-order optimizers, providing
both theoretical advantages in convergence rates and practical improvements in
large-scale neural network training. However, they incur substantial memory
overhead due to the storage demands of non-diagonal preconditioning matrices.
To address this, we introduce 4-bit quantization for Shampoo’s preconditioners.
We introduced two key methods: First, we apply Cholesky decomposition followed
by quantization of the Cholesky factors, reducing memory usage by leveraging
their lower triangular structure while preserving symmetry and positive
definiteness to minimize information loss. To our knowledge, this is the first
quantization approach applied to Cholesky factors of preconditioners. Second,
we incorporate error feedback in the quantization process, efficiently storing
Cholesky factors and error states in the lower and upper triangular parts of
the same matrix. Through extensive experiments, we demonstrate that combining
Cholesky quantization with error feedback enhances memory efficiency and
algorithm performance in large-scale deep-learning tasks. Theoretically, we
also provide convergence proofs for quantized Shampoo under both smooth and
non-smooth stochastic optimization settings.
[LINK]
http://arxiv.org/abs/2412.10663v1
[DATE]
2024-12-14 11:32:54+08:00
[CATEGORIES]
cs.LG
DP-CDA: An Algorithm for Enhanced Privacy Preservation in Dataset Synthesis Through Randomized Mixing
[AUTHORS]
Utsab Saha, Tanvir Muntakim Tonoy, Hafiz Imtiaz
[ABSTRACT]
In recent years, the growth of data across various sectors, including
healthcare, security, finance, and education, has created significant
opportunities for analysis and informed decision-making. However, these
datasets often contain sensitive and personal information, which raises serious
privacy concerns. Protecting individual privacy is crucial, yet many existing
machine learning and data publishing algorithms struggle with high-dimensional
data, facing challenges related to computational efficiency and privacy
preservation. To address these challenges, we introduce an effective data
publishing algorithm \emph{DP-CDA}. Our proposed algorithm generates synthetic
datasets by randomly mixing data in a class-specific manner, and inducing
carefully-tuned randomness to ensure formal privacy guarantees. Our
comprehensive privacy accounting shows that DP-CDA provides a stronger privacy
guarantee compared to existing methods, allowing for better utility while
maintaining strict level of privacy. To evaluate the effectiveness of DP-CDA,
we examine the accuracy of predictive models trained on the synthetic data,
which serves as a measure of dataset utility. Importantly, we identify an
optimal order of mixing that balances privacy guarantee with predictive
accuracy. Our results indicate that synthetic datasets produced using the
DP-CDA can achieve superior utility compared to those generated by traditional
data publishing algorithms, even when subject to the same privacy requirements.
[COMMENTS]
Under review in Elsevier Array
[LINK]
http://arxiv.org/abs/2411.16121v2
[DATE]
2024-12-14 11:30:27+08:00
[CATEGORIES]
cs.LG
Online Optimization for Learning to Communicate over Time-Correlated Channels
[AUTHORS]
Zheshun Wu, Junfan Li, Zenglin Xu, Sumei Sun, Jie Liu
[ABSTRACT]
Machine learning techniques have garnered great interest in designing
communication systems owing to their capacity in tacking with channel
uncertainty. To provide theoretical guarantees for learning-based communication
systems, some recent works analyze generalization bounds for devised methods
based on the assumption of Independently and Identically Distributed (I.I.D.)
channels, a condition rarely met in practical scenarios. In this paper, we drop
the I.I.D. channel assumption and study an online optimization problem of
learning to communicate over time-correlated channels. To address this issue,
we further focus on two specific tasks: optimizing channel decoders for
time-correlated fading channels and selecting optimal codebooks for
time-correlated additive noise channels. For utilizing temporal dependence of
considered channels to better learn communication systems, we develop two
online optimization algorithms based on the optimistic online mirror descent
framework. Furthermore, we provide theoretical guarantees for proposed
algorithms via deriving sub-linear regret bound on the expected error
probability of learned systems. Extensive simulation experiments have been
conducted to validate that our presented approaches can leverage the channel
correlation to achieve a lower average symbol error rate compared to baseline
methods, consistent with our theoretical findings.
[COMMENTS]
16 pages, 4 figures, submitted for possible journal publication
[LINK]
http://arxiv.org/abs/2409.00575v2
[DATE]
2024-12-14 11:14:47+08:00
[CATEGORIES]
cs.LG
MEATRD: Multimodal Anomalous Tissue Region Detection Enhanced with Spatial Transcriptomics
[AUTHORS]
Kaichen Xu, Qilong Wu, Yan Lu, Yinan Zheng, Wenlin Li, Xingjie Tang, Jun Wang, Xiaobo Sun
[ABSTRACT]
The detection of anomalous tissue regions (ATRs) within affected tissues is
crucial in clinical diagnosis and pathological studies. Conventional automated
ATR detection methods, primarily based on histology images alone, falter in
cases where ATRs and normal tissues have subtle visual differences. The recent
spatial transcriptomics (ST) technology profiles gene expressions across tissue
regions, offering a molecular perspective for detecting ATRs. However, there is
a dearth of ATR detection methods that effectively harness complementary
information from both histology images and ST. To address this gap, we propose
MEATRD, a novel ATR detection method that integrates histology image and ST
data. MEATRD is trained to reconstruct image patches and gene expression
profiles of normal tissue spots (inliers) from their multimodal embeddings,
followed by learning a one-class classification AD model based on latent
multimodal reconstruction errors. This strategy harmonizes the strengths of
reconstruction-based and one-class classification approaches. At the heart of
MEATRD is an innovative masked graph dual-attention transformer (MGDAT)
network, which not only facilitates cross-modality and cross-node information
sharing but also addresses the model over-generalization issue commonly seen in
reconstruction-based AD methods. Additionally, we demonstrate that
modality-specific, task-relevant information is collated and condensed in
multimodal bottleneck encoding generated in MGDAT, marking the first
theoretical analysis of the informational properties of multimodal bottleneck
encoding. Extensive evaluations across eight real ST datasets reveal MEATRD’s
superior performance in ATR detection, surpassing various state-of-the-art AD
methods. Remarkably, MEATRD also proves adept at discerning ATRs that only show
slight visual deviations from normal tissues.
[COMMENTS]
AAAI 2025. Code: https://github.com/wqlzuel/MEATRD
[LINK]
http://arxiv.org/abs/2412.10659v1
[DATE]
2024-12-14 11:09:04+08:00
[CATEGORIES]
cs.LG
Global Estimation of Subsurface Eddy Kinetic Energy of Mesoscale Eddies Using a Multiple-input Residual Neural Network
[AUTHORS]
Chenyue Xie, An-Kang Gao, Xiyun Lu
[ABSTRACT]
Oceanic eddy kinetic energy (EKE) is a key quantity for measuring the
intensity of mesoscale eddies and for parameterizing eddy effects in ocean
climate models. Three decades of satellite altimetry observations allow a
global assessment of sea surface information. However, the subsurface EKE with
spatial filter has not been systematically studied due to the sparseness of
subsurface observational data. The subsurface EKE can be inferred both
theoretically and numerically from sea surface observations but is limited by
the issue of decreasing correlation with sea surface variables as depth
increases. In this work, inspired by the Taylor-series expansion of subsurface
EKE, a multiple-input neural network approach is proposed to reconstruct the
subsurface monthly mean EKE from sea surface variables and subsurface
climatological variables (e.g., horizontal filtered velocity gradients). Four
neural networks are trained on a high-resolution global ocean reanalysis
dataset, namely, surface-input fully connected neural network model (FCNN),
surface-input Residual neural network model (ResNet), multiple-input fully
connected neural network model (MI-FCNN), and multiple-input residual neural
network model (MI-ResNet). The proposed MI-FCNN and MI-ResNet models integrate
the surface input variables and the vertical profiles of subsurface variables.
The MI-ResNet model outperforms the FCNN, ResNet, and MI-FCNN models, and
traditional physics-based models in both regional and global reconstruction of
subsurface EKE in the upper 2000 m. In addition, the MI-ResNet model performs
well for both regional and global observational data based on transfer
learning. These findings reveal the potential of the MI-ResNet model for
efficient and accurate reconstruction of subsurface oceanic variables.
[LINK]
http://arxiv.org/abs/2412.10656v1
[DATE]
2024-12-14 10:54:07+08:00
[CATEGORIES]
cs.LG
Centaur: Bridging the Impossible Trinity of Privacy, Efficiency, and Performance in Privacy-Preserving Transformer Inference
[AUTHORS]
Jinglong Luo, Guanzhong Chen, Yehong Zhang, Shiyu Liu, Hui Wang, Yue Yu, Xun Zhou, Yuan Qi, Zenglin Xu
[ABSTRACT]
As pre-trained models, like Transformers, are increasingly deployed on cloud
platforms for inference services, the privacy concerns surrounding model
parameters and inference data are becoming more acute. Current
Privacy-Preserving Transformer Inference (PPTI) frameworks struggle with the
“impossible trinity” of privacy, efficiency, and performance. For instance,
Secure Multi-Party Computation (SMPC)-based solutions offer strong privacy
guarantees but come with significant inference overhead and performance
trade-offs. On the other hand, PPTI frameworks that use random permutations
achieve inference efficiency close to that of plaintext and maintain accurate
results but require exposing some model parameters and intermediate results,
thereby risking substantial privacy breaches. Addressing this “impossible
trinity” with a single technique proves challenging. To overcome this
challenge, we propose Centaur, a novel hybrid PPTI framework. Unlike existing
methods, Centaur protects model parameters with random permutations and
inference data with SMPC, leveraging the structure of Transformer models. By
designing a series of efficient privacy-preserving algorithms, Centaur
leverages the strengths of both techniques to achieve a better balance between
privacy, efficiency, and performance in PPTI. We comprehensively evaluate the
effectiveness of Centaur on various types of Transformer models and datasets.
Experimental results demonstrate that the privacy protection capabilities
offered by Centaur can withstand various existing model inversion attack
methods. In terms of performance and efficiency, Centaur not only maintains the
same performance as plaintext inference but also improves inference speed by
$5.0-30.4$ times.
[LINK]
http://arxiv.org/abs/2412.10652v1
[DATE]
2024-12-14 10:50:30+08:00
[CATEGORIES]
cs.LG
Upstream flow geometries can be uniquely learnt from single-point turbulence signatures
[AUTHORS]
Mukesh Karunanethy, Raghunathan Rengaswamy, Mahesh V Panchagnula
[ABSTRACT]
We test the hypothesis that the microscopic temporal structure of near-field
turbulence downstream of a sudden contraction contains geometry-identifiable
information pertaining to the shape of the upstream obstruction. We measure a
set of spatially sparse velocity time-series data downstream of
differently-shaped orifices. We then train random forest multiclass classifier
models on a vector of invariants derived from this time-series. We test the
above hypothesis with 25 somewhat similar orifice shapes to push the model to
its extreme limits. Remarkably, the algorithm was able to identify the orifice
shape with 100% accuracy and 100% precision. This outcome is enabled by the
uniqueness in the downstream temporal evolution of turbulence structures in the
flow past orifices, combined with the random forests’ ability to learn subtle
yet discerning features in the turbulence microstructure. We are also able to
explain the underlying flow physics that enables such classification by listing
the invariant measures in the order of increasing information entropy. We show
that the temporal autocorrelation coefficients of the time-series are most
sensitive to orifice shape and are therefore informative. The ability to
identify changes in system geometry without the need for physical disassembly
offers tremendous potential for flow control and system identification.
Furthermore, the proposed approach could potentially have significant
applications in other unrelated fields as well, by deploying the core
methodology of training random forest classifiers on vectors of invariant
measures obtained from time-series data.
[COMMENTS]
Manuscript: 10 pages, 4 figures; SI Appendix: 24 pages, 3 figures;
Submitted to PNAS
[LINK]
http://arxiv.org/abs/2412.10630v1
[DATE]
2024-12-14 08:55:34+08:00
[CATEGORIES]
cs.LG
Erasing Undesirable Concepts in Diffusion Models with Adversarial Preservation
[AUTHORS]
Anh Bui, Long Vuong, Khanh Doan, Trung Le, Paul Montague, Tamas Abraham, Dinh Phung
[ABSTRACT]
Diffusion models excel at generating visually striking content from text but
can inadvertently produce undesirable or harmful content when trained on
unfiltered internet data. A practical solution is to selectively removing
target concepts from the model, but this may impact the remaining concepts.
Prior approaches have tried to balance this by introducing a loss term to
preserve neutral content or a regularization term to minimize changes in the
model parameters, yet resolving this trade-off remains challenging. In this
work, we propose to identify and preserving concepts most affected by parameter
changes, termed as \textit{adversarial concepts}. This approach ensures stable
erasure with minimal impact on the other concepts. We demonstrate the
effectiveness of our method using the Stable Diffusion model, showing that it
outperforms state-of-the-art erasure methods in eliminating unwanted content
while maintaining the integrity of other unrelated elements. Our code is
available at
\url{https://github.com/tuananhbui89/Erasing-Adversarial-Preservation}.
[COMMENTS]
Erasing Concepts, Generative Unlearning, NeurIPS 2024
[LINK]
http://arxiv.org/abs/2410.15618v3
[DATE]
2024-12-14 08:50:36+08:00
[CATEGORIES]
cs.LG
Empowering Distributed Training with Sparsity-driven Data Synchronization
[AUTHORS]
Zhuang Wang, Zhaozhuo Xu, Jingyi Xi, Yuke Wang, Anshumali Shrivastava, T. S. Eugene Ng
[ABSTRACT]
Distributed training is the de facto standard to scale up the training of
deep learning models with multiple GPUs. Its performance bottleneck lies in
communications for gradient synchronization. Although high tensor sparsity is
widely observed, the optimal communication scheme to fully leverage sparsity is
still missing. This paper aims to bridge this gap. We first analyze the
characteristics of sparse tensors in popular models to understand the
fundamentals of sparsity. We then systematically explore the design space of
communication schemes for sparse tensors and find the optimal ones. These
findings give a new understanding and inspire us to develop a holistic gradient
synchronization system called Zen for sparse tensors. We demonstrate that Zen
can achieve up to 5.09x speedup in communication time and up to $2.48\times$
speedup in training throughput compared to the state-of-the-art methods.
[LINK]
http://arxiv.org/abs/2309.13254v2
[DATE]
2024-12-14 08:20:13+08:00
[CATEGORIES]
cs.LG
Ares: Approximate Representations via Efficient Sparsification – A Stateless Approach through Polynomial Homomorphism
[AUTHORS]
Dongfang Zhao
[ABSTRACT]
The increasing prevalence of high-dimensional data demands efficient and
scalable compression methods to support modern applications. However, existing
techniques like PCA and Autoencoders often rely on auxiliary metadata or
intricate architectures, limiting their practicality for streaming or infinite
datasets. In this paper, we introduce a stateless compression framework that
leverages polynomial representations to achieve compact, interpretable, and
scalable data reduction. By eliminating the need for auxiliary data, our method
supports direct algebraic operations in the compressed domain while minimizing
error growth during computations. Through extensive experiments on synthetic
and real-world datasets, we show that our approach achieves high compression
ratios without compromising reconstruction accuracy, all while maintaining
simplicity and scalability.
[LINK]
http://arxiv.org/abs/2412.10623v1
[DATE]
2024-12-14 08:05:43+08:00
[CATEGORIES]
cs.LG
WaveGNN: Modeling Irregular Multivariate Time Series for Accurate Predictions
[AUTHORS]
Arash Hajisafi, Maria Despoina Siampou, Bita Azarijoo, Cyrus Shahabi
[ABSTRACT]
Accurately modeling and analyzing time series data is crucial for downstream
applications across various fields, including healthcare, finance, astronomy,
and epidemiology. However, real-world time series often exhibit irregularities
such as misaligned timestamps, missing entries, and variable sampling rates,
complicating their analysis. Existing approaches often rely on imputation,
which can introduce biases. A few approaches that directly model irregularity
tend to focus exclusively on either capturing intra-series patterns or
inter-series relationships, missing the benefits of integrating both. To this
end, we present WaveGNN, a novel framework designed to directly (i.e., no
imputation) embed irregularly sampled multivariate time series data for
accurate predictions. WaveGNN utilizes a Transformer-based encoder to capture
intra-series patterns by directly encoding the temporal dynamics of each time
series. To capture inter-series relationships, WaveGNN uses a dynamic graph
neural network model, where each node represents a sensor, and the edges
capture the long- and short-term relationships between them. Our experimental
results on real-world healthcare datasets demonstrate that WaveGNN consistently
outperforms existing state-of-the-art methods, with an average relative
improvement of 14.7% in F1-score when compared to the second-best baseline in
cases with extreme sparsity. Our ablation studies reveal that both intra-series
and inter-series modeling significantly contribute to this notable improvement.
[LINK]
http://arxiv.org/abs/2412.10621v1
[DATE]
2024-12-14 08:03:44+08:00
[CATEGORIES]
cs.LG