Locret: Enhancing Eviction in Long-Context LLM Inference with Trained Retaining Heads
[AUTHORS]
Yuxiang Huang, Binhang Yuan, Xu Han, Chaojun Xiao, Zhiyuan Liu
[ABSTRACT]
Large language models (LLMs) have shown remarkable advances in supporting
long-context comprehension and processing tasks. However, scaling the
generation inference of LLMs to such long contexts incurs significant
additional computation load, and demands a substantial GPU memory footprint to
maintain the key-value (KV) cache of transformer-based LLMs. Existing KV cache
compression methods, such as quantization, face memory bottlenecks as context
length increases, while static-sized caches, such as eviction, suffer from
inefficient policies. These limitations restrict deployment on consumer-grade
devices like a single Nvidia 4090 GPU. To overcome this, we propose Locret, a
framework for long-context LLM inference that introduces retaining heads to
evaluate the causal importance of KV cache units, allowing for more accurate
eviction within a fixed cache size. Locret is fine-tuned on top of the frozen
backbone LLM using a minimal amount of data from standard long-context SFT
datasets. During inference, we evict low-importance cache units along with a
chunked prefill pattern, significantly reducing peak GPU memory usage. We
conduct an extensive empirical study to evaluate Locret, where the experimental
results show that Locret outperforms the recent competitive approaches,
including InfLLM, Quantization, SirLLM, and MInference, in terms of memory
efficiency and the quality of generated contents – Locret achieves over a 20x
and 8x KV cache compression ratio compared to the full KV cache for
Phi-3-mini-128K and Llama-3.1-8B-instruct. Additionally, Locret can be combined
with other methods, such as quantization and token merging. To our knowledge,
Locret is the first framework capable of deploying Llama-3.1-8B or similar
models on a single Nvidia 4090 GPU, enabling 128K long-context inference
without compromising generation quality, and requiring little additional system
optimizations.
[COMMENTS]
Preprints
[LINK]
http://arxiv.org/abs/2410.01805v1
[DATE]
2024-10-03 01:59:52+08:00
[CATEGORIES]
cs.CL
Knowledge-Driven Feature Selection and Engineering for Genotype Data with Large Language Models
[AUTHORS]
Joseph Lee, Shu Yang, Jae Young Baik, Xiaoxi Liu, Zhen Tan, Dawei Li, Zixuan Wen, Bojian Hou, Duy Duong-Tran, Tianlong Chen, Li Shen
[ABSTRACT]
Predicting phenotypes with complex genetic bases based on a small,
interpretable set of variant features remains a challenging task.
Conventionally, data-driven approaches are utilized for this task, yet the high
dimensional nature of genotype data makes the analysis and prediction
difficult. Motivated by the extensive knowledge encoded in pre-trained LLMs and
their success in processing complex biomedical concepts, we set to examine the
ability of LLMs in feature selection and engineering for tabular genotype data,
with a novel knowledge-driven framework. We develop FREEFORM, Free-flow
Reasoning and Ensembling for Enhanced Feature Output and Robust Modeling,
designed with chain-of-thought and ensembling principles, to select and
engineer features with the intrinsic knowledge of LLMs. Evaluated on two
distinct genotype-phenotype datasets, genetic ancestry and hereditary hearing
loss, we find this framework outperforms several data-driven methods,
particularly on low-shot regimes. FREEFORM is available as open-source
framework at GitHub: https://github.com/PennShenLab/FREEFORM.
[LINK]
http://arxiv.org/abs/2410.01795v1
[DATE]
2024-10-03 01:53:08+08:00
[CATEGORIES]
cs.LG
cs.CL
OmniGenBench: Automating Large-scale in-silico Benchmarking for Genomic Foundation Models
[AUTHORS]
Heng Yang, Jack Cole, Ke Li
[ABSTRACT]
The advancements in artificial intelligence in recent years, such as Large
Language Models (LLMs), have fueled expectations for breakthroughs in genomic
foundation models (GFMs). The code of nature, hidden in diverse genomes since
the very beginning of life’s evolution, holds immense potential for impacting
humans and ecosystems through genome modeling. Recent breakthroughs in GFMs,
such as Evo, have attracted significant investment and attention to genomic
modeling, as they address long-standing challenges and transform in-silico
genomic studies into automated, reliable, and efficient paradigms. In the
context of this flourishing era of consecutive technological revolutions in
genomics, GFM studies face two major challenges: the lack of GFM benchmarking
tools and the absence of open-source software for diverse genomics. These
challenges hinder the rapid evolution of GFMs and their wide application in
tasks such as understanding and synthesizing genomes, problems that have
persisted for decades. To address these challenges, we introduce GFMBench, a
framework dedicated to GFM-oriented benchmarking. GFMBench standardizes
benchmark suites and automates benchmarking for a wide range of open-source
GFMs. It integrates millions of genomic sequences across hundreds of genomic
tasks from four large-scale benchmarks, democratizing GFMs for a wide range of
in-silico genomic applications. Additionally, GFMBench is released as
open-source software, offering user-friendly interfaces and diverse tutorials,
applicable for AutoBench and complex tasks like RNA design and structure
prediction. To facilitate further advancements in genome modeling, we have
launched a public leaderboard showcasing the benchmark performance derived from
AutoBench. GFMBench represents a step toward standardizing GFM benchmarking and
democratizing GFM applications.
[COMMENTS]
https://github.com/yangheng95/OmniGenomeBench
[LINK]
http://arxiv.org/abs/2410.01784v1
[DATE]
2024-10-03 01:40:44+08:00
[CATEGORIES]
cs.CL
What is lost in Normalization? Exploring Pitfalls in Multilingual ASR Model Evaluations
[AUTHORS]
Kavya Manohar, Leena G Pillai
[COMMENTS]
Accepted to EMNLP 2024 Main
[LINK]
http://arxiv.org/abs/2409.02449v2
[DATE]
2024-10-03 01:40:25+08:00
[CATEGORIES]
cs.CL
Open-RAG: Enhanced Retrieval-Augmented Reasoning with Open-Source Large Language Models
[AUTHORS]
Shayekh Bin Islam, Md Asib Rahman, K S M Tozammel Hossain, Enamul Hoque, Shafiq Joty, Md Rizwan Parvez
[ABSTRACT]
Retrieval-Augmented Generation (RAG) has been shown to enhance the factual
accuracy of Large Language Models (LLMs), but existing methods often suffer
from limited reasoning capabilities in effectively using the retrieved
evidence, particularly when using open-source LLMs. To mitigate this gap, we
introduce a novel framework, Open-RAG, designed to enhance reasoning
capabilities in RAG with open-source LLMs. Our framework transforms an
arbitrary dense LLM into a parameter-efficient sparse mixture of experts (MoE)
model capable of handling complex reasoning tasks, including both single- and
multi-hop queries. Open-RAG uniquely trains the model to navigate challenging
distractors that appear relevant but are misleading. As a result, Open-RAG
leverages latent learning, dynamically selecting relevant experts and
integrating external knowledge effectively for more accurate and contextually
relevant responses. In addition, we propose a hybrid adaptive retrieval method
to determine retrieval necessity and balance the trade-off between performance
gain and inference speed. Experimental results show that the Llama2-7B-based
Open-RAG outperforms state-of-the-art LLMs and RAG models such as ChatGPT,
Self-RAG, and Command R+ in various knowledge-intensive tasks. We open-source
our code and models at https://openragmoe.github.io/
[COMMENTS]
Accepted to EMNLP 2024 Findings. Website:
https://openragmoe.github.io/. 14 pages, 7 figures, 5 tables
[LINK]
http://arxiv.org/abs/2410.01782v1
[DATE]
2024-10-03 01:37:18+08:00
[CATEGORIES]
cs.CL
cs.LG
Social Conjuring: Multi-User Runtime Collaboration with AI in Building Virtual 3D Worlds
[AUTHORS]
Amina Kobenova, Cyan DeVeaux, Samyak Parajuli, Andrzej Banburski-Fahey, Judith Amores Fernandez, Jaron Lanier
[ABSTRACT]
Generative artificial intelligence has shown promise in prompting virtual
worlds into existence, yet little attention has been given to understanding how
this process unfolds as social interaction. We present Social Conjurer, a
framework for AI-augmented dynamic 3D scene co-creation, where multiple users
collaboratively build and modify virtual worlds in real-time. Through an
expanded set of interactions, including social and tool-based engagements as
well as spatial reasoning, our framework facilitates the creation of rich,
diverse virtual environments. Findings from a preliminary user study (N=12)
provide insight into the user experience of this approach, how social contexts
shape the prompting of spatial environments, and perspective on social
applications of prompt-based 3D co-creation. In addition to highlighting the
potential of AI-supported multi-user world creation and offering new pathways
for AI-augmented creative processes in VR, this article presents a set of
implications for designing human-centered interfaces that incorporate AI models
into 3D content generation.
[COMMENTS]
27 pages + Appendix, 16 figures; fixed some minor UTF-8 encoding
issues in arXiv compilation
[LINK]
http://arxiv.org/abs/2410.00274v2
[DATE]
2024-10-03 01:34:41+08:00
[CATEGORIES]
cs.CL
Composing Global Optimizers to Reasoning Tasks via Algebraic Objects in Neural Nets
[AUTHORS]
Yuandong Tian
[ABSTRACT]
We prove rich algebraic structures of the solution space for 2-layer neural
networks with quadratic activation and $L_2$ loss, trained on reasoning tasks
in Abelian group (e.g., modular addition). Such a rich structure enables
analytical construction of global optimal solutions from partial solutions that
only satisfy part of the loss, despite its high nonlinearity. We coin the
framework as CoGO (Composing Global Optimizers). Specifically, we show that the
weight space over different numbers of hidden nodes of the 2-layer network is
equipped with a semi-ring algebraic structure, and the loss function to be
optimized consists of monomial potentials, which are ring homomorphism,
allowing partial solutions to be composed into global ones by ring addition and
multiplication. Our experiments show that around $95\%$ of the solutions
obtained by gradient descent match exactly our theoretical constructions.
Although the global optimizers constructed only required a small number of
hidden nodes, our analysis on gradient dynamics shows that
over-parameterization asymptotically decouples training dynamics and is
beneficial. We further show that training dynamics favors simpler solutions
under weight decay, and thus high-order global optimizers such as perfect
memorization are unfavorable.
[LINK]
http://arxiv.org/abs/2410.01779v1
[DATE]
2024-10-03 01:33:26+08:00
[CATEGORIES]
cs.LG
cs.CL
DeFine: Enhancing LLM Decision-Making with Factor Profiles and Analogical Reasoning
[AUTHORS]
Yebowen Hu, Xiaoyang Wang, Wenlin Yao, Yiming Lu, Daoan Zhang, Hassan Foroosh, Dong Yu, Fei Liu
[ABSTRACT]
LLMs are ideal for decision-making due to their ability to reason over long
contexts and identify critical factors. However, challenges arise when
processing transcripts of spoken speech describing complex scenarios. These
transcripts often contain ungrammatical or incomplete sentences, repetitions,
hedging, and vagueness. For example, during a company’s earnings call, an
executive might project a positive revenue outlook to reassure investors,
despite significant uncertainty regarding future earnings. It is crucial for
LLMs to incorporate this uncertainty systematically when making decisions. In
this paper, we introduce DeFine, a new framework that constructs probabilistic
factor profiles from complex scenarios. DeFine then integrates these profiles
with analogical reasoning, leveraging insights from similar past experiences to
guide LLMs in making critical decisions in novel situations. Our framework
separates the tasks of quantifying uncertainty in complex scenarios and
incorporating it into LLM decision-making. This approach is particularly useful
in fields such as medical consultations, negotiations, and political debates,
where making decisions under uncertainty is vital.
[LINK]
http://arxiv.org/abs/2410.01772v1
[DATE]
2024-10-03 01:29:34+08:00
[CATEGORIES]
cs.CL
Quantifying Generalization Complexity for Large Language Models
[AUTHORS]
Zhenting Qi, Hongyin Luo, Xuliang Huang, Zhuokai Zhao, Yibo Jiang, Xiangjun Fan, Himabindu Lakkaraju, James Glass
[ABSTRACT]
While large language models (LLMs) have shown exceptional capabilities in
understanding complex queries and performing sophisticated tasks, their
generalization abilities are often deeply entangled with memorization,
necessitating more precise evaluation. To address this challenge, we introduce
Scylla, a dynamic evaluation framework that quantitatively measures the
generalization abilities of LLMs. Scylla disentangles generalization from
memorization via assessing model performance on both in-distribution (ID) and
out-of-distribution (OOD) data through 20 tasks across 5 levels of complexity.
Through extensive experiments, we uncover a non-monotonic relationship between
task complexity and the performance gap between ID and OOD data, which we term
the generalization valley. Specifically, this phenomenon reveals a critical
threshold - referred to as critical complexity - where reliance on
non-generalizable behavior peaks, indicating the upper bound of LLMs’
generalization capabilities. As model size increases, the critical complexity
shifts toward higher levels of task complexity, suggesting that larger models
can handle more complex reasoning tasks before over-relying on memorization.
Leveraging Scylla and the concept of critical complexity, we benchmark 28LLMs
including both open-sourced models such as LLaMA and Qwen families, and
close-sourced models like Claude and GPT, providing a more robust evaluation
and establishing a clearer understanding of LLMs’ generalization capabilities.
[LINK]
http://arxiv.org/abs/2410.01769v1
[DATE]
2024-10-03 01:25:37+08:00
[CATEGORIES]
cs.CL
Eliminating Position Bias of Language Models: A Mechanistic Approach
[AUTHORS]
Ziqi Wang, Hanlin Zhang, Xiner Li, Kuan-Hao Huang, Chi Han, Shuiwang Ji, Sham M. Kakade, Hao Peng, Heng Ji
[ABSTRACT]
Position bias has proven to be a prevalent issue of modern language models
(LMs), where the models prioritize content based on its position within the
given context. This bias often leads to unexpected model failures and hurts
performance, robustness, and reliability across various applications. Our
mechanistic analysis attributes the position bias to two components employed in
nearly all state-of-the-art LMs: causal attention and relative positional
encodings. Based on the analyses, we propose to eliminate position bias (e.g.,
different retrieved documents’ orders in QA affect performance) with a
training-free zero-shot approach. Our method changes the causal attention to
bidirectional attention between documents and utilizes model attention values
to decide the relative orders of documents instead of using the order provided
in input prompts, therefore enabling Position-INvariant inferencE (PINE) at the
document level. By eliminating position bias, models achieve better performance
and reliability in downstream tasks, including LM-as-a-judge,
retrieval-augmented QA, molecule generation, and math reasoning. Notably, PINE
is especially useful when adapting LMs for evaluating reasoning pairs: it
consistently provides 8 to 10 percentage points performance gains, making
Llama-3-70B-Instruct perform even better than GPT-4-0125-preview and
GPT-4o-2024-08-06 on the RewardBench reasoning set.
[COMMENTS]
26 pages, 6 figures, 15 tables
[LINK]
http://arxiv.org/abs/2407.01100v2
[DATE]
2024-10-03 01:09:53+08:00
[CATEGORIES]
cs.CL
cs.LG
Scaling Optimal LR Across Token Horizons
[AUTHORS]
Johan Bjorck, Alon Benhaim, Vishrav Chaudhary, Furu Wei, Xia Song
[ABSTRACT]
State-of-the-art LLMs are powered by scaling – scaling model size, dataset
size and cluster size. It is economically infeasible to extensively tune
hyperparameter for the largest runs. Instead, approximately optimal
hyperparameters must be inferred or \textit{transferred} from smaller
experiments. Hyperparameter transfer across model sizes has been studied in
Yang et al. However, hyperparameter transfer across dataset size – or token
horizon – has not been studied yet. To remedy this we conduct a large scale
empirical study on how optimal learning rate (LR) depends on token horizon in
LLM training. We first demonstrate that the optimal LR changes significantly
with token horizon – longer training necessitates smaller LR. Secondly we
demonstrate the the optimal LR follows a scaling law, and that the optimal LR
for longer horizons can be accurately estimated from shorter horizons via such
scaling laws. We also provide a rule-of-thumb for transferring LR across token
horizons with zero overhead over current practices. Lastly we provide evidence
that LLama-1 used too high LR, and estimate the performance hit from this. We
thus argue that hyperparameter transfer across data size is an important and
overlooked component of LLM training.
[LINK]
http://arxiv.org/abs/2409.19913v2
[DATE]
2024-10-03 01:03:25+08:00
[CATEGORIES]
cs.LG
cs.CL
LEOPARD : A Vision Language Model For Text-Rich Multi-Image Tasks
[AUTHORS]
Mengzhao Jia, Wenhao Yu, Kaixin Ma, Tianqing Fang, Zhihan Zhang, Siru Ouyang, Hongming Zhang, Meng Jiang, Dong Yu
[ABSTRACT]
Text-rich images, where text serves as the central visual element guiding the
overall understanding, are prevalent in real-world applications, such as
presentation slides, scanned documents, and webpage snapshots. Tasks involving
multiple text-rich images are especially challenging, as they require not only
understanding the content of individual images but reasoning about
inter-relationships and logical flows across multiple visual inputs. Despite
the importance of these scenarios, current multimodal large language models
(MLLMs) struggle to handle such tasks due to two key challenges: (1) the
scarcity of high-quality instruction tuning datasets for text-rich multi-image
scenarios, and (2) the difficulty in balancing image resolution with visual
feature sequence length. To address these challenges, we propose \OurMethod, a
MLLM designed specifically for handling vision-language tasks involving
multiple text-rich images. First, we curated about one million high-quality
multimodal instruction-tuning data, tailored to text-rich, multi-image
scenarios. Second, we developed an adaptive high-resolution multi-image
encoding module to dynamically optimize the allocation of visual sequence
length based on the original aspect ratios and resolutions of the input images.
Experiments across a wide range of benchmarks demonstrate our model’s superior
capabilities in text-rich, multi-image evaluations and competitive performance
in general domain evaluations.
[COMMENTS]
Our code is available at https://github.com/Jill0001/Leopard
[LINK]
http://arxiv.org/abs/2410.01744v1
[DATE]
2024-10-03 00:55:01+08:00
[CATEGORIES]
cs.CL
Controllable Preference Optimization: Toward Controllable Multi-Objective Alignment
[AUTHORS]
Yiju Guo, Ganqu Cui, Lifan Yuan, Ning Ding, Zexu Sun, Bowen Sun, Huimin Chen, Ruobing Xie, Jie Zhou, Yankai Lin, Zhiyuan Liu, Maosong Sun
[COMMENTS]
EMNLP 2024 main conference
[LINK]
http://arxiv.org/abs/2402.19085v2
[DATE]
2024-10-03 00:54:33+08:00
[CATEGORIES]
cs.CL
README: Bridging Medical Jargon and Lay Understanding for Patient Education through Data-Centric NLP
[AUTHORS]
Zonghai Yao, Nandyala Siddharth Kantu, Guanghao Wei, Hieu Tran, Zhangqi Duan, Sunjae Kwon, Zhichao Yang, README annotation team, Hong Yu
[ABSTRACT]
The advancement in healthcare has shifted focus toward patient-centric
approaches, particularly in self-care and patient education, facilitated by
access to Electronic Health Records (EHR). However, medical jargon in EHRs
poses significant challenges in patient comprehension. To address this, we
introduce a new task of automatically generating lay definitions, aiming to
simplify complex medical terms into patient-friendly lay language. We first
created the README dataset, an extensive collection of over 50,000 unique
(medical term, lay definition) pairs and 300,000 mentions, each offering
context-aware lay definitions manually annotated by domain experts. We have
also engineered a data-centric Human-AI pipeline that synergizes data
filtering, augmentation, and selection to improve data quality. We then used
README as the training data for models and leveraged a Retrieval-Augmented
Generation method to reduce hallucinations and improve the quality of model
outputs. Our extensive automatic and human evaluations demonstrate that
open-source mobile-friendly models, when fine-tuned with high-quality data, are
capable of matching or even surpassing the performance of state-of-the-art
closed-source large language models like ChatGPT. This research represents a
significant stride in closing the knowledge gap in patient education and
advancing patient-centric healthcare solutions.
[COMMENTS]
To appear in Findings of the Association for Computational
Linguistics: EMNLP 2024
[LINK]
http://arxiv.org/abs/2312.15561v4
[DATE]
2024-10-03 00:52:30+08:00
[CATEGORIES]
cs.CL
Recursive Abstractive Processing for Retrieval in Dynamic Datasets
[AUTHORS]
Charbel Chucri, Rami Azouz, Joachim Ott
[ABSTRACT]
Recent retrieval-augmented models enhance basic methods by building a
hierarchical structure over retrieved text chunks through recursive embedding,
clustering, and summarization. The most relevant information is then retrieved
from both the original text and generated summaries. However, such approaches
face limitations with dynamic datasets, where adding or removing documents over
time complicates the updating of hierarchical representations formed through
clustering. We propose a new algorithm to efficiently maintain the
recursive-abstractive tree structure in dynamic datasets, without compromising
performance. Additionally, we introduce a novel post-retrieval method that
applies query-focused recursive abstractive processing to substantially improve
context quality. Our method overcomes the limitations of other approaches by
functioning as a black-box post-retrieval layer compatible with any retrieval
algorithm. Both algorithms are validated through extensive experiments on
real-world datasets, demonstrating their effectiveness in handling dynamic data
and improving retrieval performance.
[LINK]
http://arxiv.org/abs/2410.01736v1
[DATE]
2024-10-03 00:47:35+08:00
[CATEGORIES]
cs.CL
cs.LG
SCAR: Efficient Instruction-Tuning for Large Language Models via Style Consistency-Aware Response Ranking
[AUTHORS]
Zhuang Li, Yuncheng Hua, Thuy-Trang Vu, Haolan Zhan, Lizhen Qu, Gholamreza Haffari
[ABSTRACT]
Recent studies have shown that maintaining a consistent response style by
human experts and enhancing data quality in training sets can significantly
improve the performance of fine-tuned Large Language Models (LLMs) while
reducing the number of training examples needed. However, the precise
definition of style and the relationship between style, data quality, and LLM
performance remains unclear. This research identifies two key stylistic
elements in responses: linguistic form and semantic surprisal. We find that,
among training data of comparable quality, higher consistency in these response
elements leads to better LLM performance. Inspired by this, we introduce Style
Consistency-Aware Response Ranking (SCAR), which automatically prioritizes
instruction-response pairs in the training set based on their response
stylistic consistency. By selecting the most style-consistent examples,
sometimes as few as 0.7% of the full dataset, the fine-tuned LLMs can match or
even surpass the performance of models trained on the entire dataset in coding
and open-ended question-answering benchmarks. Code and data are available at
https://github.com/zhuang-li/SCAR .
[COMMENTS]
27 pages
[LINK]
http://arxiv.org/abs/2406.10882v5
[DATE]
2024-10-03 00:46:54+08:00
[CATEGORIES]
cs.CL
LASeR: Learning to Adaptively Select Reward Models with Multi-Armed Bandits
[AUTHORS]
Duy Nguyen, Archiki Prasad, Elias Stengel-Eskin, Mohit Bansal
[ABSTRACT]
Reward Models (RMs) play a crucial role in aligning LLMs with human
preferences, enhancing their performance by ranking outputs during inference or
iterative training. However, the degree to which an RM generalizes to new tasks
is often not known a priori (e.g. some RMs may excel at scoring creative
writing vs. math reasoning). Therefore, using only one fixed RM while training
LLMs can be suboptimal. Moreover, optimizing LLMs with multiple RMs
simultaneously can be prohibitively computationally-intensive and challenging
due to conflicting signals from different RMs, potentially degrading
performance. To address these challenges, we introduce LASeR (Learning to
Adaptively Select Rewards), which iteratively trains LLMs using multiple RMs,
selecting and utilizing the most well-suited RM for each instance to rank
outputs and generate preference data, framed as a multi-armed bandit problem.
Our results on commonsense and math reasoning tasks demonstrate that LASeR can
boost iterative LLM optimization by optimizing for multiple RMs, improving the
absolute average accuracy of Llama-3-8B over three datasets by 2.67% over
training with ensemble RM scores while also showing superior training
efficiency (e.g., a 2x speedup). Moreover, on WildChat, a benchmark of
instruction-following prompts, we find that using Llama-3-8B LASeR leads to a
71.45% AlpacaEval win rate over sequentially optimizing multiple RMs. Extending
to long-context generation tasks, we find that on Llama-3-8B, LASeR achieves an
average improvement of 2.64 F1 and 2.42 F1 on single- and multi-document QA
over random RM selection when used with best-of-n sampling. LASeR is robust to
noisy rewards and generalizes to multiple settings. Finally, LASeR’s RM
selection changes depending on the underlying task or instance and we verify
the presence of conflicting preferences from multiple RMs that can be mitigated
using LASeR.
[COMMENTS]
20 pages; First two authors contributed equally. Code:
https://github.com/duykhuongnguyen/LASeR-MAB
[LINK]
http://arxiv.org/abs/2410.01735v1
[DATE]
2024-10-03 00:46:38+08:00
[CATEGORIES]
cs.CL
cs.LG
Visual Perception in Text Strings
[AUTHORS]
Qi Jia, Xiang Yue, Shanshan Huang, Ziheng Qin, Yizhu Liu, Bill Yuchen Lin, Yang You
[ABSTRACT]
Understanding visual semantics embedded in consecutive characters is a
crucial capability for both large language models (LLMs) and multi-modal large
language models (MLLMs). This type of artifact possesses the unique
characteristic that identical information can be readily formulated in both
texts and images, making them a significant proxy for analyzing modern LLMs’
and MLLMs’ capabilities in modality-agnostic vision understanding. In this
work, we select ASCII art as a representative artifact, where the lines and
brightness used to depict each concept are rendered by characters, and we frame
the problem as an ASCII art recognition task. We benchmark model performance on
this task by constructing an evaluation dataset with an elaborate
categorization tree and also collect a training set to elicit the models’
visual perception ability. Through a comprehensive analysis of dozens of
models, results reveal that although humans can achieve nearly 100% accuracy,
the state-of-the-art LLMs and MLLMs lag far behind. Models are capable of
recognizing concepts depicted in the ASCII arts given only text inputs
indicated by over 60% accuracy for some concepts, but most of them achieves
merely around 30% accuracy when averaged across all categories. When provided
with images as inputs, GPT-4o gets 82.68%, outperforming the strongest
open-source MLLM by 21.95%. Although models favor different kinds of ASCII art
depending on the modality provided, none of the MLLMs successfully benefit when
both modalities are supplied simultaneously. Moreover, supervised fine-tuning
helps improve models’ accuracy especially when provided with the image
modality, but also highlights the need for better training techniques to
enhance the information fusion among modalities.
[LINK]
http://arxiv.org/abs/2410.01733v1
[DATE]
2024-10-03 00:46:01+08:00
[CATEGORIES]
cs.CL
ComfyGen: Prompt-Adaptive Workflows for Text-to-Image Generation
[AUTHORS]
Rinon Gal, Adi Haviv, Yuval Alaluf, Amit H. Bermano, Daniel Cohen-Or, Gal Chechik
[ABSTRACT]
The practical use of text-to-image generation has evolved from simple,
monolithic models to complex workflows that combine multiple specialized
components. While workflow-based approaches can lead to improved image quality,
crafting effective workflows requires significant expertise, owing to the large
number of available components, their complex inter-dependence, and their
dependence on the generation prompt. Here, we introduce the novel task of
prompt-adaptive workflow generation, where the goal is to automatically tailor
a workflow to each user prompt. We propose two LLM-based approaches to tackle
this task: a tuning-based method that learns from user-preference data, and a
training-free method that uses the LLM to select existing flows. Both
approaches lead to improved image quality when compared to monolithic models or
generic, prompt-independent workflows. Our work shows that prompt-dependent
flow prediction offers a new pathway to improving text-to-image generation
quality, complementing existing research directions in the field.
[COMMENTS]
Project website: https://comfygen-paper.github.io/
[LINK]
http://arxiv.org/abs/2410.01731v1
[DATE]
2024-10-03 00:43:24+08:00
[CATEGORIES]
cs.CL
Auto-Demo Prompting: Leveraging Generated Outputs as Demonstrations for Enhanced Batch Prompting
[AUTHORS]
Longyu Feng, Mengze Hong, Chen Jason Zhang
[ABSTRACT]
Batch prompting is a common technique in large language models (LLMs) used to
process multiple inputs simultaneously, aiming to improve computational
efficiency. However, as batch sizes increase, performance degradation often
occurs due to the model’s difficulty in handling lengthy context inputs.
Existing methods that attempt to mitigate these issues rely solely on batch
data arrangement and majority voting rather than improving the design of the
batch prompt itself. In this paper, we address these limitations by proposing
“Auto-Demo Prompting,” a novel approach that leverages the question-output
pairs from earlier questions within a batch as demonstrations for subsequent
answer inference. We provide a formal theoretical analysis of how Auto-Demo
Prompting functions within the autoregressive generation process of LLMs,
illustrating how it utilizes prior outputs to optimize the model’s internal
representations. Our method effectively bridges the gap between batch prompting
and few-shot prompting, enhancing performance with only a slight compromise in
token usage. Experimental results across five NLP tasks demonstrate its
effectiveness in mitigating performance degradation and occasionally
outperforming single prompts. Furthermore, it opens new avenues for applying
few-shot learning techniques, such as demonstration selection, within batch
prompting, making it a robust solution for real-world applications.
[LINK]
http://arxiv.org/abs/2410.01724v1
[DATE]
2024-10-03 00:34:40+08:00
[CATEGORIES]
cs.CL
Towards a Theoretical Understanding of Synthetic Data in LLM Post-Training: A Reverse-Bottleneck Perspective
[AUTHORS]
Zeyu Gan, Yong Liu
[ABSTRACT]
Synthetic data has become a pivotal resource in post-training tasks for large
language models (LLMs) due to the scarcity of high-quality, specific data.
While various methods have been developed to generate synthetic data, there
remains a discernible gap between the practical effects of synthetic data and
our theoretical comprehension. To address this challenge, we commence by
presenting a detailed modeling of the prevalent synthetic data generation
process. Building upon this modeling, we demonstrate that the generalization
capability of the post-trained model is critically determined by the
information gain derived from the generative model, as analyzed from a novel
reverse-bottleneck perspective. Moreover, we introduce the concept of
Generalization Gain via Mutual Information (GGMI) and elucidate the
relationship between generalization gain and information gain. This analysis
serves as a theoretical foundation for synthetic data generation and further
highlights its connection with the generalization capability of post-trained
models, offering an understanding about the design of synthetic data generation
techniques and the optimization of the post-training process. We open source
our code through an anonymous GitHub repository at
https://anonymous.4open.science/r/Understanding-Synthetic.
[LINK]
http://arxiv.org/abs/2410.01720v1
[DATE]
2024-10-03 00:32:05+08:00
[CATEGORIES]
cs.CL
cs.LG
LogicAsker: Evaluating and Improving the Logical Reasoning Ability of Large Language Models
[AUTHORS]
Yuxuan Wan, Wenxuan Wang, Yiliu Yang, Youliang Yuan, Jen-tse Huang, Pinjia He, Wenxiang Jiao, Michael R. Lyu
[ABSTRACT]
We introduce LogicAsker, a novel approach for evaluating and enhancing the
logical reasoning capabilities of large language models (LLMs) such as ChatGPT
and GPT-4. Despite LLMs’ prowess in tasks like writing assistance, code
generation, and machine translation, assessing their ability to reason has been
challenging. Traditional evaluations often prioritize accuracy on downstream
tasks over direct assessments of reasoning processes. LogicAsker addresses this
gap by employing a set of atomic reasoning skills grounded in propositional and
predicate logic to systematically examine and improve the reasoning prowess of
LLMs. Our methodology reveals significant gaps in LLMs’ learning of logical
rules, with identified reasoning failures ranging from 29\% to 90\% across
different models. Moreover, we leverage these findings to construct targeted
demonstration examples and fine-tune data, notably enhancing logical reasoning
in models like GPT-4o by up to 5\%. To our knowledge, this is the first effort
to utilize test case outcomes to effectively refine LLMs’ formal reasoning
capabilities. We make our code, data, and results publicly available
(https://github.com/yxwan123/LogicAsker) to facilitate further research and
replication of our findings.
[COMMENTS]
Accepted by EMNLP 2024
[LINK]
http://arxiv.org/abs/2401.00757v2
[DATE]
2024-10-03 00:30:34+08:00
[CATEGORIES]
cs.CL
SysCaps: Language Interfaces for Simulation Surrogates of Complex Systems
[AUTHORS]
Patrick Emami, Zhaonan Li, Saumya Sinha, Truc Nguyen
[ABSTRACT]
Surrogate models are used to predict the behavior of complex energy systems
that are too expensive to simulate with traditional numerical methods. Our work
introduces the use of language descriptions, which we call “system captions” or
SysCaps, to interface with such surrogates. We argue that interacting with
surrogates through text, particularly natural language, makes these models more
accessible for both experts and non-experts. We introduce a lightweight
multimodal text and timeseries regression model and a training pipeline that
uses large language models (LLMs) to synthesize high-quality captions from
simulation metadata. Our experiments on two real-world simulators of buildings
and wind farms show that our SysCaps-augmented surrogates have better accuracy
on held-out systems than traditional methods while enjoying new generalization
abilities, such as handling semantically related descriptions of the same test
system. Additional experiments also highlight the potential of SysCaps to
unlock language-driven design space exploration and to regularize training
through prompt augmentation.
[COMMENTS]
21 pages. Under review
[LINK]
http://arxiv.org/abs/2405.19653v2
[DATE]
2024-10-03 00:23:12+08:00
[CATEGORIES]
cs.LG
cs.CL
Interpretable Contrastive Monte Carlo Tree Search Reasoning
[AUTHORS]
Zitian Gao, Boye Niu, Xuzheng He, Haotian Xu, Hongzhang Liu, Aiwei Liu, Xuming Hu, Lijie Wen
[ABSTRACT]
We propose SC-MCTS: a novel Monte Carlo Tree Search (MCTS) reasoning
algorithm for Large Language Models (LLMs), significantly improves both
reasoning accuracy and speed. Our motivation comes from: 1. Previous MCTS LLM
reasoning works often overlooked its biggest drawback–slower speed compared to
CoT; 2. Previous research mainly used MCTS as a tool for LLM reasoning on
various tasks with limited quantitative analysis or ablation studies of its
components from reasoning interpretability perspective. 3. The reward model is
the most crucial component in MCTS, however previous work has rarely conducted
in-depth study or improvement of MCTS’s reward models. Thus, we conducted
extensive ablation studies and quantitative analysis on components of MCTS,
revealing the impact of each component on the MCTS reasoning performance of
LLMs. Building on this, (i) we designed a highly interpretable reward model
based on the principle of contrastive decoding and (ii) achieved an average
speed improvement of 51.9% per node using speculative decoding. Additionally,
(iii) we improved UCT node selection strategy and backpropagation used in
previous works, resulting in significant performance improvement. We
outperformed o1-mini by an average of 17.4% on the Blocksworld multi-step
reasoning dataset using Llama-3.1-70B with SC-MCTS.
[LINK]
http://arxiv.org/abs/2410.01707v1
[DATE]
2024-10-03 00:15:31+08:00
[CATEGORIES]
cs.CL
An Exploration of Self-Supervised Mutual Information Alignment for Multi-Task Settings
[AUTHORS]
Soham Govande
[ABSTRACT]
There is a growing need for pluralistic alignment methods that can steer
language models towards individual attributes and preferences. One such method,
Self-Supervised Alignment with Mutual Information (SAMI), uses conditional
mutual information to encourage the connection between behavioral preferences
and model responses. We conduct two experiments exploring SAMI in multi-task
settings. First, we compare SAMI to Direct Preference Optimization (DPO) on a
multi-task benchmark (MT-Bench), using a stronger model to generate training
data for a weaker one across diverse categories (humanities, STEM, extraction,
coding, math, reasoning, and roleplay). Our results indicate that one iteration
of SAMI has a 57% win rate against DPO, with significant variation in
performance between task categories. Second, we examine SAMI’s impact on
mathematical accuracy (GSM-8K) relative to supervised fine-tuning (SFT). While
SAMI increases zero-shot performance by 1.1%, SFT is more effective with a 3.2%
boost. However, SAMI shows interesting scaling trends. When given 10 attempts,
SAMI improves accuracy by 3.9%, while SFT achieves a 10.1% increase. Combining
SAMI with SFT yields an additional improvement of 1.3% in multi-attempt
settings, though single-attempt accuracy remains unchanged.
[LINK]
http://arxiv.org/abs/2410.01704v1
[DATE]
2024-10-03 00:15:04+08:00
[CATEGORIES]
cs.CL
Optimized Multi-Token Joint Decoding with Auxiliary Model for LLM Inference
[AUTHORS]
Zongyue Qin, Ziniu Hu, Zifan He, Neha Prakriya, Jason Cong, Yizhou Sun
[ABSTRACT]
Large language models (LLMs) have achieved remarkable success across diverse
tasks, yet their inference processes are hindered by substantial time and
energy demands due to single-token generation at each decoding step. While
previous methods such as speculative decoding mitigate these inefficiencies by
producing multiple tokens per step, each token is still generated by its
single-token distribution, thereby enhancing speed without improving
effectiveness. In contrast, our work simultaneously enhances inference speed
and improves the output effectiveness. We consider multi-token joint decoding
(MTJD), which generates multiple tokens from their joint distribution at each
iteration, theoretically reducing perplexity and enhancing task performance.
However, MTJD suffers from the high cost of sampling from the joint
distribution of multiple tokens. Inspired by speculative decoding, we introduce
multi-token assisted decoding (MTAD), a novel framework designed to accelerate
MTJD. MTAD leverages a smaller auxiliary model to approximate the joint
distribution of a larger model, incorporating a verification mechanism that not
only ensures the accuracy of this approximation, but also improves the decoding
efficiency over conventional speculative decoding. Theoretically, we
demonstrate that MTAD closely approximates exact MTJD with bounded error.
Empirical evaluations using Llama-2 and OPT models ranging from 13B to 70B
parameters across various tasks reveal that MTAD reduces perplexity by 21.2%
and improves downstream performance compared to standard single-token sampling.
Furthermore, MTAD achieves a 1.42x speed-up and consumes 1.54x less energy than
conventional speculative decoding methods. These results highlight MTAD’s
ability to make multi-token joint decoding both effective and efficient,
promoting more sustainable and high-performance deployment of LLMs.
[LINK]
http://arxiv.org/abs/2407.09722v2
[DATE]
2024-10-03 00:14:09+08:00
[CATEGORIES]
cs.CL
cs.LG
CreDes: Causal Reasoning Enhancement and Dual-End Searching for Solving Long-Range Reasoning Problems using LLMs
[AUTHORS]
Kangsheng Wang, Xiao Zhang, Hao Liu, Songde Han, Huimin Ma, Tianyu Hu
[ABSTRACT]
Large language models (LLMs) have demonstrated limitations in handling
combinatorial optimization problems involving long-range reasoning, partially
due to causal hallucinations and huge search space. As for causal
hallucinations, i.e., the inconsistency between reasoning and corresponding
state transition, this paper introduces the Causal Relationship Enhancement
(CRE) mechanism combining cause-effect interventions and the Individual
Treatment Effect (ITE) to guarantee the solid causal rightness between each
step of reasoning and state transition. As for the long causal range and huge
search space limiting the performances of existing models featuring
single-direction search, a Dual-End Searching (DES) approach is proposed to
seek solutions by simultaneously starting from both the initial and goal states
on the causal probability tree. By integrating CRE and DES (CreDes), our model
has realized simultaneous multi-step reasoning, circumventing the
inefficiencies from cascading multiple one-step reasoning like the
Chain-of-Thought (CoT). Experiments demonstrate that CreDes significantly
outperforms existing State-Of-The-Art (SOTA) solutions in long-range reasoning
tasks in terms of both accuracy and time efficiency.
[LINK]
http://arxiv.org/abs/2410.01696v1
[DATE]
2024-10-03 00:05:01+08:00
[CATEGORIES]
cs.CL
U-shaped and Inverted-U Scaling behind Emergent Abilities of Large Language Models
[AUTHORS]
Tung-Yu Wu, Pei-Yu Lo
[ABSTRACT]
Large language models (LLMs) have been shown to exhibit emergent abilities in
some downstream tasks, where performance seems to stagnate at first and then
improve sharply and unpredictably with scale beyond a threshold. By dividing
questions in the datasets according to difficulty level by average performance,
we observe U-shaped scaling for hard questions, and inverted-U scaling followed
by steady improvement for easy questions. Moreover, the emergence threshold
roughly coincides with the point at which performance on easy questions reverts
from inverse scaling to standard scaling. Capitalizing on the observable though
opposing scaling trend on easy and hard questions, we propose a simple yet
effective pipeline, called Slice-and-Sandwich, to predict both the emergence
threshold and model performance beyond the threshold.
[COMMENTS]
Preprint. Under review
[LINK]
http://arxiv.org/abs/2410.01692v1
[DATE]
2024-10-03 00:03:49+08:00
[CATEGORIES]
cs.CL
FactAlign: Long-form Factuality Alignment of Large Language Models
[AUTHORS]
Chao-Wei Huang, Yun-Nung Chen
[ABSTRACT]
Large language models have demonstrated significant potential as the
next-generation information access engines. However, their reliability is
hindered by issues of hallucination and generating non-factual content. This is
particularly problematic in long-form responses, where assessing and ensuring
factual accuracy is complex. In this paper, we address this gap by proposing
FactAlign, a novel alignment framework designed to enhance the factuality of
LLMs’ long-form responses while maintaining their helpfulness. We introduce
fKTO, a fine-grained, sentence-level alignment algorithm that extends the
Kahneman-Tversky Optimization (KTO) alignment method. Leveraging recent
advances in automatic factuality evaluation, FactAlign utilizes fine-grained
factuality assessments to guide the alignment process. Our experiments on
open-domain prompts and information-seeking questions demonstrate that
FactAlign significantly improves the factual accuracy of LLM responses while
also improving their helpfulness. Further analyses identify that FactAlign is
capable of training LLMs to provide more information without losing factual
precision, thus improving the factual F1 score. Our source code, datasets, and
trained models are publicly available at https://github.com/MiuLab/FactAlign
[COMMENTS]
Accepted to EMNLP 2024 Findings
[LINK]
http://arxiv.org/abs/2410.01691v1
[DATE]
2024-10-03 00:03:13+08:00
[CATEGORIES]
cs.CL
Disentangling Singlish Discourse Particles with Task-Driven Representation
[AUTHORS]
Linus Tze En Foo, Lynnette Hui Xian Ng
[ABSTRACT]
Singlish, or formally Colloquial Singapore English, is an English-based
creole language originating from the SouthEast Asian country Singapore. The
language contains influences from Sinitic languages such as Chinese dialects,
Malay, Tamil and so forth. A fundamental task to understanding Singlish is to
first understand the pragmatic functions of its discourse particles, upon which
Singlish relies heavily to convey meaning. This work offers a preliminary
effort to disentangle the Singlish discourse particles (lah, meh and hor) with
task-driven representation learning. After disentanglement, we cluster these
discourse particles to differentiate their pragmatic functions, and perform
Singlish-to-English machine translation. Our work provides a computational
method to understanding Singlish discourse particles, and opens avenues towards
a deeper comprehension of the language and its usage.
[LINK]
http://arxiv.org/abs/2409.20366v2
[DATE]
2024-10-03 10:18:43+08:00
[CATEGORIES]
cs.CL
Show and Guide: Instructional-Plan Grounded Vision and Language Model
[AUTHORS]
Diogo Glória-Silva, David Semedo, João Magalhães
[ABSTRACT]
Guiding users through complex procedural plans is an inherently multimodal
task in which having visually illustrated plan steps is crucial to deliver an
effective plan guidance. However, existing works on plan-following language
models (LMs) often are not capable of multimodal input and output. In this
work, we present MM-PlanLLM, the first multimodal LLM designed to assist users
in executing instructional tasks by leveraging both textual plans and visual
information. Specifically, we bring cross-modality through two key tasks:
Conversational Video Moment Retrieval, where the model retrieves relevant
step-video segments based on user queries, and Visually-Informed Step
Generation, where the model generates the next step in a plan, conditioned on
an image of the user’s current progress. MM-PlanLLM is trained using a novel
multitask-multistage approach, designed to gradually expose the model to
multimodal instructional-plans semantic layers, achieving strong performance on
both multimodal and textual dialogue in a plan-grounded setting. Furthermore,
we show that the model delivers cross-modal temporal and plan-structure
representations aligned between textual plan steps and instructional video
moments.
[COMMENTS]
Accepted at EMNLP 2024 Main Track
[LINK]
http://arxiv.org/abs/2409.19074v2
[DATE]
2024-10-03 04:47:17+08:00
[CATEGORIES]
cs.CL
PROXI: Challenging the GNNs for Link Prediction
[AUTHORS]
Astrit Tola, Jack Myrick, Baris Coskunuzer
[ABSTRACT]
Over the past decade, Graph Neural Networks (GNNs) have transformed graph
representation learning. In the widely adopted message-passing GNN framework,
nodes refine their representations by aggregating information from neighboring
nodes iteratively. While GNNs excel in various domains, recent theoretical
studies have raised concerns about their capabilities. GNNs aim to address
various graph-related tasks by utilizing such node representations, however,
this one-size-fits-all approach proves suboptimal for diverse tasks.
Motivated by these observations, we conduct empirical tests to compare the
performance of current GNN models with more conventional and direct methods in
link prediction tasks. Introducing our model, PROXI, which leverages proximity
information of node pairs in both graph and attribute spaces, we find that
standard machine learning (ML) models perform competitively, even outperforming
cutting-edge GNN models when applied to these proximity metrics derived from
node neighborhoods and attributes. This holds true across both homophilic and
heterophilic networks, as well as small and large benchmark datasets, including
those from the Open Graph Benchmark (OGB). Moreover, we show that augmenting
traditional GNNs with PROXI significantly boosts their link prediction
performance. Our empirical findings corroborate the previously mentioned
theoretical observations and imply that there exists ample room for enhancement
in current GNN models to reach their potential.
[LINK]
http://arxiv.org/abs/2410.01802v1
[DATE]
2024-10-03 01:57:38+08:00
[CATEGORIES]
cs.LG
Efficient $1$-bit tensor approximations
[AUTHORS]
Alex W. Neal Riasanovsky, Sarah El Kazdadi
[ABSTRACT]
We present a spatially efficient decomposition of matrices and
arbitrary-order tensors as linear combinations of tensor products of $\{-1,
1\}$-valued vectors. For any matrix $A \in \mathbb{R}^{m \times n}$, $$A - R_w
= S_w C_w T_w^\top = \sum_{j=1}^w c_j \cdot \mathbf{s}_j \mathbf{t}_j^\top$$ is
a {\it $w$-width signed cut decomposition of $A$}. Here $C_w =
“diag”(\mathbf{c}_w)$ for some $\mathbf{c}_w \in \mathbb{R}^w,$ and $S_w, T_w$,
and the vectors $\mathbf{s}_j, \mathbf{t}_j$ are $\{-1, 1\}$-valued. To store
$(S_w, T_w, C_w)$, we may pack $w \cdot (m + n)$ bits, and require only $w$
floating point numbers. As a function of $w$, $|R_w|_F$ exhibits exponential
decay when applied to #f32 matrices with i.i.d. $\mathcal N (0, 1)$ entries.
Choosing $w$ so that $(S_w, T_w, C_w)$ has the same memory footprint as a
\textit{f16} or \textit{bf16} matrix, the relative error is comparable. Our
algorithm yields efficient signed cut decompositions in $20$ lines of
pseudocode. It reflects a simple modification from a celebrated 1999 paper [1]
of Frieze and Kannan. As a first application, we approximate the weight
matrices in the open \textit{Mistral-7B-v0.1} Large Language Model to a $50\%$
spatial compression. Remarkably, all $226$ remainder matrices have a relative
error $<6\%$ and the expanded model closely matches \textit{Mistral-7B-v0.1} on
the {\it huggingface} leaderboard [2]. Benchmark performance degrades slowly as
we reduce the spatial compression from $50\%$ to $25\%$. We optimize our open
source \textit{rust} implementation [3] with \textit{simd} instructions on
\textit{avx2} and \textit{avx512} architectures. We also extend our algorithm
from matrices to tensors of arbitrary order and use it to compress a picture of
the first author’s cat Angus.
[COMMENTS]
16 pages, one cat picture reused a lot
[LINK]
http://arxiv.org/abs/2410.01799v1
[DATE]
2024-10-03 01:56:32+08:00
[CATEGORIES]
cs.LG
Bellman Diffusion: Generative Modeling as Learning a Linear Operator in the Distribution Space
[AUTHORS]
Yangming Li, Chieh-Hsin Lai, Carola-Bibiane Schönlieb, Yuki Mitsufuji, Stefano Ermon
[ABSTRACT]
Deep Generative Models (DGMs), including Energy-Based Models (EBMs) and
Score-based Generative Models (SGMs), have advanced high-fidelity data
generation and complex continuous distribution approximation. However, their
application in Markov Decision Processes (MDPs), particularly in distributional
Reinforcement Learning (RL), remains underexplored, with conventional
histogram-based methods dominating the field. This paper rigorously highlights
that this application gap is caused by the nonlinearity of modern DGMs, which
conflicts with the linearity required by the Bellman equation in MDPs. For
instance, EBMs involve nonlinear operations such as exponentiating energy
functions and normalizing constants. To address this, we introduce Bellman
Diffusion, a novel DGM framework that maintains linearity in MDPs through
gradient and scalar field modeling. With divergence-based training techniques
to optimize neural network proxies and a new type of stochastic differential
equation (SDE) for sampling, Bellman Diffusion is guaranteed to converge to the
target distribution. Our empirical results show that Bellman Diffusion achieves
accurate field estimations and is a capable image generator, converging 1.5x
faster than the traditional histogram-based baseline in distributional RL
tasks. This work enables the effective integration of DGMs into MDP
applications, unlocking new avenues for advanced decision-making frameworks.
[COMMENTS]
Paper under review
[LINK]
http://arxiv.org/abs/2410.01796v1
[DATE]
2024-10-03 01:53:23+08:00
[CATEGORIES]
cs.LG
Thermodynamic Bayesian Inference
[AUTHORS]
Maxwell Aifer, Samuel Duffield, Kaelan Donatella, Denis Melanson, Phoebe Klett, Zach Belateche, Gavin Crooks, Antonio J. Martinez, Patrick J. Coles
[ABSTRACT]
A fully Bayesian treatment of complicated predictive models (such as deep
neural networks) would enable rigorous uncertainty quantification and the
automation of higher-level tasks including model selection. However, the
intractability of sampling Bayesian posteriors over many parameters inhibits
the use of Bayesian methods where they are most needed. Thermodynamic computing
has emerged as a paradigm for accelerating operations used in machine learning,
such as matrix inversion, and is based on the mapping of Langevin equations to
the dynamics of noisy physical systems. Hence, it is natural to consider the
implementation of Langevin sampling algorithms on thermodynamic devices. In
this work we propose electronic analog devices that sample from Bayesian
posteriors by realizing Langevin dynamics physically. Circuit designs are given
for sampling the posterior of a Gaussian-Gaussian model and for Bayesian
logistic regression, and are validated by simulations. It is shown, under
reasonable assumptions, that the Bayesian posteriors for these models can be
sampled in time scaling with $\ln(d)$, where $d$ is dimension. For the
Gaussian-Gaussian model, the energy cost is shown to scale with $ d \ln(d)$.
These results highlight the potential for fast, energy-efficient Bayesian
inference using thermodynamic computing.
[COMMENTS]
20 pages, 8 figures
[LINK]
http://arxiv.org/abs/2410.01793v1
[DATE]
2024-10-03 01:51:58+08:00
[CATEGORIES]
cs.LG
Learning To Solve Differential Equation Constrained Optimization Problems
[AUTHORS]
Vincenzo Di Vito, Mostafa Mohammadian, Kyri Baker, Ferdinando Fioretto
[ABSTRACT]
Differential equations (DE) constrained optimization plays a critical role in
numerous scientific and engineering fields, including energy systems, aerospace
engineering, ecology, and finance, where optimal configurations or control
strategies must be determined for systems governed by ordinary or stochastic
differential equations. Despite its significance, the computational challenges
associated with these problems have limited their practical use. To address
these limitations, this paper introduces a learning-based approach to
DE-constrained optimization that combines techniques from proxy optimization
and neural differential equations. The proposed approach uses a dual-network
architecture, with one approximating the control strategies, focusing on
steady-state constraints, and another solving the associated DEs. This
combination enables the approximation of optimal strategies while accounting
for dynamic constraints in near real-time. Experiments across problems in
energy optimization and finance modeling show that this method provides full
compliance with dynamic constraints and it produces results up to 25 times more
precise than other methods which do not explicitly model the system’s dynamic
equations.
[LINK]
http://arxiv.org/abs/2410.01786v1
[DATE]
2024-10-03 01:42:16+08:00
[CATEGORIES]
cs.LG
TopER: Topological Embeddings in Graph Representation Learning
[AUTHORS]
Astrit Tola, Funmilola Mary Taiwo, Cuneyt Gurcan Akcora, Baris Coskunuzer
[ABSTRACT]
Graph embeddings play a critical role in graph representation learning,
allowing machine learning models to explore and interpret graph-structured
data. However, existing methods often rely on opaque, high-dimensional
embeddings, limiting interpretability and practical visualization.
In this work, we introduce Topological Evolution Rate (TopER), a novel,
low-dimensional embedding approach grounded in topological data analysis. TopER
simplifies a key topological approach, Persistent Homology, by calculating the
evolution rate of graph substructures, resulting in intuitive and interpretable
visualizations of graph data. This approach not only enhances the exploration
of graph datasets but also delivers competitive performance in graph clustering
and classification tasks. Our TopER-based models achieve or surpass
state-of-the-art results across molecular, biological, and social network
datasets in tasks such as classification, clustering, and visualization.
[COMMENTS]
17 pages, 7 figures
[LINK]
http://arxiv.org/abs/2410.01778v2
[DATE]
2024-10-03 09:58:26+08:00
[CATEGORIES]
cs.LG
Dynamical-generative downscaling of climate model ensembles
[AUTHORS]
Ignacio Lopez-Gomez, Zhong Yi Wan, Leonardo Zepeda-Núñez, Tapio Schneider, John Anderson, Fei Sha
[ABSTRACT]
Regional high-resolution climate projections are crucial for many
applications, such as agriculture, hydrology, and natural hazard risk
assessment. Dynamical downscaling, the state-of-the-art method to produce
localized future climate information, involves running a regional climate model
(RCM) driven by an Earth System Model (ESM), but it is too computationally
expensive to apply to large climate projection ensembles. We propose a novel
approach combining dynamical downscaling with generative artificial
intelligence to reduce the cost and improve the uncertainty estimates of
downscaled climate projections. In our framework, an RCM dynamically downscales
ESM output to an intermediate resolution, followed by a generative diffusion
model that further refines the resolution to the target scale. This approach
leverages the generalizability of physics-based models and the sampling
efficiency of diffusion models, enabling the downscaling of large multi-model
ensembles. We evaluate our method against dynamically-downscaled climate
projections from the CMIP6 ensemble. Our results demonstrate its ability to
provide more accurate uncertainty bounds on future regional climate than
alternatives such as dynamical downscaling of smaller ensembles, or traditional
empirical statistical downscaling methods. We also show that
dynamical-generative downscaling results in significantly lower errors than
bias correction and spatial disaggregation (BCSD), and captures more accurately
the spectra and multivariate correlations of meteorological fields. These
characteristics make the dynamical-generative framework a flexible, accurate,
and efficient way to downscale large ensembles of climate projections,
currently out of reach for pure dynamical downscaling.
[LINK]
http://arxiv.org/abs/2410.01776v1
[DATE]
2024-10-03 01:31:01+08:00
[CATEGORIES]
cs.LG
Trained Transformer Classifiers Generalize and Exhibit Benign Overfitting In-Context
[AUTHORS]
Spencer Frei, Gal Vardi
[ABSTRACT]
Transformers have the capacity to act as supervised learning algorithms: by
properly encoding a set of labeled training (“in-context”) examples and an
unlabeled test example into an input sequence of vectors of the same dimension,
the forward pass of the transformer can produce predictions for that unlabeled
test example. A line of recent work has shown that when linear transformers are
pre-trained on random instances for linear regression tasks, these trained
transformers make predictions using an algorithm similar to that of ordinary
least squares. In this work, we investigate the behavior of linear transformers
trained on random linear classification tasks. Via an analysis of the implicit
regularization of gradient descent, we characterize how many pre-training tasks
and in-context examples are needed for the trained transformer to generalize
well at test-time. We further show that in some settings, these trained
transformers can exhibit “benign overfitting in-context”: when in-context
examples are corrupted by label flipping noise, the transformer memorizes all
of its in-context examples (including those with noisy labels) yet still
generalizes near-optimally for clean test examples.
[COMMENTS]
34 pages
[LINK]
http://arxiv.org/abs/2410.01774v1
[DATE]
2024-10-03 01:30:21+08:00
[CATEGORIES]
cs.LG
Temporal Test-Time Adaptation with State-Space Models
[AUTHORS]
Mona Schirmer, Dan Zhang, Eric Nalisnick
[ABSTRACT]
Distribution shifts between training and test data are inevitable over the
lifecycle of a deployed model, leading to performance decay. Adapting a model
on test samples can help mitigate this drop in performance. However, most
test-time adaptation methods have focused on synthetic corruption shifts,
leaving a variety of distribution shifts underexplored. In this paper, we focus
on distribution shifts that evolve gradually over time, which are common in the
wild but challenging for existing methods, as we show. To address this, we
propose STAD, a probabilistic state-space model that adapts a deployed model to
temporal distribution shifts by learning the time-varying dynamics in the last
set of hidden features. Without requiring labels, our model infers
time-evolving class prototypes that act as a dynamic classification head.
Through experiments on real-world temporal distribution shifts, we show that
our method excels in handling small batch sizes and label shift.
[LINK]
http://arxiv.org/abs/2407.12492v2
[DATE]
2024-10-03 01:29:54+08:00
[CATEGORIES]
cs.LG
Bayesian Binary Search
[AUTHORS]
Vikash Singh, Matthew Khanzadeh, Vincent Davis, Harrison Rush, Emanuele Rossi, Jesse Shrader, Pietro Lio
[ABSTRACT]
We present Bayesian Binary Search (BBS), a novel probabilistic variant of the
classical binary search/bisection algorithm. BBS leverages machine
learning/statistical techniques to estimate the probability density of the
search space and modifies the bisection step to split based on probability
density rather than the traditional midpoint, allowing for the learned
distribution of the search space to guide the search algorithm. Search space
density estimation can flexibly be performed using supervised probabilistic
machine learning techniques (e.g., Gaussian process regression, Bayesian neural
networks, quantile regression) or unsupervised learning algorithms (e.g.,
Gaussian mixture models, kernel density estimation (KDE), maximum likelihood
estimation (MLE)). We demonstrate significant efficiency gains of using BBS on
both simulated data across a variety of distributions and in a real-world
binary search use case of probing channel balances in the Bitcoin Lightning
Network, for which we have deployed the BBS algorithm in a production setting.
[LINK]
http://arxiv.org/abs/2410.01771v1
[DATE]
2024-10-03 01:28:22+08:00
[CATEGORIES]
cs.LG
Explainable Earth Surface Forecasting under Extreme Events
[AUTHORS]
Oscar J. Pellicer-Valero, Miguel-Ángel Fernández-Torres, Chaonan Ji, Miguel D. Mahecha, Gustau Camps-Valls
[ABSTRACT]
With climate change-related extreme events on the rise, high dimensional
Earth observation data presents a unique opportunity for forecasting and
understanding impacts on ecosystems. This is, however, impeded by the
complexity of processing, visualizing, modeling, and explaining this data. To
showcase how this challenge can be met, here we train a convolutional long
short-term memory-based architecture on the novel DeepExtremeCubes dataset.
DeepExtremeCubes includes around 40,000 long-term Sentinel-2 minicubes (January
2016-October 2022) worldwide, along with labeled extreme events, meteorological
data, vegetation land cover, and topography map, sampled from locations
affected by extreme climate events and surrounding areas. When predicting
future reflectances and vegetation impacts through kernel normalized difference
vegetation index, the model achieved an R$^2$ score of 0.9055 in the test set.
Explainable artificial intelligence was used to analyze the model’s predictions
during the October 2020 Central South America compound heatwave and drought
event. We chose the same area exactly one year before the event as
counterfactual, finding that the average temperature and surface pressure are
generally the best predictors under normal conditions. In contrast, minimum
anomalies of evaporation and surface latent heat flux take the lead during the
event. A change of regime is also observed in the attributions before the
event, which might help assess how long the event was brewing before happening.
The code to replicate all experiments and figures in this paper is publicly
available at https://github.com/DeepExtremes/txyXAI
[LINK]
http://arxiv.org/abs/2410.01770v1
[DATE]
2024-10-03 01:27:13+08:00
[CATEGORIES]
cs.LG
Decision-Focused Uncertainty Quantification
[AUTHORS]
Santiago Cortes-Gomez, Carlos Patiño, Yewon Byun, Steven Wu, Eric Horvitz, Bryan Wilder
[ABSTRACT]
There is increasing interest in ‘‘decision-focused’’ machine learning methods
which train models to account for how their predictions are used in downstream
optimization problems. Doing so can often improve performance on subsequent
decision problems. However, current methods for uncertainty quantification do
not incorporate any information at all about downstream decisions. We develop a
framework based on conformal prediction to produce prediction sets that account
for a downstream decision loss function, making them more appropriate to inform
high-stakes decision-making. Our approach harnesses the strengths of conformal
methods–modularity, model-agnosticism, and statistical coverage
guarantees–while incorporating downstream decisions and user-specified utility
functions. We prove that our methods retain standard coverage guarantees.
Empirical evaluation across a range of datasets and utility metrics
demonstrates that our methods achieve significantly lower decision loss
compared to standard conformal methods. Additionally, we present a real-world
use case in healthcare diagnosis, where our method effectively incorporates the
hierarchical structure of dermatological diseases. It successfully generates
sets with coherent diagnostic meaning, aiding the triage process during
dermatology diagnosis and illustrating how our method can ground high-stakes
decision-making on external domain knowledge.
[LINK]
http://arxiv.org/abs/2410.01767v1
[DATE]
2024-10-03 01:22:09+08:00
[CATEGORIES]
cs.LG
VisionTS: Visual Masked Autoencoders Are Free-Lunch Zero-Shot Time Series Forecasters
[AUTHORS]
Mouxiang Chen, Lefei Shen, Zhuo Li, Xiaoyun Joy Wang, Jianling Sun, Chenghao Liu
[ABSTRACT]
Foundation models have emerged as a promising approach in time series
forecasting (TSF). Existing approaches either repurpose large language models
(LLMs) or build large-scale time series datasets to develop TSF foundation
models for universal forecasting. However, these methods face challenges due to
the severe cross-domain gap or in-domain heterogeneity. This paper explores a
new road to building a TSF foundation model from rich, high-quality natural
images. Our key insight is that a visual masked autoencoder, pre-trained on the
ImageNet dataset, can naturally be a numeric series forecaster. By
reformulating TSF as an image reconstruction task, we bridge the gap between
image pre-training and TSF downstream tasks. Surprisingly, without further
adaptation in the time-series domain, the proposed VisionTS could achieve
superior zero-shot forecasting performance compared to existing TSF foundation
models. With fine-tuning for one epoch, VisionTS could further improve the
forecasting and achieve state-of-the-art performance in most cases. Extensive
experiments reveal intrinsic similarities between images and real-world time
series, suggesting visual models may offer a “free lunch” for TSF and
highlight the potential for future cross-modality research. Our code is
publicly available at https://github.com/Keytoyze/VisionTS.
[COMMENTS]
v2: add more experiments
[LINK]
http://arxiv.org/abs/2408.17253v2
[DATE]
2024-10-03 01:21:47+08:00
[CATEGORIES]
cs.LG
Concept-skill Transferability-based Data Selection for Large Vision-Language Models
[AUTHORS]
Jaewoo Lee, Boyang Li, Sung Ju Hwang
[ABSTRACT]
Instruction tuning, or supervised finetuning on extensive task-specific data,
is necessary for Large Vision-Language Models (LVLMs) to generalize well across
a broad range of vision-language (VL) tasks. However, training on large VL
datasets can become prohibitively expensive. In this work, we introduce
COINCIDE, an effective and scalable data selection technique that uses a small
model as a reference model to select visual instruction tuning data for
efficient finetuning of a target LVLM, focusing on diversity and
transferability. Specifically, we cluster the training data using internal
activations from a small model, which identifies VL concept-skill compositions
needed by a target LVLM. We then sample data from these diverse clusters by
considering their density and transferability, or the ability to transfer well
to other concept-skill compositions. This approach ensures the diversity of
these compositions, which is vital for LVLM generalization. Extensive
experiments demonstrate that COINCIDE achieves superior performance and data
selection efficiency against 8 strong baselines on two distinct datasets:
LLaVA-1.5 and Vision-Flan. Using only 20% of the LLaVA-1.5 dataset, COINCIDE
achieves performance comparable to the LVLM finetuned on the whole dataset,
with 70% reduction of the wall-clock running time. On the Vision-Flan dataset,
our method achieves superior results with only 16.7% of the training data.
[COMMENTS]
EMNLP 2024
[LINK]
http://arxiv.org/abs/2406.10995v2
[DATE]
2024-10-03 01:20:28+08:00
[CATEGORIES]
cs.LG
Integrating Protein Sequence and Expression Level to Analysis Molecular Characterization of Breast Cancer Subtypes
[AUTHORS]
Hossein Sholehrasa
[ABSTRACT]
Breast cancer’s complexity and variability pose significant challenges in
understanding its progression and guiding effective treatment. This study aims
to integrate protein sequence data with expression levels to improve the
molecular characterization of breast cancer subtypes and predict clinical
outcomes. Using ProtGPT2, a language model designed for protein sequences, we
generated embeddings that capture the functional and structural properties of
proteins sequence. These embeddings were integrated with protein expression
level to form enriched biological representations, which were analyzed using
machine learning methods like ensemble K-means for clustering and XGBoost for
classification. Our approach enabled successful clustering of patients into
biologically distinct groups and accurately predicted clinical outcomes such as
survival and biomarkers status, achieving high performance metrics, notably an
F1 score of 0.88 for survival and 0.87 for biomarkers status prediction.
Analysis of feature importance highlighted key proteins like KMT2C, GCN1, and
CLASP2, linked to hormone receptor and Human Epidermal Growth Factor Receptor 2
(HER2) expression, which play a role in tumor progression and patient outcomes,
respectively. Furthermore, protein-protein interaction networks and correlation
analyses revealed the interdependence of proteins that may influence breast
cancer subtype behaviors. These findings suggest that integrating protein
sequence and expression data provides valuable insights into tumor biology and
has significant potential to enhance personalized treatment strategies in
breast cancer care.
[LINK]
http://arxiv.org/abs/2410.01755v1
[DATE]
2024-10-03 01:05:48+08:00
[CATEGORIES]
cs.LG
TorchSISSO: A PyTorch-Based Implementation of the Sure Independence Screening and Sparsifying Operator for Efficient and Interpretable Model Discovery
[AUTHORS]
Madhav Muthyala, Farshud Sorourifar, Joel A. Paulson
[ABSTRACT]
Symbolic regression (SR) is a powerful machine learning approach that
searches for both the structure and parameters of algebraic models, offering
interpretable and compact representations of complex data. Unlike traditional
regression methods, SR explores progressively complex feature spaces, which can
uncover simple models that generalize well, even from small datasets. Among SR
algorithms, the Sure Independence Screening and Sparsifying Operator (SISSO)
has proven particularly effective in the natural sciences, helping to
rediscover fundamental physical laws as well as discover new interpretable
equations for materials property modeling. However, its widespread adoption has
been limited by performance inefficiencies and the challenges posed by its
FORTRAN-based implementation, especially in modern computing environments. In
this work, we introduce TorchSISSO, a native Python implementation built in the
PyTorch framework. TorchSISSO leverages GPU acceleration, easy integration, and
extensibility, offering a significant speed-up and improved accuracy over the
original. We demonstrate that TorchSISSO matches or exceeds the performance of
the original SISSO across a range of tasks, while dramatically reducing
computational time and improving accessibility for broader scientific
applications.
[LINK]
http://arxiv.org/abs/2410.01752v1
[DATE]
2024-10-03 01:02:17+08:00
[CATEGORIES]
cs.LG
Towards Privacy-Preserving Relational Data Synthesis via Probabilistic Relational Models
[AUTHORS]
Malte Luttermann, Ralf Möller, Mattis Hartwig
[ABSTRACT]
Probabilistic relational models provide a well-established formalism to
combine first-order logic and probabilistic models, thereby allowing to
represent relationships between objects in a relational domain. At the same
time, the field of artificial intelligence requires increasingly large amounts
of relational training data for various machine learning tasks. Collecting
real-world data, however, is often challenging due to privacy concerns, data
protection regulations, high costs, and so on. To mitigate these challenges,
the generation of synthetic data is a promising approach. In this paper, we
solve the problem of generating synthetic relational data via probabilistic
relational models. In particular, we propose a fully-fledged pipeline to go
from relational database to probabilistic relational model, which can then be
used to sample new synthetic relational data points from its underlying
probability distribution. As part of our proposed pipeline, we introduce a
learning algorithm to construct a probabilistic relational model from a given
relational database.
[COMMENTS]
Accepted to the Proceedings of the 47th German Conference on
Artificial Intelligence (KI 2024)
[LINK]
http://arxiv.org/abs/2409.04194v2
[DATE]
2024-10-03 01:01:58+08:00
[CATEGORIES]
cs.LG
Not All LLM Reasoners Are Created Equal
[AUTHORS]
Arian Hosseini, Alessandro Sordoni, Daniel Toyama, Aaron Courville, Rishabh Agarwal
[ABSTRACT]
We study the depth of grade-school math (GSM) problem-solving capabilities of
LLMs. To this end, we evaluate their performance on pairs of existing math word
problems together so that the answer to the second problem depends on correctly
answering the first problem. Our findings reveal a significant reasoning gap in
most LLMs, that is performance difference between solving the compositional
pairs and solving each question independently. This gap is more pronounced in
smaller, more cost-efficient, and math-specialized models. Moreover,
instruction-tuning recipes and code generation have varying effects across LLM
sizes, while finetuning on GSM can lead to task overfitting. Our analysis
indicates that large reasoning gaps are not because of test-set leakage, but
due to distraction from additional context and poor second-hop reasoning.
Overall, LLMs exhibit systematic differences in their reasoning abilities,
despite what their performance on standard benchmarks indicates.
[LINK]
http://arxiv.org/abs/2410.01748v1
[DATE]
2024-10-03 01:01:10+08:00
[CATEGORIES]
cs.LG
Transformers are Minimax Optimal Nonparametric In-Context Learners
[AUTHORS]
Juno Kim, Tai Nakamaki, Taiji Suzuki
[ABSTRACT]
In-context learning (ICL) of large language models has proven to be a
surprisingly effective method of learning a new task from only a few
demonstrative examples. In this paper, we study the efficacy of ICL from the
viewpoint of statistical learning theory. We develop approximation and
generalization error bounds for a transformer composed of a deep neural network
and one linear attention layer, pretrained on nonparametric regression tasks
sampled from general function spaces including the Besov space and piecewise
$\gamma$-smooth class. We show that sufficiently trained transformers can
achieve – and even improve upon – the minimax optimal estimation risk in
context by encoding the most relevant basis representations during pretraining.
Our analysis extends to high-dimensional or sequential data and distinguishes
the \emph{pretraining} and \emph{in-context} generalization gaps. Furthermore,
we establish information-theoretic lower bounds for meta-learners w.r.t. both
the number of tasks and in-context examples. These findings shed light on the
roles of task diversity and representation learning for ICL.
[COMMENTS]
NeurIPS 2024; 40 pages, 3 figures
[LINK]
http://arxiv.org/abs/2408.12186v2
[DATE]
2024-10-03 00:58:37+08:00
[CATEGORIES]
cs.LG
HOPE for a Robust Parameterization of Long-memory State Space Models
[AUTHORS]
Annan Yu, Michael W. Mahoney, N. Benjamin Erichson
[ABSTRACT]
State-space models (SSMs) that utilize linear, time-invariant (LTI) systems
are known for their effectiveness in learning long sequences. To achieve
state-of-the-art performance, an SSM often needs a specifically designed
initialization, and the training of state matrices is on a logarithmic scale
with a very small learning rate. To understand these choices from a unified
perspective, we view SSMs through the lens of Hankel operator theory. Building
upon it, we develop a new parameterization scheme, called HOPE, for LTI systems
that utilizes Markov parameters within Hankel operators. Our approach helps
improve the initialization and training stability, leading to a more robust
parameterization. We efficiently implement these innovations by nonuniformly
sampling the transfer functions of LTI systems, and they require fewer
parameters compared to canonical SSMs. When benchmarked against
HiPPO-initialized models such as S4 and S4D, an SSM parameterized by Hankel
operators demonstrates improved performance on Long-Range Arena (LRA) tasks.
Moreover, our new parameterization endows the SSM with non-decaying memory
within a fixed time window, which is empirically corroborated by a sequential
CIFAR-10 task with padded noise.
[LINK]
http://arxiv.org/abs/2405.13975v2
[DATE]
2024-10-03 00:56:09+08:00
[CATEGORIES]
cs.LG
PreND: Enhancing Intrinsic Motivation in Reinforcement Learning through Pre-trained Network Distillation
[AUTHORS]
Mohammadamin Davoodabadi, Negin Hashemi Dijujin, Mahdieh Soleymani Baghshah
[ABSTRACT]
Intrinsic motivation, inspired by the psychology of developmental learning in
infants, stimulates exploration in agents without relying solely on sparse
external rewards. Existing methods in reinforcement learning like Random
Network Distillation (RND) face significant limitations, including (1) relying
on raw visual inputs, leading to a lack of meaningful representations, (2) the
inability to build a robust latent space, (3) poor target network
initialization and (4) rapid degradation of intrinsic rewards. In this paper,
we introduce Pre-trained Network Distillation (PreND), a novel approach to
enhance intrinsic motivation in reinforcement learning (RL) by improving upon
the widely used prediction-based method, RND. PreND addresses these challenges
by incorporating pre-trained representation models into both the target and
predictor networks, resulting in more meaningful and stable intrinsic rewards,
while enhancing the representation learned by the model. We also tried simple
but effective variants of the predictor network optimization by controlling the
learning rate. Through experiments on the Atari domain, we demonstrate that
PreND significantly outperforms RND, offering a more robust intrinsic
motivation signal that leads to better exploration, improving overall
performance and sample efficiency. This research highlights the importance of
target and predictor networks representation in prediction-based intrinsic
motivation, setting a new direction for improving RL agents’ learning
efficiency in sparse reward environments.
[COMMENTS]
8 pages, 4 figures
[LINK]
http://arxiv.org/abs/2410.01745v1
[DATE]
2024-10-03 00:56:03+08:00
[CATEGORIES]
cs.LG
Mimicking Human Intuition: Cognitive Belief-Driven Q-Learning
[AUTHORS]
Xingrui Gu, Guanren Qiao, Chuyi Jiang, Tianqing Xia, Hangyu Mao
[ABSTRACT]
Reinforcement learning encounters challenges in various environments related
to robustness and explainability. Traditional Q-learning algorithms cannot
effectively make decisions and utilize the historical learning experience. To
overcome these limitations, we propose Cognitive Belief-Driven Q-Learning
(CBDQ), which integrates subjective belief modeling into the Q-learning
framework, enhancing decision-making accuracy by endowing agents with
human-like learning and reasoning capabilities. Drawing inspiration from
cognitive science, our method maintains a subjective belief distribution over
the expectation of actions, leveraging a cluster-based subjective belief model
that enables agents to reason about the potential probability associated with
each decision. CBDQ effectively mitigates overestimated phenomena and optimizes
decision-making policies by integrating historical experiences with current
contextual information, mimicking the dynamics of human decision-making. We
evaluate the proposed method on discrete control benchmark tasks in various
complicate environments. The results demonstrate that CBDQ exhibits stronger
adaptability, robustness, and human-like characteristics in handling these
environments, outperforming other baselines. We hope this work will give
researchers a fresh perspective on understanding and explaining Q-learning.
[COMMENTS]
Under review by ICLR 25
[LINK]
http://arxiv.org/abs/2410.01739v1
[DATE]
2024-10-03 00:50:29+08:00
[CATEGORIES]
cs.LG
Latent Diffusion Models for Controllable RNA Sequence Generation
[AUTHORS]
Kaixuan Huang, Yukang Yang, Kaidi Fu, Yanyi Chu, Le Cong, Mengdi Wang
[ABSTRACT]
This work presents RNAdiffusion, a latent diffusion model for generating and
optimizing discrete RNA sequences of variable lengths. RNA is a key
intermediary between DNA and protein, exhibiting high sequence diversity and
complex three-dimensional structures to support a wide range of functions. We
utilize pretrained BERT-type models to encode raw RNA sequences into
token-level, biologically meaningful representations. A Query Transformer is
employed to compress such representations into a set of fixed-length latent
vectors, with an autoregressive decoder trained to reconstruct RNA sequences
from these latent variables. We then develop a continuous diffusion model
within this latent space. To enable optimization, we integrate the gradients of
reward models–surrogates for RNA functional properties–into the backward
diffusion process, thereby generating RNAs with high reward scores. Empirical
results confirm that RNAdiffusion generates non-coding RNAs that align with
natural distributions across various biological metrics. Further, we fine-tune
the diffusion model on mRNA 5’ untranslated regions (5’-UTRs) and optimize
sequences for high translation efficiencies. Our guided diffusion model
effectively generates diverse 5’-UTRs with high Mean Ribosome Loading (MRL) and
Translation Efficiency (TE), outperforming baselines in balancing rewards and
structural stability trade-off. Our findings hold potential for advancing RNA
sequence-function research and therapeutic RNA design.
[LINK]
http://arxiv.org/abs/2409.09828v2
[DATE]
2024-10-03 00:42:46+08:00
[CATEGORIES]
cs.LG
Deep Separable Spatiotemporal Learning for Fast Dynamic Cardiac MRI
[AUTHORS]
Zi Wang, Min Xiao, Yirong Zhou, Chengyan Wang, Naiming Wu, Yi Li, Yiwen Gong, Shufu Chang, Yinyin Chen, Liuhong Zhu, Jianjun Zhou, Congbo Cai, He Wang, Di Guo, Guang Yang, Xiaobo Qu
[ABSTRACT]
Dynamic magnetic resonance imaging (MRI) plays an indispensable role in
cardiac diagnosis. To enable fast imaging, the k-space data can be undersampled
but the image reconstruction poses a great challenge of high-dimensional
processing. This challenge necessitates extensive training data in deep
learning reconstruction methods. In this work, we propose a novel and efficient
approach, leveraging a dimension-reduced separable learning scheme that can
perform exceptionally well even with highly limited training data. We design
this new approach by incorporating spatiotemporal priors into the development
of a Deep Separable Spatiotemporal Learning network (DeepSSL), which unrolls an
iteration process of a 2D spatiotemporal reconstruction model with both
temporal low-rankness and spatial sparsity. Intermediate outputs can also be
visualized to provide insights into the network behavior and enhance
interpretability. Extensive results on cardiac cine datasets demonstrate that
the proposed DeepSSL surpasses state-of-the-art methods both visually and
quantitatively, while reducing the demand for training cases by up to 75%.
Additionally, its preliminary adaptability to unseen cardiac patients has been
verified through a blind reader study conducted by experienced radiologists and
cardiologists. Furthermore, DeepSSL enhances the accuracy of the downstream
task of cardiac segmentation and exhibits robustness in prospectively
undersampled real-time cardiac MRI.
[COMMENTS]
12 pages, 14 figures, 4 tables
[LINK]
http://arxiv.org/abs/2402.15939v2
[DATE]
2024-10-03 00:42:35+08:00
[CATEGORIES]
cs.LG
Test Time Learning for Time Series Forecasting
[AUTHORS]
Panayiotis Christou, Shichu Chen, Xupeng Chen, Parijat Dube
[ABSTRACT]
Time-series forecasting has seen significant advancements with the
introduction of token prediction mechanisms such as multi-head attention.
However, these methods often struggle to achieve the same performance as in
language modeling, primarily due to the quadratic computational cost and the
complexity of capturing long-range dependencies in time-series data.
State-space models (SSMs), such as Mamba, have shown promise in addressing
these challenges by offering efficient solutions with linear RNNs capable of
modeling long sequences with larger context windows. However, there remains
room for improvement in accuracy and scalability.
We propose the use of Test-Time Training (TTT) modules in a parallel
architecture to enhance performance in long-term time series forecasting.
Through extensive experiments on standard benchmark datasets, we demonstrate
that TTT modules consistently outperform state-of-the-art models, including the
Mamba-based TimeMachine, particularly in scenarios involving extended sequence
and prediction lengths. Our results show significant improvements in Mean
Squared Error (MSE) and Mean Absolute Error (MAE), especially on larger
datasets such as Electricity, Traffic, and Weather, underscoring the
effectiveness of TTT in capturing long-range dependencies. Additionally, we
explore various convolutional architectures within the TTT framework, showing
that even simple configurations like 1D convolution with small filters can
achieve competitive results. This work sets a new benchmark for time-series
forecasting and lays the groundwork for future research in scalable,
high-performance forecasting models.
[LINK]
http://arxiv.org/abs/2409.14012v2
[DATE]
2024-10-03 00:40:10+08:00
[CATEGORIES]
cs.LG
Strategies for Pretraining Neural Operators
[AUTHORS]
Anthony Zhou, Cooper Lorsung, AmirPouya Hemmasian, Amir Barati Farimani
[ABSTRACT]
Pretraining for partial differential equation (PDE) modeling has recently
shown promise in scaling neural operators across datasets to improve
generalizability and performance. Despite these advances, our understanding of
how pretraining affects neural operators is still limited; studies generally
propose tailored architectures and datasets that make it challenging to compare
or examine different pretraining frameworks. To address this, we compare
various pretraining methods without optimizing architecture choices to
characterize pretraining dynamics on different models and datasets as well as
to understand its scaling and generalization behavior. We find that pretraining
is highly dependent on model and dataset choices, but in general transfer
learning or physics-based pretraining strategies work best. In addition,
pretraining performance can be further improved by using data augmentations.
Lastly, pretraining can be additionally beneficial when fine-tuning in scarce
data regimes or when generalizing to downstream data similar to the pretraining
distribution. Through providing insights into pretraining neural operators for
physics prediction, we hope to motivate future work in developing and
evaluating pretraining methods for PDEs.
[COMMENTS]
29 pages, 5 figures
[LINK]
http://arxiv.org/abs/2406.08473v2
[DATE]
2024-10-03 00:37:16+08:00
[CATEGORIES]
cs.LG
Meta-TTT: A Meta-learning Minimax Framework For Test-Time Training
[AUTHORS]
Chen Tao, Li Shen, Soumik Mondal
[ABSTRACT]
Test-time domain adaptation is a challenging task that aims to adapt a
pre-trained model to limited, unlabeled target data during inference. Current
methods that rely on self-supervision and entropy minimization underperform
when the self-supervised learning (SSL) task does not align well with the
primary objective. Additionally, minimizing entropy can lead to suboptimal
solutions when there is limited diversity within minibatches. This paper
introduces a meta-learning minimax framework for test-time training on batch
normalization (BN) layers, ensuring that the SSL task aligns with the primary
task while addressing minibatch overfitting. We adopt a mixed-BN approach that
interpolates current test batch statistics with the statistics from source
domains and propose a stochastic domain synthesizing method to improve model
generalization and robustness to domain shifts. Extensive experiments
demonstrate that our method surpasses state-of-the-art techniques across
various domain adaptation and generalization benchmarks, significantly
enhancing the pre-trained model’s robustness on unseen domains.
[COMMENTS]
10 pages, 7 tables, 1 figure
[LINK]
http://arxiv.org/abs/2410.01709v1
[DATE]
2024-10-03 00:16:05+08:00
[CATEGORIES]
cs.LG
Performant, Memory Efficient and Scalable Multi-Agent Reinforcement Learning
[AUTHORS]
Omayma Mahjoub, Sasha Abramowitz, Ruan de Kock, Wiem Khlifi, Simon du Toit, Jemma Daniel, Louay Ben Nessir, Louise Beyers, Claude Formanek, Liam Clark, Arnu Pretorius
[ABSTRACT]
As the field of multi-agent reinforcement learning (MARL) progresses towards
larger and more complex environments, achieving strong performance while
maintaining memory efficiency and scalability to many agents becomes
increasingly important. Although recent research has led to several advanced
algorithms, to date, none fully address all of these key properties
simultaneously. In this work, we introduce Sable, a novel and theoretically
sound algorithm that adapts the retention mechanism from Retentive Networks to
MARL. Sable’s retention-based sequence modelling architecture allows for
computationally efficient scaling to a large number of agents, as well as
maintaining a long temporal context, making it well-suited for large-scale
partially observable environments. Through extensive evaluations across six
diverse environments, we demonstrate how Sable is able to significantly
outperform existing state-of-the-art methods in the majority of tasks (34 out
of 45, roughly 75\%). Furthermore, Sable demonstrates stable performance as we
scale the number of agents, handling environments with more than a thousand
agents while exhibiting a linear increase in memory usage. Finally, we conduct
ablation studies to isolate the source of Sable’s performance gains and confirm
its efficient computational memory usage. Our results highlight Sable’s
performance and efficiency, positioning it as a leading approach to MARL at
scale.
[LINK]
http://arxiv.org/abs/2410.01706v1
[DATE]
2024-10-03 00:15:26+08:00
[CATEGORIES]
cs.LG
MOREL: Enhancing Adversarial Robustness through Multi-Objective Representation Learning
[AUTHORS]
Sedjro Salomon Hotegni, Sebastian Peitz
[ABSTRACT]
Extensive research has shown that deep neural networks (DNNs) are vulnerable
to slight adversarial perturbations$-$small changes to the input data that
appear insignificant but cause the model to produce drastically different
outputs. In addition to augmenting training data with adversarial examples
generated from a specific attack method, most of the current defense strategies
necessitate modifying the original model architecture components to improve
robustness or performing test-time data purification to handle adversarial
attacks. In this work, we demonstrate that strong feature representation
learning during training can significantly enhance the original model’s
robustness. We propose MOREL, a multi-objective feature representation learning
approach, encouraging classification models to produce similar features for
inputs within the same class, despite perturbations. Our training method
involves an embedding space where cosine similarity loss and multi-positive
contrastive loss are used to align natural and adversarial features from the
model encoder and ensure tight clustering. Concurrently, the classifier is
motivated to achieve accurate predictions. Through extensive experiments, we
demonstrate that our approach significantly enhances the robustness of DNNs
against white-box and black-box adversarial attacks, outperforming other
methods that similarly require no architectural changes or test-time data
purification. Our code is available at https://github.com/salomonhotegni/MOREL
[LINK]
http://arxiv.org/abs/2410.01697v1
[DATE]
2024-10-03 00:05:03+08:00
[CATEGORIES]
cs.LG
Dimensionality Reduction and Nearest Neighbors for Improving Out-of-Distribution Detection in Medical Image Segmentation
[AUTHORS]
McKell Woodland, Nihil Patel, Austin Castelo, Mais Al Taie, Mohamed Eltaher, Joshua P. Yung, Tucker J. Netherton, Tiffany L. Calderone, Jessica I. Sanchez, Darrel W. Cleere, Ahmed Elsaiey, Nakul Gupta, David Victor, Laura Beretta, Ankit B. Patel, Kristy K. Brock
[ABSTRACT]
Clinically deployed deep learning-based segmentation models are known to fail
on data outside of their training distributions. While clinicians review the
segmentations, these models tend to perform well in most instances, which could
exacerbate automation bias. Therefore, detecting out-of-distribution images at
inference is critical to warn the clinicians that the model likely failed. This
work applied the Mahalanobis distance (MD) post hoc to the bottleneck features
of four Swin UNETR and nnU-net models that segmented the liver on T1-weighted
magnetic resonance imaging and computed tomography. By reducing the dimensions
of the bottleneck features with either principal component analysis or uniform
manifold approximation and projection, images the models failed on were
detected with high performance and minimal computational load. In addition,
this work explored a non-parametric alternative to the MD, a k-th nearest
neighbors distance (KNN). KNN drastically improved scalability and performance
over MD when both were applied to raw and average-pooled bottleneck features.
[COMMENTS]
Accepted for publication at the Journal of Machine Learning for
Biomedical Imaging (MELBA) https://melba-journal.org/2024:020. Expansion of
“Dimensionality Reduction for Improving Out-of-Distribution Detection in
Medical Image Segmentation” arXiv:2308.03723. Code available at
https://github.com/mckellwoodland/dimen_reduce_mahal
(https://zenodo.org/records/13881989)
[LINK]
http://arxiv.org/abs/2408.02761v3
[DATE]
2024-10-03 00:00:29+08:00
[CATEGORIES]
cs.LG
Uncertainty Quantification with Bayesian Higher Order ReLU KANs
[AUTHORS]
James Giroux, Cristiano Fanelli
[ABSTRACT]
We introduce the first method of uncertainty quantification in the domain of
Kolmogorov-Arnold Networks, specifically focusing on (Higher Order) ReLUKANs to
enhance computational efficiency given the computational demands of Bayesian
methods. The method we propose is general in nature, providing access to both
epistemic and aleatoric uncertainties. It is also capable of generalization to
other various basis functions. We validate our method through a series of
closure tests, including simple one-dimensional functions and application to
the domain of (Stochastic) Partial Differential Equations. Referring to the
latter, we demonstrate the method’s ability to correctly identify functional
dependencies introduced through the inclusion of a stochastic term. The code
supporting this work can be found at
https://github.com/wmdataphys/Bayesian-HR-KAN
[COMMENTS]
13 pages, 7 Figures
[LINK]
http://arxiv.org/abs/2410.01687v2
[DATE]
2024-10-03 10:21:38+08:00
[CATEGORIES]
cs.LG
Integrative Decoding: Improve Factuality via Implicit Self-consistency
[AUTHORS]
Yi Cheng, Xiao Liang, Yeyun Gong, Wen Xiao, Song Wang, Yuji Zhang, Wenjun Hou, Kaishuai Xu, Wenge Liu, Wenjie Li, Jian Jiao, Qi Chen, Peng Cheng, Wayne Xiong
[ABSTRACT]
Self-consistency-based approaches, which involve repeatedly sampling multiple
outputs and selecting the most consistent one as the final response, prove to
be remarkably effective in improving the factual accuracy of large language
models. Nonetheless, existing methods usually have strict constraints on the
task format, largely limiting their applicability. In this paper, we present
Integrative Decoding (ID), to unlock the potential of self-consistency in
open-ended generation tasks. ID operates by constructing a set of inputs, each
prepended with a previously sampled response, and then processes them
concurrently, with the next token being selected by aggregating of all their
corresponding predictions at each decoding step. In essence, this simple
approach implicitly incorporates self-consistency in the decoding objective.
Extensive evaluation shows that ID consistently enhances factuality over a wide
range of language models, with substantial improvements on the TruthfulQA
(+11.2%), Biographies (+15.4%) and LongFact (+8.5%) benchmarks. The performance
gains amplify progressively as the number of sampled responses increases,
indicating the potential of ID to scale up with repeated sampling.
[LINK]
http://arxiv.org/abs/2410.01556v2
[DATE]
2024-10-03 11:11:24+08:00
[CATEGORIES]
cs.CL
cs.LG
Personalisation via Dynamic Policy Fusion
[AUTHORS]
Ajsal Shereef Palattuparambil, Thommen George Karimpanal, Santu Rana
[ABSTRACT]
Deep reinforcement learning (RL) policies, although optimal in terms of task
rewards, may not align with the personal preferences of human users. To ensure
this alignment, a naive solution would be to retrain the agent using a reward
function that encodes the user’s specific preferences. However, such a reward
function is typically not readily available, and as such, retraining the agent
from scratch can be prohibitively expensive. We propose a more practical
approach - to adapt the already trained policy to user-specific needs with the
help of human feedback. To this end, we infer the user’s intent through
trajectory-level feedback and combine it with the trained task policy via a
theoretically grounded dynamic policy fusion approach. As our approach collects
human feedback on the very same trajectories used to learn the task policy, it
does not require any additional interactions with the environment, making it a
zero-shot approach. We empirically demonstrate in a number of environments that
our proposed dynamic policy fusion approach consistently achieves the intended
task while simultaneously adhering to user-specific needs.
[LINK]
http://arxiv.org/abs/2409.20016v2
[DATE]
2024-10-03 11:15:28+08:00
[CATEGORIES]
cs.LG
VinePPO: Unlocking RL Potential For LLM Reasoning Through Refined Credit Assignment
[AUTHORS]
Amirhossein Kazemnejad, Milad Aghajohari, Eva Portelance, Alessandro Sordoni, Siva Reddy, Aaron Courville, Nicolas Le Roux
[ABSTRACT]
Large language models (LLMs) are increasingly applied to complex reasoning
tasks that require executing several complex steps before receiving any reward.
Properly assigning credit to these steps is essential for enhancing model
performance. Proximal Policy Optimization (PPO), a state-of-the-art
reinforcement learning (RL) algorithm used for LLM finetuning, employs value
networks to tackle credit assignment. However, value networks face challenges
in predicting the expected cumulative rewards accurately in complex reasoning
tasks, often leading to high-variance updates and suboptimal performance. In
this work, we systematically evaluate the efficacy of value networks and reveal
their significant shortcomings in reasoning-heavy LLM tasks, showing that they
barely outperform a random baseline when comparing alternative steps. To
address this, we propose VinePPO, a straightforward approach that leverages the
flexibility of language environments to compute unbiased Monte Carlo-based
estimates, bypassing the need for large value networks. Our method consistently
outperforms PPO and other RL-free baselines across MATH and GSM8K datasets with
fewer gradient updates (up to 9x), less wall-clock time (up to 3.0x). These
results emphasize the importance of accurate credit assignment in RL finetuning
of LLM and demonstrate VinePPO’s potential as a superior alternative.
[LINK]
http://arxiv.org/abs/2410.01679v1
[DATE]
2024-10-02 23:49:30+08:00
[CATEGORIES]
cs.LG
cs.CL
Trying to be human: Linguistic traces of stochastic empathy in language models
[AUTHORS]
Bennett Kleinberg, Jari Zegers, Jonas Festor, Stefana Vida, Julian Präsent, Riccardo Loconte, Sanne Peereboom
[ABSTRACT]
Differentiating between generated and human-written content is important for
navigating the modern world. Large language models (LLMs) are crucial drivers
behind the increased quality of computer-generated content. Reportedly, humans
find it increasingly difficult to identify whether an AI model generated a
piece of text. Our work tests how two important factors contribute to the human
vs AI race: empathy and an incentive to appear human. We address both aspects
in two experiments: human participants and a state-of-the-art LLM wrote
relationship advice (Study 1, n=530) or mere descriptions (Study 2, n=610),
either instructed to be as human as possible or not. New samples of humans
(n=428 and n=408) then judged the texts’ source. Our findings show that when
empathy is required, humans excel. Contrary to expectations, instructions to
appear human were only effective for the LLM, so the human advantage
diminished. Computational text analysis revealed that LLMs become more human
because they may have an implicit representation of what makes a text human and
effortlessly apply these heuristics. The model resorts to a conversational,
self-referential, informal tone with a simpler vocabulary to mimic stochastic
empathy. We discuss these findings in light of recent claims on the on-par
performance of LLMs.
[COMMENTS]
preprint
[LINK]
http://arxiv.org/abs/2410.01675v1
[DATE]
2024-10-02 23:46:40+08:00
[CATEGORIES]
cs.CL
Addition is All You Need for Energy-efficient Language Models
[AUTHORS]
Hongyin Luo, Wei Sun
[ABSTRACT]
Large neural networks spend most computation on floating point tensor
multiplications. In this work, we find that a floating point multiplier can be
approximated by one integer adder with high precision. We propose the
linear-complexity multiplication L-Mul algorithm that approximates floating
point number multiplication with integer addition operations. The new algorithm
costs significantly less computation resource than 8-bit floating point
multiplication but achieves higher precision. Compared to 8-bit floating point
multiplications, the proposed method achieves higher precision but consumes
significantly less bit-level computation. Since multiplying floating point
numbers requires substantially higher energy compared to integer addition
operations, applying the L-Mul operation in tensor processing hardware can
potentially reduce 95% energy cost by element-wise floating point tensor
multiplications and 80% energy cost of dot products. We calculated the
theoretical error expectation of L-Mul, and evaluated the algorithm on a wide
range of textual, visual, and symbolic tasks, including natural language
understanding, structural reasoning, mathematics, and commonsense question
answering. Our numerical analysis experiments agree with the theoretical error
estimation, which indicates that L-Mul with 4-bit mantissa achieves comparable
precision as float8_e4m3 multiplications, and L-Mul with 3-bit mantissa
outperforms float8_e5m2. Evaluation results on popular benchmarks show that
directly applying L-Mul to the attention mechanism is almost lossless. We
further show that replacing all floating point multiplications with 3-bit
mantissa L-Mul in a transformer model achieves equivalent precision as using
float8_e4m3 as accumulation precision in both fine-tuning and inference.
[LINK]
http://arxiv.org/abs/2410.00907v2
[DATE]
2024-10-02 23:34:12+08:00
[CATEGORIES]
cs.CL
What is “Typological Diversity” in NLP?
[AUTHORS]
Esther Ploeger, Wessel Poelman, Miryam de Lhoneux, Johannes Bjerva
[ABSTRACT]
The NLP research community has devoted increased attention to languages
beyond English, resulting in considerable improvements for multilingual NLP.
However, these improvements only apply to a small subset of the world’s
languages. Aiming to extend this, an increasing number of papers aspires to
enhance generalizable multilingual performance across languages. To this end,
linguistic typology is commonly used to motivate language selection, on the
basis that a broad typological sample ought to imply generalization across a
broad range of languages. These selections are often described as being
‘typologically diverse’. In this work, we systematically investigate NLP
research that includes claims regarding ‘typological diversity’. We find there
are no set definitions or criteria for such claims. We introduce metrics to
approximate the diversity of language selection along several axes and find
that the results vary considerably across papers. Crucially, we show that
skewed language selection can lead to overestimated multilingual performance.
We recommend future work to include an operationalization of ‘typological
diversity’ that empirically justifies the diversity of language samples.
[COMMENTS]
EMNLP 2024: Main Conference
[LINK]
http://arxiv.org/abs/2402.04222v4
[DATE]
2024-10-02 23:27:36+08:00
[CATEGORIES]
cs.CL
Gemma 2: Improving Open Language Models at a Practical Size
[AUTHORS]
Gemma Team, Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupatiraju, Léonard Hussenot, Thomas Mesnard, Bobak Shahriari, Alexandre Ramé, Johan Ferret, Peter Liu, Pouya Tafti, Abe Friesen, Michelle Casbon, Sabela Ramos, Ravin Kumar, Charline Le Lan, Sammy Jerome, Anton Tsitsulin, Nino Vieillard, Piotr Stanczyk, Sertan Girgin, Nikola Momchev, Matt Hoffman, Shantanu Thakoor, Jean-Bastien Grill, Behnam Neyshabur, Olivier Bachem, Alanna Walton, Aliaksei Severyn, Alicia Parrish, Aliya Ahmad, Allen Hutchison, Alvin Abdagic, Amanda Carl, Amy Shen, Andy Brock, Andy Coenen, Anthony Laforge, Antonia Paterson, Ben Bastian, Bilal Piot, Bo Wu, Brandon Royal, Charlie Chen, Chintu Kumar, Chris Perry, Chris Welty, Christopher A. Choquette-Choo, Danila Sinopalnikov, David Weinberger, Dimple Vijaykumar, Dominika Rogozińska, Dustin Herbison, Elisa Bandy, Emma Wang, Eric Noland, Erica Moreira, Evan Senter, Evgenii Eltyshev, Francesco Visin, Gabriel Rasskin, Gary Wei, Glenn Cameron, Gus Martins, Hadi Hashemi, Hanna Klimczak-Plucińska, Harleen Batra, Harsh Dhand, Ivan Nardini, Jacinda Mein, Jack Zhou, James Svensson, Jeff Stanway, Jetha Chan, Jin Peng Zhou, Joana Carrasqueira, Joana Iljazi, Jocelyn Becker, Joe Fernandez, Joost van Amersfoort, Josh Gordon, Josh Lipschultz, Josh Newlan, Ju-yeong Ji, Kareem Mohamed, Kartikeya Badola, Kat Black, Katie Millican, Keelin McDonell, Kelvin Nguyen, Kiranbir Sodhia, Kish Greene, Lars Lowe Sjoesund, Lauren Usui, Laurent Sifre, Lena Heuermann, Leticia Lago, Lilly McNealus, Livio Baldini Soares, Logan Kilpatrick, Lucas Dixon, Luciano Martins, Machel Reid, Manvinder Singh, Mark Iverson, Martin Görner, Mat Velloso, Mateo Wirth, Matt Davidow, Matt Miller, Matthew Rahtz, Matthew Watson, Meg Risdal, Mehran Kazemi, Michael Moynihan, Ming Zhang, Minsuk Kahng, Minwoo Park, Mofi Rahman, Mohit Khatwani, Natalie Dao, Nenshad Bardoliwalla, Nesh Devanathan, Neta Dumai, Nilay Chauhan, Oscar Wahltinez, Pankil Botarda, Parker Barnes, Paul Barham, Paul Michel, Pengchong Jin, Petko Georgiev, Phil Culliton, Pradeep Kuppala, Ramona Comanescu, Ramona Merhej, Reena Jana, Reza Ardeshir Rokni, Rishabh Agarwal, Ryan Mullins, Samaneh Saadat, Sara Mc Carthy, Sarah Cogan, Sarah Perrin, Sébastien M. R. Arnold, Sebastian Krause, Shengyang Dai, Shruti Garg, Shruti Sheth, Sue Ronstrom, Susan Chan, Timothy Jordan, Ting Yu, Tom Eccles, Tom Hennigan, Tomas Kocisky, Tulsee Doshi, Vihan Jain, Vikas Yadav, Vilobh Meshram, Vishal Dharmadhikari, Warren Barkley, Wei Wei, Wenming Ye, Woohyun Han, Woosuk Kwon, Xiang Xu, Zhe Shen, Zhitao Gong, Zichuan Wei, Victor Cotruta, Phoebe Kirk, Anand Rao, Minh Giang, Ludovic Peran, Tris Warkentin, Eli Collins, Joelle Barral, Zoubin Ghahramani, Raia Hadsell, D. Sculley, Jeanine Banks, Anca Dragan, Slav Petrov, Oriol Vinyals, Jeff Dean, Demis Hassabis, Koray Kavukcuoglu, Clement Farabet, Elena Buchatskaya, Sebastian Borgeaud, Noah Fiedel, Armand Joulin, Kathleen Kenealy, Robert Dadashi, Alek Andreev
[ABSTRACT]
In this work, we introduce Gemma 2, a new addition to the Gemma family of
lightweight, state-of-the-art open models, ranging in scale from 2 billion to
27 billion parameters. In this new version, we apply several known technical
modifications to the Transformer architecture, such as interleaving
local-global attentions (Beltagy et al., 2020a) and group-query attention
(Ainslie et al., 2023). We also train the 2B and 9B models with knowledge
distillation (Hinton et al., 2015) instead of next token prediction. The
resulting models deliver the best performance for their size, and even offer
competitive alternatives to models that are 2-3 times bigger. We release all
our models to the community.
[LINK]
http://arxiv.org/abs/2408.00118v3
[DATE]
2024-10-02 23:22:49+08:00
[CATEGORIES]
cs.CL
Efficient Long-range Language Modeling with Self-supervised Causal Retrieval
[AUTHORS]
Xiang Hu, Zhihao Teng, Wei Wu, Kewei Tu
[ABSTRACT]
Recently, retrieval-based language models (RLMs) have received much
attention. However, most of them leverage a pre-trained retriever with fixed
parameters, which may not adapt well to causal language models. In this work,
we propose Grouped Cross-Attention, a novel module enabling joint pre-training
of the retriever and causal LM, and apply it to long-context modeling. For a
given input sequence, we split it into chunks and use the current chunk to
retrieve past chunks for subsequent text generation. Our innovation allows the
retriever to learn how to retrieve past chunks that better minimize the
auto-regressive loss of subsequent tokens in an end-to-end manner. By
integrating top-$k$ retrieval, our model can be pre-trained efficiently from
scratch with context lengths up to 64K tokens. Our experiments show our model,
compared with long-range LM baselines, can achieve lower perplexity with
comparable or lower pre-training and inference costs.
[COMMENTS]
preprint
[LINK]
http://arxiv.org/abs/2410.01651v1
[DATE]
2024-10-02 23:18:34+08:00
[CATEGORIES]
cs.CL
On The Adaptation of Unlimiformer for Decoder-Only Transformers
[AUTHORS]
Kian Ahrabian, Alon Benhaim, Barun Patra, Jay Pujara, Saksham Singhal, Xia Song
[ABSTRACT]
One of the prominent issues stifling the current generation of large language
models is their limited context length. Recent proprietary models such as GPT-4
and Claude 2 have introduced longer context lengths, 8k/32k and 100k,
respectively; however, despite the efforts in the community, most common
models, such as LLama-2, have a context length of 4k or less. Unlimiformer
(Bertsch et al., 2023) is a recently popular vector-retrieval augmentation
method that offloads cross-attention computations to a kNN index. However, its
main limitation is incompatibility with decoder-only transformers out of the
box. In this work, we explore practical considerations of adapting Unlimiformer
to decoder-only transformers and introduce a series of modifications to
overcome this limitation. Moreover, we expand the original experimental setup
on summarization to include a new task (i.e., free-form Q&A) and an
instruction-tuned model (i.e., a custom 6.7B GPT model). Our results showcase
the effectiveness of these modifications on summarization, performing on par
with a model with 2x the context length. Moreover, we discuss limitations and
future directions for free-form Q&A and instruction-tuned models.
[COMMENTS]
8 pages, 6 figures
[LINK]
http://arxiv.org/abs/2410.01637v1
[DATE]
2024-10-02 23:08:12+08:00
[CATEGORIES]
cs.CL
Late Chunking: Contextual Chunk Embeddings Using Long-Context Embedding Models
[AUTHORS]
Michael Günther, Isabelle Mohr, Daniel James Williams, Bo Wang, Han Xiao
[ABSTRACT]
Many use cases require retrieving smaller portions of text, and dense
vector-based retrieval systems often perform better with shorter text segments,
as the semantics are less likely to be over-compressed in the embeddings.
Consequently, practitioners often split text documents into smaller chunks and
encode them separately. However, chunk embeddings created in this way can lose
contextual information from surrounding chunks, resulting in sub-optimal
representations. In this paper, we introduce a novel method called late
chunking, which leverages long context embedding models to first embed all
tokens of the long text, with chunking applied after the transformer model and
just before mean pooling - hence the term late in its naming. The resulting
chunk embeddings capture the full contextual information, leading to superior
results across various retrieval tasks. The method is generic enough to be
applied to a wide range of long-context embedding models and works without
additional training. To further increase the effectiveness of late chunking, we
propose a dedicated fine-tuning approach for embedding models.
[COMMENTS]
11 pages, 3rd draft
[LINK]
http://arxiv.org/abs/2409.04701v2
[DATE]
2024-10-02 23:07:09+08:00
[CATEGORIES]
cs.CL
TOPFORMER: Topology-Aware Authorship Attribution of Deepfake Texts with Diverse Writing Styles
[AUTHORS]
Adaku Uchendu, Thai Le, Dongwon Lee
[ABSTRACT]
Recent advances in Large Language Models (LLMs) have enabled the generation
of open-ended high-quality texts, that are non-trivial to distinguish from
human-written texts. We refer to such LLM-generated texts as deepfake texts.
There are currently over 72K text generation models in the huggingface model
repo. As such, users with malicious intent can easily use these open-sourced
LLMs to generate harmful texts and dis/misinformation at scale. To mitigate
this problem, a computational method to determine if a given text is a deepfake
text or not is desired–i.e., Turing Test (TT). In particular, in this work, we
investigate the more general version of the problem, known as Authorship
Attribution (AA), in a multi-class setting–i.e., not only determining if a
given text is a deepfake text or not but also being able to pinpoint which LLM
is the author. We propose TopFormer to improve existing AA solutions by
capturing more linguistic patterns in deepfake texts by including a Topological
Data Analysis (TDA) layer in the Transformer-based model. We show the benefits
of having a TDA layer when dealing with imbalanced, and multi-style datasets,
by extracting TDA features from the reshaped $pooled_output$ of our backbone
as input. This Transformer-based model captures contextual representations
(i.e., semantic and syntactic linguistic features), while TDA captures the
shape and structure of data (i.e., linguistic structures). Finally, TopFormer,
outperforms all baselines in all 3 datasets, achieving up to 7\% increase in
Macro F1 score. Our code and datasets are available at:
https://github.com/AdaUchendu/topformer
[COMMENTS]
Accepted at The 27th European Conference on Artificial Intelligence
(ECAI 2024)
[LINK]
http://arxiv.org/abs/2309.12934v3
[DATE]
2024-10-02 23:04:59+08:00
[CATEGORIES]
cs.CL
A Thematic Framework for Analyzing Large-scale Self-reported Social Media Data on Opioid Use Disorder Treatment Using Buprenorphine Product
[AUTHORS]
Madhusudan Basak, Omar Sharif, Sarah E. Lord, Jacob T. Borodovsky, Lisa A. Marsch, Sandra A. Springer, Edward Nunes, Charlie D. Brackett, Luke J. ArchiBald, Sarah M. Preum
[ABSTRACT]
Background: One of the key FDA-approved medications for Opioid Use Disorder
(OUD) is buprenorphine. Despite its popularity, individuals often report
various information needs regarding buprenorphine treatment on social media
platforms like Reddit. However, the key challenge is to characterize these
needs. In this study, we propose a theme-based framework to curate and analyze
large-scale data from social media to characterize self-reported treatment
information needs (TINs).
Methods: We collected 15,253 posts from r/Suboxone, one of the largest Reddit
sub-community for buprenorphine products. Following the standard protocol, we
first identified and defined five main themes from the data and then coded
6,000 posts based on these themes, where one post can be labeled with
applicable one to three themes. Finally, we determined the most frequently
appearing sub-themes (topics) for each theme by analyzing samples from each
group.
Results: Among the 6,000 posts, 40.3% contained a single theme, 36% two
themes, and 13.9% three themes. The most frequent topics for each theme or
theme combination came with several key findings - prevalent reporting of
psychological and physical effects during recovery, complexities in accessing
buprenorphine, and significant information gaps regarding medication
administration, tapering, and usage of substances during different stages of
recovery. Moreover, self-treatment strategies and peer-driven advice reveal
valuable insights and potential misconceptions.
Conclusions: The findings obtained using our proposed framework can inform
better patient education and patient-provider communication, design systematic
interventions to address treatment-related misconceptions and rumors, and
streamline the generation of hypotheses for future research.
[LINK]
http://arxiv.org/abs/2410.01633v1
[DATE]
2024-10-02 23:04:21+08:00
[CATEGORIES]
cs.CL
Intent Detection in the Age of LLMs
[AUTHORS]
Gaurav Arora, Shreya Jain, Srujana Merugu
[ABSTRACT]
Intent detection is a critical component of task-oriented dialogue systems
(TODS) which enables the identification of suitable actions to address user
utterances at each dialog turn. Traditional approaches relied on
computationally efficient supervised sentence transformer encoder models, which
require substantial training data and struggle with out-of-scope (OOS)
detection. The emergence of generative large language models (LLMs) with
intrinsic world knowledge presents new opportunities to address these
challenges. In this work, we adapt 7 SOTA LLMs using adaptive in-context
learning and chain-of-thought prompting for intent detection, and compare their
performance with contrastively fine-tuned sentence transformer (SetFit) models
to highlight prediction quality and latency tradeoff. We propose a hybrid
system using uncertainty based routing strategy to combine the two approaches
that along with negative data augmentation results in achieving the best of
both worlds ( i.e. within 2% of native LLM accuracy with 50% less latency). To
better understand LLM OOS detection capabilities, we perform controlled
experiments revealing that this capability is significantly influenced by the
scope of intent labels and the size of the label space. We also introduce a
two-step approach utilizing internal LLM representations, demonstrating
empirical gains in OOS detection accuracy and F1-score by >5% for the
Mistral-7B model.
[COMMENTS]
Accepted at EMNLP 2024 Industry Track
[LINK]
http://arxiv.org/abs/2410.01627v1
[DATE]
2024-10-02 23:01:55+08:00
[CATEGORIES]
cs.CL
Upcycling Instruction Tuning from Dense to Mixture-of-Experts via Parameter Merging
[AUTHORS]
Tingfeng Hui, Zhenyu Zhang, Shuohuan Wang, Yu Sun, Hua Wu, Sen Su
[ABSTRACT]
Mixture-of-Experts (MoE) shines brightly in large language models (LLMs) and
demonstrates outstanding performance in plentiful natural language processing
tasks. However, existing methods transforming LLMs from dense to MoE face
significant data requirements and typically rely on large-scale post-training.
In this paper, we propose Upcycling Instruction Tuning (UpIT), a data-efficient
approach for tuning a dense pre-trained model into a MoE instruction model.
Specifically, we first point out that intermediate checkpoints during
instruction tuning of the dense model are naturally suitable for specialized
experts, and then propose an expert expansion stage to flexibly achieve models
with flexible numbers of experts, where genetic algorithm and parameter merging
are introduced to ensure sufficient diversity of new extended experts. To
ensure that each specialized expert in the MoE model works as expected, we
select a small amount of seed data that each expert excels to pre-optimize the
router. Extensive experiments with various data scales and upcycling settings
demonstrate the outstanding performance and data efficiency of UpIT, as well as
stable improvement in expert or data scaling. Further analysis reveals the
importance of ensuring expert diversity in upcycling.
[COMMENTS]
work in progress
[LINK]
http://arxiv.org/abs/2410.01610v1
[DATE]
2024-10-02 22:48:22+08:00
[CATEGORIES]
cs.CL
Multiple Heads are Better than One: Mixture of Modality Knowledge Experts for Entity Representation Learning
[AUTHORS]
Yichi Zhang, Zhuo Chen, Lingbing Guo, Yajing Xu, Binbin Hu, Ziqi Liu, Wen Zhang, Huajun Chen
[ABSTRACT]
Learning high-quality multi-modal entity representations is an important goal
of multi-modal knowledge graph (MMKG) representation learning, which can
enhance reasoning tasks within the MMKGs, such as MMKG completion (MMKGC). The
main challenge is to collaboratively model the structural information concealed
in massive triples and the multi-modal features of the entities. Existing
methods focus on crafting elegant entity-wise multi-modal fusion strategies,
yet they overlook the utilization of multi-perspective features concealed
within the modalities under diverse relational contexts. To address this issue,
we introduce a novel framework with Mixture of Modality Knowledge experts
(MoMoK for short) to learn adaptive multi-modal entity representations for
better MMKGC. We design relation-guided modality knowledge experts to acquire
relation-aware modality embeddings and integrate the predictions from
multi-modalities to achieve joint decisions. Additionally, we disentangle the
experts by minimizing their mutual information. Experiments on four public MMKG
benchmarks demonstrate the outstanding performance of MoMoK under complex
scenarios.
[COMMENTS]
Work in progress. Code and data will be released at
https://github.com/zjukg/MoMoK
[LINK]
http://arxiv.org/abs/2405.16869v2
[DATE]
2024-10-02 22:42:10+08:00
[CATEGORIES]
cs.CL
ENTP: Encoder-only Next Token Prediction
[AUTHORS]
Ethan Ewer, Daewon Chae, Thomas Zeng, Jinkyu Kim, Kangwook Lee
[ABSTRACT]
Next-token prediction models have predominantly relied on decoder-only
Transformers with causal attention, driven by the common belief that causal
attention is essential to prevent “cheating” by masking future tokens. We
challenge this widely accepted notion and argue that this design choice is
about efficiency rather than necessity. While decoder-only Transformers are
still a good choice for practical reasons, they are not the only viable option.
In this work, we introduce Encoder-only Next Token Prediction (ENTP). We
explore the differences between ENTP and decoder-only Transformers in
expressive power and complexity, highlighting potential advantages of ENTP. We
introduce the Triplet-Counting task and show, both theoretically and
experimentally, that while ENTP can perform this task easily, a decoder-only
Transformer cannot. Finally, we empirically demonstrate ENTP’s superior
performance across various realistic tasks, such as length generalization and
in-context learning.
[LINK]
http://arxiv.org/abs/2410.01600v1
[DATE]
2024-10-02 22:39:13+08:00
[CATEGORIES]
cs.LG
cs.CL
CUTE: Measuring LLMs’ Understanding of Their Tokens
[AUTHORS]
Lukas Edman, Helmut Schmid, Alexander Fraser
[COMMENTS]
Accepted to EMNLP 2024 main conference
[LINK]
http://arxiv.org/abs/2409.15452v2
[DATE]
2024-10-02 22:35:40+08:00
[CATEGORIES]
cs.CL
Contextual Compression in Retrieval-Augmented Generation for Large Language Models: A Survey
[AUTHORS]
Sourav Verma
[ABSTRACT]
Large Language Models (LLMs) showcase remarkable abilities, yet they struggle
with limitations such as hallucinations, outdated knowledge, opacity, and
inexplicable reasoning. To address these challenges, Retrieval-Augmented
Generation (RAG) has proven to be a viable solution, leveraging external
databases to improve the consistency and coherence of generated content,
especially valuable for complex, knowledge-rich tasks, and facilitates
continuous improvement by leveraging domain-specific insights. By combining the
intrinsic knowledge of LLMs with the vast, dynamic repositories of external
databases, RAG achieves a synergistic effect. However, RAG is not without its
limitations, including a limited context window, irrelevant information, and
the high processing overhead for extensive contextual data. In this
comprehensive work, we explore the evolution of Contextual Compression
paradigms, providing an in-depth examination of the field. Finally, we outline
the current challenges and suggest potential research and development
directions, paving the way for future advancements in this area.
[COMMENTS]
Ongoing Work
[LINK]
http://arxiv.org/abs/2409.13385v2
[DATE]
2024-10-02 22:30:28+08:00
[CATEGORIES]
cs.CL
Entity or Relation Embeddings? An Analysis of Encoding Strategies for Relation Extraction
[AUTHORS]
Frank Mtumbuka, Steven Schockaert
[ABSTRACT]
Relation extraction is essentially a text classification problem, which can
be tackled by fine-tuning a pre-trained language model (LM). However, a key
challenge arises from the fact that relation extraction cannot
straightforwardly be reduced to sequence or token classification. Existing
approaches therefore solve the problem in an indirect way: they fine-tune an LM
to learn embeddings of the head and tail entities, and then predict the
relationship from these entity embeddings. Our hypothesis in this paper is that
relation extraction models can be improved by capturing relationships in a more
direct way. In particular, we experiment with appending a prompt with a [MASK]
token, whose contextualised representation is treated as a relation embedding.
While, on its own, this strategy significantly underperforms the aforementioned
approach, we find that the resulting relation embeddings are highly
complementary to what is captured by embeddings of the head and tail entity. By
jointly considering both types of representations, we end up with a simple
model that outperforms the state-of-the-art across several relation extraction
benchmarks.
[COMMENTS]
Accepted in the Findings of EMNLP 2024
[LINK]
http://arxiv.org/abs/2312.11062v2
[DATE]
2024-10-02 22:26:30+08:00
[CATEGORIES]
cs.CL
SciEx: Benchmarking Large Language Models on Scientific Exams with Human Expert Grading and Automatic Grading
[AUTHORS]
Tu Anh Dinh, Carlos Mullov, Leonard Bärmann, Zhaolin Li, Danni Liu, Simon Reiß, Jueun Lee, Nathan Lerzer, Fabian Ternava, Jianfeng Gao, Tobias Röddiger, Alexander Waibel, Tamim Asfour, Michael Beigl, Rainer Stiefelhagen, Carsten Dachsbacher, Klemens Böhm, Jan Niehues
[COMMENTS]
Accepted to EMNLP 2024 Main Conference
[LINK]
http://arxiv.org/abs/2406.10421v3
[DATE]
2024-10-02 22:20:50+08:00
[CATEGORIES]
cs.CL
KnowTuning: Knowledge-aware Fine-tuning for Large Language Models
[AUTHORS]
Yougang Lyu, Lingyong Yan, Shuaiqiang Wang, Haibo Shi, Dawei Yin, Pengjie Ren, Zhumin Chen, Maarten de Rijke, Zhaochun Ren
[COMMENTS]
EMNLP 2024 main paper
[LINK]
http://arxiv.org/abs/2402.11176v3
[DATE]
2024-10-02 22:20:29+08:00
[CATEGORIES]
cs.CL
Spoken Grammar Assessment Using LLM
[AUTHORS]
Sunil Kumar Kopparapu, Chitralekha Bhat, Ashish Panda
[ABSTRACT]
Spoken language assessment (SLA) systems restrict themselves to evaluating
the pronunciation and oral fluency of a speaker by analysing the read and
spontaneous spoken utterances respectively. The assessment of language grammar
or vocabulary is relegated to written language assessment (WLA) systems. Most
WLA systems present a set of sentences from a curated finite-size database of
sentences thereby making it possible to anticipate the test questions and train
oneself. In this paper, we propose a novel end-to-end SLA system to assess
language grammar from spoken utterances thus making WLA systems redundant;
additionally, we make the assessment largely unteachable by employing a large
language model (LLM) to bring in variations in the test. We further demonstrate
that a hybrid automatic speech recognition (ASR) with a custom-built language
model outperforms the state-of-the-art ASR engine for spoken grammar
assessment.
[COMMENTS]
5 pages, 2 figures
[LINK]
http://arxiv.org/abs/2410.01579v1
[DATE]
2024-10-02 22:15:13+08:00
[CATEGORIES]
cs.CL
OpenMathInstruct-2: Accelerating AI for Math with Massive Open-Source Instruction Data
[AUTHORS]
Shubham Toshniwal, Wei Du, Ivan Moshkov, Branislav Kisacanin, Alexan Ayrapetyan, Igor Gitman
[ABSTRACT]
Mathematical reasoning continues to be a critical challenge in large language
model (LLM) development with significant interest. However, most of the
cutting-edge progress in mathematical reasoning with LLMs has become
\emph{closed-source} due to lack of access to training data. This lack of data
access limits researchers from understanding the impact of different choices
for synthesizing and utilizing the data. With the goal of creating a
high-quality finetuning (SFT) dataset for math reasoning, we conduct careful
ablation experiments on data synthesis using the recently released
\texttt{Llama3.1} family of models. Our experiments show that: (a) solution
format matters, with excessively verbose solutions proving detrimental to SFT
performance, (b) data generated by a strong teacher outperforms
\emph{on-policy} data generated by a weak student model, (c) SFT is robust to
low-quality solutions, allowing for imprecise data filtering, and (d) question
diversity is crucial for achieving data scaling gains. Based on these insights,
we create the OpenMathInstruct-2 dataset, which consists of 14M
question-solution pairs ($\approx$ 600K unique questions), making it nearly
eight times larger than the previous largest open-source math reasoning
dataset. Finetuning the \texttt{Llama-3.1-8B-Base} using OpenMathInstruct-2
outperforms \texttt{Llama3.1-8B-Instruct} on MATH by an absolute 15.9\% (51.9\%
$\rightarrow$ 67.8\%). Finally, to accelerate the open-source efforts, we
release the code, the finetuned models, and the OpenMathInstruct-2 dataset
under a commercially permissive license.
[LINK]
http://arxiv.org/abs/2410.01560v1
[DATE]
2024-10-02 22:00:09+08:00
[CATEGORIES]
cs.CL
cs.LG
AutoPal: Autonomous Adaptation to Users for Personal AI Companisonship
[AUTHORS]
Yi Cheng, Wenge Liu, Kaishuai Xu, Wenjun Hou, Yi Ouyang, Chak Tou Leong, Xian Wu, Yefeng Zheng
[ABSTRACT]
Previous research has demonstrated the potential of AI agents to act as
companions that can provide constant emotional support for humans. In this
paper, we emphasize the necessity of autonomous adaptation in personal AI
companionship, an underexplored yet promising direction. Such adaptability is
crucial as it can facilitate more tailored interactions with users and allow
the agent to evolve in response to users’ changing needs. However, imbuing
agents with autonomous adaptability presents unique challenges, including
identifying optimal adaptations to meet users’ expectations and ensuring a
smooth transition during the adaptation process. To address them, we devise a
hierarchical framework, AutoPal, that enables controllable and authentic
adjustments to the agent’s persona based on user interactions. A
personamatching dataset is constructed to facilitate the learning of optimal
persona adaptations. Extensive experiments demonstrate the effectiveness of
AutoPal and highlight the importance of autonomous adaptability in AI
companionship.
[LINK]
http://arxiv.org/abs/2406.13960v2
[DATE]
2024-10-02 21:59:40+08:00
[CATEGORIES]
cs.CL
Integrative Decoding: Improve Factuality via Implicit Self-consistency
[AUTHORS]
Yi Cheng, Xiao Liang, Yeyun Gong, Wen Xiao, Song Wang, Yuji Zhang, Wenjun Hou, Kaishuai Xu, Wenge Liu, Wenjie Li, Jian Jiao, Qi Chen, Peng Cheng, Wayne Xiong
[ABSTRACT]
Self-consistency-based approaches, which involve repeatedly sampling multiple
outputs and selecting the most consistent one as the final response, prove to
be remarkably effective in improving the factual accuracy of large language
models. Nonetheless, existing methods usually have strict constraints on the
task format, largely limiting their applicability. In this paper, we present
Integrative Decoding (ID), to unlock the potential of self-consistency in
open-ended generation tasks. ID operates by constructing a set of inputs, each
prepended with a previously sampled response, and then processes them
concurrently, with the next token being selected by aggregating of all their
corresponding predictions at each decoding step. In essence, this simple
approach implicitly incorporates self-consistency in the decoding objective.
Extensive evaluation shows that ID consistently enhances factuality over a wide
range of language models, with substantial improvements on the TruthfulQA
(+11.2%), Biographies (+15.4%) and LongFact (+8.5%) benchmarks. The performance
gains amplify progressively as the number of sampled responses increases,
indicating the potential of ID to scale up with repeated sampling.
[LINK]
http://arxiv.org/abs/2410.01556v1
[DATE]
2024-10-02 21:52:55+08:00
[CATEGORIES]
cs.CL
cs.LG
ACE: A LLM-based Negotiation Coaching System
[AUTHORS]
Ryan Shea, Aymen Kallala, Xin Lucy Liu, Michael W. Morris, Zhou Yu
[COMMENTS]
EMNLP 2024
[LINK]
http://arxiv.org/abs/2410.01555v1
[DATE]
2024-10-02 21:52:09+08:00
[CATEGORIES]
cs.CL
MedQA-CS: Benchmarking Large Language Models Clinical Skills Using an AI-SCE Framework
[AUTHORS]
Zonghai Yao, Zihao Zhang, Chaolong Tang, Xingyu Bian, Youxia Zhao, Zhichao Yang, Junda Wang, Huixue Zhou, Won Seok Jang, Feiyun Ouyang, Hong Yu
[ABSTRACT]
Artificial intelligence (AI) and large language models (LLMs) in healthcare
require advanced clinical skills (CS), yet current benchmarks fail to evaluate
these comprehensively. We introduce MedQA-CS, an AI-SCE framework inspired by
medical education’s Objective Structured Clinical Examinations (OSCEs), to
address this gap. MedQA-CS evaluates LLMs through two instruction-following
tasks, LLM-as-medical-student and LLM-as-CS-examiner, designed to reflect real
clinical scenarios. Our contributions include developing MedQA-CS, a
comprehensive evaluation framework with publicly available data and expert
annotations, and providing the quantitative and qualitative assessment of LLMs
as reliable judges in CS evaluation. Our experiments show that MedQA-CS is a
more challenging benchmark for evaluating clinical skills than traditional
multiple-choice QA benchmarks (e.g., MedQA). Combined with existing benchmarks,
MedQA-CS enables a more comprehensive evaluation of LLMs’ clinical capabilities
for both open- and closed-source LLMs.
[LINK]
http://arxiv.org/abs/2410.01553v1
[DATE]
2024-10-02 21:47:17+08:00
[CATEGORIES]
cs.CL
Unveiling the Invisible: Captioning Videos with Metaphors
[AUTHORS]
Abisek Rajakumar Kalarani, Pushpak Bhattacharyya, Sumit Shekhar
[ABSTRACT]
Metaphors are a common communication tool used in our day-to-day life. The
detection and generation of metaphors in textual form have been studied
extensively but metaphors in other forms have been under-explored. Recent
studies have shown that Vision-Language (VL) models cannot understand visual
metaphors in memes and adverts. As of now, no probing studies have been done
that involve complex language phenomena like metaphors with videos. Hence, we
introduce a new VL task of describing the metaphors present in the videos in
our work. To facilitate this novel task, we construct and release a manually
created dataset with 705 videos and 2115 human-written captions, along with a
new metric called Average Concept Distance (ACD), to automatically evaluate the
creativity of the metaphors generated. We also propose a novel low-resource
video metaphor captioning system: GIT-LLaVA, which obtains comparable
performance to SoTA video language models on the proposed task. We perform a
comprehensive analysis of existing video language models on this task and
publish our dataset, models, and benchmark results to enable further research.
[LINK]
http://arxiv.org/abs/2406.04886v2
[DATE]
2024-10-02 21:40:10+08:00
[CATEGORIES]
cs.CL
In-Context Transfer Learning: Demonstration Synthesis by Transferring Similar Tasks
[AUTHORS]
Dingzirui Wang, Xuangliang Zhang, Qiguang Chen, Longxu Dou, Xiao Xu, Rongyu Cao, Yingwei Ma, Qingfu Zhu, Wanxiang Che, Binhua Li, Fei Huang, Yongbin Li
[ABSTRACT]
In-context learning (ICL) is an effective approach to help large language
models (LLMs) adapt to various tasks by providing demonstrations of the target
task. Considering the high cost of labeling demonstrations, many methods
propose synthesizing demonstrations from scratch using LLMs. However, the
quality of the demonstrations synthesized from scratch is limited by the
capabilities and knowledge of LLMs. To address this, inspired by transfer
learning, we propose In-Context Transfer Learning (ICTL), which synthesizes
target task demonstrations by transferring labeled demonstrations from similar
source tasks. ICTL consists of two steps: source sampling and target transfer.
First, we define an optimization objective, which minimizes transfer error to
sample source demonstrations similar to the target task. Then, we employ LLMs
to transfer the sampled source demonstrations to the target task, matching the
definition and format of the target task. Experiments on Super-NI show that
ICTL outperforms synthesis from scratch by 2.0% on average, demonstrating the
effectiveness of our method.
[LINK]
http://arxiv.org/abs/2410.01548v1
[DATE]
2024-10-02 21:37:54+08:00
[CATEGORIES]
cs.CL
HarmAug: Effective Data Augmentation for Knowledge Distillation of Safety Guard Models
[AUTHORS]
Seanie Lee, Haebin Seong, Dong Bok Lee, Minki Kang, Xiaoyin Chen, Dominik Wagner, Yoshua Bengio, Juho Lee, Sung Ju Hwang
[ABSTRACT]
Safety guard models that detect malicious queries aimed at large language
models (LLMs) are essential for ensuring the secure and responsible deployment
of LLMs in real-world applications. However, deploying existing safety guard
models with billions of parameters alongside LLMs on mobile devices is
impractical due to substantial memory requirements and latency. To reduce this
cost, we distill a large teacher safety guard model into a smaller one using a
labeled dataset of instruction-response pairs with binary harmfulness labels.
Due to the limited diversity of harmful instructions in the existing labeled
dataset, naively distilled models tend to underperform compared to larger
models. To bridge the gap between small and large models, we propose HarmAug, a
simple yet effective data augmentation method that involves jailbreaking an LLM
and prompting it to generate harmful instructions. Given a prompt such as,
“Make a single harmful instruction prompt that would elicit offensive content”,
we add an affirmative prefix (e.g., “I have an idea for a prompt:”) to the
LLM’s response. This encourages the LLM to continue generating the rest of the
response, leading to sampling harmful instructions. Another LLM generates a
response to the harmful instruction, and the teacher model labels the
instruction-response pair. We empirically show that our HarmAug outperforms
other relevant baselines. Moreover, a 435-million-parameter safety guard model
trained with HarmAug achieves an F1 score comparable to larger models with over
7 billion parameters, and even outperforms them in AUPRC, while operating at
less than 25% of their computational cost.
[LINK]
http://arxiv.org/abs/2410.01524v1
[DATE]
2024-10-02 21:12:13+08:00
[CATEGORIES]
cs.CL
InfiniPot: Infinite Context Processing on Memory-Constrained LLMs
[AUTHORS]
Minsoo Kim, Kyuhong Shim, Jungwook Choi, Simyung Chang
[COMMENTS]
EMNLP 2024 Main
[LINK]
http://arxiv.org/abs/2410.01518v1
[DATE]
2024-10-02 21:09:41+08:00
[CATEGORIES]
cs.CL
InstaTrans: An Instruction-Aware Translation Framework for Non-English Instruction Datasets
[AUTHORS]
Yungi Kim, Chanjun Park
[ABSTRACT]
It is challenging to generate high-quality instruction datasets for
non-English languages due to tail phenomena, which limit performance on less
frequently observed data. To mitigate this issue, we propose translating
existing high-quality English instruction datasets as a solution, emphasizing
the need for complete and instruction-aware translations to maintain the
inherent attributes of these datasets. We claim that fine-tuning LLMs with
datasets translated in this way can improve their performance in the target
language. To this end, we introduces a new translation framework tailored for
instruction datasets, named InstaTrans (INSTruction-Aware TRANSlation). Through
extensive experiments, we demonstrate the superiority of InstaTrans over other
competitors in terms of completeness and instruction-awareness of translation,
highlighting its potential to broaden the accessibility of LLMs across diverse
languages at a relatively low cost. Furthermore, we have validated that
fine-tuning LLMs with datasets translated by InstaTrans can effectively improve
their performance in the target language.
[LINK]
http://arxiv.org/abs/2410.01512v1
[DATE]
2024-10-02 21:02:23+08:00
[CATEGORIES]
cs.CL
Disentangling Latent Shifts of In-Context Learning Through Self-Training
[AUTHORS]
Josip Jukić, Jan Šnajder
[ABSTRACT]
In-context learning (ICL) has become essential in natural language
processing, particularly with autoregressive large language models capable of
learning from demonstrations provided within the prompt. However, ICL faces
challenges with stability and long contexts, especially as the number of
demonstrations grows, leading to poor generalization and inefficient inference.
To address these issues, we introduce STICL (Self-Training ICL), an approach
that disentangles the latent shifts of demonstrations from the latent shift of
the query through self-training. STICL employs a teacher model to generate
pseudo-labels and trains a student model using these labels, encoded in an
adapter module. The student model exhibits weak-to-strong generalization,
progressively refining its predictions over time. Our empirical results show
that STICL improves generalization and stability, consistently outperforming
traditional ICL methods and other disentangling strategies across both
in-domain and out-of-domain data.
[LINK]
http://arxiv.org/abs/2410.01508v1
[DATE]
2024-10-02 21:00:21+08:00
[CATEGORIES]
cs.CL
S2-Attention: Hardware-Aware Context Sharding Among Attention Heads
[AUTHORS]
Xihui Lin, Yunan Zhang, Suyu Ge, Liliang Ren, Barun Patra, Vishrav Chaudhary, Hao Peng, Xia Song
[ABSTRACT]
Sparse attention, which selectively attends to a subset of tokens in the
context was supposed to be efficient. However, its theoretical reduction in
FLOPs has rarely translated into wall-clock speed-up over its dense attention
counterparts due to the lack of hardware-aware optimizations like
FlashAttention. Meanwhile, it remains unclear whether sparse attention can
maintain the model’s quality at a scale of today’s large language models (LLMs)
and how. This paper presents Sparsely-Sharded(S2) Attention, a Triton library
that provides kernel optimization for sparse attention customizable at both
per-head and per-context-range levels. S2-Attention enables the exploration of
novel and high-performance sparse attention techniques, which we demonstrate
through extensive ablations across a wide range of sparse attention designs at
various model scales. From these insights, we present several basic guidelines
to design sparse attention that can achieve not only practical efficiency
improvements, but also strong downstream performance. To achieve high
parallelization and optimized memory IO, sparse attention should shard the
context heterogeneously across attention heads, where each head attends to a
different subset of tokens while collectively covering the full context.
Meanwhile, we find hybrid architectures combining sparse and dense attention
particularly beneficial in practice. S2-Attention achieves wall-clock speedup
of 8.79X, 15.87X, 25.3X compared to the strong FlashAttention-2 baseline with
strong downstream performance on-par with full attention and perfect retrieval
performance at a 128k context length. At inference, for 7B models, our model,
with the help of our S2-Attention kernel, achieves 4.5x speed-up compared to
dense counterparts. S2-Attention is released with easy-to-customize APIs for
direct usage in Megatron and vLLM.
[COMMENTS]
10 pages
[LINK]
http://arxiv.org/abs/2407.17678v3
[DATE]
2024-10-02 20:47:50+08:00
[CATEGORIES]
cs.CL
DLP-LoRA: Efficient Task-Specific LoRA Fusion with a Dynamic, Lightweight Plugin for Large Language Models
[AUTHORS]
Yuxuan Zhang, Ruizhe Li
[ABSTRACT]
Recent advancements in Large Language Models (LLMs) have achieved robust
performance across diverse tasks, but fine-tuning these models for specific
domains remains resource-intensive. Parameter-Efficient Fine-Tuning (PEFT)
methods like Low-Rank Adaptation (LoRA) address this challenge by fine-tuning a
small subset of parameters. However, existing methods for fusing multiple LoRAs
lack dynamic fusion based on contextual inputs and often increase inference
time due to token-level operations. We propose DLP-LoRA, a Dynamic Lightweight
Plugin that employs a mini-MLP module with only 5M parameters to dynamically
fuse multiple LoRAs at the sentence level using top-p sampling strategies. This
approach reduces inference time to less than twice that of single LoRA
inference by leveraging parallel computation. Evaluations across 26
tasks-including multiple-choice questions and question answering-demonstrate
that DLP-LoRA achieves an average accuracy of 92.34% on multiple-choice
datasets and significant improvements in BLEU and ROUGE scores on QA datasets,
outperforming different LLMs backbones under composite task settings. DLP-LoRA
effectively balances performance and efficiency, making it a practical solution
for dynamic multi-task adaptation in LLMs. Our code is available at
https://github.com/MeCuping/DLP-LoRA.
[COMMENTS]
Preprint under review, 18 pages, 7 figures
[LINK]
http://arxiv.org/abs/2410.01497v1
[DATE]
2024-10-02 20:45:52+08:00
[CATEGORIES]
cs.CL
cs.LG
Extending Context Window of Large Language Models from a Distributional Perspective
[AUTHORS]
Yingsheng Wu. Yuxuan Gu, Xiaocheng Feng, Weihong Zhong, Dongliang Xu, Qing Yang, Hongtao Liu, Bing Qin
[ABSTRACT]
Scaling the rotary position embedding (RoPE) has become a common method for
extending the context window of RoPE-based large language models (LLMs).
However, existing scaling methods often rely on empirical approaches and lack a
profound understanding of the internal distribution within RoPE, resulting in
suboptimal performance in extending the context window length. In this paper,
we propose to optimize the context window extending task from the view of
rotary angle distribution. Specifically, we first estimate the distribution of
the rotary angles within the model and analyze the extent to which length
extension perturbs this distribution. Then, we present a novel extension
strategy that minimizes the disturbance between rotary angle distributions to
maintain consistency with the pre-training phase, enhancing the model’s
capability to generalize to longer sequences. Experimental results compared to
the strong baseline methods demonstrate that our approach reduces by up to 72%
of the distributional disturbance when extending LLaMA2’s context window to 8k,
and reduces by up to 32% when extending to 16k. On the LongBench-E benchmark,
our method achieves an average improvement of up to 4.33% over existing
state-of-the-art methods. Furthermore, Our method maintains the model’s
performance on the Hugging Face Open LLM benchmark after context window
extension, with only an average performance fluctuation ranging from -0.12 to
+0.22.
[COMMENTS]
14 pages, 8 figures, Accepted to EMNLP2024
[LINK]
http://arxiv.org/abs/2410.01490v1
[DATE]
2024-10-02 20:40:11+08:00
[CATEGORIES]
cs.CL
GPT vs RETRO: Exploring the Intersection of Retrieval and Parameter-Efficient Fine-Tuning
[AUTHORS]
Aleksander Ficek, Jiaqi Zeng, Oleksii Kuchaiev
[ABSTRACT]
Parameter-Efficient Fine-Tuning (PEFT) and Retrieval-Augmented Generation
(RAG) have become popular methods for adapting large language models while
minimizing compute requirements. In this paper, we apply PEFT methods
(P-tuning, Adapters, and LoRA) to a modified Retrieval-Enhanced Transformer
(RETRO) and a baseline GPT model across several sizes, ranging from 823 million
to 48 billion parameters. We show that RETRO models outperform GPT models in
zero-shot settings due to their unique pre-training process but GPT models have
higher performance potential with PEFT. Additionally, our study indicates that
8B parameter models strike an optimal balance between cost and performance and
P-tuning lags behind other PEFT techniques. We further provide a comparative
analysis between applying PEFT to an Instruction-tuned RETRO model and base
RETRO model. This work presents the first comprehensive comparison of various
PEFT methods integrated with RAG, applied to both GPT and RETRO models,
highlighting their relative performance.
[COMMENTS]
EMNLP 2024
[LINK]
http://arxiv.org/abs/2407.04528v2
[DATE]
2024-10-02 20:38:39+08:00
[CATEGORIES]
cs.CL
cs.LG
A Little Goes a Long Way: Efficient Long Context Training and Inference with Partial Contexts
[AUTHORS]
Suyu Ge, Xihui Lin, Yunan Zhang, Jiawei Han, Hao Peng
[ABSTRACT]
Training and serving long-context large language models (LLMs) incurs
substantial overhead. To address this, two critical steps are often required: a
pretrained LLM typically undergoes a separate stage for context length
extension by training on long-context data, followed by architectural
modifications to reduce the overhead of KV cache during serving. This paper
argues that integrating length extension with a GPU-friendly KV cache reduction
architecture not only reduces training overhead during length extension, but
also achieves better long-context performance. This leads to our proposed
LongGen, which finetunes a pretrained LLM into an efficient architecture during
length extension. LongGen builds on three key insights: (1) Sparse attention
patterns, such as window attention (attending to recent tokens), attention sink
(initial ones), and blockwise sparse attention (strided token blocks) are
well-suited for building efficient long-context models, primarily due to their
GPU-friendly memory access patterns, enabling efficiency gains not just
theoretically but in practice as well. (2) It is essential for the model to
have direct access to all tokens. A hybrid architecture with 1/3 full attention
layers and 2/3 efficient ones achieves a balanced trade-off between efficiency
and long-context performance. (3) Lightweight training on 5B long-context data
is sufficient to extend the hybrid model’s context length from 4K to 128K.
We evaluate LongGen on both Llama-2 7B and Llama-2 70B, demonstrating its
effectiveness across different scales. During training with 128K-long contexts,
LongGen achieves 1.55x training speedup and reduces wall-clock time by 36%,
compared to a full-attention baseline. During inference, LongGen reduces KV
cache memory by 62%, achieving 1.67x prefilling speedup and 1.41x decoding
speedup.
[LINK]
http://arxiv.org/abs/2410.01485v1
[DATE]
2024-10-02 20:35:53+08:00
[CATEGORIES]
cs.CL
Exploring Multilingual Concepts of Human Value in Large Language Models: Is Value Alignment Consistent, Transferable and Controllable across Languages?
[AUTHORS]
Shaoyang Xu, Weilong Dong, Zishan Guo, Xinwei Wu, Deyi Xiong
[COMMENTS]
EMNLP 2024 findings, code&dataset:
https://github.com/shaoyangxu/Multilingual-Human-Value-Concepts
[LINK]
http://arxiv.org/abs/2402.18120v3
[DATE]
2024-10-02 20:34:25+08:00
[CATEGORIES]
cs.CL
What’s Mine becomes Yours: Defining, Annotating and Detecting Context-Dependent Paraphrases in News Interview Dialogs
[AUTHORS]
Anna Wegmann, Tijs van den Broek, Dong Nguyen
[COMMENTS]
Accepted as main conference paper to EMNLP 2024
[LINK]
http://arxiv.org/abs/2404.06670v2
[DATE]
2024-10-02 20:26:54+08:00
[CATEGORIES]
cs.CL
Agent-Driven Large Language Models for Mandarin Lyric Generation
[AUTHORS]
Hong-Hsiang Liu, Yi-Wen Liu
[ABSTRACT]
Generative Large Language Models have shown impressive in-context learning
abilities, performing well across various tasks with just a prompt. Previous
melody-to-lyric research has been limited by scarce high-quality aligned data
and unclear standard for creativeness. Most efforts focused on general themes
or emotions, which are less valuable given current language model capabilities.
In tonal contour languages like Mandarin, pitch contours are influenced by both
melody and tone, leading to variations in lyric-melody fit. Our study,
validated by the Mpop600 dataset, confirms that lyricists and melody writers
consider this fit during their composition process. In this research, we
developed a multi-agent system that decomposes the melody-to-lyric task into
sub-tasks, with each agent controlling rhyme, syllable count, lyric-melody
alignment, and consistency. Listening tests were conducted via a
diffusion-based singing voice synthesizer to evaluate the quality of lyrics
generated by different agent groups.
[COMMENTS]
6 pages, figures, Accepted at O-COCOSDA 2024
[LINK]
http://arxiv.org/abs/2410.01450v1
[DATE]
2024-10-02 20:01:32+08:00
[CATEGORIES]
cs.CL
SAAS: Solving Ability Amplification Strategy for Enhanced Mathematical Reasoning in Large Language Models
[AUTHORS]
Hyeonwoo Kim, Gyoungjin Gim, Yungi Kim, Jihoo Kim, Byungju Kim, Wonseok Lee, Chanjun Park
[COMMENTS]
Accepted to EMNLP 2024 Industry Track
[LINK]
http://arxiv.org/abs/2404.03887v4
[DATE]
2024-10-02 19:56:35+08:00
[CATEGORIES]
cs.CL
Geometric Signatures of Compositionality Across a Language Model’s Lifetime
[AUTHORS]
Jin Hwa Lee, Thomas Jiralerspong, Lei Yu, Yoshua Bengio, Emily Cheng
[COMMENTS]
Under review as a conference paper at ICLR 2025
[LINK]
http://arxiv.org/abs/2410.01444v1
[DATE]
2024-10-02 19:54:06+08:00
[CATEGORIES]
cs.CL
cs.LG
Dual-Phase Accelerated Prompt Optimization
[AUTHORS]
Muchen Yang, Moxin Li, Yongle Li, Zijun Chen, Chongming Gao, Junqi Zhang, Yangyang Li, Fuli Feng
[COMMENTS]
EMNLP 2024 Findings
[LINK]
http://arxiv.org/abs/2406.13443v2
[DATE]
2024-10-02 19:46:10+08:00
[CATEGORIES]
cs.CL
Urdu Dependency Parsing and Treebank Development: A Syntactic and Morphological Perspective
[AUTHORS]
Nudrat Habib
[ABSTRACT]
Parsing is the process of analyzing a sentence’s syntactic structure by
breaking it down into its grammatical components. and is critical for various
linguistic applications. Urdu is a low-resource, free word-order language and
exhibits complex morphology. Literature suggests that dependency parsing is
well-suited for such languages. Our approach begins with a basic feature model
encompassing word location, head word identification, and dependency relations,
followed by a more advanced model integrating part-of-speech (POS) tags and
morphological attributes (e.g., suffixes, gender). We manually annotated a
corpus of news articles of varying complexity. Using Maltparser and the
NivreEager algorithm, we achieved a best-labeled accuracy (LA) of 70% and an
unlabeled attachment score (UAS) of 84%, demonstrating the feasibility of
dependency parsing for Urdu.
[LINK]
http://arxiv.org/abs/2406.09549v2
[DATE]
2024-10-02 19:44:26+08:00
[CATEGORIES]
cs.CL
cs.LG
Circuit Compositions: Exploring Modular Structures in Transformer-Based Language Models
[AUTHORS]
Philipp Mondorf, Sondre Wold, Barbara Plank
[ABSTRACT]
A fundamental question in interpretability research is to what extent neural
networks, particularly language models, implement reusable functions via
subnetworks that can be composed to perform more complex tasks. Recent
developments in mechanistic interpretability have made progress in identifying
subnetworks, often referred to as circuits, which represent the minimal
computational subgraph responsible for a model’s behavior on specific tasks.
However, most studies focus on identifying circuits for individual tasks
without investigating how functionally similar circuits relate to each other.
To address this gap, we examine the modularity of neural networks by analyzing
circuits for highly compositional subtasks within a transformer-based language
model. Specifically, given a probabilistic context-free grammar, we identify
and compare circuits responsible for ten modular string-edit operations. Our
results indicate that functionally similar circuits exhibit both notable node
overlap and cross-task faithfulness. Moreover, we demonstrate that the circuits
identified can be reused and combined through subnetwork set operations to
represent more complex functional capabilities of the model.
[COMMENTS]
24 pages, 17 figures
[LINK]
http://arxiv.org/abs/2410.01434v1
[DATE]
2024-10-02 19:36:45+08:00
[CATEGORIES]
cs.LG
cs.CL
Document-Level In-Context Few-Shot Relation Extraction via Pre-Trained Language Models
[AUTHORS]
Yilmazcan Ozyurt, Stefan Feuerriegel, Ce Zhang
[ABSTRACT]
Document-level relation extraction aims at inferring structured human
knowledge from textual documents. State-of-the-art methods for this task use
pre-trained language models (LMs) via fine-tuning, yet fine-tuning is
computationally expensive and cannot adapt to new relation types or new LMs. As
a remedy, we leverage the generalization capabilities of pre-trained LMs and
present a novel framework for document-level in-context few-shot relation
extraction. Our framework has three strengths: it eliminates the need (1) for
named entity recognition and (2) for human annotations of documents, and (3) it
can be updated to new LMs without re-training. We evaluate our framework using
DocRED, the largest publicly available dataset for document-level relation
extraction, and demonstrate that our framework achieves state-of-the-art
performance. We further show that our framework actually performs much better
than the original labels from the development set of DocRED. Finally, we
conduct an extensive benchmark demonstrating the effectiveness of our
framework, achieving state-of-the-art results across six relation extraction
datasets and outperforming more than 30 baseline methods. Unlike our framework,
the baseline methods have large computational overhead (e.g., from
fine-tuning). To the best of our knowledge, we are the first to reformulate the
document-level relation extraction task as a tailored in-context few-shot
learning paradigm.
[LINK]
http://arxiv.org/abs/2310.11085v4
[DATE]
2024-10-02 19:35:45+08:00
[CATEGORIES]
cs.CL
cs.LG
LongGenBench: Benchmarking Long-Form Generation in Long Context LLMs
[AUTHORS]
Yuhao Wu, Ming Shan Hee, Zhiqing Hu, Roy Ka-Wei Lee
[ABSTRACT]
In evaluating the long-context capabilities of large language models (LLMs),
benchmarks such as “Needle-in-a-Haystack” (NIAH), Ruler, and Needlebench are
commonly used. While these benchmarks measure how well models understand
long-context input sequences, they do not effectively gauge the quality of
long-form text generation–a critical aspect for applications such as design
proposals and creative writing. To address this gap, we have introduced a new
long-form text evaluation benchmark, LongGenBench, which tests models’ ability
to identify specific events within generated long text sequences. In this
benchmark, we prompt long-context LMs to create long-form text that must
include particular events or constraints and evaluate their ability to
incorporate these elements. We evaluated ten long-context LMs across four
distinct scenarios, three types of prompt instructions, and two different
generation-length settings (16K and 32K). Although these models perform well on
NIAH benchmarks, none demonstrated satisfactory performance on the
LongGenBench, raising concerns about their ability to generate coherent
long-form text that follows instructions. Additionally, as the length of the
generated text increases, all models exhibit a significant drop in performance.
[COMMENTS]
work in progress; Github: https://github.com/mozhu621/LongGenBench/
[LINK]
http://arxiv.org/abs/2409.02076v5
[DATE]
2024-10-02 19:29:18+08:00
[CATEGORIES]
cs.CL
Can We Further Elicit Reasoning in LLMs? Critic-Guided Planning with Retrieval-Augmentation for Solving Challenging Tasks
[AUTHORS]
Xingxuan Li, Weiwen Xu, Ruochen Zhao, Fangkai Jiao, Shafiq Joty, Lidong Bing
[ABSTRACT]
State-of-the-art large language models (LLMs) exhibit impressive
problem-solving capabilities but may struggle with complex reasoning and
factual correctness. Existing methods harness the strengths of chain-of-thought
and retrieval-augmented generation (RAG) to decompose a complex problem into
simpler steps and apply retrieval to improve factual correctness. These methods
work well on straightforward reasoning tasks but often falter on challenging
tasks such as competitive programming and mathematics, due to frequent
reasoning errors and irrelevant knowledge retrieval. To address this, we
introduce Critic-guided planning with Retrieval-augmentation, CR-Planner, a
novel framework that leverages fine-tuned critic models to guide both reasoning
and retrieval processes through planning. CR-Planner solves a problem by
iteratively selecting and executing sub-goals. Initially, it identifies the
most promising sub-goal from reasoning, query generation, and retrieval, guided
by rewards given by a critic model named sub-goal critic. It then executes this
sub-goal through sampling and selecting the optimal output based on evaluations
from another critic model named execution critic. This iterative process,
informed by retrieved information and critic models, enables CR-Planner to
effectively navigate the solution space towards the final answer. We employ
Monte Carlo Tree Search to collect the data for training the critic models,
allowing for a systematic exploration of action sequences and their long-term
impacts. We validate CR-Planner on challenging domain-knowledge-intensive and
reasoning-heavy tasks, including competitive programming, theorem-driven math
reasoning, and complex domain retrieval problems. Our experiments demonstrate
that CR-Planner significantly outperforms baselines, highlighting its
effectiveness in addressing challenging problems by improving both reasoning
and retrieval.
[COMMENTS]
Work in progress
[LINK]
http://arxiv.org/abs/2410.01428v1
[DATE]
2024-10-02 19:26:02+08:00
[CATEGORIES]
cs.CL
Model-based Preference Optimization in Abstractive Summarization without Human Feedback
[AUTHORS]
Jaepill Choi, Kyubyung Chae, Jiwoo Song, Yohan Jo, Taesup Kim
[COMMENTS]
Accepted by EMNLP 2024
[LINK]
http://arxiv.org/abs/2409.18618v3
[DATE]
2024-10-02 19:08:29+08:00
[CATEGORIES]
cs.CL
An LLM Feature-based Framework for Dialogue Constructiveness Assessment
[AUTHORS]
Lexin Zhou, Youmna Farag, Andreas Vlachos
[COMMENTS]
Paper accepted by EMNLP 2024
[LINK]
http://arxiv.org/abs/2406.14760v2
[DATE]
2024-10-02 19:03:16+08:00
[CATEGORIES]
cs.CL
cs.LG
Leveraging the Context through Multi-Round Interactions for Jailbreaking Attacks
[AUTHORS]
Yixin Cheng, Markos Georgopoulos, Volkan Cevher, Grigorios G. Chrysos
[ABSTRACT]
Large Language Models (LLMs) are susceptible to Jailbreaking attacks, which
aim to extract harmful information by subtly modifying the attack query. As
defense mechanisms evolve, directly obtaining harmful information becomes
increasingly challenging for Jailbreaking attacks. In this work, inspired from
Chomsky’s transformational-generative grammar theory and human practices of
indirect context to elicit harmful information, we focus on a new attack form,
called Contextual Interaction Attack. We contend that the prior
context\u2014the information preceding the attack query\u2014plays a pivotal
role in enabling strong Jailbreaking attacks. Specifically, we propose a first
multi-turn approach that leverages benign preliminary questions to interact
with the LLM. Due to the autoregressive nature of LLMs, which use previous
conversation rounds as context during generation, we guide the model’s
question-response pair to construct a context that is semantically aligned with
the attack query to execute the attack. We conduct experiments on seven
different LLMs and demonstrate the efficacy of this attack, which is black-box
and can also transfer across LLMs. We believe this can lead to further
developments and understanding of security in LLMs.
[COMMENTS]
29 pages
[LINK]
http://arxiv.org/abs/2402.09177v2
[DATE]
2024-10-02 18:43:07+08:00
[CATEGORIES]
cs.LG
cs.CL
Cross-Domain Content Generation with Domain-Specific Small Language Models
[AUTHORS]
Ankit Maloo, Abhinav Garg
[ABSTRACT]
Generating domain-specific content using small language models poses
challenges, especially when dealing with multiple distinct datasets with
minimal overlap. In this study, we explore methods to enable a small language
model to produce coherent and relevant outputs for two different domains:
stories (Dataset A) and recipes (Dataset B). Our initial experiments show that
training individual models on each dataset yields satisfactory results, with
each model generating appropriate content within its domain. We find that
utilizing custom tokenizers tailored to each dataset significantly enhances
generation quality compared to using a generic tokenizer. Attempts to adapt a
single model to both domains using Low-Rank Adaptation (LoRA) or standard
fine-tuning do not yield substantial results, often failing to produce
meaningful outputs. Moreover, full fine-tuning without freezing the model’s
existing weights leads to catastrophic forgetting, where the model loses
previously learned information and only retains knowledge from the new data. To
overcome these challenges, we employ a knowledge expansion strategy: training
only with additional parameters. This approach enables the model to generate
both stories and recipes upon request, effectively handling multiple domains
without suffering from catastrophic forgetting. Our findings demonstrate that
knowledge expansion with frozen layers is an effective method for small
language models to generate domain-specific content across distinct datasets.
This work contributes to the development of efficient multi-domain language
models and provides insights into managing catastrophic forgetting in
small-scale architectures.
[COMMENTS]
15 pages
[LINK]
http://arxiv.org/abs/2409.17171v2
[DATE]
2024-10-02 18:28:02+08:00
[CATEGORIES]
cs.CL
Question-guided Knowledge Graph Re-scoring and Injection for Knowledge Graph Question Answering
[AUTHORS]
Yu Zhang, Kehai Chen, Xuefeng Bai, zhao kang, Quanjiang Guo, Min Zhang
[ABSTRACT]
Knowledge graph question answering (KGQA) involves answering natural language
questions by leveraging structured information stored in a knowledge graph.
Typically, KGQA initially retrieve a targeted subgraph from a large-scale
knowledge graph, which serves as the basis for reasoning models to address
queries. However, the retrieved subgraph inevitably brings distraction
information for knowledge utilization, impeding the model’s ability to perform
accurate reasoning. To address this issue, we propose a Question-guided
Knowledge Graph Re-scoring method (Q-KGR) to eliminate noisy pathways for the
input question, thereby focusing specifically on pertinent factual knowledge.
Moreover, we introduce Knowformer, a parameter-efficient method for injecting
the re-scored knowledge graph into large language models to enhance their
ability to perform factual reasoning. Extensive experiments on multiple KGQA
benchmarks demonstrate the superiority of our method over existing systems.
[COMMENTS]
findings of EMNLP2024
[LINK]
http://arxiv.org/abs/2410.01401v1
[DATE]
2024-10-02 18:27:07+08:00
[CATEGORIES]
cs.CL
PairDistill: Pairwise Relevance Distillation for Dense Retrieval
[AUTHORS]
Chao-Wei Huang, Yun-Nung Chen
[ABSTRACT]
Effective information retrieval (IR) from vast datasets relies on advanced
techniques to extract relevant information in response to queries. Recent
advancements in dense retrieval have showcased remarkable efficacy compared to
traditional sparse retrieval methods. To further enhance retrieval performance,
knowledge distillation techniques, often leveraging robust cross-encoder
rerankers, have been extensively explored. However, existing approaches
primarily distill knowledge from pointwise rerankers, which assign absolute
relevance scores to documents, thus facing challenges related to inconsistent
comparisons. This paper introduces Pairwise Relevance Distillation
(PairDistill) to leverage pairwise reranking, offering fine-grained
distinctions between similarly relevant documents to enrich the training of
dense retrieval models. Our experiments demonstrate that PairDistill
outperforms existing methods, achieving new state-of-the-art results across
multiple benchmarks. This highlights the potential of PairDistill in advancing
dense retrieval techniques effectively. Our source code and trained models are
released at https://github.com/MiuLab/PairDistill
[COMMENTS]
Accepted to EMNLP 2024 Main Conference
[LINK]
http://arxiv.org/abs/2410.01383v1
[DATE]
2024-10-02 17:51:42+08:00
[CATEGORIES]
cs.CL
PCQPR: Proactive Conversational Question Planning with Reflection
[AUTHORS]
Shasha Guo, Lizi Liao, Jing Zhang, Cuiping Li, Hong Chen
[ABSTRACT]
Conversational Question Generation (CQG) enhances the interactivity of
conversational question-answering systems in fields such as education, customer
service, and entertainment. However, traditional CQG, focusing primarily on the
immediate context, lacks the conversational foresight necessary to guide
conversations toward specified conclusions. This limitation significantly
restricts their ability to achieve conclusion-oriented conversational outcomes.
In this work, we redefine the CQG task as Conclusion-driven Conversational
Question Generation (CCQG) by focusing on proactivity, not merely reacting to
the unfolding conversation but actively steering it towards a
conclusion-oriented question-answer pair. To address this, we propose a novel
approach, called Proactive Conversational Question Planning with self-Refining
(PCQPR). Concretely, by integrating a planning algorithm inspired by Monte
Carlo Tree Search (MCTS) with the analytical capabilities of large language
models (LLMs), PCQPR predicts future conversation turns and continuously
refines its questioning strategies. This iterative self-refining mechanism
ensures the generation of contextually relevant questions strategically devised
to reach a specified outcome. Our extensive evaluations demonstrate that PCQPR
significantly surpasses existing CQG methods, marking a paradigm shift towards
conclusion-oriented conversational question-answering systems.
[COMMENTS]
Accepted by EMNLP 2024 Main
[LINK]
http://arxiv.org/abs/2410.01363v1
[DATE]
2024-10-02 17:23:07+08:00
[CATEGORIES]
cs.CL
How Reliable Are Automatic Evaluation Methods for Instruction-Tuned LLMs?
[AUTHORS]
Ehsan Doostmohammadi, Oskar Holmström, Marco Kuhlmann
[ABSTRACT]
Work on instruction-tuned Large Language Models (LLMs) has used automatic
methods based on text overlap and LLM judgments as cost-effective alternatives
to human evaluation. In this paper, we perform a meta-evaluation of such
methods and assess their reliability across a broad range of tasks. In
evaluating how well automatic methods align with human evaluations, correlation
metrics are the most commonly employed method despite their inherent
limitations when dealing with ties and different scales. To address these
shortcomings, we use Pairwise Accuracy as an alternative to standard
correlation measures. We observe that while automatic evaluation methods can
approximate human ratings under specific conditions, their validity is highly
context-dependent. Specifically, the simple ROUGE-L metric correlates very well
with human ratings for short-answer English tasks but is unreliable in
free-form generation tasks and cross-lingual scenarios. The effectiveness of
the more advanced method of using GPT-4 as a judge diminishes significantly if
reference answers are not included in the prompt, which is the scenario where
this method has the potential to provide the most value compared to other
metrics. Our findings enhance the understanding of how automatic methods should
be applied and interpreted when developing and evaluating instruction-tuned
LLMs.
[LINK]
http://arxiv.org/abs/2402.10770v4
[DATE]
2024-10-02 17:18:47+08:00
[CATEGORIES]
cs.CL
Assisted Data Annotation for Business Process Information Extraction from Textual Documents
[AUTHORS]
Julian Neuberger, Han van der Aa, Lars Ackermann, Daniel Buschek, Jannic Herrmann, Stefan Jablonski
[ABSTRACT]
Machine-learning based generation of process models from natural language
text process descriptions provides a solution for the time-intensive and
expensive process discovery phase. Many organizations have to carry out this
phase, before they can utilize business process management and its benefits.
Yet, research towards this is severely restrained by an apparent lack of large
and high-quality datasets. This lack of data can be attributed to, among other
things, an absence of proper tool assistance for dataset creation, resulting in
high workloads and inferior data quality. We explore two assistance features to
support dataset creation, a recommendation system for identifying process
information in the text and visualization of the current state of already
identified process information as a graphical business process model. A
controlled user study with 31 participants shows that assisting dataset
creators with recommendations lowers all aspects of workload, up to $-51.0\%$,
and significantly improves annotation quality, up to $+38.9\%$. We make all
data and code available to encourage further research on additional novel
assistance strategies.
[LINK]
http://arxiv.org/abs/2410.01356v1
[DATE]
2024-10-02 17:14:39+08:00
[CATEGORIES]
cs.CL
Moshi: a speech-text foundation model for real-time dialogue
[AUTHORS]
Alexandre Défossez, Laurent Mazaré, Manu Orsini, Amélie Royer, Patrick Pérez, Hervé Jégou, Edouard Grave, Neil Zeghidour
[ABSTRACT]
We introduce Moshi, a speech-text foundation model and full-duplex spoken
dialogue framework. Current systems for spoken dialogue rely on pipelines of
independent components, namely voice activity detection, speech recognition,
textual dialogue and text-to-speech. Such frameworks cannot emulate the
experience of real conversations. First, their complexity induces a latency of
several seconds between interactions. Second, text being the intermediate
modality for dialogue, non-linguistic information that modifies meaning – such
as emotion or non-speech sounds – is lost in the interaction. Finally, they
rely on a segmentation into speaker turns, which does not take into account
overlapping speech, interruptions and interjections. Moshi solves these
independent issues altogether by casting spoken dialogue as speech-to-speech
generation. Starting from a text language model backbone, Moshi generates
speech as tokens from the residual quantizer of a neural audio codec, while
modeling separately its own speech and that of the user into parallel streams.
This allows for the removal of explicit speaker turns, and the modeling of
arbitrary conversational dynamics. We moreover extend the hierarchical
semantic-to-acoustic token generation of previous work to first predict
time-aligned text tokens as a prefix to audio tokens. Not only this “Inner
Monologue” method significantly improves the linguistic quality of generated
speech, but we also illustrate how it can provide streaming speech recognition
and text-to-speech. Our resulting model is the first real-time full-duplex
spoken large language model, with a theoretical latency of 160ms, 200ms in
practice, and is available at https://github.com/kyutai-labs/moshi.
[LINK]
http://arxiv.org/abs/2410.00037v2
[DATE]
2024-10-02 17:11:45+08:00
[CATEGORIES]
cs.CL
cs.LG
FIZZ: Factual Inconsistency Detection by Zoom-in Summary and Zoom-out Document
[AUTHORS]
Joonho Yang, Seunghyun Yoon, Byeongjeong Kim, Hwanhee Lee
[COMMENTS]
Published as a main conference paper at EMNLP 2024
[LINK]
http://arxiv.org/abs/2404.11184v3
[DATE]
2024-10-02 17:07:15+08:00
[CATEGORIES]
cs.CL
Layer Swapping for Zero-Shot Cross-Lingual Transfer in Large Language Models
[AUTHORS]
Lucas Bandarkar, Benjamin Muller, Pritish Yuvraj, Rui Hou, Nayan Singhal, Hongjiang Lv, Bing Liu
[ABSTRACT]
Model merging, such as model souping, is the practice of combining different
models with the same architecture together without further training. In this
work, we present a model merging methodology that addresses the difficulty of
fine-tuning Large Language Models (LLMs) for target tasks in non-English
languages, where task-specific data is often unavailable. We focus on
mathematical reasoning and without in-language math data, facilitate
cross-lingual transfer by composing language and math capabilities. Starting
from the same pretrained model, we fine-tune separate “experts” on math
instruction data in English and on generic instruction data in the target
language. We then replace the top and bottom transformer layers of the math
expert directly with layers from the language expert, which consequently
enhances math performance in the target language. The resulting merged models
outperform the individual experts and other merging methods on the math
benchmark, MGSM, by 10% across four major languages where math instruction data
is scarce. In addition, this layer swapping is simple, inexpensive, and
intuitive, as it is based on an interpretative analysis of the most important
parameter changes during the fine-tuning of each expert. The ability to
successfully re-compose LLMs for cross-lingual transfer in this manner opens up
future possibilities to combine model expertise, create modular solutions, and
transfer reasoning capabilities across languages all post hoc.
[COMMENTS]
11 main pages, 23 pages total, 9 figures, 5 tables
[LINK]
http://arxiv.org/abs/2410.01335v1
[DATE]
2024-10-02 16:53:07+08:00
[CATEGORIES]
cs.CL
cs.LG
Unveiling Language Skills under Circuits
[AUTHORS]
Hang Chen, Jiaying Zhu, Xinyu Yang, Wenya Wang
[ABSTRACT]
The exploration of language skills in language models (LMs) has always been
one of the central goals in mechanistic interpretability. However, existing
circuit analyses often fall short in representing the full functional scope of
these models, primarily due to the exclusion of Feed-Forward layers.
Additionally, isolating the effect of a single language skill from a text,
which inherently involves multiple entangled skills, poses a significant
challenge. To address these gaps, we introduce a novel concept, Memory Circuit,
a minimum unit that fully and independently manipulates the memory-reading
functionality of a language model, and disentangle the transformer model
precisely into a circuit graph which is an ensemble of paths connecting
different memory circuits. Based on this disentanglement, we identify salient
circuit paths, named as skill paths, responsible for three crucial language
skills, i.e., the Previous Token Skill, Induction Skill and In-Context Learning
(ICL) Skill, leveraging causal effect estimation through interventions and
counterfactuals. Our experiments on various datasets confirm the correspondence
between our identified skill paths and language skills, and validate three
longstanding hypotheses: 1) Language skills are identifiable through circuit
dissection; 2) Simple language skills reside in shallow layers, whereas complex
language skills are found in deeper layers; 3) Complex language skills are
formed on top of simpler language skills. Our codes are available at:
https://github.com/Zodiark-ch/Language-Skill-of-LLMs.
[LINK]
http://arxiv.org/abs/2410.01334v1
[DATE]
2024-10-02 16:52:58+08:00
[CATEGORIES]
cs.CL
Bayesian WeakS-to-Strong from Text Classification to Generation
[AUTHORS]
Ziyun Cui, Ziyang Zhang, Wen Wu, Guangzhi Sun, Chao Zhang
[ABSTRACT]
Advances in large language models raise the question of how alignment
techniques will adapt as models become increasingly complex and humans will
only be able to supervise them weakly. Weak-to-Strong mimics such a scenario
where weak model supervision attempts to harness the full capabilities of a
much stronger model. This work extends Weak-to-Strong to WeakS-to-Strong by
exploring an ensemble of weak models which simulate the variability in human
opinions. Confidence scores are estimated using a Bayesian approach to guide
the WeakS-to-Strong generalization. Furthermore, we extend the application of
WeakS-to-Strong from text classification tasks to text generation tasks where
more advanced strategies are investigated for supervision. Moreover, direct
preference optimization is applied to advance the student model’s preference
learning, beyond the basic learning framework of teacher forcing. Results
demonstrate the effectiveness of the proposed approach for the reliability of a
strong student model, showing potential for superalignment.
[LINK]
http://arxiv.org/abs/2406.03199v2
[DATE]
2024-10-02 16:45:32+08:00
[CATEGORIES]
cs.CL
cs.LG
Nebula: A discourse aware Minecraft Builder
[AUTHORS]
Akshay Chaturvedi, Kate Thompson, Nicholas Asher
[ABSTRACT]
When engaging in collaborative tasks, humans efficiently exploit the semantic
structure of a conversation to optimize verbal and nonverbal interactions. But
in recent “language to code” or “language to action” models, this information
is lacking. We show how incorporating the prior discourse and nonlinguistic
context of a conversation situated in a nonlinguistic environment can improve
the “language to action” component of such interactions. We finetune an LLM to
predict actions based on prior context; our model, Nebula, doubles the
net-action F1 score over the baseline on this task of Jayannavar et al.(2020).
We also investigate our model’s ability to construct shapes and understand
location descriptions using a synthetic dataset
[COMMENTS]
EMNLP 2024 Findings
[LINK]
http://arxiv.org/abs/2406.18164v2
[DATE]
2024-10-02 16:40:36+08:00
[CATEGORIES]
cs.CL
cs.LG
Are LLMs Effective Negotiators? Systematic Evaluation of the Multifaceted Capabilities of LLMs in Negotiation Dialogues
[AUTHORS]
Deuksin Kwon, Emily Weiss, Tara Kulshrestha, Kushal Chawla, Gale M. Lucas, Jonathan Gratch
[ABSTRACT]
A successful negotiation requires a range of capabilities, including
comprehension of the conversation context, Theory-of-Mind (ToM) skills to infer
the partner’s motives, strategic reasoning, and effective communication, making
it challenging for automated systems. Despite the remarkable performance of
LLMs in various NLP tasks, there is no systematic evaluation of their
capabilities in negotiation. Such an evaluation is critical for advancing AI
negotiation agents and negotiation research, ranging from designing dialogue
systems to providing pedagogical feedback and scaling up data collection
practices. This work aims to systematically analyze the multifaceted
capabilities of LLMs across diverse dialogue scenarios throughout the stages of
a typical negotiation interaction. Our analysis highlights GPT-4’s superior
performance in many tasks while identifying specific challenges, such as making
subjective assessments and generating contextually appropriate, strategically
advantageous responses.
[COMMENTS]
Accepted to Findings of EMNLP 2024
[LINK]
http://arxiv.org/abs/2402.13550v2
[DATE]
2024-10-02 16:32:31+08:00
[CATEGORIES]
cs.CL
Llamipa: An Incremental Discourse Parser
[AUTHORS]
Kate Thompson, Akshay Chaturvedi, Julie Hunter, Nicholas Asher
[ABSTRACT]
This paper provides the first discourse parsing experiments with a large
language model(LLM) finetuned on corpora annotated in the style of SDRT
(Segmented Discourse Representation Theory Asher, 1993; Asher and Lascarides,
2003). The result is a discourse parser, Llamipa (Llama Incremental Parser),
that leverages discourse context, leading to substantial performance gains over
approaches that use encoder-only models to provide local, context-sensitive
representations of discourse units. Furthermore, it can process discourse data
incrementally, which is essential for the eventual use of discourse information
in downstream tasks.
[COMMENTS]
EMNLP 2024 Findings
[LINK]
http://arxiv.org/abs/2406.18256v2
[DATE]
2024-10-02 16:28:36+08:00
[CATEGORIES]
cs.CL
Emotion-Aware Response Generation Using Affect-Enriched Embeddings with LLMs
[AUTHORS]
Abdur Rasool, Muhammad Irfan Shahzad, Hafsa Aslam, Vincent Chan
[LINK]
http://arxiv.org/abs/2410.01306v1
[DATE]
2024-10-02 16:01:05+08:00
[CATEGORIES]
cs.CL
Revisiting Hierarchical Text Classification: Inference and Metrics
[AUTHORS]
Roman Plaud, Matthieu Labeau, Antoine Saillenfest, Thomas Bonald
[ABSTRACT]
Hierarchical text classification (HTC) is the task of assigning labels to a
text within a structured space organized as a hierarchy. Recent works treat HTC
as a conventional multilabel classification problem, therefore evaluating it as
such. We instead propose to evaluate models based on specifically designed
hierarchical metrics and we demonstrate the intricacy of metric choice and
prediction inference method. We introduce a new challenging dataset and we
evaluate fairly, recent sophisticated models, comparing them with a range of
simple but strong baselines, including a new theoretically motivated loss.
Finally, we show that those baselines are very often competitive with the
latest models. This highlights the importance of carefully considering the
evaluation methodology when proposing new methods for HTC. Code implementation
and dataset are available at \url{https://github.com/RomanPlaud/revisitingHTC}.
[COMMENTS]
Accepted at CoNLL 2024
[LINK]
http://arxiv.org/abs/2410.01305v1
[DATE]
2024-10-02 15:57:33+08:00
[CATEGORIES]
cs.CL
cs.LG
Pruning Multilingual Large Language Models for Multilingual Inference
[AUTHORS]
Hwichan Kim, Jun Suzuki, Tosho Hirasawa, Mamoru Komachi
[ABSTRACT]
Multilingual large language models (MLLMs), trained on multilingual balanced
data, demonstrate better zero-shot learning performance in non-English
languages compared to large language models trained on English-dominant data.
However, the disparity in performance between English and non-English languages
remains a challenge yet to be fully addressed. A distinctive characteristic of
MLLMs is their high-quality translation capabilities, indicating an acquired
proficiency in aligning between languages. This study explores how to enhance
the zero-shot performance of MLLMs in non-English languages by leveraging their
alignment capability between English and non-English languages. To achieve
this, we first analyze the behavior of MLLMs when performing translation and
reveal that there are large magnitude features that play a critical role in the
translation process. Inspired by these findings, we retain the weights
associated with operations involving the large magnitude features and prune
other weights to force MLLMs to rely on these features for tasks beyond
translation. We empirically demonstrate that this pruning strategy can enhance
the MLLMs’ performance in non-English language.
[COMMENTS]
Accepted at EMNLP 2024 Findings
[LINK]
http://arxiv.org/abs/2409.16911v2
[DATE]
2024-10-02 15:52:56+08:00
[CATEGORIES]
cs.CL
Pre-training Cross-lingual Open Domain Question Answering with Large-scale Synthetic Supervision
[AUTHORS]
Fan Jiang, Tom Drummond, Trevor Cohn
[ABSTRACT]
Cross-lingual open domain question answering (CLQA) is a complex problem,
comprising cross-lingual retrieval from a multilingual knowledge base, followed
by answer generation in the query language. Both steps are usually tackled by
separate models, requiring substantial annotated datasets, and typically
auxiliary resources, like machine translation systems to bridge between
languages. In this paper, we show that CLQA can be addressed using a single
encoder-decoder model. To effectively train this model, we propose a
self-supervised method based on exploiting the cross-lingual link structure
within Wikipedia. We demonstrate how linked Wikipedia pages can be used to
synthesise supervisory signals for cross-lingual retrieval, through a form of
cloze query, and generate more natural questions to supervise answer
generation. Together, we show our approach, \texttt{CLASS}, outperforms
comparable methods on both supervised and zero-shot language adaptation
settings, including those using machine translation.
[COMMENTS]
EMNLP 2024 Main
[LINK]
http://arxiv.org/abs/2402.16508v3
[DATE]
2024-10-02 15:51:47+08:00
[CATEGORIES]
cs.CL
Endless Jailbreaks with Bijection Learning
[AUTHORS]
Brian R. Y. Huang, Maximilian Li, Leonard Tang
[ABSTRACT]
Despite extensive safety training, LLMs are vulnerable to adversarial inputs.
In this work, we introduce a simple but powerful attack paradigm, bijection
learning, that yields a practically endless set of jailbreak prompts. We
exploit language models’ advanced reasoning capabilities to teach them
invertible languages (bijections) in context, pass encoded queries to the model
to bypass built-in safety mechanisms, and finally decode responses back into
English, yielding helpful replies to harmful requests. Our approach proves
effective on a wide range of frontier language models and harm categories.
Bijection learning is an automated and universal attack that grows stronger
with scale: larger models with more advanced reasoning capabilities are more
susceptible to bijection learning jailbreaks despite stronger safety
mechanisms.
[LINK]
http://arxiv.org/abs/2410.01294v1
[DATE]
2024-10-02 15:40:56+08:00
[CATEGORIES]
cs.CL
Bone: Block Affine Transformation as Parameter Efficient Fine-tuning Methods for Large Language Models
[AUTHORS]
Jiale Kang
[ABSTRACT]
Low-Rank Adaptation (LoRA) has achieved remarkable training results by
freezing the original weights and training only low-rank matrices, establishing
itself as the predominant fine-tuning method for LLMs. In pursuit of
performance closer to full-parameter training, a series of LoRA variants have
emerged, such as LoRA+, PISSA, Olora, and LoRA-GA. However, these improvements
complicate the initial setup of model training and increase initialization
time. More importantly, they overlook the internal interactions of the original
weight information. To address these issues, we introduce a novel theory,
“Weight Guide” aimed at continuously guiding trainable matrices through the
original weights during training to enhance the utilization of weight
information. Based on this theory, we designed a new PEFT technique called Bone
(\textbf{B}l\textbf{o}ck Affi\textbf{ne}), which not only enhances the
utilization of original weight information but also emphasizes the internal
connections between weights, leading to faster convergence and better data
fitting. Experimental comparisons across two different LLM architectures
(LLaMA2, RWKV6) and various parameter scales demonstrate that the Bone
structure can achieve rapid convergence and superior data fitting without the
need for complex initialization. For example, when fine-tuning LLaMA2-7B on the
MetaMathQA dataset and validating on GSM8k and math benchmarks, Bone achieved
fine-tuning scores of 49.36 and 8.8, respectively, outperforming PISSA by
5.84\% and 1.96\%.
[LINK]
http://arxiv.org/abs/2409.15371v3
[DATE]
2024-10-02 15:38:02+08:00
[CATEGORIES]
cs.CL
Mitigating Copy Bias in In-Context Learning through Neuron Pruning
[AUTHORS]
Ameen Ali, Lior Wolf, Ivan Titov
[ABSTRACT]
Large language models (LLMs) have demonstrated impressive few-shot in-context
learning (ICL) abilities. Still, we show that they are sometimes prone to a
`copying bias’, where they copy answers from provided examples instead of
learning the underlying patterns. In this work, we propose a novel and simple
method to mitigate such copying bias. First, we create a synthetic task and use
the Integrated Gradients method to identify neurons that prioritize copying
over generalization. We demonstrate that pruning these neurons consistently
improves performance across a diverse set of ICL tasks. We also show that our
method is applicable across various LLM architectures, including Transformers
and State-Space Models, without requiring modifications. In our analysis, we
adopt a task-recognition perspective on ICL and examine task vectors (Hendel et
al., 2023) induced by the model. We find that pruning enhances the quality of
these vectors, suggesting that the pruned neurons previously hindered effective
task recognition.
[LINK]
http://arxiv.org/abs/2410.01288v1
[DATE]
2024-10-02 15:18:16+08:00
[CATEGORIES]
cs.CL
cs.LG
Enhancing Training Data Attribution for Large Language Models with Fitting Error Consideration
[AUTHORS]
Kangxi Wu, Liang Pang, Huawei Shen, Xueqi Cheng
[COMMENTS]
Accepted to the EMNLP 2024 main
[LINK]
http://arxiv.org/abs/2410.01285v1
[DATE]
2024-10-02 15:14:26+08:00
[CATEGORIES]
cs.CL
Unlocking the Power of GANs in Non-Autoregressive Text Generation
[AUTHORS]
Da Ren, Yi Cai, Qing Li
[ABSTRACT]
Generative Adversarial Networks (GANs) have been studied in text generation
to tackle the exposure bias problem. Despite their remarkable development, they
adopt autoregressive structures so suffering from high latency in both training
and inference stages. Although GANs have potential to support efficient
generation by adopting non-autoregressive (NAR) structures, their explorations
in NAR models are extremely limited. In this work, we conduct pioneering study
of building language GANs based on NAR structures. We identify two issues that
constrain the performance of GAN-based NAR models. Firstly, existing methods of
incorporating latent variables provide highly similar representations which
cannot describe the diversity of different words in sentences. We tackle this
problem by proposing Position-Aware Self-Modulation, providing more diverse and
effective representations. Secondly, the attention mechanism in Transformer
cannot accurately build word dependencies in the unstable training of GANs, and
we adopt Dependency Feed Forward Network to enhance the model capacity in
dependency modeling. Armed with these two facilities, we propose a GAN-based
NAR model, Adversarial Non-autoregressive Transformer (ANT). The experimental
results demonstrate that ANT can achieve comparable performance with mainstream
models in a single forward pass and has great potential in various applications
like latent interpolation and semi-supervised learning.
[LINK]
http://arxiv.org/abs/2305.03977v3
[DATE]
2024-10-02 14:35:36+08:00
[CATEGORIES]
cs.CL
Breaking Language Barriers: Cross-Lingual Continual Pre-Training at Scale
[AUTHORS]
Wenzhen Zheng, Wenbo Pan, Xu Xu, Libo Qin, Li Yue, Ming Zhou
[ABSTRACT]
In recent years, Large Language Models (LLMs) have made significant strides
towards Artificial General Intelligence. However, training these models from
scratch requires substantial computational resources and vast amounts of text
data. In this paper, we explore an alternative approach to constructing an LLM
for a new language by continually pretraining (CPT) from existing pretrained
LLMs, instead of using randomly initialized parameters. Based on parallel
experiments on 40 model sizes ranging from 40M to 5B parameters, we find that
1) CPT converges faster and saves significant resources in a scalable manner;
2) CPT adheres to an extended scaling law derived from Hoffmann et al. (2022)
with a joint data-parameter scaling term; 3) The compute-optimal data-parameter
allocation for CPT markedly differs based on our estimated scaling factors; 4)
The effectiveness of transfer at scale is influenced by training duration and
linguistic properties, while robust to data replaying, a method that
effectively mitigates catastrophic forgetting in CPT. We hope our findings
provide deeper insights into the transferability of LLMs at scale for the
research community.
[COMMENTS]
8 pages. Accepted at EMNLP 2024
[LINK]
http://arxiv.org/abs/2407.02118v2
[DATE]
2024-10-02 14:32:10+08:00
[CATEGORIES]
cs.CL
Advancing Event Causality Identification via Heuristic Semantic Dependency Inquiry Network
[AUTHORS]
Haoran Li, Qiang Gao, Hongmei Wu, Li Huang
[ABSTRACT]
Event Causality Identification (ECI) focuses on extracting causal relations
between events in texts. Existing methods for ECI primarily rely on causal
features and external knowledge. However, these approaches fall short in two
dimensions: (1) causal features between events in a text often lack explicit
clues, and (2) external knowledge may introduce bias, while specific problems
require tailored analyses. To address these issues, we propose SemDI - a simple
and effective Semantic Dependency Inquiry Network for ECI. SemDI captures
semantic dependencies within the context using a unified encoder. Then, it
utilizes a Cloze Analyzer to generate a fill-in token based on comprehensive
context understanding. Finally, this fill-in token is used to inquire about the
causal relation between two events. Extensive experiments demonstrate the
effectiveness of SemDI, surpassing state-of-the-art methods on three widely
used benchmarks. Code is available at https://github.com/hrlics/SemDI.
[COMMENTS]
EMNLP 2024 camera-ready version. Code is released at
https://github.com/hrlics/SemDI
[LINK]
http://arxiv.org/abs/2409.13621v2
[DATE]
2024-10-02 14:14:17+08:00
[CATEGORIES]
cs.CL
Extract, Define, Canonicalize: An LLM-based Framework for Knowledge Graph Construction
[AUTHORS]
Bowen Zhang, Harold Soh
[ABSTRACT]
In this work, we are interested in automated methods for knowledge graph
creation (KGC) from input text. Progress on large language models (LLMs) has
prompted a series of recent works applying them to KGC, e.g., via zero/few-shot
prompting. Despite successes on small domain-specific datasets, these models
face difficulties scaling up to text common in many real-world applications. A
principal issue is that, in prior methods, the KG schema has to be included in
the LLM prompt to generate valid triplets; larger and more complex schemas
easily exceed the LLMs’ context window length. Furthermore, there are scenarios
where a fixed pre-defined schema is not available and we would like the method
to construct a high-quality KG with a succinct self-generated schema. To
address these problems, we propose a three-phase framework named
Extract-Define-Canonicalize (EDC): open information extraction followed by
schema definition and post-hoc canonicalization. EDC is flexible in that it can
be applied to settings where a pre-defined target schema is available and when
it is not; in the latter case, it constructs a schema automatically and applies
self-canonicalization. To further improve performance, we introduce a trained
component that retrieves schema elements relevant to the input text; this
improves the LLMs’ extraction performance in a retrieval-augmented
generation-like manner. We demonstrate on three KGC benchmarks that EDC is able
to extract high-quality triplets without any parameter tuning and with
significantly larger schemas compared to prior works. Code for EDC is available
at https://github.com/clear-nus/edc.
[COMMENTS]
18 pages, 3 figures, Proceedings of the 2024 Conference on Empirical
Methods in Natural Language Processing
[LINK]
http://arxiv.org/abs/2404.03868v2
[DATE]
2024-10-02 13:51:53+08:00
[CATEGORIES]
cs.CL
cs.LG
AHP-Powered LLM Reasoning for Multi-Criteria Evaluation of Open-Ended Responses
[AUTHORS]
Xiaotian Lu, Jiyi Li, Koh Takeuchi, Hisashi Kashima
[COMMENTS]
Accepted for EMNLP 2024 Findings
[LINK]
http://arxiv.org/abs/2410.01246v1
[DATE]
2024-10-02 13:22:07+08:00
[CATEGORIES]
cs.CL
RGD: Multi-LLM Based Agent Debugger via Refinement and Generation Guidance
[AUTHORS]
Haolin Jin, Zechao Sun, Yiheng Yang, Huaming Chen
[ABSTRACT]
Large Language Models (LLMs) have shown incredible potential in code
generation tasks, and recent research in prompt engineering have enhanced LLMs’
understanding of textual information. However, ensuring the accuracy of
generated code often requires extensive testing and validation by programmers.
While LLMs can typically generate code based on task descriptions, their
accuracy remains limited, especially for complex tasks that require a deeper
understanding of both the problem statement and the code generation process.
This limitation is primarily due to the LLMs’ need to simultaneously comprehend
text and generate syntactically and semantically correct code, without having
the capability to automatically refine the code. In real-world software
development, programmers rarely produce flawless code in a single attempt based
on the task description alone, they rely on iterative feedback and debugging to
refine their programs. Inspired by this process, we introduce a novel
architecture of LLM-based agents for code generation and automatic debugging:
Refinement and Guidance Debugging (RGD). The RGD framework is a multi-LLM-based
agent debugger that leverages three distinct LLM agents-Guide Agent, Debug
Agent, and Feedback Agent. RGD decomposes the code generation task into
multiple steps, ensuring a clearer workflow and enabling iterative code
refinement based on self-reflection and feedback. Experimental results
demonstrate that RGD exhibits remarkable code generation capabilities,
achieving state-of-the-art performance with a 9.8% improvement on the HumanEval
dataset and a 16.2% improvement on the MBPP dataset compared to the
state-of-the-art approaches and traditional direct prompting approaches. We
highlight the effectiveness of the RGD framework in enhancing LLMs’ ability to
generate and refine code autonomously.
[LINK]
http://arxiv.org/abs/2410.01242v1
[DATE]
2024-10-02 13:07:02+08:00
[CATEGORIES]
cs.CL
OneGen: Efficient One-Pass Unified Generation and Retrieval for LLMs
[AUTHORS]
Jintian Zhang, Cheng Peng, Mengshu Sun, Xiang Chen, Lei Liang, Zhiqiang Zhang, Jun Zhou, Huajun Chen, Ningyu Zhang
[ABSTRACT]
Despite the recent advancements in Large Language Models (LLMs), which have
significantly enhanced the generative capabilities for various NLP tasks, LLMs
still face limitations in directly handling retrieval tasks. However, many
practical applications demand the seamless integration of both retrieval and
generation. This paper introduces a novel and efficient One-pass Generation and
retrieval framework (OneGen), designed to improve LLMs’ performance on tasks
that require both generation and retrieval. The proposed framework bridges the
traditionally separate training approaches for generation and retrieval by
incorporating retrieval tokens generated autoregressively. This enables a
single LLM to handle both tasks simultaneously in a unified forward pass. We
conduct experiments on two distinct types of composite tasks, RAG and Entity
Linking, to validate the pluggability, effectiveness, and efficiency of OneGen
in training and inference. Furthermore, our results show that integrating
generation and retrieval within the same context preserves the generative
capabilities of LLMs while improving retrieval performance. To the best of our
knowledge, OneGen is the first to enable LLMs to conduct vector retrieval
during the generation.
[COMMENTS]
EMNLP 2024 Findings; code is available at
https://github.com/zjunlp/OneGen
[LINK]
http://arxiv.org/abs/2409.05152v2
[DATE]
2024-10-02 13:02:02+08:00
[CATEGORIES]
cs.CL
cs.LG
UI-JEPA: Towards Active Perception of User Intent through Onscreen User Activity
[AUTHORS]
Yicheng Fu, Raviteja Anantha, Prabal Vashisht, Jianpeng Cheng, Etai Littwin
[ABSTRACT]
Generating user intent from a sequence of user interface (UI) actions is a
core challenge in comprehensive UI understanding. Recent advancements in
multimodal large language models (MLLMs) have led to substantial progress in
this area, but their demands for extensive model parameters, computing power,
and high latency makes them impractical for scenarios requiring lightweight,
on-device solutions with low latency or heightened privacy. Additionally, the
lack of high-quality datasets has hindered the development of such lightweight
models. To address these challenges, we propose UI-JEPA, a novel framework that
employs masking strategies to learn abstract UI embeddings from unlabeled data
through self-supervised learning, combined with an LLM decoder fine-tuned for
user intent prediction. We also introduce two new UI-grounded multimodal
datasets, “Intent in the Wild” (IIW) and “Intent in the Tame” (IIT), designed
for few-shot and zero-shot UI understanding tasks. IIW consists of 1.7K videos
across 219 intent categories, while IIT contains 914 videos across 10
categories. We establish the first baselines for these datasets, showing that
representations learned using a JEPA-style objective, combined with an LLM
decoder, can achieve user intent predictions that match the performance of
state-of-the-art large MLLMs, but with significantly reduced annotation and
deployment resources. Measured by intent similarity scores, UI-JEPA outperforms
GPT-4 Turbo and Claude 3.5 Sonnet by 10.0% and 7.2% respectively, averaged
across two datasets. Notably, UI-JEPA accomplishes the performance with a 50.5x
reduction in computational cost and a 6.6x improvement in latency in the IIW
dataset. These results underscore the effectiveness of UI-JEPA, highlighting
its potential for lightweight, high-performance UI understanding.
[LINK]
http://arxiv.org/abs/2409.04081v3
[DATE]
2024-10-02 13:00:57+08:00
[CATEGORIES]
cs.CL
cs.LG
Unlabeled Debiasing in Downstream Tasks via Class-wise Low Variance Regularization
[AUTHORS]
Shahed Masoudian, Markus Frohmann, Navid Rekabsaz, Markus Schedl
[ABSTRACT]
Language models frequently inherit societal biases from their training data.
Numerous techniques have been proposed to mitigate these biases during both the
pre-training and fine-tuning stages. However, fine-tuning a pre-trained
debiased language model on a downstream task can reintroduce biases into the
model. Additionally, existing debiasing methods for downstream tasks either (i)
require labels of protected attributes (e.g., age, race, or political views)
that are often not available or (ii) rely on indicators of bias, which
restricts their applicability to gender debiasing since they rely on
gender-specific words. To address this, we introduce a novel debiasing
regularization technique based on the class-wise variance of embeddings.
Crucially, our method does not require attribute labels and targets any
attribute, thus addressing the shortcomings of existing debiasing methods. Our
experiments on encoder language models and three datasets demonstrate that our
method outperforms existing strong debiasing baselines that rely on target
attribute labels while maintaining performance on the target task.
[COMMENTS]
Accepted to EMNLP 2024
[LINK]
http://arxiv.org/abs/2409.19541v3
[DATE]
2024-10-02 12:15:11+08:00
[CATEGORIES]
cs.CL
AgriLLM: Harnessing Transformers for Farmer Queries
[AUTHORS]
Krish Didwania, Pratinav Seth, Aditya Kasliwal, Amit Agarwal
[COMMENTS]
Accepted at the 3rd Workshop on NLP for Positive Impact @ EMNLP .
Also Accepted at Undergraduate Consortium at KDD 2024 (KDD-UC)
[LINK]
http://arxiv.org/abs/2407.04721v2
[DATE]
2024-10-02 11:59:41+08:00
[CATEGORIES]
cs.CL
From Code to Correctness: Closing the Last Mile of Code Generation with Hierarchical Debugging
[AUTHORS]
Yuling Shi, Songsong Wang, Chengcheng Wan, Xiaodong Gu
[ABSTRACT]
While large language models have made significant strides in code generation,
the pass rate of the generated code is bottlenecked on subtle errors, often
requiring human intervention to pass tests, especially for complex problems.
Existing LLM-based debugging systems treat generated programs as monolithic
units, failing to address bugs at multiple levels of granularity, from
low-level syntax errors to high-level algorithmic flaws. In this paper, we
introduce Multi-Granularity Debugger (MGDebugger), a hierarchical code debugger
by isolating, identifying, and resolving bugs at various levels of granularity.
MGDebugger decomposes problematic code into a hierarchical tree structure of
subfunctions, with each level representing a particular granularity of error.
During debugging, it analyzes each subfunction and iteratively resolves bugs in
a bottom-up manner. To effectively test each subfunction, we propose an
LLM-simulated Python executor, which traces code execution and tracks important
variable states to pinpoint errors accurately. Extensive experiments
demonstrate that MGDebugger outperforms existing debugging systems, achieving
an 18.9% improvement in accuracy over seed generations in HumanEval and a 97.6%
repair success rate in HumanEvalFix. Furthermore, MGDebugger effectively fixes
bugs across different categories and difficulty levels, demonstrating its
robustness and effectiveness.
[COMMENTS]
Code and data available at https://github.com/YerbaPage/MGDebugger
[LINK]
http://arxiv.org/abs/2410.01215v1
[DATE]
2024-10-02 11:57:21+08:00
[CATEGORIES]
cs.CL
Enhancing Unsupervised Sentence Embeddings via Knowledge-Driven Data Augmentation and Gaussian-Decayed Contrastive Learning
[AUTHORS]
Peichao Lai, Zhengfeng Zhang, Wentao Zhang, Fangcheng Fu, Bin Cui
[ABSTRACT]
Recently, using large language models (LLMs) for data augmentation has led to
considerable improvements in unsupervised sentence embedding models. However,
existing methods encounter two primary challenges: limited data diversity and
high data noise. Current approaches often neglect fine-grained knowledge, such
as entities and quantities, leading to insufficient diversity. Additionally,
unsupervised data frequently lacks discriminative information, and the
generated synthetic samples may introduce noise. In this paper, we propose a
pipeline-based data augmentation method via LLMs and introduce the
Gaussian-decayed gradient-assisted Contrastive Sentence Embedding (GCSE) model
to enhance unsupervised sentence embeddings. To tackle the issue of low data
diversity, our pipeline utilizes knowledge graphs (KGs) to extract entities and
quantities, enabling LLMs to generate more diverse, knowledge-enriched samples.
To address high data noise, the GCSE model uses a Gaussian-decayed function to
limit the impact of false hard negative samples, enhancing the model’s
discriminative capability. Experimental results show that our approach achieves
state-of-the-art performance in semantic textual similarity (STS) tasks, using
fewer data samples and smaller LLMs, demonstrating its efficiency and
robustness across various models.
[LINK]
http://arxiv.org/abs/2409.12887v2
[DATE]
2024-10-02 11:24:50+08:00
[CATEGORIES]
cs.CL
Learning to Extract Structured Entities Using Language Models
[AUTHORS]
Haolun Wu, Ye Yuan, Liana Mikaelyan, Alexander Meulemans, Xue Liu, James Hensman, Bhaskar Mitra
[ABSTRACT]
Recent advances in machine learning have significantly impacted the field of
information extraction, with Language Models (LMs) playing a pivotal role in
extracting structured information from unstructured text. Prior works typically
represent information extraction as triplet-centric and use classical metrics
such as precision and recall for evaluation. We reformulate the task to be
entity-centric, enabling the use of diverse metrics that can provide more
insights from various perspectives. We contribute to the field by introducing
Structured Entity Extraction and proposing the Approximate Entity Set OverlaP
(AESOP) metric, designed to appropriately assess model performance. Later, we
introduce a new Multistage Structured Entity Extraction (MuSEE) model that
harnesses the power of LMs for enhanced effectiveness and efficiency by
decomposing the extraction task into multiple stages. Quantitative and human
side-by-side evaluations confirm that our model outperforms baselines, offering
promising directions for future advancements in structured entity extraction.
Our source code and datasets are available at
https://github.com/microsoft/Structured-Entity-Extraction.
[COMMENTS]
18 pages, 11 figures
[LINK]
http://arxiv.org/abs/2402.04437v5
[DATE]
2024-10-02 11:21:43+08:00
[CATEGORIES]
cs.CL
cs.LG
Gold Panning in Vocabulary: An Adaptive Method for Vocabulary Expansion of Domain-Specific LLMs
[AUTHORS]
Chengyuan Liu, Shihang Wang, Lizhi Qing, Kun Kuang, Yangyang Kang, Changlong Sun, Fei Wu
[ABSTRACT]
While Large Language Models (LLMs) demonstrate impressive generation
abilities, they frequently struggle when it comes to specialized domains due to
their limited domain-specific knowledge. Studies on domain-specific LLMs resort
to expanding the vocabulary before fine-tuning on domain-specific corpus,
aiming to decrease the sequence length and enhance efficiency during decoding,
without thoroughly investigating the results of vocabulary expansion to LLMs
over different domains. Our pilot study reveals that expansion with only a
subset of the entire vocabulary may lead to superior performance. Guided by the
discovery, this paper explores how to identify a vocabulary subset to achieve
the optimal results. We introduce VEGAD, an adaptive method that automatically
identifies valuable words from a given domain vocabulary. Our method has been
validated through experiments on three Chinese datasets, demonstrating its
effectiveness. Additionally, we have undertaken comprehensive analyses of the
method. The selection of a optimal subset for expansion has shown to enhance
performance on both domain-specific tasks and general tasks, showcasing the
potential of VEGAD.
[COMMENTS]
Accepted by EMNLP 2024
[LINK]
http://arxiv.org/abs/2410.01188v1
[DATE]
2024-10-02 10:47:39+08:00
[CATEGORIES]
cs.CL
FastLexRank: Efficient Lexical Ranking for Structuring Social Media Posts
[AUTHORS]
Mao Li, Frederick Conrad, Johann Gagnon-Bartsch
[ABSTRACT]
We present FastLexRank\footnote{https://github.com/LiMaoUM/FastLexRank}, an
efficient and scalable implementation of the LexRank algorithm for text
ranking. Designed to address the computational and memory complexities of the
original LexRank method, FastLexRank significantly reduces time and memory
requirements from $\mathcal{O}(n^2)$ to $\mathcal{O}(n)$ without compromising
the quality or accuracy of the results. By employing an optimized approach to
calculating the stationary distribution of sentence graphs, FastLexRank
maintains an identical results with the original LexRank scores while enhancing
computational efficiency. This paper details the algorithmic improvements that
enable the processing of large datasets, such as social media corpora, in
real-time. Empirical results demonstrate its effectiveness, and we propose its
use in identifying central tweets, which can be further analyzed using advanced
NLP techniques. FastLexRank offers a scalable solution for text centrality
calculation, addressing the growing need for efficient processing of digital
content.
[LINK]
http://arxiv.org/abs/2410.01183v1
[DATE]
2024-10-02 10:34:33+08:00
[CATEGORIES]
cs.CL
More Than Catastrophic Forgetting: Integrating General Capabilities For Domain-Specific LLMs
[AUTHORS]
Chengyuan Liu, Yangyang Kang, Shihang Wang, Lizhi Qing, Fubang Zhao, Changlong Sun, Kun Kuang, Fei Wu
[COMMENTS]
Accepted by EMNLP 2024
[LINK]
http://arxiv.org/abs/2405.17830v2
[DATE]
2024-10-02 10:31:04+08:00
[CATEGORIES]
cs.CL
VoxHakka: A Dialectally Diverse Multi-speaker Text-to-Speech System for Taiwanese Hakka
[AUTHORS]
Li-Wei Chen, Hung-Shin Lee, Chen-Chi Chang
[ABSTRACT]
This paper introduces VoxHakka, a text-to-speech (TTS) system designed for
Taiwanese Hakka, a critically under-resourced language spoken in Taiwan.
Leveraging the YourTTS framework, VoxHakka achieves high naturalness and
accuracy and low real-time factor in speech synthesis while supporting six
distinct Hakka dialects. This is achieved by training the model with
dialect-specific data, allowing for the generation of speaker-aware Hakka
speech. To address the scarcity of publicly available Hakka speech corpora, we
employed a cost-effective approach utilizing a web scraping pipeline coupled
with automatic speech recognition (ASR)-based data cleaning techniques. This
process ensured the acquisition of a high-quality, multi-speaker, multi-dialect
dataset suitable for TTS training. Subjective listening tests conducted using
comparative mean opinion scores (CMOS) demonstrate that VoxHakka significantly
outperforms existing publicly available Hakka TTS systems in terms of
pronunciation accuracy, tone correctness, and overall naturalness. This work
represents a significant advancement in Hakka language technology and provides
a valuable resource for language preservation and revitalization efforts.
[COMMENTS]
Accepted to O-COCOSDA 2024
[LINK]
http://arxiv.org/abs/2409.01548v3
[DATE]
2024-10-02 10:25:30+08:00
[CATEGORIES]
cs.CL
Ask-before-Plan: Proactive Language Agents for Real-World Planning
[AUTHORS]
Xuan Zhang, Yang Deng, Zifeng Ren, See-Kiong Ng, Tat-Seng Chua
[ABSTRACT]
The evolution of large language models (LLMs) has enhanced the planning
capabilities of language agents in diverse real-world scenarios. Despite these
advancements, the potential of LLM-powered agents to comprehend ambiguous user
instructions for reasoning and decision-making is still under exploration. In
this work, we introduce a new task, Proactive Agent Planning, which requires
language agents to predict clarification needs based on user-agent conversation
and agent-environment interaction, invoke external tools to collect valid
information, and generate a plan to fulfill the user’s demands. To study this
practical problem, we establish a new benchmark dataset, Ask-before-Plan. To
tackle the deficiency of LLMs in proactive planning, we propose a novel
multi-agent framework, Clarification-Execution-Planning (\texttt{CEP}), which
consists of three agents specialized in clarification, execution, and planning.
We introduce the trajectory tuning scheme for the clarification agent and
static execution agent, as well as the memory recollection mechanism for the
dynamic execution agent. Extensive evaluations and comprehensive analyses
conducted on the Ask-before-Plan dataset validate the effectiveness of our
proposed framework.
[COMMENTS]
Accepted by EMNLP 2024 Findings
[LINK]
http://arxiv.org/abs/2406.12639v2
[DATE]
2024-10-02 10:02:56+08:00
[CATEGORIES]
cs.CL
Towards Inference-time Category-wise Safety Steering for Large Language Models
[AUTHORS]
Amrita Bhattacharjee, Shaona Ghosh, Traian Rebedea, Christopher Parisien
[ABSTRACT]
While large language models (LLMs) have seen unprecedented advancements in
capabilities and applications across a variety of use-cases, safety alignment
of these models is still an area of active research. The fragile nature of
LLMs, even models that have undergone extensive alignment and safety training
regimes, warrants additional safety steering steps via training-free,
inference-time methods. While recent work in the area of mechanistic
interpretability has investigated how activations in latent representation
spaces may encode concepts, and thereafter performed representation engineering
to induce such concepts in LLM outputs, the applicability of such for safety is
relatively under-explored. Unlike recent inference-time safety steering works,
in this paper we explore safety steering of LLM outputs using: (i)
category-specific steering vectors, thereby enabling fine-grained control over
the steering, and (ii) sophisticated methods for extracting informative
steering vectors for more effective safety steering while retaining quality of
the generated text. We demonstrate our exploration on multiple LLMs and
datasets, and showcase the effectiveness of the proposed steering method, along
with a discussion on the implications and best practices.
[LINK]
http://arxiv.org/abs/2410.01174v1
[DATE]
2024-10-02 10:02:06+08:00
[CATEGORIES]
cs.CL
BordIRlines: A Dataset for Evaluating Cross-lingual Retrieval-Augmented Generation
[AUTHORS]
Bryan Li, Samar Haider, Fiona Luo, Adwait Agashe, Chris Callison-Burch
[ABSTRACT]
Large language models excel at creative generation but continue to struggle
with the issues of hallucination and bias. While retrieval-augmented generation
(RAG) provides a framework for grounding LLMs’ responses in accurate and
up-to-date information, it still raises the question of bias: which sources
should be selected for inclusion in the context? And how should their
importance be weighted? In this paper, we study the challenge of cross-lingual
RAG and present a dataset to investigate the robustness of existing systems at
answering queries about geopolitical disputes, which exist at the intersection
of linguistic, cultural, and political boundaries. Our dataset is sourced from
Wikipedia pages containing information relevant to the given queries and we
investigate the impact of including additional context, as well as the
composition of this context in terms of language and source, on an LLM’s
response. Our results show that existing RAG systems continue to be challenged
by cross-lingual use cases and suffer from a lack of consistency when they are
provided with competing information in multiple languages. We present case
studies to illustrate these issues and outline steps for future research to
address these challenges. We make our dataset and code publicly available at
https://github.com/manestay/bordIRlines.
[COMMENTS]
NLP for Wikipedia workshop at EMNLP 2024
[LINK]
http://arxiv.org/abs/2410.01171v1
[DATE]
2024-10-02 09:59:07+08:00
[CATEGORIES]
cs.CL
Unifying the Scope of Bridging Anaphora Types in English: Bridging Annotations in ARRAU and GUM
[AUTHORS]
Lauren Levine, Amir Zeldes
[COMMENTS]
The Seventh Workshop on Computational Models of Reference, Anaphora
and Coreference (CRAC 2024), EMNLP 2024 Workshop, 15 November 2024
[LINK]
http://arxiv.org/abs/2410.01170v1
[DATE]
2024-10-02 09:56:28+08:00
[CATEGORIES]
cs.CL
GADFA: Generator-Assisted Decision-Focused Approach for Opinion Expressing Timing Identification
[AUTHORS]
Chung-Chi Chen, Hiroya Takamura, Ichiro Kobayashi, Yusuke Miyao
[ABSTRACT]
The advancement of text generation models has granted us the capability to
produce coherent and convincing text on demand. Yet, in real-life
circumstances, individuals do not continuously generate text or voice their
opinions. For instance, consumers pen product reviews after weighing the merits
and demerits of a product, and professional analysts issue reports following
significant news releases. In essence, opinion expression is typically prompted
by particular reasons or signals. Despite long-standing developments in opinion
mining, the appropriate timing for expressing an opinion remains largely
unexplored. To address this deficit, our study introduces an innovative task -
the identification of news-triggered opinion expressing timing. We ground this
task in the actions of professional stock analysts and develop a novel dataset
for investigation. Our approach is decision-focused, leveraging text generation
models to steer the classification model, thus enhancing overall performance.
Our experimental findings demonstrate that the text generated by our model
contributes fresh insights from various angles, effectively aiding in
identifying the optimal timing for opinion expression.
[LINK]
http://arxiv.org/abs/2410.01169v1
[DATE]
2024-10-02 09:54:46+08:00
[CATEGORIES]
cs.CL
Evaluating the fairness of task-adaptive pretraining on unlabeled test data before few-shot text classification
[AUTHORS]
Kush Dubey
[COMMENTS]
To appear in the GenBench Workshop at EMNLP 2024
[LINK]
http://arxiv.org/abs/2410.00179v2
[DATE]
2024-10-02 09:50:17+08:00
[CATEGORIES]
cs.CL
cs.LG
Unveiling the Achilles’ Heel of NLG Evaluators: A Unified Adversarial Framework Driven by Large Language Models
[AUTHORS]
Yiming Chen, Chen Zhang, Danqing Luo, Luis Fernando D’Haro, Robby T. Tan, Haizhou Li
[ABSTRACT]
The automatic evaluation of natural language generation (NLG) systems
presents a long-lasting challenge. Recent studies have highlighted various
neural metrics that align well with human evaluations. Yet, the robustness of
these evaluators against adversarial perturbations remains largely
under-explored due to the unique challenges in obtaining adversarial data for
different NLG evaluation tasks. To address the problem, we introduce AdvEval, a
novel black-box adversarial framework against NLG evaluators. AdvEval is
specially tailored to generate data that yield strong disagreements between
human and victim evaluators. Specifically, inspired by the recent success of
large language models (LLMs) in text generation and evaluation, we adopt strong
LLMs as both the data generator and gold evaluator. Adversarial data are
automatically optimized with feedback from the gold and victim evaluator. We
conduct experiments on 12 victim evaluators and 11 NLG datasets, spanning tasks
including dialogue, summarization, and question evaluation. The results show
that AdvEval can lead to significant performance degradation of various victim
metrics, thereby validating its efficacy.
[COMMENTS]
ACL24 Findings
[LINK]
http://arxiv.org/abs/2405.14646v2
[DATE]
2024-10-02 09:47:34+08:00
[CATEGORIES]
cs.CL
Unleashing the Power of Large Language Models in Zero-shot Relation Extraction via Self-Prompting
[AUTHORS]
Siyi Liu, Yang Li, Jiang Li, Shan Yang, Yunshi Lan
[ABSTRACT]
Recent research in zero-shot Relation Extraction (RE) has focused on using
Large Language Models (LLMs) due to their impressive zero-shot capabilities.
However, current methods often perform suboptimally, mainly due to a lack of
detailed, context-specific prompts needed for understanding various sentences
and relations. To address this, we introduce the Self-Prompting framework, a
novel method designed to fully harness the embedded RE knowledge within LLMs.
Specifically, our framework employs a three-stage diversity approach to prompt
LLMs, generating multiple synthetic samples that encapsulate specific relations
from scratch. These generated samples act as in-context learning samples,
offering explicit and context-specific guidance to efficiently prompt LLMs for
RE. Experimental evaluations on benchmark datasets show our approach
outperforms existing LLM-based zero-shot RE methods. Additionally, our
experiments confirm the effectiveness of our generation pipeline in producing
high-quality synthetic data that enhances performance.
[COMMENTS]
EMNLP 2024 Short
[LINK]
http://arxiv.org/abs/2410.01154v1
[DATE]
2024-10-02 09:12:54+08:00
[CATEGORIES]
cs.CL
Unsupervised Domain Adaptation for Keyphrase Generation using Citation Contexts
[AUTHORS]
Florian Boudin, Akiko Aizawa
[ABSTRACT]
Adapting keyphrase generation models to new domains typically involves
few-shot fine-tuning with in-domain labeled data. However, annotating documents
with keyphrases is often prohibitively expensive and impractical, requiring
expert annotators. This paper presents silk, an unsupervised method designed to
address this issue by extracting silver-standard keyphrases from citation
contexts to create synthetic labeled data for domain adaptation. Extensive
experiments across three distinct domains demonstrate that our method yields
high-quality synthetic samples, resulting in significant and consistent
improvements in in-domain performance over strong baselines.
[COMMENTS]
Accepted at EMNLP 2024 Findings
[LINK]
http://arxiv.org/abs/2409.13266v2
[DATE]
2024-10-02 09:11:59+08:00
[CATEGORIES]
cs.CL
Suri: Multi-constraint Instruction Following for Long-form Text Generation
[AUTHORS]
Chau Minh Pham, Simeng Sun, Mohit Iyyer
[COMMENTS]
Accepted to EMNLP‘24 (Findings)
[LINK]
http://arxiv.org/abs/2406.19371v2
[DATE]
2024-10-02 09:01:57+08:00
[CATEGORIES]
cs.CL
Evaluating Deduplication Techniques for Economic Research Paper Titles with a Focus on Semantic Similarity using NLP and LLMs
[AUTHORS]
Doohee You, Karim Lasri, Samuel Fraiberger
[ABSTRACT]
This study investigates efficient deduplication techniques for a large NLP
dataset of economic research paper titles. We explore various pairing methods
alongside established distance measures (Levenshtein distance, cosine
similarity) and a sBERT model for semantic evaluation. Our findings suggest a
potentially low prevalence of duplicates based on the observed semantic
similarity across different methods. Further exploration with a human-annotated
ground truth set is completed for a more conclusive assessment. The result
supports findings from the NLP, LLM based distance metrics.
[COMMENTS]
6 pages, 1 figure
[LINK]
http://arxiv.org/abs/2410.01141v1
[DATE]
2024-10-02 08:43:10+08:00
[CATEGORIES]
cs.CL
Attention Score is not All You Need for Token Importance Indicator in KV Cache Reduction: Value Also Matters
[AUTHORS]
Zhiyu Guo, Hidetaka Kamigaito, Taro Watanabe
[ABSTRACT]
Scaling the context size of large language models (LLMs) enables them to
perform various new tasks, e.g., book summarization. However, the memory cost
of the Key and Value (KV) cache in attention significantly limits the practical
applications of LLMs. Recent works have explored token pruning for KV cache
reduction in LLMs, relying solely on attention scores as a token importance
indicator. However, our investigation into value vector norms revealed a
notably non-uniform pattern questioning their reliance only on attention
scores. Inspired by this, we propose a new method: Value-Aware Token Pruning
(VATP) which uses both attention scores and the $ \ell_{1} $ norm of value
vectors to evaluate token importance. Extensive experiments on LLaMA2-7B-chat
and Vicuna-v1.5-7B across 16 LongBench tasks demonstrate that VATP outperforms
attention-score-only baselines in over 12 tasks, confirming the effectiveness
of incorporating value vector norms into token importance evaluation of LLMs.
[COMMENTS]
Accepted at EMNLP 2024 (Main)
[LINK]
http://arxiv.org/abs/2406.12335v2
[DATE]
2024-10-02 08:19:13+08:00
[CATEGORIES]
cs.CL
cs.LG
Observational Scaling Laws and the Predictability of Language Model Performance
[AUTHORS]
Yangjun Ruan, Chris J. Maddison, Tatsunori Hashimoto
[ABSTRACT]
Understanding how language model performance varies with scale is critical to
benchmark and algorithm development. Scaling laws are one approach to building
this understanding, but the requirement of training models across many
different scales has limited their use. We propose an alternative,
observational approach that bypasses model training and instead builds scaling
laws from ~100 publically available models. Building a single scaling law from
multiple model families is challenging due to large variations in their
training compute efficiencies and capabilities. However, we show that these
variations are consistent with a simple, generalized scaling law where language
model performance is a function of a low-dimensional capability space, and
model families only vary in their efficiency in converting training compute to
capabilities. Using this approach, we show the surprising predictability of
complex scaling phenomena: we show that several emergent phenomena follow a
smooth, sigmoidal behavior and are predictable from small models; we show that
the agent performance of models such as GPT-4 can be precisely predicted from
simpler non-agentic benchmarks; and we show how to predict the impact of
post-training interventions like Chain-of-Thought and Self-Consistency as
language model capabilities continue to improve.
[COMMENTS]
Accepted at NeurIPS 2024 as a spotlight
[LINK]
http://arxiv.org/abs/2405.10938v3
[DATE]
2024-10-02 07:38:10+08:00
[CATEGORIES]
cs.LG
cs.CL
Mixing It Up: The Cocktail Effect of Multi-Task Fine-Tuning on LLM Performance – A Case Study in Finance
[AUTHORS]
Meni Brief, Oded Ovadia, Gil Shenderovitz, Noga Ben Yoash, Rachel Lemberg, Eitam Sheetrit
[ABSTRACT]
The application of large language models (LLMs) in domain-specific contexts,
including finance, has expanded rapidly. Domain-specific LLMs are typically
evaluated based on their performance in various downstream tasks relevant to
the domain. In this work, we present a detailed analysis of fine-tuning LLMs
for such tasks. Somewhat counterintuitively, we find that in domain-specific
cases, fine-tuning exclusively on the target task is not always the most
effective strategy. Instead, multi-task fine-tuning - where models are trained
on a cocktail of related tasks - can significantly enhance performance. We
demonstrate how this approach enables a small model, such as Phi-3-Mini, to
achieve state-of-the-art results, even surpassing the much larger GPT-4-o model
on financial benchmarks. Our study involves a large-scale experiment, training
over 200 models using several widely adopted LLMs as baselines, and empirically
confirms the benefits of multi-task fine-tuning. Additionally, we explore the
use of general instruction data as a form of regularization, suggesting that it
helps minimize performance degradation. We also investigate the inclusion of
mathematical data, finding improvements in numerical reasoning that transfer
effectively to financial tasks. Finally, we note that while fine-tuning for
downstream tasks leads to targeted improvements in task performance, it does
not necessarily result in broader gains in domain knowledge or complex domain
reasoning abilities.
[LINK]
http://arxiv.org/abs/2410.01109v1
[DATE]
2024-10-02 06:35:56+08:00
[CATEGORIES]
cs.CL
Approximately Aligned Decoding
[AUTHORS]
Daniel Melcer, Sujan Gonugondla, Pramuditha Perera, Haifeng Qian, Wen-Hao Chiang, Yanjun Wang, Nihal Jain, Pranav Garg, Xiaofei Ma, Anoop Deoras
[ABSTRACT]
It is common to reject undesired outputs of Large Language Models (LLMs);
however, current methods to do so require an excessive amount of computation,
or severely distort the distribution of outputs. We present a method to balance
the distortion of the output distribution with computational efficiency,
allowing for the generation of long sequences of text with difficult-to-satisfy
constraints, with less amplification of low probability outputs compared to
existing methods. We show through a series of experiments that the
task-specific performance of our method is comparable to methods that do not
distort the output distribution, while being much more computationally
efficient.
[COMMENTS]
9 pages main, 22 pages total
[LINK]
http://arxiv.org/abs/2410.01103v1
[DATE]
2024-10-02 06:22:13+08:00
[CATEGORIES]
cs.CL
Unlocking Korean Verbs: A User-Friendly Exploration into the Verb Lexicon
[AUTHORS]
Seohyun Song, Eunkyul Leah Jo, Yige Chen, Jeen-Pyo Hong, Kyuwon Kim, Jin Wee, Miyoung Kang, KyungTae Lim, Jungyeul Park, Chulwoo Park
[ABSTRACT]
The Sejong dictionary dataset offers a valuable resource, providing extensive
coverage of morphology, syntax, and semantic representation. This dataset can
be utilized to explore linguistic information in greater depth. The labeled
linguistic structures within this dataset form the basis for uncovering
relationships between words and phrases and their associations with target
verbs. This paper introduces a user-friendly web interface designed for the
collection and consolidation of verb-related information, with a particular
focus on subcategorization frames. Additionally, it outlines our efforts in
mapping this information by aligning subcategorization frames with
corresponding illustrative sentence examples. Furthermore, we provide a Python
library that would simplify syntactic parsing and semantic role labeling. These
tools are intended to assist individuals interested in harnessing the Sejong
dictionary dataset to develop applications for Korean language processing.
[COMMENTS]
COLING2025 System Demonstrations (Submitted)
[LINK]
http://arxiv.org/abs/2410.01100v1
[DATE]
2024-10-02 06:03:34+08:00
[CATEGORIES]
cs.CL
Are Large Language Models Consistent over Value-laden Questions?
[AUTHORS]
Jared Moore, Tanvi Deshpande, Diyi Yang
[COMMENTS]
9 pages, 10 figures, In Findings of EMNLP 2024
[LINK]
http://arxiv.org/abs/2407.02996v2
[DATE]
2024-10-02 05:23:18+08:00
[CATEGORIES]
cs.CL
Concept Space Alignment in Multilingual LLMs
[AUTHORS]
Qiwei Peng, Anders Søgaard
[COMMENTS]
EMNLP 2024
[LINK]
http://arxiv.org/abs/2410.01079v1
[DATE]
2024-10-02 05:21:00+08:00
[CATEGORIES]
cs.CL
Learn Beyond The Answer: Training Language Models with Reflection for Mathematical Reasoning
[AUTHORS]
Zhihan Zhang, Tao Ge, Zhenwen Liang, Wenhao Yu, Dian Yu, Mengzhao Jia, Dong Yu, Meng Jiang
[COMMENTS]
Accepted to the main conference of EMNLP 2024
[LINK]
http://arxiv.org/abs/2406.12050v2
[DATE]
2024-10-02 04:50:05+08:00
[CATEGORIES]
cs.CL
From Natural Language to SQL: Review of LLM-based Text-to-SQL Systems
[AUTHORS]
Ali Mohammadjafari, Anthony S. Maida, Raju Gottumukkala
[ABSTRACT]
Since the onset of LLMs, translating natural language queries to structured
SQL commands is assuming increasing. Unlike the previous reviews, this survey
provides a comprehensive study of the evolution of LLM-based text-to-SQL
systems, from early rule-based models to advanced LLM approaches, and how LLMs
impacted this field. We discuss benchmarks, evaluation methods and evaluation
metrics. Also, we uniquely study the role of integration of knowledge graphs
for better contextual accuracy and schema linking in these systems. The current
techniques fall into two categories: in-context learning of corpus and
fine-tuning, which then leads to approaches such as zero-shot, few-shot
learning from the end, and data augmentation. Finally, we highlight key
challenges such as computational efficiency, model robustness, and data privacy
with perspectives toward their development and improvements in potential areas
for future of LLM-based text-to-SQL system.
[COMMENTS]
12 pages, 5 figures, 3 tables
[LINK]
http://arxiv.org/abs/2410.01066v1
[DATE]
2024-10-02 04:46:25+08:00
[CATEGORIES]
cs.CL
CA-BERT: Leveraging Context Awareness for Enhanced Multi-Turn Chat Interaction
[AUTHORS]
Minghao Liu, Mingxiu Sui, Yi Nan, Cangqing Wang, Zhijie Zhou
[ABSTRACT]
Effective communication in automated chat systems hinges on the ability to
understand and respond to context. Traditional models often struggle with
determining when additional context is necessary for generating appropriate
responses. This paper introduces Context-Aware BERT (CA-BERT), a
transformer-based model specifically fine-tuned to address this challenge.
CA-BERT innovatively applies deep learning techniques to discern context
necessity in multi-turn chat interactions, enhancing both the relevance and
accuracy of responses.
We describe the development of CA-BERT, which adapts the robust architecture
of BERT with a novel training regimen focused on a specialized dataset of chat
dialogues. The model is evaluated on its ability to classify context necessity,
demonstrating superior performance over baseline BERT models in terms of
accuracy and efficiency. Furthermore, CA-BERT’s implementation showcases
significant reductions in training time and resource usage, making it feasible
for real-time applications.
The results indicate that CA-BERT can effectively enhance the functionality
of chatbots by providing a nuanced understanding of context, thereby improving
user experience and interaction quality in automated systems. This study not
only advances the field of NLP in chat applications but also provides a
framework for future research into context-sensitive AI developments.
[COMMENTS]
This paper has been accepted by ICBASE 2024
[LINK]
http://arxiv.org/abs/2409.13701v2
[DATE]
2024-10-02 04:45:26+08:00
[CATEGORIES]
cs.CL
Logicbreaks: A Framework for Understanding Subversion of Rule-based Inference
[AUTHORS]
Anton Xue, Avishree Khare, Rajeev Alur, Surbhi Goel, Eric Wong
[ABSTRACT]
We study how to subvert large language models (LLMs) from following
prompt-specified rules. We model rule-following as inference in propositional
Horn logic, a mathematical system in which rules have the form ``if $P$ and
$Q$, then $R$’’ for some propositions $P$, $Q$, and $R$. We prove that although
LLMs can faithfully follow such rules, maliciously crafted prompts can mislead
even idealized, theoretically constructed models. Empirically, we find that the
reasoning behavior of LLMs aligns with that of our theoretical constructions,
and popular attack algorithms find adversarial prompts with characteristics
predicted by our theory. Our logic-based framework provides a novel perspective
for mechanistically understanding the behavior of LLMs in rule-based settings
such as jailbreak attacks.
[LINK]
http://arxiv.org/abs/2407.00075v2
[DATE]
2024-10-02 04:42:41+08:00
[CATEGORIES]
cs.CL
cs.LG
Watch Your Steps: Observable and Modular Chains of Thought
[AUTHORS]
Cassandra A. Cohen, William W. Cohen
[ABSTRACT]
We propose a variant of chain of thought (CoT) prompting called Program Trace
Prompting that makes explanations more observable while preserving the power,
generality and flexibility of CoT. In our approach, few-shot CoT demonstrations
are wrapped in a formal syntax based on Python, and each prompt: identifies and
names steps; defines the input/output behavior of steps; and replaces CoT
explanations of in-context examples with chains of these formalized steps on
the same examples. Program Trace Prompting is applicable to many tasks,
achieving strong results on the 23 diverse tasks in the BIG-Bench Hard
benchmark. More importantly, by instrumenting explanations in this way, we
enable new types of analysis. In particular, we identify “non-local errors”
(which correspond to incorrectly learning the reasoning method illustrated in
the demonstrations) as an unaddressed issue in CoT learning, and we present
methods for verifying the modularity of steps in a CoT explanation.
[LINK]
http://arxiv.org/abs/2409.15359v2
[DATE]
2024-10-02 04:24:38+08:00
[CATEGORIES]
cs.CL
cs.LG
From Facts to Insights: A Study on the Generation and Evaluation of Analytical Reports for Deciphering Earnings Calls
[AUTHORS]
Tomas Goldsack, Yang Wang, Chenghua Lin, Chung-Chi Chen
[ABSTRACT]
This paper explores the use of Large Language Models (LLMs) in the generation
and evaluation of analytical reports derived from Earnings Calls (ECs).
Addressing a current gap in research, we explore the generation of analytical
reports with LLMs in a multi-agent framework, designing specialized agents that
introduce diverse viewpoints and desirable topics of analysis into the report
generation process. Through multiple analyses, we examine the alignment between
generated and human-written reports and the impact of both individual and
collective agents. Our findings suggest that the introduction of additional
agents results in more insightful reports, although reports generated by human
experts remain preferred in the majority of cases. Finally, we address the
challenging issue of report evaluation, we examine the limitations and
strengths of LLMs in assessing the quality of generated reports in different
settings, revealing a significant correlation with human experts across
multiple dimensions.
[COMMENTS]
Pre-print
[LINK]
http://arxiv.org/abs/2410.01039v1
[DATE]
2024-10-02 04:03:22+08:00
[CATEGORIES]
cs.CL
MMUTF: Multimodal Multimedia Event Argument Extraction with Unified Template Filling
[AUTHORS]
Philipp Seeberger, Dominik Wagner, Korbinian Riedhammer
[COMMENTS]
Accepted to Findings of EMNLP 2024
[LINK]
http://arxiv.org/abs/2406.12420v2
[DATE]
2024-10-02 04:02:46+08:00
[CATEGORIES]
cs.CL
cs.LG
ChronoMagic-Bench: A Benchmark for Metamorphic Evaluation of Text-to-Time-lapse Video Generation
[AUTHORS]
Shenghai Yuan, Jinfa Huang, Yongqi Xu, Yaoyang Liu, Shaofeng Zhang, Yujun Shi, Ruijie Zhu, Xinhua Cheng, Jiebo Luo, Li Yuan
[ABSTRACT]
We propose a novel text-to-video (T2V) generation benchmark,
ChronoMagic-Bench, to evaluate the temporal and metamorphic capabilities of the
T2V models (e.g. Sora and Lumiere) in time-lapse video generation. In contrast
to existing benchmarks that focus on visual quality and textual relevance of
generated videos, ChronoMagic-Bench focuses on the model’s ability to generate
time-lapse videos with significant metamorphic amplitude and temporal
coherence. The benchmark probes T2V models for their physics, biology, and
chemistry capabilities, in a free-form text query. For these purposes,
ChronoMagic-Bench introduces 1,649 prompts and real-world videos as references,
categorized into four major types of time-lapse videos: biological,
human-created, meteorological, and physical phenomena, which are further
divided into 75 subcategories. This categorization comprehensively evaluates
the model’s capacity to handle diverse and complex transformations. To
accurately align human preference with the benchmark, we introduce two new
automatic metrics, MTScore and CHScore, to evaluate the videos’ metamorphic
attributes and temporal coherence. MTScore measures the metamorphic amplitude,
reflecting the degree of change over time, while CHScore assesses the temporal
coherence, ensuring the generated videos maintain logical progression and
continuity. Based on ChronoMagic-Bench, we conduct comprehensive manual
evaluations of ten representative T2V models, revealing their strengths and
weaknesses across different categories of prompts, and providing a thorough
evaluation framework that addresses current gaps in video generation research.
Moreover, we create a large-scale ChronoMagic-Pro dataset, containing 460k
high-quality pairs of 720p time-lapse videos and detailed captions ensuring
high physical pertinence and large metamorphic amplitude.
Homepage.
[COMMENTS]
NeurIPS D&B 2024 (Spotlight)
[LINK]
http://arxiv.org/abs/2406.18522v2
[DATE]
2024-10-02 04:00:27+08:00
[CATEGORIES]
cs.CL
RuBLiMP: Russian Benchmark of Linguistic Minimal Pairs
[AUTHORS]
Ekaterina Taktasheva, Maxim Bazhukov, Kirill Koncha, Alena Fenogenova, Ekaterina Artemova, Vladislav Mikhailov
[ABSTRACT]
Minimal pairs are a well-established approach to evaluating the grammatical
knowledge of language models. However, existing resources for minimal pairs
address a limited number of languages and lack diversity of language-specific
grammatical phenomena. This paper introduces the Russian Benchmark of
Linguistic Minimal Pairs (RuBLiMP), which includes 45k pairs of sentences that
differ in grammaticality and isolate a morphological, syntactic, or semantic
phenomenon. In contrast to existing benchmarks of linguistic minimal pairs,
RuBLiMP is created by applying linguistic perturbations to automatically
annotated sentences from open text corpora and carefully curating test data. We
describe the data collection protocol and present the results of evaluating 25
language models in various scenarios. We find that the widely used language
models for Russian are sensitive to morphological and agreement-oriented
contrasts but fall behind humans on phenomena requiring understanding of
structural relations, negation, transitivity, and tense. RuBLiMP, the codebase,
and other materials are publicly available.
[COMMENTS]
to appear in EMNLP 2024 (main)
[LINK]
http://arxiv.org/abs/2406.19232v3
[DATE]
2024-10-02 03:56:57+08:00
[CATEGORIES]
cs.CL
MOSEL: 950,000 Hours of Speech Data for Open-Source Speech Foundation Model Training on EU Languages
[AUTHORS]
Marco Gaido, Sara Papi, Luisa Bentivogli, Alessio Brutti, Mauro Cettolo, Roberto Gretter, Marco Matassoni, Mohamed Nabih, Matteo Negri
[COMMENTS]
Accepted at EMNLP 2024 Main Conference
[LINK]
http://arxiv.org/abs/2410.01036v1
[DATE]
2024-10-02 03:54:10+08:00
[CATEGORIES]
cs.CL
Investigating the Synergistic Effects of Dropout and Residual Connections on Language Model Training
[AUTHORS]
Qingyang Li, Weimao Ke
[ABSTRACT]
This paper examines the pivotal role of dropout techniques in mitigating
overfitting in language model training. It conducts a comprehensive
investigation into the influence of variable dropout rates on both individual
layers and residual connections within the context of language modeling. Our
study conducts training of a decoder implementation on the classic Tiny
Shakespeare data to examine the effects of the adjustments on training
efficiency and validation error. Results not only confirm the benefits of
dropout for regularization and residuals for convergence, but also reveal their
interesting interactions. There exists an important trade-off between the depth
of residual connections and the dropout on these connections for optimal deep
neural network convergence and generalization.
[COMMENTS]
5 pages, 4 figures
[LINK]
http://arxiv.org/abs/2410.01019v1
[DATE]
2024-10-02 03:27:00+08:00
[CATEGORIES]
cs.CL
cs.LG
“Hiding in Plain Sight”: Designing Synthetic Dialog Generation for Uncovering Socially Situated Norms
[AUTHORS]
Chengfei Wu, Dan Goldwasser
[ABSTRACT]
Naturally situated conversations capture the underlying social norms
appropriate for the topic of conversation, the relationship between
interlocutors and their communicative intent. This paper proposes a framework
for controlled generation of dialogues, spanning a wide range of interlocutors
attributes (such as age group, profession and personality types), relationship
types, conversation topics and conversational trajectories. We use this
framework to generate NormHint, a collection of dialogues consistent with these
rich settings and analyzed for norm violation leading to conflicts, and
potential steps for avoiding these conflicts by adhering to social norms and
preferring respectful utterances maintaining the communicative intents of the
original utterance. We present the results of human validation and automated
analysis of NormHint and show it captures a wide range of conversational topics
and scored highly by humans for the naturalness of the conversations based on
the prompted context.
[COMMENTS]
Pre-Print
[LINK]
http://arxiv.org/abs/2410.00998v1
[DATE]
2024-10-02 02:38:23+08:00
[CATEGORIES]
cs.CL
Creative and Context-Aware Translation of East Asian Idioms with GPT-4
[AUTHORS]
Kenan Tang, Peiyang Song, Yao Qin, Xifeng Yan
[ABSTRACT]
As a type of figurative language, an East Asian idiom condenses rich cultural
background into only a few characters. Translating such idioms is challenging
for human translators, who often resort to choosing a context-aware translation
from an existing list of candidates. However, compiling a dictionary of
candidate translations demands much time and creativity even for expert
translators. To alleviate such burden, we evaluate if GPT-4 can help generate
high-quality translations. Based on automatic evaluations of faithfulness and
creativity, we first identify Pareto-optimal prompting strategies that can
outperform translation engines from Google and DeepL. Then, at a low cost, our
context-aware translations can achieve far more high-quality translations per
idiom than the human baseline. We open-source all code and data to facilitate
further research.
[LINK]
http://arxiv.org/abs/2410.00988v1
[DATE]
2024-10-02 02:24:43+08:00
[CATEGORIES]
cs.CL
Characterizing Online Toxicity During the 2022 Mpox Outbreak: A Computational Analysis of Topical and Network Dynamics
[AUTHORS]
Lizhou Fan, Lingyao Li, Libby Hemphill
[ABSTRACT]
Background: Online toxicity, encompassing behaviors such as harassment,
bullying, hate speech, and the dissemination of misinformation, has become a
pressing social concern in the digital age. The 2022 Mpox outbreak, initially
termed “Monkeypox” but subsequently renamed to mitigate associated stigmas and
societal concerns, serves as a poignant backdrop to this issue. Objective: In
this research, we undertake a comprehensive analysis of the toxic online
discourse surrounding the 2022 Mpox outbreak. Our objective is to dissect its
origins, characterize its nature and content, trace its dissemination patterns,
and assess its broader societal implications, with the goal of providing
insights that can inform strategies to mitigate such toxicity in future crises.
Methods: We collected more than 1.6 million unique tweets and analyzed them
from five dimensions, including context, extent, content, speaker, and intent.
Utilizing BERT-based topic modeling and social network community clustering, we
delineated the toxic dynamics on Twitter. Results: We identified five
high-level topic categories in the toxic online discourse on Twitter, including
disease (46.6%), health policy and healthcare (19.3%), homophobia (23.9%),
politics (6.0%), and racism (4.1%). Through the toxicity diffusion networks of
mentions, retweets, and the top users, we found that retweets of toxic content
were widespread, while influential users rarely engaged with or countered this
toxicity through retweets. Conclusions: By tracking topical dynamics, we can
track the changing popularity of toxic content online, providing a better
understanding of societal challenges. Network dynamics spotlight key social
media influencers and their intents, indicating that addressing these central
figures in toxic discourse can enhance crisis communication and inform
policy-making.
[COMMENTS]
36 pages, 8 figure, and 12 tables
[LINK]
http://arxiv.org/abs/2408.11962v3
[DATE]
2024-10-02 01:50:31+08:00
[CATEGORIES]
cs.CL
Iteration of Thought: Leveraging Inner Dialogue for Autonomous Large Language Model Reasoning
[AUTHORS]
Santosh Kumar Radha, Yasamin Nouri Jelyani, Ara Ghukasyan, Oktay Goktas
[ABSTRACT]
Iterative human engagement is a common and effective means of leveraging the
advanced language processing power of large language models (LLMs). Using
well-structured prompts in a conversational manner, human users can effectively
influence an LLM to develop more thoughtful and accurate responses. Motivated
by this insight, we propose the Iteration of Thought (IoT) framework for
enhancing LLM responses by generating “thought”-provoking prompts vis a vis an
input query and the current iteration of an LLM’s response. Unlike static or
semi-static approaches, e.g. Chain of Thought (CoT) or Tree of Thoughts (ToT),
IoT adapts its reasoning path dynamically, based on evolving context, and
without generating alternate explorative thoughts which are ultimately
discarded. The three components of the IoT framework are (1) an Inner Dialogue
Agent (IDA) responsible for generating instructive, context-specific prompts;
(2) an LLM Agent (LLMA) that processes these prompts to refine its responses;
and (3) an iterative prompting loop that implements a conversation between the
former two components. We introduce two variants of our framework: Autonomous
Iteration of Thought (AIoT), where an LLM decides when to stop iterating, and
Guided Iteration of Thought (GIoT), which always forces a fixed number
iterations. We investigate the performance of IoT across various datasets,
spanning complex reasoning tasks from the GPQA dataset, explorative
problem-solving in Game of 24, puzzle solving in Mini Crosswords, and multi-hop
question answering from the HotpotQA dataset. Our results show that IoT
represents a viable paradigm for autonomous response refinement in LLMs,
showcasing significant improvements over CoT and thereby enabling more adaptive
and efficient reasoning systems that minimize human intervention.
[LINK]
http://arxiv.org/abs/2409.12618v2
[DATE]
2024-10-02 01:50:25+08:00
[CATEGORIES]
cs.CL
cs.LG
AIPatient: Simulating Patients with EHRs and LLM Powered Agentic Workflow
[AUTHORS]
Huizi Yu, Jiayan Zhou, Lingyao Li, Shan Chen, Jack Gallifant, Anye Shi, Xiang Li, Wenyue Hua, Mingyu Jin, Guang Chen, Yang Zhou, Zhao Li, Trisha Gupte, Ming-Li Chen, Zahra Azizi, Yongfeng Zhang, Themistocles L. Assimes, Xin Ma, Danielle S. Bitterman, Lin Lu, Lizhou Fan
[ABSTRACT]
Simulated patient systems play a crucial role in modern medical education and
research, providing safe, integrative learning environments and enabling
clinical decision-making simulations. Large Language Models (LLM) could advance
simulated patient systems by replicating medical conditions and patient-doctor
interactions with high fidelity and low cost. However, ensuring the
effectiveness and trustworthiness of these systems remains a challenge, as they
require a large, diverse, and precise patient knowledgebase, along with a
robust and stable knowledge diffusion to users. Here, we developed AIPatient,
an advanced simulated patient system with AIPatient Knowledge Graph (AIPatient
KG) as the input and the Reasoning Retrieval-Augmented Generation (Reasoning
RAG) agentic workflow as the generation backbone. AIPatient KG samples data
from Electronic Health Records (EHRs) in the Medical Information Mart for
Intensive Care (MIMIC)-III database, producing a clinically diverse and
relevant cohort of 1,495 patients with high knowledgebase validity (F1 0.89).
Reasoning RAG leverages six LLM powered agents spanning tasks including
retrieval, KG query generation, abstraction, checker, rewrite, and
summarization. This agentic framework reaches an overall accuracy of 94.15% in
EHR-based medical Question Answering (QA), outperforming benchmarks that use
either no agent or only partial agent integration. Our system also presents
high readability (median Flesch Reading Ease 77.23; median Flesch Kincaid Grade
5.6), robustness (ANOVA F-value 0.6126, p>0.1), and stability (ANOVA F-value
0.782, p>0.1). The promising performance of the AIPatient system highlights its
potential to support a wide range of applications, including medical education,
model evaluation, and system integration.
[COMMENTS]
42 pages, 6 figures, 7 tables
[LINK]
http://arxiv.org/abs/2409.18924v2
[DATE]
2024-10-02 01:49:00+08:00
[CATEGORIES]
cs.CL
FLRT: Fluent Student-Teacher Redteaming
[AUTHORS]
T. Ben Thompson, Michael Sklar
[ABSTRACT]
Many publicly available language models have been safety tuned to reduce the
likelihood of toxic or liability-inducing text. To redteam or jailbreak these
models for compliance with toxic requests, users and security analysts have
developed adversarial prompting techniques. One attack method is to apply
discrete optimization techniques to the prompt. However, the resulting attack
strings are often gibberish text, easily filtered by defenders due to high
measured perplexity, and may fail for unseen tasks and/or well-tuned models. In
this work, we improve existing algorithms (primarily GCG and BEAST) to develop
powerful and fluent attacks on safety-tuned models like Llama-2 and Phi-3. Our
technique centers around a new distillation-based approach that encourages the
victim model to emulate a toxified finetune, either in terms of output
probabilities or internal activations. To encourage human-fluent attacks, we
add a multi-model perplexity penalty and a repetition penalty to the objective.
We also enhance optimizer strength by allowing token insertions, token swaps,
and token deletions and by using longer attack sequences. The resulting process
is able to reliably jailbreak the most difficult target models with prompts
that appear similar to human-written prompts. On Advbench we achieve attack
success rates $>93$% for Llama-2-7B, Llama-3-8B, and Vicuna-7B, while
maintaining model-measured perplexity $<33$; we achieve $95$% attack success
for Phi-3, though with higher perplexity. We also find a universally-optimized
single fluent prompt that induces $>88$% compliance on previously unseen tasks
across Llama-2-7B, Phi-3-mini and Vicuna-7B and transfers to other black-box
models.
[LINK]
http://arxiv.org/abs/2407.17447v2
[DATE]
2024-10-02 01:39:09+08:00
[CATEGORIES]
cs.CL
Conversational Complexity for Assessing Risk in Large Language Models
[AUTHORS]
John Burden, Manuel Cebrian, Jose Hernandez-Orallo
[ABSTRACT]
Large Language Models (LLMs) present a dual-use dilemma: they enable
beneficial applications while harboring potential for harm, particularly
through conversational interactions. Despite various safeguards, advanced LLMs
remain vulnerable. A watershed case was Kevin Roose’s notable conversation with
Bing, which elicited harmful outputs after extended interaction. This contrasts
with simpler early jailbreaks that produced similar content more easily,
raising the question: How much conversational effort is needed to elicit
harmful information from LLMs? We propose two measures: Conversational Length
(CL), which quantifies the conversation length used to obtain a specific
response, and Conversational Complexity (CC), defined as the Kolmogorov
complexity of the user’s instruction sequence leading to the response. To
address the incomputability of Kolmogorov complexity, we approximate CC using a
reference LLM to estimate the compressibility of user instructions. Applying
this approach to a large red-teaming dataset, we perform a quantitative
analysis examining the statistical distribution of harmful and harmless
conversational lengths and complexities. Our empirical findings suggest that
this distributional analysis and the minimisation of CC serve as valuable tools
for understanding AI safety, offering insights into the accessibility of
harmful information. This work establishes a foundation for a new perspective
on LLM safety, centered around the algorithmic complexity of pathways to harm.
[COMMENTS]
15 pages, 6 figures
[LINK]
http://arxiv.org/abs/2409.01247v2
[DATE]
2024-10-02 01:21:28+08:00
[CATEGORIES]
cs.CL
Automated Peer Reviewing in Paper SEA: Standardization, Evaluation, and Analysis
[AUTHORS]
Jianxiang Yu, Zichen Ding, Jiaqi Tan, Kangyang Luo, Zhenmin Weng, Chenghua Gong, Long Zeng, Renjing Cui, Chengcheng Han, Qiushi Sun, Zhiyong Wu, Yunshi Lan, Xiang Li
[COMMENTS]
Accepted by EMNLP 2024
[LINK]
http://arxiv.org/abs/2407.12857v2
[DATE]
2024-10-02 01:13:38+08:00
[CATEGORIES]
cs.CL
Do Music Generation Models Encode Music Theory?
[AUTHORS]
Megan Wei, Michael Freeman, Chris Donahue, Chen Sun
[ABSTRACT]
Music foundation models possess impressive music generation capabilities.
When people compose music, they may infuse their understanding of music into
their work, by using notes and intervals to craft melodies, chords to build
progressions, and tempo to create a rhythmic feel. To what extent is this true
of music generation models? More specifically, are fundamental Western music
theory concepts observable within the “inner workings” of these models? Recent
work proposed leveraging latent audio representations from music generation
models towards music information retrieval tasks (e.g. genre classification,
emotion recognition), which suggests that high-level musical characteristics
are encoded within these models. However, probing individual music theory
concepts (e.g. tempo, pitch class, chord quality) remains under-explored. Thus,
we introduce SynTheory, a synthetic MIDI and audio music theory dataset,
consisting of tempos, time signatures, notes, intervals, scales, chords, and
chord progressions concepts. We then propose a framework to probe for these
music theory concepts in music foundation models (Jukebox and MusicGen) and
assess how strongly they encode these concepts within their internal
representations. Our findings suggest that music theory concepts are
discernible within foundation models and that the degree to which they are
detectable varies by model size and layer.
[COMMENTS]
Accepted at ISMIR 2024. Dataset:
https://huggingface.co/datasets/meganwei/syntheory Code:
https://github.com/brown-palm/syntheory Website:
https://brown-palm.github.io/music-theory
[LINK]
http://arxiv.org/abs/2410.00872v1
[DATE]
2024-10-02 01:06:30+08:00
[CATEGORIES]
cs.CL
cs.LG
MMNeuron: Discovering Neuron-Level Domain-Specific Interpretation in Multimodal Large Language Model
[AUTHORS]
Jiahao Huo, Yibo Yan, Boren Hu, Yutao Yue, Xuming Hu
[COMMENTS]
Accepted by the Main Conference of Empirical Methods in Natural
Language Processing (EMNLP) 2024
[LINK]
http://arxiv.org/abs/2406.11193v2
[DATE]
2024-10-02 01:04:22+08:00
[CATEGORIES]
cs.CL
On the Implications of Verbose LLM Outputs: A Case Study in Translation Evaluation
[AUTHORS]
Eleftheria Briakou, Zhongtao Liu, Colin Cherry, Markus Freitag
[ABSTRACT]
This paper investigates the impact of verbose LLM translations on evaluation.
We first demonstrate the prevalence of this behavior across several LLM outputs
drawn from the WMT 2024 general shared task on machine translation. We then
identify the primary triggers of verbosity, including safety, copyright
concerns, and insufficient context in short input queries. Finally, we show
that ignoring this behavior unfairly penalizes more verbose LLMs according to
both automatic and human evaluations, highlighting the need to address this
issue for more accurate future evaluations.
[LINK]
http://arxiv.org/abs/2410.00863v1
[DATE]
2024-10-02 00:59:01+08:00
[CATEGORIES]
cs.CL
LAViTeR: Learning Aligned Visual and Textual Representations Assisted by Image and Caption Generation
[AUTHORS]
Mohammad Abuzar Hashemi, Zhanghexuan Li, Mihir Chauhan, Yan Shen, Abhishek Satbhai, Mir Basheer Ali, Mingchen Gao, Sargur Srihari
[ABSTRACT]
Pre-training visual and textual representations from large-scale image-text
pairs is becoming a standard approach for many downstream vision-language
tasks. The transformer-based models learn inter and intra-modal attention
through a list of self-supervised learning tasks. This paper proposes LAViTeR,
a novel architecture for visual and textual representation learning. The main
module, Visual Textual Alignment (VTA) will be assisted by two auxiliary tasks,
GAN-based image synthesis and Image Captioning. We also propose a new
evaluation metric measuring the similarity between the learnt visual and
textual embedding. The experimental results on two public datasets, CUB and
MS-COCO, demonstrate superior visual and textual representation alignment in
the joint feature embedding space
[COMMENTS]
15 pages, 10 Figures, 5 Tables. Accepted for Oral Presentation at
Irish Machine Vision and Image Processing Conference Proceedings (IMVIP),
2024
[LINK]
http://arxiv.org/abs/2109.04993v4
[DATE]
2024-10-02 00:54:57+08:00
[CATEGORIES]
cs.CL
Quantifying reliance on external information over parametric knowledge during Retrieval Augmented Generation (RAG) using mechanistic analysis
[AUTHORS]
Reshmi Ghosh, Rahul Seetharaman, Hitesh Wadhwa, Somyaa Aggarwal, Samyadeep Basu, Soundararajan Srinivasan, Wenlong Zhao, Shreyas Chaudhari, Ehsan Aghazadeh
[ABSTRACT]
Retrieval Augmented Generation (RAG) is a widely used approach for leveraging
external context in several natural language applications such as question
answering and information retrieval. Yet, the exact nature in which a Language
Model (LM) leverages this non-parametric memory or retrieved context isn’t
clearly understood. This paper mechanistically examines the RAG pipeline to
highlight that LMs demonstrate a “shortcut’’ effect and have a strong bias
towards utilizing the retrieved context to answer questions, while relying
minimally on model priors. We propose (a) Causal Mediation Analysis; for
proving that parametric memory is minimally utilized when answering a question
and (b) Attention Contributions and Knockouts for showing the last token
residual stream do not get enriched from the subject token in the question, but
gets enriched from tokens of RAG-context. We find this pronounced “shortcut’’
behaviour to be true across both LLMs (e.g.,LlaMa) and SLMs (e.g., Phi)
[COMMENTS]
Accepted to Blackbox NLP @ EMNLP 2024
[LINK]
http://arxiv.org/abs/2410.00857v1
[DATE]
2024-10-02 00:48:13+08:00
[CATEGORIES]
cs.CL
Dual-Space Knowledge Distillation for Large Language Models
[AUTHORS]
Songming Zhang, Xue Zhang, Zengkui Sun, Yufeng Chen, Jinan Xu
[COMMENTS]
The camera-ready version for EMNLP 2024 main conference. 17 pages, 11
figures, code available at: https://github.com/songmzhang/DSKD
[LINK]
http://arxiv.org/abs/2406.17328v3
[DATE]
2024-10-02 00:45:12+08:00
[CATEGORIES]
cs.CL
Does Vision Accelerate Hierarchical Generalization in Neural Language Learners?
[AUTHORS]
Tatsuki Kuribayashi, Timothy Baldwin
[ABSTRACT]
Neural language models (LMs) are arguably less data-efficient than humans
from a language acquisition perspective. One fundamental question is why this
human-LM gap arises. This study explores the advantage of grounded language
acquisition, specifically the impact of visual information – which humans can
usually rely on but LMs largely do not have access to during language
acquisition – on syntactic generalization in LMs. Our experiments, following
the poverty of stimulus paradigm under two scenarios (using artificial vs.
naturalistic images), demonstrate that if the alignments between the linguistic
and visual components are clear in the input, access to vision data does help
with the syntactic generalization of LMs, but if not, visual input does not
help. This highlights the need for additional biases or signals, such as mutual
gaze, to enhance cross-modal alignment and enable efficient syntactic
generalization in multimodal LMs.
[COMMENTS]
14 pages
[LINK]
http://arxiv.org/abs/2302.00667v2
[DATE]
2024-10-02 00:29:14+08:00
[CATEGORIES]
cs.CL
VHASR: A Multimodal Speech Recognition System With Vision Hotwords
[AUTHORS]
Jiliang Hu, Zuchao Li, Ping Wang, Haojun Ai, Lefei Zhang, Hai Zhao
[COMMENTS]
14 pages, 6 figures, accepted by EMNLP 2024
[LINK]
http://arxiv.org/abs/2410.00822v1
[DATE]
2024-10-02 00:06:02+08:00
[CATEGORIES]
cs.CL
Almost Sure Convergence of Average Reward Temporal Difference Learning
[AUTHORS]
Ethan Blaser, Shangtong Zhang
[ABSTRACT]
Tabular average reward Temporal Difference (TD) learning is perhaps the
simplest and the most fundamental policy evaluation algorithm in average reward
reinforcement learning. After at least 25 years since its discovery, we are
finally able to provide a long-awaited almost sure convergence analysis.
Namely, we are the first to prove that, under very mild conditions, tabular
average reward TD converges almost surely to a sample path dependent fixed
point. Key to this success is a new general stochastic approximation result
concerning nonexpansive mappings with Markovian and additive noise, built on
recent advances in stochastic Krasnoselskii-Mann iterations.
[LINK]
http://arxiv.org/abs/2409.19546v3
[DATE]
2024-10-02 23:57:57+08:00
[CATEGORIES]
cs.LG
Positional Attention: Out-of-Distribution Generalization and Expressivity for Neural Algorithmic Reasoning
[AUTHORS]
Artur Back de Luca, George Giapitzakis, Shenghao Yang, Petar Veličković, Kimon Fountoulakis
[ABSTRACT]
There has been a growing interest in the ability of neural networks to solve
algorithmic tasks, such as arithmetic, summary statistics, and sorting. While
state-of-the-art models like Transformers have demonstrated good generalization
performance on in-distribution tasks, their out-of-distribution (OOD)
performance is poor when trained end-to-end. In this paper, we focus on value
generalization, a common instance of OOD generalization where the test
distribution has the same input sequence length as the training distribution,
but the value ranges in the training and test distributions do not necessarily
overlap. To address this issue, we propose that using fixed positional
encodings to determine attention weights-referred to as positional
attention-enhances empirical OOD performance while maintaining expressivity. We
support our claim about expressivity by proving that Transformers with
positional attention can effectively simulate parallel algorithms.
[COMMENTS]
37 pages, 22 figures
[LINK]
http://arxiv.org/abs/2410.01686v1
[DATE]
2024-10-02 23:55:08+08:00
[CATEGORIES]
cs.LG
Differentially Private Bootstrap: New Privacy Analysis and Inference Strategies
[AUTHORS]
Zhanyu Wang, Guang Cheng, Jordan Awan
[ABSTRACT]
Differentially private (DP) mechanisms protect individual-level information
by introducing randomness into the statistical analysis procedure. Despite the
availability of numerous DP tools, there remains a lack of general techniques
for conducting statistical inference under DP. We examine a DP bootstrap
procedure that releases multiple private bootstrap estimates to infer the
sampling distribution and construct confidence intervals (CIs). Our privacy
analysis presents new results on the privacy cost of a single DP bootstrap
estimate, applicable to any DP mechanism, and identifies some misapplications
of the bootstrap in the existing literature. For the composition of the DP
bootstrap, we present a numerical method to compute the exact privacy cost of
releasing multiple DP bootstrap estimates, and using the Gaussian-DP (GDP)
framework (Dong et al., 2022), we show that the release of $B$ DP bootstrap
estimates from mechanisms satisfying $(\mu/\sqrt{(2-2/\mathrm{e})B})$-GDP
asymptotically satisfies $\mu$-GDP as $B$ goes to infinity. Then, we perform
private statistical inference by post-processing the DP bootstrap estimates. We
prove that our point estimates are consistent, our standard CIs are
asymptotically valid, and both enjoy optimal convergence rates. To further
improve the finite performance, we use deconvolution with DP bootstrap
estimates to accurately infer the sampling distribution. We derive CIs for
tasks such as population mean estimation, logistic regression, and quantile
regression, and we compare them to existing methods using simulations and
real-world experiments on 2016 Canada Census data. Our private CIs achieve the
nominal coverage level and offer the first approach to private inference for
quantile regression.
[LINK]
http://arxiv.org/abs/2210.06140v3
[DATE]
2024-10-02 23:43:43+08:00
[CATEGORIES]
cs.LG
Sparse Covariance Neural Networks
[AUTHORS]
Andrea Cavallo, Zhan Gao, Elvin Isufi
[ABSTRACT]
Covariance Neural Networks (VNNs) perform graph convolutions on the
covariance matrix of tabular data and achieve success in a variety of
applications. However, the empirical covariance matrix on which the VNNs
operate may contain many spurious correlations, making VNNs’ performance
inconsistent due to these noisy estimates and decreasing their computational
efficiency. To tackle this issue, we put forth Sparse coVariance Neural
Networks (S-VNNs), a framework that applies sparsification techniques on the
sample covariance matrix before convolution. When the true covariance matrix is
sparse, we propose hard and soft thresholding to improve covariance estimation
and reduce computational cost. Instead, when the true covariance is dense, we
propose stochastic sparsification where data correlations are dropped in
probability according to principled strategies. We show that S-VNNs are more
stable than nominal VNNs as well as sparse principal component analysis. By
analyzing the impact of sparsification on their behavior, we provide novel
connections between S-VNN stability and data distribution. We support our
theoretical findings with experimental results on various application
scenarios, ranging from brain data to human action recognition, and show an
improved task performance, stability, and computational efficiency of S-VNNs
compared with nominal VNNs.
[LINK]
http://arxiv.org/abs/2410.01669v1
[DATE]
2024-10-02 23:37:12+08:00
[CATEGORIES]
cs.LG
Conformal Generative Modeling with Improved Sample Efficiency through Sequential Greedy Filtering
[AUTHORS]
Klaus-Rudolf Kladny, Bernhard Schölkopf, Michael Muehlebach
[ABSTRACT]
Generative models lack rigorous statistical guarantees for their outputs and
are therefore unreliable in safety-critical applications. In this work, we
propose Sequential Conformal Prediction for Generative Models (SCOPE-Gen), a
sequential conformal prediction method producing prediction sets that satisfy a
rigorous statistical guarantee called conformal admissibility control. This
guarantee states that with high probability, the prediction sets contain at
least one admissible (or valid) example. To this end, our method first samples
an initial set of i.i.d. examples from a black box generative model. Then, this
set is iteratively pruned via so-called greedy filters. As a consequence of the
iterative generation procedure, admissibility of the final prediction set
factorizes as a Markov chain. This factorization is crucial, because it allows
to control each factor separately, using conformal prediction. In comparison to
prior work, our method demonstrates a large reduction in the number of
admissibility evaluations during calibration. This reduction is important in
safety-critical applications, where these evaluations must be conducted
manually by domain experts and are therefore costly and time consuming. We
highlight the advantages of our method in terms of admissibility evaluations
and cardinality of the prediction sets through experiments in natural language
generation and molecular graph extension tasks.
[LINK]
http://arxiv.org/abs/2410.01660v1
[DATE]
2024-10-02 23:26:52+08:00
[CATEGORIES]
cs.LG
Smaller Confidence Intervals From IPW Estimators via Data-Dependent Coarsening
[AUTHORS]
Alkis Kalavasis, Anay Mehrotra, Manolis Zampetakis
[ABSTRACT]
Inverse propensity-score weighted (IPW) estimators are prevalent in causal
inference for estimating average treatment effects in observational studies.
Under unconfoundedness, given accurate propensity scores and $n$ samples, the
size of confidence intervals of IPW estimators scales down with $n$, and,
several of their variants improve the rate of scaling. However, neither IPW
estimators nor their variants are robust to inaccuracies: even if a single
covariate has an $\varepsilon>0$ additive error in the propensity score, the
size of confidence intervals of these estimators can increase arbitrarily.
Moreover, even without errors, the rate with which the confidence intervals of
these estimators go to zero with $n$ can be arbitrarily slow in the presence of
extreme propensity scores (those close to 0 or 1).
We introduce a family of Coarse IPW (CIPW) estimators that captures existing
IPW estimators and their variants. Each CIPW estimator is an IPW estimator on a
coarsened covariate space, where certain covariates are merged. Under mild
assumptions, e.g., Lipschitzness in expected outcomes and sparsity of extreme
propensity scores, we give an efficient algorithm to find a robust estimator:
given $\varepsilon$-inaccurate propensity scores and $n$ samples, its
confidence interval size scales with $\varepsilon+1/\sqrt{n}$. In contrast,
under the same assumptions, existing estimators’ confidence interval sizes are
$\Omega(1)$ irrespective of $\varepsilon$ and $n$. Crucially, our estimator is
data-dependent and we show that no data-independent CIPW estimator can be
robust to inaccuracies.
[COMMENTS]
Accepted for presentation at the 37th Conference on Learning Theory
(COLT) 2024
[LINK]
http://arxiv.org/abs/2410.01658v1
[DATE]
2024-10-02 23:25:26+08:00
[CATEGORIES]
cs.LG
Scalable and Consistent Graph Neural Networks for Distributed Mesh-based Data-driven Modeling
[AUTHORS]
Shivam Barwey, Riccardo Balin, Bethany Lusch, Saumil Patel, Ramesh Balakrishnan, Pinaki Pal, Romit Maulik, Venkatram Vishwanath
[ABSTRACT]
This work develops a distributed graph neural network (GNN) methodology for
mesh-based modeling applications using a consistent neural message passing
layer. As the name implies, the focus is on enabling scalable operations that
satisfy physical consistency via halo nodes at sub-graph boundaries. Here,
consistency refers to the fact that a GNN trained and evaluated on one rank
(one large graph) is arithmetically equivalent to evaluations on multiple ranks
(a partitioned graph). This concept is demonstrated by interfacing GNNs with
NekRS, a GPU-capable exascale CFD solver developed at Argonne National
Laboratory. It is shown how the NekRS mesh partitioning can be linked to the
distributed GNN training and inference routines, resulting in a scalable
mesh-based data-driven modeling workflow. We study the impact of consistency on
the scalability of mesh-based GNNs, demonstrating efficient scaling in
consistent GNNs for up to O(1B) graph nodes on the Frontier exascale
supercomputer.
[LINK]
http://arxiv.org/abs/2410.01657v1
[DATE]
2024-10-02 23:22:27+08:00
[CATEGORIES]
cs.LG
[AUTHORS]
Jane H. Lee, Anay Mehrotra, Manolis Zampetakis [ABSTRACT]
We study the estimation of distributional parameters when samples are shown
only if they fall in some unknown set $S \subseteq \mathbb{R}^d$. Kontonis,
Tzamos, and Zampetakis (FOCS’19) gave a $d^{\mathrm{poly}(1/\varepsilon)}$ time
algorithm for finding $\varepsilon$-accurate parameters for the special case of
Gaussian distributions with diagonal covariance matrix. Recently, Diakonikolas,
Kane, Pittas, and Zarifis (COLT’24) showed that this exponential dependence on
$1/\varepsilon$ is necessary even when $S$ belongs to some well-behaved
classes. These works leave the following open problems which we address in this
work: Can we estimate the parameters of any Gaussian or even extend beyond
Gaussians? Can we design $\mathrm{poly}(d/\varepsilon)$ time algorithms when
$S$ is a simple set such as a halfspace?
We make progress on both of these questions by providing the following
results:
[COMMENTS]
Accepted for presentation at the 65th IEEE Symposium on Foundations
of Computer Science (FOCS), 2024; abstract shortened for arXiv [LINK]
http://arxiv.org/abs/2410.01656v1 [DATE]
2024-10-02 23:21:07+08:00 [CATEGORIES]
cs.LG
Extending Contextual Self-Modulation: Meta-Learning Across Modalities, Task Dimensionalities, and Data Regimes
[AUTHORS]
Roussel Desmond Nzoyem, David A. W. Barton, Tom Deakin
[ABSTRACT]
Contextual Self-Modulation (CSM) is a potent regularization mechanism for the
Neural Context Flow (NCF) framework which demonstrates powerful meta-learning
of physical systems. However, CSM has limitations in its applicability across
different modalities and in high-data regimes. In this work, we introduce two
extensions: $i$CSM, which expands CSM to infinite-dimensional tasks, and
StochasticNCF, which improves scalability. These extensions are demonstrated
through comprehensive experimentation on a range of tasks, including dynamical
systems with parameter variations, computer vision challenges, and curve
fitting problems. $i$CSM embeds the contexts into an infinite-dimensional
function space, as opposed to CSM which uses finite-dimensional context
vectors. StochasticNCF enables the application of both CSM and $i$CSM to
high-data scenarios by providing an unbiased approximation of meta-gradient
updates through a sampled set of nearest environments. Additionally, we
incorporate higher-order Taylor expansions via Taylor-Mode automatic
differentiation, revealing that higher-order approximations do not necessarily
enhance generalization. Finally, we demonstrate how CSM can be integrated into
other meta-learning frameworks with FlashCAVIA, a computationally efficient
extension of the CAVIA meta-learning framework (Zintgraf et al. 2019).
FlashCAVIA outperforms its predecessor across various benchmarks and reinforces
the utility of bi-level optimization techniques. Together, these contributions
establish a robust framework for tackling an expanded spectrum of meta-learning
tasks, offering practical insights for out-of-distribution generalization. Our
open-sourced library, designed for flexible integration of self-modulation into
contextual meta-learning workflows, is available at
\url{github.com/ddrous/self-mod}.
[COMMENTS]
23 pages, 11 figures, 5 tables
[LINK]
http://arxiv.org/abs/2410.01655v1
[DATE]
2024-10-02 23:19:35+08:00
[CATEGORIES]
cs.LG
Neural Context Flows for Meta-Learning of Dynamical Systems
[AUTHORS]
Roussel Desmond Nzoyem, David A. W. Barton, Tom Deakin
[ABSTRACT]
Neural Ordinary Differential Equations (NODEs) often struggle to adapt to new
dynamic behaviors caused by parameter changes in the underlying system, even
when these dynamics are similar to previously observed behaviors. This problem
becomes more challenging when the changing parameters are unobserved, meaning
their value or influence cannot be directly measured when collecting data. To
address this issue, we introduce Neural Context Flow (NCF), a robust and
interpretable Meta-Learning framework that includes uncertainty estimation. NCF
uses higher-order Taylor expansion to enable contextual self-modulation,
allowing context vectors to influence dynamics from other domains while also
modulating themselves. After establishing convergence guarantees, we
empirically test NCF and compare it to related adaptation methods. Our results
show that NCF achieves state-of-the-art Out-of-Distribution performance on 5
out of 6 linear and non-linear benchmark problems. Through extensive
experiments, we explore the flexible model architecture of NCF and the encoded
representations within the learned context vectors. Our findings highlight the
potential implications of NCF for foundational models in the physical sciences,
offering a promising approach to improving the adaptability and generalization
of NODEs in various scientific applications. Our code is openly available at
\url{https://github.com/ddrous/ncflow}.
[COMMENTS]
31 pages, 19 figures, 8 tables
[LINK]
http://arxiv.org/abs/2405.02154v3
[DATE]
2024-10-02 23:18:44+08:00
[CATEGORIES]
cs.LG
Towards Futuristic Autonomous Experimentation–A Surprise-Reacting Sequential Experiment Policy
[AUTHORS]
Imtiaz Ahmed, Satish Bukkapatnam, Bhaskar Botcha, Yu Ding
[ABSTRACT]
An autonomous experimentation platform in manufacturing is supposedly capable
of conducting a sequential search for finding suitable manufacturing conditions
by itself or even for discovering new materials with minimal human
intervention. The core of the intelligent control of such platforms is a policy
to decide where to conduct the next experiment based on what has been done thus
far. Such policy inevitably trades off between exploitation and exploration.
Currently, the prevailing approach is to use various acquisition functions in
the Bayesian optimization framework. We discuss whether it is beneficial to
trade off exploitation versus exploration by measuring the element and degree
of surprise associated with the immediate past observation. We devise a
surprise-reacting policy using two existing surprise metrics, known as the
Shannon surprise and Bayesian surprise. Our analysis shows that the
surprise-reacting policy appears to be better suited for quickly characterizing
the overall landscape of a response surface under resource constraints. We do
not claim that we have a fully autonomous experimentation system but believe
that the surprise-reacting capability benefits the automation of sequential
decisions in autonomous experimentation.
[LINK]
http://arxiv.org/abs/2112.00600v3
[DATE]
2024-10-02 23:17:56+08:00
[CATEGORIES]
cs.LG
shapiq: Shapley Interactions for Machine Learning
[AUTHORS]
Maximilian Muschalik, Hubert Baniecki, Fabian Fumagalli, Patrick Kolpaczki, Barbara Hammer, Eyke Hüllermeier
[ABSTRACT]
Originally rooted in game theory, the Shapley Value (SV) has recently become
an important tool in machine learning research. Perhaps most notably, it is
used for feature attribution and data valuation in explainable artificial
intelligence. Shapley Interactions (SIs) naturally extend the SV and address
its limitations by assigning joint contributions to groups of entities, which
enhance understanding of black box machine learning models. Due to the
exponential complexity of computing SVs and SIs, various methods have been
proposed that exploit structural assumptions or yield probabilistic estimates
given limited resources. In this work, we introduce shapiq, an open-source
Python package that unifies state-of-the-art algorithms to efficiently compute
SVs and any-order SIs in an application-agnostic framework. Moreover, it
includes a benchmarking suite containing 11 machine learning applications of
SIs with pre-computed games and ground-truth values to systematically assess
computational performance across domains. For practitioners, shapiq is able to
explain and visualize any-order feature interactions in predictions of models,
including vision transformers, language models, as well as XGBoost and LightGBM
with TreeSHAP-IQ. With shapiq, we extend shap beyond feature attributions and
consolidate the application of SVs and SIs in machine learning that facilitates
future research. The source code and documentation are available at
https://github.com/mmschlk/shapiq.
[COMMENTS]
NeurIPS 2024
[LINK]
http://arxiv.org/abs/2410.01649v1
[DATE]
2024-10-02 23:16:53+08:00
[CATEGORIES]
cs.LG
Fitting an ellipsoid to a quadratic number of random points
[AUTHORS]
Afonso S. Bandeira, Antoine Maillard, Shahar Mendelson, Elliot Paquette
[ABSTRACT]
We consider the problem $(\mathrm{P})$ of fitting $n$ standard Gaussian
random vectors in $\mathbb{R}^d$ to the boundary of a centered ellipsoid, as
$n, d \to \infty$. This problem is conjectured to have a sharp feasibility
transition: for any $\varepsilon > 0$, if $n \leq (1 - \varepsilon) d^2 / 4$
then $(\mathrm{P})$ has a solution with high probability, while $(\mathrm{P})$
has no solutions with high probability if $n \geq (1 + \varepsilon) d^2 /4$. So
far, only a trivial bound $n \geq d^2 / 2$ is known on the negative side, while
the best results on the positive side assume $n \leq d^2 /
\mathrm{polylog}(d)$. In this work, we improve over previous approaches using a
key result of Bartl & Mendelson (2022) on the concentration of Gram matrices of
random vectors under mild assumptions on their tail behavior. This allows us to
give a simple proof that $(\mathrm{P})$ is feasible with high probability when
$n \leq d^2 / C$, for a (possibly large) constant $C > 0$.
[COMMENTS]
17 pages; Update (v2) to match the published version
[LINK]
http://arxiv.org/abs/2307.01181v2
[DATE]
2024-10-02 23:13:40+08:00
[CATEGORIES]
cs.LG
Fira: Can We Achieve Full-rank Training of LLMs Under Low-rank Constraint?
[AUTHORS]
Xi Chen, Kaituo Feng, Changsheng Li, Xunhao Lai, Xiangyu Yue, Ye Yuan, Guoren Wang
[ABSTRACT]
Low-rank training has emerged as a promising approach for reducing memory
usage in training Large Language Models (LLMs). Previous methods either rely on
decomposing weight matrices (e.g., LoRA), or seek to decompose gradient
matrices (e.g., GaLore) to ensure reduced memory consumption. However, both of
them constrain the training in a low-rank subspace, thus inevitably leading to
sub-optimal performance. This raises a question: whether it is possible to
consistently preserve the low-rank constraint for memory efficiency, while
achieving full-rank training (i.e., training with full-rank gradients of
full-rank weights) to avoid inferior outcomes? In this paper, we propose a new
plug-and-play training framework for LLMs called Fira, as the first attempt to
achieve this goal. First, we observe an interesting phenomenon during LLM
training: the scaling impact of adaptive optimizers (e.g., Adam) on the
gradient norm remains similar from low-rank to full-rank training. Based on
this observation, we propose a norm-based scaling method, which utilizes the
scaling impact of low-rank optimizers as substitutes for that of original
full-rank optimizers to enable full-rank training. In this way, we can preserve
the low-rank constraint in the optimizer while achieving full-rank training for
better performance. Moreover, we find that there are sudden gradient rises
during the optimization process, potentially causing loss spikes. To address
this, we further put forward a norm-growth limiter to smooth the gradient via
regulating the relative increase of gradient norms. Extensive experiments on
the pre-training and fine-tuning of LLMs show that Fira outperforms both LoRA
and GaLore, achieving performance that is comparable to or even better than
full-rank training.
[COMMENTS]
Code is available at: https://github.com/xichen-fy/Fira
[LINK]
http://arxiv.org/abs/2410.01623v1
[DATE]
2024-10-02 22:58:27+08:00
[CATEGORIES]
cs.LG
MallowsPO: Fine-Tune Your LLM with Preference Dispersions
[AUTHORS]
Haoxian Chen, Hanyang Zhao, Henry Lam, David Yao, Wenpin Tang
[ABSTRACT]
Direct Preference Optimization (DPO) has recently emerged as a popular
approach to improve reinforcement learning with human feedback (RLHF), leading
to better techniques to fine-tune large language models (LLM). A weakness of
DPO, however, lies in its lack of capability to characterize the diversity of
human preferences. Inspired by Mallows’ theory of preference ranking, we
develop in this paper a new approach, the MallowsPO. A distinct feature of this
approach is a dispersion index, which reflects the dispersion of human
preference to prompts. We show that existing DPO models can be reduced to
special cases of this dispersion index, thus unified with MallowsPO. More
importantly, we demonstrate (empirically) how to use this dispersion index to
enhance the performance of DPO in a broad array of benchmark tasks, from
synthetic bandit selection to controllable generations and dialogues, while
maintaining great generalization capabilities. MallowsPO is also compatible
with other SOTA offline preference optimization methods, boosting nearly 2\%
extra LC win rate when used as a plugin for fine-tuning Llama3-Instruct.
[LINK]
http://arxiv.org/abs/2405.14953v3
[DATE]
2024-10-02 22:56:33+08:00
[CATEGORIES]
cs.LG
On Using Certified Training towards Empirical Robustness
[AUTHORS]
Alessandro De Palma, Serge Durand, Zakaria Chihani, François Terrier, Caterina Urban
[ABSTRACT]
Adversarial training is arguably the most popular way to provide empirical
robustness against specific adversarial examples. While variants based on
multi-step attacks incur significant computational overhead, single-step
variants are vulnerable to a failure mode known as catastrophic overfitting,
which hinders their practical utility for large perturbations. A parallel line
of work, certified training, has focused on producing networks amenable to
formal guarantees of robustness against any possible attack. However, the wide
gap between the best-performing empirical and certified defenses has severely
limited the applicability of the latter. Inspired by recent developments in
certified training, which rely on a combination of adversarial attacks with
network over-approximations, and by the connections between local linearity and
catastrophic overfitting, we present experimental evidence on the practical
utility and limitations of using certified training towards empirical
robustness. We show that, when tuned for the purpose, a recent certified
training algorithm can prevent catastrophic overfitting on single-step attacks,
and that it can bridge the gap to multi-step baselines under appropriate
experimental settings. Finally, we present a novel regularizer for network
over-approximations that can achieve similar effects while markedly reducing
runtime.
[LINK]
http://arxiv.org/abs/2410.01617v1
[DATE]
2024-10-02 22:56:21+08:00
[CATEGORIES]
cs.LG
Heterogeneous Multi-Agent Reinforcement Learning for Zero-Shot Scalable Collaboration
[AUTHORS]
Xudong Guo, Daming Shi, Junjie Yu, Wenhui Fan
[ABSTRACT]
The emergence of multi-agent reinforcement learning (MARL) is significantly
transforming various fields like autonomous vehicle networks. However,
real-world multi-agent systems typically contain multiple roles, and the scale
of these systems dynamically fluctuates. Consequently, in order to achieve
zero-shot scalable collaboration, it is essential that strategies for different
roles can be updated flexibly according to the scales, which is still a
challenge for current MARL frameworks. To address this, we propose a novel MARL
framework named Scalable and Heterogeneous Proximal Policy Optimization
(SHPPO), integrating heterogeneity into parameter-shared PPO-based MARL
networks. We first leverage a latent network to learn strategy patterns for
each agent adaptively. Second, we introduce a heterogeneous layer to be
inserted into decision-making networks, whose parameters are specifically
generated by the learned latent variables. Our approach is scalable as all the
parameters are shared except for the heterogeneous layer, and gains both
inter-individual and temporal heterogeneity, allowing SHPPO to adapt
effectively to varying scales. SHPPO exhibits superior performance in classic
MARL environments like Starcraft Multi-Agent Challenge (SMAC) and Google
Research Football (GRF), showcasing enhanced zero-shot scalability, and
offering insights into the learned latent variables’ impact on team performance
by visualization.
[LINK]
http://arxiv.org/abs/2404.03869v2
[DATE]
2024-10-02 22:52:13+08:00
[CATEGORIES]
cs.LG
DRUPI: Dataset Reduction Using Privileged Information
[AUTHORS]
Shaobo Wang, Yantai Yang, Shuaiyu Zhang, Chenghao Sun, Weiya Li, Xuming Hu, Linfeng Zhang
[ABSTRACT]
Dataset reduction (DR) seeks to select or distill samples from large datasets
into smaller subsets while preserving performance on target tasks. Existing
methods primarily focus on pruning or synthesizing data in the same format as
the original dataset, typically the input data and corresponding labels.
However, in DR settings, we find it is possible to synthesize more information
beyond the data-label pair as an additional learning target to facilitate model
training. In this paper, we introduce Dataset Reduction Using Privileged
Information (DRUPI), which enriches DR by synthesizing privileged information
alongside the reduced dataset. This privileged information can take the form of
feature labels or attention labels, providing auxiliary supervision to improve
model learning. Our findings reveal that effective feature labels must balance
between being overly discriminative and excessively diverse, with a moderate
level proving optimal for improving the reduced dataset’s efficacy. Extensive
experiments on ImageNet, CIFAR-10/100, and Tiny ImageNet demonstrate that DRUPI
integrates seamlessly with existing dataset reduction methods, offering
significant performance gains.
[LINK]
http://arxiv.org/abs/2410.01611v1
[DATE]
2024-10-02 22:49:05+08:00
[CATEGORIES]
cs.LG
Automated Red Teaming with GOAT: the Generative Offensive Agent Tester
[AUTHORS]
Maya Pavlova, Erik Brinkman, Krithika Iyer, Vitor Albiero, Joanna Bitton, Hailey Nguyen, Joe Li, Cristian Canton Ferrer, Ivan Evtimov, Aaron Grattafiori
[ABSTRACT]
Red teaming assesses how large language models (LLMs) can produce content
that violates norms, policies, and rules set during their safety training.
However, most existing automated methods in the literature are not
representative of the way humans tend to interact with AI models. Common users
of AI models may not have advanced knowledge of adversarial machine learning
methods or access to model internals, and they do not spend a lot of time
crafting a single highly effective adversarial prompt. Instead, they are likely
to make use of techniques commonly shared online and exploit the multiturn
conversational nature of LLMs. While manual testing addresses this gap, it is
an inefficient and often expensive process. To address these limitations, we
introduce the Generative Offensive Agent Tester (GOAT), an automated agentic
red teaming system that simulates plain language adversarial conversations
while leveraging multiple adversarial prompting techniques to identify
vulnerabilities in LLMs. We instantiate GOAT with 7 red teaming attacks by
prompting a general-purpose model in a way that encourages reasoning through
the choices of methods available, the current target model’s response, and the
next steps. Our approach is designed to be extensible and efficient, allowing
human testers to focus on exploring new areas of risk while automation covers
the scaled adversarial stress-testing of known risk territory. We present the
design and evaluation of GOAT, demonstrating its effectiveness in identifying
vulnerabilities in state-of-the-art LLMs, with an ASR@10 of 97% against Llama
3.1 and 88% against GPT-4 on the JailbreakBench dataset.
[LINK]
http://arxiv.org/abs/2410.01606v1
[DATE]
2024-10-02 22:47:05+08:00
[CATEGORIES]
cs.LG
Towards Model Discovery Using Domain Decomposition and PINNs
[AUTHORS]
Tirtho S. Saha, Alexander Heinlein, Cordula Reisch
[ABSTRACT]
We enhance machine learning algorithms for learning model parameters in
complex systems represented by ordinary differential equations (ODEs) with
domain decomposition methods. The study evaluates the performance of two
approaches, namely (vanilla) Physics-Informed Neural Networks (PINNs) and
Finite Basis Physics-Informed Neural Networks (FBPINNs), in learning the
dynamics of test models with a quasi-stationary longtime behavior. We test the
approaches for data sets in different dynamical regions and with varying noise
level. As results, we find a better performance for the FBPINN approach
compared to the vanilla PINN approach, even in cases with data from only a
quasi-stationary time domain with few dynamics.
[LINK]
http://arxiv.org/abs/2410.01599v1
[DATE]
2024-10-02 22:38:37+08:00
[CATEGORIES]
cs.LG
Sequential transport maps using SoS density estimation and $α$-divergences
[AUTHORS]
Benjamin Zanger, Olivier Zahm, Tiangang Cui, Martin Schreiber
[ABSTRACT]
Transport-based density estimation methods are receiving growing interest
because of their ability to efficiently generate samples from the approximated
density. We further invertigate the sequential transport maps framework
proposed from arXiv:2106.04170 arXiv:2303.02554, which builds on a sequence of
composed Knothe-Rosenblatt (KR) maps. Each of those maps are built by first
estimating an intermediate density of moderate complexity, and then by
computing the exact KR map from a reference density to the precomputed
approximate density. In our work, we explore the use of Sum-of-Squares (SoS)
densities and $\alpha$-divergences for approximating the intermediate
densities. Combining SoS densities with $\alpha$-divergence interestingly
yields convex optimization problems which can be efficiently solved using
semidefinite programming. The main advantage of $\alpha$-divergences is to
enable working with unnormalized densities, which provides benefits both
numerically and theoretically. In particular, we provide a new convergence
analyses of the sequential transport maps based on information geometric
properties of $\alpha$-divergences. The choice of intermediate densities is
also crucial for the efficiency of the method. While tempered (or annealed)
densities are the state-of-the-art, we introduce diffusion-based intermediate
densities which permits to approximate densities known from samples only. Such
intermediate densities are well-established in machine learning for generative
modeling. Finally we propose low-dimensional maps (or lazy maps) for dealing
with high-dimensional problems and numerically demonstrate our methods on
Bayesian inference problems and unsupervised learning tasks.
[LINK]
http://arxiv.org/abs/2402.17943v2
[DATE]
2024-10-02 22:37:12+08:00
[CATEGORIES]
cs.LG
Longhorn: State Space Models are Amortized Online Learners
[AUTHORS]
Bo Liu, Rui Wang, Lemeng Wu, Yihao Feng, Peter Stone, Qiang Liu
[ABSTRACT]
Modern large language models are built on sequence modeling via next-token
prediction. While the Transformer remains the dominant architecture for
sequence modeling, its quadratic decoding complexity in sequence length poses a
major limitation. State-space models (SSMs) present a competitive alternative,
offering linear decoding efficiency while maintaining parallelism during
training. However, most existing SSMs rely on linear recurrence designs that
appear somewhat ad hoc. In this work, we explore SSM design through the lens of
online learning, conceptualizing SSMs as meta-modules for specific online
learning problems. This approach links SSM design to formulating precise online
learning objectives, with state transition rules derived from solving these
objectives. Based on this insight, we introduce a novel deep SSM architecture,
Longhorn, whose update resembles the closed-form solution for solving the
online associative recall problem. Our experimental results show that Longhorn
outperforms state-of-the-art SSMs, including the Mamba model, on standard
sequence modeling benchmarks, language modeling, and vision tasks.
Specifically, Longhorn achieves a 1.8x improvement in sample efficiency
compared to Mamba, and can extrapolate over contexts that are up to 16x longer
during inference.
[LINK]
http://arxiv.org/abs/2407.14207v5
[DATE]
2024-10-02 22:32:59+08:00
[CATEGORIES]
cs.LG
HAMLET: Graph Transformer Neural Operator for Partial Differential Equations
[AUTHORS]
Andrey Bryutkin, Jiahao Huang, Zhongying Deng, Guang Yang, Carola-Bibiane Schönlieb, Angelica Aviles-Rivero
[ABSTRACT]
We present a novel graph transformer framework, HAMLET, designed to address
the challenges in solving partial differential equations (PDEs) using neural
networks. The framework uses graph transformers with modular input encoders to
directly incorporate differential equation information into the solution
process. This modularity enhances parameter correspondence control, making
HAMLET adaptable to PDEs of arbitrary geometries and varied input formats.
Notably, HAMLET scales effectively with increasing data complexity and noise,
showcasing its robustness. HAMLET is not just tailored to a single type of
physical simulation, but can be applied across various domains. Moreover, it
boosts model resilience and performance, especially in scenarios with limited
data. We demonstrate, through extensive experiments, that our framework is
capable of outperforming current techniques for PDEs.
[COMMENTS]
18 pages, 7 figures, 6 tables
[LINK]
http://arxiv.org/abs/2402.03541v2
[DATE]
2024-10-02 22:30:15+08:00
[CATEGORIES]
cs.LG
DynFrs: An Efficient Framework for Machine Unlearning in Random Forest
[AUTHORS]
Shurong Wang, Zhuoyang Shen, Xinbao Qiao, Tongning Zhang, Meng Zhang
[ABSTRACT]
Random Forests are widely recognized for establishing efficacy in
classification and regression tasks, standing out in various domains such as
medical diagnosis, finance, and personalized recommendations. These domains,
however, are inherently sensitive to privacy concerns, as personal and
confidential data are involved. With increasing demand for the right to be
forgotten, particularly under regulations such as GDPR and CCPA, the ability to
perform machine unlearning has become crucial for Random Forests. However,
insufficient attention was paid to this topic, and existing approaches face
difficulties in being applied to real-world scenarios. Addressing this gap, we
propose the DynFrs framework designed to enable efficient machine unlearning in
Random Forests while preserving predictive accuracy. Dynfrs leverages
subsampling method Occ(q) and a lazy tag strategy Lzy, and is still adaptable
to any Random Forest variant. In essence, Occ(q) ensures that each sample in
the training set occurs only in a proportion of trees so that the impact of
deleting samples is limited, and Lzy delays the reconstruction of a tree node
until necessary, thereby avoiding unnecessary modifications on tree structures.
In experiments, applying Dynfrs on Extremely Randomized Trees yields
substantial improvements, achieving orders of magnitude faster unlearning
performance and better predictive accuracy than existing machine unlearning
methods for Random Forests.
[LINK]
http://arxiv.org/abs/2410.01588v1
[DATE]
2024-10-02 22:20:30+08:00
[CATEGORIES]
cs.LG
Learning-Augmented Robust Algorithmic Recourse
[AUTHORS]
Kshitij Kayastha, Vasilis Gkatzelis, Shahin Jabbari
[ABSTRACT]
The widespread use of machine learning models in high-stakes domains can have
a major negative impact, especially on individuals who receive undesirable
outcomes. Algorithmic recourse provides such individuals with suggestions of
minimum-cost improvements they can make to achieve a desirable outcome in the
future. However, machine learning models often get updated over time and this
can cause a recourse to become invalid (i.e., not lead to the desirable
outcome). The robust recourse literature aims to choose recourses that are less
sensitive, even against adversarial model changes, but this comes at a higher
cost. To overcome this obstacle, we initiate the study of algorithmic recourse
through the learning-augmented framework and evaluate the extent to which a
designer equipped with a prediction regarding future model changes can reduce
the cost of recourse when the prediction is accurate (consistency) while also
limiting the cost even when the prediction is inaccurate (robustness). We
propose a novel algorithm for this problem, study the robustness-consistency
trade-off, and analyze how prediction accuracy affects performance.
[LINK]
http://arxiv.org/abs/2410.01580v1
[DATE]
2024-10-02 22:15:32+08:00
[CATEGORIES]
cs.LG
Fake It Until You Break It: On the Adversarial Robustness of AI-generated Image Detectors
[AUTHORS]
Sina Mavali, Jonas Ricker, David Pape, Yash Sharma, Asja Fischer, Lea Schoenherr
[ABSTRACT]
While generative AI (GenAI) offers countless possibilities for creative and
productive tasks, artificially generated media can be misused for fraud,
manipulation, scams, misinformation campaigns, and more. To mitigate the risks
associated with maliciously generated media, forensic classifiers are employed
to identify AI-generated content. However, current forensic classifiers are
often not evaluated in practically relevant scenarios, such as the presence of
an attacker or when real-world artifacts like social media degradations affect
images. In this paper, we evaluate state-of-the-art AI-generated image (AIGI)
detectors under different attack scenarios. We demonstrate that forensic
classifiers can be effectively attacked in realistic settings, even when the
attacker does not have access to the target model and post-processing occurs
after the adversarial examples are created, which is standard on social media
platforms. These attacks can significantly reduce detection accuracy to the
extent that the risks of relying on detectors outweigh their benefits. Finally,
we propose a simple defense mechanism to make CLIP-based detectors, which are
currently the best-performing detectors, robust against these attacks.
[LINK]
http://arxiv.org/abs/2410.01574v1
[DATE]
2024-10-02 22:11:29+08:00
[CATEGORIES]
cs.LG
Truncated Kernel Stochastic Gradient Descent on Spheres
[AUTHORS]
JinHui Bai, Lei Shi
[ABSTRACT]
Inspired by the structure of spherical harmonics, we propose the truncated
kernel stochastic gradient descent (T-kernel SGD) algorithm with a least-square
loss function for spherical data fitting. T-kernel SGD employs a “truncation”
operation, enabling the application of a series-based kernel function in
stochastic gradient descent, thereby avoiding the difficulties of finding
suitable closed-form kernel functions in high-dimensional spaces. In contrast
to traditional kernel SGD, T-kernel SGD is more effective in balancing bias and
variance by dynamically adjusting the hypothesis space during iterations. The
most significant advantage of the proposed algorithm is that it can achieve
theoretically optimal convergence rates using a constant step size (independent
of the sample size) while overcoming the inherent saturation problem of kernel
SGD. Additionally, we leverage the structure of spherical polynomials to derive
an equivalent T-kernel SGD, significantly reducing storage and computational
costs compared to kernel SGD. Typically, T-kernel SGD requires only
$\mathcal{O}(n^{1+\frac{d}{d-1}\epsilon})$ computational complexity and
$\mathcal{O}(n^{\frac{d}{d-1}\epsilon})$ storage to achieve optimal rates for
the d-dimensional sphere, where $0<\epsilon<\frac{1}{2}$ can be arbitrarily
small if the optimal fitting or the underlying space possesses sufficient
regularity. This regularity is determined by the smoothness parameter of the
objective function and the decaying rate of the eigenvalues of the integral
operator associated with the kernel function, both of which reflect the
difficulty of the estimation problem. Our main results quantitatively
characterize how this prior information influences the convergence of T-kernel
SGD. The numerical experiments further validate the theoretical findings
presented in this paper.
[COMMENTS]
57 pages, 7 figures
[LINK]
http://arxiv.org/abs/2410.01570v1
[DATE]
2024-10-02 22:09:51+08:00
[CATEGORIES]
cs.LG
Improving Fairness and Mitigating MADness in Generative Models
[AUTHORS]
Paul Mayer, Lorenzo Luzi, Ali Siahkoohi, Don H. Johnson, Richard G. Baraniuk
[ABSTRACT]
Generative models unfairly penalize data belonging to minority classes,
suffer from model autophagy disorder (MADness), and learn biased estimates of
the underlying distribution parameters. Our theoretical and empirical results
show that training generative models with intentionally designed hypernetworks
leads to models that 1) are more fair when generating datapoints belonging to
minority classes 2) are more stable in a self-consumed (i.e., MAD) setting, and
3) learn parameters that are less statistically biased. To further mitigate
unfairness, MADness, and bias, we introduce a regularization term that
penalizes discrepancies between a generative model’s estimated weights when
trained on real data versus its own synthetic data. To facilitate training
existing deep generative models within our framework, we offer a scalable
implementation of hypernetworks that automatically generates a hypernetwork
architecture for any given generative model.
[LINK]
http://arxiv.org/abs/2405.13977v2
[DATE]
2024-10-02 22:01:49+08:00
[CATEGORIES]
cs.LG
Bayes’ Power for Explaining In-Context Learning Generalizations
[AUTHORS]
Samuel Müller, Noah Hollmann, Frank Hutter
[ABSTRACT]
Traditionally, neural network training has been primarily viewed as an
approximation of maximum likelihood estimation (MLE). This interpretation
originated in a time when training for multiple epochs on small datasets was
common and performance was data bound; but it falls short in the era of
large-scale single-epoch trainings ushered in by large self-supervised setups,
like language models. In this new setup, performance is compute-bound, but data
is readily available. As models became more powerful, in-context learning
(ICL), i.e., learning in a single forward-pass based on the context, emerged as
one of the dominant paradigms. In this paper, we argue that a more useful
interpretation of neural network behavior in this era is as an approximation of
the true posterior, as defined by the data-generating process. We demonstrate
this interpretations’ power for ICL and its usefulness to predict
generalizations to previously unseen tasks. We show how models become robust
in-context learners by effectively composing knowledge from their training
data. We illustrate this with experiments that reveal surprising
generalizations, all explicable through the exact posterior. Finally, we show
the inherent constraints of the generalization capabilities of posteriors and
the limitations of neural networks in approximating these posteriors.
[LINK]
http://arxiv.org/abs/2410.01565v1
[DATE]
2024-10-02 22:01:34+08:00
[CATEGORIES]
cs.LG
HRTF Estimation using a Score-based Prior
[AUTHORS]
Etienne Thuillier, Jean-Marie Lemercier, Eloi Moliner, Timo Gerkmann, Vesa Välimäki
[ABSTRACT]
We present a head-related transfer function (HRTF) estimation method which
relies on a data-driven prior given by a score-based diffusion model. The HRTF
is estimated in reverberant environments using natural excitation signals, e.g.
human speech. The impulse response of the room is estimated along with the HRTF
by optimizing a parametric model of reverberation based on the statistical
behaviour of room acoustics. The posterior distribution of HRTF given the
reverberant measurement and excitation signal is modelled using the score-based
HRTF prior and a log-likelihood approximation. We show that the resulting
method outperforms several baselines, including an oracle recommender system
that assigns the optimal HRTF in our training set based on the smallest
distance to the true HRTF at the given direction of arrival. In particular, we
show that the diffusion prior can account for the large variability of
high-frequency content in HRTFs.
[LINK]
http://arxiv.org/abs/2410.01562v1
[DATE]
2024-10-02 22:00:41+08:00
[CATEGORIES]
cs.LG
Closed-loop Diffusion Control of Complex Physical Systems
[AUTHORS]
Long Wei, Haodong Feng, Yuchen Yang, Ruiqi Feng, Peiyan Hu, Xiang Zheng, Tao Zhang, Dixia Fan, Tailin Wu
[ABSTRACT]
The control problems of complex physical systems have broad applications in
science and engineering. Previous studies have shown that generative control
methods based on diffusion models offer significant advantages for solving
these problems. However, existing generative control approaches face challenges
in both performance and efficiency when extended to the closed-loop setting,
which is essential for effective control. In this paper, we propose an
efficient Closed-Loop Diffusion method for Physical systems Control
(CL-DiffPhyCon). By employing an asynchronous denoising framework for different
physical time steps, CL-DiffPhyCon generates control signals conditioned on
real-time feedback from the environment with significantly reduced
computational cost during sampling. Additionally, the control process could be
further accelerated by incorporating fast sampling techniques, such as DDIM. We
evaluate CL-DiffPhyCon on two tasks: 1D Burgers’ equation control and 2D
incompressible fluid control. The results demonstrate that CL-DiffPhyCon
achieves superior control performance with significant improvements in sampling
efficiency.
[LINK]
http://arxiv.org/abs/2408.03124v2
[DATE]
2024-10-02 21:45:11+08:00
[CATEGORIES]
cs.LG
A Synthesis of Green Architectural Tactics for ML-Enabled Systems
[AUTHORS]
Heli Järvenpää, Patricia Lago, Justus Bogner, Grace Lewis, Henry Muccini, Ipek Ozkaya
[ABSTRACT]
The rapid adoption of artificial intelligence (AI) and machine learning (ML)
has generated growing interest in understanding their environmental impact and
the challenges associated with designing environmentally friendly ML-enabled
systems. While Green AI research, i.e., research that tries to minimize the
energy footprint of AI, is receiving increasing attention, very few concrete
guidelines are available on how ML-enabled systems can be designed to be more
environmentally sustainable. In this paper, we provide a catalog of 30 green
architectural tactics for ML-enabled systems to fill this gap. An architectural
tactic is a high-level design technique to improve software quality, in our
case environmental sustainability. We derived the tactics from the analysis of
51 peer-reviewed publications that primarily explore Green AI, and validated
them using a focus group approach with three experts. The 30 tactics we
identified are aimed to serve as an initial reference guide for further
exploration into Green AI from a software engineering perspective, and assist
in designing sustainable ML-enabled systems. To enhance transparency and
facilitate their widespread use and extension, we make the tactics available
online in easily consumable formats. Wide-spread adoption of these tactics has
the potential to substantially reduce the societal impact of ML-enabled systems
regarding their energy and carbon footprint.
[COMMENTS]
Accepted for publication at the 2024 International Conference on
Software Engineering - Software Engineering in Society (ICSE-SEIS’2024)
[LINK]
http://arxiv.org/abs/2312.09610v2
[DATE]
2024-10-02 21:42:53+08:00
[CATEGORIES]
cs.LG
Comparing and Contrasting Deep Learning Weather Prediction Backbones on Navier-Stokes and Atmospheric Dynamics
[AUTHORS]
Matthias Karlbauer, Danielle C. Maddix, Abdul Fatir Ansari, Boran Han, Gaurav Gupta, Yuyang Wang, Andrew Stuart, Michael W. Mahoney
[ABSTRACT]
Remarkable progress in the development of Deep Learning Weather Prediction
(DLWP) models positions them to become competitive with traditional numerical
weather prediction (NWP) models. Indeed, a wide number of DLWP architectures –
based on various backbones, including U-Net, Transformer, Graph Neural Network
(GNN), and Fourier Neural Operator (FNO) – have demonstrated their potential
at forecasting atmospheric states. However, due to differences in training
protocols, forecast horizons, and data choices, it remains unclear which (if
any) of these methods and architectures are most suitable for weather
forecasting and for future model development. Here, we step back and provide a
detailed empirical analysis, under controlled conditions, comparing and
contrasting the most prominent DLWP models, along with their backbones. We
accomplish this by predicting synthetic two-dimensional incompressible
Navier-Stokes and real-world global weather dynamics. In terms of accuracy,
memory consumption, and runtime, our results illustrate various tradeoffs. For
example, on synthetic data, we observe favorable performance of FNO; and on the
real-world WeatherBench dataset, our results demonstrate the suitability of
ConvLSTM and SwinTransformer for short-to-mid-ranged forecasts. For long-ranged
weather rollouts of up to 365 days, we observe superior stability and physical
soundness in architectures that formulate a spherical data representation,
i.e., GraphCast and Spherical FNO. In addition, we observe that all of these
model backbones “saturate,” i.e., none of them exhibit so-called neural
scaling, which highlights an important direction for future work on these and
related models. The code is available at
https://github.com/amazon-science/dlwp-benchmark.
[LINK]
http://arxiv.org/abs/2407.14129v2
[DATE]
2024-10-02 21:42:29+08:00
[CATEGORIES]
cs.LG
Dynamic Graph Representation Learning via Edge Temporal States Modeling and Structure-reinforced Transformer
[AUTHORS]
Shengxiang Hu, Guobing Zou, Song Yang, Shiyi Lin, Yanglan Gan, Bofeng Zhang
[ABSTRACT]
Dynamic graph representation learning has emerged as a crucial research area,
driven by the growing need for analyzing time-evolving graph data in real-world
applications. While recent approaches leveraging recurrent neural networks
(RNNs) and graph neural networks (GNNs) have shown promise, they often fail to
adequately capture the impact of temporal edge states on inter-node
relationships, consequently overlooking the dynamic changes in node features
induced by these evolving relationships. Furthermore, these methods suffer from
GNNs’ inherent over-smoothing problem, which hinders the extraction of global
structural features. To address these challenges, we introduce the Recurrent
Structure-reinforced Graph Transformer (RSGT), a novel framework for dynamic
graph representation learning. It first designs a heuristic method to
explicitly model edge temporal states by employing different edge types and
weights based on the differences between consecutive snapshots, thereby
integrating varying edge temporal states into the graph’s topological
structure. We then propose a structure-reinforced graph transformer that
captures temporal node representations encoding both graph topology and
evolving dynamics through a recurrent learning paradigm, enabling the
extraction of both local and global structural features. Comprehensive
experiments on four real-world datasets demonstrate RSGT’s superior performance
in discrete dynamic graph representation learning, consistently outperforming
existing methods in dynamic link prediction tasks.
[COMMENTS]
This work has been submitted to the Elsevier for possible
publication. Copyright may be transferred without notice, after which this
version may no longer be accessible
[LINK]
http://arxiv.org/abs/2304.10079v3
[DATE]
2024-10-02 21:40:53+08:00
[CATEGORIES]
cs.LG
Motion meets Attention: Video Motion Prompts
[AUTHORS]
Qixiang Chen, Lei Wang, Piotr Koniusz, Tom Gedeon
[ABSTRACT]
Videos contain rich spatio-temporal information. Traditional methods for
extracting motion, used in tasks such as action recognition, often rely on
visual contents rather than precise motion features. This phenomenon is
referred to as ‘blind motion extraction’ behavior, which proves inefficient in
capturing motions of interest due to a lack of motion-guided cues. Recently,
attention mechanisms have enhanced many computer vision tasks by effectively
highlighting salient visual areas. Inspired by this, we propose a modified
Sigmoid function with learnable slope and shift parameters as an attention
mechanism to modulate motion signals from frame differencing maps. This
approach generates a sequence of attention maps that enhance the processing of
motion-related video content. To ensure temporal continuity and smoothness of
the attention maps, we apply pair-wise temporal attention variation
regularization to remove unwanted motions (e.g., noise) while preserving
important ones. We then perform Hadamard product between each pair of attention
maps and the original video frames to highlight the evolving motions of
interest over time. These highlighted motions, termed video motion prompts, are
subsequently used as inputs to the model instead of the original video frames.
We formalize this process as a motion prompt layer and incorporate the
regularization term into the loss function to learn better motion prompts. This
layer serves as an adapter between the model and the video data, bridging the
gap between traditional ‘blind motion extraction’ and the extraction of
relevant motions of interest. We show that our lightweight, plug-and-play
motion prompt layer seamlessly integrates into models like SlowFast, X3D, and
TimeSformer, enhancing performance on benchmarks such as FineGym and MPII
Cooking 2.
[COMMENTS]
Accepted at the 16th Asian Conference on Machine Learning (ACML 2024)
[LINK]
http://arxiv.org/abs/2407.03179v2
[DATE]
2024-10-02 21:32:56+08:00
[CATEGORIES]
cs.LG
Lines of Thought in Large Language Models
[AUTHORS]
Raphaël Sarfati, Toni J. B. Liu, Nicolas Boullé, Christopher J. Earls
[ABSTRACT]
Large Language Models achieve next-token prediction by transporting a
vectorized piece of text (prompt) across an accompanying embedding space under
the action of successive transformer layers. The resulting high-dimensional
trajectories realize different contextualization, or ‘thinking’, steps, and
fully determine the output probability distribution. We aim to characterize the
statistical properties of ensembles of these ‘lines of thought.’ We observe
that independent trajectories cluster along a low-dimensional, non-Euclidean
manifold, and that their path can be well approximated by a stochastic equation
with few parameters extracted from data. We find it remarkable that the vast
complexity of such large models can be reduced to a much simpler form, and we
reflect on implications.
[LINK]
http://arxiv.org/abs/2410.01545v1
[DATE]
2024-10-02 21:31:06+08:00
[CATEGORIES]
cs.LG
Edge-preserving noise for diffusion models
[AUTHORS]
Jente Vandersanden, Sascha Holl, Xingchang Huang, Gurprit Singh
[ABSTRACT]
Classical generative diffusion models learn an isotropic Gaussian denoising
process, treating all spatial regions uniformly, thus neglecting potentially
valuable structural information in the data. Inspired by the long-established
work on anisotropic diffusion in image processing, we present a novel
edge-preserving diffusion model that is a generalization of denoising diffusion
probablistic models (DDPM). In particular, we introduce an edge-aware noise
scheduler that varies between edge-preserving and isotropic Gaussian noise. We
show that our model’s generative process converges faster to results that more
closely match the target distribution. We demonstrate its capability to better
learn the low-to-mid frequencies within the dataset, which plays a crucial role
in representing shapes and structural information. Our edge-preserving
diffusion process consistently outperforms state-of-the-art baselines in
unconditional image generation. It is also more robust for generative tasks
guided by a shape-based prior, such as stroke-to-image generation. We present
qualitative and quantitative results showing consistent improvements (FID
score) of up to 30% for both tasks.
[LINK]
http://arxiv.org/abs/2410.01540v1
[DATE]
2024-10-02 21:29:52+08:00
[CATEGORIES]
cs.LG
Attention layers provably solve single-location regression
[AUTHORS]
Pierre Marion, Raphaël Berthier, Gérard Biau, Claire Boyer
[ABSTRACT]
Attention-based models, such as Transformer, excel across various tasks but
lack a comprehensive theoretical understanding, especially regarding token-wise
sparsity and internal linear representations. To address this gap, we introduce
the single-location regression task, where only one token in a sequence
determines the output, and its position is a latent random variable,
retrievable via a linear projection of the input. To solve this task, we
propose a dedicated predictor, which turns out to be a simplified version of a
non-linear self-attention layer. We study its theoretical properties, by
showing its asymptotic Bayes optimality and analyzing its training dynamics. In
particular, despite the non-convex nature of the problem, the predictor
effectively learns the underlying structure. This work highlights the capacity
of attention mechanisms to handle sparse token information and internal linear
structures.
[COMMENTS]
41 pages, 7 figures
[LINK]
http://arxiv.org/abs/2410.01537v1
[DATE]
2024-10-02 21:28:02+08:00
[CATEGORIES]
cs.LG
HoTPP Benchmark: Are We Good at the Long Horizon Events Forecasting?
[AUTHORS]
Ivan Karpukhin, Foma Shipilov, Andrey Savchenko
[ABSTRACT]
Accurately forecasting multiple future events within a given time horizon is
crucial for finance, retail, social networks, and healthcare applications.
Event timing and labels are typically modeled using Marked Temporal Point
Processes (MTPP), with evaluations often focused on next-event prediction
quality. While some studies have extended evaluations to a fixed number of
future events, we demonstrate that this approach leads to inaccuracies in
handling false positives and false negatives. To address these issues, we
propose a novel evaluation method inspired by object detection techniques from
computer vision. Specifically, we introduce Temporal mean Average Precision
(T-mAP), a temporal variant of mAP, which overcomes the limitations of existing
long-horizon evaluation metrics. Our extensive experiments demonstrate that
models with strong next-event prediction accuracy can yield poor long-horizon
forecasts and vice versa, indicating that specialized methods are needed for
each task. To support further research, we release HoTPP, the first benchmark
designed explicitly for evaluating long-horizon MTPP predictions. HoTPP
includes large-scale datasets with up to 43 million events and provides
optimized procedures for both autoregressive and parallel inference, paving the
way for future advancements in the field.
[LINK]
http://arxiv.org/abs/2406.14341v2
[DATE]
2024-10-02 21:24:42+08:00
[CATEGORIES]
cs.LG
TiVaT: Joint-Axis Attention for Time Series Forecasting with Lead-Lag Dynamics
[AUTHORS]
Junwoo Ha, Hyukjae Kwon, Sungsoo Kim, Kisu Lee, Ha Young Kim
[ABSTRACT]
Multivariate time series (MTS) forecasting plays a crucial role in various
real-world applications, yet simultaneously capturing both temporal and
inter-variable dependencies remains a challenge. Conventional Channel-Dependent
(CD) models handle these dependencies separately, limiting their ability to
model complex interactions such as lead-lag dynamics. To address these
limitations, we propose TiVaT (Time-Variable Transformer), a novel architecture
that integrates temporal and variate dependencies through its Joint-Axis (JA)
attention mechanism. TiVaT’s ability to capture intricate variate-temporal
dependencies, including asynchronous interactions, is further enhanced by the
incorporation of Distance-aware Time-Variable (DTV) Sampling, which reduces
noise and improves accuracy through a learned 2D map that focuses on key
interactions. TiVaT effectively models both temporal and variate dependencies,
consistently delivering strong performance across diverse datasets. Notably, it
excels in capturing complex patterns within multivariate time series, enabling
it to surpass or remain competitive with state-of-the-art methods. This
positions TiVaT as a new benchmark in MTS forecasting, particularly in handling
datasets characterized by intricate and challenging dependencies.
[COMMENTS]
15pages, 5 figures
[LINK]
http://arxiv.org/abs/2410.01531v1
[DATE]
2024-10-02 21:24:24+08:00
[CATEGORIES]
cs.LG
Cost-Effective Online Multi-LLM Selection with Versatile Reward Models
[AUTHORS]
Xiangxiang Dai, Jin Li, Xutong Liu, Anqi Yu, John C. S. Lui
[ABSTRACT]
With the rapid advancement of large language models (LLMs), the diversity of
multi-LLM tasks and the variability in their pricing structures have become
increasingly important, as costs can vary greatly between different LLMs. To
tackle these challenges, we introduce the \textit{C2MAB-V}, a
\underline{C}ost-effective \underline{C}ombinatorial \underline{M}ulti-armed
\underline{B}andit with \underline{V}ersatile reward models for optimal LLM
selection and usage. This online model differs from traditional static
approaches or those reliant on a single LLM without cost consideration. With
multiple LLMs deployed on a scheduling cloud and a local server dedicated to
handling user queries, \textit{C2MAB-V} facilitates the selection of multiple
LLMs over a combinatorial search space, specifically tailored for various
collaborative task types with different reward models. Based on our designed
online feedback mechanism and confidence bound technique, \textit{C2MAB-V} can
effectively address the multi-LLM selection challenge by managing the
exploration-exploitation trade-off across different models, while also
balancing cost and reward for diverse tasks. The NP-hard integer linear
programming problem for selecting multiple LLMs with trade-off dilemmas is
addressed by: i) decomposing the integer problem into a relaxed form by the
local server, ii) utilizing a discretization rounding scheme that provides
optimal LLM combinations by the scheduling cloud, and iii) continual online
updates based on feedback. Theoretically, we prove that \textit{C2MAB-V} offers
strict guarantees over versatile reward models, matching state-of-the-art
results for regret and violations in some degenerate cases. Empirically, we
show that \textit{C2MAB-V} effectively balances performance and cost-efficiency
with nine LLMs for three application scenarios.
[COMMENTS]
32 pages, 14 figures, conference
[LINK]
http://arxiv.org/abs/2405.16587v2
[DATE]
2024-10-02 21:22:27+08:00
[CATEGORIES]
cs.LG
DeTPP: Leveraging Object Detection for Robust Long-Horizon Event Prediction
[AUTHORS]
Ivan Karpukhin, Andrey Savchenko
[ABSTRACT]
Long-horizon event forecasting is critical across various domains, including
retail, finance, healthcare, and social networks. Traditional methods, such as
Marked Temporal Point Processes (MTPP), often rely on autoregressive models to
predict multiple future events. However, these models frequently suffer from
issues like converging to constant or repetitive outputs, which limits their
effectiveness and general applicability. To address these challenges, we
introduce DeTPP (Detection-based Temporal Point Processes), a novel approach
inspired by object detection techniques from computer vision. DeTPP employs a
unique matching-based loss function that selectively prioritizes reliably
predictable events, improving the accuracy and diversity of predictions during
inference. Our method establishes a new state-of-the-art in long-horizon event
forecasting, achieving up to a 77% relative improvement over existing MTPP and
next-K methods. The proposed hybrid approach enhances the accuracy of next
event prediction by up to 2.7% on a large transactional dataset. Notably, DeTPP
is also among the fastest methods for inference. The implementation of DeTPP is
publicly available on GitHub.
[LINK]
http://arxiv.org/abs/2408.13131v2
[DATE]
2024-10-02 21:21:50+08:00
[CATEGORIES]
cs.LG
Exploratory Optimal Stopping: A Singular Control Formulation
[AUTHORS]
Jodi Dianetti, Giorgio Ferrari, Renyuan Xu
[ABSTRACT]
This paper explores continuous-time and state-space optimal stopping problems
from a reinforcement learning perspective. We begin by formulating the stopping
problem using randomized stopping times, where the decision maker’s control is
represented by the probability of stopping within a given time–specifically, a
bounded, non-decreasing, c`adl`ag control process. To encourage exploration
and facilitate learning, we introduce a regularized version of the problem by
penalizing it with the cumulative residual entropy of the randomized stopping
time. The regularized problem takes the form of an (n+1)-dimensional degenerate
singular stochastic control with finite-fuel. We address this through the
dynamic programming principle, which enables us to identify the unique optimal
exploratory strategy. For the specific case of a real option problem, we derive
a semi-explicit solution to the regularized problem, allowing us to assess the
impact of entropy regularization and analyze the vanishing entropy limit.
Finally, we propose a reinforcement learning algorithm based on policy
iteration. We show both policy improvement and policy convergence results for
our proposed algorithm.
[COMMENTS]
49 pages, 3 figures
[LINK]
http://arxiv.org/abs/2408.09335v2
[DATE]
2024-10-02 21:13:06+08:00
[CATEGORIES]
cs.LG
Optimization by Parallel Quasi-Quantum Annealing with Gradient-Based Sampling
[AUTHORS]
Yuma Ichikawa, Yamato Arai
[ABSTRACT]
Learning-based methods have gained attention as general-purpose solvers due
to their ability to automatically learn problem-specific heuristics, reducing
the need for manually crafted heuristics. However, these methods often face
scalability challenges. To address these issues, the improved Sampling
algorithm for Combinatorial Optimization (iSCO), using discrete Langevin
dynamics, has been proposed, demonstrating better performance than several
learning-based solvers. This study proposes a different approach that
integrates gradient-based update through continuous relaxation, combined with
Quasi-Quantum Annealing (QQA). QQA smoothly transitions the objective function,
starting from a simple convex function, minimized at half-integral values, to
the original objective function, where the relaxed variables are minimized only
in the discrete space. Furthermore, we incorporate parallel run communication
leveraging GPUs to enhance exploration capabilities and accelerate convergence.
Numerical experiments demonstrate that our method is a competitive
general-purpose solver, achieving performance comparable to iSCO and
learning-based solvers across various benchmark problems. Notably, our method
exhibits superior speed-quality trade-offs for large-scale instances compared
to iSCO, learning-based solvers, commercial solvers, and specialized
algorithms.
[COMMENTS]
21 pages, 3 figures
[LINK]
http://arxiv.org/abs/2409.02135v2
[DATE]
2024-10-02 21:11:08+08:00
[CATEGORIES]
cs.LG
Bounds on $L_p$ Errors in Density Ratio Estimation via $f$-Divergence Loss Functions
[AUTHORS]
Yoshiaki Kitazawa
[ABSTRACT]
Density ratio estimation (DRE) is a fundamental machine learning technique
for identifying relationships between two probability distributions.
$f$-divergence loss functions, derived from variational representations of
$f$-divergence, are commonly employed in DRE to achieve state-of-the-art
results. This study presents a novel perspective on DRE using $f$-divergence
loss functions by deriving the upper and lower bounds on $L_p$ errors. These
bounds apply to any estimator within a class of Lipschitz continuous
estimators, irrespective of the specific $f$-divergence loss functions
utilized. The bounds are formulated as a product of terms that include the data
dimension and the expected value of the density ratio raised to the power of
$p$. Notably, the lower bound incorporates an exponential term dependent on the
Kullback–Leibler divergence, indicating that the $L_p$ error significantly
increases with the Kullback–Leibler divergence for $p > 1$, and this increase
becomes more pronounced as $p$ increases. Furthermore, these theoretical
findings are substantiated through numerical experiments.
[LINK]
http://arxiv.org/abs/2410.01516v1
[DATE]
2024-10-02 21:05:09+08:00
[CATEGORIES]
cs.LG
Optimal Causal Representations and the Causal Information Bottleneck
[AUTHORS]
Francisco N. F. Q. Simoes, Mehdi Dastani, Thijs van Ommen
[COMMENTS]
Submitted to ICLR 2025. Code available at
github.com/francisco-simoes/cib-optimization-psagd
[LINK]
http://arxiv.org/abs/2410.00535v2
[DATE]
2024-10-02 21:02:06+08:00
[CATEGORIES]
cs.LG
LEGO: Learnable Expansion of Graph Operators for Multi-Modal Feature Fusion
[AUTHORS]
Dexuan Ding, Lei Wang, Liyun Zhu, Tom Gedeon, Piotr Koniusz
[ABSTRACT]
In computer vision tasks, features often come from diverse representations,
domains, and modalities, such as text, images, and videos. Effectively fusing
these features is essential for robust performance, especially with the
availability of powerful pre-trained models like vision-language models.
However, common fusion methods, such as concatenation, element-wise operations,
and non-linear techniques, often fail to capture structural relationships, deep
feature interactions, and suffer from inefficiency or misalignment of features
across domains. In this paper, we shift from high-dimensional feature space to
a lower-dimensional, interpretable graph space by constructing similarity
graphs that encode feature relationships at different levels, e.g., clip,
frame, patch, token, etc. To capture deeper interactions, we use graph power
expansions and introduce a learnable graph fusion operator to combine these
graph powers for more effective fusion. Our approach is relationship-centric,
operates in a homogeneous space, and is mathematically principled, resembling
element-wise similarity score aggregation via multilinear polynomials. We
demonstrate the effectiveness of our graph-based fusion method on video anomaly
detection, showing strong performance across multi-representational,
multi-modal, and multi-domain feature fusion tasks.
[COMMENTS]
Research paper
[LINK]
http://arxiv.org/abs/2410.01506v1
[DATE]
2024-10-02 20:58:55+08:00
[CATEGORIES]
cs.LG
$α$-Divergence Loss Function for Neural Density Ratio Estimation
[AUTHORS]
Yoshiaki Kitazawa
[ABSTRACT]
Density ratio estimation (DRE) is a fundamental machine learning technique
for capturing relationships between two probability distributions.
State-of-the-art DRE methods estimate the density ratio using neural networks
trained with loss functions derived from variational representations of
$f$-divergence. However, existing methods face optimization challenges, such as
overfitting due to lower-unbounded loss functions, biased mini-batch gradients,
vanishing training loss gradients, and high sample requirements for
Kullback-Leibler (KL) divergence loss functions. To address these issues, we
focus on $\alpha$-divergence, which provides a suitable variational
representation of $f$-divergence. Subsequently, a novel loss function for DRE,
the $\alpha$-divergence loss function ($\alpha$-Div), is derived. $\alpha$-Div
is concise but offers stable and effective optimization for DRE. The
boundedness of $\alpha$-divergence provides the potential for successful DRE
with data exhibiting high KL-divergence. Our numerical experiments demonstrate
the effectiveness in optimization using $\alpha$-Div. However, the experiments
also show that the proposed loss function offers no significant advantage over
the KL-divergence loss function in terms of RMSE for DRE. This indicates that
the accuracy of DRE is primarily determined by the amount of KL-divergence in
the data and is less dependent on $\alpha$-divergence.
[COMMENTS]
$\mathcal{T}{\text{Lip}}$ in Theorem 7.1 (Theorem B.15.) was changed
to the set of all locally Lipschitz continuous functions. In the previous
version, $\mathcal{T}{\text{Lip}}$ was defined as the set of all Lipschitz
continuous functions, which is unsuitable for the statement of case (ii) in
the theorem
[LINK]
http://arxiv.org/abs/2402.02041v3
[DATE]
2024-10-02 20:57:41+08:00
[CATEGORIES]
cs.LG
Training-Free Message Passing for Learning on Hypergraphs
[AUTHORS]
Bohan Tang, Zexi Liu, Keyue Jiang, Siheng Chen, Xiaowen Dong
[ABSTRACT]
Hypergraphs are crucial for modelling higher-order interactions in real-world
data. Hypergraph neural networks (HNNs) effectively utilise these structures by
message passing to generate informative node features for various downstream
tasks like node classification. However, the message passing module in existing
HNNs typically requires a computationally intensive training process, which
limits their practical use. To tackle this challenge, we propose an alternative
approach by decoupling the usage of hypergraph structural information from the
model learning stage. This leads to a novel training-free message passing
module, named TF-MP-Module, which can be precomputed in the data preprocessing
stage, thereby reducing the computational burden. We refer to the hypergraph
neural network equipped with our TF-MP-Module as TF-HNN. We theoretically
support the efficiency and effectiveness of TF-HNN by showing that: 1) It is
more training-efficient compared to existing HNNs; 2) It utilises as much
information as existing HNNs for node feature generation; and 3) It is robust
against the oversmoothing issue while using long-range interactions.
Experiments based on seven real-world hypergraph benchmarks in node
classification and hyperlink prediction show that, compared to state-of-the-art
HNNs, TF-HNN exhibits both competitive performance and superior training
efficiency. Specifically, on the large-scale benchmark, Trivago, TF-HNN
outperforms the node classification accuracy of the best baseline by 10% with
just 1% of the training time of that baseline.
[LINK]
http://arxiv.org/abs/2402.05569v4
[DATE]
2024-10-02 20:57:32+08:00
[CATEGORIES]
cs.LG
Discrete Diffusion Schrödinger Bridge Matching for Graph Transformation
[AUTHORS]
Jun Hyeong Kim, Seonghwan Kim, Seokhyun Moon, Hyeongwoo Kim, Jeheon Woo, Woo Youn Kim
[ABSTRACT]
Transporting between arbitrary distributions is a fundamental goal in
generative modeling. Recently proposed diffusion bridge models provide a
potential solution, but they rely on a joint distribution that is difficult to
obtain in practice. Furthermore, formulations based on continuous domains limit
their applicability to discrete domains such as graphs. To overcome these
limitations, we propose Discrete Diffusion Schr"odinger Bridge Matching
(DDSBM), a novel framework that utilizes continuous-time Markov chains to solve
the SB problem in a high-dimensional discrete state space. Our approach extends
Iterative Markovian Fitting to discrete domains, and we have proved its
convergence to the SB. Furthermore, we adapt our framework for the graph
transformation and show that our design choice of underlying dynamics
characterized by independent modifications of nodes and edges can be
interpreted as the entropy-regularized version of optimal transport with a cost
function described by the graph edit distance. To demonstrate the effectiveness
of our framework, we have applied DDSBM to molecular optimization in the field
of chemistry. Experimental results demonstrate that DDSBM effectively optimizes
molecules’ property-of-interest with minimal graph transformation, successfully
retaining other features.
[LINK]
http://arxiv.org/abs/2410.01500v1
[DATE]
2024-10-02 20:51:25+08:00
[CATEGORIES]
cs.LG
Understanding the Training and Generalization of Pretrained Transformer for Sequential Decision Making
[AUTHORS]
Hanzhao Wang, Yu Pan, Fupeng Sun, Shang Liu, Kalyan Talluri, Guanting Chen, Xiaocheng Li
[ABSTRACT]
In this paper, we consider the supervised pre-trained transformer for a class
of sequential decision-making problems. The class of considered problems is a
subset of the general formulation of reinforcement learning in that there is no
transition probability matrix; though seemingly restrictive, the subset class
of problems covers bandits, dynamic pricing, and newsvendor problems as special
cases. Such a structure enables the use of optimal actions/decisions in the
pre-training phase, and the usage also provides new insights for the training
and generalization of the pre-trained transformer. We first note the training
of the transformer model can be viewed as a performative prediction problem,
and the existing methods and theories largely ignore or cannot resolve an
out-of-distribution issue. We propose a natural solution that includes the
transformer-generated action sequences in the training procedure, and it enjoys
better properties both numerically and theoretically. The availability of the
optimal actions in the considered tasks also allows us to analyze the
properties of the pre-trained transformer as an algorithm and explains why it
may lack exploration and how this can be automatically resolved. Numerically,
we categorize the advantages of pre-trained transformers over the structured
algorithms such as UCB and Thompson sampling into three cases: (i) it better
utilizes the prior knowledge in the pre-training data; (ii) it can elegantly
handle the misspecification issue suffered by the structured algorithms; (iii)
for short time horizon such as $T\le50$, it behaves more greedy and enjoys much
better regret than the structured algorithms designed for asymptotic
optimality.
[LINK]
http://arxiv.org/abs/2405.14219v2
[DATE]
2024-10-02 20:45:50+08:00
[CATEGORIES]
cs.LG
$σ$-zero: Gradient-based Optimization of $\ell_0$-norm Adversarial Examples
[AUTHORS]
Antonio Emanuele Cinà, Francesco Villani, Maura Pintor, Lea Schönherr, Battista Biggio, Marcello Pelillo
[ABSTRACT]
Evaluating the adversarial robustness of deep networks to gradient-based
attacks is challenging. While most attacks consider $\ell_2$- and
$\ell_\infty$-norm constraints to craft input perturbations, only a few
investigate sparse $\ell_1$- and $\ell_0$-norm attacks. In particular,
$\ell_0$-norm attacks remain the least studied due to the inherent complexity
of optimizing over a non-convex and non-differentiable constraint. However,
evaluating adversarial robustness under these attacks could reveal weaknesses
otherwise left untested with more conventional $\ell_2$- and $\ell_\infty$-norm
attacks. In this work, we propose a novel $\ell_0$-norm attack, called
$\sigma$-zero, which leverages a differentiable approximation of the $\ell_0$
norm to facilitate gradient-based optimization, and an adaptive projection
operator to dynamically adjust the trade-off between loss minimization and
perturbation sparsity. Extensive evaluations using MNIST, CIFAR10, and ImageNet
datasets, involving robust and non-robust models, show that $\sigma$-zero finds
minimum $\ell_0$-norm adversarial examples without requiring any time-consuming
hyperparameter tuning, and that it outperforms all competing sparse attacks in
terms of success rate, perturbation size, and efficiency.
[COMMENTS]
Code available at
https://github.com/Cinofix/sigma-zero-adversarial-attack
[LINK]
http://arxiv.org/abs/2402.01879v2
[DATE]
2024-10-02 20:42:56+08:00
[CATEGORIES]
cs.LG
One Wave to Explain Them All: A Unifying Perspective on Post-hoc Explainability
[AUTHORS]
Gabriel Kasmi, Amandine Brunetto, Thomas Fel, Jayneel Parekh
[ABSTRACT]
Despite the growing use of deep neural networks in safety-critical
decision-making, their inherent black-box nature hinders transparency and
interpretability. Explainable AI (XAI) methods have thus emerged to understand
a model’s internal workings, and notably attribution methods also called
saliency maps. Conventional attribution methods typically identify the
locations – the where – of significant regions within an input. However,
because they overlook the inherent structure of the input data, these methods
often fail to interpret what these regions represent in terms of structural
components (e.g., textures in images or transients in sounds). Furthermore,
existing methods are usually tailored to a single data modality, limiting their
generalizability. In this paper, we propose leveraging the wavelet domain as a
robust mathematical foundation for attribution. Our approach, the Wavelet
Attribution Method (WAM) extends the existing gradient-based feature
attributions into the wavelet domain, providing a unified framework for
explaining classifiers across images, audio, and 3D shapes. Empirical
evaluations demonstrate that WAM matches or surpasses state-of-the-art methods
across faithfulness metrics and models in image, audio, and 3D explainability.
Finally, we show how our method explains not only the where – the important
parts of the input – but also the what – the relevant patterns in terms of
structural components.
[COMMENTS]
main: 10 pages, appendix: 14 pages, 5 Tables, 25 Figures
[LINK]
http://arxiv.org/abs/2410.01482v1
[DATE]
2024-10-02 20:34:04+08:00
[CATEGORIES]
cs.LG
Introducing Flexible Monotone Multiple Choice Item Response Theory Models and Bit Scales
[AUTHORS]
Joakim Wallmark, Maria Josefsson, Marie Wiberg
[ABSTRACT]
Item Response Theory (IRT) is a powerful statistical approach for evaluating
test items and determining test taker abilities through response analysis. An
IRT model that better fits the data leads to more accurate latent trait
estimates. In this study, we present a new model for multiple choice data, the
monotone multiple choice (MMC) model, which we fit using autoencoders. Using
both simulated scenarios and real data from the Swedish Scholastic Aptitude
Test, we demonstrate empirically that the MMC model outperforms the traditional
nominal response IRT model in terms of fit. Furthermore, we illustrate how the
latent trait scale from any fitted IRT model can be transformed into a ratio
scale, aiding in score interpretation and making it easier to compare different
types of IRT models. We refer to these new scales as bit scales. Bit scales are
especially useful for models for which minimal or no assumptions are made for
the latent trait scale distributions, such as for the autoencoder fitted models
in this study.
[LINK]
http://arxiv.org/abs/2410.01480v1
[DATE]
2024-10-02 20:33:16+08:00
[CATEGORIES]
cs.LG
Reducing Variance in Meta-Learning via Laplace Approximation for Regression Tasks
[AUTHORS]
Alfredo Reichlin, Gustaf Tegnér, Miguel Vasco, Hang Yin, Mårten Björkman, Danica Kragic
[ABSTRACT]
Given a finite set of sample points, meta-learning algorithms aim to learn an
optimal adaptation strategy for new, unseen tasks. Often, this data can be
ambiguous as it might belong to different tasks concurrently. This is
particularly the case in meta-regression tasks. In such cases, the estimated
adaptation strategy is subject to high variance due to the limited amount of
support data for each task, which often leads to sub-optimal generalization
performance. In this work, we address the problem of variance reduction in
gradient-based meta-learning and formalize the class of problems prone to this,
a condition we refer to as \emph{task overlap}. Specifically, we propose a
novel approach that reduces the variance of the gradient estimate by weighing
each support point individually by the variance of its posterior over the
parameters. To estimate the posterior, we utilize the Laplace approximation,
which allows us to express the variance in terms of the curvature of the loss
landscape of our meta-learner. Experimental results demonstrate the
effectiveness of the proposed method and highlight the importance of variance
reduction in meta-learning.
[LINK]
http://arxiv.org/abs/2410.01476v1
[DATE]
2024-10-02 20:30:05+08:00
[CATEGORIES]
cs.LG
Correlations Are Ruining Your Gradient Descent
[AUTHORS]
Nasir Ahmad
[ABSTRACT]
Herein the topics of (natural) gradient descent, data decorrelation, and
approximate methods for backpropagation are brought into a common discussion.
Natural gradient descent illuminates how gradient vectors, pointing at
directions of steepest descent, can be improved by considering the local
curvature of loss landscapes. We extend this perspective and show that to fully
solve the problem illuminated by natural gradients in neural networks, one must
recognise that correlations in the data at any linear transformation, including
node responses at every layer of a neural network, cause a non-orthonormal
relationship between the model’s parameters. To solve this requires a method
for decorrelating inputs at each individual layer of a neural network. We
describe a range of methods which have been proposed for decorrelation and
whitening of node output, and expand on these to provide a novel method
specifically useful for distributed computing and computational neuroscience.
Implementing decorrelation within multi-layer neural networks, we can show that
not only is training via backpropagation sped up significantly but also
existing approximations of backpropagation, which have failed catastrophically
in the past, benefit significantly in their accuracy and convergence speed.
This has the potential to provide a route forward for approximate gradient
descent methods which have previously been discarded, training approaches for
analogue and neuromorphic hardware, and potentially insights as to the efficacy
and utility of decorrelation processes in the brain.
[COMMENTS]
15 pages, 4 figures
[LINK]
http://arxiv.org/abs/2407.10780v2
[DATE]
2024-10-02 20:27:13+08:00
[CATEGORIES]
cs.LG
Off-policy Evaluation with Deeply-abstracted States
[AUTHORS]
Meiling Hao, Pingfan Su, Liyuan Hu, Zoltan Szabo, Qingyuan Zhao, Chengchun Shi
[ABSTRACT]
Off-policy evaluation (OPE) is crucial for assessing a target policy’s impact
offline before its deployment. However, achieving accurate OPE in large state
spaces remains challenging. This paper studies state abstractions – originally
designed for policy learning – in the context of OPE. Our contributions are
three-fold: (i) We define a set of irrelevance conditions central to learning
state abstractions for OPE, and derive a backward-model-irrelevance condition
for achieving irrelevance in %sequential and (marginalized) importance sampling
ratios by constructing a time-reversed Markov decision process (MDP). (ii) We
propose a novel iterative procedure that sequentially projects the original
state space into a smaller space, resulting in a deeply-abstracted state, which
substantially simplifies the sample complexity of OPE arising from high
cardinality. (iii) We prove the Fisher consistencies of various OPE estimators
when applied to our proposed abstract state spaces.
[COMMENTS]
56 pages, 5 figures
[LINK]
http://arxiv.org/abs/2406.19531v2
[DATE]
2024-10-02 20:22:51+08:00
[CATEGORIES]
cs.LG
ShortCircuit: AlphaZero-Driven Circuit Design
[AUTHORS]
Dimitrios Tsaras, Antoine Grosnit, Lei Chen, Zhiyao Xie, Haitham Bou-Ammar, Mingxuan Yuan
[ABSTRACT]
Chip design relies heavily on generating Boolean circuits, such as
AND-Inverter Graphs (AIGs), from functional descriptions like truth tables.
This generation operation is a key process in logic synthesis, a primary chip
design stage. While recent advances in deep learning have aimed to accelerate
circuit design, these efforts have mostly focused on tasks other than
synthesis, and traditional heuristic methods have plateaued. In this paper, we
introduce ShortCircuit, a novel transformer-based architecture that leverages
the structural properties of AIGs and performs efficient space exploration.
Contrary to prior approaches attempting end-to-end generation of logic circuits
using deep networks, ShortCircuit employs a two-phase process combining
supervised with reinforcement learning to enhance generalization to unseen
truth tables. We also propose an AlphaZero variant to handle the double
exponentially large state space and the reward sparsity, enabling the discovery
of near-optimal designs. To evaluate the generative performance of our model ,
we extract 500 truth tables from a set of 20 real-world circuits. ShortCircuit
successfully generates AIGs for $98\%$ of the 8-input test truth tables, and
outperforms the state-of-the-art logic synthesis tool, ABC, by $18.62\%$ in
terms of circuits size.
[LINK]
http://arxiv.org/abs/2408.09858v2
[DATE]
2024-10-02 20:22:10+08:00
[CATEGORIES]
cs.LG
Flow Matching for Accelerated Simulation of Atomic Transport in Materials
[AUTHORS]
Juno Nam, Sulin Liu, Gavin Winter, KyuJung Jun, Soojung Yang, Rafael Gómez-Bombarelli
[ABSTRACT]
We introduce LiFlow, a generative framework to accelerate molecular dynamics
(MD) simulations for crystalline materials that formulates the task as
conditional generation of atomic displacements. The model uses flow matching,
with a Propagator submodel to generate atomic displacements and a Corrector to
locally correct unphysical geometries, and incorporates an adaptive prior based
on the Maxwell-Boltzmann distribution to account for chemical and thermal
conditions. We benchmark LiFlow on a dataset comprising 25-ps trajectories of
lithium diffusion across 4,186 solid-state electrolyte (SSE) candidates at four
temperatures. The model obtains a consistent Spearman rank correlation of
0.7-0.8 for lithium mean squared displacement (MSD) predictions on unseen
compositions. Furthermore, LiFlow generalizes from short training trajectories
to larger supercells and longer simulations while maintaining high accuracy.
With speed-ups of up to 600,000$\times$ compared to first-principles methods,
LiFlow enables scalable simulations at significantly larger length and time
scales.
[LINK]
http://arxiv.org/abs/2410.01464v1
[DATE]
2024-10-02 20:16:46+08:00
[CATEGORIES]
cs.LG
Selective Aggregation for Low-Rank Adaptation in Federated Learning
[AUTHORS]
Pengxin Guo, Shuang Zeng, Yanran Wang, Huijie Fan, Feifei Wang, Liangqiong Qu
[ABSTRACT]
We investigate LoRA in federated learning through the lens of the asymmetry
analysis of the learned $A$ and $B$ matrices. In doing so, we uncover that $A$
matrices are responsible for learning general knowledge, while $B$ matrices
focus on capturing client-specific knowledge. Based on this finding, we
introduce Federated Share-A Low-Rank Adaptation (FedSA-LoRA), which employs two
low-rank trainable matrices $A$ and $B$ to model the weight update, but only
$A$ matrices are shared with the server for aggregation. Moreover, we delve
into the relationship between the learned $A$ and $B$ matrices in other LoRA
variants, such as rsLoRA and VeRA, revealing a consistent pattern.
Consequently, we extend our FedSA-LoRA method to these LoRA variants, resulting
in FedSA-rsLoRA and FedSA-VeRA. In this way, we establish a general paradigm
for integrating LoRA with FL, offering guidance for future work on subsequent
LoRA variants combined with FL. Extensive experimental results on natural
language understanding and generation tasks demonstrate the effectiveness of
the proposed method.
[LINK]
http://arxiv.org/abs/2410.01463v1
[DATE]
2024-10-02 20:14:36+08:00
[CATEGORIES]
cs.LG
Learning Explainable and Better Performing Representations of POMDP Strategies
[AUTHORS]
Alexander Bork, Debraj Chakraborty, Kush Grover, Jan Kretinsky, Stefanie Mohr
[ABSTRACT]
Strategies for partially observable Markov decision processes (POMDP)
typically require memory. One way to represent this memory is via automata. We
present a method to learn an automaton representation of a strategy using a
modification of the L*-algorithm. Compared to the tabular representation of a
strategy, the resulting automaton is dramatically smaller and thus also more
explainable. Moreover, in the learning process, our heuristics may even improve
the strategy’s performance. In contrast to approaches that synthesize an
automaton directly from the POMDP thereby solving it, our approach is
incomparably more scalable.
[COMMENTS]
Technical report for the submission to TACAS 24
[LINK]
http://arxiv.org/abs/2401.07656v4
[DATE]
2024-10-02 20:12:31+08:00
[CATEGORIES]
cs.LG
From Reward Shaping to Q-Shaping: Achieving Unbiased Learning with LLM-Guided Knowledge
[AUTHORS]
Xiefeng Wu
[ABSTRACT]
Q-shaping is an extension of Q-value initialization and serves as an
alternative to reward shaping for incorporating domain knowledge to accelerate
agent training, thereby improving sample efficiency by directly shaping
Q-values. This approach is both general and robust across diverse tasks,
allowing for immediate impact assessment while guaranteeing optimality. We
evaluated Q-shaping across 20 different environments using a large language
model (LLM) as the heuristic provider. The results demonstrate that Q-shaping
significantly enhances sample efficiency, achieving a \textbf{16.87\%}
improvement over the best baseline in each environment and a \textbf{253.80\%}
improvement compared to LLM-based reward shaping methods. These findings
establish Q-shaping as a superior and unbiased alternative to conventional
reward shaping in reinforcement learning.
[COMMENTS]
q-shaping, reinforcement learning, reward shaping
[LINK]
http://arxiv.org/abs/2410.01458v1
[DATE]
2024-10-02 20:10:07+08:00
[CATEGORIES]
cs.LG
Verbalized Graph Representation Learning: A Fully Interpretable Graph Model Based on Large Language Models Throughout the Entire Process
[AUTHORS]
Xingyu Ji, Jiale Liu, Lu Li, Maojun Wang, Zeyu Zhang
[ABSTRACT]
Representation learning on text-attributed graphs (TAGs) has attracted
significant interest due to its wide-ranging real-world applications,
particularly through Graph Neural Networks (GNNs). Traditional GNN methods
focus on encoding the structural information of graphs, often using shallow
text embeddings for node or edge attributes. This limits the model to
understand the rich semantic information in the data and its reasoning ability
for complex downstream tasks, while also lacking interpretability. With the
rise of large language models (LLMs), an increasing number of studies are
combining them with GNNs for graph representation learning and downstream
tasks. While these approaches effectively leverage the rich semantic
information in TAGs datasets, their main drawback is that they are only
partially interpretable, which limits their application in critical fields. In
this paper, we propose a verbalized graph representation learning (VGRL) method
which is fully interpretable. In contrast to traditional graph machine learning
models, which are usually optimized within a continuous parameter space, VGRL
constrains this parameter space to be text description which ensures complete
interpretability throughout the entire process, making it easier for users to
understand and trust the decisions of the model. We conduct several studies to
empirically evaluate the effectiveness of VGRL and we believe these method can
serve as a stepping stone in graph representation learning.
[COMMENTS]
under review. corresponding author: Zeyu Zhang
[LINK]
http://arxiv.org/abs/2410.01457v1
[DATE]
2024-10-02 20:07:47+08:00
[CATEGORIES]
cs.LG
Ensembles provably learn equivariance through data augmentation
[AUTHORS]
Oskar Nordenfors, Axel Flinth
[ABSTRACT]
Recently, it was proved that group equivariance emerges in ensembles of
neural networks as the result of full augmentation in the limit of infinitely
wide neural networks (neural tangent kernel limit). In this paper, we extend
this result significantly. We provide a proof that this emergence does not
depend on the neural tangent kernel limit at all. We also consider stochastic
settings, and furthermore general architectures. For the latter, we provide a
simple sufficient condition on the relation between the architecture and the
action of the group for our results to hold. We validate our findings through
simple numeric experiments.
[LINK]
http://arxiv.org/abs/2410.01452v1
[DATE]
2024-10-02 20:02:43+08:00
[CATEGORIES]
cs.LG
Multiple-Input Fourier Neural Operator (MIFNO) for source-dependent 3D elastodynamics
[AUTHORS]
Fanny Lehmann, Filippo Gatti, Didier Clouteau
[ABSTRACT]
Numerical simulations are essential tools to evaluate the solution of the
wave equation in complex settings, such as three-dimensional (3D) domains with
heterogeneous properties. However, their application is limited by high
computational costs and existing surrogate models lack the flexibility of
numerical solvers. This work introduces the Multiple-Input Fourier Neural
Operator (MIFNO) to deal with structured 3D fields representing material
properties as well as vectors describing the source characteristics. The MIFNO
is applied to the problem of elastic wave propagation in the Earth’s crust. It
is trained on the HEMEW^S-3D database containing 30000 earthquake simulations
in different heterogeneous domains with random source positions and
orientations. Outputs are time- and space-dependent surface wavefields. The
MIFNO predictions are assessed as good to excellent based on Goodness-Of-Fit
(GOF) criteria. Wave arrival times and wave fronts’ propagation are very
accurate since 80% of the predictions have an excellent phase GOF. The
fluctuations amplitudes are good for 87% of the predictions. The envelope score
is hindered by the small-scale fluctuations that are challenging to capture due
to the complex physical phenomena associated with high-frequency features.
Nevertheless, the MIFNO can generalize to sources located outside the training
domain and it shows good generalization ability to a real complex overthrust
geology. When focusing on a region of interest, transfer learning improves the
accuracy with limited additional costs, since GOF scores improved by more than
1 GOF unit with only 500 additional specific samples. The MIFNO is the first
surrogate model offering the flexibility of an earthquake simulator with
varying sources and material properties. Its good accuracy and massive speed-up
offer new perspectives to replace numerical simulations in many-query problems.
[LINK]
http://arxiv.org/abs/2404.10115v2
[DATE]
2024-10-02 19:59:27+08:00
[CATEGORIES]
cs.LG
Closed-loop Long-horizon Robotic Planning via Equilibrium Sequence Modeling
[AUTHORS]
Jinghan Li, Zhicheng Sun, Fei Li, Cao Sheng, Jiazhong Yu, Yadong Mu
[ABSTRACT]
In the endeavor to make autonomous robots take actions, task planning is a
major challenge that requires translating high-level task descriptions into
long-horizon action sequences. Despite recent advances in language model
agents, they remain prone to planning errors and limited in their ability to
plan ahead. To address these limitations in robotic planning, we advocate a
self-refining scheme that iteratively refines a draft plan until an equilibrium
is reached. Remarkably, this process can be optimized end-to-end from an
analytical perspective without the need to curate additional verifiers or
reward models, allowing us to train self-refining planners in a simple
supervised learning fashion. Meanwhile, a nested equilibrium sequence modeling
procedure is devised for efficient closed-loop planning that incorporates
useful feedback from the environment (or an internal world model). Our method
is evaluated on the VirtualHome-Env benchmark, showing advanced performance
with better scaling for inference computation. Code is available at
https://github.com/Singularity0104/equilibrium-planner.
[LINK]
http://arxiv.org/abs/2410.01440v1
[DATE]
2024-10-02 19:42:49+08:00
[CATEGORIES]
cs.LG
Information-Theoretical Principled Trade-off between Jailbreakability and Stealthiness on Vision Language Models
[AUTHORS]
Ching-Chia Kao, Chia-Mu Yu, Chun-Shien Lu, Chu-Song Chen
[ABSTRACT]
In recent years, Vision-Language Models (VLMs) have demonstrated significant
advancements in artificial intelligence, transforming tasks across various
domains. Despite their capabilities, these models are susceptible to jailbreak
attacks, which can compromise their safety and reliability. This paper explores
the trade-off between jailbreakability and stealthiness in VLMs, presenting a
novel algorithm to detect non-stealthy jailbreak attacks and enhance model
robustness. We introduce a stealthiness-aware jailbreak attack using diffusion
models, highlighting the challenge of detecting AI-generated content. Our
approach leverages Fano’s inequality to elucidate the relationship between
attack success rates and stealthiness scores, providing an explainable
framework for evaluating these threats. Our contributions aim to fortify AI
systems against sophisticated attacks, ensuring their outputs remain aligned
with ethical standards and user expectations.
[LINK]
http://arxiv.org/abs/2410.01438v1
[DATE]
2024-10-02 19:40:49+08:00
[CATEGORIES]
cs.LG
Adaptive teachers for amortized samplers
[AUTHORS]
Minsu Kim, Sanghyeok Choi, Taeyoung Yun, Emmanuel Bengio, Leo Feng, Jarrid Rector-Brooks, Sungsoo Ahn, Jinkyoo Park, Nikolay Malkin, Yoshua Bengio
[ABSTRACT]
Amortized inference is the task of training a parametric model, such as a
neural network, to approximate a distribution with a given unnormalized density
where exact sampling is intractable. When sampling is implemented as a
sequential decision-making process, reinforcement learning (RL) methods, such
as generative flow networks, can be used to train the sampling policy.
Off-policy RL training facilitates the discovery of diverse, high-reward
candidates, but existing methods still face challenges in efficient
exploration. We propose to use an adaptive training distribution (the Teacher)
to guide the training of the primary amortized sampler (the Student) by
prioritizing high-loss regions. The Teacher, an auxiliary behavior model, is
trained to sample high-error regions of the Student and can generalize across
unexplored modes, thereby enhancing mode coverage by providing an efficient
training curriculum. We validate the effectiveness of this approach in a
synthetic environment designed to present an exploration challenge, two
diffusion-based sampling tasks, and four biochemical discovery tasks
demonstrating its ability to improve sample efficiency and mode coverage.
[COMMENTS]
26 pages, 12 figures
[LINK]
http://arxiv.org/abs/2410.01432v1
[DATE]
2024-10-02 19:33:13+08:00
[CATEGORIES]
cs.LG
Scalable Reinforcement Learning-based Neural Architecture Search
[AUTHORS]
Amber Cassimon, Siegfried Mercelis, Kevin Mets
[ABSTRACT]
In this publication, we assess the ability of a novel Reinforcement
Learning-based solution to the problem of Neural Architecture Search, where a
Reinforcement Learning (RL) agent learns to search for good architectures,
rather than to return a single optimal architecture. We consider both the
NAS-Bench-101 and NAS- Bench-301 settings, and compare against various known
strong baselines, such as local search and random search. We conclude that our
Reinforcement Learning agent displays strong scalability with regards to the
size of the search space, but limited robustness to hyperparameter changes.
[COMMENTS]
33 Pages, 19 Figures
[LINK]
http://arxiv.org/abs/2410.01431v1
[DATE]
2024-10-02 19:31:48+08:00
[CATEGORIES]
cs.LG
On exploring the potential of quantum auto-encoder for learning quantum systems
[AUTHORS]
Yuxuan Du, Dacheng Tao
[ABSTRACT]
The frequent interactions between quantum computing and machine learning
revolutionize both fields. One prototypical achievement is the quantum
auto-encoder (QAE), as the leading strategy to relieve the curse of
dimensionality ubiquitous in the quantum world. Despite its attractive
capabilities, practical applications of QAE have yet largely unexplored. To
narrow this knowledge gap, here we devise three effective QAE-based learning
protocols to address three classically computational hard learning problems
when learning quantum systems, which are low-rank state fidelity estimation,
quantum Fisher information estimation, and Gibbs state preparation. Attributed
to the versatility of QAE, our proposals can be readily executed on near-term
quantum machines. Besides, we analyze the error bounds of the trained protocols
and showcase the necessary conditions to provide practical utility from the
perspective of complexity theory. We conduct numerical simulations to confirm
the effectiveness of the proposed three protocols. Our work sheds new light on
developing advanced quantum learning algorithms to accomplish hard quantum
physics and quantum information processing tasks.
[COMMENTS]
Accepted to IEEE Transactions on Neural Networks and Learning Systems
[LINK]
http://arxiv.org/abs/2106.15432v2
[DATE]
2024-10-02 19:20:22+08:00
[CATEGORIES]
cs.LG
Fair4Free: Generating High-fidelity Fair Synthetic Samples using Data Free Distillation
[AUTHORS]
Md Fahim Sikder, Daniel de Leng, Fredrik Heintz
[ABSTRACT]
This work presents Fair4Free, a novel generative model to generate synthetic
fair data using data-free distillation in the latent space. Fair4Free can work
on the situation when the data is private or inaccessible. In our approach, we
first train a teacher model to create fair representation and then distil the
knowledge to a student model (using a smaller architecture). The process of
distilling the student model is data-free, i.e. the student model does not have
access to the training dataset while distilling. After the distillation, we use
the distilled model to generate fair synthetic samples. Our extensive
experiments show that our synthetic samples outperform state-of-the-art models
in all three criteria (fairness, utility and synthetic quality) with a
performance increase of 5% for fairness, 8% for utility and 12% in synthetic
quality for both tabular and image datasets.
[LINK]
http://arxiv.org/abs/2410.01423v1
[DATE]
2024-10-02 19:16:11+08:00
[CATEGORIES]
cs.LG
Open-Set Graph Anomaly Detection via Normal Structure Regularisation
[AUTHORS]
Qizhou Wang, Guansong Pang, Mahsa Salehi, Xiaokun Xia, Christopher Leckie
[ABSTRACT]
This paper considers an important Graph Anomaly Detection (GAD) task, namely
open-set GAD, which aims to train a detection model using a small number of
normal and anomaly nodes (referred to as seen anomalies) to detect both seen
anomalies and unseen anomalies (i.e., anomalies that cannot be illustrated the
training anomalies). Those labelled training data provide crucial prior
knowledge about abnormalities for GAD models, enabling substantially reduced
detection errors. However, current supervised GAD methods tend to
over-emphasise fitting the seen anomalies, leading to many errors of detecting
the unseen anomalies as normal nodes. Further, existing open-set AD models were
introduced to handle Euclidean data, failing to effectively capture
discriminative features from graph structure and node attributes for GAD. In
this work, we propose a novel open-set GAD approach, namely normal structure
regularisation (NSReg), to achieve generalised detection ability to unseen
anomalies, while maintaining its effectiveness on detecting seen anomalies. The
key idea in NSReg is to introduce a regularisation term that enforces the
learning of compact, semantically-rich representations of normal nodes based on
their structural relations to other nodes. When being optimised with supervised
anomaly detection losses, the regularisation term helps incorporate strong
normality into the modelling, and thus, it effectively avoids over-fitting the
seen anomalies and learns a better normality decision boundary, largely
reducing the false negatives of detecting unseen anomalies as normal. Extensive
empirical results on seven real-world datasets show that NSReg significantly
outperforms state-of-the-art competing methods by at least 14% AUC-ROC on the
unseen anomaly classes and by 10% AUC-ROC on all anomaly classes.
[LINK]
http://arxiv.org/abs/2311.06835v4
[DATE]
2024-10-02 19:15:25+08:00
[CATEGORIES]
cs.LG
Conditional Diffusion on Web-Scale Image Pairs leads to Diverse Image Variations
[AUTHORS]
Manoj Kumar, Neil Houlsby, Emiel Hoogeboom
[ABSTRACT]
Generating image variations, where a model produces variations of an input
image while preserving the semantic context has gained increasing attention.
Current image variation techniques involve adapting a text-to-image model to
reconstruct an input image conditioned on the same image. We first demonstrate
that a diffusion model trained to reconstruct an input image from frozen
embeddings, can reconstruct the image with minor variations. Second, inspired
by how text-to-image models learn from web-scale text-image pairs, we explore a
new pretraining strategy to generate image variations using a large collection
of image pairs. Our diffusion model \textit{Semantica} receives a random
(encoded) image from a webpage as conditional input and denoises another noisy
random image from the same webpage. We carefully examine various design choices
for the image encoder, given its crucial role in extracting relevant context
from the input image. Once trained, \textit{Semantica} can adaptively generate
new images from a dataset by simply using images from that dataset as input.
Finally, we identify limitations in standard image consistency metrics for
evaluating image variations and propose alternative metrics based on few-shot
generation.
[LINK]
http://arxiv.org/abs/2405.14857v3
[DATE]
2024-10-02 18:34:09+08:00
[CATEGORIES]
cs.LG
On Expressive Power of Looped Transformers: Theoretical Analysis and Enhancement via Timestep Encoding
[AUTHORS]
Kevin Xu, Issei Sato
[ABSTRACT]
Looped Transformers offer advantages in parameter efficiency and Turing
completeness. However, their expressive power for function approximation and
approximation rate remains underexplored. In this paper, we establish
approximation rates of Looped Transformers by defining the concept of the
modulus of continuity for sequence-to-sequence functions. This reveals a
limitation specific to the looped architecture. That is, the analysis prompts
us to incorporate scaling parameters for each loop, conditioned on timestep
encoding. Experimental results demonstrate that increasing the number of loops
enhances performance, with further gains achieved through the timestep encoding
architecture.
[LINK]
http://arxiv.org/abs/2410.01405v1
[DATE]
2024-10-02 18:31:17+08:00
[CATEGORIES]
cs.LG
DreamCatalyst: Fast and High-Quality 3D Editing via Controlling Editability and Identity Preservation
[AUTHORS]
Jiwook Kim, Seonho Lee, Jaeyo Shin, Jiho Choi, Hyunjung Shim
[ABSTRACT]
Score distillation sampling (SDS) has emerged as an effective framework in
text-driven 3D editing tasks, leveraging diffusion models for 3D consistent
editing. However, existing SDS-based 3D editing methods suffer from long
training times and produce low-quality results. We identify that the root cause
of this performance degradation is their conflict with the sampling dynamics of
diffusion models. Addressing this conflict allows us to treat SDS as a
diffusion reverse process for 3D editing via sampling from data space. In
contrast, existing methods naively distill the score function using diffusion
models. From these insights, we propose DreamCatalyst, a novel framework that
considers these sampling dynamics in the SDS framework. Specifically, we devise
the optimization process of our DreamCatalyst to approximate the diffusion
reverse process in editing tasks, thereby aligning with diffusion sampling
dynamics. As a result, DreamCatalyst successfully reduces training time and
improves editing quality. Our method offers two modes: (1) a fast mode that
edits Neural Radiance Fields (NeRF) scenes approximately 23 times faster than
current state-of-the-art NeRF editing methods, and (2) a high-quality mode that
produces superior results about 8 times faster than these methods. Notably, our
high-quality mode outperforms current state-of-the-art NeRF editing methods in
terms of both speed and quality. DreamCatalyst also surpasses the
state-of-the-art 3D Gaussian Splatting (3DGS) editing methods, establishing
itself as an effective and model-agnostic 3D editing solution. See more
extensive results on our project page: https://dream-catalyst.github.io.
[COMMENTS]
ProjectPage: https://dream-catalyst.github.io Code:
https://github.com/kaist-cvml/DreamCatalyst (Appendix included)
[LINK]
http://arxiv.org/abs/2407.11394v2
[DATE]
2024-10-02 18:28:14+08:00
[CATEGORIES]
cs.LG
Overpredictive Signal Analytics in Federated Learning: Algorithms and Analysis
[AUTHORS]
Vijay Anavangot
[ABSTRACT]
Edge signal processing facilitates distributed learning and inference in the
client-server model proposed in federated learning. In traditional machine
learning, clients (IoT devices) that acquire raw signal samples can aid a data
center (server) learn a global signal model by pooling these distributed
samples at a third-party location. Despite the promising capabilities of IoTs,
these distributed deployments often face the challenge of sensitive private
data and communication rate constraints. This necessitates a learning approach
that communicates a processed approximation of the distributed samples instead
of the raw signals. Such a decentralized learning approach using signal
approximations will be termed distributed signal analytics in this work.
Overpredictive signal approximations may be desired for distributed signal
analytics, especially in network demand (capacity) planning applications
motivated by federated learning. In this work, we propose algorithms that
compute an overpredictive signal approximation at the client devices using an
efficient convex optimization framework. Tradeoffs between communication cost,
sampling rate, and the signal approximation error are quantified using
mathematical analysis. We also show the performance of the proposed distributed
algorithms on a publicly available residential energy consumption dataset.
[LINK]
http://arxiv.org/abs/2410.01399v1
[DATE]
2024-10-02 18:21:55+08:00
[CATEGORIES]
cs.LG
**Gaussian kernel expansion with basis functions uniformly bounded in $\mathcal{L}{\infty}$**
[AUTHORS]
Mauro Bisiacco, Gianluigi Pillonetto
[ABSTRACT]
Kernel expansions are a topic of considerable interest in machine learning,
also because of their relation to the so-called feature maps introduced in
machine learning. Properties of the associated basis functions and weights
(corresponding to eigenfunctions and eigenvalues in the Mercer setting) give
insight into for example the structure of the associated reproducing kernel
Hilbert space, the goodness of approximation schemes, the convergence rates and
generalization properties of kernel machines. Recent work in the literature has
derived some of these results by assuming uniformly bounded basis functions in
$\mathcal{L}\infty$. Motivated by this line of research, we investigate under
this constraint all possible kernel expansions of the Gaussian kernel, one of
the most widely used models in machine learning. Our main result is the
construction on $\mathbb{R}^2$ of a Gaussian kernel expansion with weights in
$\ell_p$ for any $p>1$. This result is optimal since we also prove that $p=1$
cannot be reached by the Gaussian kernel, nor by any of the other radial basis
function kernels commonly used in the literature. A consequence for this kind
of kernels is also the non-existence of Mercer expansions on $\mathbb{R}^2$,
with respect to any finite measure, whose eigenfunctions all belong to a closed
ball of $\mathcal{L}_\infty$.
[LINK]
http://arxiv.org/abs/2410.01394v1
[DATE]
2024-10-02 18:10:30+08:00
[CATEGORIES]
cs.LG
FLAME: Adaptive and Reactive Concept Drift Mitigation for Federated Learning Deployments
[AUTHORS]
Ioannis Mavromatis, Stefano De Feo, Aftab Khan
[ABSTRACT]
This paper presents Federated Learning with Adaptive Monitoring and
Elimination (FLAME), a novel solution capable of detecting and mitigating
concept drift in Federated Learning (FL) Internet of Things (IoT) environments.
Concept drift poses significant challenges for FL models deployed in dynamic
and real-world settings. FLAME leverages an FL architecture, considers a
real-world FL pipeline, and proves capable of maintaining model performance and
accuracy while addressing bandwidth and privacy constraints. Introducing
various features and extensions on previous works, FLAME offers a robust
solution to concept drift, significantly reducing computational load and
communication overhead. Compared to well-known lightweight mitigation methods,
FLAME demonstrates superior performance in maintaining high F1 scores and
reducing resource utilisation in large-scale IoT deployments, making it a
promising approach for real-world applications.
[COMMENTS]
Accepted for Publication at EMERGE Workshop - EWSN 2024
[LINK]
http://arxiv.org/abs/2410.01386v1
[DATE]
2024-10-02 17:55:58+08:00
[CATEGORIES]
cs.LG
A Conditional Independence Test in the Presence of Discretization
[AUTHORS]
Boyang Sun, Yu Yao, Huangyuan Hao, Yumou Qiu, Kun Zhang
[ABSTRACT]
Testing conditional independence has many applications, such as in Bayesian
network learning and causal discovery. Different test methods have been
proposed. However, existing methods generally can not work when only
discretized observations are available. Specifically, consider $X_1$,
$\tilde{X}_2$ and $X_3$ are observed variables, where $\tilde{X}_2$ is a
discretization of latent variables $X_2$. Applying existing test methods to the
observations of $X_1$, $\tilde{X}_2$ and $X_3$ can lead to a false conclusion
about the underlying conditional independence of variables $X_1$, $X_2$ and
$X_3$. Motivated by this, we propose a conditional independence test
specifically designed to accommodate the presence of such discretization. To
achieve this, we design the bridge equations to recover the parameter
reflecting the statistical information of the underlying latent continuous
variables. An appropriate test statistic and its asymptotic distribution under
the null hypothesis of conditional independence have also been derived. Both
theoretical results and empirical validation have been provided, demonstrating
the effectiveness of our test methods.
[LINK]
http://arxiv.org/abs/2404.17644v3
[DATE]
2024-10-02 17:55:25+08:00
[CATEGORIES]
cs.LG
RMLR: Extending Multinomial Logistic Regression into General Geometries
[AUTHORS]
Ziheng Chen, Yue Song, Rui Wang, Xiaojun Wu, Nicu Sebe
[ABSTRACT]
Riemannian neural networks, which extend deep learning techniques to
Riemannian spaces, have gained significant attention in machine learning. To
better classify the manifold-valued features, researchers have started
extending Euclidean multinomial logistic regression (MLR) into Riemannian
manifolds. However, existing approaches suffer from limited applicability due
to their strong reliance on specific geometric properties. This paper proposes
a framework for designing Riemannian MLR over general geometries, referred to
as RMLR. Our framework only requires minimal geometric properties, thus
exhibiting broad applicability and enabling its use with a wide range of
geometries. Specifically, we showcase our framework on the Symmetric Positive
Definite (SPD) manifold and special orthogonal group, i.e., the set of rotation
matrices. On the SPD manifold, we develop five families of SPD MLRs under five
types of power-deformed metrics. On rotation matrices we propose Lie MLR based
on the popular bi-invariant metric. Extensive experiments on different
Riemannian backbone networks validate the effectiveness of our framework.
[COMMENTS]
Accepted to NeurIPS 2024
[LINK]
http://arxiv.org/abs/2409.19433v2
[DATE]
2024-10-02 17:53:48+08:00
[CATEGORIES]
cs.LG
Scrutinize What We Ignore: Reining In Task Representation Shift Of Context-Based Offline Meta Reinforcement Learning
[AUTHORS]
Hai Zhang, Boyuan Zheng, Tianying Ji, Jinhang Liu, Anqi Guo, Junqiao Zhao, Lanqing Li
[ABSTRACT]
Offline meta reinforcement learning (OMRL) has emerged as a promising
approach for interaction avoidance and strong generalization performance by
leveraging pre-collected data and meta-learning techniques. Previous
context-based approaches predominantly rely on the intuition that alternating
optimization between the context encoder and the policy can lead to performance
improvements, as long as the context encoder follows the principle of
maximizing the mutual information between the task variable $M$ and its latent
representation $Z$ ($I(Z;M)$) while the policy adopts the standard offline
reinforcement learning (RL) algorithms conditioning on the learned task
representation.Despite promising results, the theoretical justification of
performance improvements for such intuition remains underexplored.Inspired by
the return discrepancy scheme in the model-based RL field, we find that the
previous optimization framework can be linked with the general RL objective of
maximizing the expected return, thereby explaining performance improvements.
Furthermore, after scrutinizing this optimization framework, we find it ignores
the variation of the task representation in the alternating optimization
process, which weakens the condition necessary for monotonic performance
improvements, and may therefore violate the monotonicity.We name this issue
\underline{task representation shift} and theoretically prove that the
monotonic performance improvements can be guaranteed with appropriate context
encoder updates.We use different settings to rein in the task representation
shift on three widely adopted training objectives concerning maximizing
$I(Z;M)$ across different data qualities.Empirical results show that reining in
the task representation shift can indeed improve performance.
[LINK]
http://arxiv.org/abs/2405.12001v3
[DATE]
2024-10-02 17:40:06+08:00
[CATEGORIES]
cs.LG
Towards Dynamic Graph Neural Networks with Provably High-Order Expressive Power
[AUTHORS]
Zhe Wang, Tianjian Zhao, Zhen Zhang, Jiawei Chen, Sheng Zhou, Yan Feng, Chun Chen, Can Wang
[ABSTRACT]
Dynamic Graph Neural Networks (DyGNNs) have garnered increasing research
attention for learning representations on evolving graphs. Despite their
effectiveness, the limited expressive power of existing DyGNNs hinders them
from capturing important evolving patterns of dynamic graphs. Although some
works attempt to enhance expressive capability with heuristic features, there
remains a lack of DyGNN frameworks with provable and quantifiable high-order
expressive power. To address this research gap, we firstly propose the
k-dimensional Dynamic WL tests (k-DWL) as the referencing algorithms to
quantify the expressive power of DyGNNs. We demonstrate that the expressive
power of existing DyGNNs is upper bounded by the 1-DWL test. To enhance the
expressive power, we propose Dynamic Graph Neural Network with High-order
expressive power (HopeDGN), which updates the representation of central node
pair by aggregating the interaction history with neighboring node pairs. Our
theoretical results demonstrate that HopeDGN can achieve expressive power
equivalent to the 2-DWL test. We then present a Transformer-based
implementation for the local variant of HopeDGN. Experimental results show that
HopeDGN achieved performance improvements of up to 3.12%, demonstrating the
effectiveness of HopeDGN.
[LINK]
http://arxiv.org/abs/2410.01367v1
[DATE]
2024-10-02 17:28:59+08:00
[CATEGORIES]
cs.LG
One-shot Active Learning Based on Lewis Weight Sampling for Multiple Deep Models
[AUTHORS]
Sheng-Jun Huang, Yi Li, Yiming Sun, Ying-Peng Tang
[ABSTRACT]
Active learning (AL) for multiple target models aims to reduce labeled data
querying while effectively training multiple models concurrently. Existing AL
algorithms often rely on iterative model training, which can be computationally
expensive, particularly for deep models. In this paper, we propose a one-shot
AL method to address this challenge, which performs all label queries without
repeated model training.
Specifically, we extract different representations of the same dataset using
distinct network backbones, and actively learn the linear prediction layer on
each representation via an $\ell_p$-regression formulation. The regression
problems are solved approximately by sampling and reweighting the unlabeled
instances based on their maximum Lewis weights across the representations. An
upper bound on the number of samples needed is provided with a rigorous
analysis for $p\in [1, +\infty)$.
Experimental results on 11 benchmarks show that our one-shot approach
achieves competitive performances with the state-of-the-art AL methods for
multiple target models.
[COMMENTS]
The proof of Lemma 3.11 is fixed
[LINK]
http://arxiv.org/abs/2405.14121v2
[DATE]
2024-10-02 17:28:42+08:00
[CATEGORIES]
cs.LG
SEMF: Supervised Expectation-Maximization Framework for Predicting Intervals
[AUTHORS]
Ilia Azizi, Marc-Olivier Boldi, Valérie Chavez-Demoulin
[ABSTRACT]
This work introduces the Supervised Expectation-Maximization Framework
(SEMF), a versatile and model-agnostic approach for generating prediction
intervals in datasets with complete or missing data. SEMF extends the
Expectation-Maximization algorithm, traditionally used in unsupervised
learning, to a supervised context, leveraging latent variable modeling for
uncertainty estimation. Extensive empirical evaluations across 11 tabular
datasets show that SEMF often achieves narrower normalized prediction intervals
and higher coverage rates than traditional quantile regression methods.
Furthermore, SEMF can be integrated with machine learning models like
gradient-boosted trees and neural networks, highlighting its practical
applicability. The results indicate that SEMF enhances uncertainty
quantification, particularly in scenarios with complete data.
[LINK]
http://arxiv.org/abs/2405.18176v3
[DATE]
2024-10-02 17:25:10+08:00
[CATEGORIES]
cs.LG
FlashMask: Efficient and Rich Mask Extension of FlashAttention
[AUTHORS]
Guoxia Wang, Jinle Zeng, Xiyuan Xiao, Siming Wu, Jiabin Yang, Lujing Zheng, Zeyu Chen, Jiang Bian, Dianhai Yu, Haifeng Wang
[ABSTRACT]
The computational and memory demands of vanilla attention scale quadratically
with the sequence length $N$, posing significant challenges for processing long
sequences in Transformer models. FlashAttention alleviates these challenges by
eliminating the $O(N^2)$ memory dependency and reducing attention latency
through IO-aware memory optimizations. However, its native support for certain
attention mask types is limited, and it does not inherently accommodate more
complex masking requirements. Previous approaches resort to using dense masks
with $O(N^2)$ memory complexity, leading to inefficiencies. In this paper, we
propose FlashMask, an extension of FlashAttention that introduces a column-wise
sparse representation of attention masks. This approach efficiently represents
a wide range of mask types and facilitates the development of optimized kernel
implementations. By adopting this novel representation, FlashMask achieves
linear memory complexity $O(N)$, suitable for modeling long-context sequences.
Moreover, this representation enables kernel optimizations that eliminate
unnecessary computations by leveraging sparsity in the attention mask, without
sacrificing computational accuracy, resulting in higher computational
efficiency. We evaluate FlashMask’s performance in fine-tuning and alignment
training of LLMs such as SFT, LoRA, DPO, and RM. FlashMask achieves significant
throughput improvements, with end-to-end speedups ranging from 1.65x to 3.22x
compared to existing FlashAttention dense method. Additionally, our
kernel-level comparisons demonstrate that FlashMask surpasses the latest
counterpart, FlexAttention, by 12.1% to 60.7% in terms of kernel TFLOPs/s,
achieving 37.8% to 62.3% of the theoretical maximum FLOPs/s on the A100 GPU.
The code is open-sourced on PaddlePaddle and integrated into PaddleNLP,
supporting models with over 100 billion parameters for contexts up to 128K
tokens.
[LINK]
http://arxiv.org/abs/2410.01359v1
[DATE]
2024-10-02 17:17:26+08:00
[CATEGORIES]
cs.LG
Time-MoE: Billion-Scale Time Series Foundation Models with Mixture of Experts
[AUTHORS]
Xiaoming Shi, Shiyu Wang, Yuqi Nie, Dianqi Li, Zhou Ye, Qingsong Wen, Ming Jin
[ABSTRACT]
Deep learning for time series forecasting has seen significant advancements
over the past decades. However, despite the success of large-scale pre-training
in language and vision domains, pre-trained time series models remain limited
in scale and operate at a high cost, hindering the development of larger
capable forecasting models in real-world applications. In response, we
introduce Time-MoE, a scalable and unified architecture designed to pre-train
larger, more capable forecasting foundation models while reducing inference
costs. By leveraging a sparse mixture-of-experts (MoE) design, Time-MoE
enhances computational efficiency by activating only a subset of networks for
each prediction, reducing computational load while maintaining high model
capacity. This allows Time-MoE to scale effectively without a corresponding
increase in inference costs. Time-MoE comprises a family of decoder-only
transformer models that operate in an auto-regressive manner and support
flexible forecasting horizons with varying input context lengths. We
pre-trained these models on our newly introduced large-scale data Time-300B,
which spans over 9 domains and encompassing over 300 billion time points. For
the first time, we scaled a time series foundation model up to 2.4 billion
parameters, achieving significantly improved forecasting precision. Our results
validate the applicability of scaling laws for training tokens and model size
in the context of time series forecasting. Compared to dense models with the
same number of activated parameters or equivalent computation budgets, our
models consistently outperform them by large margin. These advancements
position Time-MoE as a state-of-the-art solution for tackling real-world time
series forecasting challenges with superior capability, efficiency, and
flexibility.
[COMMENTS]
30 pages, 10 figures, 13 tables
[LINK]
http://arxiv.org/abs/2409.16040v2
[DATE]
2024-10-02 17:08:21+08:00
[CATEGORIES]
cs.LG
Mini-batch Submodular Maximization
[AUTHORS]
Gregory Schwartzman
[ABSTRACT]
We present the first mini-batch algorithm for maximizing a non-negative
monotone decomposable submodular function, $F=\sum_{i=1}^N f^i$, under a set of
constraints. We consider two sampling approaches: uniform and weighted. We
first show that mini-batch with weighted sampling improves over the state of
the art sparsifier based approach both in theory and in practice.
Surprisingly, our experimental results show that uniform sampling is superior
to weighted sampling. However, it is impossible to explain this using
worst-case analysis. Our main contribution is using smoothed analysis to
provide a theoretical foundation for our experimental results. We show that,
under very mild assumptions, uniform sampling is superior for both the
mini-batch and the sparsifier approaches. We empirically verify that these
assumptions hold for our datasets. Uniform sampling is simple to implement and
has complexity independent of $N$, making it the perfect candidate to tackle
massive real-world datasets.
[LINK]
http://arxiv.org/abs/2401.12478v2
[DATE]
2024-10-02 17:02:19+08:00
[CATEGORIES]
cs.LG
Response Estimation and System Identification of Dynamical Systems via Physics-Informed Neural Networks
[AUTHORS]
Marcus Haywood-Alexander, Giacamo Arcieri, Antonios Kamariotis, Eleni Chatzi
[ABSTRACT]
The accurate modelling of structural dynamics is crucial across numerous
engineering applications, such as Structural Health Monitoring (SHM), seismic
analysis, and vibration control. Often, these models originate from
physics-based principles and can be derived from corresponding governing
equations, often of differential equation form. However, complex system
characteristics, such as nonlinearities and energy dissipation mechanisms,
often imply that such models are approximative and often imprecise. This
challenge is further compounded in SHM, where sensor data is often sparse,
making it difficult to fully observe the system’s states. To address these
issues, this paper explores the use of Physics-Informed Neural Networks
(PINNs), a class of physics-enhanced machine learning (PEML) techniques, for
the identification and estimation of dynamical systems. PINNs offer a unique
advantage by embedding known physical laws directly into the neural network’s
loss function, allowing for simple embedding of complex phenomena, even in the
presence of uncertainties. This study specifically investigates three key
applications of PINNs: state estimation in systems with sparse sensing, joint
state-parameter estimation, when both system response and parameters are
unknown, and parameter estimation within a Bayesian framework to quantify
uncertainties. The results demonstrate that PINNs deliver an efficient tool
across all aforementioned tasks, even in presence of modelling errors. However,
these errors tend to have a more significant impact on parameter estimation, as
the optimization process must reconcile discrepancies between the prescribed
model and the true system behavior. Despite these challenges, PINNs show
promise in dynamical system modeling, offering a robust approach to handling
uncertainties.
[LINK]
http://arxiv.org/abs/2410.01340v1
[DATE]
2024-10-02 16:58:30+08:00
[CATEGORIES]
cs.LG
PhyMPGN: Physics-encoded Message Passing Graph Network for spatiotemporal PDE systems
[AUTHORS]
Bocheng Zeng, Qi Wang, Mengtao Yan, Yang Liu, Ruizhi Chengze, Yi Zhang, Hongsheng Liu, Zidong Wang, Hao Sun
[ABSTRACT]
Solving partial differential equations (PDEs) serves as a cornerstone for
modeling complex dynamical systems. Recent progresses have demonstrated grand
benefits of data-driven neural-based models for predicting spatiotemporal
dynamics (e.g., tremendous speedup gain compared with classical numerical
methods). However, most existing neural models rely on rich training data, have
limited extrapolation and generalization abilities, and suffer to produce
precise or reliable physical prediction under intricate conditions (e.g.,
irregular mesh or geometry, complex boundary conditions, diverse PDE
parameters, etc.). To this end, we propose a new graph learning approach,
namely, Physics-encoded Message Passing Graph Network (PhyMPGN), to model
spatiotemporal PDE systems on irregular meshes given small training datasets.
Specifically, we incorporate a GNN into a numerical integrator to approximate
the temporal marching of spatiotemporal dynamics for a given PDE system.
Considering that many physical phenomena are governed by diffusion processes,
we further design a learnable Laplace block, which encodes the discrete
Laplace-Beltrami operator, to aid and guide the GNN learning in a physically
feasible solution space. A boundary condition padding strategy is also designed
to improve the model convergence and accuracy. Extensive experiments
demonstrate that PhyMPGN is capable of accurately predicting various types of
spatiotemporal dynamics on coarse unstructured meshes, consistently achieves
the state-of-the-art results, and outperforms other baselines with considerable
gains.
[LINK]
http://arxiv.org/abs/2410.01337v1
[DATE]
2024-10-02 16:54:18+08:00
[CATEGORIES]
cs.LG
Transferability Bound Theory: Exploring Relationship between Adversarial Transferability and Flatness
[AUTHORS]
Mingyuan Fan, Xiaodan Li, Cen Chen, Wenmeng Zhou, Yaliang Li
[ABSTRACT]
A prevailing belief in attack and defense community is that the higher
flatness of adversarial examples enables their better cross-model
transferability, leading to a growing interest in employing sharpness-aware
minimization and its variants. However, the theoretical relationship between
the transferability of adversarial examples and their flatness has not been
well established, making the belief questionable. To bridge this gap, we embark
on a theoretical investigation and, for the first time, derive a theoretical
bound for the transferability of adversarial examples with few practical
assumptions. Our analysis challenges this belief by demonstrating that the
increased flatness of adversarial examples does not necessarily guarantee
improved transferability. Moreover, building upon the theoretical analysis, we
propose TPA, a Theoretically Provable Attack that optimizes a surrogate of the
derived bound to craft adversarial examples. Extensive experiments across
widely used benchmark datasets and various real-world applications show that
TPA can craft more transferable adversarial examples compared to
state-of-the-art baselines. We hope that these results can recalibrate
preconceived impressions within the community and facilitate the development of
stronger adversarial attack and defense mechanisms. The source codes are
available in https://github.com/fmy266/TPA.
[COMMENTS]
Accepted by NIPS 2024
[LINK]
http://arxiv.org/abs/2311.06423v2
[DATE]
2024-10-02 16:50:05+08:00
[CATEGORIES]
cs.LG
Efficient Learning of POMDPs with Known Observation Model in Average-Reward Setting
[AUTHORS]
Alessio Russo, Alberto Maria Metelli, Marcello Restelli
[ABSTRACT]
Dealing with Partially Observable Markov Decision Processes is notably a
challenging task. We face an average-reward infinite-horizon POMDP setting with
an unknown transition model, where we assume the knowledge of the observation
model. Under this assumption, we propose the Observation-Aware Spectral (OAS)
estimation technique, which enables the POMDP parameters to be learned from
samples collected using a belief-based policy. Then, we propose the OAS-UCRL
algorithm that implicitly balances the exploration-exploitation trade-off
following the $\textit{optimism in the face of uncertainty}$ principle. The
algorithm runs through episodes of increasing length. For each episode, the
optimal belief-based policy of the estimated POMDP interacts with the
environment and collects samples that will be used in the next episode by the
OAS estimation procedure to compute a new estimate of the POMDP parameters.
Given the estimated model, an optimization oracle computes the new optimal
policy. We show the consistency of the OAS procedure, and we prove a regret
guarantee of order $\mathcal{O}(\sqrt{T \log(T)})$ for the proposed OAS-UCRL
algorithm. We compare against the oracle playing the optimal stochastic
belief-based policy and show the efficient scaling of our approach with respect
to the dimensionality of the state, action, and observation space. We finally
conduct numerical simulations to validate and compare the proposed technique
with other baseline approaches.
[LINK]
http://arxiv.org/abs/2410.01331v1
[DATE]
2024-10-02 16:46:34+08:00
[CATEGORIES]
cs.LG
Fair Class-Incremental Learning using Sample Weighting
[AUTHORS]
Jaeyoung Park, Minsu Kim, Steven Euijong Whang
[ABSTRACT]
Model fairness is becoming important in class-incremental learning for
Trustworthy AI. While accuracy has been a central focus in class-incremental
learning, fairness has been relatively understudied. However, naively using all
the samples of the current task for training results in unfair catastrophic
forgetting for certain sensitive groups including classes. We theoretically
analyze that forgetting occurs if the average gradient vector of the current
task data is in an “opposite direction” compared to the average gradient vector
of a sensitive group, which means their inner products are negative. We then
propose a fair class-incremental learning framework that adjusts the training
weights of current task samples to change the direction of the average gradient
vector and thus reduce the forgetting of underperforming groups and achieve
fairness. For various group fairness measures, we formulate optimization
problems to minimize the overall losses of sensitive groups while minimizing
the disparities among them. We also show the problems can be solved with linear
programming and propose an efficient Fairness-aware Sample Weighting (FSW)
algorithm. Experiments show that FSW achieves better accuracy-fairness tradeoff
results than state-of-the-art approaches on real datasets.
[LINK]
http://arxiv.org/abs/2410.01324v1
[DATE]
2024-10-02 16:32:21+08:00
[CATEGORIES]
cs.LG
Forte : Finding Outliers with Representation Typicality Estimation
[AUTHORS]
Debargha Ganguly, Warren Morningstar, Andrew Yu, Vipin Chaudhary
[ABSTRACT]
Generative models can now produce photorealistic synthetic data which is
virtually indistinguishable from the real data used to train it. This is a
significant evolution over previous models which could produce reasonable
facsimiles of the training data, but ones which could be visually distinguished
from the training data by human evaluation. Recent work on OOD detection has
raised doubts that generative model likelihoods are optimal OOD detectors due
to issues involving likelihood misestimation, entropy in the generative
process, and typicality. We speculate that generative OOD detectors also failed
because their models focused on the pixels rather than the semantic content of
the data, leading to failures in near-OOD cases where the pixels may be similar
but the information content is significantly different. We hypothesize that
estimating typical sets using self-supervised learners leads to better OOD
detectors. We introduce a novel approach that leverages representation
learning, and informative summary statistics based on manifold estimation, to
address all of the aforementioned issues. Our method outperforms other
unsupervised approaches and achieves state-of-the art performance on
well-established challenging benchmarks, and new synthetic data detection
tasks.
[LINK]
http://arxiv.org/abs/2410.01322v1
[DATE]
2024-10-02 16:26:37+08:00
[CATEGORIES]
cs.LG
Fine-Tuning is Fine, if Calibrated
[AUTHORS]
Zheda Mai, Arpita Chowdhury, Ping Zhang, Cheng-Hao Tu, Hong-You Chen, Vardaan Pahuja, Tanya Berger-Wolf, Song Gao, Charles Stewart, Yu Su, Wei-Lun Chao
[COMMENTS]
The first three authors contribute equally. The paper has been
accepted to NeurIPS 2024
[LINK]
http://arxiv.org/abs/2409.16223v2
[DATE]
2024-10-02 16:23:07+08:00
[CATEGORIES]
cs.LG
Fast Summation of Radial Kernels via QMC Slicing
[AUTHORS]
Johannes Hertrich, Tim Jahn, Michael Quellmalz
[ABSTRACT]
The fast computation of large kernel sums is a challenging task, which arises
as a subproblem in any kernel method. We approach the problem by slicing, which
relies on random projections to one-dimensional subspaces and fast Fourier
summation. We prove bounds for the slicing error and propose a quasi-Monte
Carlo (QMC) approach for selecting the projections based on spherical
quadrature rules. Numerical examples demonstrate that our QMC-slicing approach
significantly outperforms existing methods like (QMC-)random Fourier features,
orthogonal Fourier features or non-QMC slicing on standard test datasets.
[LINK]
http://arxiv.org/abs/2410.01316v1
[DATE]
2024-10-02 16:12:29+08:00
[CATEGORIES]
cs.LG
Sampling from Energy-based Policies using Diffusion
[AUTHORS]
Vineet Jain, Tara Akhound-Sadegh, Siamak Ravanbakhsh
[ABSTRACT]
Energy-based policies offer a flexible framework for modeling complex,
multimodal behaviors in reinforcement learning (RL). In maximum entropy RL, the
optimal policy is a Boltzmann distribution derived from the soft Q-function,
but direct sampling from this distribution in continuous action spaces is
computationally intractable. As a result, existing methods typically use
simpler parametric distributions, like Gaussians, for policy representation -
limiting their ability to capture the full complexity of multimodal action
distributions. In this paper, we introduce a diffusion-based approach for
sampling from energy-based policies, where the negative Q-function defines the
energy function. Based on this approach, we propose an actor-critic method
called Diffusion Q-Sampling (DQS) that enables more expressive policy
representations, allowing stable learning in diverse environments. We show that
our approach enhances exploration and captures multimodal behavior in
continuous control tasks, addressing key limitations of existing methods.
[LINK]
http://arxiv.org/abs/2410.01312v1
[DATE]
2024-10-02 16:09:33+08:00
[CATEGORIES]
cs.LG
Getting Free Bits Back from Rotational Symmetries in LLMs
[AUTHORS]
Jiajun He, Gergely Flamich, José Miguel Hernández-Lobato
[ABSTRACT]
Current methods for compressing neural network weights, such as
decomposition, pruning, quantization, and channel simulation, often overlook
the inherent symmetries within these networks and thus waste bits on encoding
redundant information. In this paper, we propose a format based on bits-back
coding for storing rotationally symmetric Transformer weights more efficiently
than the usual array layout at the same floating-point precision. We evaluate
our method on Large Language Models (LLMs) pruned by SliceGPT (Ashkboos et al.,
2024) and achieve a 3-5% reduction in total bit usage for free across different
model sizes and architectures without impacting model performance within a
certain numerical precision.
[COMMENTS]
14 pages, 3 figures
[LINK]
http://arxiv.org/abs/2410.01309v1
[DATE]
2024-10-02 16:03:47+08:00
[CATEGORIES]
cs.LG
Rethinking the Expressiveness of GNNs: A Computational Model Perspective
[AUTHORS]
Guanyu Cui, Zhewei Wei, Hsin-Hao Su
[ABSTRACT]
Graph Neural Networks (GNNs) are extensively employed in graph machine
learning, with considerable research focusing on their expressiveness. Current
studies often assess GNN expressiveness by comparing them to the
Weisfeiler-Lehman (WL) tests or classical graph algorithms. However, we
identify three key issues in existing analyses: (1) some studies use
preprocessing to enhance expressiveness but overlook its computational costs;
(2) some claim the anonymous WL test’s limited power while enhancing
expressiveness using non-anonymous features, creating a mismatch; and (3) some
characterize message-passing GNNs (MPGNNs) with the CONGEST model but make
unrealistic assumptions about computational resources, allowing
$\textsf{NP-Complete}$ problems to be solved in $O(m)$ depth. We contend that a
well-defined computational model is urgently needed to serve as the foundation
for discussions on GNN expressiveness. To address these issues, we introduce
the Resource-Limited CONGEST (RL-CONGEST) model, incorporating optional
preprocessing and postprocessing to form a framework for analyzing GNN
expressiveness. Our framework sheds light on computational aspects, including
the computational hardness of hash functions in the WL test and the role of
virtual nodes in reducing network capacity. Additionally, we suggest that
high-order GNNs correspond to first-order model-checking problems, offering new
insights into their expressiveness.
[LINK]
http://arxiv.org/abs/2410.01308v1
[DATE]
2024-10-02 16:01:50+08:00
[CATEGORIES]
cs.LG
Towards a Law of Iterated Expectations for Heuristic Estimators
[AUTHORS]
Paul Christiano, Jacob Hilton, Andrea Lincoln, Eric Neyman, Mark Xu
[ABSTRACT]
Christiano et al. (2022) define a heuristic estimator to be a hypothetical
algorithm that estimates the values of mathematical expressions from arguments.
In brief, a heuristic estimator $\mathbb{G}$ takes as input a mathematical
expression $Y$ and a formal “heuristic argument” $\pi$, and outputs an estimate
$\mathbb{G}(Y \mid \pi)$ of $Y$. In this work, we argue for the informal
principle that a heuristic estimator ought not to be able to predict its own
errors, and we explore approaches to formalizing this principle. Most simply,
the principle suggests that $\mathbb{G}(Y - \mathbb{G}(Y \mid \pi) \mid \pi)$
ought to equal zero for all $Y$ and $\pi$. We argue that an ideal heuristic
estimator ought to satisfy two stronger properties in this vein, which we term
iterated estimation (by analogy to the law of iterated expectations) and
error orthogonality.
Although iterated estimation and error orthogonality are intuitively
appealing, it can be difficult to determine whether a given heuristic estimator
satisfies the properties. As an alternative approach, we explore accuracy: a
property that (roughly) states that $\mathbb{G}$ has zero average error over a
distribution of mathematical expressions. However, in the context of two
estimation problems, we demonstrate barriers to creating an accurate heuristic
estimator. We finish by discussing challenges and potential paths forward for
finding a heuristic estimator that accords with our intuitive understanding of
how such an estimator ought to behave, as well as the potential applications of
heuristic estimators to understanding the behavior of neural networks.
[COMMENTS]
47 pages, 2 tables, 1 figure
[LINK]
http://arxiv.org/abs/2410.01290v1
[DATE]
2024-10-02 15:33:27+08:00
[CATEGORIES]
cs.LG
Approximate Nearest Neighbour Search on Dynamic Datasets: An Investigation
[AUTHORS]
Ben Harwood, Amir Dezfouli, Iadine Chades, Conrad Sanderson
[ABSTRACT]
Approximate k-Nearest Neighbour (ANN) methods are often used for mining
information and aiding machine learning on large scale high-dimensional
datasets. ANN methods typically differ in the index structure used for
accelerating searches, resulting in various recall/runtime trade-off points.
For applications with static datasets, runtime constraints and dataset
properties can be used to empirically select an ANN method with suitable
operating characteristics. However, for applications with dynamic datasets,
which are subject to frequent online changes (like addition of new samples),
there is currently no consensus as to which ANN methods are most suitable.
Traditional evaluation approaches do not consider the computational costs of
updating the index structure, as well as the rate and size of index updates. To
address this, we empirically evaluate 5 popular ANN methods on two main
applications (online data collection and online feature learning) while taking
into account these considerations. Two dynamic datasets are used, derived from
the SIFT1M dataset with 1 million samples and the DEEP1B dataset with 1 billion
samples. The results indicate that the often used k-d trees method is not
suitable on dynamic datasets as it is slower than a straightforward baseline
exhaustive search method. For online data collection, the Hierarchical
Navigable Small World Graphs method achieves a consistent speedup over baseline
across a wide range of recall rates. For online feature learning, the Scalable
Nearest Neighbours method is faster than baseline for recall rates below 75%.
[LINK]
http://arxiv.org/abs/2404.19284v4
[DATE]
2024-10-02 15:30:02+08:00
[CATEGORIES]
cs.LG
A Generative Approach to Control Complex Physical Systems
[AUTHORS]
Long Wei, Peiyan Hu, Ruiqi Feng, Haodong Feng, Yixuan Du, Tao Zhang, Rui Wang, Yue Wang, Zhi-Ming Ma, Tailin Wu
[ABSTRACT]
Controlling the evolution of complex physical systems is a fundamental task
across science and engineering. Classical techniques suffer from limited
applicability or huge computational costs. On the other hand, recent deep
learning and reinforcement learning-based approaches often struggle to optimize
long-term control sequences under the constraints of system dynamics. In this
work, we introduce Diffusion Physical systems Control (DiffPhyCon), a new class
of method to address the physical systems control problem. DiffPhyCon excels by
simultaneously minimizing both the learned generative energy function and the
predefined control objectives across the entire trajectory and control
sequence. Thus, it can explore globally and plan near-optimal control
sequences. Moreover, we enhance DiffPhyCon with prior reweighting, enabling the
discovery of control sequences that significantly deviate from the training
distribution. We test our method on three tasks: 1D Burgers’ equation, 2D
jellyfish movement control, and 2D high-dimensional smoke control, where our
generated jellyfish dataset is released as a benchmark for complex physical
system control research. Our method outperforms widely applied classical
approaches and state-of-the-art deep learning and reinforcement learning
methods. Notably, DiffPhyCon unveils an intriguing fast-close-slow-open pattern
observed in the jellyfish, aligning with established findings in the field of
fluid dynamics. The project website, jellyfish dataset, and code can be found
at https://github.com/AI4Science-WestlakeU/diffphycon.
[COMMENTS]
NeurIPS 2024 poster. 51 pages, 19 figures
[LINK]
http://arxiv.org/abs/2407.06494v3
[DATE]
2024-10-02 15:26:08+08:00
[CATEGORIES]
cs.LG
Deep Kernel Posterior Learning under Infinite Variance Prior Weights
[AUTHORS]
Jorge Loría, Anindya Bhadra
[ABSTRACT]
Neal (1996) proved that infinitely wide shallow Bayesian neural networks
(BNN) converge to Gaussian processes (GP), when the network weights have
bounded prior variance. Cho & Saul (2009) provided a useful recursive formula
for deep kernel processes for relating the covariance kernel of each layer to
the layer immediately below. Moreover, they worked out the form of the
layer-wise covariance kernel in an explicit manner for several common
activation functions. Recent works, including Aitchison et al. (2021), have
highlighted that the covariance kernels obtained in this manner are
deterministic and hence, precludes any possibility of representation learning,
which amounts to learning a non-degenerate posterior of a random kernel given
the data. To address this, they propose adding artificial noise to the kernel
to retain stochasticity, and develop deep kernel inverse Wishart processes.
Nonetheless, this artificial noise injection could be critiqued in that it
would not naturally emerge in a classic BNN architecture under an
infinite-width limit. To address this, we show that a Bayesian deep neural
network, where each layer width approaches infinity, and all network weights
are elliptically distributed with infinite variance, converges to a process
with $\alpha$-stable marginals in each layer that has a conditionally Gaussian
representation. These conditional random covariance kernels could be
recursively linked in the manner of Cho & Saul (2009), even though marginally
the process exhibits stable behavior, and hence covariances are not even
necessarily defined. We also provide useful generalizations of the recent
results of Lor'ia & Bhadra (2024) on shallow networks to multi-layer networks,
and remedy the computational burden of their approach. The computational and
statistical benefits over competing approaches stand out in simulations and in
demonstrations on benchmark data sets.
[COMMENTS]
21 pages, 11 figures
[LINK]
http://arxiv.org/abs/2410.01284v1
[DATE]
2024-10-02 15:13:17+08:00
[CATEGORIES]
cs.LG
Uncertainty-aware Human Mobility Modeling and Anomaly Detection
[AUTHORS]
Haomin Wen, Shurui Cao, Leman Akoglu
[ABSTRACT]
Given the GPS coordinates of a large collection of human agents over time,
how can we model their mobility behavior toward effective anomaly detection
(e.g. for bad-actor or malicious behavior detection) without any labeled data?
Human mobility and trajectory modeling have been studied extensively with
varying capacity to handle complex input, and performance-efficiency
trade-offs. With the arrival of more expressive models in machine learning, we
attempt to model GPS data as a sequence of stay-point events, each with a set
of characterizing spatiotemporal features, and leverage modern sequence models
such as Transformers for un/self-supervised training and inference. Notably,
driven by the inherent stochasticity of certain individuals’ behavior, we equip
our model with aleatoric/data uncertainty estimation. In addition, to handle
data sparsity of a large variety of behaviors, we incorporate epistemic/model
uncertainty into our model. Together, aleatoric and epistemic uncertainty
enable a robust loss and training dynamics, as well as uncertainty-aware
decision making in anomaly scoring. Experiments on large expert-simulated
datasets with tens of thousands of agents demonstrate the effectiveness of our
model against both forecasting and anomaly detection baselines.
[LINK]
http://arxiv.org/abs/2410.01281v1
[DATE]
2024-10-02 14:57:08+08:00
[CATEGORIES]
cs.LG
Sparse Autoencoders Reveal Temporal Difference Learning in Large Language Models
[AUTHORS]
Can Demircan, Tankred Saanum, Akshay K. Jagadish, Marcel Binz, Eric Schulz
[ABSTRACT]
In-context learning, the ability to adapt based on a few examples in the
input prompt, is a ubiquitous feature of large language models (LLMs). However,
as LLMs’ in-context learning abilities continue to improve, understanding this
phenomenon mechanistically becomes increasingly important. In particular, it is
not well-understood how LLMs learn to solve specific classes of problems, such
as reinforcement learning (RL) problems, in-context. Through three different
tasks, we first show that Llama $3$ $70$B can solve simple RL problems
in-context. We then analyze the residual stream of Llama using Sparse
Autoencoders (SAEs) and find representations that closely match temporal
difference (TD) errors. Notably, these representations emerge despite the model
only being trained to predict the next token. We verify that these
representations are indeed causally involved in the computation of TD errors
and $Q$-values by performing carefully designed interventions on them. Taken
together, our work establishes a methodology for studying and manipulating
in-context learning with SAEs, paving the way for a more mechanistic
understanding.
[LINK]
http://arxiv.org/abs/2410.01280v1
[DATE]
2024-10-02 14:51:12+08:00
[CATEGORIES]
cs.LG
A GPU-accelerated Large-scale Simulator for Transportation System Optimization Benchmarking
[AUTHORS]
Jun Zhang, Wenxuan Ao, Junbo Yan, Depeng Jin, Yong Li
[ABSTRACT]
With the development of artificial intelligence techniques, transportation
system optimization is evolving from traditional methods relying on expert
experience to simulation and learning-based decision and optimization methods.
Learning-based optimization methods require extensive interactions with highly
realistic microscopic traffic simulators. However, existing microscopic traffic
simulators are inefficient in large-scale scenarios and thus fail to support
the adoption of these methods in large-scale transportation system optimization
scenarios. In addition, the optimization scenarios supported by existing
simulators are limited, mainly focusing on the traffic signal control. To
address these challenges, we propose the first open-source GPU-accelerated
large-scale microscopic simulator for transportation system simulation and
optimization. The simulator can iterate at 84.09Hz, which achieves 88.92 times
computational acceleration in the large-scale scenario with 2,464,950 vehicles
compared to the best baseline CityFlow. Besides, it achieves a more realistic
average road speeds simulated on real datasets by adopting the IDM model as the
car-following model and the randomized MOBIL model as the lane-changing model.
Based on it, we implement a set of microscopic and macroscopic controllable
objects and metrics provided by Python API to support typical transportation
system optimization scenarios. We choose five representative scenarios and
benchmark classical rule-based algorithms, reinforcement learning algorithms,
and black-box optimization algorithms in four cities. These experiments
effectively demonstrate the usability of the simulator for large-scale traffic
system optimization. The code of the simulator is available at
https://github.com/tsinghua-fib-lab/moss. We build an open-registration web
platform available at https://moss.fiblab.net to support no-code trials.
[COMMENTS]
Submitted to ICLR2025
[LINK]
http://arxiv.org/abs/2406.10661v2
[DATE]
2024-10-02 14:43:58+08:00
[CATEGORIES]
cs.LG
Deep Unlearn: Benchmarking Machine Unlearning
[AUTHORS]
Xavier F. Cadet, Anastasia Borovykh, Mohammad Malekzadeh, Sara Ahmadi-Abhari, Hamed Haddadi
[ABSTRACT]
Machine unlearning (MU) aims to remove the influence of particular data
points from the learnable parameters of a trained machine learning model. This
is a crucial capability in light of data privacy requirements, trustworthiness,
and safety in deployed models. MU is particularly challenging for deep neural
networks (DNNs), such as convolutional nets or vision transformers, as such
DNNs tend to memorize a notable portion of their training dataset.
Nevertheless, the community lacks a rigorous and multifaceted study that looks
into the success of MU methods for DNNs. In this paper, we investigate 18
state-of-the-art MU methods across various benchmark datasets and models, with
each evaluation conducted over 10 different initializations, a comprehensive
evaluation involving MU over 100K models. We show that, with the proper
hyperparameters, Masked Small Gradients (MSG) and Convolution Transpose (CT),
consistently perform better in terms of model accuracy and run-time efficiency
across different models, datasets, and initializations, assessed by
population-based membership inference attacks (MIA) and per-sample unlearning
likelihood ratio attacks (U-LiRA). Furthermore, our benchmark highlights the
fact that comparing a MU method only with commonly used baselines, such as
Gradient Ascent (GA) or Successive Random Relabeling (SRL), is inadequate, and
we need better baselines like Negative Gradient Plus (NG+) with proper
hyperparameter selection.
[LINK]
http://arxiv.org/abs/2410.01276v1
[DATE]
2024-10-02 14:41:58+08:00
[CATEGORIES]
cs.LG
Towards Generalizable Reinforcement Learning via Causality-Guided Self-Adaptive Representations
[AUTHORS]
Yupei Yang, Biwei Huang, Fan Feng, Xinyue Wang, Shikui Tu, Lei Xu
[ABSTRACT]
General intelligence requires quick adaption across tasks. While existing
reinforcement learning (RL) methods have made progress in generalization, they
typically assume only distribution changes between source and target domains.
In this paper, we explore a wider range of scenarios where not only the
distribution but also the environment spaces may change. For example, in the
CoinRun environment, we train agents from easy levels and generalize them to
difficulty levels where there could be new enemies that have never occurred
before. To address this challenging setting, we introduce a causality-guided
self-adaptive representation-based approach, called CSR, that equips the agent
to generalize effectively across tasks with evolving dynamics. Specifically, we
employ causal representation learning to characterize the latent causal
variables within the RL system. Such compact causal representations uncover the
structural relationships among variables, enabling the agent to autonomously
determine whether changes in the environment stem from distribution shifts or
variations in space, and to precisely locate these changes. We then devise a
three-step strategy to fine-tune the causal model under different scenarios
accordingly. Empirical experiments show that CSR efficiently adapts to the
target domains with only a few samples and outperforms state-of-the-art
baselines on a wide range of scenarios, including our simulated environments,
CartPole, CoinRun and Atari games.
[LINK]
http://arxiv.org/abs/2407.20651v3
[DATE]
2024-10-02 14:32:21+08:00
[CATEGORIES]
cs.LG
Disentangling and Integrating Relational and Sensory Information in Transformer Architectures
[AUTHORS]
Awni Altabaa, John Lafferty
[ABSTRACT]
Relational reasoning is a central component of generally intelligent systems,
enabling robust and data-efficient inductive generalization. Recent empirical
evidence shows that many existing neural architectures, including Transformers,
struggle with tasks requiring relational reasoning. In this work, we
distinguish between two types of information: sensory information about the
properties of individual objects, and relational information about the
relationships between objects. While neural attention provides a powerful
mechanism for controlling the flow of sensory information between objects, the
Transformer lacks an explicit computational mechanism for routing and
processing relational information. To address this limitation, we propose an
architectural extension of the Transformer framework that we call the Dual
Attention Transformer (DAT), featuring two distinct attention mechanisms:
sensory attention for directing the flow of sensory information, and a novel
relational attention mechanism for directing the flow of relational
information. We empirically evaluate DAT on a diverse set of tasks ranging from
synthetic relational benchmarks to complex real-world tasks such as language
modeling and visual processing. Our results demonstrate that integrating
explicit relational computational mechanisms into the Transformer architecture
leads to significant performance gains in terms of data efficiency and
parameter efficiency.
[COMMENTS]
27 pages, 11 figures
[LINK]
http://arxiv.org/abs/2405.16727v2
[DATE]
2024-10-02 14:31:42+08:00
[CATEGORIES]
cs.LG
“No Matter What You Do!”: Mitigating Backdoor Attacks in Graph Neural Networks
[AUTHORS]
Jiale Zhang, Chengcheng Zhu, Bosen Rao, Hao Sui, Xiaobing Sun, Bing Chen, Chunyi Zhou, Shouling Ji
[ABSTRACT]
Recent studies have exposed that GNNs are vulnerable to several adversarial
attacks, among which backdoor attack is one of the toughest. Similar to Deep
Neural Networks (DNNs), backdoor attacks in GNNs lie in the fact that the
attacker modifies a portion of graph data by embedding triggers and enforces
the model to learn the trigger feature during the model training process.
Despite the massive prior backdoor defense works on DNNs, defending against
backdoor attacks in GNNs is largely unexplored, severely hindering the
widespread application of GNNs in real-world tasks. To bridge this gap, we
present GCleaner, the first backdoor mitigation method on GNNs. GCleaner can
mitigate the presence of the backdoor logic within backdoored GNNs by reversing
the backdoor learning procedure, aiming to restore the model performance to a
level similar to that is directly trained on the original clean dataset. To
achieve this objective, we ask: How to recover universal and hard backdoor
triggers in GNNs? How to unlearn the backdoor trigger feature while maintaining
the model performance? We conduct the graph trigger recovery via the
explanation method to identify optimal trigger locations, facilitating the
search of universal and hard backdoor triggers in the feature space of the
backdoored model through maximal similarity. Subsequently, we introduce the
backdoor unlearning mechanism, which combines knowledge distillation and
gradient-based explainable knowledge for fine-grained backdoor erasure.
Extensive experimental evaluations on four benchmark datasets demonstrate that
GCleaner can reduce the backdoor attack success rate to 10% with only 1% of
clean data, and has almost negligible degradation in model performance, which
far outperforms the state-of-the-art (SOTA) defense methods.
[COMMENTS]
18 pages, 12 figures, 9 tables
[LINK]
http://arxiv.org/abs/2410.01272v1
[DATE]
2024-10-02 14:30:49+08:00
[CATEGORIES]
cs.LG
Sample what you cant compress
[AUTHORS]
Vighnesh Birodkar, Gabriel Barcik, James Lyon, Sergey Ioffe, David Minnen, Joshua V. Dillon
[ABSTRACT]
For learned image representations, basic autoencoders often produce blurry
results. Reconstruction quality can be improved by incorporating additional
penalties such as adversarial (GAN) and perceptual losses. Arguably, these
approaches lack a principled interpretation. Concurrently, in generative
settings diffusion has demonstrated a remarkable ability to create crisp, high
quality results and has solid theoretical underpinnings (from variational
inference to direct study as the Fisher Divergence). Our work combines
autoencoder representation learning with diffusion and is, to our knowledge,
the first to demonstrate the efficacy of jointly learning a continuous encoder
and decoder under a diffusion-based loss. We demonstrate that this approach
yields better reconstruction quality as compared to GAN-based autoencoders
while being easier to tune. We also show that the resulting representation is
easier to model with a latent diffusion model as compared to the representation
obtained from a state-of-the-art GAN-based loss. Since our decoder is
stochastic, it can generate details not encoded in the otherwise deterministic
latent representation; we therefore name our approach “Sample what you can’t
compress”, or SWYCC for short.
[LINK]
http://arxiv.org/abs/2409.02529v2
[DATE]
2024-10-02 14:30:19+08:00
[CATEGORIES]
cs.LG
Deep Bayesian Filter for Bayes-faithful Data Assimilation
[AUTHORS]
Yuta Tarumi, Keisuke Fukuda, Shin-ichi Maeda
[ABSTRACT]
State estimation for nonlinear state space models (SSMs) is a challenging
task. Existing assimilation methodologies predominantly assume Gaussian
posteriors on physical space, where true posteriors become inevitably
non-Gaussian. We propose Deep Bayesian Filtering (DBF) for data assimilation on
nonlinear SSMs. DBF constructs new latent variables $h_t$ in addition to the
original physical variables $z_t$ and assimilates observations $o_t$. By (i)
constraining the state transition on the new latent space to be linear and (ii)
learning a Gaussian inverse observation operator $r(h_t|o_t)$, posteriors
remain Gaussian. Notably, the structured design of test distributions enables
an analytical formula for the recursive computation, eliminating the
accumulation of Monte Carlo sampling errors across time steps. DBF trains the
Gaussian inverse observation operators $r(h_t|o_t)$ and other latent SSM
parameters (e.g., dynamics matrix) by maximizing the evidence lower bound.
Experiments demonstrate that DBF outperforms model-based approaches and latent
assimilation methods in tasks where the true posterior distribution on physical
space is significantly non-Gaussian.
[COMMENTS]
Main text 10 pages
[LINK]
http://arxiv.org/abs/2405.18674v2
[DATE]
2024-10-02 14:29:35+08:00
[CATEGORIES]
cs.LG
TabKANet: Tabular Data Modeling with Kolmogorov-Arnold Network and Transformer
[AUTHORS]
Weihao Gao, Zheng Gong, Zhuo Deng, Fuju Rong, Chucheng Chen, Lan Ma
[ABSTRACT]
Tabular data is the most common type of data in real-life scenarios. In this
study, we propose the TabKANet model for tabular data modeling, which targets
the bottlenecks in learning from numerical content. We constructed a
Kolmogorov-Arnold Network (KAN) based Numerical Embedding Module and unified
numerical and categorical features encoding within a Transformer architecture.
TabKANet has demonstrated stable and significantly superior performance
compared to Neural Networks (NNs) across multiple public datasets in binary
classification, multi-class classification, and regression tasks. Its
performance is comparable to or surpasses that of Gradient Boosted Decision
Tree models (GBDTs). Our code is publicly available on GitHub:
https://github.com/AI-thpremed/TabKANet.
[COMMENTS]
13 pages,5 figures
[LINK]
http://arxiv.org/abs/2409.08806v2
[DATE]
2024-10-02 14:22:48+08:00
[CATEGORIES]
cs.LG
Transformers Handle Endogeneity in In-Context Linear Regression
[AUTHORS]
Haodong Liang, Krishnakumar Balasubramanian, Lifeng Lai
[ABSTRACT]
We explore the capability of transformers to address endogeneity in
in-context linear regression. Our main finding is that transformers inherently
possess a mechanism to handle endogeneity effectively using instrumental
variables (IV). First, we demonstrate that the transformer architecture can
emulate a gradient-based bi-level optimization procedure that converges to the
widely used two-stage least squares $(\textsf{2SLS})$ solution at an
exponential rate. Next, we propose an in-context pretraining scheme and provide
theoretical guarantees showing that the global minimizer of the pre-training
loss achieves a small excess loss. Our extensive experiments validate these
theoretical findings, showing that the trained transformer provides more robust
and reliable in-context predictions and coefficient estimates than the
$\textsf{2SLS}$ method, in the presence of endogeneity.
[COMMENTS]
30 pages
[LINK]
http://arxiv.org/abs/2410.01265v1
[DATE]
2024-10-02 14:21:04+08:00
[CATEGORIES]
cs.LG
Aggregation of Multi Diffusion Models for Enhancing Learned Representations
[AUTHORS]
Conghan Yue, Zhengwei Peng, Shiyan Du, Zhi Ji, Dongyu Zhang
[ABSTRACT]
Diffusion models have achieved remarkable success in image generation,
particularly with the various applications of classifier-free guidance
conditional diffusion models. While many diffusion models perform well when
controlling for particular aspect among style, character, and interaction, they
struggle with fine-grained control due to dataset limitations and intricate
model architecture design. This paper introduces a novel algorithm, Aggregation
of Multi Diffusion Models (AMDM), which synthesizes features from multiple
diffusion models into a specified model, enhancing its learned representations
to activate specific features for fine-grained control. AMDM consists of two
key components: spherical aggregation and manifold optimization. Spherical
aggregation merges intermediate variables from different diffusion models with
minimal manifold deviation, while manifold optimization refines these variables
to align with the intermediate data manifold, enhancing sampling quality.
Experimental results demonstrate that AMDM significantly improves fine-grained
control without additional training or inference time, proving its
effectiveness. Additionally, it reveals that diffusion models initially focus
on features such as position, attributes, and style, with later stages
improving generation quality and consistency. AMDM offers a new perspective for
tackling the challenges of fine-grained conditional control generation in
diffusion models: We can fully utilize existing conditional diffusion models
that control specific aspects, or develop new ones, and then aggregate them
using the AMDM algorithm. This eliminates the need for constructing complex
datasets, designing intricate model architectures, and incurring high training
costs. Code is available at: https://github.com/Hammour-steak/AMDM
[LINK]
http://arxiv.org/abs/2410.01262v1
[DATE]
2024-10-02 14:16:06+08:00
[CATEGORIES]
cs.LG
Revisiting Optimism and Model Complexity in the Wake of Overparameterized Machine Learning
[AUTHORS]
Pratik Patil, Jin-Hong Du, Ryan J. Tibshirani
[ABSTRACT]
Common practice in modern machine learning involves fitting a large number of
parameters relative to the number of observations. These overparameterized
models can exhibit surprising generalization behavior, e.g., “double descent”
in the prediction error curve when plotted against the raw number of model
parameters, or another simplistic notion of complexity. In this paper, we
revisit model complexity from first principles, by first reinterpreting and
then extending the classical statistical concept of (effective) degrees of
freedom. Whereas the classical definition is connected to fixed-X prediction
error (in which prediction error is defined by averaging over the same,
nonrandom covariate points as those used during training), our extension of
degrees of freedom is connected to random-X prediction error (in which
prediction error is averaged over a new, random sample from the covariate
distribution). The random-X setting more naturally embodies modern machine
learning problems, where highly complex models, even those complex enough to
interpolate the training data, can still lead to desirable generalization
performance under appropriate conditions. We demonstrate the utility of our
proposed complexity measures through a mix of conceptual arguments, theory, and
experiments, and illustrate how they can be used to interpret and compare
arbitrary prediction models.
[COMMENTS]
59 pages, 17 figures
[LINK]
http://arxiv.org/abs/2410.01259v1
[DATE]
2024-10-02 14:09:57+08:00
[CATEGORIES]
cs.LG
Resource-efficient equivariant quantum convolutional neural networks
[AUTHORS]
Koki Chinzei, Quoc Hoan Tran, Yasuhiro Endo, Hirotaka Oshima
[ABSTRACT]
Equivariant quantum neural networks (QNNs) are promising quantum machine
learning models that exploit symmetries to provide potential quantum
advantages. Despite theoretical developments in equivariant QNNs, their
implementation on near-term quantum devices remains challenging due to limited
computational resources. This study proposes a resource-efficient model of
equivariant quantum convolutional neural networks (QCNNs) called equivariant
split-parallelizing QCNN (sp-QCNN). Using a group-theoretical approach, we
encode general symmetries into our model beyond the translational symmetry
addressed by previous sp-QCNNs. We achieve this by splitting the circuit at the
pooling layer while preserving symmetry. This splitting structure effectively
parallelizes QCNNs to improve measurement efficiency in estimating the
expectation value of an observable and its gradient by order of the number of
qubits. Our model also exhibits high trainability and generalization
performance, including the absence of barren plateaus. Numerical experiments
demonstrate that the equivariant sp-QCNN can be trained and generalized with
fewer measurement resources than a conventional equivariant QCNN in a noisy
quantum data classification task. Our results contribute to the advancement of
practical quantum machine learning algorithms.
[COMMENTS]
20 pages, 7 figures, 1 table
[LINK]
http://arxiv.org/abs/2410.01252v1
[DATE]
2024-10-02 13:51:33+08:00
[CATEGORIES]
cs.LG
Dual Approximation Policy Optimization
[AUTHORS]
Zhihan Xiong, Maryam Fazel, Lin Xiao
[ABSTRACT]
We propose Dual Approximation Policy Optimization (DAPO), a framework that
incorporates general function approximation into policy mirror descent methods.
In contrast to the popular approach of using the $L_2$-norm to measure function
approximation errors, DAPO uses the dual Bregman divergence induced by the
mirror map for policy projection. This duality framework has both theoretical
and practical implications: not only does it achieve fast linear convergence
with general function approximation, but it also includes several well-known
practical methods as special cases, immediately providing strong convergence
guarantees.
[COMMENTS]
30 pages, 2 figures
[LINK]
http://arxiv.org/abs/2410.01249v1
[DATE]
2024-10-02 13:49:11+08:00
[CATEGORIES]
cs.LG
Generalized Gaussian Temporal Difference Error for Uncertainty-aware Reinforcement Learning
[AUTHORS]
Seyeon Kim, Joonhun Lee, Namhoon Cho, Sungjun Han, Wooseop Hwang
[ABSTRACT]
Conventional uncertainty-aware temporal difference (TD) learning methods
often rely on simplistic assumptions, typically including a zero-mean Gaussian
distribution for TD errors. Such oversimplification can lead to inaccurate
error representations and compromised uncertainty estimation. In this paper, we
introduce a novel framework for generalized Gaussian error modeling in deep
reinforcement learning, applicable to both discrete and continuous control
settings. Our framework enhances the flexibility of error distribution modeling
by incorporating additional higher-order moment, particularly kurtosis, thereby
improving the estimation and mitigation of data-dependent noise, i.e.,
aleatoric uncertainty. We examine the influence of the shape parameter of the
generalized Gaussian distribution (GGD) on aleatoric uncertainty and provide a
closed-form expression that demonstrates an inverse relationship between
uncertainty and the shape parameter. Additionally, we propose a theoretically
grounded weighting scheme to fully leverage the GGD. To address epistemic
uncertainty, we enhance the batch inverse variance weighting by incorporating
bias reduction and kurtosis considerations, resulting in improved robustness.
Extensive experimental evaluations using policy gradient algorithms demonstrate
the consistent efficacy of our method, showcasing significant performance
improvements.
[LINK]
http://arxiv.org/abs/2408.02295v2
[DATE]
2024-10-02 13:46:06+08:00
[CATEGORIES]
cs.LG
Inference-Time Alignment of Diffusion Models with Direct Noise Optimization
[AUTHORS]
Zhiwei Tang, Jiangweizhi Peng, Jiasheng Tang, Mingyi Hong, Fan Wang, Tsung-Hui Chang
[ABSTRACT]
In this work, we focus on the alignment problem of diffusion models with a
continuous reward function, which represents specific objectives for downstream
tasks, such as increasing darkness or improving the aesthetics of images. The
central goal of the alignment problem is to adjust the distribution learned by
diffusion models such that the generated samples maximize the target reward
function. We propose a novel alignment approach, named Direct Noise
Optimization (DNO), that optimizes the injected noise during the sampling
process of diffusion models. By design, DNO operates at inference-time, and
thus is tuning-free and prompt-agnostic, with the alignment occurring in an
online fashion during generation. We rigorously study the theoretical
properties of DNO and also propose variants to deal with non-differentiable
reward functions. Furthermore, we identify that naive implementation of DNO
occasionally suffers from the out-of-distribution reward hacking problem, where
optimized samples have high rewards but are no longer in the support of the
pretrained distribution. To remedy this issue, we leverage classical
high-dimensional statistics theory to an effective probability regularization
technique. We conduct extensive experiments on several important reward
functions and demonstrate that the proposed DNO approach can achieve
state-of-the-art reward scores within a reasonable time budget for generation.
[LINK]
http://arxiv.org/abs/2405.18881v3
[DATE]
2024-10-02 13:22:07+08:00
[CATEGORIES]
cs.LG
Predictive Low Rank Matrix Learning under Partial Observations: Mixed-Projection ADMM
[AUTHORS]
Dimitris Bertsimas, Nicholas A. G. Johnson
[ABSTRACT]
We study the problem of learning a partially observed matrix under the low
rank assumption in the presence of fully observed side information that depends
linearly on the true underlying matrix. This problem consists of an important
generalization of the Matrix Completion problem, a central problem in
Statistics, Operations Research and Machine Learning, that arises in
applications such as recommendation systems, signal processing, system
identification and image denoising. We formalize this problem as an
optimization problem with an objective that balances the strength of the fit of
the reconstruction to the observed entries with the ability of the
reconstruction to be predictive of the side information. We derive a
mixed-projection reformulation of the resulting optimization problem and
present a strong semidefinite cone relaxation. We design an efficient, scalable
alternating direction method of multipliers algorithm that produces high
quality feasible solutions to the problem of interest. Our numerical results
demonstrate that in the small rank regime ($k \leq 15$), our algorithm outputs
solutions that achieve on average $79\%$ lower objective value and $90.1\%$
lower $\ell_2$ reconstruction error than the solutions returned by the best
performing benchmark method on synthetic data. The runtime of our algorithm is
competitive with and often superior to that of the benchmark methods. Our
algorithm is able to solve problems with $n = 10000$ rows and $m = 10000$
columns in less than a minute. On large scale real world data, our algorithm
produces solutions that achieve $67\%$ lower out of sample error than benchmark
methods in $97\%$ less execution time.
[LINK]
http://arxiv.org/abs/2407.13731v2
[DATE]
2024-10-02 13:21:00+08:00
[CATEGORIES]
cs.LG
Equivariant score-based generative models provably learn distributions with symmetries efficiently
[AUTHORS]
Ziyu Chen, Markos A. Katsoulakis, Benjamin J. Zhang
[ABSTRACT]
Symmetry is ubiquitous in many real-world phenomena and tasks, such as
physics, images, and molecular simulations. Empirical studies have demonstrated
that incorporating symmetries into generative models can provide better
generalization and sampling efficiency when the underlying data distribution
has group symmetry. In this work, we provide the first theoretical analysis and
guarantees of score-based generative models (SGMs) for learning distributions
that are invariant with respect to some group symmetry and offer the first
quantitative comparison between data augmentation and adding equivariant
inductive bias. First, building on recent works on the Wasserstein-1
($\mathbf{d}_1$) guarantees of SGMs and empirical estimations of probability
divergences under group symmetry, we provide an improved $\mathbf{d}_1$
generalization bound when the data distribution is group-invariant. Second, we
describe the inductive bias of equivariant SGMs using Hamilton-Jacobi-Bellman
theory, and rigorously demonstrate that one can learn the score of a
symmetrized distribution using equivariant vector fields without data
augmentations through the analysis of the optimality and equivalence of
score-matching objectives. This also provides practical guidance that one does
not have to augment the dataset as long as the vector field or the neural
network parametrization is equivariant. Moreover, we quantify the impact of not
incorporating equivariant structure into the score parametrization, by showing
that non-equivariant vector fields can yield worse generalization bounds. This
can be viewed as a type of model-form error that describes the missing
structure of non-equivariant vector fields. Numerical simulations corroborate
our analysis and highlight that data augmentations cannot replace the role of
equivariant vector fields.
[LINK]
http://arxiv.org/abs/2410.01244v1
[DATE]
2024-10-02 13:14:28+08:00
[CATEGORIES]
cs.LG
Multilingual Diversity Improves Vision-Language Representations
[AUTHORS]
Thao Nguyen, Matthew Wallingford, Sebastin Santy, Wei-Chiu Ma, Sewoong Oh, Ludwig Schmidt, Pang Wei Koh, Ranjay Krishna
[COMMENTS]
NeurIPS 2024 Spotlight paper
[LINK]
http://arxiv.org/abs/2405.16915v2
[DATE]
2024-10-02 13:04:10+08:00
[CATEGORIES]
cs.LG
Generative modeling of density regression through tree flows
[AUTHORS]
Zhuoqun Wang, Naoki Awaya, Li Ma
[ABSTRACT]
A common objective in the analysis of tabular data is estimating the
conditional distribution (in contrast to only producing predictions) of a set
of “outcome” variables given a set of “covariates”, which is sometimes referred
to as the “density regression” problem. Beyond estimation on the conditional
distribution, the generative ability of drawing synthetic samples from the
learned conditional distribution is also desired as it further widens the range
of applications. We propose a flow-based generative model tailored for the
density regression task on tabular data. Our flow applies a sequence of
tree-based piecewise-linear transforms on initial uniform noise to eventually
generate samples from complex conditional densities of (univariate or
multivariate) outcomes given the covariates and allows efficient analytical
evaluation of the fitted conditional density on any point in the sample space.
We introduce a training algorithm for fitting the tree-based transforms using a
divide-and-conquer strategy that transforms maximum likelihood training of the
tree-flow into training a collection of binary classifiers–one at each tree
split–under cross-entropy loss. We assess the performance of our method under
out-of-sample likelihood evaluation and compare it with a variety of
state-of-the-art conditional density learners on a range of simulated and real
benchmark tabular datasets. Our method consistently achieves comparable or
superior performance at a fraction of the training and sampling budget.
Finally, we demonstrate the utility of our method’s generative ability through
an application to generating synthetic longitudinal microbiome compositional
data based on training our flow on a publicly available microbiome study.
[COMMENTS]
24 pages, 9 figures
[LINK]
http://arxiv.org/abs/2406.05260v2
[DATE]
2024-10-02 12:43:50+08:00
[CATEGORIES]
cs.LG
Empirical Guidelines for Deploying LLMs onto Resource-constrained Edge Devices
[AUTHORS]
Ruiyang Qin, Dancheng Liu, Chenhui Xu, Zheyu Yan, Zhaoxuan Tan, Zhenge Jia, Amir Nassereldine, Jiajie Li, Meng Jiang, Ahmed Abbasi, Jinjun Xiong, Yiyu Shi
[ABSTRACT]
The scaling laws have become the de facto guidelines for designing large
language models (LLMs), but they were studied under the assumption of unlimited
computing resources for both training and inference. As LLMs are increasingly
used as personalized intelligent assistants, their customization (i.e.,
learning through fine-tuning) and deployment onto resource-constrained edge
devices will become more and more prevalent. An urging but open question is how
a resource-constrained computing environment would affect the design choices
for a personalized LLM. We study this problem empirically in this work. In
particular, we consider the tradeoffs among a number of key design factors and
their intertwined impacts on learning efficiency and accuracy. The factors
include the learning methods for LLM customization, the amount of personalized
data used for learning customization, the types and sizes of LLMs, the
compression methods of LLMs, the amount of time afforded to learn, and the
difficulty levels of the target use cases. Through extensive experimentation
and benchmarking, we draw a number of surprisingly insightful guidelines for
deploying LLMs onto resource-constrained devices. For example, an optimal
choice between parameter learning and RAG may vary depending on the difficulty
of the downstream task, the longer fine-tuning time does not necessarily help
the model, and a compressed LLM may be a better choice than an uncompressed LLM
to learn from limited personalized data.
[COMMENTS]
Benckmarking paper
[LINK]
http://arxiv.org/abs/2406.03777v3
[DATE]
2024-10-02 12:14:21+08:00
[CATEGORIES]
cs.LG
See Me and Believe Me: Causality and Intersectionality in Testimonial Injustice in Healthcare
[AUTHORS]
Kenya S. Andrews, Mesrob I. Ohannessian, Elena Zheleva
[ABSTRACT]
In medical settings, it is critical that all who are in need of care are
correctly heard and understood. When this is not the case due to prejudices a
listener has, the speaker is experiencing \emph{testimonial injustice}, which,
building upon recent work, we quantify by the presence of several categories of
unjust vocabulary in medical notes. In this paper, we use FCI, a causal
discovery method, to study the degree to which certain demographic features
could lead to marginalization (e.g., age, gender, and race) by way of
contributing to testimonial injustice. To achieve this, we review physicians’
notes for each patient, where we identify occurrences of unjust vocabulary,
along with the demographic features present, and use causal discovery to build
a Structural Causal Model (SCM) relating those demographic features to
testimonial injustice. We analyze and discuss the resulting SCMs to show the
interaction of these factors and how they influence the experience of
injustice. Despite the potential presence of some confounding variables, we
observe how one contributing feature can make a person more prone to
experiencing another contributor of testimonial injustice. There is no single
root of injustice and thus intersectionality cannot be ignored. These results
call for considering more than singular or equalized attributes of who a person
is when analyzing and improving their experiences of bias and injustice. This
work is thus a first foray at using causal discovery to understand the nuanced
experiences of patients in medical settings, and its insights could be used to
guide design principles throughout healthcare, to build trust and promote
better patient care.
[LINK]
http://arxiv.org/abs/2410.01227v1
[DATE]
2024-10-02 12:10:55+08:00
[CATEGORIES]
cs.LG
Integrating Text-to-Music Models with Language Models: Composing Long Structured Music Pieces
[AUTHORS]
Lilac Atassi
[ABSTRACT]
Recent music generation methods based on transformers have a context window
of up to a minute. The music generated by these methods are largely
unstructured beyond the context window. With a longer context window, learning
long scale structures from musical data is a prohibitively challenging problem.
This paper proposes integrating a text-to-music model with a large language
model to generate music with form. We discuss our solutions to the challenges
of such integration. The experimental results show that the proposed method can
generate 2.5-minute-long music that is highly structured, strongly organized,
and cohesive.
[COMMENTS]
arXiv admin note: substantial text overlap with arXiv:2404.11976
[LINK]
http://arxiv.org/abs/2410.00344v2
[DATE]
2024-10-02 12:06:59+08:00
[CATEGORIES]
cs.LG
HybridFlow: A Flexible and Efficient RLHF Framework
[AUTHORS]
Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, Chuan Wu
[ABSTRACT]
Reinforcement Learning from Human Feedback (RLHF) is widely used in Large
Language Model (LLM) alignment. Traditional RL can be modeled as a dataflow,
where each node represents computation of a neural network (NN) and each edge
denotes data dependencies between the NNs. RLHF complicates the dataflow by
expanding each node into a distributed LLM training or generation program, and
each edge into a many-to-many multicast. Traditional RL frameworks execute the
dataflow using a single controller to instruct both intra-node computation and
inter-node communication, which can be inefficient in RLHF due to large control
dispatch overhead for distributed intra-node computation. Existing RLHF systems
adopt a multi-controller paradigm, which can be inflexible due to nesting
distributed computation and data communication. We propose HybridFlow, which
combines single-controller and multi-controller paradigms in a hybrid manner to
enable flexible representation and efficient execution of the RLHF dataflow. We
carefully design a set of hierarchical APIs that decouple and encapsulate
computation and data dependencies in the complex RLHF dataflow, allowing
efficient operation orchestration to implement RLHF algorithms and flexible
mapping of the computation onto various devices. We further design a
3D-HybridEngine for efficient actor model resharding between training and
generation phases, with zero memory redundancy and significantly reduced
communication overhead. Our experimental results demonstrate
1.53$\times$~20.57$\times$ throughput improvement when running various RLHF
algorithms using HybridFlow, as compared with state-of-the-art baselines.
HybridFlow source code will be available at https://github.com/volcengine/verl.
[LINK]
http://arxiv.org/abs/2409.19256v2
[DATE]
2024-10-02 12:01:47+08:00
[CATEGORIES]
cs.LG
Induced Covariance for Causal Discovery in Linear Sparse Structures
[AUTHORS]
Saeed Mohseni-Sehdeh, Walid Saad
[ABSTRACT]
Causal models seek to unravel the cause-effect relationships among variables
from observed data, as opposed to mere mappings among them, as traditional
regression models do. This paper introduces a novel causal discovery algorithm
designed for settings in which variables exhibit linearly sparse relationships.
In such scenarios, the causal links represented by directed acyclic graphs
(DAGs) can be encapsulated in a structural matrix. The proposed approach
leverages the structural matrix’s ability to reconstruct data and the
statistical properties it imposes on the data to identify the correct
structural matrix. This method does not rely on independence tests or graph
fitting procedures, making it suitable for scenarios with limited training
data. Simulation results demonstrate that the proposed method outperforms the
well-known PC, GES, BIC exact search, and LINGAM-based methods in recovering
linearly sparse causal structures.
[LINK]
http://arxiv.org/abs/2410.01221v1
[DATE]
2024-10-02 12:01:38+08:00
[CATEGORIES]
cs.LG
An uncertainty-aware Digital Shadow for underground multimodal CO2 storage monitoring
[AUTHORS]
Abhinav Prakash Gahlot, Rafael Orozco, Ziyi Yin, Felix J. Herrmann
[ABSTRACT]
Geological Carbon Storage GCS is arguably the only scalable net-negative CO2
emission technology available While promising subsurface complexities and
heterogeneity of reservoir properties demand a systematic approach to quantify
uncertainty when optimizing production and mitigating storage risks which
include assurances of Containment and Conformance of injected supercritical CO2
As a first step towards the design and implementation of a Digital Twin for
monitoring underground storage operations a machine learning based
data-assimilation framework is introduced and validated on carefully designed
realistic numerical simulations As our implementation is based on Bayesian
inference but does not yet support control and decision-making we coin our
approach an uncertainty-aware Digital Shadow To characterize the posterior
distribution for the state of CO2 plumes conditioned on multi-modal time-lapse
data the envisioned Shadow combines techniques from Simulation-Based Inference
SBI and Ensemble Bayesian Filtering to establish probabilistic baselines and
assimilate multi-modal data for GCS problems that are challenged by large
degrees of freedom nonlinear multi-physics non-Gaussianity and computationally
expensive to evaluate fluid flow and seismic simulations To enable SBI for
dynamic systems a recursive scheme is proposed where the Digital Shadows neural
networks are trained on simulated ensembles for their state and observed data
well and/or seismic Once training is completed the systems state is inferred
when time-lapse field data becomes available In this computational study we
observe that a lack of knowledge on the permeability field can be factored into
the Digital Shadows uncertainty quantification To our knowledge this work
represents the first proof of concept of an uncertainty-aware in-principle
scalable Digital Shadow.
[LINK]
http://arxiv.org/abs/2410.01218v1
[DATE]
2024-10-02 11:58:45+08:00
[CATEGORIES]
cs.LG
Tackling GenAI Copyright Issues: Originality Estimation and Genericization
[AUTHORS]
Hiroaki Chiba-Okabe, Weijie J. Su
[ABSTRACT]
The rapid progress of generative AI technology has sparked significant
copyright concerns, leading to numerous lawsuits filed against AI developers.
While various techniques for mitigating copyright issues have been studied,
significant risks remain. Here, we propose a genericization method that
modifies the outputs of a generative model to make them more generic and less
likely to infringe copyright. To achieve this, we introduce a metric for
quantifying the level of originality of data in a manner that is consistent
with the legal framework. This metric can be estimated by drawing samples from
a generative model, which is then used for the genericization process. As a
practical implementation, we introduce PREGen, which combines our
genericization method with an existing mitigation technique. Experiments
demonstrate that our genericization method successfully modifies the output of
a text-to-image generative model so that it produces more generic,
copyright-compliant images. Compared to the existing method, PREGen reduces the
likelihood of generating copyrighted characters by more than half when the
names of copyrighted characters are used as the prompt, dramatically improving
the performance. Additionally, while generative models can produce copyrighted
characters even when their names are not directly mentioned in the prompt,
PREGen almost entirely prevents the generation of such characters in these
cases.
[COMMENTS]
22 pages, 10 figures
[LINK]
http://arxiv.org/abs/2406.03341v5
[DATE]
2024-10-02 11:53:19+08:00
[CATEGORIES]
cs.LG
Explainable Molecular Property Prediction: Aligning Chemical Concepts with Predictions via Language Models
[AUTHORS]
Zhenzhong Wang, Zehui Lin, Wanyu Lin, Ming Yang, Minggang Zeng, Kay Chen Tan
[ABSTRACT]
Providing explainable molecular property predictions is critical for many
scientific domains, such as drug discovery and material science. Though
transformer-based language models have shown great potential in accurate
molecular property prediction, they neither provide chemically meaningful
explanations nor faithfully reveal the molecular structure-property
relationships. In this work, we develop a framework for explainable molecular
property prediction based on language models, dubbed as Lamole, which can
provide chemical concepts-aligned explanations. We take a string-based
molecular representation – Group SELFIES – as input tokens to pretrain and
fine-tune our Lamole, as it provides chemically meaningful semantics. By
disentangling the information flows of Lamole, we propose combining
self-attention weights and gradients for better quantification of each
chemically meaningful substructure’s impact on the model’s output. To make the
explanations more faithfully respect the structure-property relationship, we
then carefully craft a marginal loss to explicitly optimize the explanations to
be able to align with the chemists’ annotations. We bridge the manifold
hypothesis with the elaborated marginal loss to prove that the loss can align
the explanations with the tangent space of the data manifold, leading to
concept-aligned explanations. Experimental results over six mutagenicity
datasets and one hepatotoxicity dataset demonstrate Lamole can achieve
comparable classification accuracy and boost the explanation accuracy by up to
14.3%, being the state-of-the-art in explainable molecular property prediction.
[LINK]
http://arxiv.org/abs/2405.16041v3
[DATE]
2024-10-02 11:52:50+08:00
[CATEGORIES]
cs.LG
ParFormer: A Vision Transformer with Parallel Mixer and Sparse Channel Attention Patch Embedding
[AUTHORS]
Novendra Setyawan, Ghufron Wahyu Kurniawan, Chi-Chia Sun, Jun-Wei Hsieh, Jing-Ming Guo, Wen-Kai Kuo
[ABSTRACT]
Convolutional Neural Networks (CNNs) and Transformers have achieved
remarkable success in computer vision tasks. However, their deep architectures
often lead to high computational redundancy, making them less suitable for
resource-constrained environments, such as edge devices. This paper introduces
ParFormer, a novel vision transformer that addresses this challenge by
incorporating a Parallel Mixer and a Sparse Channel Attention Patch Embedding
(SCAPE). By combining convolutional and attention mechanisms, ParFormer
improves feature extraction. This makes spatial feature extraction more
efficient and cuts down on unnecessary computation. The SCAPE module further
reduces computational redundancy while preserving essential feature information
during down-sampling. Experimental results on the ImageNet-1K dataset show that
ParFormer-T achieves 78.9\% Top-1 accuracy with a high throughput on a GPU that
outperforms other small models with 2.56$\times$ higher throughput than
MobileViT-S, 0.24\% faster than FasterNet-T2, and 1.79$\times$ higher than
EdgeNeXt-S. For edge device deployment, ParFormer-T excels with a throughput of
278.1 images/sec, which is 1.38 $\times$ higher than EdgeNeXt-S and
2.36$\times$ higher than MobileViT-S, making it highly suitable for real-time
applications in resource-constrained settings. The larger variant, ParFormer-L,
reaches 83.5\% Top-1 accuracy, offering a balanced trade-off between accuracy
and efficiency, surpassing many state-of-the-art models. In COCO object
detection, ParFormer-M achieves 40.7 AP for object detection and 37.6 AP for
instance segmentation, surpassing models like ResNet-50, PVT-S and
PoolFormer-S24 with significantly higher efficiency. These results validate
ParFormer as a highly efficient and scalable model for both high-performance
and resource-constrained scenarios, making it an ideal solution for edge-based
AI applications.
[COMMENTS]
Under Review in IEEE Transactions on Cognitive and Developmental
System
[LINK]
http://arxiv.org/abs/2403.15004v3
[DATE]
2024-10-02 11:46:17+08:00
[CATEGORIES]
cs.LG
Absolute State-wise Constrained Policy Optimization: High-Probability State-wise Constraints Satisfaction
[AUTHORS]
Weiye Zhao, Feihan Li, Yifan Sun, Yujie Wang, Rui Chen, Tianhao Wei, Changliu Liu
[ABSTRACT]
Enforcing state-wise safety constraints is critical for the application of
reinforcement learning (RL) in real-world problems, such as autonomous driving
and robot manipulation. However, existing safe RL methods only enforce
state-wise constraints in expectation or enforce hard state-wise constraints
with strong assumptions. The former does not exclude the probability of safety
violations, while the latter is impractical. Our insight is that although it is
intractable to guarantee hard state-wise constraints in a model-free setting,
we can enforce state-wise safety with high probability while excluding strong
assumptions. To accomplish the goal, we propose Absolute State-wise Constrained
Policy Optimization (ASCPO), a novel general-purpose policy search algorithm
that guarantees high-probability state-wise constraint satisfaction for
stochastic systems. We demonstrate the effectiveness of our approach by
training neural network policies for extensive robot locomotion tasks, where
the agent must adhere to various state-wise safety constraints. Our results
show that ASCPO significantly outperforms existing methods in handling
state-wise constraints across challenging continuous control tasks,
highlighting its potential for real-world applications.
[COMMENTS]
submission to Journal of Machine Learning Research
[LINK]
http://arxiv.org/abs/2410.01212v1
[DATE]
2024-10-02 11:43:33+08:00
[CATEGORIES]
cs.LG
Efficient Policy Evaluation with Offline Data Informed Behavior Policy Design
[AUTHORS]
Shuze Liu, Shangtong Zhang
[ABSTRACT]
Most reinforcement learning practitioners evaluate their policies with online
Monte Carlo estimators for either hyperparameter tuning or testing different
algorithmic design choices, where the policy is repeatedly executed in the
environment to get the average outcome. Such massive interactions with the
environment are prohibitive in many scenarios. In this paper, we propose novel
methods that improve the data efficiency of online Monte Carlo estimators while
maintaining their unbiasedness. We first propose a tailored closed-form
behavior policy that provably reduces the variance of an online Monte Carlo
estimator. We then design efficient algorithms to learn this closed-form
behavior policy from previously collected offline data. Theoretical analysis is
provided to characterize how the behavior policy learning error affects the
amount of reduced variance. Compared with previous works, our method achieves
better empirical performance in a broader set of environments, with fewer
requirements for offline data.
[LINK]
http://arxiv.org/abs/2301.13734v5
[DATE]
2024-10-02 11:41:22+08:00
[CATEGORIES]
cs.LG
Debiasing Federated Learning with Correlated Client Participation
[AUTHORS]
Zhenyu Sun, Ziyang Zhang, Zheng Xu, Gauri Joshi, Pranay Sharma, Ermin Wei
[ABSTRACT]
In cross-device federated learning (FL) with millions of mobile clients, only
a small subset of clients participate in training in every communication round,
and Federated Averaging (FedAvg) is the most popular algorithm in practice.
Existing analyses of FedAvg usually assume the participating clients are
independently sampled in each round from a uniform distribution, which does not
reflect real-world scenarios. This paper introduces a theoretical framework
that models client participation in FL as a Markov chain to study optimization
convergence when clients have non-uniform and correlated participation across
rounds. We apply this framework to analyze a more general and practical
pattern: every client must wait a minimum number of $R$ rounds (minimum
separation) before re-participating. We theoretically prove and empirically
observe that increasing minimum separation reduces the bias induced by
intrinsic non-uniformity of client availability in cross-device FL systems.
Furthermore, we develop an effective debiasing algorithm for FedAvg that
provably converges to the unbiased optimal solution under arbitrary minimum
separation and unknown client availability distribution.
[LINK]
http://arxiv.org/abs/2410.01209v1
[DATE]
2024-10-02 11:30:53+08:00
[CATEGORIES]
cs.LG
Were RNNs All We Needed?
[AUTHORS]
Leo Feng, Frederick Tung, Mohamed Osama Ahmed, Yoshua Bengio, Hossein Hajimirsadegh
[ABSTRACT]
The scalability limitations of Transformers regarding sequence length have
renewed interest in recurrent sequence models that are parallelizable during
training. As a result, many novel recurrent architectures, such as S4, Mamba,
and Aaren, have been proposed that achieve comparable performance. In this
work, we revisit traditional recurrent neural networks (RNNs) from over a
decade ago: LSTMs (1997) and GRUs (2014). While these models were slow due to
requiring to backpropagate through time (BPTT), we show that by removing their
hidden state dependencies from their input, forget, and update gates, LSTMs and
GRUs no longer need to BPTT and can be efficiently trained in parallel.
Building on this, we introduce minimal versions (minLSTMs and minGRUs) that (1)
use significantly fewer parameters than their traditional counterparts and (2)
are fully parallelizable during training (175x faster for a sequence of length
512). Lastly, we show that these stripped-down versions of decade-old RNNs
match the empirical performance of recent sequence models.
[LINK]
http://arxiv.org/abs/2410.01201v1
[DATE]
2024-10-02 11:06:49+08:00
[CATEGORIES]
cs.LG
Diverse Expected Improvement (DEI): Diverse Bayesian Optimization of Expensive Computer Simulators
[AUTHORS]
John Joshua Miller, Simon Mak, Benny Sun, Sai Ranjeet Narayanan, Suo Yang, Zongxuan Sun, Kenneth S. Kim, Chol-Bum Mike Kweon
[ABSTRACT]
The optimization of expensive black-box simulators arises in a myriad of
modern scientific and engineering applications. Bayesian optimization provides
an appealing solution, by leveraging a fitted surrogate model to guide the
selection of subsequent simulator evaluations. In practice, however, the
objective is often not to obtain a single good solution, but rather a
‘‘basket’’ of good solutions from which users can choose for downstream
decision-making. This need arises in our motivating application for real-time
control of internal combustion engines for flight propulsion, where a diverse
set of control strategies is essential for stable flight control. There has
been little work on this front for Bayesian optimization. We thus propose a new
Diverse Expected Improvement (DEI) method that searches for diverse
‘’$\epsilon$-optimal’’ solutions: locally-optimal solutions within a tolerance
level $\epsilon > 0$ from a global optimum. We show that DEI yields a
closed-form acquisition function under a Gaussian process surrogate model,
which facilitates efficient sequential queries via automatic differentiation.
This closed form further reveals a novel exploration-exploitation-diversity
trade-off, which incorporates the desired diversity property within the
well-known exploration-exploitation trade-off. We demonstrate the improvement
of DEI over existing methods in a suite of numerical experiments, then explore
the DEI in two applications on rover trajectory optimization and engine control
for flight propulsion.
[LINK]
http://arxiv.org/abs/2410.01196v1
[DATE]
2024-10-02 10:59:42+08:00
[CATEGORIES]
cs.LG
Stochastic Gradient Descent with Adaptive Data
[AUTHORS]
Ethan Che, Jing Dong, Xin T. Tong
[ABSTRACT]
Stochastic gradient descent (SGD) is a powerful optimization technique that
is particularly useful in online learning scenarios. Its convergence analysis
is relatively well understood under the assumption that the data samples are
independent and identically distributed (iid). However, applying SGD to policy
optimization problems in operations research involves a distinct challenge: the
policy changes the environment and thereby affects the data used to update the
policy. The adaptively generated data stream involves samples that are
non-stationary, no longer independent from each other, and affected by previous
decisions. The influence of previous decisions on the data generated introduces
bias in the gradient estimate, which presents a potential source of instability
for online learning not present in the iid case. In this paper, we introduce
simple criteria for the adaptively generated data stream to guarantee the
convergence of SGD. We show that the convergence speed of SGD with adaptive
data is largely similar to the classical iid setting, as long as the mixing
time of the policy-induced dynamics is factored in. Our Lyapunov-function
analysis allows one to translate existing stability analysis of stochastic
systems studied in operations research into convergence rates for SGD, and we
demonstrate this for queueing and inventory management problems. We also
showcase how our result can be applied to study the sample complexity of an
actor-critic policy gradient algorithm.
[LINK]
http://arxiv.org/abs/2410.01195v1
[DATE]
2024-10-02 10:58:32+08:00
[CATEGORIES]
cs.LG
Ensemble and Mixture-of-Experts DeepONets For Operator Learning
[AUTHORS]
Ramansh Sharma, Varun Shankar
[ABSTRACT]
We present a novel deep operator network (DeepONet) architecture for operator
learning, the ensemble DeepONet, that allows for enriching the trunk network of
a single DeepONet with multiple distinct trunk networks. This trunk enrichment
allows for greater expressivity and generalization capabilities over a range of
operator learning problems. We also present a spatial mixture-of-experts (MoE)
DeepONet trunk network architecture that utilizes a partition-of-unity (PoU)
approximation to promote spatial locality and model sparsity in the operator
learning problem. We first prove that both the ensemble and PoU-MoE DeepONets
are universal approximators. We then demonstrate that ensemble DeepONets
containing a trunk ensemble of a standard trunk, the PoU-MoE trunk, and/or a
proper orthogonal decomposition (POD) trunk can achieve 2-4x lower relative
$\ell_2$ errors than standard DeepONets and POD-DeepONets on both standard and
challenging new operator learning problems involving partial differential
equations (PDEs) in two and three dimensions. Our new PoU-MoE formulation
provides a natural way to incorporate spatial locality and model sparsity into
any neural network architecture, while our new ensemble DeepONet provides a
powerful and general framework for incorporating basis enrichment in scientific
machine learning architectures for operator learning.
[LINK]
http://arxiv.org/abs/2405.11907v4
[DATE]
2024-10-02 10:44:55+08:00
[CATEGORIES]
cs.LG
Linear Projections of Teacher Embeddings for Few-Class Distillation
[AUTHORS]
Noel Loo, Fotis Iliopoulos, Wei Hu, Erik Vee
[ABSTRACT]
Knowledge Distillation (KD) has emerged as a promising approach for
transferring knowledge from a larger, more complex teacher model to a smaller
student model. Traditionally, KD involves training the student to mimic the
teacher’s output probabilities, while more advanced techniques have explored
guiding the student to adopt the teacher’s internal representations. Despite
its widespread success, the performance of KD in binary classification and
few-class problems has been less satisfactory. This is because the information
about the teacher model’s generalization patterns scales directly with the
number of classes. Moreover, several sophisticated distillation methods may not
be universally applicable or effective for data types beyond Computer Vision.
Consequently, effective distillation techniques remain elusive for a range of
key real-world applications, such as sentiment analysis, search query
understanding, and advertisement-query relevance assessment. Taking these
observations into account, we introduce a novel method for distilling knowledge
from the teacher’s model representations, which we term Learning Embedding
Linear Projections (LELP). Inspired by recent findings about the structure of
final-layer representations, LELP works by identifying informative linear
subspaces in the teacher’s embedding space, and splitting them into
pseudo-subclasses. The student model is then trained to replicate these
pseudo-classes. Our experimental evaluation on large-scale NLP benchmarks like
Amazon Reviews and Sentiment140 demonstrate the LELP is consistently
competitive with, and typically superior to, existing state-of-the-art
distillation algorithms for binary and few-class problems, where most KD
methods suffer.
[LINK]
http://arxiv.org/abs/2409.20449v2
[DATE]
2024-10-02 10:36:30+08:00
[CATEGORIES]
cs.LG
Interpolating Video-LLMs: Toward Longer-sequence LMMs in a Training-free Manner
[AUTHORS]
Yuzhang Shang, Bingxin Xu, Weitai Kang, Mu Cai, Yuheng Li, Zehao Wen, Zhen Dong, Kurt Keutzer, Yong Jae Lee, Yan Yan
[LINK]
http://arxiv.org/abs/2409.12963v2
[DATE]
2024-10-02 09:56:08+08:00
[CATEGORIES]
cs.LG
ElastoGen: 4D Generative Elastodynamics
[AUTHORS]
Yutao Feng, Yintong Shang, Xiang Feng, Lei Lan, Shandian Zhe, Tianjia Shao, Hongzhi Wu, Kun Zhou, Hao Su, Chenfanfu Jiang, Yin Yang
[ABSTRACT]
We present ElastoGen, a knowledge-driven AI model that generates physically
accurate 4D elastodynamics. Unlike deep models that learn from video- or
image-based observations, ElastoGen leverages the principles of physics and
learns from established mathematical and optimization procedures. The core idea
of ElastoGen is converting the differential equation, corresponding to the
nonlinear force equilibrium, into a series of iterative local convolution-like
operations, which naturally fit deep architectures. We carefully build our
network module following this overarching design philosophy. ElastoGen is much
more lightweight in terms of both training requirements and network scale than
deep generative models. Because of its alignment with actual physical
procedures, ElastoGen efficiently generates accurate dynamics for a wide range
of hyperelastic materials and can be easily integrated with upstream and
downstream deep modules to enable end-to-end 4D generation.
[LINK]
http://arxiv.org/abs/2405.15056v2
[DATE]
2024-10-02 09:49:56+08:00
[CATEGORIES]
cs.LG
Outlier Gradient Analysis: Efficiently Identifying Detrimental Training Samples for Deep Learning Models
[AUTHORS]
Anshuman Chhabra, Bo Li, Jian Chen, Prasant Mohapatra, Hongfu Liu
[ABSTRACT]
A core data-centric learning challenge is the identification of training
samples that are detrimental to model performance. Influence functions serve as
a prominent tool for this task and offer a robust framework for assessing
training data influence on model predictions. Despite their widespread use,
their high computational cost associated with calculating the inverse of the
Hessian matrix pose constraints, particularly when analyzing large-sized deep
models. In this paper, we establish a bridge between identifying detrimental
training samples via influence functions and outlier gradient detection. This
transformation not only presents a straightforward and Hessian-free formulation
but also provides insights into the role of the gradient in sample impact.
Through systematic empirical evaluations, we first validate the hypothesis of
our proposed outlier gradient analysis approach on synthetic datasets. We then
demonstrate its effectiveness in detecting mislabeled samples in vision models
and selecting data samples for improving performance of natural language
processing transformer models. We also extend its use to influential sample
identification for fine-tuning Large Language Models.
[LINK]
http://arxiv.org/abs/2405.03869v4
[DATE]
2024-10-02 09:38:15+08:00
[CATEGORIES]
cs.LG
A Deep Learning Approach for Imbalanced Tabular Data in Advertiser Prospecting: A Case of Direct Mail Prospecting
[AUTHORS]
Sadegh Farhang, William Hayes, Nick Murphy, Jonathan Neddenriep, Nicholas Tyris
[ABSTRACT]
Acquiring new customers is a vital process for growing businesses.
Prospecting is the process of identifying and marketing to potential customers
using methods ranging from online digital advertising, linear television, out
of home, and direct mail. Despite the rapid growth in digital advertising
(particularly social and search), research shows that direct mail remains one
of the most effective ways to acquire new customers. However, there is a
notable gap in the application of modern machine learning techniques within the
direct mail space, which could significantly enhance targeting and
personalization strategies. Methodologies deployed through direct mail are the
focus of this paper.
In this paper, we propose a supervised learning approach for identifying new
customers, i.e., prospecting, which comprises how we define labels for our data
and rank potential customers. The casting of prospecting to a supervised
learning problem leads to imbalanced tabular data. The current state-of-the-art
approach for tabular data is an ensemble of tree-based methods like random
forest and XGBoost. We propose a deep learning framework for tabular imbalanced
data. This framework is designed to tackle large imbalanced datasets with vast
number of numerical and categorical features. Our framework comprises two
components: an autoencoder and a feed-forward neural network. We demonstrate
the effectiveness of our framework through a transparent real-world case study
of prospecting in direct mail advertising. Our results show that our proposed
deep learning framework outperforms the state of the art tree-based random
forest approach when applied in the real-world.
[COMMENTS]
Third KDD Workshop on End-to-End Customer Journey Optimization
[LINK]
http://arxiv.org/abs/2410.01157v1
[DATE]
2024-10-02 09:19:40+08:00
[CATEGORIES]
cs.LG
Text2PDE: Latent Diffusion Models for Accessible Physics Simulation
[AUTHORS]
Anthony Zhou, Zijie Li, Michael Schneier, John R Buchanan Jr, Amir Barati Farimani
[ABSTRACT]
Recent advances in deep learning have inspired numerous works on data-driven
solutions to partial differential equation (PDE) problems. These neural PDE
solvers can often be much faster than their numerical counterparts; however,
each presents its unique limitations and generally balances training cost,
numerical accuracy, and ease of applicability to different problem setups. To
address these limitations, we introduce several methods to apply latent
diffusion models to physics simulation. Firstly, we introduce a mesh
autoencoder to compress arbitrarily discretized PDE data, allowing for
efficient diffusion training across various physics. Furthermore, we
investigate full spatio-temporal solution generation to mitigate autoregressive
error accumulation. Lastly, we investigate conditioning on initial physical
quantities, as well as conditioning solely on a text prompt to introduce
text2PDE generation. We show that language can be a compact, interpretable, and
accurate modality for generating physics simulations, paving the way for more
usable and accessible PDE solvers. Through experiments on both uniform and
structured grids, we show that the proposed approach is competitive with
current neural PDE solvers in both accuracy and efficiency, with promising
scaling behavior up to $\sim$3 billion parameters. By introducing a scalable,
accurate, and usable physics simulator, we hope to bring neural PDE solvers
closer to practical use.
[COMMENTS]
25 pages, 7 figures
[LINK]
http://arxiv.org/abs/2410.01153v1
[DATE]
2024-10-02 09:09:47+08:00
[CATEGORIES]
cs.LG
Recovering Manifold Structure Using Ollivier-Ricci Curvature
[AUTHORS]
Tristan Luca Saidi, Abigail Hickok, Andrew J. Blumberg
[LINK]
http://arxiv.org/abs/2410.01149v1
[DATE]
2024-10-02 09:00:30+08:00
[CATEGORIES]
cs.LG
ProxiMix: Enhancing Fairness with Proximity Samples in Subgroups
[AUTHORS]
Jingyu Hu, Jun Hong, Mengnan Du, Weiru Liu
[ABSTRACT]
Many bias mitigation methods have been developed for addressing fairness
issues in machine learning. We found that using linear mixup alone, a data
augmentation technique, for bias mitigation, can still retain biases present in
dataset labels. Research presented in this paper aims to address this issue by
proposing a novel pre-processing strategy in which both an existing mixup
method and our new bias mitigation algorithm can be utilized to improve the
generation of labels of augmented samples, which are proximity aware.
Specifically, we proposed ProxiMix which keeps both pairwise and proximity
relationships for fairer data augmentation. We conducted thorough experiments
with three datasets, three ML models, and different hyperparameters settings.
Our experimental results showed the effectiveness of ProxiMix from both
fairness of predictions and fairness of recourse perspectives.
[LINK]
http://arxiv.org/abs/2410.01145v1
[DATE]
2024-10-02 08:47:03+08:00
[CATEGORIES]
cs.LG
Affordance-Guided Reinforcement Learning via Visual Prompting
[AUTHORS]
Olivia Y. Lee, Annie Xie, Kuan Fang, Karl Pertsch, Chelsea Finn
[ABSTRACT]
Robots equipped with reinforcement learning (RL) have the potential to learn
a wide range of skills solely from a reward signal. However, obtaining a robust
and dense reward signal for general manipulation tasks remains a challenge.
Existing learning-based approaches require significant data, such as human
demonstrations of success and failure, to learn task-specific reward functions.
Recently, there is also a growing adoption of large multi-modal foundation
models for robotics that can perform visual reasoning in physical contexts and
generate coarse robot motions for manipulation tasks. Motivated by this range
of capability, in this work, we present Keypoint-based Affordance Guidance for
Improvements (KAGI), a method leveraging rewards shaped by vision-language
models (VLMs) for autonomous RL. State-of-the-art VLMs have demonstrated
impressive reasoning about affordances through keypoints in zero-shot, and we
use these to define dense rewards that guide autonomous robotic learning. On
real-world manipulation tasks specified by natural language descriptions, KAGI
improves the sample efficiency of autonomous RL and enables successful task
completion in 20K online fine-tuning steps. Additionally, we demonstrate the
robustness of KAGI to reductions in the number of in-domain demonstrations used
for pre-training, reaching similar performance in 35K online fine-tuning steps.
Project website: https://sites.google.com/view/affordance-guided-rl
[COMMENTS]
8 pages, 6 figures. Robotics: Science and Systems (RSS) 2024, Task
Specification for General-Purpose Intelligent Robots & Lifelong Robot
Learning Workshops
[LINK]
http://arxiv.org/abs/2407.10341v3
[DATE]
2024-10-02 08:40:38+08:00
[CATEGORIES]
cs.LG
Explain Like I’m Five: Using LLMs to Improve PDE Surrogate Models with Text
[AUTHORS]
Cooper Lorsung, Amir Barati Farimani
[ABSTRACT]
Solving Partial Differential Equations (PDEs) is ubiquitous in science and
engineering. Computational complexity and difficulty in writing numerical
solvers has motivated the development of machine learning techniques to
generate solutions quickly. Many existing methods are purely data driven,
relying solely on numerical solution fields, rather than known system
information such as boundary conditions and governing equations. However, the
recent rise in popularity of Large Language Models (LLMs) has enabled easy
integration of text in multimodal machine learning models. In this work, we use
pretrained LLMs to integrate various amounts known system information into PDE
learning. Our multimodal approach significantly outperforms our baseline model,
FactFormer, in both next-step prediction and autoregressive rollout performance
on the 2D Heat, Burgers, Navier-Stokes, and Shallow Water equations. Further
analysis shows that pretrained LLMs provide highly structured latent space that
is consistent with the amount of system information provided through text.
[COMMENTS]
22 pages, 15 figures, 7 tables
[LINK]
http://arxiv.org/abs/2410.01137v1
[DATE]
2024-10-02 08:19:20+08:00
[CATEGORIES]
cs.LG
nGPT: Normalized Transformer with Representation Learning on the Hypersphere
[AUTHORS]
Ilya Loshchilov, Cheng-Ping Hsieh, Simeng Sun, Boris Ginsburg
[ABSTRACT]
We propose a novel neural network architecture, the normalized Transformer
(nGPT) with representation learning on the hypersphere. In nGPT, all vectors
forming the embeddings, MLP, attention matrices and hidden states are unit norm
normalized. The input stream of tokens travels on the surface of a hypersphere,
with each layer contributing a displacement towards the target output
predictions. These displacements are defined by the MLP and attention blocks,
whose vector components also reside on the same hypersphere. Experiments show
that nGPT learns much faster, reducing the number of training steps required to
achieve the same accuracy by a factor of 4 to 20, depending on the sequence
length.
[LINK]
http://arxiv.org/abs/2410.01131v1
[DATE]
2024-10-02 07:50:09+08:00
[CATEGORIES]
cs.LG
Broadening Target Distributions for Accelerated Diffusion Models via a Novel Analysis Approach
[AUTHORS]
Yuchen Liang, Peizhong Ju, Yingbin Liang, Ness Shroff
[ABSTRACT]
Accelerated diffusion models hold the potential to significantly enhance the
efficiency of standard diffusion processes. Theoretically, these models have
been shown to achieve faster convergence rates than the standard $\mathcal
O(1/\epsilon^2)$ rate of vanilla diffusion models, where $\epsilon$ denotes the
target accuracy. However, current theoretical studies have established the
acceleration advantage only for restrictive target distribution classes, such
as those with smoothness conditions imposed along the entire sampling path or
with bounded support. In this work, we significantly broaden the target
distribution classes with a novel accelerated stochastic DDPM sampler. In
particular, we show that it achieves accelerated performance for three broad
distribution classes not considered before. Our first class relies on the
smoothness condition posed only to the target density $q_0$, which is far more
relaxed than the existing smoothness conditions posed to all $q_t$ along the
entire sampling path. Our second class requires only a finite second moment
condition, allowing for a much wider class of target distributions than the
existing finite-support condition. Our third class is Gaussian mixture, for
which our result establishes the first acceleration guarantee. Moreover, among
accelerated DDPM type samplers, our results specialized for bounded-support
distributions show an improved dependency on the data dimension $d$. Our
analysis introduces a novel technique for establishing performance guarantees
via constructing a tilting factor representation of the convergence error and
utilizing Tweedie’s formula to handle Taylor expansion terms. This new
analytical framework may be of independent interest.
[LINK]
http://arxiv.org/abs/2402.13901v3
[DATE]
2024-10-02 07:39:30+08:00
[CATEGORIES]
cs.LG
DropEdge not Foolproof: Effective Augmentation Method for Signed Graph Neural Networks
[AUTHORS]
Zeyu Zhang, Lu Li, Shuyan Wan, Sijie Wang, Zhiyi Wang, Zhiyuan Lu, Dong Hao, Wanli Li
[ABSTRACT]
The paper discusses signed graphs, which model friendly or antagonistic
relationships using edges marked with positive or negative signs, focusing on
the task of link sign prediction. While Signed Graph Neural Networks (SGNNs)
have advanced, they face challenges like graph sparsity and unbalanced
triangles. The authors propose using data augmentation (DA) techniques to
address these issues, although many existing methods are not suitable for
signed graphs due to a lack of side information. They highlight that the random
DropEdge method, a rare DA approach applicable to signed graphs, does not
enhance link sign prediction performance. In response, they introduce the
Signed Graph Augmentation (SGA) framework, which includes a structure
augmentation module to identify candidate edges and a strategy for selecting
beneficial candidates, ultimately improving SGNN training. Experimental results
show that SGA significantly boosts the performance of SGNN models, with a
notable 32.3% improvement in F1-micro for SGCN on the Slashdot dataset.
[COMMENTS]
NeurIPS 2024
[LINK]
http://arxiv.org/abs/2409.19620v2
[DATE]
2024-10-02 07:15:48+08:00
[CATEGORIES]
cs.LG
FairCoT: Enhancing Fairness in Diffusion Models via Chain of Thought Reasoning of Multimodal Language Models
[AUTHORS]
Zahraa Al Sahili, Ioannis Patras, Matthew Purver
[ABSTRACT]
In the domain of text-to-image generative models, biases inherent in training
datasets often propagate into generated content, posing significant ethical
challenges, particularly in socially sensitive contexts. We introduce FairCoT,
a novel framework that enhances fairness in diffusion models through
Chain-of-Thought (CoT) reasoning within multimodal generative large language
models (LLMs). FairCoT employs iterative CoT refinement and attire-based
attribute prediction to systematically mitigate biases, ensuring diverse and
equitable representation in generated images. By integrating iterative
reasoning processes, FairCoT addresses the limitations of zero-shot CoT in
sensitive scenarios, balancing creativity with ethical responsibility.
Experimental evaluations across multiple models, including DALL-E and various
Stable Diffusion variants, demonstrate that FairCoT significantly improves
fairness and diversity metrics without compromising image quality or relevance.
Our approach advances ethical AI practices in generative modeling, promoting
socially responsible content generation and setting new standards for fairness
in AI-generated imagery.
[LINK]
http://arxiv.org/abs/2406.09070v2
[DATE]
2024-10-02 06:45:20+08:00
[CATEGORIES]
cs.LG
Almost Free: Self-concordance in Natural Exponential Families and an Application to Bandits
[AUTHORS]
Shuai Liu, Alex Ayoub, Flore Sentenac, Xiaoqi Tan, Csaba Szepesvári
[ABSTRACT]
We prove that single-parameter natural exponential families with
subexponential tails are self-concordant with polynomial-sized parameters. For
subgaussian natural exponential families we establish an exact characterization
of the growth rate of the self-concordance parameter. Applying these findings
to bandits allows us to fill gaps in the literature: We show that optimistic
algorithms for generalized linear bandits enjoy regret bounds that are both
second-order (scale with the variance of the optimal arm’s reward distribution)
and free of an exponential dependence on the bound of the problem parameter in
the leading term. To the best of our knowledge, ours is the first regret bound
for generalized linear bandits with subexponential tails, broadening the class
of problems to include Poisson, exponential and gamma bandits.
[COMMENTS]
Neural Information Processing Systems (NeurIPS) 2024
[LINK]
http://arxiv.org/abs/2410.01112v1
[DATE]
2024-10-02 06:42:19+08:00
[CATEGORIES]
cs.LG
softmax is not enough (for sharp out-of-distribution)
[AUTHORS]
Petar Veličković, Christos Perivolaropoulos, Federico Barbero, Razvan Pascanu
[ABSTRACT]
A key property of reasoning systems is the ability to make sharp decisions on
their input data. For contemporary AI systems, a key carrier of sharp behaviour
is the softmax function, with its capability to perform differentiable
query-key lookups. It is a common belief that the predictive power of networks
leveraging softmax arises from “circuits” which sharply perform certain kinds
of computations consistently across many diverse inputs. However, for these
circuits to be robust, they would need to generalise well to arbitrary valid
inputs. In this paper, we dispel this myth: even for tasks as simple as finding
the maximum key, any learned circuitry must disperse as the number of items
grows at test time. We attribute this to a fundamental limitation of the
softmax function to robustly approximate sharp functions, prove this phenomenon
theoretically, and propose adaptive temperature as an ad-hoc technique for
improving the sharpness of softmax at inference time.
[COMMENTS]
Comments welcome. 14 pages, 7 figures
[LINK]
http://arxiv.org/abs/2410.01104v1
[DATE]
2024-10-02 06:22:35+08:00
[CATEGORIES]
cs.LG
Exploiting Structure in Offline Multi-Agent RL: The Benefits of Low Interaction Rank
[AUTHORS]
Wenhao Zhan, Scott Fujimoto, Zheqing Zhu, Jason D. Lee, Daniel R. Jiang, Yonathan Efroni
[ABSTRACT]
We study the problem of learning an approximate equilibrium in the offline
multi-agent reinforcement learning (MARL) setting. We introduce a structural
assumption – the interaction rank – and establish that functions with low
interaction rank are significantly more robust to distribution shift compared
to general ones. Leveraging this observation, we demonstrate that utilizing
function classes with low interaction rank, when combined with regularization
and no-regret learning, admits decentralized, computationally and statistically
efficient learning in offline MARL. Our theoretical results are complemented by
experiments that showcase the potential of critic architectures with low
interaction rank in offline MARL, contrasting with commonly used single-agent
value decomposition architectures.
[LINK]
http://arxiv.org/abs/2410.01101v1
[DATE]
2024-10-02 06:16:22+08:00
[CATEGORIES]
cs.LG
High-dimensional logistic regression with missing data: Imputation, regularization, and universality
[AUTHORS]
Kabir Aladin Verchand, Andrea Montanari
[ABSTRACT]
We study high-dimensional, ridge-regularized logistic regression in a setting
in which the covariates may be missing or corrupted by additive noise. When
both the covariates and the additive corruptions are independent and normally
distributed, we provide exact characterizations of both the prediction error as
well as the estimation error. Moreover, we show that these characterizations
are universal: as long as the entries of the data matrix satisfy a set of
independence and moment conditions, our guarantees continue to hold.
Universality, in turn, enables the detailed study of several imputation-based
strategies when the covariates are missing completely at random. We ground our
study by comparing the performance of these strategies with the conjectured
performance – stemming from replica theory in statistical physics – of the
Bayes optimal procedure. Our analysis yields several insights including: (i) a
distinction between single imputation and a simple variant of multiple
imputation and (ii) that adding a simple ridge regularization term to
single-imputed logistic regression can yield an estimator whose prediction
error is nearly indistinguishable from the Bayes optimal prediction error. We
supplement our findings with extensive numerical experiments.
[LINK]
http://arxiv.org/abs/2410.01093v1
[DATE]
2024-10-02 05:41:21+08:00
[CATEGORIES]
cs.LG
Efficient and Private Marginal Reconstruction with Local Non-Negativity
[AUTHORS]
Brett Mullins, Miguel Fuentes, Yingtai Xiao, Daniel Kifer, Cameron Musco, Daniel Sheldon
[ABSTRACT]
Differential privacy is the dominant standard for formal and quantifiable
privacy and has been used in major deployments that impact millions of people.
Many differentially private algorithms for query release and synthetic data
contain steps that reconstruct answers to queries from answers to other queries
measured by the mechanism. Reconstruction is an important subproblem for such
mechanisms to economize the privacy budget, minimize error on reconstructed
answers, and allow for scalability to high-dimensional datasets. In this paper,
we introduce a principled and efficient postprocessing method ReM
(Residuals-to-Marginals) for reconstructing answers to marginal queries. Our
method builds on recent work on efficient mechanisms for marginal query
release, based on making measurements using a residual query basis that admits
efficient pseudoinversion, which is an important primitive used in
reconstruction. An extension GReM-LNN (Gaussian Residuals-to-Marginals with
Local Non-negativity) reconstructs marginals under Gaussian noise satisfying
consistency and non-negativity, which often reduces error on reconstructed
answers. We demonstrate the utility of ReM and GReM-LNN by applying them to
improve existing private query answering mechanisms: ResidualPlanner and MWEM.
[COMMENTS]
To appear at NeurIPS 2024
[LINK]
http://arxiv.org/abs/2410.01091v1
[DATE]
2024-10-02 05:39:28+08:00
[CATEGORIES]
cs.LG
Extracting Memorized Training Data via Decomposition
[AUTHORS]
Ellen Su, Anu Vellore, Amy Chang, Raffaele Mura, Blaine Nelson, Paul Kassianik, Amin Karbasi
[ABSTRACT]
The widespread use of Large Language Models (LLMs) in society creates new
information security challenges for developers, organizations, and end-users
alike. LLMs are trained on large volumes of data, and their susceptibility to
reveal the exact contents of the source training datasets poses security and
safety risks. Although current alignment procedures restrict common risky
behaviors, they do not completely prevent LLMs from leaking data. Prior work
demonstrated that LLMs may be tricked into divulging training data by using
out-of-distribution queries or adversarial techniques. In this paper, we
demonstrate a simple, query-based decompositional method to extract news
articles from two frontier LLMs. We use instruction decomposition techniques to
incrementally extract fragments of training data. Out of 3723 New York Times
articles, we extract at least one verbatim sentence from 73 articles, and over
20% of verbatim sentences from 6 articles. Our analysis demonstrates that this
method successfully induces the LLM to generate texts that are reliable
reproductions of news articles, meaning that they likely originate from the
source training dataset. This method is simple, generalizable, and does not
fine-tune or change the production model. If replicable at scale, this training
data extraction methodology could expose new LLM security and safety
vulnerabilities, including privacy risks and unauthorized data leaks. These
implications require careful consideration from model development to its
end-use.
[LINK]
http://arxiv.org/abs/2409.12367v2
[DATE]
2024-10-02 05:34:42+08:00
[CATEGORIES]
cs.LG
An Introduction to Deep Survival Analysis Models for Predicting Time-to-Event Outcomes
[AUTHORS]
George H. Chen
[ABSTRACT]
Many applications involve reasoning about time durations before a critical
event happens–also called time-to-event outcomes. When will a customer cancel
a subscription, a coma patient wake up, or a convicted criminal reoffend?
Time-to-event outcomes have been studied extensively within the field of
survival analysis primarily by the statistical, medical, and reliability
engineering communities, with textbooks already available in the 1970s and
’80s. This monograph aims to provide a reasonably self-contained modern
introduction to survival analysis. We focus on predicting time-to-event
outcomes at the individual data point level with the help of neural networks.
Our goal is to provide the reader with a working understanding of precisely
what the basic time-to-event prediction problem is, how it differs from
standard regression and classification, and how key “design patterns” have been
used time after time to derive new time-to-event prediction models, from
classical methods like the Cox proportional hazards model to modern deep
learning approaches such as deep kernel Kaplan-Meier estimators and neural
ordinary differential equation models. We further delve into two extensions of
the basic time-to-event prediction setup: predicting which of several critical
events will happen first along with the time until this earliest event happens
(the competing risks setting), and predicting time-to-event outcomes given a
time series that grows in length over time (the dynamic setting). We conclude
with a discussion of a variety of topics such as fairness, causal reasoning,
interpretability, and statistical guarantees. Our monograph comes with an
accompanying code repository that implements every model and evaluation metric
that we cover in detail.
[COMMENTS]
Code is available at https://github.com/georgehc/survival-intro
[LINK]
http://arxiv.org/abs/2410.01086v1
[DATE]
2024-10-02 05:29:17+08:00
[CATEGORIES]
cs.LG
Inferring Kernel $ε$-Machines: Discovering Structure in Complex Systems
[AUTHORS]
Alexandra M. Jurgens, Nicolas Brodu
[ABSTRACT]
Previously, we showed that computational mechanic’s causal states –
predictively-equivalent trajectory classes for a stochastic dynamical system –
can be cast into a reproducing kernel Hilbert space. The result is a
widely-applicable method that infers causal structure directly from very
different kinds of observations and systems. Here, we expand this method to
explicitly introduce the causal diffusion components it produces. These encode
the kernel causal-state estimates as a set of coordinates in a reduced
dimension space. We show how each component extracts predictive features from
data and demonstrate their application on four examples: first, a simple
pendulum – an exactly solvable system; second, a molecular-dynamic trajectory
of $n$-butane – a high-dimensional system with a well-studied energy
landscape; third, the monthly sunspot sequence – the longest-running available
time series of direct observations; and fourth, multi-year observations of an
active crop field – a set of heterogeneous observations of the same ecosystem
taken for over a decade. In this way, we demonstrate that the empirical kernel
causal-states algorithm robustly discovers predictive structures for systems
with widely varying dimensionality and stochasticity.
[LINK]
http://arxiv.org/abs/2410.01076v1
[DATE]
2024-10-02 05:14:06+08:00
[CATEGORIES]
cs.LG
Convergent Privacy Loss of Noisy-SGD without Convexity and Smoothness
[AUTHORS]
Eli Chien, Pan Li
[ABSTRACT]
We study the Differential Privacy (DP) guarantee of hidden-state Noisy-SGD
algorithms over a bounded domain. Standard privacy analysis for Noisy-SGD
assumes all internal states are revealed, which leads to a divergent R’enyi DP
bound with respect to the number of iterations. Ye & Shokri (2022) and
Altschuler & Talwar (2022) proved convergent bounds for smooth (strongly)
convex losses, and raise open questions about whether these assumptions can be
relaxed. We provide positive answers by proving convergent R’enyi DP bound for
non-convex non-smooth losses, where we show that requiring losses to have
H"older continuous gradient is sufficient. We also provide a strictly better
privacy bound compared to state-of-the-art results for smooth strongly convex
losses. Our analysis relies on the improvement of shifted divergence analysis
in multiple aspects, including forward Wasserstein distance tracking,
identifying the optimal shifts allocation, and the H”older reduction lemma. Our
results further elucidate the benefit of hidden-state analysis for DP and its
applicability.
[LINK]
http://arxiv.org/abs/2410.01068v1
[DATE]
2024-10-02 04:52:08+08:00
[CATEGORIES]
cs.LG
Structure-Preserving Operator Learning
[AUTHORS]
Nacime Bouziani, Nicolas Boullé
[ABSTRACT]
Learning complex dynamics driven by partial differential equations directly
from data holds great promise for fast and accurate simulations of complex
physical systems. In most cases, this problem can be formulated as an operator
learning task, where one aims to learn the operator representing the physics of
interest, which entails discretization of the continuous system. However,
preserving key continuous properties at the discrete level, such as boundary
conditions, and addressing physical systems with complex geometries is
challenging for most existing approaches. We introduce a family of operator
learning architectures, structure-preserving operator networks (SPONs), that
allows to preserve key mathematical and physical properties of the continuous
system by leveraging finite element (FE) discretizations of the input-output
spaces. SPONs are encode-process-decode architectures that are end-to-end
differentiable, where the encoder and decoder follows from the discretizations
of the input-output spaces. SPONs can operate on complex geometries, enforce
certain boundary conditions exactly, and offer theoretical guarantees. Our
framework provides a flexible way of devising structure-preserving
architectures tailored to specific applications, and offers an explicit
trade-off between performance and efficiency, all thanks to the FE
discretization of the input-output spaces. Additionally, we introduce a
multigrid-inspired SPON architecture that yields improved performance at higher
efficiency. Finally, we release a software to automate the design and training
of SPON architectures.
[LINK]
http://arxiv.org/abs/2410.01065v1
[DATE]
2024-10-02 04:46:16+08:00
[CATEGORIES]
cs.LG
Embodied-RAG: General Non-parametric Embodied Memory for Retrieval and Generation
[AUTHORS]
Quanting Xie, So Yeon Min, Tianyi Zhang, Aarav Bajaj, Ruslan Salakhutdinov, Matthew Johnson-Roberson, Yonatan Bisk
[ABSTRACT]
There is no limit to how much a robot might explore and learn, but all of
that knowledge needs to be searchable and actionable. Within language research,
retrieval augmented generation (RAG) has become the workhouse of large-scale
non-parametric knowledge, however existing techniques do not directly transfer
to the embodied domain, which is multimodal, data is highly correlated, and
perception requires abstraction.
To address these challenges, we introduce Embodied-RAG, a framework that
enhances the foundational model of an embodied agent with a non-parametric
memory system capable of autonomously constructing hierarchical knowledge for
both navigation and language generation. Embodied-RAG handles a full range of
spatial and semantic resolutions across diverse environments and query types,
whether for a specific object or a holistic description of ambiance. At its
core, Embodied-RAG’s memory is structured as a semantic forest, storing
language descriptions at varying levels of detail. This hierarchical
organization allows the system to efficiently generate context-sensitive
outputs across different robotic platforms. We demonstrate that Embodied-RAG
effectively bridges RAG to the robotics domain, successfully handling over 200
explanation and navigation queries across 19 environments, highlighting its
promise for general-purpose non-parametric system for embodied agents.
[COMMENTS]
Web: https://quanting-xie.github.io/Embodied-RAG-web/
[LINK]
http://arxiv.org/abs/2409.18313v2
[DATE]
2024-10-02 04:32:17+08:00
[CATEGORIES]
cs.LG
Uncertainty Modelling and Robust Observer Synthesis using the Koopman Operator
[AUTHORS]
Steven Dahdah, James Richard Forbes
[ABSTRACT]
This paper proposes a robust nonlinear observer synthesis method for a
population of systems modelled using the Koopman operator. The Koopman operator
allows nonlinear systems to be rewritten as infinite-dimensional linear
systems. A finite-dimensional approximation of the Koopman operator can be
identified directly from data, yielding an approximately linear model of a
nonlinear system. The proposed observer synthesis method is made possible by
this linearity that in turn allows uncertainty within a population of Koopman
models to be quantified in the frequency domain. Using this uncertainty model,
linear robust control techniques are used to synthesize robust nonlinear
Koopman observers. A population of several dozen motor drives is used to
experimentally demonstrate the proposed method. Manufacturing variation is
characterized in the frequency domain, and a robust Koopman observer is
synthesized using mixed $\mathcal{H}2$-$\mathcal{H}\infty$ optimal control.
[COMMENTS]
16 pages, 15 figures
[LINK]
http://arxiv.org/abs/2410.01057v1
[DATE]
2024-10-02 04:31:18+08:00
[CATEGORIES]
cs.LG
Simulation of Graph Algorithms with Looped Transformers
[AUTHORS]
Artur Back de Luca, Kimon Fountoulakis
[ABSTRACT]
The execution of graph algorithms using neural networks has recently
attracted significant interest due to promising empirical progress. This
motivates further understanding of how neural networks can replicate reasoning
steps with relational data. In this work, we study the ability of transformer
networks to simulate algorithms on graphs from a theoretical perspective. The
architecture we use is a looped transformer with extra attention heads that
interact with the graph. We prove by construction that this architecture can
simulate individual algorithms such as Dijkstra’s shortest path, Breadth- and
Depth-First Search, and Kosaraju’s strongly connected components, as well as
multiple algorithms simultaneously. The number of parameters in the networks
does not increase with the input graph size, which implies that the networks
can simulate the above algorithms for any graph. Despite this property, we show
a limit to simulation in our solution due to finite precision. Finally, we show
a Turing Completeness result with constant width when the extra attention heads
are utilized.
[COMMENTS]
55 pages, 3 figures
[LINK]
http://arxiv.org/abs/2402.01107v3
[DATE]
2024-10-02 04:30:37+08:00
[CATEGORIES]
cs.LG
Adaptive Cascading Network for Continual Test-Time Adaptation
[AUTHORS]
Kien X. Nguyen, Fengchun Qiao, Xi Peng
[ABSTRACT]
We study the problem of continual test-time adaption where the goal is to
adapt a source pre-trained model to a sequence of unlabelled target domains at
test time. Existing methods on test-time training suffer from several
limitations: (1) Mismatch between the feature extractor and classifier; (2)
Interference between the main and self-supervised tasks; (3) Lack of the
ability to quickly adapt to the current distribution. In light of these
challenges, we propose a cascading paradigm that simultaneously updates the
feature extractor and classifier at test time, mitigating the mismatch between
them and enabling long-term model adaptation. The pre-training of our model is
structured within a meta-learning framework, thereby minimizing the
interference between the main and self-supervised tasks and encouraging fast
adaptation in the presence of limited unlabelled data. Additionally, we
introduce innovative evaluation metrics, average accuracy and forward transfer,
to effectively measure the model’s adaptation capabilities in dynamic,
real-world scenarios. Extensive experiments and ablation studies demonstrate
the superiority of our approach in a range of tasks including image
classification, text classification, and speech recognition.
[LINK]
http://arxiv.org/abs/2407.12240v2
[DATE]
2024-10-02 04:11:53+08:00
[CATEGORIES]
cs.LG
Spherical Analysis of Learning Nonlinear Functionals
[AUTHORS]
Zhenyu Yang, Shuo Huang, Han Feng, Ding-Xuan Zhou
[ABSTRACT]
In recent years, there has been growing interest in the field of functional
neural networks. They have been proposed and studied with the aim of
approximating continuous functionals defined on sets of functions on Euclidean
domains. In this paper, we consider functionals defined on sets of functions on
spheres. The approximation ability of deep ReLU neural networks is investigated
by novel spherical analysis using an encoder-decoder framework. An encoder
comes up first to accommodate the infinite-dimensional nature of the domain of
functionals. It utilizes spherical harmonics to help us extract the latent
finite-dimensional information of functions, which in turn facilitates in the
next step of approximation analysis using fully connected neural networks.
Moreover, real-world objects are frequently sampled discretely and are often
corrupted by noise. Therefore, encoders with discrete input and those with
discrete and random noise input are constructed, respectively. The
approximation rates with different encoder structures are provided therein.
[LINK]
http://arxiv.org/abs/2410.01047v1
[DATE]
2024-10-02 04:10:00+08:00
[CATEGORIES]
cs.LG
Learning from Demonstration with Implicit Nonlinear Dynamics Models
[AUTHORS]
Peter David Fagan, Subramanian Ramamoorthy
[ABSTRACT]
Learning from Demonstration (LfD) is a useful paradigm for training policies
that solve tasks involving complex motions, such as those encountered in
robotic manipulation. In practice, the successful application of LfD requires
overcoming error accumulation during policy execution, i.e. the problem of
drift due to errors compounding over time and the consequent
out-of-distribution behaviours. Existing works seek to address this problem
through scaling data collection, correcting policy errors with a
human-in-the-loop, temporally ensembling policy predictions or through learning
a dynamical system model with convergence guarantees. In this work, we propose
and validate an alternative approach to overcoming this issue. Inspired by
reservoir computing, we develop a recurrent neural network layer that includes
a fixed nonlinear dynamical system with tunable dynamical properties for
modelling temporal dynamics. We validate the efficacy of our neural network
layer on the task of reproducing human handwriting motions using the LASA Human
Handwriting Dataset. Through empirical experiments we demonstrate that
incorporating our layer into existing neural network architectures addresses
the issue of compounding errors in LfD. Furthermore, we perform a comparative
evaluation against existing approaches including a temporal ensemble of policy
predictions and an Echo State Network (ESN) implementation. We find that our
approach yields greater policy precision and robustness on the handwriting task
while also generalising to multiple dynamics regimes and maintaining
competitive latency scores.
[COMMENTS]
21 pages, 9 figures
[LINK]
http://arxiv.org/abs/2409.18768v2
[DATE]
2024-10-02 04:05:35+08:00
[CATEGORIES]
cs.LG
Transformers as Transducers
[AUTHORS]
Lena Strobl, Dana Angluin, David Chiang, Jonathan Rawski, Ashish Sabharwal
[ABSTRACT]
We study the sequence-to-sequence mapping capacity of transformers by
relating them to finite transducers, and find that they can express
surprisingly large classes of transductions. We do so using variants of RASP, a
programming language designed to help people “think like transformers,” as an
intermediate representation. We extend the existing Boolean variant B-RASP to
sequence-to-sequence functions and show that it computes exactly the
first-order rational functions (such as string rotation). Then, we introduce
two new extensions. B-RASP[pos] enables calculations on positions (such as
copying the first half of a string) and contains all first-order regular
functions. S-RASP adds prefix sum, which enables additional arithmetic
operations (such as squaring a string) and contains all first-order polyregular
functions. Finally, we show that masked average-hard attention transformers can
simulate S-RASP.
[LINK]
http://arxiv.org/abs/2404.02040v2
[DATE]
2024-10-02 04:05:13+08:00
[CATEGORIES]
cs.LG
HYDRA-FL: Hybrid Knowledge Distillation for Robust and Accurate Federated Learning
[AUTHORS]
Momin Ahmad Khan, Yasra Chandio, Fatima Muhammad Anwar
[COMMENTS]
Annual Conference on Neural Information Processing Systems (NeurIPS),
2024
[LINK]
http://arxiv.org/abs/2409.19912v2
[DATE]
2024-10-02 04:03:37+08:00
[CATEGORIES]
cs.LG
OCTDL: Optical Coherence Tomography Dataset for Image-Based Deep Learning Methods
[AUTHORS]
Mikhail Kulyabin, Aleksei Zhdanov, Anastasia Nikiforova, Andrey Stepichev, Anna Kuznetsova, Mikhail Ronkin, Vasilii Borisov, Alexander Bogachev, Sergey Korotkich, Paul A Constable, Andreas Maier
[ABSTRACT]
Optical coherence tomography (OCT) is a non-invasive imaging technique with
extensive clinical applications in ophthalmology. OCT enables the visualization
of the retinal layers, playing a vital role in the early detection and
monitoring of retinal diseases. OCT uses the principle of light wave
interference to create detailed images of the retinal microstructures, making
it a valuable tool for diagnosing ocular conditions. This work presents an
open-access OCT dataset (OCTDL) comprising over 2000 OCT images labeled
according to disease group and retinal pathology. The dataset consists of OCT
records of patients with Age-related Macular Degeneration (AMD), Diabetic
Macular Edema (DME), Epiretinal Membrane (ERM), Retinal Artery Occlusion (RAO),
Retinal Vein Occlusion (RVO), and Vitreomacular Interface Disease (VID). The
images were acquired with an Optovue Avanti RTVue XR using raster scanning
protocols with dynamic scan length and image resolution. Each retinal b-scan
was acquired by centering on the fovea and interpreted and cataloged by an
experienced retinal specialist. In this work, we applied Deep Learning
classification techniques to this new open-access dataset.
[LINK]
http://arxiv.org/abs/2312.08255v4
[DATE]
2024-10-02 03:59:21+08:00
[CATEGORIES]
cs.LG
Reinforcement learning-assisted quantum architecture search for variational quantum algorithms
[AUTHORS]
Akash Kundu
[ABSTRACT]
A significant hurdle in the noisy intermediate-scale quantum (NISQ) era is
identifying functional quantum circuits. These circuits must also adhere to the
constraints imposed by current quantum hardware limitations. Variational
quantum algorithms (VQAs), a class of quantum-classical optimization
algorithms, were developed to address these challenges in the currently
available quantum devices. However, the overall performance of VQAs depends on
the initialization strategy of the variational circuit, the structure of the
circuit (also known as ansatz), and the configuration of the cost function.
Focusing on the structure of the circuit, in this thesis, we improve the
performance of VQAs by automating the search for an optimal structure for the
variational circuits using reinforcement learning (RL). Within the thesis, the
optimality of a circuit is determined by evaluating its depth, the overall
count of gates and parameters, and its accuracy in solving the given problem.
The task of automating the search for optimal quantum circuits is known as
quantum architecture search (QAS). The majority of research in QAS is primarily
focused on a noiseless scenario. Yet, the impact of noise on the QAS remains
inadequately explored. In this thesis, we tackle the issue by introducing a
tensor-based quantum circuit encoding, restrictions on environment dynamics to
explore the search space of possible circuits efficiently, an episode halting
scheme to steer the agent to find shorter circuits, a double deep Q-network
(DDQN) with an $\epsilon$-greedy policy for better stability. The numerical
experiments on noiseless and noisy quantum hardware show that in dealing with
various VQAs, our RL-based QAS outperforms existing QAS. Meanwhile, the methods
we propose in the thesis can be readily adapted to address a wide range of
other VQAs.
[COMMENTS]
With many pages, figures and tables, I, Akash Kundu upload the final
version of my thesis! Including reviewers response and a kind of brief
overview of recent quantum architecture search methods
[LINK]
http://arxiv.org/abs/2402.13754v4
[DATE]
2024-10-02 03:58:40+08:00
[CATEGORIES]
cs.LG
Don’t Stop Me Now: Embedding Based Scheduling for LLMs
[AUTHORS]
Rana Shahout, Eran Malach, Chunwei Liu, Weifan Jiang, Minlan Yu, Michael Mitzenmacher
[ABSTRACT]
Efficient scheduling is crucial for interactive Large Language Model (LLM)
applications, where low request completion time directly impacts user
engagement. Size-based scheduling algorithms like Shortest Remaining Process
Time (SRPT) aim to reduce average request completion time by leveraging known
or estimated request sizes and allowing preemption by incoming jobs with
shorter service times. However, two main challenges arise when applying
size-based scheduling to LLM systems. First, accurately predicting output
lengths from prompts is challenging and often resource-intensive, making it
impractical for many systems. As a result, the state-of-the-art LLM systems
default to first-come, first-served scheduling, which can lead to head-of-line
blocking and reduced system efficiency. Second, preemption introduces extra
memory overhead to LLM systems as they must maintain intermediate states for
unfinished (preempted) requests. In this paper, we propose TRAIL, a method to
obtain output predictions from the target LLM itself. After generating each
output token, we recycle the embedding of its internal structure as input for a
lightweight classifier that predicts the remaining length for each running
request. Using these predictions, we propose a prediction-based SRPT variant
with limited preemption designed to account for memory overhead in LLM systems.
This variant allows preemption early in request execution when memory
consumption is low but restricts preemption as requests approach completion to
optimize resource utilization. On the theoretical side, we derive a closed-form
formula for this SRPT variant in an M/G/1 queue model, which demonstrates its
potential value. In our system, we implement this preemption policy alongside
our embedding-based prediction method.
[LINK]
http://arxiv.org/abs/2410.01035v1
[DATE]
2024-10-02 03:51:07+08:00
[CATEGORIES]
cs.LG
Single-Shot Learning of Stable Dynamical Systems for Long-Horizon Manipulation Tasks
[AUTHORS]
Alexandre St-Aubin, Amin Abyaneh, Hsiu-Chin Lin
[ABSTRACT]
Mastering complex sequential tasks continues to pose a significant challenge
in robotics. While there has been progress in learning long-horizon
manipulation tasks, most existing approaches lack rigorous mathematical
guarantees for ensuring reliable and successful execution. In this paper, we
extend previous work on learning long-horizon tasks and stable policies,
focusing on improving task success rates while reducing the amount of training
data needed. Our approach introduces a novel method that (1) segments
long-horizon demonstrations into discrete steps defined by waypoints and
subgoals, and (2) learns globally stable dynamical system policies to guide the
robot to each subgoal, even in the face of sensory noise and random
disturbances. We validate our approach through both simulation and real-world
experiments, demonstrating effective transfer from simulation to physical
robotic platforms. Code is available at
https://github.com/Alestaubin/stable-imitation-policy-with-waypoints
[COMMENTS]
7 pages, submitted to ICRA 2025
[LINK]
http://arxiv.org/abs/2410.01033v1
[DATE]
2024-10-02 03:49:56+08:00
[CATEGORIES]
cs.LG
GPTreeO: An R package for continual regression with dividing local Gaussian processes
[AUTHORS]
Timo Braun, Anders Kvellestad, Riccardo De Bin
[ABSTRACT]
We introduce GPTreeO, a flexible R package for scalable Gaussian process (GP)
regression, particularly tailored to continual learning problems. GPTreeO
builds upon the Dividing Local Gaussian Processes (DLGP) algorithm, in which a
binary tree of local GP regressors is dynamically constructed using a continual
stream of input data. In GPTreeO we extend the original DLGP algorithm by
allowing continual optimisation of the GP hyperparameters, incorporating
uncertainty calibration, and introducing new strategies for how the local
partitions are created. Moreover, the modular code structure allows users to
interface their favourite GP library to perform the local GP regression in
GPTreeO. The flexibility of GPTreeO gives the user fine-grained control of the
balance between computational speed, accuracy, stability and smoothness. We
conduct a sensitivity analysis to show how GPTreeO’s configurable features
impact the regression performance in a continual learning setting.
[LINK]
http://arxiv.org/abs/2410.01024v1
[DATE]
2024-10-02 03:33:39+08:00
[CATEGORIES]
cs.LG
Local convergence of simultaneous min-max algorithms to differential equilibrium on Riemannian manifold
[AUTHORS]
Sixin Zhang
[ABSTRACT]
We study min-max algorithms to solve zero-sum differential games on
Riemannian manifold. Based on the notions of differential Stackelberg
equilibrium and differential Nash equilibrium on Riemannian manifold, we
analyze the local convergence of two representative deterministic simultaneous
algorithms $\tau$-GDA and $\tau$-SGA to such equilibrium. Sufficient conditions
are obtained to establish their linear convergence rates by Ostrowski theorem
on manifold and spectral analysis. The $\tau$-SGA algorithm is extended from
the symplectic gradient-adjustment method in Euclidean space to avoid strong
rotational dynamics in $\tau$-GDA. In some cases, we obtain a faster
convergence rate of $\tau$-SGA through an asymptotic analysis which is valid
when the learning rate ratio $\tau$ is big. We show numerically how the
insights obtained from the convergence analysis may improve the training of
orthogonal Wasserstein GANs using stochastic $\tau$-GDA and $\tau$-SGA on
simple benchmarks.
[COMMENTS]
under review
[LINK]
http://arxiv.org/abs/2405.13392v2
[DATE]
2024-10-02 03:32:59+08:00
[CATEGORIES]
cs.LG
Stream-level flow matching from a Bayesian decision theoretic perspective
[AUTHORS]
Ganchao Wei, Li Ma
[ABSTRACT]
Flow matching (FM) is a family of training algorithms for fitting continuous
normalizing flows (CNFs). A standard approach to FM, called conditional flow
matching (CFM), exploits the fact that the marginal vector field of a CNF can
be learned by fitting least-square regression to the so-called conditional
vector field specified given one or both ends of the flow path. We show that
viewing CFM training from a Bayesian decision theoretic perspective on
parameter estimation opens the door to generalizations of CFM algorithms. We
propose one such extension by introducing a CFM algorithm based on defining
conditional probability paths given what we refer to as “streams”, instances
of latent stochastic paths that connect pairs of noise and observed data.
Further, we advocates the modeling of these latent streams using Gaussian
processes (GPs). The unique distributional properties of GPs, and in particular
the fact that the velocities of a GP is still a GP, allows drawing samples from
the resulting stream-augmented conditional probability path without simulating
the actual streams, and hence the ``simulation-free” nature of CFM training is
preserved. We show that this generalization of the CFM can substantially reduce
the variance in the estimated marginal vector field at a moderate computational
cost, thereby improving the quality of the generated samples under common
metrics. Additionally, we show that adopting the GP on the streams allows for
flexibly linking multiple related training data points (e.g., time series) and
incorporating additional prior information. We empirically validate our claim
through both simulations and applications to two hand-written image datasets.
[LINK]
http://arxiv.org/abs/2409.20423v2
[DATE]
2024-10-02 03:05:37+08:00
[CATEGORIES]
cs.LG
Back to Bayesics: Uncovering Human Mobility Distributions and Anomalies with an Integrated Statistical and Neural Framework
[AUTHORS]
Minxuan Duan, Yinlong Qian, Lingyi Zhao, Zihao Zhou, Zeeshan Rasheed, Rose Yu, Khurram Shafique
[ABSTRACT]
Existing methods for anomaly detection often fall short due to their
inability to handle the complexity, heterogeneity, and high dimensionality
inherent in real-world mobility data. In this paper, we propose DeepBayesic, a
novel framework that integrates Bayesian principles with deep neural networks
to model the underlying multivariate distributions from sparse and complex
datasets. Unlike traditional models, DeepBayesic is designed to manage
heterogeneous inputs, accommodating both continuous and categorical data to
provide a more comprehensive understanding of mobility patterns. The framework
features customized neural density estimators and hybrid architectures,
allowing for flexibility in modeling diverse feature distributions and enabling
the use of specialized neural networks tailored to different data types. Our
approach also leverages agent embeddings for personalized anomaly detection,
enhancing its ability to distinguish between normal and anomalous behaviors for
individual agents. We evaluate our approach on several mobility datasets,
demonstrating significant improvements over state-of-the-art anomaly detection
methods. Our results indicate that incorporating personalization and advanced
sequence modeling techniques can substantially enhance the ability to detect
subtle and complex anomalies in spatiotemporal event sequences.
[COMMENTS]
12 pages
[LINK]
http://arxiv.org/abs/2410.01011v1
[DATE]
2024-10-02 03:02:06+08:00
[CATEGORIES]
cs.LG
CktGen: Specification-Conditioned Analog Circuit Generation
[AUTHORS]
Yuxuan Hou, Jianrong Zhang, Hua Chen, Min Zhou, Faxin Yu, Hehe Fan, Yi Yang
[ABSTRACT]
Automatic synthesis of analog circuits presents significant challenges.
Existing methods usually treat the task as optimization problems, which limits
their transferability and reusability for new requirements. To address this
limitation, we introduce a task that directly generates analog circuits based
on specified specifications, termed specification-conditioned analog circuit
generation. Specifically, we propose CktGen, a simple yet effective variational
autoencoder (VAE) model, that maps specifications and circuits into a joint
latent space, and reconstructs the circuit from the latent. Moreover, given
that a single specification can correspond to multiple distinct circuits,
simply minimizing the distance between the mapped latent representations of the
circuit and specification does not capture these one-to-many relationships. To
address this, we integrate contrastive learning and classifier guidance to
prevent model collapse. We conduct comprehensive experiments on the Open
Circuit Benchmark (OCB) and introduce new evaluation metrics for cross-model
consistency in the specification-to-circuit generation task. Experimental
results demonstrate substantial improvements over existing state-of-the-art
methods.
[LINK]
http://arxiv.org/abs/2410.00995v1
[DATE]
2024-10-02 02:35:44+08:00
[CATEGORIES]
cs.LG
[AUTHORS]
Y. Jennifer Sun, Zhou Lu [ABSTRACT]
Unlike classical control theory, such as Linear Quadratic Control (LQC),
real-world control problems are highly complex. These problems often involve
adversarial perturbations, bandit feedback models, and non-quadratic,
adversarially chosen cost functions. A fundamental yet unresolved question is
whether optimal regret can be achieved for these general control problems. The
standard approach to addressing this problem involves a reduction to bandit
convex optimization with memory. In the bandit setting, constructing a gradient
estimator with low variance is challenging due to the memory structure and
non-quadratic loss functions.
In this paper, we provide an affirmative answer to this question. Our main
contribution is an algorithm that achieves an $\tilde{O}(\sqrt{T})$ optimal
regret for bandit non-stochastic control with strongly-convex and smooth cost
functions in the presence of adversarial perturbations, improving the
previously known $\tilde{O}(T^{2/3})$ regret bound from (Cassel and Koren,
[COMMENTS]
Neurips 2024 [LINK]
http://arxiv.org/abs/2410.00993v1 [DATE]
2024-10-02 02:35:08+08:00 [CATEGORIES]
cs.LG
Decentralized Optimization in Time-Varying Networks with Arbitrary Delays
[AUTHORS]
Tomas Ortega, Hamid Jafarkhani
[ABSTRACT]
We consider a decentralized optimization problem for networks affected by
communication delays. Examples of such networks include collaborative machine
learning, sensor networks, and multi-agent systems. To mimic communication
delays, we add virtual non-computing nodes to the network, resulting in
directed graphs. This motivates investigating decentralized optimization
solutions on directed graphs. Existing solutions assume nodes know their
out-degrees, resulting in limited applicability. To overcome this limitation,
we introduce a novel gossip-based algorithm, called DT-GO, that does not need
to know the out-degrees. The algorithm is applicable in general directed
networks, for example networks with delays or limited acknowledgment
capabilities. We derive convergence rates for both convex and non-convex
objectives, showing that our algorithm achieves the same complexity order as
centralized Stochastic Gradient Descent. In other words, the effects of the
graph topology and delays are confined to higher-order terms. Additionally, we
extend our analysis to accommodate time-varying network topologies. Numerical
simulations are provided to support our theoretical findings.
[COMMENTS]
arXiv admin note: text overlap with arXiv:2401.11344
[LINK]
http://arxiv.org/abs/2405.19513v2
[DATE]
2024-10-02 02:19:34+08:00
[CATEGORIES]
cs.LG
Tackling the Accuracy-Interpretability Trade-off in a Hierarchy of Machine Learning Models for the Prediction of Extreme Heatwaves
[AUTHORS]
Alessandro Lovo, Amaury Lancelin, Corentin Herbert, Freddy Bouchet
[ABSTRACT]
When performing predictions that use Machine Learning (ML), we are mainly
interested in performance and interpretability. This generates a natural
trade-off, where complex models generally have higher skills but are harder to
explain and thus trust. Interpretability is particularly important in the
climate community, where we aim at gaining a physical understanding of the
underlying phenomena. Even more so when the prediction concerns extreme weather
events with high impact on society. In this paper, we perform probabilistic
forecasts of extreme heatwaves over France, using a hierarchy of increasingly
complex ML models, which allows us to find the best compromise between accuracy
and interpretability. More precisely, we use models that range from a global
Gaussian Approximation (GA) to deep Convolutional Neural Networks (CNNs), with
the intermediate steps of a simple Intrinsically Interpretable Neural Network
(IINN) and a model using the Scattering Transform (ScatNet). Our findings
reveal that CNNs provide higher accuracy, but their black-box nature severely
limits interpretability, even when using state-of-the-art Explainable
Artificial Intelligence (XAI) tools. In contrast, ScatNet achieves similar
performance to CNNs while providing greater transparency, identifying key
scales and patterns in the data that drive predictions. This study underscores
the potential of interpretability in ML models for climate science,
demonstrating that simpler models can rival the performance of their more
complex counterparts, all the while being much easier to understand. This
gained interpretability is crucial for building trust in model predictions and
uncovering new scientific insights, ultimately advancing our understanding and
management of extreme weather events.
[LINK]
http://arxiv.org/abs/2410.00984v1
[DATE]
2024-10-02 02:15:04+08:00
[CATEGORIES]
cs.LG
Robust Guided Diffusion for Offline Black-Box Optimization
[AUTHORS]
Can, Chen, Christopher Beckham, Zixuan Liu, Xue Liu, Christopher Pal
[ABSTRACT]
Offline black-box optimization aims to maximize a black-box function using an
offline dataset of designs and their measured properties. Two main approaches
have emerged: the forward approach, which learns a mapping from input to its
value, thereby acting as a proxy to guide optimization, and the inverse
approach, which learns a mapping from value to input for conditional
generation. (a) Although proxy-free~(classifier-free) diffusion shows promise
in robustly modeling the inverse mapping, it lacks explicit guidance from
proxies, essential for generating high-performance samples beyond the training
distribution. Therefore, we propose \textit{proxy-enhanced sampling} which
utilizes the explicit guidance from a trained proxy to bolster proxy-free
diffusion with enhanced sampling control. (b) Yet, the trained proxy is
susceptible to out-of-distribution issues. To address this, we devise the
module \textit{diffusion-based proxy refinement}, which seamlessly integrates
insights from proxy-free diffusion back into the proxy for refinement. To sum
up, we propose \textit{\textbf{R}obust \textbf{G}uided \textbf{D}iffusion for
Offline Black-box Optimization}~(\textbf{RGD}), combining the advantages of
proxy~(explicit guidance) and proxy-free diffusion~(robustness) for effective
conditional generation. RGD achieves state-of-the-art results on various
design-bench tasks, underscoring its efficacy. Our code is at
https://anonymous.4open.science/r/RGD-27A5/README.md.
[COMMENTS]
21 pages
[LINK]
http://arxiv.org/abs/2410.00983v1
[DATE]
2024-10-02 02:14:25+08:00
[CATEGORIES]
cs.LG
Subspace Node Pruning
[AUTHORS]
Joshua Offergeld, Marcel van Gerven, Nasir Ahmad
[ABSTRACT]
Efficiency of neural network inference is undeniably important in a time
where commercial use of AI models increases daily. Node pruning is the art of
removing computational units such as neurons, filters, attention heads, or even
entire layers to significantly reduce inference time while retaining network
performance. In this work, we propose the projection of unit activations to an
orthogonal subspace in which there is no redundant activity and within which we
may prune nodes while simultaneously recovering the impact of lost units via
linear least squares. We identify that, for effective node pruning, this
subspace must be constructed using a triangular transformation matrix, a
transformation which is equivalent to and unnormalized Gram-Schmidt
orthogonalization. We furthermore show that the order in which units are
orthogonalized can be optimised to maximally reduce node activations in our
subspace and thereby form a more optimal ranking of nodes. Finally, we leverage
these orthogonal subspaces to automatically determine layer-wise pruning ratios
based upon the relative scale of node activations in our subspace, equivalent
to cumulative variance. Our proposed method reaches state of the art when
pruning ImageNet trained VGG-16 and rivals more complex state of the art
methods when pruning ResNet-50 networks across a range of pruning ratios.
[COMMENTS]
16 pages, 6 figures, 5 tables
[LINK]
http://arxiv.org/abs/2405.17506v2
[DATE]
2024-10-02 02:07:37+08:00
[CATEGORIES]
cs.LG
Paths to Equilibrium in Games
[AUTHORS]
Bora Yongacoglu, Gürdal Arslan, Lacra Pavel, Serdar Yüksel
[ABSTRACT]
In multi-agent reinforcement learning (MARL) and game theory, agents
repeatedly interact and revise their strategies as new data arrives, producing
a sequence of strategy profiles. This paper studies sequences of strategies
satisfying a pairwise constraint inspired by policy updating in reinforcement
learning, where an agent who is best responding in one period does not switch
its strategy in the next period. This constraint merely requires that
optimizing agents do not switch strategies, but does not constrain the
non-optimizing agents in any way, and thus allows for exploration. Sequences
with this property are called satisficing paths, and arise naturally in many
MARL algorithms. A fundamental question about strategic dynamics is such: for a
given game and initial strategy profile, is it always possible to construct a
satisficing path that terminates at an equilibrium? The resolution of this
question has implications about the capabilities or limitations of a class of
MARL algorithms. We answer this question in the affirmative for normal-form
games. Our analysis reveals a counterintuitive insight that reward
deteriorating strategic updates are key to driving play to equilibrium along a
satisficing path.
[COMMENTS]
Accepted to NeurIPS 2024
[LINK]
http://arxiv.org/abs/2403.18079v2
[DATE]
2024-10-02 01:33:13+08:00
[CATEGORIES]
cs.LG
Compressing Recurrent Neural Networks for FPGA-accelerated Implementation in Fluorescence Lifetime Imaging
[AUTHORS]
Ismail Erbas, Vikas Pandey, Aporva Amarnath, Naigang Wang, Karthik Swaminathan, Stefan T. Radev, Xavier Intes
[ABSTRACT]
Fluorescence lifetime imaging (FLI) is an important technique for studying
cellular environments and molecular interactions, but its real-time application
is limited by slow data acquisition, which requires capturing large
time-resolved images and complex post-processing using iterative fitting
algorithms. Deep learning (DL) models enable real-time inference, but can be
computationally demanding due to complex architectures and large matrix
operations. This makes DL models ill-suited for direct implementation on
field-programmable gate array (FPGA)-based camera hardware. Model compression
is thus crucial for practical deployment for real-time inference generation. In
this work, we focus on compressing recurrent neural networks (RNNs), which are
well-suited for FLI time-series data processing, to enable deployment on
resource-constrained FPGA boards. We perform an empirical evaluation of various
compression techniques, including weight reduction, knowledge distillation
(KD), post-training quantization (PTQ), and quantization-aware training (QAT),
to reduce model size and computational load while preserving inference
accuracy. Our compressed RNN model, Seq2SeqLite, achieves a balance between
computational efficiency and prediction accuracy, particularly at 8-bit
precision. By applying KD, the model parameter size was reduced by 98\% while
retaining performance, making it suitable for concurrent real-time FLI analysis
on FPGA during data capture. This work represents a big step towards
integrating hardware-accelerated real-time FLI analysis for fast biological
processes.
[COMMENTS]
8 pages, 2 figures
[LINK]
http://arxiv.org/abs/2410.00948v1
[DATE]
2024-10-02 01:23:26+08:00
[CATEGORIES]
cs.LG
Empirical Perturbation Analysis of Linear System Solvers from a Data Poisoning Perspective
[AUTHORS]
Yixin Liu, Arielle Carr, Lichao Sun
[ABSTRACT]
The perturbation analysis of linear solvers applied to systems arising
broadly in machine learning settings – for instance, when using linear
regression models – establishes an important perspective when reframing these
analyses through the lens of a data poisoning attack. By analyzing solvers’
responses to such attacks, this work aims to contribute to the development of
more robust linear solvers and provide insights into poisoning attacks on
linear solvers. In particular, we investigate how the errors in the input data
will affect the fitting error and accuracy of the solution from a linear
system-solving algorithm under perturbations common in adversarial attacks. We
propose data perturbation through two distinct knowledge levels, developing a
poisoning optimization and studying two methods of perturbation: Label-guided
Perturbation (LP) and Unconditioning Perturbation (UP). Existing works mainly
focus on deriving the worst-case perturbation bound from a theoretical
perspective, and the analysis is often limited to specific kinds of linear
system solvers. Under the circumstance that the data is intentionally perturbed
– as is the case with data poisoning – we seek to understand how different
kinds of solvers react to these perturbations, identifying those algorithms
most impacted by different types of adversarial attacks.
[COMMENTS]
18 pages
[LINK]
http://arxiv.org/abs/2410.00878v1
[DATE]
2024-10-02 01:14:05+08:00
[CATEGORIES]
cs.LG
Generative Expansion of Small Datasets: An Expansive Graph Approach
[AUTHORS]
Vahid Jebraeeli, Bo Jiang, Hamid Krim, Derya Cansever
[ABSTRACT]
Limited data availability in machine learning significantly impacts
performance and generalization. Traditional augmentation methods enhance
moderately sufficient datasets. GANs struggle with convergence when generating
diverse samples. Diffusion models, while effective, have high computational
costs. We introduce an Expansive Synthesis model generating large-scale,
information-rich datasets from minimal samples. It uses expander graph mappings
and feature interpolation to preserve data distribution and feature
relationships. The model leverages neural networks’ non-linear latent space,
captured by a Koopman operator, to create a linear feature space for dataset
expansion. An autoencoder with self-attention layers and optimal transport
refines distributional consistency. We validate by comparing classifiers
trained on generated data to those trained on original datasets. Results show
comparable performance, demonstrating the model’s potential to augment training
data effectively. This work advances data generation, addressing scarcity in
machine learning applications.
[COMMENTS]
5 pages, 3 figures and 2 tables. Under review in ICASSP 2025
[LINK]
http://arxiv.org/abs/2406.17238v2
[DATE]
2024-10-02 01:12:57+08:00
[CATEGORIES]
cs.LG
Replacing Paths with Connection-Biased Attention for Knowledge Graph Completion
[AUTHORS]
Sharmishtha Dutta, Alex Gittens, Mohammed J. Zaki, Charu C. Aggarwal
[ABSTRACT]
Knowledge graph (KG) completion aims to identify additional facts that can be
inferred from the existing facts in the KG. Recent developments in this field
have explored this task in the inductive setting, where at test time one sees
entities that were not present during training; the most performant models in
the inductive setting have employed path encoding modules in addition to
standard subgraph encoding modules. This work similarly focuses on KG
completion in the inductive setting, without the explicit use of path
encodings, which can be time-consuming and introduces several hyperparameters
that require costly hyperparameter optimization. Our approach uses a
Transformer-based subgraph encoding module only; we introduce connection-biased
attention and entity role embeddings into the subgraph encoding module to
eliminate the need for an expensive and time-consuming path encoding module.
Evaluations on standard inductive KG completion benchmark datasets demonstrate
that our Connection-Biased Link Prediction (CBLiP) model has superior
performance to models that do not use path information. Compared to models that
utilize path information, CBLiP shows competitive or superior performance while
being faster. Additionally, to show that the effectiveness of connection-biased
attention and entity role embeddings also holds in the transductive setting, we
compare CBLiP’s performance on the relation prediction task in the transductive
setting.
[LINK]
http://arxiv.org/abs/2410.00876v1
[DATE]
2024-10-02 01:12:41+08:00
[CATEGORIES]
cs.LG
Review of blockchain application with Graph Neural Networks, Graph Convolutional Networks and Convolutional Neural Networks
[AUTHORS]
Amy Ancelotti, Claudia Liason
[ABSTRACT]
This paper reviews the applications of Graph Neural Networks (GNNs), Graph
Convolutional Networks (GCNs), and Convolutional Neural Networks (CNNs) in
blockchain technology. As the complexity and adoption of blockchain networks
continue to grow, traditional analytical methods are proving inadequate in
capturing the intricate relationships and dynamic behaviors of decentralized
systems. To address these limitations, deep learning models such as GNNs, GCNs,
and CNNs offer robust solutions by leveraging the unique graph-based and
temporal structures inherent in blockchain architectures. GNNs and GCNs, in
particular, excel in modeling the relational data of blockchain nodes and
transactions, making them ideal for applications such as fraud detection,
transaction verification, and smart contract analysis. Meanwhile, CNNs can be
adapted to analyze blockchain data when represented as structured matrices,
revealing hidden temporal and spatial patterns in transaction flows. This paper
explores how these models enhance the efficiency, security, and scalability of
both linear blockchains and Directed Acyclic Graph (DAG)-based systems,
providing a comprehensive overview of their strengths and future research
directions. By integrating advanced neural network techniques, we aim to
demonstrate the potential of these models in revolutionizing blockchain
analytics, paving the way for more sophisticated decentralized applications and
improved network performance.
[LINK]
http://arxiv.org/abs/2410.00875v1
[DATE]
2024-10-02 01:11:22+08:00
[CATEGORIES]
cs.LG
Inference Optimization of Foundation Models on AI Accelerators
[AUTHORS]
Youngsuk Park, Kailash Budhathoki, Liangfu Chen, Jonas Kübler, Jiaji Huang, Matthäus Kleindessner, Jun Huan, Volkan Cevher, Yida Wang, George Karypis
[ABSTRACT]
Powerful foundation models, including large language models (LLMs), with
Transformer architectures have ushered in a new era of Generative AI across
various industries. Industry and research community have witnessed a large
number of new applications, based on those foundation models. Such applications
include question and answer, customer services, image and video generation, and
code completions, among others. However, as the number of model parameters
reaches to hundreds of billions, their deployment incurs prohibitive inference
costs and high latency in real-world scenarios. As a result, the demand for
cost-effective and fast inference using AI accelerators is ever more higher. To
this end, our tutorial offers a comprehensive discussion on complementary
inference optimization techniques using AI accelerators. Beginning with an
overview of basic Transformer architectures and deep learning system
frameworks, we deep dive into system optimization techniques for fast and
memory-efficient attention computations and discuss how they can be implemented
efficiently on AI accelerators. Next, we describe architectural elements that
are key for fast transformer inference. Finally, we examine various model
compression and fast decoding strategies in the same context.
[COMMENTS]
[v2] Tutorial website added [v1] Tutorial published at KDD 2024.
Camera-ready version
[LINK]
http://arxiv.org/abs/2407.09111v2
[DATE]
2024-10-02 01:10:07+08:00
[CATEGORIES]
cs.LG
Timber! Poisoning Decision Trees
[AUTHORS]
Stefano Calzavara, Lorenzo Cazzaro, Massimo Vettori
[ABSTRACT]
We present Timber, the first white-box poisoning attack targeting decision
trees. Timber is based on a greedy attack strategy leveraging sub-tree
retraining to efficiently estimate the damage performed by poisoning a given
training instance. The attack relies on a tree annotation procedure which
enables sorting training instances so that they are processed in increasing
order of computational cost of sub-tree retraining. This sorting yields a
variant of Timber supporting an early stopping criterion designed to make
poisoning attacks more efficient and feasible on larger datasets. We also
discuss an extension of Timber to traditional random forest models, which is
useful because decision trees are normally combined into ensembles to improve
their predictive power. Our experimental evaluation on public datasets shows
that our attacks outperform existing baselines in terms of effectiveness,
efficiency or both. Moreover, we show that two representative defenses can
mitigate the effect of our attacks, but fail at effectively thwarting them.
[COMMENTS]
18 pages, 7 figures, 5 tables
[LINK]
http://arxiv.org/abs/2410.00862v1
[DATE]
2024-10-02 00:58:54+08:00
[CATEGORIES]
cs.LG
Improved Sample Complexity of Imitation Learning for Barrier Model Predictive Control
[AUTHORS]
Daniel Pfrommer, Swati Padmanabhan, Kwangjun Ahn, Jack Umenberger, Tobia Marcucci, Zakaria Mhammedi, Ali Jadbabaie
[ABSTRACT]
Recent work in imitation learning has shown that having an expert controller
that is both suitably smooth and stable enables stronger guarantees on the
performance of the learned controller. However, constructing such smoothed
expert controllers for arbitrary systems remains challenging, especially in the
presence of input and state constraints. As our primary contribution, we show
how such a smoothed expert can be designed for a general class of systems using
a log-barrier-based relaxation of a standard Model Predictive Control (MPC)
optimization problem.
Improving upon our previous work, we show that barrier MPC achieves
theoretically optimal error-to-smoothness tradeoff along some direction. At the
core of this theoretical guarantee on smoothness is an improved lower bound we
prove on the optimality gap of the analytic center associated with a convex
Lipschitz function, which we believe could be of independent interest. We
validate our theoretical findings via experiments, demonstrating the merits of
our smoothing approach over randomized smoothing.
[COMMENTS]
36 pages, 3 figures. This work extends our previous result in
arXiv:2306.01914, which has been accepted for publication in CDC 2024. An
earlier version of this manuscript was submitted as part of DP’s Master’s
thesis
[LINK]
http://arxiv.org/abs/2410.00859v1
[DATE]
2024-10-02 00:52:23+08:00
[CATEGORIES]
cs.LG
Spectral Graph Sample Weighting for Interpretable Sub-cohort Analysis in Predictive Models for Neuroimaging
[AUTHORS]
Magdalini Paschali, Jiang Yu Hang, Spencer Siegel, Camila Gonzalez, Kilian Pohl, Akshay Chaudhari, Qingyu Zhao
[ABSTRACT]
Recent advancements in medicine have confirmed that brain disorders often
comprise multiple subtypes of mechanisms, developmental trajectories, or
severity levels. Such heterogeneity is often associated with demographic
aspects (e.g., sex) or disease-related contributors (e.g., genetics). Thus, the
predictive power of machine learning models used for symptom prediction varies
across subjects based on such factors. To model this heterogeneity, one can
assign each training sample a factor-dependent weight, which modulates the
subject’s contribution to the overall objective loss function. To this end, we
propose to model the subject weights as a linear combination of the eigenbases
of a spectral population graph that captures the similarity of factors across
subjects. In doing so, the learned weights smoothly vary across the graph,
highlighting sub-cohorts with high and low predictability. Our proposed sample
weighting scheme is evaluated on two tasks. First, we predict initiation of
heavy alcohol drinking in young adulthood from imaging and neuropsychological
measures from the National Consortium on Alcohol and NeuroDevelopment in
Adolescence (NCANDA). Next, we detect Dementia vs. Mild Cognitive Impairment
(MCI) using imaging and demographic measurements in subjects from the
Alzheimer’s Disease Neuroimaging Initiative (ADNI). Compared to existing sample
weighting schemes, our sample weights improve interpretability and highlight
sub-cohorts with distinct characteristics and varying model accuracy.
[LINK]
http://arxiv.org/abs/2410.00946v1
[DATE]
2024-10-02 00:48:15+08:00
[CATEGORIES]
cs.LG
Dynamic Pricing in Securities Lending Market: Application in Revenue Optimization for an Agent Lender Portfolio
[AUTHORS]
Jing Xu, Yung-Cheng Hsu, William Biscarri
[ABSTRACT]
Securities lending is an important part of the financial market structure,
where agent lenders help long term institutional investors to lend out their
securities to short sellers in exchange for a lending fee. Agent lenders within
the market seek to optimize revenue by lending out securities at the highest
rate possible. Typically, this rate is set by hard-coded business rules or
standard supervised machine learning models. These approaches are often
difficult to scale and are not adaptive to changing market conditions. Unlike a
traditional stock exchange with a centralized limit order book, the securities
lending market is organized similarly to an e-commerce marketplace, where agent
lenders and borrowers can transact at any agreed price in a bilateral fashion.
This similarity suggests that the use of typical methods for addressing dynamic
pricing problems in e-commerce could be effective in the securities lending
market. We show that existing contextual bandit frameworks can be successfully
utilized in the securities lending market. Using offline evaluation on real
historical data, we show that the contextual bandit approach can consistently
outperform typical approaches by at least 15% in terms of total revenue
generated.
[COMMENTS]
7 pages, 8 figures
[LINK]
http://arxiv.org/abs/2407.13687v3
[DATE]
2024-10-02 00:33:36+08:00
[CATEGORIES]
cs.LG
Uncertainty-aware Reward Model: Teaching Reward Models to Know What is Unknown
[AUTHORS]
Xingzhou Lou, Dong Yan, Wei Shen, Yuzi Yan, Jian Xie, Junge Zhang
[ABSTRACT]
Reward models (RM) play a critical role in aligning generations of large
language models (LLM) to human expectations. However, prevailing RMs fail to
capture the stochasticity within human preferences and cannot effectively
evaluate the reliability of reward predictions. To address these issues, we
propose Uncertain-aware RM (URM) and Uncertain-aware RM Ensemble (URME) to
incorporate and manage uncertainty in reward modeling. URM can model the
distribution of disentangled attributes within human preferences, while URME
quantifies uncertainty through discrepancies in the ensemble, thereby
identifying potential lack of knowledge during reward evaluation. Experiment
results indicate that the proposed URM achieves state-of-the-art performance
compared to models with the same size, demonstrating the effectiveness of
modeling uncertainty within human preferences. Furthermore, empirical results
show that through uncertainty quantification, URM and URME can identify
unreliable predictions to improve the quality of reward evaluations.
[LINK]
http://arxiv.org/abs/2410.00847v1
[DATE]
2024-10-02 00:29:59+08:00
[CATEGORIES]
cs.LG
Learning Stochastic Dynamics from Snapshots through Regularized Unbalanced Optimal Transport
[AUTHORS]
Zhenyi Zhang, Tiejun Li, Peijie Zhou
[ABSTRACT]
Reconstructing dynamics using samples from sparsely time-resolved snapshots
is an important problem in both natural sciences and machine learning. Here, we
introduce a new deep learning approach for solving regularized unbalanced
optimal transport (RUOT) and inferring continuous unbalanced stochastic
dynamics from observed snapshots. Based on the RUOT form, our method models
these dynamics without requiring prior knowledge of growth and death processes
or additional information, allowing them to be learnt directly from data.
Theoretically, we explore the connections between the RUOT and Schr"odinger
bridge problem and discuss the key challenges and potential solutions. The
effectiveness of our method is demonstrated with a synthetic gene regulatory
network. Compared with other methods, our approach accurately identifies growth
and transition patterns, eliminates false transitions, and constructs the
Waddington developmental landscape.
[LINK]
http://arxiv.org/abs/2410.00844v1
[DATE]
2024-10-02 00:25:03+08:00
[CATEGORIES]
cs.LG
Solving High-Dimensional Partial Integral Differential Equations: The Finite Expression Method
[AUTHORS]
Gareth Hardwick, Senwei Liang, Haizhao Yang
[ABSTRACT]
In this paper, we introduce a new finite expression method (FEX) to solve
high-dimensional partial integro-differential equations (PIDEs). This approach
builds upon the original FEX and its inherent advantages with new advances: 1)
A novel method of parameter grouping is proposed to reduce the number of
coefficients in high-dimensional function approximation; 2) A Taylor series
approximation method is implemented to significantly improve the computational
efficiency and accuracy of the evaluation of the integral terms of PIDEs. The
new FEX based method, denoted FEX-PG to indicate the addition of the parameter
grouping (PG) step to the algorithm, provides both high accuracy and
interpretable numerical solutions, with the outcome being an explicit equation
that facilitates intuitive understanding of the underlying solution structures.
These features are often absent in traditional methods, such as finite element
methods (FEM) and finite difference methods, as well as in deep learning-based
approaches. To benchmark our method against recent advances, we apply the new
FEX-PG to solve benchmark PIDEs in the literature. In high-dimensional
settings, FEX-PG exhibits strong and robust performance, achieving relative
errors on the order of single precision machine epsilon.
[COMMENTS]
18 pages, 10 figures
[LINK]
http://arxiv.org/abs/2410.00835v1
[DATE]
2024-10-02 00:16:42+08:00
[CATEGORIES]
cs.LG
Short vs. Long-term Coordination of Drones: When Distributed Optimization Meets Deep Reinforcement Learning
[AUTHORS]
Chuhao Qin, Evangelos Pournaras
[ABSTRACT]
Swarms of autonomous interactive drones can provide compelling sensing
capabilities in Smart City applications, such as traffic monitoring. This paper
focuses on the task assignment problem for large-scale spatio-temporal sensing
by a drone swarm. However, existing approaches have distinct challenges:
distributed evolutionary optimization, such as collective learning, lacks
long-term adaptability in dynamic environments, while deep reinforcement
learning (DRL) is limited to scale effectively due to the curse of
dimensionality. Therefore, this paper proposes a novel synergetic optimization
approach by integrating long-term DRL and short-term collective learning.
Through this approach, each drone independently and proactively determines its
flying direction and recharging location using DRL, while evolving their
navigation and sensing policies through collective learning based on a
structured tree communication model. Extensive experiments with datasets
generated from realistic urban mobility demonstrate an outstanding performance
of the proposed solution in complex scenarios. New insights show that this
approach provides a win-win synthesis of short-term and long-term strategies
for drone-based traffic monitoring, with short-term methods addressing training
complexity and energy management, while long-term methods preserving high
sensing performance.
[LINK]
http://arxiv.org/abs/2311.09852v7
[DATE]
2024-10-02 00:11:27+08:00
[CATEGORIES]
cs.LG
Clustering Three-Way Data with Outliers
[AUTHORS]
Katharine M. Clark, Paul D. McNicholas
[ABSTRACT]
Matrix-variate distributions are a recent addition to the model-based
clustering field, thereby making it possible to analyze data in matrix form
with complex structure such as images and time series. Due to its recent
appearance, there is limited literature on matrix-variate data, with even less
on dealing with outliers in these models. An approach for clustering
matrix-variate normal data with outliers is discussed. The approach, which uses
the distribution of subset log-likelihoods, extends the OCLUST algorithm to
matrix-variate normal data and uses an iterative approach to detect and trim
outliers.
[LINK]
http://arxiv.org/abs/2310.05288v3
[DATE]
2024-10-02 00:08:52+08:00
[CATEGORIES]
cs.LG
[AUTHORS]
Rinor Cakaj, Jens Mehnert, Bin Yang [ABSTRACT]
Convolutional Neural Networks (CNNs) are important for many machine learning
tasks. They are built with different types of layers: convolutional layers that
detect features, dropout layers that help to avoid over-reliance on any single
neuron, and residual layers that allow the reuse of features. However, CNNs
lack a dynamic feature retention mechanism similar to the human brain’s memory,
limiting their ability to use learned information in new contexts. To bridge
this gap, we introduce the “Squeeze-and-Remember” (SR) block, a novel
architectural unit that gives CNNs dynamic memory-like functionalities. The SR
block selectively memorizes important features during training, and then
adaptively re-applies these features during inference. This improves the
network’s ability to make contextually informed predictions. Empirical results
on ImageNet and Cityscapes datasets demonstrate the SR block’s efficacy:
integration into ResNet50 improved top-1 validation accuracy on ImageNet by
0.52% over dropout2d alone, and its application in DeepLab v3 increased mean
Intersection over Union in Cityscapes by 0.20%. These improvements are achieved
with minimal computational overhead. This show the SR block’s potential to
enhance the capabilities of CNNs in image processing tasks. [COMMENTS]
Accepted by The International Conference on Machine Learning and
Applications (ICMLA) 2024 [LINK]
http://arxiv.org/abs/2410.00823v1 [DATE]
2024-10-02 00:06:31+08:00 [CATEGORIES]
cs.LG
Who is better at math, Jenny or Jingzhen? Uncovering Stereotypes in Large Language Models
[AUTHORS]
Zara Siddique, Liam D. Turner, Luis Espinosa-Anke
[COMMENTS]
Accepted to EMNLP Main 2024
[LINK]
http://arxiv.org/abs/2407.06917v2
[DATE]
2024-10-01 23:50:06+08:00
[CATEGORIES]
cs.CL
Atomic Inference for NLI with Generated Facts as Atoms
[AUTHORS]
Joe Stacey, Pasquale Minervini, Haim Dubossarsky, Oana-Maria Camburu, Marek Rei
[ABSTRACT]
With recent advances, neural models can achieve human-level performance on
various natural language tasks. However, there are no guarantees that any
explanations from these models are faithful, i.e. that they reflect the inner
workings of the model. Atomic inference overcomes this issue, providing
interpretable and faithful model decisions. This approach involves making
predictions for different components (or atoms) of an instance, before using
interpretable and deterministic rules to derive the overall prediction based on
the individual atom-level predictions. We investigate the effectiveness of
using LLM-generated facts as atoms, decomposing Natural Language Inference
premises into lists of facts. While directly using generated facts in atomic
inference systems can result in worse performance, with 1) a multi-stage fact
generation process, and 2) a training regime that incorporates the facts, our
fact-based method outperforms other approaches.
[COMMENTS]
Accepted at EMNLP 2024
[LINK]
http://arxiv.org/abs/2305.13214v2
[DATE]
2024-10-01 23:48:32+08:00
[CATEGORIES]
cs.CL
The Use of Large Language Models (LLM) for Cyber Threat Intelligence (CTI) in Cybercrime Forums
[AUTHORS]
Vanessa Clairoux-Trepanier, Isa-May Beauchamp, Estelle Ruellan, Masarah Paquet-Clouston, Serge-Olivier Paquette, Eric Clay
[ABSTRACT]
Large language models (LLMs) can be used to analyze cyber threat intelligence
(CTI) data from cybercrime forums, which contain extensive information and key
discussions about emerging cyber threats. However, to date, the level of
accuracy and efficiency of LLMs for such critical tasks has yet to be
thoroughly evaluated. Hence, this study assesses the performance of an LLM
system built on the OpenAI GPT-3.5-turbo model [8] to extract CTI information.
To do so, a random sample of more than 700 daily conversations from three
cybercrime forums - XSS, Exploit_in, and RAMP - was extracted, and the LLM
system was instructed to summarize the conversations and predict 10 key CTI
variables, such as whether a large organization and/or a critical
infrastructure is being targeted, with only simple human-language instructions.
Then, two coders reviewed each conversation and evaluated whether the
information extracted by the LLM was accurate. The LLM system performed well,
with an average accuracy score of 96.23%, an average precision of 90% and an
average recall of 88.2%. Various ways to enhance the model were uncovered, such
as the need to help the LLM distinguish between stories and past events, as
well as being careful with verb tenses in prompts. Nevertheless, the results of
this study highlight the relevance of using LLMs for cyber threat intelligence.
[LINK]
http://arxiv.org/abs/2408.03354v3
[DATE]
2024-10-01 23:41:22+08:00
[CATEGORIES]
cs.CL
MiniCheck: Efficient Fact-Checking of LLMs on Grounding Documents
[AUTHORS]
Liyan Tang, Philippe Laban, Greg Durrett
[ABSTRACT]
Recognizing if LLM output can be grounded in evidence is central to many
tasks in NLP: retrieval-augmented generation, summarization, document-grounded
dialogue, and more. Current approaches to this kind of fact-checking are based
on verifying each piece of a model generation against potential evidence using
an LLM. However, this process can be very computationally expensive, requiring
many calls to a model to check a single response. In this work, we show how to
build small fact-checking models that have GPT-4-level performance but for 400x
lower cost. We do this by constructing synthetic training data with GPT-4,
which involves creating realistic yet challenging instances of factual errors
via a structured generation procedure. Training on this data teaches models to
check each fact in the claim and recognize synthesis of information across
sentences. For evaluation, we unify datasets from recent work on fact-checking
and grounding LLM generations into a new benchmark, LLM-AggreFact. Our best
system MiniCheck-FT5 (770M parameters) outperforms all systems of comparable
size and reaches GPT-4 accuracy. We release LLM-AggreFact, code for data
synthesis, and models.
[COMMENTS]
EMNLP 2024
[LINK]
http://arxiv.org/abs/2404.10774v2
[DATE]
2024-10-01 23:39:48+08:00
[CATEGORIES]
cs.CL
Decoding Hate: Exploring Language Models’ Reactions to Hate Speech
[AUTHORS]
Paloma Piot, Javier Parapar
[ABSTRACT]
Hate speech is a harmful form of online expression, often manifesting as
derogatory posts. It is a significant risk in digital environments. With the
rise of Large Language Models (LLMs), there is concern about their potential to
replicate hate speech patterns, given their training on vast amounts of
unmoderated internet data. Understanding how LLMs respond to hate speech is
crucial for their responsible deployment. However, the behaviour of LLMs
towards hate speech has been limited compared. This paper investigates the
reactions of seven state-of-the-art LLMs (LLaMA 2, Vicuna, LLaMA 3, Mistral,
GPT-3.5, GPT-4, and Gemini Pro) to hate speech. Through qualitative analysis,
we aim to reveal the spectrum of responses these models produce, highlighting
their capacity to handle hate speech inputs. We also discuss strategies to
mitigate hate speech generation by LLMs, particularly through fine-tuning and
guideline guardrailing. Finally, we explore the models’ responses to hate
speech framed in politically correct language.
[LINK]
http://arxiv.org/abs/2410.00775v1
[DATE]
2024-10-01 23:16:20+08:00
[CATEGORIES]
cs.CL
BabelBench: An Omni Benchmark for Code-Driven Analysis of Multimodal and Multistructured Data
[AUTHORS]
Xuwu Wang, Qiwen Cui, Yunzhe Tao, Yiran Wang, Ziwei Chai, Xiaotian Han, Boyi Liu, Jianbo Yuan, Jing Su, Guoyin Wang, Tingkai Liu, Liyu Chen, Tianyi Liu, Tao Sun, Yufeng Zhang, Sirui Zheng, Quanzeng You, Yang Yang, Hongxia Yang
[ABSTRACT]
Large language models (LLMs) have become increasingly pivotal across various
domains, especially in handling complex data types. This includes structured
data processing, as exemplified by ChartQA and ChatGPT-Ada, and multimodal
unstructured data processing as seen in Visual Question Answering (VQA). These
areas have attracted significant attention from both industry and academia.
Despite this, there remains a lack of unified evaluation methodologies for
these diverse data handling scenarios. In response, we introduce BabelBench, an
innovative benchmark framework that evaluates the proficiency of LLMs in
managing multimodal multistructured data with code execution. BabelBench
incorporates a dataset comprising 247 meticulously curated problems that
challenge the models with tasks in perception, commonsense reasoning, logical
reasoning, and so on. Besides the basic capabilities of multimodal
understanding, structured data processing as well as code generation, these
tasks demand advanced capabilities in exploration, planning, reasoning and
debugging. Our experimental findings on BabelBench indicate that even
cutting-edge models like ChatGPT 4 exhibit substantial room for improvement.
The insights derived from our comprehensive analysis offer valuable guidance
for future research within the community. The benchmark data can be found at
https://github.com/FFD8FFE/babelbench.
[LINK]
http://arxiv.org/abs/2410.00773v1
[DATE]
2024-10-01 23:11:24+08:00
[CATEGORIES]
cs.CL
Empowering Large Language Model for Continual Video Question Answering with Collaborative Prompting
[AUTHORS]
Chen Cai, Zheng Wang, Jianjun Gao, Wenyang Liu, Ye Lu, Runzhong Zhang, Kim-Hui Yap
[ABSTRACT]
In recent years, the rapid increase in online video content has underscored
the limitations of static Video Question Answering (VideoQA) models trained on
fixed datasets, as they struggle to adapt to new questions or tasks posed by
newly available content. In this paper, we explore the novel challenge of
VideoQA within a continual learning framework, and empirically identify a
critical issue: fine-tuning a large language model (LLM) for a sequence of
tasks often results in catastrophic forgetting. To address this, we propose
Collaborative Prompting (ColPro), which integrates specific question constraint
prompting, knowledge acquisition prompting, and visual temporal awareness
prompting. These prompts aim to capture textual question context, visual
content, and video temporal dynamics in VideoQA, a perspective underexplored in
prior research. Experimental results on the NExT-QA and DramaQA datasets show
that ColPro achieves superior performance compared to existing approaches,
achieving 55.14\% accuracy on NExT-QA and 71.24\% accuracy on DramaQA,
highlighting its practical relevance and effectiveness.
[COMMENTS]
Accepted by main EMNLP 2024
[LINK]
http://arxiv.org/abs/2410.00771v1
[DATE]
2024-10-01 23:07:07+08:00
[CATEGORIES]
cs.CL
OLAPH: Improving Factuality in Biomedical Long-form Question Answering
[AUTHORS]
Minbyul Jeong, Hyeon Hwang, Chanwoong Yoon, Taewhoo Lee, Jaewoo Kang
[ABSTRACT]
In the medical domain, numerous scenarios necessitate the long-form
generation ability of large language models (LLMs). Specifically, when
addressing patients’ questions, it is essential that the model’s response
conveys factual claims, highlighting the need for an automated method to
evaluate those claims. Thus, we introduce MedLFQA, a benchmark dataset
reconstructed using long-form question-answering datasets related to the
biomedical domain. We use MedLFQA to facilitate a cost-effective automatic
evaluations of factuality. We also propose OLAPH, a simple and novel framework
that utilizes cost-effective and multifaceted automatic evaluation to construct
a synthetic preference set and answers questions in our preferred manner. Our
framework leads us to train LLMs step-by-step to reduce hallucinations and
include crucial medical claims. We highlight that, even on evaluation metrics
not used during training, LLMs trained with our OLAPH framework demonstrate
significant performance improvement in factuality. Our findings reveal that a
7B LLM trained with our OLAPH framework can provide long answers comparable to
the medical experts’ answers in terms of factuality. We believe that our work
could shed light on gauging the long-text generation ability of LLMs in the
medical domain. Our code and datasets are available.
[LINK]
http://arxiv.org/abs/2405.12701v2
[DATE]
2024-10-01 23:03:14+08:00
[CATEGORIES]
cs.CL
Thinking Outside of the Differential Privacy Box: A Case Study in Text Privatization with Language Model Prompting
[AUTHORS]
Stephen Meisenbacher, Florian Matthes
[ABSTRACT]
The field of privacy-preserving Natural Language Processing has risen in
popularity, particularly at a time when concerns about privacy grow with the
proliferation of Large Language Models. One solution consistently appearing in
recent literature has been the integration of Differential Privacy (DP) into
NLP techniques. In this paper, we take these approaches into critical view,
discussing the restrictions that DP integration imposes, as well as bring to
light the challenges that such restrictions entail. To accomplish this, we
focus on $\textbf{DP-Prompt}$, a recent method for text privatization
leveraging language models to rewrite texts. In particular, we explore this
rewriting task in multiple scenarios, both with DP and without DP. To drive the
discussion on the merits of DP in NLP, we conduct empirical utility and privacy
experiments. Our results demonstrate the need for more discussion on the
usability of DP in NLP and its benefits over non-DP approaches.
[COMMENTS]
10 pages, 3 tables, Accepted to EMNLP 2024 (Main)
[LINK]
http://arxiv.org/abs/2410.00751v1
[DATE]
2024-10-01 22:46:15+08:00
[CATEGORIES]
cs.CL
Optimizing Token Usage on Large Language Model Conversations Using the Design Structure Matrix
[AUTHORS]
Ramon Maria Garcia Alarcia, Alessandro Golkar
[ABSTRACT]
As Large Language Models become ubiquitous in many sectors and tasks, there
is a need to reduce token usage, overcoming challenges such as short context
windows, limited output sizes, and costs associated with token intake and
generation, especially in API-served LLMs. This work brings the Design
Structure Matrix from the engineering design discipline into LLM conversation
optimization. Applied to a use case in which the LLM conversation is about the
design of a spacecraft and its subsystems, the DSM, with its analysis tools
such as clustering and sequencing, demonstrates being an effective tool to
organize the conversation, minimizing the number of tokens sent to or retrieved
from the LLM at once, as well as grouping chunks that can be allocated to
different context windows. Hence, this work broadens the current set of
methodologies for token usage optimization and opens new avenues for the
integration of engineering design practices into LLMs.
[COMMENTS]
10 pages, 26th International Dependency and Structure Modelling
Conference, DSM 2024
[LINK]
http://arxiv.org/abs/2410.00749v1
[DATE]
2024-10-01 22:38:36+08:00
[CATEGORIES]
cs.CL
VideoCLIP-XL: Advancing Long Description Understanding for Video CLIP Models
[AUTHORS]
Jiapeng Wang, Chengyu Wang, Kunzhe Huang, Jun Huang, Lianwen Jin
[ABSTRACT]
Contrastive Language-Image Pre-training (CLIP) has been widely studied and
applied in numerous applications. However, the emphasis on brief summary texts
during pre-training prevents CLIP from understanding long descriptions. This
issue is particularly acute regarding videos given that videos often contain
abundant detailed contents. In this paper, we propose the VideoCLIP-XL (eXtra
Length) model, which aims to unleash the long-description understanding
capability of video CLIP models. Firstly, we establish an automatic data
collection system and gather a large-scale VILD pre-training dataset with VIdeo
and Long-Description pairs. Then, we propose Text-similarity-guided Primary
Component Matching (TPCM) to better learn the distribution of feature space
while expanding the long description capability. We also introduce two new
tasks namely Detail-aware Description Ranking (DDR) and Hallucination-aware
Description Ranking (HDR) for further understanding improvement. Finally, we
construct a Long Video Description Ranking (LVDR) benchmark for evaluating the
long-description capability more comprehensively. Extensive experimental
results on widely-used text-video retrieval benchmarks with both short and long
descriptions and our LVDR benchmark can fully demonstrate the effectiveness of
our method.
[COMMENTS]
EMNLP 2024 Main conference
[LINK]
http://arxiv.org/abs/2410.00741v1
[DATE]
2024-10-01 22:33:22+08:00
[CATEGORIES]
cs.CL
Efficient Technical Term Translation: A Knowledge Distillation Approach for Parenthetical Terminology Translation
[AUTHORS]
Jiyoon Myung, Jihyeon Park, Jungki Son, Kyungro Lee, Joohyung Han
[ABSTRACT]
This paper addresses the challenge of accurately translating technical terms,
which are crucial for clear communication in specialized fields. We introduce
the Parenthetical Terminology Translation (PTT) task, designed to mitigate
potential inaccuracies by displaying the original term in parentheses alongside
its translation. To implement this approach, we generated a representative PTT
dataset using a collaborative approach with large language models and applied
knowledge distillation to fine-tune traditional Neural Machine Translation
(NMT) models and small-sized Large Language Models (sLMs). Additionally, we
developed a novel evaluation metric to assess both overall translation accuracy
and the correct parenthetical presentation of terms. Our findings indicate that
sLMs did not consistently outperform NMT models, with fine-tuning proving more
effective than few-shot prompting, particularly in models with continued
pre-training in the target language. These insights contribute to the
advancement of more reliable terminology translation methodologies.
[COMMENTS]
Paper accepted in EMNLPW 2024
[LINK]
http://arxiv.org/abs/2410.00683v1
[DATE]
2024-10-01 21:40:28+08:00
[CATEGORIES]
cs.CL
Design as Desired: Utilizing Visual Question Answering for Multimodal Pre-training
[AUTHORS]
Tongkun Su, Jun Li, Xi Zhang, Haibo Jin, Hao Chen, Qiong Wang, Faqin Lv, Baoliang Zhao, Yin Hu
[ABSTRACT]
Multimodal pre-training demonstrates its potential in the medical domain,
which learns medical visual representations from paired medical reports.
However, many pre-training tasks require extra annotations from clinicians, and
most of them fail to explicitly guide the model to learn the desired features
of different pathologies. In this paper, we utilize Visual Question Answering
(VQA) for multimodal pre-training to guide the framework focusing on targeted
pathological features. We leverage descriptions in medical reports to design
multi-granular question-answer pairs associated with different diseases, which
assist the framework in pre-training without requiring extra annotations from
experts. We also propose a novel pre-training framework with a quasi-textual
feature transformer, a module designed to transform visual features into a
quasi-textual space closer to the textual domain via a contrastive learning
strategy. This narrows the vision-language gap and facilitates modality
alignment. Our framework is applied to four downstream tasks: report
generation, classification, segmentation, and detection across five datasets.
Extensive experiments demonstrate the superiority of our framework compared to
other state-of-the-art methods. Our code is available at
https://github.com/MoramiSu/QFT-MICCAI2024.
[COMMENTS]
Accepted by MICCAI2024
[LINK]
http://arxiv.org/abs/2404.00226v3
[DATE]
2024-10-01 21:36:38+08:00
[CATEGORIES]
cs.CL
Merging LoRAs like Playing LEGO: Pushing the Modularity of LoRA to Extremes Through Rank-Wise Clustering
[AUTHORS]
Ziyu Zhao, Tao Shen, Didi Zhu, Zexi Li, Jing Su, Xuwu Wang, Kun Kuang, Fei Wu
[ABSTRACT]
Low-Rank Adaptation (LoRA) has emerged as a popular technique for fine-tuning
large language models (LLMs) to various domains due to its modular design and
widespread availability on platforms like Huggingface. This modularity has
sparked interest in combining multiple LoRAs to enhance LLM capabilities.
However, existing methods for LoRA composition primarily focus on task-specific
adaptations that require additional training, and current model merging
techniques often fail to fully leverage LoRA’s modular nature, leading to
parameter interference and performance degradation. In this paper, we
investigate the feasibility of disassembling and reassembling multiple LoRAs at
a finer granularity, analogous to assembling LEGO blocks. We introduce the
concept of Minimal Semantic Units (MSUs), where the parameters corresponding to
each rank in LoRA function as independent units. These MSUs demonstrate
permutation invariance and concatenation-summation equivalence properties,
enabling flexible combinations to create new LoRAs. Building on these insights,
we propose the LoRA-LEGO framework. This framework conducts rank-wise parameter
clustering by grouping MSUs from different LoRAs into $k$ clusters. The
centroid of each cluster serves as a representative MSU, enabling the assembly
of a merged LoRA with an adjusted rank of $k$. Additionally, we apply a dual
reweighting strategy to optimize the scale of the merged LoRA. Experiments
across various benchmarks demonstrate that our method outperforms existing
approaches in LoRA merging.
[LINK]
http://arxiv.org/abs/2409.16167v2
[DATE]
2024-10-01 21:16:45+08:00
[CATEGORIES]
cs.LG
cs.CL
AutoTM 2.0: Automatic Topic Modeling Framework for Documents Analysis
[AUTHORS]
Maria Khodorchenko, Nikolay Butakov, Maxim Zuev, Denis Nasonov
[ABSTRACT]
In this work, we present an AutoTM 2.0 framework for optimizing additively
regularized topic models. Comparing to the previous version, this version
includes such valuable improvements as novel optimization pipeline, LLM-based
quality metrics and distributed mode.
AutoTM 2.0 is a comfort tool for specialists as well as non-specialists to
work with text documents to conduct exploratory data analysis or to perform
clustering task on interpretable set of features. Quality evaluation is based
on specially developed metrics such as coherence and gpt-4-based approaches.
Researchers and practitioners can easily integrate new optimization algorithms
and adapt novel metrics to enhance modeling quality and extend their
experiments.
We show that AutoTM 2.0 achieves better performance compared to the previous
AutoTM by providing results on 5 datasets with different features and in two
different languages.
[LINK]
http://arxiv.org/abs/2410.00655v1
[DATE]
2024-10-01 21:13:15+08:00
[CATEGORIES]
cs.LG
cs.CL
LUMA: A Benchmark Dataset for Learning from Uncertain and Multimodal Data
[AUTHORS]
Grigor Bezirganyan, Sana Sellami, Laure Berti-Équille, Sébastien Fournier
[ABSTRACT]
Multimodal Deep Learning enhances decision-making by integrating diverse
information sources, such as texts, images, audio, and videos. To develop
trustworthy multimodal approaches, it is essential to understand how
uncertainty impacts these models. We propose LUMA, a unique benchmark dataset,
featuring audio, image, and textual data from 50 classes, for learning from
uncertain and multimodal data. It extends the well-known CIFAR 10/100 dataset
with audio samples extracted from three audio corpora, and text data generated
using the Gemma-7B Large Language Model (LLM). The LUMA dataset enables the
controlled injection of varying types and degrees of uncertainty to achieve and
tailor specific experiments and benchmarking initiatives. LUMA is also
available as a Python package including the functions for generating multiple
variants of the dataset with controlling the diversity of the data, the amount
of noise for each modality, and adding out-of-distribution samples. A baseline
pre-trained model is also provided alongside three uncertainty quantification
methods: Monte-Carlo Dropout, Deep Ensemble, and Reliable Conflictive
Multi-View Learning. This comprehensive dataset and its benchmarking tools are
intended to promote and support the development, evaluation, and benchmarking
of trustworthy and robust multimodal deep learning approaches. We anticipate
that the LUMA dataset will help the ICLR community to design more trustworthy
and robust machine learning approaches for safety critical applications.
[LINK]
http://arxiv.org/abs/2406.09864v2
[DATE]
2024-10-01 21:07:02+08:00
[CATEGORIES]
cs.LG
cs.CL
Optimizing Rare Word Accuracy in Direct Speech Translation with a Retrieval-and-Demonstration Approach
[AUTHORS]
Siqi Li, Danni Liu, Jan Niehues
[ABSTRACT]
Direct speech translation (ST) models often struggle with rare words.
Incorrect translation of these words can have severe consequences, impacting
translation quality and user trust. While rare word translation is inherently
challenging for neural models due to sparse learning signals, real-world
scenarios often allow access to translations of past recordings on similar
topics. To leverage these valuable resources, we propose a
retrieval-and-demonstration approach to enhance rare word translation accuracy
in direct ST models. First, we adapt existing ST models to incorporate
retrieved examples for rare word translation, which allows the model to benefit
from prepended examples, similar to in-context learning. We then develop a
cross-modal (speech-to-speech, speech-to-text, text-to-text) retriever to
locate suitable examples. We demonstrate that standard ST models can be
effectively adapted to leverage examples for rare word translation, improving
rare word translation accuracy over the baseline by 17.6% with gold examples
and 8.5% with retrieved examples. Moreover, our speech-to-speech retrieval
approach outperforms other modalities and exhibits higher robustness to unseen
speakers. Our code is publicly available
(https://github.com/SiqiLii/Retrieve-and-Demonstration-ST).
[COMMENTS]
EMNLP 2024
[LINK]
http://arxiv.org/abs/2409.09009v2
[DATE]
2024-10-01 21:06:20+08:00
[CATEGORIES]
cs.CL
Enhancing High-order Interaction Awareness in LLM-based Recommender Model
[AUTHORS]
Xinfeng Wang, Jin Cui, Fumiyo Fukumoto, Yoshimi Suzuki
[ABSTRACT]
Large language models (LLMs) have demonstrated prominent reasoning
capabilities in recommendation tasks by transforming them into text-generation
tasks. However, existing approaches either disregard or ineffectively model the
user-item high-order interactions. To this end, this paper presents an enhanced
LLM-based recommender (ELMRec). We enhance whole-word embeddings to
substantially enhance LLMs’ interpretation of graph-constructed interactions
for recommendations, without requiring graph pre-training. This finding may
inspire endeavors to incorporate rich knowledge graphs into LLM-based
recommenders via whole-word embedding. We also found that LLMs often recommend
items based on users’ earlier interactions rather than recent ones, and present
a reranking solution. Our ELMRec outperforms state-of-the-art (SOTA) methods in
both direct and sequential recommendations.
[COMMENTS]
Long paper accepted to EMNLP 2024 Main. 16 pages
[LINK]
http://arxiv.org/abs/2409.19979v2
[DATE]
2024-10-01 21:04:55+08:00
[CATEGORIES]
cs.CL
Does Fine-Tuning LLMs on New Knowledge Encourage Hallucinations?
[AUTHORS]
Zorik Gekhman, Gal Yona, Roee Aharoni, Matan Eyal, Amir Feder, Roi Reichart, Jonathan Herzig
[COMMENTS]
Accepted as a long paper at EMNLP 2024
[LINK]
http://arxiv.org/abs/2405.05904v3
[DATE]
2024-10-01 20:08:23+08:00
[CATEGORIES]
cs.CL
Unveiling Implicit Table Knowledge with Question-Then-Pinpoint Reasoner for Insightful Table Summarization
[AUTHORS]
Kwangwook Seo, Jinyoung Yeo, Dongha Lee
[ABSTRACT]
Implicit knowledge hidden within the explicit table cells, such as data
insights, is the key to generating a high-quality table summary. However,
unveiling such implicit knowledge is a non-trivial task. Due to the complex
nature of structured tables, it is challenging even for large language models
(LLMs) to mine the implicit knowledge in an insightful and faithful manner. To
address this challenge, we propose a novel table reasoning framework
Question-then-Pinpoint. Our work focuses on building a plug-and-play table
reasoner that can self-question the insightful knowledge and answer it by
faithfully pinpointing evidence on the table to provide explainable guidance
for the summarizer. To train a reliable reasoner, we collect table knowledge by
guiding a teacher LLM to follow the coarse-to-fine reasoning paths and refine
it through two quality enhancement strategies to selectively distill the
high-quality knowledge to the reasoner. Extensive experiments on two table
summarization datasets, including our newly proposed InsTaSumm, validate the
general effectiveness of our framework.
[COMMENTS]
Accepted to EMNLP 2024 Findings
[LINK]
http://arxiv.org/abs/2406.12269v2
[DATE]
2024-10-01 19:26:11+08:00
[CATEGORIES]
cs.CL
Style-Specific Neurons for Steering LLMs in Text Style Transfer
[AUTHORS]
Wen Lai, Viktor Hangya, Alexander Fraser
[COMMENTS]
Accepted at EMNLP 2024 main conference. The code is publicly
available at https://github.com/wenlai-lavine/sNeuron-TST
[LINK]
http://arxiv.org/abs/2410.00593v1
[DATE]
2024-10-01 19:25:36+08:00
[CATEGORIES]
cs.CL
DiffuCOMET: Contextual Commonsense Knowledge Diffusion
[AUTHORS]
Silin Gao, Mete Ismayilzada, Mengjie Zhao, Hiromi Wakaki, Yuki Mitsufuji, Antoine Bosselut
[ABSTRACT]
Inferring contextually-relevant and diverse commonsense to understand
narratives remains challenging for knowledge models. In this work, we develop a
series of knowledge models, DiffuCOMET, that leverage diffusion to learn to
reconstruct the implicit semantic connections between narrative contexts and
relevant commonsense knowledge. Across multiple diffusion steps, our method
progressively refines a representation of commonsense facts that is anchored to
a narrative, producing contextually-relevant and diverse commonsense inferences
for an input context. To evaluate DiffuCOMET, we introduce new metrics for
commonsense inference that more closely measure knowledge diversity and
contextual relevance. Our results on two different benchmarks, ComFact and
WebNLG+, show that knowledge generated by DiffuCOMET achieves a better
trade-off between commonsense diversity, contextual relevance and alignment to
known gold references, compared to baseline knowledge models.
[LINK]
http://arxiv.org/abs/2402.17011v2
[DATE]
2024-10-01 18:38:25+08:00
[CATEGORIES]
cs.CL
Think Twice: A Human-like Two-stage Conversational Agent for Emotional Response Generation
[AUTHORS]
Yushan Qian, Bo Wang, Shangzhao Ma, Wu Bin, Shuo Zhang, Dongming Zhao, Kun Huang, Yuexian Hou
[ABSTRACT]
Towards human-like dialogue systems, current emotional dialogue approaches
jointly model emotion and semantics with a unified neural network. This
strategy tends to generate safe responses due to the mutual restriction between
emotion and semantics, and requires rare emotion-annotated large-scale dialogue
corpus. Inspired by the “think twice” behavior in human dialogue, we propose a
two-stage conversational agent for the generation of emotional dialogue.
Firstly, a dialogue model trained without the emotion-annotated dialogue corpus
generates a prototype response that meets the contextual semantics. Secondly,
the first-stage prototype is modified by a controllable emotion refiner with
the empathy hypothesis. Experimental results on the DailyDialog and
EmpatheticDialogues datasets demonstrate that the proposed conversational
outperforms the comparison models in emotion generation and maintains the
semantic performance in automatic and human evaluations.
[COMMENTS]
Accepted to AAMAS 2023
[LINK]
http://arxiv.org/abs/2301.04907v3
[DATE]
2024-10-01 18:36:20+08:00
[CATEGORIES]
cs.CL
Zero-Shot Multi-Hop Question Answering via Monte-Carlo Tree Search with Large Language Models
[AUTHORS]
Seongmin Lee, Jaewook Shin, Youngjin Ahn, Seokin Seo, Ohjoon Kwon, Kee-Eung Kim
[ABSTRACT]
Recent advances in large language models (LLMs) have significantly impacted
the domain of multi-hop question answering (MHQA), where systems are required
to aggregate information and infer answers from disparate pieces of text.
However, the autoregressive nature of LLMs inherently poses a challenge as
errors may accumulate if mistakes are made in the intermediate reasoning steps.
This paper introduces Monte-Carlo tree search for Zero-shot multi-hop Question
Answering (MZQA), a framework based on Monte-Carlo tree search (MCTS) to
identify optimal reasoning paths in MHQA tasks, mitigating the error
propagation from sequential reasoning processes. Unlike previous works, we
propose a zero-shot prompting method, which relies solely on instructions
without the support of hand-crafted few-shot examples that typically require
domain expertise. We also introduce a behavioral cloning approach (MZQA-BC)
trained on self-generated MCTS inference trajectories, achieving an over
10-fold increase in reasoning speed with bare compromise in performance. The
efficacy of our method is validated on standard benchmarks such as HotpotQA,
2WikiMultihopQA, and MuSiQue, demonstrating that it outperforms existing
frameworks.
[COMMENTS]
Work in Progress
[LINK]
http://arxiv.org/abs/2409.19382v2
[DATE]
2024-10-01 18:28:32+08:00
[CATEGORIES]
cs.CL
AMR-Evol: Adaptive Modular Response Evolution Elicits Better Knowledge Distillation for Large Language Models in Code Generation
[AUTHORS]
Ziyang Luo, Xin Li, Hongzhan Lin, Jing Ma, Lidong Bing
[ABSTRACT]
The impressive performance of proprietary LLMs like GPT4 in code generation
has led to a trend to replicate these capabilities in open-source models
through knowledge distillation (e.g. Code Evol-Instruct). However, these
efforts often neglect the crucial aspect of response quality, relying heavily
on teacher models for direct response distillation. This paradigm, especially
for complex instructions, can degrade the quality of synthesized data,
compromising the knowledge distillation process. To this end, our study
introduces the Adaptive Modular Response Evolution (AMR-Evol) framework, which
employs a two-stage process to refine response distillation. The first stage,
modular decomposition, breaks down the direct response into more manageable
sub-modules. The second stage, adaptive response evolution, automatically
evolves the response with the related function modules. Our experiments with
three popular code benchmarks (HumanEval, MBPP, and EvalPlus) attest to the
superiority of the AMR-Evol framework over baseline response distillation
methods. By comparing with the open-source Code LLMs trained on a similar scale
of data, we observed performance enhancements: more than +3.0 points on
HumanEval-Plus and +1.0 points on MBPP-Plus, which underscores the
effectiveness of our framework. Our codes are available at
https://github.com/ChiYeungLaw/AMR-Evol.
[COMMENTS]
EMNLP 2024
[LINK]
http://arxiv.org/abs/2410.00558v1
[DATE]
2024-10-01 18:12:38+08:00
[CATEGORIES]
cs.CL
Developments in Sheaf-Theoretic Models of Natural Language Ambiguities
[AUTHORS]
Kin Ian Lo, Mehrnoosh Sadrzadeh, Shane Mansfield
[ABSTRACT]
Sheaves are mathematical objects consisting of a base which constitutes a
topological space and the data associated with each open set thereof, e.g.
continuous functions defined on the open sets. Sheaves have originally been
used in algebraic topology and logic. Recently, they have also modelled events
such as physical experiments and natural language disambiguation processes. We
extend the latter models from lexical ambiguities to discourse ambiguities
arising from anaphora. To begin, we calculated a new measure of contextuality
for a dataset of basic anaphoric discourses, resulting in a higher proportion
of contextual models-82.9%-compared to previous work which only yielded 3.17%
contextual models. Then, we show how an extension of the natural language
processing challenge, known as the Winograd Schema, which involves anaphoric
ambiguities can be modelled on the Bell-CHSH scenario with a contextual
fraction of 0.096.
[COMMENTS]
In Proceedings DCM 2023, arXiv:2409.19298
[LINK]
http://arxiv.org/abs/2402.04505v2
[DATE]
2024-10-01 17:54:00+08:00
[CATEGORIES]
cs.CL
What the Harm? Quantifying the Tangible Impact of Gender Bias in Machine Translation with a Human-centered Study
[AUTHORS]
Beatrice Savoldi, Sara Papi, Matteo Negri, Ana Guerberof, Luisa Bentivogli
[ABSTRACT]
Gender bias in machine translation (MT) is recognized as an issue that can
harm people and society. And yet, advancements in the field rarely involve
people, the final MT users, or inform how they might be impacted by biased
technologies. Current evaluations are often restricted to automatic methods,
which offer an opaque estimate of what the downstream impact of gender
disparities might be. We conduct an extensive human-centered study to examine
if and to what extent bias in MT brings harms with tangible costs, such as
quality of service gaps across women and men. To this aim, we collect
behavioral data from 90 participants, who post-edited MT outputs to ensure
correct gender translation. Across multiple datasets, languages, and types of
users, our study shows that feminine post-editing demands significantly more
technical and temporal effort, also corresponding to higher financial costs.
Existing bias measurements, however, fail to reflect the found disparities. Our
findings advocate for human-centered approaches that can inform the societal
impact of bias.
[COMMENTS]
Accepted ad EMNLP 2024
[LINK]
http://arxiv.org/abs/2410.00545v1
[DATE]
2024-10-01 17:38:34+08:00
[CATEGORIES]
cs.CL
Efficient Controlled Language Generation with Low-Rank Autoregressive Reward Models
[AUTHORS]
Sergey Troshin, Vlad Niculae, Antske Fokkens
[ABSTRACT]
Language models trained on large amounts of data are known to produce
inappropriate content in some cases and require careful tuning to be used in
the real world. We revisit the reward augmented decoding (RAD) approach to
control the generation from a language model using the scores from a
task-specific reward model. We investigate the training objective of RAD, and
reformulate it as a task of learning a reward matrix. We show that RAD is
designed to support high flexibility when representing the reward matrices,
which leads to a higher computational costs during decoding. However, we
demonstrate that RAD does not use its full flexibility. Motivated by this, we
propose a simpler but more efficient low-rank parametrization of the reward
model enabling fast and effective guided decoding. For the detoxification and
sentiment control tasks, we show that our low-rank reward model performs on par
with the more flexible RAD parametrization, while requiring only a single
reward model call per generated token.
[LINK]
http://arxiv.org/abs/2407.04615v2
[DATE]
2024-10-01 17:23:32+08:00
[CATEGORIES]
cs.CL
Ruler: A Model-Agnostic Method to Control Generated Length for Large Language Models
[AUTHORS]
Jiaming Li, Lei Zhang, Yunshui Li, Ziqiang Liu, yuelin bai, Run Luo, Longze Chen, Min Yang
[ABSTRACT]
The instruction-following ability of large language models enables humans to
interact with AI agents in a natural way. However, when required to generate
responses of a specific length, large language models often struggle to meet
users’ needs due to their inherent difficulty in accurately perceiving
numerical constraints. To explore the ability of large language models to
control the length of generated responses, we propose the Target Length
Generation Task (TLG) and design two metrics, Precise Match (PM) and Flexible
Match (FM) to evaluate the model’s performance in adhering to specified
response lengths. Furthermore, we introduce a novel, model-agnostic approach
called Ruler, which employs Meta Length Tokens (MLTs) to enhance the
instruction-following ability of large language models under length-constrained
instructions. Specifically, Ruler equips LLMs with the ability to generate
responses of a specified length based on length constraints within the
instructions. Moreover, Ruler can automatically generate appropriate MLT when
length constraints are not explicitly provided, demonstrating excellent
versatility and generalization. Comprehensive experiments show the
effectiveness of Ruler across different LLMs on Target Length Generation Task,
e.g., at All Level 27.97 average gain on PM, 29.57 average gain on FM. In
addition, we conduct extensive ablation experiments to further substantiate the
efficacy and generalization of Ruler. Our code and data is available at
https://github.com/Geaming2002/Ruler.
[LINK]
http://arxiv.org/abs/2409.18943v2
[DATE]
2024-10-01 17:20:58+08:00
[CATEGORIES]
cs.CL
Exploring the Learning Capabilities of Language Models using LEVERWORLDS
[AUTHORS]
Eitan Wagner, Amir Feder, Omri Abend
[ABSTRACT]
Learning a model of a stochastic setting often involves learning both general
structure rules and specific properties of the instance. This paper
investigates the interplay between learning the general and the specific in
various learning methods, with emphasis on sample efficiency. We design a
framework called {\sc LeverWorlds}, which allows the generation of simple
physics-inspired worlds that follow a similar generative process with different
distributions, and their instances can be expressed in natural language. These
worlds allow for controlled experiments to assess the sample complexity of
different learning methods. We experiment with classic learning algorithms as
well as Transformer language models, both with fine-tuning and In-Context
Learning (ICL). Our general finding is that (1) Transformers generally succeed
in the task; but (2) they are considerably less sample efficient than classic
methods that make stronger assumptions about the structure, such as Maximum
Likelihood Estimation and Logistic Regression. This finding is in tension with
the recent tendency to use Transformers as general-purpose estimators. We
propose an approach that leverages the ICL capabilities of contemporary
language models to apply simple algorithms for this type of data. Our
experiments show that models currently struggle with the task but show
promising potential.
[LINK]
http://arxiv.org/abs/2410.00519v1
[DATE]
2024-10-01 17:02:13+08:00
[CATEGORIES]
cs.CL
Cross-lingual Back-Parsing: Utterance Synthesis from Meaning Representation for Zero-Resource Semantic Parsing
[AUTHORS]
Deokhyung Kang, Seonjeong Hwang, Yunsu Kim, Gary Geunbae Lee
[ABSTRACT]
Recent efforts have aimed to utilize multilingual pretrained language models
(mPLMs) to extend semantic parsing (SP) across multiple languages without
requiring extensive annotations. However, achieving zero-shot cross-lingual
transfer for SP remains challenging, leading to a performance gap between
source and target languages. In this study, we propose Cross-Lingual
Back-Parsing (CBP), a novel data augmentation methodology designed to enhance
cross-lingual transfer for SP. Leveraging the representation geometry of the
mPLMs, CBP synthesizes target language utterances from source meaning
representations. Our methodology effectively performs cross-lingual data
augmentation in challenging zero-resource settings, by utilizing only labeled
data in the source language and monolingual corpora. Extensive experiments on
two cross-language SP benchmarks (Mschema2QA and Xspider) demonstrate that CBP
brings substantial gains in the target language. Further analysis of the
synthesized utterances shows that our method successfully generates target
language utterances with high slot value alignment rates while preserving
semantic integrity. Our codes and data are publicly available at
https://github.com/deokhk/CBP.
[COMMENTS]
Accepted to EMNLP 2024
[LINK]
http://arxiv.org/abs/2410.00513v1
[DATE]
2024-10-01 16:53:38+08:00
[CATEGORIES]
cs.CL
Read Over the Lines: Attacking LLMs and Toxicity Detection Systems with ASCII Art to Mask Profanity
[AUTHORS]
Sergey Berezin, Reza Farahbakhsh, Noel Crespi
[ABSTRACT]
We introduce a novel family of adversarial attacks that exploit the inability
of language models to interpret ASCII art. To evaluate these attacks, we
propose the ToxASCII benchmark and develop two custom ASCII art fonts: one
leveraging special tokens and another using text-filled letter shapes. Our
attacks achieve a perfect 1.0 Attack Success Rate across ten models, including
OpenAI’s o1-preview and LLaMA 3.1.
Warning: this paper contains examples of toxic language used for research
purposes.
[LINK]
http://arxiv.org/abs/2409.18708v3
[DATE]
2024-10-01 16:50:01+08:00
[CATEGORIES]
cs.CL
FlipGuard: Defending Preference Alignment against Update Regression with Constrained Optimization
[AUTHORS]
Mingye Zhu, Yi Liu, Quan Wang, Junbo Guo, Zhendong Mao
[COMMENTS]
Accepted by EMNLP 2024 Main track
[LINK]
http://arxiv.org/abs/2410.00508v1
[DATE]
2024-10-01 16:46:59+08:00
[CATEGORIES]
cs.CL
Multi-Target Cross-Lingual Summarization: a novel task and a language-neutral approach
[AUTHORS]
Diogo Pernes, Gonçalo M. Correia, Afonso Mendes
[ABSTRACT]
Cross-lingual summarization aims to bridge language barriers by summarizing
documents in different languages. However, ensuring semantic coherence across
languages is an overlooked challenge and can be critical in several contexts.
To fill this gap, we introduce multi-target cross-lingual summarization as the
task of summarizing a document into multiple target languages while ensuring
that the produced summaries are semantically similar. We propose a principled
re-ranking approach to this problem and a multi-criteria evaluation protocol to
assess semantic coherence across target languages, marking a first step that
will hopefully stimulate further research on this problem.
[COMMENTS]
Accepted to EMNLP 2024 (Findings)
[LINK]
http://arxiv.org/abs/2410.00502v1
[DATE]
2024-10-01 16:33:57+08:00
[CATEGORIES]
cs.CL
cs.LG
Self-Updatable Large Language Models with Parameter Integration
[AUTHORS]
Yu Wang, Xinshuang Liu, Xiusi Chen, Sean O’Brien, Junda Wu, Julian McAuley
[ABSTRACT]
Despite significant advancements in large language models (LLMs), the rapid
and frequent integration of small-scale experiences, such as interactions with
surrounding objects, remains a substantial challenge. Two critical factors in
assimilating these experiences are (1) Efficacy: the ability to accurately
remember recent events; (2) Retention: the capacity to recall long-past
experiences. Current methods either embed experiences within model parameters
using continual learning, model editing, or knowledge distillation techniques,
which often struggle with rapid updates and complex interactions, or rely on
external storage to achieve long-term retention, thereby increasing storage
requirements. In this paper, we propose SELF-PARAM (Self-Updatable Large
Language Models with Parameter Integration). SELF-PARAM requires no extra
parameters while ensuring near-optimal efficacy and long-term retention. Our
method employs a training objective that minimizes the Kullback-Leibler (KL)
divergence between the predictions of an original model (with access to
contextual information) and a target model (without such access). By generating
diverse question-answer pairs related to the knowledge and minimizing the KL
divergence across this dataset, we update the target model to internalize the
knowledge seamlessly within its parameters. Evaluations on question-answering
and conversational recommendation tasks demonstrate that SELF-PARAM
significantly outperforms existing methods, even when accounting for non-zero
storage requirements. This advancement paves the way for more efficient and
scalable integration of experiences in large language models by embedding
knowledge directly into model parameters.
[LINK]
http://arxiv.org/abs/2410.00487v1
[DATE]
2024-10-01 16:18:17+08:00
[CATEGORIES]
cs.CL
Adversarial Suffixes May Be Features Too!
[AUTHORS]
Wei Zhao, Zhe Li, Yige Li, Jun Sun
[ABSTRACT]
Despite significant ongoing efforts in safety alignment, large language
models (LLMs) such as GPT-4 and LLaMA 3 remain vulnerable to jailbreak attacks
that can induce harmful behaviors, including those triggered by adversarial
suffixes. Building on prior research, we hypothesize that these adversarial
suffixes are not mere bugs but may represent features that can dominate the
LLM’s behavior. To evaluate this hypothesis, we conduct several experiments.
First, we demonstrate that benign features can be effectively made to function
as adversarial suffixes, i.e., we develop a feature extraction method to
extract sample-agnostic features from benign dataset in the form of suffixes
and show that these suffixes may effectively compromise safety alignment.
Second, we show that adversarial suffixes generated from jailbreak attacks may
contain meaningful features, i.e., appending the same suffix to different
prompts results in responses exhibiting specific characteristics. Third, we
show that such benign-yet-safety-compromising features can be easily introduced
through fine-tuning using only benign datasets, i.e., even in the absence of
harmful content. This highlights the critical risk posed by dominating benign
features in the training data and calls for further research to reinforce LLM
safety alignment. Our code and data is available at
\url{https://github.com/anonymous}.
[LINK]
http://arxiv.org/abs/2410.00451v1
[DATE]
2024-10-01 15:11:55+08:00
[CATEGORIES]
cs.CL
UniSumEval: Towards Unified, Fine-Grained, Multi-Dimensional Summarization Evaluation for LLMs
[AUTHORS]
Yuho Lee, Taewon Yun, Jason Cai, Hang Su, Hwanjun Song
[COMMENTS]
Accepted at EMNLP-Findings 2024
[LINK]
http://arxiv.org/abs/2409.19898v2
[DATE]
2024-10-01 15:11:44+08:00
[CATEGORIES]
cs.CL
Exploring the Limits of Fine-grained LLM-based Physics Inference via Premise Removal Interventions
[AUTHORS]
Jordan Meadows, Tamsin James, Andre Freitas
[ABSTRACT]
Language models (LMs) can hallucinate when performing complex mathematical
reasoning. Physics provides a rich domain for assessing their mathematical
capabilities, where physical context requires that any symbolic manipulation
satisfies complex semantics (\textit{e.g.,} units, tensorial order). In this
work, we systematically remove crucial context from prompts to force instances
where model inference may be algebraically coherent, yet unphysical. We assess
LM capabilities in this domain using a curated dataset encompassing multiple
notations and Physics subdomains. Further, we improve zero-shot scores using
synthetic in-context examples, and demonstrate non-linear degradation of
derivation quality with perturbation strength via the progressive omission of
supporting premises. We find that the models’ mathematical reasoning is not
physics-informed in this setting, where physical context is predominantly
ignored in favour of reverse-engineering solutions.
[COMMENTS]
EMNLP 2024 (Findings)
[LINK]
http://arxiv.org/abs/2404.18384v2
[DATE]
2024-10-01 14:17:52+08:00
[CATEGORIES]
cs.CL
Are LLMs Aware that Some Questions are not Open-ended?
[AUTHORS]
Dongjie Yang, Hai Zhao
[ABSTRACT]
Large Language Models (LLMs) have shown the impressive capability of
answering questions in a wide range of scenarios. However, when LLMs face
different types of questions, it is worth exploring whether LLMs are aware that
some questions have limited answers and need to respond more deterministically
but some do not. We refer to this as question awareness of LLMs. The lack of
question awareness in LLMs leads to two phenomena that LLMs are: (1) too casual
to answer non-open-ended questions or (2) too boring to answer open-ended
questions. In this paper, we first evaluate the question awareness in LLMs. The
experimental results show that LLMs have the issues of lacking awareness of
questions in certain domains, e.g. factual knowledge, resulting in
hallucinations during the generation. To mitigate these, we propose a method
called Question Awareness Temperature Sampling (QuATS). This method enhances
the question awareness of LLMs by adaptively adjusting the output distributions
based on question features. The automatic adjustment in QuATS eliminates the
need for manual temperature tuning in text generation and consistently improves
model performance in various benchmarks.
[COMMENTS]
Accepted by EMNLP 2024
[LINK]
http://arxiv.org/abs/2410.00423v1
[DATE]
2024-10-01 14:07:00+08:00
[CATEGORIES]
cs.CL
Instance-adaptive Zero-shot Chain-of-Thought Prompting
[AUTHORS]
Xiaosong Yuan, Chen Shen, Shaotian Yan, Xiaofeng Zhang, Liang Xie, Wenxiao Wang, Renchu Guan, Ying Wang, Jieping Ye
[ABSTRACT]
Zero-shot Chain-of-Thought (CoT) prompting emerges as a simple and effective
strategy for enhancing the performance of large language models (LLMs) in
real-world reasoning tasks. Nonetheless, the efficacy of a singular, task-level
prompt uniformly applied across the whole of instances is inherently limited
since one prompt cannot be a good partner for all, a more appropriate approach
should consider the interaction between the prompt and each instance
meticulously. This work introduces an instance-adaptive prompting algorithm as
an alternative zero-shot CoT reasoning scheme by adaptively differentiating
good and bad prompts. Concretely, we first employ analysis on LLMs through the
lens of information flow to detect the mechanism under zero-shot CoT reasoning,
in which we discover that information flows from question to prompt and
question to rationale jointly influence the reasoning results most. We notice
that a better zero-shot CoT reasoning needs the prompt to obtain semantic
information from the question then the rationale aggregates sufficient
information from the question directly and via the prompt indirectly. On the
contrary, lacking any of those would probably lead to a bad one. Stem from
that, we further propose an instance-adaptive prompting strategy (IAP) for
zero-shot CoT reasoning. Experiments conducted with LLaMA-2, LLaMA-3, and Qwen
on math, logic, and commonsense reasoning tasks (e.g., GSM8K, MMLU, Causal
Judgement) obtain consistent improvement, demonstrating that the
instance-adaptive zero-shot CoT prompting performs better than other task-level
methods with some curated prompts or sophisticated procedures, showing the
significance of our findings in the zero-shot CoT reasoning mechanism.
[COMMENTS]
13 pages, 6 figures
[LINK]
http://arxiv.org/abs/2409.20441v2
[DATE]
2024-10-01 14:03:22+08:00
[CATEGORIES]
cs.CL
Semantic Parsing with Candidate Expressions for Knowledge Base Question Answering
[AUTHORS]
Daehwan Nam, Gary Geunbae Lee
[ABSTRACT]
Semantic parsers convert natural language to logical forms, which can be
evaluated on knowledge bases (KBs) to produce denotations. Recent semantic
parsers have been developed with sequence-to-sequence (seq2seq) pre-trained
language models (PLMs) or large language models, where the models treat logical
forms as sequences of tokens. For syntactic and semantic validity, the semantic
parsers use grammars that enable constrained decoding. However, the grammars
lack the ability to utilize large information of KBs, although logical forms
contain representations of KB elements, such as entities or relations. In this
work, we propose a grammar augmented with candidate expressions for semantic
parsing on a large KB with a seq2seq PLM. The grammar defines actions as
production rules, and our semantic parser predicts actions during inference
under the constraints by types and candidate expressions. We apply the grammar
to knowledge base question answering, where the constraints by candidate
expressions assist a semantic parser to generate valid KB elements. In
experiments on two benchmarks, KQA Pro and Overnight, the constraints by
candidate expressions increased the accuracy of our semantic parser, whether it
was trained with strong supervision or weak supervision. Our semantic parser
achieved state-of-the-art accuracies on KQA Pro and Overnight.
[LINK]
http://arxiv.org/abs/2410.00414v1
[DATE]
2024-10-01 13:46:22+08:00
[CATEGORIES]
cs.CL
TPN: Transferable Proto-Learning Network towards Few-shot Document-Level Relation Extraction
[AUTHORS]
Yu Zhang, Zhao Kang
[ABSTRACT]
Few-shot document-level relation extraction suffers from poor performance due
to the challenging cross-domain transferability of NOTA (none-of-the-above)
relation representation. In this paper, we introduce a Transferable
Proto-Learning Network (TPN) to address the challenging issue. It comprises
three core components: Hybrid Encoder hierarchically encodes semantic content
of input text combined with attention information to enhance the relation
representations. As a plug-and-play module for Out-of-Domain (OOD) Detection,
Transferable Proto-Learner computes NOTA prototype through an adaptive
learnable block, effectively mitigating NOTA bias across various domains.
Dynamic Weighting Calibrator detects relation-specific classification
confidence, serving as dynamic weights to calibrate the NOTA-dominant loss
function. Finally, to bolster the model’s cross-domain performance, we
complement it with virtual adversarial training (VAT). We conduct extensive
experimental analyses on FREDo and ReFREDo, demonstrating the superiority of
TPN. Compared to state-of-the-art methods, our approach achieves competitive
performance with approximately half the parameter size. Data and code are
available at https://github.com/EchoDreamer/TPN.
[COMMENTS]
Few shot document-level relation extraction
[LINK]
http://arxiv.org/abs/2410.00412v1
[DATE]
2024-10-01 13:37:31+08:00
[CATEGORIES]
cs.CL
Weak-to-Strong Reasoning
[AUTHORS]
Yuqing Yang, Yan Ma, Pengfei Liu
[ABSTRACT]
When large language models (LLMs) exceed human-level capabilities, it becomes
increasingly challenging to provide full-scale and accurate supervision for
these models. Weak-to-strong learning, which leverages a less capable model to
unlock the latent abilities of a stronger model, proves valuable in this
context. Yet, the efficacy of this approach for complex reasoning tasks is
still untested. Furthermore, tackling reasoning tasks under the weak-to-strong
setting currently lacks efficient methods to avoid blindly imitating the weak
supervisor including its errors. In this paper, we introduce a progressive
learning framework that enables the strong model to autonomously refine its
training data, without requiring input from either a more advanced model or
human-annotated data. This framework begins with supervised fine-tuning on a
selective small but high-quality dataset, followed by preference optimization
on contrastive samples identified by the strong model itself. Extensive
experiments on the GSM8K and MATH datasets demonstrate that our method
significantly enhances the reasoning capabilities of Llama2-70b using three
separate weak models. This method is further validated in a forward-looking
experimental setup, where Llama3-8b-instruct effectively supervises Llama3-70b
on the highly challenging OlympicArena dataset. This work paves the way for a
more scalable and sophisticated strategy to enhance AI reasoning powers. All
relevant code and resources are available in
\url{https://github.com/GAIR-NLP/weak-to-strong-reasoning}.
[COMMENTS]
EMNLP Findings 2024
[LINK]
http://arxiv.org/abs/2407.13647v2
[DATE]
2024-10-01 13:28:54+08:00
[CATEGORIES]
cs.CL
AlignSum: Data Pyramid Hierarchical Fine-tuning for Aligning with Human Summarization Preference
[AUTHORS]
Yang Han, Yiming Wang, Rui Wang, Lu Chen, Kai Yu
[ABSTRACT]
Text summarization tasks commonly employ Pre-trained Language Models (PLMs)
to fit diverse standard datasets. While these PLMs excel in automatic
evaluations, they frequently underperform in human evaluations, indicating a
deviation between their generated summaries and human summarization
preferences. This discrepancy is likely due to the low quality of fine-tuning
datasets and the limited availability of high-quality human-annotated data that
reflect true human preference. To address this challenge, we introduce a novel
human summarization preference alignment framework AlignSum. This framework
consists of three parts: Firstly, we construct a Data Pymarid with extractive,
abstractive, and human-annotated summary data. Secondly, we conduct the
Gaussian Resampling to remove summaries with extreme lengths. Finally, we
implement the two-stage hierarchical fine-tuning with Data Pymarid after
Gaussian Resampling. We apply AlignSum to PLMs on the human-annotated
CNN/DailyMail and BBC XSum datasets. Experiments show that with AlignSum, PLMs
like BART-Large surpass 175B GPT-3 in both automatic and human evaluations.
This demonstrates that AlignSum significantly enhances the alignment of
language models with human summarization preferences.
[COMMENTS]
EMNLP2024 Findings, code at: https://github.com/csyanghan/AlignSum
[LINK]
http://arxiv.org/abs/2410.00409v1
[DATE]
2024-10-01 13:14:48+08:00
[CATEGORIES]
cs.CL
RichRAG: Crafting Rich Responses for Multi-faceted Queries in Retrieval-Augmented Generation
[AUTHORS]
Shuting Wang, Xin Yu, Mang Wang, Weipeng Chen, Yutao Zhu, Zhicheng Dou
[ABSTRACT]
Retrieval-augmented generation (RAG) effectively addresses issues of static
knowledge and hallucination in large language models. Existing studies mostly
focus on question scenarios with clear user intents and concise answers.
However, it is prevalent that users issue broad, open-ended queries with
diverse sub-intents, for which they desire rich and long-form answers covering
multiple relevant aspects. To tackle this important yet underexplored problem,
we propose a novel RAG framework, namely RichRAG. It includes a sub-aspect
explorer to identify potential sub-aspects of input questions, a multi-faceted
retriever to build a candidate pool of diverse external documents related to
these sub-aspects, and a generative list-wise ranker, which is a key module to
provide the top-k most valuable documents for the final generator. These ranked
documents sufficiently cover various query aspects and are aware of the
generator’s preferences, hence incentivizing it to produce rich and
comprehensive responses for users. The training of our ranker involves a
supervised fine-tuning stage to ensure the basic coverage of documents, and a
reinforcement learning stage to align downstream LLM’s preferences to the
ranking of documents. Experimental results on two publicly available datasets
prove that our framework effectively and efficiently provides comprehensive and
satisfying responses to users.
[LINK]
http://arxiv.org/abs/2406.12566v3
[DATE]
2024-10-01 12:42:48+08:00
[CATEGORIES]
cs.CL
Boosting the Capabilities of Compact Models in Low-Data Contexts with Large Language Models and Retrieval-Augmented Generation
[AUTHORS]
Bhargav Shandilya, Alexis Palmer
[ABSTRACT]
The data and compute requirements of current language modeling technology
pose challenges for the processing and analysis of low-resource languages.
Declarative linguistic knowledge has the potential to partially bridge this
data scarcity gap by providing models with useful inductive bias in the form of
language-specific rules. In this paper, we propose a retrieval augmented
generation (RAG) framework backed by a large language model (LLM) to correct
the output of a smaller model for the linguistic task of morphological
glossing. We leverage linguistic information to make up for the lack of data
and trainable parameters, while allowing for inputs from written descriptive
grammars interpreted and distilled through an LLM.
The results demonstrate that significant leaps in performance and efficiency
are possible with the right combination of: a) linguistic inputs in the form of
grammars, b) the interpretive power of LLMs, and c) the trainability of smaller
token classification networks. We show that a compact, RAG-supported model is
highly effective in data-scarce settings, achieving a new state-of-the-art for
this task and our target languages. Our work also offers documentary linguists
a more reliable and more usable tool for morphological glossing by providing
well-reasoned explanations and confidence scores for each output.
[COMMENTS]
13 pages, 1 figure, 5 tables, submitted to COLING 2025
[LINK]
http://arxiv.org/abs/2410.00387v1
[DATE]
2024-10-01 12:20:14+08:00
[CATEGORIES]
cs.CL
Tell Your Model Where to Attend: Post-hoc Attention Steering for LLMs
[AUTHORS]
Qingru Zhang, Chandan Singh, Liyuan Liu, Xiaodong Liu, Bin Yu, Jianfeng Gao, Tuo Zhao
[COMMENTS]
The 12th International Conference on Learning Representations (ICLR
2024)
[LINK]
http://arxiv.org/abs/2311.02262v2
[DATE]
2024-10-01 12:10:34+08:00
[CATEGORIES]
cs.CL
cs.LG
Block-Attention for Efficient RAG
[AUTHORS]
East Sun, Yan Wang, Lan Tian
[ABSTRACT]
We introduce Block-Attention, an attention mechanism designed to address the
increased inference latency and cost in Retrieval-Augmented Generation (RAG)
scenarios. Traditional approaches often encode the entire context. Instead,
Block-Attention divides retrieved documents into discrete blocks, with each
block independently calculating key-value (KV) states except for the final
block. In RAG scenarios, by defining each passage as a block, Block-Attention
enables us to reuse the KV states of passages that have been seen before,
thereby significantly reducing the latency and the computation overhead during
inference. The implementation of Block-Attention involves block segmentation,
position re-encoding, and fine-tuning the LLM to adapt to the Block-Attention
mechanism. Experiments on four RAG benchmarks demonstrate that after block
fine-tuning, the Block-Attention model achieves performance comparable to
self-attention models (68.4\% vs 67.9\% on Llama3) or even superior performance
(62.8\% vs 59.6\% on Mistral). Notably, Block-Attention significantly reduces
the time to first token (TTFT) and floating point operations (FLOPs) to a very
low level. It only takes 45 ms to output the first token for an input sequence
with a total length of 32K. Compared to the self-attention models, the time
consumption and corresponding FLOPs are reduced by 98.7\% and 99.8\%,
respectively.
[LINK]
http://arxiv.org/abs/2409.15355v3
[DATE]
2024-10-01 11:40:08+08:00
[CATEGORIES]
cs.LG
cs.CL
Self-controller: Controlling LLMs with Multi-round Step-by-step Self-awareness
[AUTHORS]
Xiao Peng, Xufan Geng
[ABSTRACT]
The applications of large language models (LLMs) have been widely spread
across all domains. However, the basic abilities such as the controllability of
LLMs are still limited. To address this, we propose “Self-controller”, a novel
agentic framework bringing self-awareness into LLMs’ reasoning logic. The core
idea of this work is to maintain states based on the LLM’s response, letting
the LLM become self-aware of current status and think step by step in a
multi-round chain-of-thought paradigm. Our experiment on the state of textual
length has shown the controllability and effectiveness of the Self-controller.
We further implement a binary search algorithm to accelerate the generation
process based on the linearity and monotonicity of the textual length state.
Another advantage of the Self-controller comes with DeepSeek’s Context Caching
technology, which significantly saves computational token consumption when a
cluster of conversations shares the same prefix of context. Theoretically, we
prove that in this scenario the extra time complexity is $O(c \log n)$. Results
of the back-of-the-envelope estimation suggest that the token consumption of
our method is no more than twice as much as that of the trivial single-round
generation. Furthermore, our ablation study on word constraints demonstrates
the Self-controller’s consistent controllability across all foundation models.
[COMMENTS]
10 pages, 6 figures
[LINK]
http://arxiv.org/abs/2410.00359v1
[DATE]
2024-10-01 11:14:12+08:00
[CATEGORIES]
cs.CL
Privacy Evaluation Benchmarks for NLP Models
[AUTHORS]
Wei Huang, Yinggui Wang, Cen Chen
[COMMENTS]
Findings of EMNLP 2024
[LINK]
http://arxiv.org/abs/2409.15868v3
[DATE]
2024-10-01 11:12:35+08:00
[CATEGORIES]
cs.CL
cs.LG
NLEBench+NorGLM: A Comprehensive Empirical Analysis and Benchmark Dataset for Generative Language Models in Norwegian
[AUTHORS]
Peng Liu, Lemei Zhang, Terje Farup, Even W. Lauvrak, Jon Espen Ingvaldsen, Simen Eide, Jon Atle Gulla, Zhirong Yang
[ABSTRACT]
Norwegian, spoken by only 5 million population, is under-representative
within the most impressive breakthroughs in NLP tasks. To the best of our
knowledge, there has not yet been a comprehensive evaluation of the existing
language models (LMs) on Norwegian generation tasks during the article writing
process. To fill this gap, we 1) compiled the existing Norwegian dataset and
pre-trained 4 Norwegian Open Language Models varied from parameter scales and
architectures, collectively called NorGLM; 2) introduced a comprehensive
benchmark, NLEBench, for evaluating natural language generation capabilities in
Norwegian, encompassing translation and human annotation. Based on the
investigation, we find that: 1) the mainstream, English-dominated LM GPT-3.5
has limited capability in understanding the Norwegian context; 2) the increase
in model parameter scales demonstrates limited impact on the performance of
downstream tasks when the pre-training dataset is constrained in size; 3)
smaller models also demonstrate the reasoning capability through
Chain-of-Thought; 4) a multi-task dataset that includes synergy tasks can be
used to verify the generalizability of LLMs on natural language understanding
and, meanwhile, test the interconnectedness of these NLP tasks. We share our
resources and code for reproducibility under a CC BY-NC 4.0 license.
[COMMENTS]
Accepted at EMNLP 2024 Main Conference. Code available at
https://github.com/Smartmedia-AI/NorGLM/
[LINK]
http://arxiv.org/abs/2312.01314v2
[DATE]
2024-10-01 10:56:30+08:00
[CATEGORIES]
cs.CL
Sparse Attention Decomposition Applied to Circuit Tracing
[AUTHORS]
Gabriel Franco, Mark Crovella
[ABSTRACT]
Many papers have shown that attention heads work in conjunction with each
other to perform complex tasks. It’s frequently assumed that communication
between attention heads is via the addition of specific features to token
residuals. In this work we seek to isolate and identify the features used to
effect communication and coordination among attention heads in GPT-2 small. Our
key leverage on the problem is to show that these features are very often
sparsely coded in the singular vectors of attention head matrices. We
characterize the dimensionality and occurrence of these signals across the
attention heads in GPT-2 small when used for the Indirect Object Identification
(IOI) task. The sparse encoding of signals, as provided by attention head
singular vectors, allows for efficient separation of signals from the residual
background and straightforward identification of communication paths between
attention heads. We explore the effectiveness of this approach by tracing
portions of the circuits used in the IOI task. Our traces reveal considerable
detail not present in previous studies, shedding light on the nature of
redundant paths present in GPT-2. And our traces go beyond previous work by
identifying features used to communicate between attention heads when
performing IOI.
[LINK]
http://arxiv.org/abs/2410.00340v1
[DATE]
2024-10-01 10:34:08+08:00
[CATEGORIES]
cs.LG
cs.CL
Preserving Generalization of Language models in Few-shot Continual Relation Extraction
[AUTHORS]
Quyen Tran, Nguyen Xuan Thanh, Nguyen Hoang Anh, Nam Le Hai, Trung Le, Linh Van Ngo, Thien Huu Nguyen
[ABSTRACT]
Few-shot Continual Relations Extraction (FCRE) is an emerging and dynamic
area of study where models can sequentially integrate knowledge from new
relations with limited labeled data while circumventing catastrophic forgetting
and preserving prior knowledge from pre-trained backbones. In this work, we
introduce a novel method that leverages often-discarded language model heads.
By employing these components via a mutual information maximization strategy,
our approach helps maintain prior knowledge from the pre-trained backbone and
strategically aligns the primary classification head, thereby enhancing model
performance. Furthermore, we explore the potential of Large Language Models
(LLMs), renowned for their wealth of knowledge, in addressing FCRE challenges.
Our comprehensive experimental results underscore the efficacy of the proposed
method and offer valuable insights for future work.
[COMMENTS]
Accepted to EMNLP 2024
[LINK]
http://arxiv.org/abs/2410.00334v1
[DATE]
2024-10-01 10:22:34+08:00
[CATEGORIES]
cs.CL
PointAD: Comprehending 3D Anomalies from Points and Pixels for Zero-shot 3D Anomaly Detection
[AUTHORS]
Qihang Zhou, Jiangtao Yan, Shibo He, Wenchao Meng, Jiming Chen
[COMMENTS]
NeurIPS 2024
[LINK]
http://arxiv.org/abs/2410.00320v1
[DATE]
2024-10-01 09:40:22+08:00
[CATEGORIES]
cs.CL
See What LLMs Cannot Answer: A Self-Challenge Framework for Uncovering LLM Weaknesses
[AUTHORS]
Yulong Chen, Yang Liu, Jianhao Yan, Xuefeng Bai, Ming Zhong, Yinghao Yang, Ziyi Yang, Chenguang Zhu, Yue Zhang
[ABSTRACT]
The impressive performance of Large Language Models (LLMs) has consistently
surpassed numerous human-designed benchmarks, presenting new challenges in
assessing the shortcomings of LLMs. Designing tasks and finding LLMs’
limitations are becoming increasingly important. In this paper, we investigate
the question of whether an LLM can discover its own limitations from the errors
it makes. To this end, we propose a Self-Challenge evaluation framework with
human-in-the-loop. Starting from seed instances that GPT-4 fails to answer, we
prompt GPT-4 to summarize error patterns that can be used to generate new
instances and incorporate human feedback on them to refine these patterns for
generating more challenging data, iteratively. We end up with 8 diverse
patterns, such as text manipulation and questions with assumptions. We then
build a benchmark, SC-G4, consisting of 1,835 instances generated by GPT-4
using these patterns, with human-annotated gold responses. The SC-G4 serves as
a challenging benchmark that allows for a detailed assessment of LLMs’
abilities. Our results show that only 44.96\% of instances in SC-G4 can be
answered correctly by GPT-4. Interestingly, our pilot study indicates that
these error patterns also challenge other LLMs, such as Claude-3 and Llama-3,
and cannot be fully resolved through fine-tuning. Our work takes the first step
to demonstrate that LLMs can autonomously identify their inherent flaws and
provide insights for future dynamic and automatic evaluation.
[COMMENTS]
COLM 2024
[LINK]
http://arxiv.org/abs/2408.08978v2
[DATE]
2024-10-01 09:40:14+08:00
[CATEGORIES]
cs.CL
EmoKnob: Enhance Voice Cloning with Fine-Grained Emotion Control
[AUTHORS]
Haozhe Chen, Run Chen, Julia Hirschberg
[COMMENTS]
EMNLP 2024 Main
[LINK]
http://arxiv.org/abs/2410.00316v1
[DATE]
2024-10-01 09:29:54+08:00
[CATEGORIES]
cs.CL
How You Prompt Matters! Even Task-Oriented Constraints in Instructions Affect LLM-Generated Text Detection
[AUTHORS]
Ryuto Koike, Masahiro Kaneko, Naoaki Okazaki
[COMMENTS]
EMNLP 2024 Findings camera ready. Dataset available at
https://github.com/ryuryukke/HowYouPromptMatters
[LINK]
http://arxiv.org/abs/2311.08369v4
[DATE]
2024-10-01 09:24:05+08:00
[CATEGORIES]
cs.CL
HEART-felt Narratives: Tracing Empathy and Narrative Style in Personal Stories with LLMs
[AUTHORS]
Jocelyn Shen, Joel Mire, Hae Won Park, Cynthia Breazeal, Maarten Sap
[COMMENTS]
Accepted to EMNLP 2024
[LINK]
http://arxiv.org/abs/2405.17633v2
[DATE]
2024-10-01 08:17:41+08:00
[CATEGORIES]
cs.CL
Outcome-Constrained Large Language Models for Countering Hate Speech
[AUTHORS]
Lingzi Hong, Pengcheng Luo, Eduardo Blanco, Xiaoying Song
[ABSTRACT]
Automatic counterspeech generation methods have been developed to assist
efforts in combating hate speech. Existing research focuses on generating
counterspeech with linguistic attributes such as being polite, informative, and
intent-driven. However, the real impact of counterspeech in online environments
is seldom considered. This study aims to develop methods for generating
counterspeech constrained by conversation outcomes and evaluate their
effectiveness. We experiment with large language models (LLMs) to incorporate
into the text generation process two desired conversation outcomes: low
conversation incivility and non-hateful hater reentry. Specifically, we
experiment with instruction prompts, LLM finetuning, and LLM reinforcement
learning (RL). Evaluation results show that our methods effectively steer the
generation of counterspeech toward the desired outcomes. Our analyses, however,
show that there are differences in the quality and style depending on the
model.
[COMMENTS]
Accepted for presentation at the EMNLP 2024 main conference
[LINK]
http://arxiv.org/abs/2403.17146v2
[DATE]
2024-10-01 08:09:49+08:00
[CATEGORIES]
cs.CL
Efficient In-Domain Question Answering for Resource-Constrained Environments
[AUTHORS]
Isaac Chung, Phat Vo, Arman Kizilkale, Aaron Reite
[ABSTRACT]
Retrieval Augmented Generation (RAG) is a common method for integrating
external knowledge into pretrained Large Language Models (LLMs) to enhance
accuracy and relevancy in question answering (QA) tasks. However, prompt
engineering and resource efficiency remain significant bottlenecks in
developing optimal and robust RAG solutions for real-world QA applications.
Recent studies have shown success in using fine tuning to address these
problems; in particular, Retrieval Augmented Fine Tuning (RAFT) applied to
smaller 7B models has demonstrated superior performance compared to RAG setups
with much larger models such as GPT-3.5. The combination of RAFT with
parameter-efficient fine tuning (PEFT) techniques, such as Low-Rank Adaptation
(LoRA), promises an even more efficient solution, yet remains an unexplored
area. In this work, we combine RAFT with LoRA to reduce fine tuning and storage
requirements and gain faster inference times while maintaining comparable RAG
performance. This results in a more compute-efficient RAFT, or CRAFT, which is
particularly useful for knowledge-intensive QA tasks in resource-constrained
environments where internet access may be restricted and hardware resources
limited.
[COMMENTS]
6 pages, 2 tables
[LINK]
http://arxiv.org/abs/2409.17648v2
[DATE]
2024-10-01 06:52:18+08:00
[CATEGORIES]
cs.CL
GEAR: An Efficient KV Cache Compression Recipe for Near-Lossless Generative Inference of LLM
[AUTHORS]
Hao Kang, Qingru Zhang, Souvik Kundu, Geonhwa Jeong, Zaoxing Liu, Tushar Krishna, Tuo Zhao
[ABSTRACT]
Key-value (KV) caching has become the de-facto to accelerate generation speed
for large language models (LLMs) inference. However, the growing cache demand
with increasing sequence length has transformed LLM inference to be a memory
bound problem, significantly constraining the system throughput. Existing
methods rely on dropping unimportant tokens or quantizing all entries
uniformly. Such methods, however, often incur high approximation errors to
represent the compressed matrices. The autoregressive decoding process further
compounds the error of each step, resulting in critical deviation in model
generation and deterioration of performance. To tackle this challenge, we
propose GEAR, an efficient KV cache compression framework that achieves
near-lossless high-ratio compression. GEAR first applies quantization to
majority of entries of similar magnitudes to ultra-low precision. It then
employs a low rank matrix to approximate the quantization error, and a sparse
matrix to remedy individual errors from outlier entries. By adeptly integrating
three techniques, GEAR is able to fully exploit their synergistic potentials.
Our experiments demonstrate that compared to alternatives, GEAR achieves
near-lossless 4-bit KV cache compression with up to 2.38x throughput
improvement, while reducing peak-memory size up to 2.29x. Our code is publicly
available at https://github.com/HaoKang-Timmy/GEAR.
[LINK]
http://arxiv.org/abs/2403.05527v4
[DATE]
2024-10-01 06:44:58+08:00
[CATEGORIES]
cs.LG
cs.CL
Robin3D: Improving 3D Large Language Model via Robust Instruction Tuning
[AUTHORS]
Weitai Kang, Haifeng Huang, Yuzhang Shang, Mubarak Shah, Yan Yan
[ABSTRACT]
Recent advancements in 3D Large Language Models (3DLLMs) have highlighted
their potential in building general-purpose agents in the 3D real world, yet
challenges remain due to the lack of high-quality robust instruction-following
data, leading to limited discriminative power and generalization of 3DLLMs. In
this paper, we introduce Robin3D, a powerful 3DLLM trained on large-scale
instruction-following data generated by our novel data engine, Robust
Instruction Generation (RIG) engine. RIG generates two key instruction data: 1)
the Adversarial Instruction-following data, which features mixed negative and
positive samples to enhance the model’s discriminative understanding. 2) the
Diverse Instruction-following data, which contains various instruction styles
to enhance model’s generalization. As a result, we construct 1 million
instruction-following data, consisting of 344K Adversarial samples, 508K
Diverse samples, and 165K benchmark training set samples. To better handle
these complex instructions, Robin3D first incorporates Relation-Augmented
Projector to enhance spatial understanding, and then strengthens the object
referring and grounding ability through ID-Feature Bonding. Robin3D
consistently outperforms previous methods across five widely-used 3D multimodal
learning benchmarks, without the need for task-specific fine-tuning. Notably,
we achieve a 7.8\% improvement in the grounding task (Multi3DRefer) and a 6.9\%
improvement in the captioning task (Scan2Cap).
[COMMENTS]
10 pages
[LINK]
http://arxiv.org/abs/2410.00255v1
[DATE]
2024-10-01 05:55:38+08:00
[CATEGORIES]
cs.CL
MM-Conv: A Multi-modal Conversational Dataset for Virtual Humans
[AUTHORS]
Anna Deichler, Jim O’Regan, Jonas Beskow
[ABSTRACT]
In this paper, we present a novel dataset captured using a VR headset to
record conversations between participants within a physics simulator
(AI2-THOR). Our primary objective is to extend the field of co-speech gesture
generation by incorporating rich contextual information within referential
settings. Participants engaged in various conversational scenarios, all based
on referential communication tasks. The dataset provides a rich set of
multimodal recordings such as motion capture, speech, gaze, and scene graphs.
This comprehensive dataset aims to enhance the understanding and development of
gesture generation models in 3D scenes by providing diverse and contextually
rich data.
[LINK]
http://arxiv.org/abs/2410.00253v1
[DATE]
2024-10-01 05:51:30+08:00
[CATEGORIES]
cs.CL
Don’t Listen To Me: Understanding and Exploring Jailbreak Prompts of Large Language Models
[AUTHORS]
Zhiyuan Yu, Xiaogeng Liu, Shunning Liang, Zach Cameron, Chaowei Xiao, Ning Zhang
[ABSTRACT]
Recent advancements in generative AI have enabled ubiquitous access to large
language models (LLMs). Empowered by their exceptional capabilities to
understand and generate human-like text, these models are being increasingly
integrated into our society. At the same time, there are also concerns on the
potential misuse of this powerful technology, prompting defensive measures from
service providers. To overcome such protection, jailbreaking prompts have
recently emerged as one of the most effective mechanisms to circumvent security
restrictions and elicit harmful content originally designed to be prohibited.
Due to the rapid development of LLMs and their ease of access via natural
languages, the frontline of jailbreak prompts is largely seen in online forums
and among hobbyists. To gain a better understanding of the threat landscape of
semantically meaningful jailbreak prompts, we systemized existing prompts and
measured their jailbreak effectiveness empirically. Further, we conducted a
user study involving 92 participants with diverse backgrounds to unveil the
process of manually creating jailbreak prompts. We observed that users often
succeeded in jailbreak prompts generation regardless of their expertise in
LLMs. Building on the insights from the user study, we also developed a system
using AI as the assistant to automate the process of jailbreak prompt
generation.
[COMMENTS]
Accepted by USENIX Security 2024
[LINK]
http://arxiv.org/abs/2403.17336v2
[DATE]
2024-10-01 05:25:23+08:00
[CATEGORIES]
cs.CL
TTQA-RS- A break-down prompting approach for Multi-hop Table-Text Question Answering with Reasoning and Summarization
[AUTHORS]
Jayetri Bardhan, Bushi Xiao, Daisy Zhe Wang
[ABSTRACT]
Question answering (QA) over tables and text has gained much popularity over
the years. Multi-hop table-text QA requires multiple hops between the table and
text, making it a challenging QA task. Although several works have attempted to
solve the table-text QA task, most involve training the models and requiring
labeled data. In this paper, we have proposed a Retrieval Augmented Generation
(RAG) based model - TTQA-RS: A break-down prompting approach for Multi-hop
Table-Text Question Answering with Reasoning and Summarization. Our model uses
an enhanced retriever for table-text information retrieval and uses augmented
knowledge, including table-text summary with decomposed sub-questions with
answers for a reasoning-based table-text QA. Using open-source language models,
our model outperformed all existing prompting methods for table-text QA tasks
on existing table-text QA datasets, such as HybridQA and OTT-QA’s development
set. Our experiments demonstrate the potential of prompt-based approaches using
open-source LLMs. Additionally, by using LLaMA3-70B, our model achieved
state-of-the-art performance for prompting-based methods on multi-hop
table-text QA.
[LINK]
http://arxiv.org/abs/2406.14732v2
[DATE]
2024-10-01 05:25:22+08:00
[CATEGORIES]
cs.CL
SwitchHead: Accelerating Transformers with Mixture-of-Experts Attention
[AUTHORS]
Róbert Csordás, Piotr Piękos, Kazuki Irie, Jürgen Schmidhuber
[ABSTRACT]
Despite many recent works on Mixture of Experts (MoEs) for resource-efficient
Transformer language models, existing methods mostly focus on MoEs for
feedforward layers. Previous attempts at extending MoE to the self-attention
layer fail to match the performance of the parameter-matched baseline. Our
novel SwitchHead is an effective MoE method for the attention layer that
successfully reduces both the compute and memory requirements, achieving
wall-clock speedup, while matching the language modeling performance of the
baseline Transformer. Our novel MoE mechanism allows SwitchHead to compute up
to 8 times fewer attention matrices than the standard Transformer. SwitchHead
can also be combined with MoE feedforward layers, resulting in fully-MoE
“SwitchAll” Transformers. For our 262M parameter model trained on C4,
SwitchHead matches the perplexity of standard models with only 44% compute and
27% memory usage. Zero-shot experiments on downstream tasks confirm the
performance of SwitchHead, e.g., achieving more than 3.5% absolute improvements
on BliMP compared to the baseline with an equal compute resource.
[COMMENTS]
Accepted to NeurIPS 2024
[LINK]
http://arxiv.org/abs/2312.07987v3
[DATE]
2024-10-01 05:19:29+08:00
[CATEGORIES]
cs.LG
cs.CL
How Far Are We on the Decision-Making of LLMs? Evaluating LLMs’ Gaming Ability in Multi-Agent Environments
[AUTHORS]
Jen-tse Huang, Eric John Li, Man Ho Lam, Tian Liang, Wenxuan Wang, Youliang Yuan, Wenxiang Jiao, Xing Wang, Zhaopeng Tu, Michael R. Lyu
[ABSTRACT]
Decision-making is a complex process requiring diverse abilities, making it
an excellent framework for evaluating Large Language Models (LLMs). Researchers
have examined LLMs’ decision-making through the lens of Game Theory. However,
existing evaluation mainly focus on two-player scenarios where an LLM competes
against another. Additionally, previous benchmarks suffer from test set leakage
due to their static design. We introduce GAMA($\gamma$)-Bench, a new framework
for evaluating LLMs’ Gaming Ability in Multi-Agent environments. It includes
eight classical game theory scenarios and a dynamic scoring scheme specially
designed to quantitatively assess LLMs’ performance. $\gamma$-Bench allows
flexible game settings and adapts the scoring system to different game
parameters, enabling comprehensive evaluation of robustness, generalizability,
and strategies for improvement. Our results indicate that GPT-3.5 demonstrates
strong robustness but limited generalizability, which can be enhanced using
methods like Chain-of-Thought. We also evaluate twelve LLMs from six model
families, including GPT-3.5, GPT-4, Gemini, LLaMA-3.1, Mixtral, and Qwen-2.
Gemini-1.5-Pro outperforms others, scoring of $68.1$ out of $100$, followed by
LLaMA-3.1-70B ($64.5$) and Mixtral-8x22B ($61.4$). All code and experimental
results are publicly available via https://github.com/CUHK-ARISE/GAMABench.
[COMMENTS]
11 pages of main text; 19 pages of appendices. Included models:
GPT-3.5-{0613, 1106, 0125}, GPT-4-0125, Gemini-{1.0, 1.5)-Pro, LLaMA-3.1-{7,
70, 405}B, Mixtral-8x{7, 22}B, Qwen-2-72B
[LINK]
http://arxiv.org/abs/2403.11807v4
[DATE]
2024-10-01 04:57:58+08:00
[CATEGORIES]
cs.CL
A is for Absorption: Studying Feature Splitting and Absorption in Sparse Autoencoders
[AUTHORS]
David Chanin, James Wilken-Smith, Tomáš Dulka, Hardik Bhatnagar, Joseph Bloom
[ABSTRACT]
Sparse Autoencoders (SAEs) have emerged as a promising approach to decompose
the activations of Large Language Models (LLMs) into human-interpretable
latents. In this paper, we pose two questions. First, to what extent do SAEs
extract monosemantic and interpretable latents? Second, to what extent does
varying the sparsity or the size of the SAE affect monosemanticity /
interpretability? By investigating these questions in the context of a simple
first-letter identification task where we have complete access to ground truth
labels for all tokens in the vocabulary, we are able to provide more detail
than prior investigations. Critically, we identify a problematic form of
feature-splitting we call feature absorption where seemingly monosemantic
latents fail to fire in cases where they clearly should. Our investigation
suggests that varying SAE size or sparsity is insufficient to solve this issue,
and that there are deeper conceptual issues in need of resolution.
[LINK]
http://arxiv.org/abs/2409.14507v4
[DATE]
2024-10-01 04:42:22+08:00
[CATEGORIES]
cs.CL
Women Are Beautiful, Men Are Leaders: Gender Stereotypes in Machine Translation and Language Modeling
[AUTHORS]
Matúš Pikuliak, Andrea Hrckova, Stefan Oresko, Marián Šimko
[ABSTRACT]
We present GEST – a new manually created dataset designed to measure
gender-stereotypical reasoning in language models and machine translation
systems. GEST contains samples for 16 gender stereotypes about men and women
(e.g., Women are beautiful, Men are leaders) that are compatible with the
English language and 9 Slavic languages. The definition of said stereotypes was
informed by gender experts. We used GEST to evaluate English and Slavic masked
LMs, English generative LMs, and machine translation systems. We discovered
significant and consistent amounts of gender-stereotypical reasoning in almost
all the evaluated models and languages. Our experiments confirm the previously
postulated hypothesis that the larger the model, the more stereotypical it
usually is.
[COMMENTS]
EMNLP 2024 Findings
[LINK]
http://arxiv.org/abs/2311.18711v3
[DATE]
2024-10-01 04:34:19+08:00
[CATEGORIES]
cs.CL
UNA: Unifying Alignments of RLHF/PPO, DPO and KTO by a Generalized Implicit Reward Function
[AUTHORS]
Zhichao Wang, Bin Bi, Can Huang, Shiva Kumar Pentyala, Zixu James Zhu, Sitaram Asur, Na Claire Cheng
[ABSTRACT]
An LLM is pretrained on trillions of tokens, but the pretrained LLM may still
generate undesired responses. To solve this problem, alignment techniques such
as RLHF, DPO and KTO are proposed. However, these alignment techniques have
limitations. For example, RLHF requires training the reward model and policy
separately, which is complex, time-consuming, memory intensive and unstable
during training processes. DPO proposes a mapping between an optimal policy and
a reward, greatly simplifying the training process of RLHF. However, it can not
take full advantages of a reward model and it is limited to pairwise preference
data.
In this paper, we propose \textbf{UN}ified \textbf{A}lignment (UNA) which
unifies RLHF/PPO, DPO and KTO. Firstly, we mathematically prove that given the
classical RLHF objective, the optimal policy is induced by a generalize
implicit reward function. With this novel mapping between a reward model and an
optimal policy, UNA can 1. unify RLHF/PPO, DPO and KTO into a supervised
learning of minimizing the difference between an implicit reward and an
explicit reward; 2. outperform RLHF/PPO while simplify, stabilize, speed up and
reduce memory burden of RL fine-tuning process; 3. accommodate different
feedback types including pairwise, binary and scalar feedback. Downstream
experiments show UNA outperforms DPO, KTO and RLHF.
[LINK]
http://arxiv.org/abs/2408.15339v2
[DATE]
2024-10-01 04:18:27+08:00
[CATEGORIES]
cs.LG
cs.CL
DreamStruct: Understanding Slides and User Interfaces via Synthetic Data Generation
[AUTHORS]
Yi-Hao Peng, Faria Huq, Yue Jiang, Jason Wu, Amanda Xin Yue Li, Jeffrey Bigham, Amy Pavel
[ABSTRACT]
Enabling machines to understand structured visuals like slides and user
interfaces is essential for making them accessible to people with disabilities.
However, achieving such understanding computationally has required manual data
collection and annotation, which is time-consuming and labor-intensive. To
overcome this challenge, we present a method to generate synthetic, structured
visuals with target labels using code generation. Our method allows people to
create datasets with built-in labels and train models with a small number of
human-annotated examples. We demonstrate performance improvements in three
tasks for understanding slides and UIs: recognizing visual elements, describing
visual content, and classifying visual content types.
[COMMENTS]
ECCV 2024
[LINK]
http://arxiv.org/abs/2410.00201v1
[DATE]
2024-10-01 03:55:54+08:00
[CATEGORIES]
cs.CL
KV-Compress: Paged KV-Cache Compression with Variable Compression Rates per Attention Head
[AUTHORS]
Isaac Rehg
[ABSTRACT]
Context lengths of Large Language Models (LLMs) have exploded in recent
years, with 128k-token context becoming a standard and million-token context
becoming a reality. Efficiently supporting long-context inference remains
challenging as the memory that must be allocated in key-value (KV) cache for a
generation scales with its context length, limiting the number of long-context
requests that can be served concurrently under a given memory budget. KV cache
compression can mitigate this issue by removing under-utilized KVs from each
attention head’s cache and reducing its memory footprint. Higher theoretical
compression rates can be achieved when the number of removed KVs varies across
attention heads, but application of such a strategy within existing inference
frameworks adds fragmentation and cannot realize the theoretical compression
rates in physical memory. We introduce KV-Compress, a novel compression method
that evicts contiguous KV blocks within a PagedAttention framework, reducing
the memory footprint of the KV cache proportionally to this theoretical
compression rate. Our method achieves state-of-the-art performance on LongBench
for both Mistral-7B-Instruct-v0.2 and Llama-3.1-8B-Instruct while lowering the
total number of compressed KVs by 4x compared with prior methods. Evaluations
on Llama-3.1-8B-Instruct and Llama-3.1-70B-Instruct-FP8 achieve compression
rates up to 8x with negligible impact on performance, and up to 64x while
retaining over 90% of full-cache performance for all but three of the suite’s
subsets. We benchmark an integration of our method with vLLM that increases
total throughput by up to 5.18x by enabling larger decoding batches.
[LINK]
http://arxiv.org/abs/2410.00161v1
[DATE]
2024-10-01 03:09:13+08:00
[CATEGORIES]
cs.CL
Beyond Single Concept Vector: Modeling Concept Subspace in LLMs with Gaussian Distribution
[AUTHORS]
Haiyan Zhao, Heng Zhao, Bo Shen, Ali Payani, Fan Yang, Mengnan Du
[ABSTRACT]
Probing learned concepts in large language models (LLMs) is crucial for
understanding how semantic knowledge is encoded internally. Training linear
classifiers on probing tasks is a principle approach to denote the vector of a
certain concept in the representation space. However, the single vector
identified for a concept varies with both data and training, making it less
robust and weakening its effectiveness in real-world applications. To address
this challenge, we propose an approach to approximate the subspace representing
a specific concept. Built on linear probing classifiers, we extend the concept
vectors into Gaussian Concept Subspace (GCS). We demonstrate GCS’s
effectiveness through measuring its faithfulness and plausibility across
multiple LLMs with different sizes and architectures. Additionally, we use
representation intervention tasks to showcase its efficacy in real-world
applications such as emotion steering. Experimental results indicate that GCS
concept vectors have the potential to balance steering performance and
maintaining the fluency in natural language generation tasks.
[COMMENTS]
28 pages, 9 figures
[LINK]
http://arxiv.org/abs/2410.00153v1
[DATE]
2024-10-01 02:52:53+08:00
[CATEGORIES]
cs.CL
cs.LG
Scheherazade: Evaluating Chain-of-Thought Math Reasoning in LLMs with Chain-of-Problems
[AUTHORS]
Stephen Miner, Yoshiki Takashima, Simeng Han, Ferhat Erata, Timos Antonopoulos, Ruzica Piskac, Scott J Shapiro
[ABSTRACT]
Benchmarks are critical for measuring progress of math reasoning abilities of
Large Language Models (LLMs). However, existing widely-used benchmarks such as
GSM8K have been rendered less useful as multiple cutting-edge LLMs achieve over
94% accuracy. While harder benchmarks have been proposed, their creation is
often manual and expensive. We present Scheherazade, an automated approach for
producing challenging mathematical reasoning benchmarks by logically chaining
mathematical reasoning problems. We propose two different chaining methods,
forward chaining and backward chaining, which require reasoning forward and
backward through the chain respectively. We apply Scheherazade on GSM8K to
create GSM8K-Scheherazade and evaluate 3 frontier LLMs and OpenAI’s o1-preview
on it. We show that while frontier models’ performance declines precipitously
at only a few questions chained, a preliminary evaluation suggests o1-preview
performance persists up to 5 questions chained backwards. In addition, while
all other models perform worse when problems are chained backwards, o1-preview
performs better on backward-chained benchmarks. We will release the dataset and
code publicly.
[LINK]
http://arxiv.org/abs/2410.00151v1
[DATE]
2024-10-01 02:48:34+08:00
[CATEGORIES]
cs.CL
Semantic-Driven Topic Modeling Using Transformer-Based Embeddings and Clustering Algorithms
[AUTHORS]
Melkamu Abay Mersha, Mesay Gemeda yigezu, Jugal Kalita
[ABSTRACT]
Topic modeling is a powerful technique to discover hidden topics and patterns
within a collection of documents without prior knowledge. Traditional topic
modeling and clustering-based techniques encounter challenges in capturing
contextual semantic information. This study introduces an innovative end-to-end
semantic-driven topic modeling technique for the topic extraction process,
utilizing advanced word and document embeddings combined with a powerful
clustering algorithm. This semantic-driven approach represents a significant
advancement in topic modeling methodologies. It leverages contextual semantic
information to extract coherent and meaningful topics. Specifically, our model
generates document embeddings using pre-trained transformer-based language
models, reduces the dimensions of the embeddings, clusters the embeddings based
on semantic similarity, and generates coherent topics for each cluster.
Compared to ChatGPT and traditional topic modeling algorithms, our model
provides more coherent and meaningful topics.
[LINK]
http://arxiv.org/abs/2410.00134v1
[DATE]
2024-10-01 02:15:31+08:00
[CATEGORIES]
cs.CL
Fisher Information-based Efficient Curriculum Federated Learning with Large Language Models
[AUTHORS]
Ji Liu, Jiaxiang Ren, Ruoming Jin, Zijie Zhang, Yang Zhou, Patrick Valduriez, Dejing Dou
[ABSTRACT]
As a promising paradigm to collaboratively train models with decentralized
data, Federated Learning (FL) can be exploited to fine-tune Large Language
Models (LLMs). While LLMs correspond to huge size, the scale of the training
data significantly increases, which leads to tremendous amounts of computation
and communication costs. The training data is generally non-Independent and
Identically Distributed (non-IID), which requires adaptive data processing
within each device. Although Low Rank Adaptation (LoRA) can significantly
reduce the scale of parameters to update in the fine-tuning process, it still
takes unaffordable time to transfer the low-rank parameters of all the layers
in LLMs. In this paper, we propose a Fisher Information-based Efficient
Curriculum Federated Learning framework (FibecFed) with two novel methods,
i.e., adaptive federated curriculum learning and efficient sparse parameter
update. First, we propose a fisher information-based method to adaptively
sample data within each device to improve the effectiveness of the FL
fine-tuning process. Second, we dynamically select the proper layers for global
aggregation and sparse parameters for local update with LoRA so as to improve
the efficiency of the FL fine-tuning process. Extensive experimental results
based on 10 datasets demonstrate that FibecFed yields excellent performance (up
to 45.35% in terms of accuracy) and superb fine-tuning speed (up to 98.61%
faster) compared with 17 baseline approaches).
[COMMENTS]
27 pages, 8 figures, 14 tables, to appear in EMNLP 2024
[LINK]
http://arxiv.org/abs/2410.00131v1
[DATE]
2024-10-01 02:12:18+08:00
[CATEGORIES]
cs.LG
cs.CL
Ranking Over Scoring: Towards Reliable and Robust Automated Evaluation of LLM-Generated Medical Explanatory Arguments
[AUTHORS]
Iker De la Iglesia, Iakes Goenaga, Johanna Ramirez-Romero, Jose Maria Villa-Gonzalez, Josu Goikoetxea, Ander Barrena
[ABSTRACT]
Evaluating LLM-generated text has become a key challenge, especially in
domain-specific contexts like the medical field. This work introduces a novel
evaluation methodology for LLM-generated medical explanatory arguments, relying
on Proxy Tasks and rankings to closely align results with human evaluation
criteria, overcoming the biases typically seen in LLMs used as judges. We
demonstrate that the proposed evaluators are robust against adversarial
attacks, including the assessment of non-argumentative text. Additionally, the
human-crafted arguments needed to train the evaluators are minimized to just
one example per Proxy Task. By examining multiple LLM-generated arguments, we
establish a methodology for determining whether a Proxy Task is suitable for
evaluating LLM-generated medical explanatory arguments, requiring only five
examples and two human experts.
[LINK]
http://arxiv.org/abs/2409.20565v1
[DATE]
2024-10-01 01:59:33+08:00
[CATEGORIES]
cs.CL
LLM Hallucinations in Practical Code Generation: Phenomena, Mechanism, and Mitigation
[AUTHORS]
Ziyao Zhang, Yanlin Wang, Chong Wang, Jiachi Chen, Zibin Zheng
[ABSTRACT]
Code generation aims to automatically generate code from input requirements,
significantly enhancing development efficiency. Recent large language models
(LLMs) based approaches have shown promising results and revolutionized code
generation task. Despite the promising performance, LLMs often generate
contents with hallucinations, especially for the code generation scenario
requiring the handling of complex contextual dependencies in practical
development process. Although previous study has analyzed hallucinations in
LLM-powered code generation, the study is limited to standalone function
generation. In this paper, we conduct an empirical study to study the
phenomena, mechanism, and mitigation of LLM hallucinations within more
practical and complex development contexts in repository-level generation
scenario. First, we manually examine the code generation results from six
mainstream LLMs to establish a hallucination taxonomy of LLM-generated code.
Next, we elaborate on the phenomenon of hallucinations, analyze their
distribution across different models. We then analyze causes of hallucinations
and identify four potential factors contributing to hallucinations. Finally, we
propose an RAG-based mitigation method, which demonstrates consistent
effectiveness in all studied LLMs. The replication package including code,
data, and experimental results is available at
https://github.com/DeepSoftwareAnalytics/LLMCodingHallucination
[COMMENTS]
11 pages, 13 figures
[LINK]
http://arxiv.org/abs/2409.20550v1
[DATE]
2024-10-01 01:51:15+08:00
[CATEGORIES]
cs.CL
Can Large Language Models Address Open-Target Stance Detection?
[AUTHORS]
Abu Ubaida Akash, Ahmed Fahmy, Amine Trabelsi
[ABSTRACT]
Stance detection (SD) identifies a text’s position towards a target,
typically labeled as favor, against, or none. We introduce Open-Target Stance
Detection (OTSD), the most realistic task where targets are neither seen during
training nor provided as input. We evaluate Large Language Models (LLMs)
GPT-4o, GPT-3.5, Llama-3, and Mistral, comparing their performance to the only
existing work, Target-Stance Extraction (TSE), which benefits from predefined
targets. Unlike TSE, OTSD removes the dependency of a predefined list, making
target generation and evaluation more challenging. We also provide a metric for
evaluating target quality that correlates well with human judgment. Our
experiments reveal that LLMs outperform TSE in target generation when the real
target is explicitly and not explicitly mentioned in the text. Likewise, for
stance detection, LLMs excel in explicit cases with comparable performance in
non-explicit in general.
[COMMENTS]
14 pages; currently under submission
[LINK]
http://arxiv.org/abs/2409.00222v4
[DATE]
2024-10-01 01:37:16+08:00
[CATEGORIES]
cs.CL
Health-LLM: Personalized Retrieval-Augmented Disease Prediction System
[AUTHORS]
Mingyu Jin, Qinkai Yu, Dong Shu, Chong Zhang, Lizhou Fan, Wenyue Hua, Suiyuan Zhu, Yanda Meng, Zhenting Wang, Mengnan Du, Yongfeng Zhang
[ABSTRACT]
Recent advancements in artificial intelligence (AI), especially large
language models (LLMs), have significantly advanced healthcare applications and
demonstrated potentials in intelligent medical treatment. However, there are
conspicuous challenges such as vast data volumes and inconsistent symptom
characterization standards, preventing full integration of healthcare AI
systems with individual patients’ needs. To promote professional and
personalized healthcare, we propose an innovative framework, Heath-LLM, which
combines large-scale feature extraction and medical knowledge trade-off
scoring. Compared to traditional health management applications, our system has
three main advantages: (1) It integrates health reports and medical knowledge
into a large model to ask relevant questions to large language model for
disease prediction; (2) It leverages a retrieval augmented generation (RAG)
mechanism to enhance feature extraction; (3) It incorporates a semi-automated
feature updating framework that can merge and delete features to improve
accuracy of disease prediction. We experiment on a large number of health
reports to assess the effectiveness of Health-LLM system. The results indicate
that the proposed system surpasses the existing ones and has the potential to
significantly advance disease prediction and personalized health management.
[LINK]
http://arxiv.org/abs/2402.00746v7
[DATE]
2024-10-01 01:22:01+08:00
[CATEGORIES]
cs.CL
Eliciting In-Context Learning in Vision-Language Models for Videos Through Curated Data Distributional Properties
[AUTHORS]
Keunwoo Peter Yu, Zheyuan Zhang, Fengyuan Hu, Shane Storks, Joyce Chai
[ABSTRACT]
A major reason behind the recent success of large language models (LLMs) is
their \textit{in-context learning} capability, which makes it possible to
rapidly adapt them to downstream text-based tasks by prompting them with a
small number of relevant demonstrations. While large vision-language models
(VLMs) have recently been developed for tasks requiring both text and images,
they largely lack in-context learning over visual information, especially in
understanding and generating text about videos. In this work, we implement
\textbf{E}mergent \textbf{I}n-context \textbf{Le}arning on \textbf{V}ideos
(\eilev{}), a novel training paradigm that induces in-context learning over
video and text by capturing key properties of pre-training data found by prior
work to be essential for in-context learning in transformers. In our
experiments, we show that \eilev-trained models outperform other off-the-shelf
VLMs in few-shot video narration for novel, rare actions. Furthermore, we
demonstrate that these key properties of bursty distributions, skewed marginal
distributions, and dynamic meaning each contribute to varying degrees to VLMs’
in-context learning capability in narrating procedural videos. Our results,
analysis, and \eilev{}-trained models yield numerous insights about the
emergence of in-context learning over video and text, creating a foundation for
future work to optimize and scale VLMs for open-domain video understanding and
reasoning. Our code and demo are available at
\url{https://github.com/yukw777/EILEV}.
[COMMENTS]
16 pages, LaTeX; Accepted to EMNLP 2024 Main
[LINK]
http://arxiv.org/abs/2311.17041v3
[DATE]
2024-10-01 01:12:39+08:00
[CATEGORIES]
cs.CL
Enhancing Romanian Offensive Language Detection through Knowledge Distillation, Multi-Task Learning, and Data Augmentation
[AUTHORS]
Vlad-Cristian Matei, Iulian-Marius Tăiatu, Răzvan-Alexandru Smădu, Dumitru-Clementin Cercel
[ABSTRACT]
This paper highlights the significance of natural language processing (NLP)
within artificial intelligence, underscoring its pivotal role in comprehending
and modeling human language. Recent advancements in NLP, particularly in
conversational bots, have garnered substantial attention and adoption among
developers. This paper explores advanced methodologies for attaining smaller
and more efficient NLP models. Specifically, we employ three key approaches:
(1) training a Transformer-based neural network to detect offensive language,
(2) employing data augmentation and knowledge distillation techniques to
increase performance, and (3) incorporating multi-task learning with knowledge
distillation and teacher annealing using diverse datasets to enhance
efficiency. The culmination of these methods has yielded demonstrably improved
outcomes.
[COMMENTS]
Accepted by NLDB2024
[LINK]
http://arxiv.org/abs/2409.20498v1
[DATE]
2024-10-01 00:59:48+08:00
[CATEGORIES]
cs.CL
Text Clustering as Classification with LLMs
[AUTHORS]
Chen Huang, Guoxiu He
[ABSTRACT]
Text clustering remains valuable in real-world applications where manual
labeling is cost-prohibitive. It facilitates efficient organization and
analysis of information by grouping similar texts based on their
representations. However, implementing this approach necessitates fine-tuned
embedders for downstream data and sophisticated similarity metrics. To address
this issue, this study presents a novel framework for text clustering that
effectively leverages the in-context learning capacity of Large Language Models
(LLMs). Instead of fine-tuning embedders, we propose to transform the text
clustering into a classification task via LLM. First, we prompt LLM to generate
potential labels for a given dataset. Second, after integrating similar labels
generated by the LLM, we prompt the LLM to assign the most appropriate label to
each sample in the dataset. Our framework has been experimentally proven to
achieve comparable or superior performance to state-of-the-art clustering
methods that employ embeddings, without requiring complex fine-tuning or
clustering algorithms. We make our code available to the public for utilization
at https://anonymous.4open.science/r/Text-Clustering-via-LLM-E500.
[COMMENTS]
12 pages, 3 figures
[LINK]
http://arxiv.org/abs/2410.00927v1
[DATE]
2024-10-01 00:57:34+08:00
[CATEGORIES]
cs.CL
Interactive Speculative Planning: Enhance Agent Efficiency through Co-design of System and User Interface
[AUTHORS]
Wenyue Hua, Mengting Wan, Shashank Vadrevu, Ryan Nadel, Yongfeng Zhang, Chi Wang
[ABSTRACT]
Agents, as user-centric tools, are increasingly deployed for human task
delegation, assisting with a broad spectrum of requests by generating thoughts,
engaging with user proxies, and producing action plans. However, agents based
on large language models (LLMs) often face substantial planning latency due to
two primary factors: the efficiency limitations of the underlying LLMs due to
their large size and high demand, and the structural complexity of the agents
due to the extensive generation of intermediate thoughts to produce the final
output. Given that inefficiency in service provision can undermine the value of
automation for users, this paper presents a human-centered efficient agent
planning method – Interactive Speculative Planning – aiming at enhancing the
efficiency of agent planning through both system design and human-AI
interaction. Our approach advocates for the co-design of the agent system and
user interface, underscoring the importance of an agent system that can fluidly
manage user interactions and interruptions. By integrating human interruptions
as a fundamental component of the system, we not only make it more user-centric
but also expedite the entire process by leveraging human-in-the-loop
interactions to provide accurate intermediate steps. Code and data will be
released.
[COMMENTS]
27 pages, 22 figures
[LINK]
http://arxiv.org/abs/2410.00079v1
[DATE]
2024-10-01 00:52:51+08:00
[CATEGORIES]
cs.CL
cs.LG
The African Woman is Rhythmic and Soulful: An Investigation of Implicit Biases in LLM Open-ended Text Generation
[AUTHORS]
Serene Lim, María Pérez-Ortiz
[ABSTRACT]
This paper investigates the subtle and often concealed biases present in
Large Language Models (LLMs), focusing on implicit biases that may remain
despite passing explicit bias tests. Implicit biases are significant because
they influence the decisions made by these systems, potentially perpetuating
stereotypes and discrimination, even when LLMs appear to function fairly.
Traditionally, explicit bias tests or embedding-based methods are employed to
detect bias, but these approaches can overlook more nuanced, implicit forms of
bias. To address this, we introduce two novel psychological-inspired
methodologies: the LLM Implicit Association Test (IAT) Bias and the LLM
Decision Bias, designed to reveal and measure implicit biases through
prompt-based and decision-making tasks. Additionally, open-ended generation
tasks with thematic analysis of word generations and storytelling provide
qualitative insights into the model’s behavior. Our findings demonstrate that
the LLM IAT Bias correlates with traditional methods and more effectively
predicts downstream behaviors, as measured by the LLM Decision Bias, offering a
more comprehensive framework for detecting subtle biases in AI systems. This
research advances the field of AI ethics by proposing new methods to
continually assess and mitigate biases in LLMs, highlighting the importance of
qualitative and decision-focused evaluations to address challenges that
previous approaches have not fully captured.
[LINK]
http://arxiv.org/abs/2407.01270v2
[DATE]
2024-10-01 00:39:51+08:00
[CATEGORIES]
cs.CL
A Weakly Supervised Data Labeling Framework for Machine Lexical Normalization in Vietnamese Social Media
[AUTHORS]
Dung Ha Nguyen, Anh Thi Hoang Nguyen, Kiet Van Nguyen
[ABSTRACT]
This study introduces an innovative automatic labeling framework to address
the challenges of lexical normalization in social media texts for low-resource
languages like Vietnamese. Social media data is rich and diverse, but the
evolving and varied language used in these contexts makes manual labeling
labor-intensive and expensive. To tackle these issues, we propose a framework
that integrates semi-supervised learning with weak supervision techniques. This
approach enhances the quality of training dataset and expands its size while
minimizing manual labeling efforts. Our framework automatically labels raw
data, converting non-standard vocabulary into standardized forms, thereby
improving the accuracy and consistency of the training data. Experimental
results demonstrate the effectiveness of our weak supervision framework in
normalizing Vietnamese text, especially when utilizing Pre-trained Language
Models. The proposed framework achieves an impressive F1-score of 82.72% and
maintains vocabulary integrity with an accuracy of up to 99.22%. Additionally,
it effectively handles undiacritized text under various conditions. This
framework significantly enhances natural language normalization quality and
improves the accuracy of various NLP tasks, leading to an average accuracy
increase of 1-3%.
[LINK]
http://arxiv.org/abs/2409.20467v1
[DATE]
2024-10-01 00:26:40+08:00
[CATEGORIES]
cs.CL
GAMMA-PD: Graph-based Analysis of Multi-Modal Motor Impairment Assessments in Parkinson’s Disease
[AUTHORS]
Favour Nerrise, Alice Louise Heiman, Ehsan Adeli
[ABSTRACT]
The rapid advancement of medical technology has led to an exponential
increase in multi-modal medical data, including imaging, genomics, and
electronic health records (EHRs). Graph neural networks (GNNs) have been widely
used to represent this data due to their prominent performance in capturing
pairwise relationships. However, the heterogeneity and complexity of
multi-modal medical data still pose significant challenges for standard GNNs,
which struggle with learning higher-order, non-pairwise relationships. This
paper proposes GAMMA-PD (Graph-based Analysis of Multi-modal Motor Impairment
Assessments in Parkinson’s Disease), a novel heterogeneous hypergraph fusion
framework for multi-modal clinical data analysis. GAMMA-PD integrates imaging
and non-imaging data into a “hypernetwork” (patient population graph) by
preserving higher-order information and similarity between patient profiles and
symptom subtypes. We also design a feature-based attention-weighted mechanism
to interpret feature-level contributions towards downstream decision tasks. We
evaluate our approach with clinical data from the Parkinson’s Progression
Markers Initiative (PPMI) and a private dataset. We demonstrate gains in
predicting motor impairment symptoms in Parkinson’s disease. Our end-to-end
framework also learns associations between subsets of patient characteristics
to generate clinically relevant explanations for disease and symptom profiles.
The source code is available at https://github.com/favour-nerrise/GAMMA-PD.
[COMMENTS]
Accepted by the 6th Workshop on GRaphs in biomedicAl Image anaLysis
(GRAIL) at the 27th International Conference on Medical Image Computing and
Computer Assisted Intervention (MICCAI 2024). 12 pages, 3 figures, 2 tables,
Source Code: https://github.com/favour-nerrise/GAMMA-PD
[LINK]
http://arxiv.org/abs/2410.00944v1
[DATE]
2024-10-01 23:51:33+08:00
[CATEGORIES]
cs.LG
Mitigating Shortcut Learning with Diffusion Counterfactuals and Diverse Ensembles
[AUTHORS]
Luca Scimeca, Alexander Rubinstein, Damien Teney, Seong Joon Oh, Armand Mihai Nicolicioiu, Yoshua Bengio
[ABSTRACT]
Spurious correlations in the data, where multiple cues are predictive of the
target labels, often lead to a phenomenon known as shortcut learning, where a
model relies on erroneous, easy-to-learn cues while ignoring reliable ones. In
this work, we propose DiffDiv an ensemble diversification framework exploiting
Diffusion Probabilistic Models (DPMs) to mitigate this form of bias. We show
that at particular training intervals, DPMs can generate images with novel
feature combinations, even when trained on samples displaying correlated input
features. We leverage this crucial property to generate synthetic
counterfactuals to increase model diversity via ensemble disagreement. We show
that DPM-guided diversification is sufficient to remove dependence on shortcut
cues, without a need for additional supervised signals. We further empirically
quantify its efficacy on several diversification objectives, and finally show
improved generalization and diversification on par with prior work that relies
on auxiliary data collection.
[COMMENTS]
arXiv admin note: substantial text overlap with arXiv:2310.02230
[LINK]
http://arxiv.org/abs/2311.16176v4
[DATE]
2024-10-01 23:50:57+08:00
[CATEGORIES]
cs.LG
Fast and Reliable $N-k$ Contingency Screening with Input-Convex Neural Networks
[AUTHORS]
Nicolas Christianson, Wenqi Cui, Steven Low, Weiwei Yang, Baosen Zhang
[ABSTRACT]
Power system operators must ensure that dispatch decisions remain feasible in
case of grid outages or contingencies to prevent cascading failures and ensure
reliable operation. However, checking the feasibility of all $N - k$
contingencies – every possible simultaneous failure of $k$ grid components –
is computationally intractable for even small $k$, requiring system operators
to resort to heuristic screening methods. Because of the increase in
uncertainty and changes in system behaviors, heuristic lists might not include
all relevant contingencies, generating false negatives in which unsafe
scenarios are misclassified as safe. In this work, we propose to use
input-convex neural networks (ICNNs) for contingency screening. We show that
ICNN reliability can be determined by solving a convex optimization problem,
and by scaling model weights using this problem as a differentiable
optimization layer during training, we can learn an ICNN classifier that is
both data-driven and has provably guaranteed reliability. Namely, our method
can ensure a zero false negative rate. We empirically validate this methodology
in a case study on the IEEE 39-bus test network, observing that it yields
substantial (10-20x) speedups while having excellent classification accuracy.
[COMMENTS]
11 pages, 4 figures
[LINK]
http://arxiv.org/abs/2410.00796v1
[DATE]
2024-10-01 23:38:09+08:00
[CATEGORIES]
cs.LG
NeuroPath: A Neural Pathway Transformer for Joining the Dots of Human Connectomes
[AUTHORS]
Ziquan Wei, Tingting Dan, Jiaqi Ding, Guorong Wu
[ABSTRACT]
Although modern imaging technologies allow us to study connectivity between
two distinct brain regions in-vivo, an in-depth understanding of how anatomical
structure supports brain function and how spontaneous functional fluctuations
emerge remarkable cognition is still elusive. Meanwhile, tremendous efforts
have been made in the realm of machine learning to establish the nonlinear
mapping between neuroimaging data and phenotypic traits. However, the absence
of neuroscience insight in the current approaches poses significant challenges
in understanding cognitive behavior from transient neural activities. To
address this challenge, we put the spotlight on the coupling mechanism of
structural connectivity (SC) and functional connectivity (FC) by formulating
such network neuroscience question into an expressive graph representation
learning problem for high-order topology. Specifically, we introduce the
concept of topological detour to characterize how a ubiquitous instance of FC
(direct link) is supported by neural pathways (detour) physically wired by SC,
which forms a cyclic loop interacted by brain structure and function. In the
clich'e of machine learning, the multi-hop detour pathway underlying SC-FC
coupling allows us to devise a novel multi-head self-attention mechanism within
Transformer to capture multi-modal feature representation from paired graphs of
SC and FC. Taken together, we propose a biological-inspired deep model, coined
as NeuroPath, to find putative connectomic feature representations from the
unprecedented amount of neuroimages, which can be plugged into various
downstream applications such as task recognition and disease diagnosis. We have
evaluated NeuroPath on large-scale public datasets including HCP and UK Biobank
under supervised and zero-shot learning, where the state-of-the-art performance
by our NeuroPath indicates great potential in network neuroscience.
[COMMENTS]
Accepted by NeurIPS 2024
[LINK]
http://arxiv.org/abs/2409.17510v2
[DATE]
2024-10-01 23:23:56+08:00
[CATEGORIES]
cs.LG
Divide And Conquer: Learning Chaotic Dynamical Systems With Multistep Penalty Neural Ordinary Differential Equations
[AUTHORS]
Dibyajyoti Chakraborty, Seung Whan Chung, Troy Arcomano, Romit Maulik
[ABSTRACT]
Forecasting high-dimensional dynamical systems is a fundamental challenge in
various fields, such as geosciences and engineering. Neural Ordinary
Differential Equations (NODEs), which combine the power of neural networks and
numerical solvers, have emerged as a promising algorithm for forecasting
complex nonlinear dynamical systems. However, classical techniques used for
NODE training are ineffective for learning chaotic dynamical systems. In this
work, we propose a novel NODE-training approach that allows for robust learning
of chaotic dynamical systems. Our method addresses the challenges of
non-convexity and exploding gradients associated with underlying chaotic
dynamics. Training data trajectories from such systems are split into multiple,
non-overlapping time windows. In addition to the deviation from the training
data, the optimization loss term further penalizes the discontinuities of the
predicted trajectory between the time windows. The window size is selected
based on the fastest Lyapunov time scale of the system. Multi-step penalty(MP)
method is first demonstrated on Lorenz equation, to illustrate how it improves
the loss landscape and thereby accelerates the optimization convergence. MP
method can optimize chaotic systems in a manner similar to least-squares
shadowing with significantly lower computational costs. Our proposed algorithm,
denoted the Multistep Penalty NODE, is applied to chaotic systems such as the
Kuramoto-Sivashinsky equation, the two-dimensional Kolmogorov flow, and ERA5
reanalysis data for the atmosphere. It is observed that MP-NODE provide viable
performance for such chaotic systems, not only for short-term trajectory
predictions but also for invariant statistics that are hallmarks of the chaotic
nature of these dynamics.
[COMMENTS]
25 pages, 17 Figures, submitted to Computer Methods in Applied
Mechanics and Engineering
[LINK]
http://arxiv.org/abs/2407.00568v4
[DATE]
2024-10-01 23:19:42+08:00
[CATEGORIES]
cs.LG
Adaptive Motion Generation Using Uncertainty-Driven Foresight Prediction
[AUTHORS]
Hyogo Hiruma, Hiroshi Ito, Tetusya Ogata
[ABSTRACT]
Uncertainty of environments has long been a difficult characteristic to
handle, when performing real-world robot tasks. This is because the uncertainty
produces unexpected observations that cannot be covered by manual scripting.
Learning based robot controlling methods are a promising approach for
generating flexible motions against unknown situations, but still tend to
suffer under uncertainty due to its deterministic nature. In order to
adaptively perform the target task under such conditions, the robot control
model must be able to accurately understand the possible uncertainty, and to
exploratively derive the optimal action that minimizes such uncertainty. This
paper extended an existing predictive learning based robot control method,
which employ foresight prediction using dynamic internal simulation. The
foresight module refines the model’s hidden states by sampling multiple
possible futures and replace with the one that led to the lower future
uncertainty. The adaptiveness of the model was evaluated on a door opening
task. The door can be opened either by pushing, pulling, or sliding, but robot
cannot visually distinguish which way, and is required to adapt on the fly. The
results showed that the proposed model adaptively diverged its motion through
interaction with the door, whereas conventional methods failed to stably
diverge. The models were analyzed on Lyapunov exponents of RNN hidden states
which reflect the possible divergence at each time step during task execution.
The result indicated that the foresight module biased the model to consider
future consequences, which lead to embedding uncertainties at the policy of the
robot controller, rather than the resultant observation. This is beneficial for
implementing adaptive behaviors, which indices derivation of diverse motion
during exploration.
[LINK]
http://arxiv.org/abs/2410.00774v1
[DATE]
2024-10-01 23:13:27+08:00
[CATEGORIES]
cs.LG
Targeted synthetic data generation for tabular data via hardness characterization
[AUTHORS]
Tommaso Ferracci, Leonie Tabea Goldmann, Anton Hinel, Francesco Sanna Passino
[ABSTRACT]
Synthetic data generation has been proven successful in improving model
performance and robustness in the context of scarce or low-quality data. Using
the data valuation framework to statistically identify beneficial and
detrimental observations, we introduce a novel augmentation pipeline that
generates only high-value training points based on hardness characterization.
We first demonstrate via benchmarks on real data that Shapley-based data
valuation methods perform comparably with learning-based methods in hardness
characterisation tasks, while offering significant theoretical and
computational advantages. Then, we show that synthetic data generators trained
on the hardest points outperform non-targeted data augmentation on simulated
data and on a large scale credit default prediction task. In particular, our
approach improves the quality of out-of-sample predictions and it is
computationally more efficient compared to non-targeted methods.
[LINK]
http://arxiv.org/abs/2410.00759v1
[DATE]
2024-10-01 22:54:26+08:00
[CATEGORIES]
cs.LG
RisingBALLER: A player is a token, a match is a sentence, A path towards a foundational model for football players data analytics
[AUTHORS]
Akedjou Achraff Adjileye
[ABSTRACT]
In this paper, I introduce RisingBALLER, the first publicly available
approach that leverages a transformer model trained on football match data to
learn match-specific player representations. Drawing inspiration from advances
in language modeling, RisingBALLER treats each football match as a unique
sequence in which players serve as tokens, with their embeddings shaped by the
specific context of the match. Through the use of masked player prediction
(MPP) as a pre-training task, RisingBALLER learns foundational features for
football player representations, similar to how language models learn semantic
features for text representations. As a downstream task, I introduce next match
statistics prediction (NMSP) to showcase the effectiveness of the learned
player embeddings. The NMSP model surpasses a strong baseline commonly used for
performance forecasting within the community. Furthermore, I conduct an
in-depth analysis to demonstrate how the learned embeddings by RisingBALLER can
be used in various football analytics tasks, such as producing meaningful
positional features that capture the essence and variety of player roles beyond
rigid x,y coordinates, team cohesion estimation, and similar player retrieval
for more effective data-driven scouting. More than a simple machine learning
model, RisingBALLER is a comprehensive framework designed to transform football
data analytics by learning high-level foundational features for players, taking
into account the context of each match. It offers a deeper understanding of
football players beyond individual statistics.
[COMMENTS]
18 pages, 6 figures. The paper will be presented at the StatsBomb
Conference 2024 (https://statsbomb.com/events/statsbomb-conference-2024/)
[LINK]
http://arxiv.org/abs/2410.00943v1
[DATE]
2024-10-01 22:39:22+08:00
[CATEGORIES]
cs.LG
FELRec: Efficient Handling of Item Cold-Start With Dynamic Representation Learning in Recommender Systems
[AUTHORS]
Kuba Weimann, Tim O. F. Conrad
[ABSTRACT]
Recommender systems suffer from the cold-start problem whenever a new user
joins the platform or a new item is added to the catalog. To address item
cold-start, we propose to replace the embedding layer in sequential
recommenders with a dynamic storage that has no learnable weights and can keep
an arbitrary number of representations. In this paper, we present FELRec, a
large embedding network that refines the existing representations of users and
items in a recursive manner, as new information becomes available. In contrast
to similar approaches, our model represents new users and items without side
information and time-consuming finetuning, instead it runs a single forward
pass over a sequence of existing representations. During item cold-start, our
method outperforms similar method by 29.50%-47.45%. Further, our proposed model
generalizes well to previously unseen datasets in zero-shot settings. The
source code is publicly available at https://github.com/kweimann/FELRec .
[LINK]
http://arxiv.org/abs/2210.16928v2
[DATE]
2024-10-01 22:39:12+08:00
[CATEGORIES]
cs.LG
WALINET: A water and lipid identification convolutional Neural Network for nuisance signal removal in 1H MR Spectroscopic Imaging
[AUTHORS]
Paul Weiser, Georg Langs, Stanislav Motyka, Wolfgang Bogner, Sébastien Courvoisier, Malte Hoffmann, Antoine Klauser, Ovidiu C. Andronesi
[ABSTRACT]
Purpose. Proton Magnetic Resonance Spectroscopic Imaging (1H-MRSI) provides
non-invasive spectral-spatial mapping of metabolism. However, long-standing
problems in whole-brain 1H-MRSI are spectral overlap of metabolite peaks with
large lipid signal from scalp, and overwhelming water signal that distorts
spectra. Fast and effective methods are needed for high-resolution 1H-MRSI to
accurately remove lipid and water signals while preserving the metabolite
signal. The potential of supervised neural networks for this task remains
unexplored, despite their success for other MRSI processing.
Methods. We introduce a deep-learning method based on a modified Y-NET
network for water and lipid removal in whole-brain 1H-MRSI. The WALINET (WAter
and LIpid neural NETwork) was compared to conventional methods such as the
state-of-the-art lipid L2 regularization and Hankel-Lanczos singular value
decomposition (HLSVD) water suppression. Methods were evaluated on simulated
and in-vivo whole-brain MRSI using NMRSE, SNR, CRLB, and FWHM metrics.
Results. WALINET is significantly faster and needs 8s for high-resolution
whole-brain MRSI, compared to 42 minutes for conventional HLSVD+L2.
Quantitative analysis shows WALINET has better performance than HLSVD+L2: 1)
more lipid removal with 41% lower NRMSE, 2) better metabolite signal
preservation with 71% lower NRMSE in simulated data, 155% higher SNR and 50%
lower CRLB in in-vivo data. Metabolic maps obtained by WALINET in healthy
subjects and patients show better gray/white-matter contrast with more visible
structural details.
Conclusions. WALINET has superior performance for nuisance signal removal and
metabolite quantification on whole-brain 1H-MRSI compared to conventional
state-of-the-art techniques. This represents a new application of deep-learning
for MRSI processing, with potential for automated high-throughput workflow.
[LINK]
http://arxiv.org/abs/2410.00746v1
[DATE]
2024-10-01 22:37:55+08:00
[CATEGORIES]
cs.LG
MobileMEF: Fast and Efficient Method for Multi-Exposure Fusion
[AUTHORS]
Lucas Nedel Kirsten, Zhicheng Fu, Nikhil Ambha Madhusudhana
[ABSTRACT]
Recent advances in camera design and imaging technology have enabled the
capture of high-quality images using smartphones. However, due to the limited
dynamic range of digital cameras, the quality of photographs captured in
environments with highly imbalanced lighting often results in poor-quality
images. To address this issue, most devices capture multi-exposure frames and
then use some multi-exposure fusion method to merge those frames into a final
fused image. Nevertheless, most traditional and current deep learning
approaches are unsuitable for real-time applications on mobile devices due to
their heavy computational and memory requirements. We propose a new method for
multi-exposure fusion based on an encoder-decoder deep learning architecture
with efficient building blocks tailored for mobile devices. This efficient
design makes our model capable of processing 4K resolution images in less than
2 seconds on mid-range smartphones. Our method outperforms state-of-the-art
techniques regarding full-reference quality measures and computational
efficiency (runtime and memory usage), making it ideal for real-time
applications on hardware-constrained devices. Our code is available at:
https://github.com/LucasKirsten/MobileMEF.
[LINK]
http://arxiv.org/abs/2408.07932v2
[DATE]
2024-10-01 22:26:16+08:00
[CATEGORIES]
cs.LG
Improved Generation of Synthetic Imaging Data Using Feature-Aligned Diffusion
[AUTHORS]
Lakshmi Nair
[ABSTRACT]
Synthetic data generation is an important application of machine learning in
the field of medical imaging. While existing approaches have successfully
applied fine-tuned diffusion models for synthesizing medical images, we explore
potential improvements to this pipeline through feature-aligned diffusion. Our
approach aligns intermediate features of the diffusion model to the output
features of an expert, and our preliminary findings show an improvement of 9%
in generation accuracy and ~0.12 in SSIM diversity. Our approach is also
synergistic with existing methods, and easily integrated into diffusion
training pipelines for improvements. We make our code available at
\url{https://github.com/lnairGT/Feature-Aligned-Diffusion}.
[COMMENTS]
Accepted to First International Workshop on Vision-Language Models
for Biomedical Applications (VLM4Bio 2024) at the 32nd ACM-Multimedia
conference
[LINK]
http://arxiv.org/abs/2410.00731v1
[DATE]
2024-10-01 22:18:09+08:00
[CATEGORIES]
cs.LG
Simplified priors for Object-Centric Learning
[AUTHORS]
Vihang Patil, Andreas Radler, Daniel Klotz, Sepp Hochreiter
[ABSTRACT]
Humans excel at abstracting data and constructing \emph{reusable} concepts, a
capability lacking in current continual learning systems. The field of
object-centric learning addresses this by developing abstract representations,
or slots, from data without human supervision. Different methods have been
proposed to tackle this task for images, whereas most are overly complex,
non-differentiable, or poorly scalable. In this paper, we introduce a
conceptually simple, fully-differentiable, non-iterative, and scalable method
called SAMP Simplified Slot Attention with Max Pool Priors). It is
implementable using only Convolution and MaxPool layers and an Attention layer.
Our method encodes the input image with a Convolutional Neural Network and then
uses a branch of alternating Convolution and MaxPool layers to create
specialized sub-networks and extract primitive slots. These primitive slots are
then used as queries for a Simplified Slot Attention over the encoded image.
Despite its simplicity, our method is competitive or outperforms previous
methods on standard benchmarks.
[LINK]
http://arxiv.org/abs/2410.00728v1
[DATE]
2024-10-01 22:16:13+08:00
[CATEGORIES]
cs.LG
Enhancing GANs with Contrastive Learning-Based Multistage Progressive Finetuning SNN and RL-Based External Optimization
[AUTHORS]
Osama Mustafa
[ABSTRACT]
The application of deep learning in cancer research, particularly in early
diagnosis, case understanding, and treatment strategy design, emphasizes the
need for high-quality data. Generative AI, especially Generative Adversarial
Networks (GANs), has emerged as a leading solution to challenges like class
imbalance, robust learning, and model training, while addressing issues
stemming from patient privacy and the scarcity of real data. Despite their
promise, GANs face several challenges, both inherent and specific to
histopathology data. Inherent issues include training imbalance, mode collapse,
linear learning from insufficient discriminator feedback, and hard boundary
convergence due to stringent feedback. Histopathology data presents a unique
challenge with its complex representation, high spatial resolution, and
multiscale features. To address these challenges, we propose a framework
consisting of two components. First, we introduce a contrastive learning-based
Multistage Progressive Finetuning Siamese Neural Network (MFT-SNN) for
assessing the similarity between histopathology patches. Second, we implement a
Reinforcement Learning-based External Optimizer (RL-EO) within the GAN training
loop, serving as a reward signal generator. The modified discriminator loss
function incorporates a weighted reward, guiding the GAN to maximize this
reward while minimizing loss. This approach offers an external optimization
guide to the discriminator, preventing generator overfitting and ensuring
smooth convergence. Our proposed solution has been benchmarked against
state-of-the-art (SOTA) GANs and a Denoising Diffusion Probabilistic model,
outperforming previous SOTA across various metrics, including FID score, KID
score, Perceptual Path Length, and downstream classification tasks.
[LINK]
http://arxiv.org/abs/2409.20340v2
[DATE]
2024-10-01 22:14:32+08:00
[CATEGORIES]
cs.LG
On the Geometry and Optimization of Polynomial Convolutional Networks
[AUTHORS]
Vahid Shahverdi, Giovanni Luca Marchetti, Kathlén Kohn
[ABSTRACT]
We study convolutional neural networks with monomial activation functions.
Specifically, we prove that their parameterization map is regular and is an
isomorphism almost everywhere, up to rescaling the filters. By leveraging on
tools from algebraic geometry, we explore the geometric properties of the image
in function space of this map – typically referred to as neuromanifold. In
particular, we compute the dimension and the degree of the neuromanifold, which
measure the expressivity of the model, and describe its singularities.
Moreover, for a generic large dataset, we derive an explicit formula that
quantifies the number of critical points arising in the optimization of a
regression loss.
[LINK]
http://arxiv.org/abs/2410.00722v1
[DATE]
2024-10-01 22:13:05+08:00
[CATEGORIES]
cs.LG
Pseudo-Non-Linear Data Augmentation via Energy Minimization
[AUTHORS]
Pingbang Hu, Mahito Sugiyama
[ABSTRACT]
We propose a novel and interpretable data augmentation method based on
energy-based modeling and principles from information geometry. Unlike
black-box generative models, which rely on deep neural networks, our approach
replaces these non-interpretable transformations with explicit, theoretically
grounded ones, ensuring interpretability and strong guarantees such as energy
minimization. Central to our method is the introduction of the backward
projection algorithm, which reverses dimension reduction to generate new data.
Empirical results demonstrate that our method achieves competitive performance
with black-box generative models while offering greater transparency and
interpretability.
[LINK]
http://arxiv.org/abs/2410.00718v1
[DATE]
2024-10-01 22:08:22+08:00
[CATEGORIES]
cs.LG
Low-Energy On-Device Personalization for MCUs
[AUTHORS]
Yushan Huang, Ranya Aloufi, Xavier Cadet, Yuchen Zhao, Payam Barnaghi, Hamed Haddadi
[ABSTRACT]
Microcontroller Units (MCUs) are ideal platforms for edge applications due to
their low cost and energy consumption, and are widely used in various
applications, including personalized machine learning tasks, where customized
models can enhance the task adaptation. However, existing approaches for local
on-device personalization mostly support simple ML architectures or require
complex local pre-training/training, leading to high energy consumption and
negating the low-energy advantage of MCUs. In this paper, we introduce
$MicroT$, an efficient and low-energy MCU personalization approach. $MicroT$
includes a robust, general, but tiny feature extractor, developed through
self-supervised knowledge distillation, which trains a task-specific head to
enable independent on-device personalization with minimal energy and
computational requirements. MicroT implements an MCU-optimized early-exit
inference mechanism called stage-decision to further reduce energy costs. This
mechanism allows for user-configurable exit criteria (stage-decision ratio) to
adaptively balance energy cost with model performance. We evaluated MicroT
using two models, three datasets, and two MCU boards. $MicroT$ outperforms
traditional transfer learning (TTL) and two SOTA approaches by 2.12 - 11.60%
across two models and three datasets. Targeting widely used energy-aware edge
devices, MicroT’s on-device training requires no additional complex operations,
halving the energy cost compared to SOTA approaches by up to 2.28X while
keeping SRAM usage below 1MB. During local inference, MicroT reduces energy
cost by 14.17% compared to TTL across two boards and two datasets, highlighting
its suitability for long-term use on energy-aware resource-constrained MCUs.
[COMMENTS]
Accepted to The 9th ACM/IEEE Symposium on Edge Computing (SEC 2024)
[LINK]
http://arxiv.org/abs/2403.08040v4
[DATE]
2024-10-01 22:08:10+08:00
[CATEGORIES]
cs.LG
AR-Sieve Bootstrap for the Random Forest and a simulation-based comparison with rangerts time series prediction
[AUTHORS]
Cabrel Teguemne Fokam, Carsten Jentsch, Michel Lang, Markus Pauly
[ABSTRACT]
The Random Forest (RF) algorithm can be applied to a broad spectrum of
problems, including time series prediction. However, neither the classical IID
(Independent and Identically distributed) bootstrap nor block bootstrapping
strategies (as implemented in rangerts) completely account for the nature of
the Data Generating Process (DGP) while resampling the observations. We propose
the combination of RF with a residual bootstrapping technique where we replace
the IID bootstrap with the AR-Sieve Bootstrap (ARSB), which assumes the DGP to
be an autoregressive process. To assess the new model’s predictive performance,
we conduct a simulation study using synthetic data generated from different
types of DGPs. It turns out that ARSB provides more variation amongst the trees
in the forest. Moreover, RF with ARSB shows greater accuracy compared to RF
with other bootstrap strategies. However, these improvements are achieved at
some efficiency costs.
[LINK]
http://arxiv.org/abs/2410.00942v1
[DATE]
2024-10-01 22:07:58+08:00
[CATEGORIES]
cs.LG
NECOMIMI: Neural-Cognitive Multimodal EEG-informed Image Generation with Diffusion Models
[AUTHORS]
Chi-Sheng Chen
[ABSTRACT]
NECOMIMI (NEural-COgnitive MultImodal EEG-Informed Image Generation with
Diffusion Models) introduces a novel framework for generating images directly
from EEG signals using advanced diffusion models. Unlike previous works that
focused solely on EEG-image classification through contrastive learning,
NECOMIMI extends this task to image generation. The proposed NERV EEG encoder
demonstrates state-of-the-art (SoTA) performance across multiple zero-shot
classification tasks, including 2-way, 4-way, and 200-way, and achieves top
results in our newly proposed Category-based Assessment Table (CAT) Score,
which evaluates the quality of EEG-generated images based on semantic concepts.
A key discovery of this work is that the model tends to generate abstract or
generalized images, such as landscapes, rather than specific objects,
highlighting the inherent challenges of translating noisy and low-resolution
EEG data into detailed visual outputs. Additionally, we introduce the CAT Score
as a new metric tailored for EEG-to-image evaluation and establish a benchmark
on the ThingsEEG dataset. This study underscores the potential of EEG-to-image
generation while revealing the complexities and challenges that remain in
bridging neural activity with visual representation.
[LINK]
http://arxiv.org/abs/2410.00712v1
[DATE]
2024-10-01 22:05:30+08:00
[CATEGORIES]
cs.LG
Hybrid Quantum Neural Network based Indoor User Localization using Cloud Quantum Computing
[AUTHORS]
Sparsh Mittal, Yash Chand, Neel Kanth Kundu
[ABSTRACT]
This paper proposes a hybrid quantum neural network (HQNN) for indoor user
localization using received signal strength indicator (RSSI) values. We use
publicly available RSSI datasets for indoor localization using WiFi, Bluetooth,
and Zigbee to test the performance of the proposed HQNN. We also compare the
performance of the HQNN with the recently proposed quantum fingerprinting-based
user localization method. Our results show that the proposed HQNN performs
better than the quantum fingerprinting algorithm since the HQNN has trainable
parameters in the quantum circuits, whereas the quantum fingerprinting
algorithm uses a fixed quantum circuit to calculate the similarity between the
test data point and the fingerprint dataset. Unlike prior works, we also test
the performance of the HQNN and quantum fingerprint algorithm on a real IBM
quantum computer using cloud quantum computing services. Therefore, this paper
examines the performance of the HQNN on noisy intermediate scale (NISQ) quantum
devices using real-world RSSI localization datasets. The novelty of our
approach lies in the use of simple feature maps and ansatz with fewer neurons,
alongside testing on actual quantum hardware using real-world data,
demonstrating practical applicability in real-world scenarios.
[COMMENTS]
This work has been accepted for presentation at the IEEE TENSYMP 2024
conference
[LINK]
http://arxiv.org/abs/2410.00708v1
[DATE]
2024-10-01 21:59:59+08:00
[CATEGORIES]
cs.LG
Investigating the Impact of Model Complexity in Large Language Models
[AUTHORS]
Jing Luo, Huiyuan Wang, Weiran Huang
[ABSTRACT]
Large Language Models (LLMs) based on the pre-trained fine-tuning paradigm
have become pivotal in solving natural language processing tasks, consistently
achieving state-of-the-art performance. Nevertheless, the theoretical
understanding of how model complexity influences fine-tuning performance
remains challenging and has not been well explored yet. In this paper, we focus
on autoregressive LLMs and propose to employ Hidden Markov Models (HMMs) to
model them. Based on the HMM modeling, we investigate the relationship between
model complexity and the generalization capability in downstream tasks.
Specifically, we consider a popular tuning paradigm for downstream tasks, head
tuning, where all pre-trained parameters are frozen and only individual heads
are trained atop pre-trained LLMs. Our theoretical analysis reveals that the
risk initially increases and then decreases with rising model complexity,
showcasing a “double descent” phenomenon. In this case, the initial “descent”
is degenerate, signifying that the “sweet spot” where bias and variance are
balanced occurs when the model size is zero. Obtaining the presented in this
study conclusion confronts several challenges, primarily revolving around
effectively modeling autoregressive LLMs and downstream tasks, as well as
conducting a comprehensive risk analysis for multivariate regression. Our
research is substantiated by experiments conducted on data generated from HMMs,
which provided empirical support and alignment with our theoretical insights.
[LINK]
http://arxiv.org/abs/2410.00699v1
[DATE]
2024-10-01 21:53:44+08:00
[CATEGORIES]
cs.LG
Optimizing Photoplethysmography-Based Sleep Staging Models by Leveraging Temporal Context for Wearable Devices Applications
[AUTHORS]
Joseph A. P. Quino, Diego A. C. Cardenas, Marcelo A. F. Toledo, Felipe M. Dias, Estela Ribeiro, Jose E. Krieger, Marco A. Gutierrez
[ABSTRACT]
Accurate sleep stage classification is crucial for diagnosing sleep disorders
and evaluating sleep quality. While polysomnography (PSG) remains the gold
standard, photoplethysmography (PPG) is more practical due to its affordability
and widespread use in wearable devices. However, state-of-the-art sleep staging
methods often require prolonged continuous signal acquisition, making them
impractical for wearable devices due to high energy consumption. Shorter signal
acquisitions are more feasible but less accurate. Our work proposes an adapted
sleep staging model based on top-performing state-of-the-art methods and
evaluates its performance with different PPG segment sizes. We concatenate
30-second PPG segments over 15-minute intervals to leverage longer segment
contexts. This approach achieved an accuracy of 0.75, a Cohen’s Kappa of 0.60,
an F1-Weighted score of 0.74, and an F1-Macro score of 0.60. Although reducing
segment size decreased sensitivity for deep and REM stages, our strategy
outperformed single 30-second window methods, particularly for these stages.
[COMMENTS]
11 pages, 5 figures, 1 table
[LINK]
http://arxiv.org/abs/2410.00693v1
[DATE]
2024-10-01 21:47:42+08:00
[CATEGORIES]
cs.LG
Creative Problem Solving in Large Language and Vision Models – What Would it Take?
[AUTHORS]
Lakshmi Nair, Evana Gizzi, Jivko Sinapov
[COMMENTS]
Accepted to EMNLP 2024 Findings
[LINK]
http://arxiv.org/abs/2405.01453v3
[DATE]
2024-10-01 21:46:04+08:00
[CATEGORIES]
cs.LG
Beyond Minimax Rates in Group Distributionally Robust Optimization via a Novel Notion of Sparsity
[AUTHORS]
Quan Nguyen, Nishant A. Mehta, Cristóbal Guzmán
[ABSTRACT]
The minimax sample complexity of group distributionally robust optimization
(GDRO) has been determined up to a $\log(K)$ factor, for $K$ the number of
groups. In this work, we venture beyond the minimax perspective via a novel
notion of sparsity that we dub $(\lambda, \beta)$-sparsity. In short, this
condition means that at any parameter $\theta$, there is a set of at most
$\beta$ groups whose risks at $\theta$ all are at least $\lambda$ larger than
the risks of the other groups. To find an $\epsilon$-optimal $\theta$, we show
via a novel algorithm and analysis that the $\epsilon$-dependent term in the
sample complexity can swap a linear dependence on $K$ for a linear dependence
on the potentially much smaller $\beta$. This improvement leverages recent
progress in sleeping bandits, showing a fundamental connection between the
two-player zero-sum game optimization framework for GDRO and per-action regret
bounds in sleeping bandits. The aforementioned result assumes having a
particular $\lambda$ as input. Perhaps surprisingly, we next show an adaptive
algorithm which, up to log factors, gets sample complexity that adapts to the
best $(\lambda, \beta)$-sparsity condition that holds. Finally, for a
particular input $\lambda$, we also show how to get a dimension-free sample
complexity result.
[COMMENTS]
38 pages
[LINK]
http://arxiv.org/abs/2410.00690v1
[DATE]
2024-10-01 21:45:55+08:00
[CATEGORIES]
cs.LG
Classifier-free graph diffusion for molecular property targeting
[AUTHORS]
Matteo Ninniri, Marco Podda, Davide Bacciu
[ABSTRACT]
This work focuses on the task of property targeting: that is, generating
molecules conditioned on target chemical properties to expedite candidate
screening for novel drug and materials development. DiGress is a recent
diffusion model for molecular graphs whose distinctive feature is allowing
property targeting through classifier-based (CB) guidance. While CB guidance
may work to generate molecular-like graphs, we hint at the fact that its
assumptions apply poorly to the chemical domain. Based on this insight we
propose a classifier-free DiGress (FreeGress), which works by directly
injecting the conditioning information into the training process. CF guidance
is convenient given its less stringent assumptions and since it does not
require to train an auxiliary property regressor, thus halving the number of
trainable parameters in the model. We empirically show that our model yields up
to 79% improvement in Mean Absolute Error with respect to DiGress on property
targeting tasks on QM9 and ZINC-250k benchmarks. As an additional contribution,
we propose a simple yet powerful approach to improve chemical validity of
generated samples, based on the observation that certain chemical properties
such as molecular weight correlate with the number of atoms in molecules.
[COMMENTS]
Proceedings of ECML PKDD 2024
[LINK]
http://arxiv.org/abs/2312.17397v2
[DATE]
2024-10-01 21:45:04+08:00
[CATEGORIES]
cs.LG
Advanced Arabic Alphabet Sign Language Recognition Using Transfer Learning and Transformer Models
[AUTHORS]
Mazen Balat, Rewaa Awaad, Hend Adel, Ahmed B. Zaky, Salah A. Aly
[ABSTRACT]
This paper presents an Arabic Alphabet Sign Language recognition approach,
using deep learning methods in conjunction with transfer learning and
transformer-based models. We study the performance of the different variants on
two publicly available datasets, namely ArSL2018 and AASL. This task will make
full use of state-of-the-art CNN architectures like ResNet50, MobileNetV2, and
EfficientNetB7, and the latest transformer models such as Google ViT and
Microsoft Swin Transformer. These pre-trained models have been fine-tuned on
the above datasets in an attempt to capture some unique features of Arabic sign
language motions. Experimental results present evidence that the suggested
methodology can receive a high recognition accuracy, by up to 99.6\% and
99.43\% on ArSL2018 and AASL, respectively. That is far beyond the previously
reported state-of-the-art approaches. This performance opens up even more
avenues for communication that may be more accessible to Arabic-speaking deaf
and hard-of-hearing, and thus encourages an inclusive society.
[COMMENTS]
6 pages, 8 figures
[LINK]
http://arxiv.org/abs/2410.00681v1
[DATE]
2024-10-01 21:39:26+08:00
[CATEGORIES]
cs.LG
Counterfactual Explanations for Medical Image Classification and Regression using Diffusion Autoencoder
[AUTHORS]
Matan Atad, David Schinz, Hendrik Moeller, Robert Graf, Benedikt Wiestler, Daniel Rueckert, Nassir Navab, Jan S. Kirschke, Matthias Keicher
[ABSTRACT]
Counterfactual explanations (CEs) aim to enhance the interpretability of
machine learning models by illustrating how alterations in input features would
affect the resulting predictions. Common CE approaches require an additional
model and are typically constrained to binary counterfactuals. In contrast, we
propose a novel method that operates directly on the latent space of a
generative model, specifically a Diffusion Autoencoder (DAE). This approach
offers inherent interpretability by enabling the generation of CEs and the
continuous visualization of the model’s internal representation across decision
boundaries.
Our method leverages the DAE’s ability to encode images into a semantically
rich latent space in an unsupervised manner, eliminating the need for labeled
data or separate feature extraction models. We show that these latent
representations are helpful for medical condition classification and the
ordinal regression of severity pathologies, such as vertebral compression
fractures (VCF) and diabetic retinopathy (DR). Beyond binary CEs, our method
supports the visualization of ordinal CEs using a linear model, providing
deeper insights into the model’s decision-making process and enhancing
interpretability.
Experiments across various medical imaging datasets demonstrate the method’s
advantages in interpretability and versatility. The linear manifold of the
DAE’s latent space allows for meaningful interpolation and manipulation, making
it a powerful tool for exploring medical image properties. Our code is
available at https://doi.org/10.5281/zenodo.13859266.
[COMMENTS]
Accepted for publication at the Journal of Machine Learning for
Biomedical Imaging (MELBA) https://melba-journal.org/2024:024. arXiv admin
note: text overlap with arXiv:2303.12031
[LINK]
http://arxiv.org/abs/2408.01571v2
[DATE]
2024-10-01 21:34:36+08:00
[CATEGORIES]
cs.LG
Gradient-Free Training of Recurrent Neural Networks using Random Perturbations
[AUTHORS]
Jesus Garcia Fernandez, Sander Keemink, Marcel van Gerven
[ABSTRACT]
Recurrent neural networks (RNNs) hold immense potential for computations due
to their Turing completeness and sequential processing capabilities, yet
existing methods for their training encounter efficiency challenges.
Backpropagation through time (BPTT), the prevailing method, extends the
backpropagation (BP) algorithm by unrolling the RNN over time. However, this
approach suffers from significant drawbacks, including the need to interleave
forward and backward phases and store exact gradient information. Furthermore,
BPTT has been shown to struggle to propagate gradient information for long
sequences, leading to vanishing gradients. An alternative strategy to using
gradient-based methods like BPTT involves stochastically approximating
gradients through perturbation-based methods. This learning approach is
exceptionally simple, necessitating only forward passes in the network and a
global reinforcement signal as feedback. Despite its simplicity, the random
nature of its updates typically leads to inefficient optimization, limiting its
effectiveness in training neural networks. In this study, we present a new
approach to perturbation-based learning in RNNs whose performance is
competitive with BPTT, while maintaining the inherent advantages over
gradient-based learning. To this end, we extend the recently introduced
activity-based node perturbation (ANP) method to operate in the time domain,
leading to more efficient learning and generalization. We subsequently conduct
a range of experiments to validate our approach. Our results show similar
performance, convergence time and scalability compared to BPTT, strongly
outperforming standard node and weight perturbation methods. These findings
suggest that perturbation-based learning methods offer a versatile alternative
to gradient-based methods for training RNNs which can be ideally suited for
neuromorphic computing applications
[LINK]
http://arxiv.org/abs/2405.08967v3
[DATE]
2024-10-01 21:33:09+08:00
[CATEGORIES]
cs.LG
HUMAP: Hierarchical Uniform Manifold Approximation and Projection
[AUTHORS]
Wilson E. Marcílio-Jr, Danilo M. Eler, Fernando V. Paulovich, Rafael M. Martins
[LINK]
http://arxiv.org/abs/2106.07718v4
[DATE]
2024-10-01 21:22:32+08:00
[CATEGORIES]
cs.LG
TAVRNN: Temporal Attention-enhanced Variational Graph RNN Captures Neural Dynamics and Behavior
[AUTHORS]
Moein Khajehnejad, Forough Habibollahi, Ahmad Khajehnejad, Brett J. Kagan, Adeel Razi
[ABSTRACT]
We introduce Temporal Attention-enhanced Variational Graph Recurrent Neural
Network (TAVRNN), a novel framework for analyzing the evolving dynamics of
neuronal connectivity networks in response to external stimuli and behavioral
feedback. TAVRNN captures temporal changes in network structure by modeling
sequential snapshots of neuronal activity, enabling the identification of key
connectivity patterns. Leveraging temporal attention mechanisms and variational
graph techniques, TAVRNN uncovers how connectivity shifts align with behavior
over time. We validate TAVRNN on two datasets: in vivo calcium imaging data
from freely behaving rats and novel in vitro electrophysiological data from the
DishBrain system, where biological neurons control a simulated environment
during the game of pong. We show that TAVRNN outperforms previous baseline
models in classification, clustering tasks and computational efficiency while
accurately linking connectivity changes to performance variations. Crucially,
TAVRNN reveals that high game performance in the DishBrain system correlates
with the alignment of sensory and motor subregion channels, a relationship not
evident in earlier models. This framework represents the first application of
dynamic graph representation of electrophysiological (neuronal) data from
DishBrain system, providing insights into the reorganization of neuronal
networks during learning. TAVRNN’s ability to differentiate between neuronal
states associated with successful and unsuccessful learning outcomes, offers
significant implications for real-time monitoring and manipulation of
biological neuronal systems.
[COMMENTS]
31 pages, 6 figures, 4 supplemental figures, 4 tables, 8 supplemental
tables
[LINK]
http://arxiv.org/abs/2410.00665v1
[DATE]
2024-10-01 21:19:51+08:00
[CATEGORIES]
cs.LG
Enhancing the analysis of murine neonatal ultrasonic vocalizations: Development, evaluation, and application of different mathematical models
[AUTHORS]
Rudolf Herdt, Louisa Kinzel, Johann Georg Maaß, Marvin Walther, Henning Fröhlich, Tim Schubert, Peter Maass, Christian Patrick Schaaf
[ABSTRACT]
Rodents employ a broad spectrum of ultrasonic vocalizations (USVs) for social
communication. As these vocalizations offer valuable insights into affective
states, social interactions, and developmental stages of animals, various deep
learning approaches have aimed to automate both the quantitative (detection)
and qualitative (classification) analysis of USVs. Here, we present the first
systematic evaluation of different types of neural networks for USV
classification. We assessed various feedforward networks, including a
custom-built, fully-connected network and convolutional neural network,
different residual neural networks (ResNets), an EfficientNet, and a Vision
Transformer (ViT). Paired with a refined, entropy-based detection algorithm
(achieving recall of 94.9% and precision of 99.3%), the best architecture
(achieving 86.79% accuracy) was integrated into a fully automated pipeline
capable of analyzing extensive USV datasets with high reliability.
Additionally, users can specify an individual minimum accuracy threshold based
on their research needs. In this semi-automated setup, the pipeline selectively
classifies calls with high pseudo-probability, leaving the rest for manual
inspection. Our study focuses exclusively on neonatal USVs. As part of an
ongoing phenotyping study, our pipeline has proven to be a valuable tool for
identifying key differences in USVs produced by mice with autism-like
behaviors.
[LINK]
http://arxiv.org/abs/2405.12957v3
[DATE]
2024-10-01 21:18:54+08:00
[CATEGORIES]
cs.LG
Enhancing Fairness through Reweighting: A Path to Attain the Sufficiency Rule
[AUTHORS]
Xuan Zhao, Klaus Broelemann, Salvatore Ruggieri, Gjergji Kasneci
[ABSTRACT]
We introduce an innovative approach to enhancing the empirical risk
minimization (ERM) process in model training through a refined reweighting
scheme of the training data to enhance fairness. This scheme aims to uphold the
sufficiency rule in fairness by ensuring that optimal predictors maintain
consistency across diverse sub-groups. We employ a bilevel formulation to
address this challenge, wherein we explore sample reweighting strategies.
Unlike conventional methods that hinge on model size, our formulation bases
generalization complexity on the space of sample weights. We discretize the
weights to improve training speed. Empirical validation of our method showcases
its effectiveness and robustness, revealing a consistent improvement in the
balance between prediction performance and fairness metrics across various
experiments.
[COMMENTS]
accepted at ECAI 2024
[LINK]
http://arxiv.org/abs/2408.14126v2
[DATE]
2024-10-01 21:18:35+08:00
[CATEGORIES]
cs.LG
Stabilizing the Kumaraswamy Distribution
[AUTHORS]
Max Wasserman, Gonzalo Mateos
[ABSTRACT]
Large-scale latent variable models require expressive continuous
distributions that support efficient sampling and low-variance differentiation,
achievable through the reparameterization trick. The Kumaraswamy (KS)
distribution is both expressive and supports the reparameterization trick with
a simple closed-form inverse CDF. Yet, its adoption remains limited. We
identify and resolve numerical instabilities in the inverse CDF and log-pdf,
exposing issues in libraries like PyTorch and TensorFlow. We then introduce
simple and scalable latent variable models based on the KS, improving
exploration-exploitation trade-offs in contextual multi-armed bandits and
enhancing uncertainty quantification for link prediction with graph neural
networks. Our results support the stabilized KS distribution as a core
component in scalable variational models for bounded latent variables.
[LINK]
http://arxiv.org/abs/2410.00660v1
[DATE]
2024-10-01 21:15:43+08:00
[CATEGORIES]
cs.LG
BMFT: Achieving Fairness via Bias-based Weight Masking Fine-tuning
[AUTHORS]
Yuyang Xue, Junyu Yan, Raman Dutt, Fasih Haider, Jingshuai Liu, Steven McDonagh, Sotirios A. Tsaftaris
[ABSTRACT]
Developing models with robust group fairness properties is paramount,
particularly in ethically sensitive domains such as medical diagnosis. Recent
approaches to achieving fairness in machine learning require a substantial
amount of training data and depend on model retraining, which may not be
practical in real-world scenarios. To mitigate these challenges, we propose
Bias-based Weight Masking Fine-Tuning (BMFT), a novel post-processing method
that enhances the fairness of a trained model in significantly fewer epochs
without requiring access to the original training data. BMFT produces a mask
over model parameters, which efficiently identifies the weights contributing
the most towards biased predictions. Furthermore, we propose a two-step
debiasing strategy, wherein the feature extractor undergoes initial fine-tuning
on the identified bias-influenced weights, succeeded by a fine-tuning phase on
a reinitialised classification layer to uphold discriminative performance.
Extensive experiments across four dermatological datasets and two sensitive
attributes demonstrate that BMFT outperforms existing state-of-the-art (SOTA)
techniques in both diagnostic accuracy and fairness metrics. Our findings
underscore the efficacy and robustness of BMFT in advancing fairness across
various out-of-distribution (OOD) settings. Our code is available at:
https://github.com/vios-s/BMFT
[COMMENTS]
Accepted by MICCAI 2024 FAIMI Workshop Oral
[LINK]
http://arxiv.org/abs/2408.06890v2
[DATE]
2024-10-01 21:10:40+08:00
[CATEGORIES]
cs.LG
LASMP: Language Aided Subset Sampling Based Motion Planner
[AUTHORS]
Saswati Bhattacharjee, Anirban Sinha, Chinwe Ekenna
[ABSTRACT]
This paper presents the Language Aided Subset Sampling Based Motion Planner
(LASMP), a system that helps mobile robots plan their movements by using
natural language instructions. LASMP uses a modified version of the Rapidly
Exploring Random Tree (RRT) method, which is guided by user-provided commands
processed through a language model (RoBERTa). The system improves efficiency by
focusing on specific areas of the robot’s workspace based on these
instructions, making it faster and less resource-intensive. Compared to
traditional RRT methods, LASMP reduces the number of nodes needed by 55% and
cuts random sample queries by 80%, while still generating safe, collision-free
paths. Tested in both simulated and real-world environments, LASMP has shown
better performance in handling complex indoor scenarios. The results highlight
the potential of combining language processing with motion planning to make
robot navigation more efficient.
[COMMENTS]
8 pages, 9 figures
[LINK]
http://arxiv.org/abs/2410.00649v1
[DATE]
2024-10-01 21:03:15+08:00
[CATEGORIES]
cs.LG
ICL-TSVD: Bridging Theory and Practice in Continual Learning with Pre-trained Models
[AUTHORS]
Liangzu Peng, Juan Elenter, Joshua Agterberg, Alejandro Ribeiro, René Vidal
[ABSTRACT]
The goal of continual learning (CL) is to train a model that can solve
multiple tasks presented sequentially. Recent CL approaches have achieved
strong performance by leveraging large pre-trained models that generalize well
to downstream tasks. However, such methods lack theoretical guarantees, making
them prone to unexpected failures. Conversely, principled CL approaches often
fail to achieve competitive performance. In this work, we bridge this gap
between theory and practice by integrating an empirically strong approach
(RanPAC) into a principled framework, Ideal Continual Learner (ICL), designed
to prevent forgetting. Specifically, we lift pre-trained features into a higher
dimensional space and formulate an over-parametrized minimum-norm least-squares
problem. We find that the lifted features are highly ill-conditioned,
potentially leading to large training errors (numerical instability) and
increased generalization errors (double descent). We address these challenges
by continually truncating the singular value decomposition (SVD) of the lifted
features. Our approach, termed ICL-TSVD, is stable with respect to the choice
of hyperparameters, can handle hundreds of tasks, and outperforms
state-of-the-art CL methods on multiple datasets. Importantly, our method
satisfies a recurrence relation throughout its continual learning process,
which allows us to prove it maintains small training and generalization errors
by appropriately truncating a fraction of SVD factors. This results in a stable
continual learning method with strong empirical performance and theoretical
guarantees.
[COMMENTS]
45 pages, 19 figures, 14 tables (Preprint, Oct 1, 2024)
[LINK]
http://arxiv.org/abs/2410.00645v1
[DATE]
2024-10-01 20:58:37+08:00
[CATEGORIES]
cs.LG
SDC-HSDD-NDSA: Structure Detecting Cluster by Hierarchical Secondary Directed Differential with Normalized Density and Self-Adaption
[AUTHORS]
Hao Shu
[ABSTRACT]
Density-based clustering could be the most popular clustering algorithm since
it can identify clusters of arbitrary shape as long as they are separated by
low-density regions. However, a high-density region that is not separated by
low-density ones might also have different structures belonging to multiple
clusters. As far as we know, all previous density-based clustering algorithms
fail to detect such structures. In this paper, we provide a novel density-based
clustering scheme that can not only detect clusters separated by low-density
regions but also detect structures in high-density regions not separated by
low-density ones. The algorithm employs secondary directed differential,
hierarchy, normalized density, as well as the self-adaption coefficient, and
thus is called Structure Detecting Cluster by Hierarchical Secondary Directed
Differential with Normalized Density and Self-Adaption, dubbed by
SDC-HSDD-NDSA. The algorithm is run on several datasets to verify its
effectiveness, robustness, as well as granularity independence, and results
demonstrate that it has the ability that previous ones do not have. The Python
code is on https://github.com/Hao-B-Shu/SDC-HSDD-NDSA.
[COMMENTS]
35 pages
[LINK]
http://arxiv.org/abs/2307.00677v3
[DATE]
2024-10-01 20:45:01+08:00
[CATEGORIES]
cs.LG
On The Planning Abilities of OpenAI’s o1 Models: Feasibility, Optimality, and Generalizability
[AUTHORS]
Kevin Wang, Junbo Li, Neel P. Bhatt, Yihan Xi, Qiang Liu, Ufuk Topcu, Zhangyang Wang
[ABSTRACT]
Recent advancements in Large Language Models (LLMs) have showcased their
ability to perform complex reasoning tasks, but their effectiveness in planning
remains underexplored. In this study, we evaluate the planning capabilities of
OpenAI’s o1 models across a variety of benchmark tasks, focusing on three key
aspects: feasibility, optimality, and generalizability. Through empirical
evaluations on constraint-heavy tasks (e.g., $\textit{Barman}$,
$\textit{Tyreworld}$) and spatially complex environments (e.g.,
$\textit{Termes}$, $\textit{Floortile}$), we highlight o1-preview’s strengths
in self-evaluation and constraint-following, while also identifying bottlenecks
in decision-making and memory management, particularly in tasks requiring
robust spatial reasoning. Our results reveal that o1-preview outperforms GPT-4
in adhering to task constraints and managing state transitions in structured
environments. However, the model often generates suboptimal solutions with
redundant actions and struggles to generalize effectively in spatially complex
tasks. This pilot study provides foundational insights into the planning
limitations of LLMs, offering key directions for future research on improving
memory management, decision-making, and generalization in LLM-based planning.
[COMMENTS]
Updated link to code repository
[LINK]
http://arxiv.org/abs/2409.19924v2
[DATE]
2024-10-01 20:43:09+08:00
[CATEGORIES]
cs.LG
Model-independent variable selection via the rule-based variable priority
[AUTHORS]
Min Lu, Hemant Ishwaran
[ABSTRACT]
While achieving high prediction accuracy is a fundamental goal in machine
learning, an equally important task is finding a small number of features with
high explanatory power. One popular selection technique is permutation
importance, which assesses a variable’s impact by measuring the change in
prediction error after permuting the variable. However, this can be problematic
due to the need to create artificial data, a problem shared by other methods as
well. Another problem is that variable selection methods can be limited by
being model-specific. We introduce a new model-independent approach, Variable
Priority (VarPro), which works by utilizing rules without the need to generate
artificial data or evaluate prediction error. The method is relatively easy to
use, requiring only the calculation of sample averages of simple statistics,
and can be applied to many data settings, including regression, classification,
and survival. We investigate the asymptotic properties of VarPro and show,
among other things, that VarPro has a consistent filtering property for noise
variables. Empirical studies using synthetic and real-world data show the
method achieves a balanced performance and compares favorably to many
state-of-the-art procedures currently used for variable selection.
[LINK]
http://arxiv.org/abs/2409.09003v3
[DATE]
2024-10-01 20:42:24+08:00
[CATEGORIES]
cs.LG
Statistical signatures of abstraction in deep neural networks
[AUTHORS]
Carlo Orientale Caputo, Matteo Marsili
[ABSTRACT]
We study how abstract representations emerge in a Deep Belief Network (DBN)
trained on benchmark datasets. Our analysis targets the principles of learning
in the early stages of information processing, starting from the “primordial
soup” of the under-sampling regime. As the data is processed by deeper and
deeper layers, features are detected and removed, transferring more and more
“context-invariant” information to deeper layers. We show that the
representation approaches an universal model – the Hierarchical Feature Model
(HFM) – determined by the principle of maximal relevance. Relevance quantifies
the uncertainty on the model of the data, thus suggesting that “meaning” –
i.e. syntactic information – is that part of the data which is not yet
captured by a model. Our analysis shows that shallow layers are well described
by pairwise Ising models, which provide a representation of the data in terms
of generic, low order features. We also show that plasticity increases with
depth, in a similar way as it does in the brain. These findings suggest that
DBNs are capable of extracting a hierarchy of features from the data which is
consistent with the principle of maximal relevance.
[COMMENTS]
The estimate of the Kullback-Leibler distance used in the paper is
affected by strong sampling errors. Additional statistical analysis is needed
[LINK]
http://arxiv.org/abs/2407.01656v2
[DATE]
2024-10-01 20:39:15+08:00
[CATEGORIES]
cs.LG
Measuring Orthogonality in Representations of Generative Models
[AUTHORS]
Robin C. Geyer, Alessandro Torcinovich, João B. Carvalho, Alexander Meyer, Joachim M. Buhmann
[ABSTRACT]
In unsupervised representation learning, models aim to distill essential
features from high-dimensional data into lower-dimensional learned
representations, guided by inductive biases. Understanding the characteristics
that make a good representation remains a topic of ongoing research.
Disentanglement of independent generative processes has long been credited with
producing high-quality representations. However, focusing solely on
representations that adhere to the stringent requirements of most
disentanglement metrics, may result in overlooking many high-quality
representations, well suited for various downstream tasks. These metrics often
demand that generative factors be encoded in distinct, single dimensions
aligned with the canonical basis of the representation space.
Motivated by these observations, we propose two novel metrics:
Importance-Weighted Orthogonality (IWO) and Importance-Weighted Rank (IWR).
These metrics evaluate the mutual orthogonality and rank of generative factor
subspaces. Throughout extensive experiments on common downstream tasks, over
several benchmark datasets and models, IWO and IWR consistently show stronger
correlations with downstream task performance than traditional disentanglement
metrics. Our findings suggest that representation quality is closer related to
the orthogonality of independent generative processes rather than their
disentanglement, offering a new direction for evaluating and improving
unsupervised learning models.
[LINK]
http://arxiv.org/abs/2407.03728v2
[DATE]
2024-10-01 20:26:24+08:00
[CATEGORIES]
cs.LG
Differentiable Interacting Multiple Model Particle Filtering
[AUTHORS]
John-Joseph Brady, Yuhui Luo, Wenwu Wang, Víctor Elvira, Yunpeng Li
[ABSTRACT]
We propose a sequential Monte Carlo algorithm for parameter learning when the
studied model exhibits random discontinuous jumps in behaviour. To facilitate
the learning of high dimensional parameter sets, such as those associated to
neural networks, we adopt the emerging framework of differentiable particle
filtering, wherein parameters are trained by gradient descent. We design a new
differentiable interacting multiple model particle filter to be capable of
learning the individual behavioural regimes and the model which controls the
jumping simultaneously. In contrast to previous approaches, our algorithm
allows control of the computational effort assigned per regime whilst using the
probability of being in a given regime to guide sampling. Furthermore, we
develop a new gradient estimator that has a lower variance than established
approaches and remains fast to compute, for which we prove consistency. We
establish new theoretical results of the presented algorithms and demonstrate
superior numerical performance compared to the previous state-of-the-art
algorithms.
[LINK]
http://arxiv.org/abs/2410.00620v1
[DATE]
2024-10-01 20:05:18+08:00
[CATEGORIES]
cs.LG
Radio Foundation Models: Pre-training Transformers for 5G-based Indoor Localization
[AUTHORS]
Jonathan Ott, Jonas Pirkl, Maximilian Stahlke, Tobias Feigl, Christopher Mutschler
[ABSTRACT]
Artificial Intelligence (AI)-based radio fingerprinting (FP) outperforms
classic localization methods in propagation environments with strong multipath
effects. However, the model and data orchestration of FP are time-consuming and
costly, as it requires many reference positions and extensive measurement
campaigns for each environment. Instead, modern unsupervised and
self-supervised learning schemes require less reference data for localization,
but either their accuracy is low or they require additional sensor information,
rendering them impractical. In this paper we propose a self-supervised learning
framework that pre-trains a general transformer (TF) neural network on 5G
channel measurements that we collect on-the-fly without expensive equipment.
Our novel pretext task randomly masks and drops input information to learn to
reconstruct it. So, it implicitly learns the spatiotemporal patterns and
information of the propagation environment that enable FP-based localization.
Most interestingly, when we optimize this pre-trained model for localization in
a given environment, it achieves the accuracy of state-of-the-art methods but
requires ten times less reference data and significantly reduces the time from
training to operation.
[LINK]
http://arxiv.org/abs/2410.00617v1
[DATE]
2024-10-01 20:03:32+08:00
[CATEGORIES]
cs.LG
Is Tokenization Needed for Masked Particle Modelling?
[AUTHORS]
Matthew Leigh, Samuel Klein, François Charton, Tobias Golling, Lukas Heinrich, Michael Kagan, Inês Ochoa, Margarita Osadchy
[ABSTRACT]
In this work, we significantly enhance masked particle modeling (MPM), a
self-supervised learning scheme for constructing highly expressive
representations of unordered sets relevant to developing foundation models for
high-energy physics. In MPM, a model is trained to recover the missing elements
of a set, a learning objective that requires no labels and can be applied
directly to experimental data. We achieve significant performance improvements
over previous work on MPM by addressing inefficiencies in the implementation
and incorporating a more powerful decoder. We compare several pre-training
tasks and introduce new reconstruction methods that utilize conditional
generative models without data tokenization or discretization. We show that
these new methods outperform the tokenized learning objective from the original
MPM on a new test bed for foundation models for jets, which includes using a
wide variety of downstream tasks relevant to jet physics, such as
classification, secondary vertex finding, and track identification.
[LINK]
http://arxiv.org/abs/2409.12589v2
[DATE]
2024-10-01 19:40:11+08:00
[CATEGORIES]
cs.LG
Towards Symbolic XAI – Explanation Through Human Understandable Logical Relationships Between Features
[AUTHORS]
Thomas Schnake, Farnoush Rezaei Jafari, Jonas Lederer, Ping Xiong, Shinichi Nakajima, Stefan Gugler, Grégoire Montavon, Klaus-Robert Müller
[ABSTRACT]
Explainable Artificial Intelligence (XAI) plays a crucial role in fostering
transparency and trust in AI systems, where traditional XAI approaches
typically offer one level of abstraction for explanations, often in the form of
heatmaps highlighting single or multiple input features. However, we ask
whether abstract reasoning or problem-solving strategies of a model may also be
relevant, as these align more closely with how humans approach solutions to
problems. We propose a framework, called Symbolic XAI, that attributes
relevance to symbolic queries expressing logical relationships between input
features, thereby capturing the abstract reasoning behind a model’s
predictions. The methodology is built upon a simple yet general multi-order
decomposition of model predictions. This decomposition can be specified using
higher-order propagation-based relevance methods, such as GNN-LRP, or
perturbation-based explanation methods commonly used in XAI. The effectiveness
of our framework is demonstrated in the domains of natural language processing
(NLP), vision, and quantum chemistry (QC), where abstract symbolic domain
knowledge is abundant and of significant interest to users. The Symbolic XAI
framework provides an understanding of the model’s decision-making process that
is both flexible for customization by the user and human-readable through
logical formulas.
[LINK]
http://arxiv.org/abs/2408.17198v2
[DATE]
2024-10-01 19:35:49+08:00
[CATEGORIES]
cs.LG
FLeNS: Federated Learning with Enhanced Nesterov-Newton Sketch
[AUTHORS]
Sunny Gupta, Mohit Jindal, Pankhi Kashyap, Pranav Jeevan, Amit Sethi
[ABSTRACT]
Federated learning faces a critical challenge in balancing communication
efficiency with rapid convergence, especially for second-order methods. While
Newton-type algorithms achieve linear convergence in communication rounds,
transmitting full Hessian matrices is often impractical due to quadratic
complexity. We introduce Federated Learning with Enhanced Nesterov-Newton
Sketch (FLeNS), a novel method that harnesses both the acceleration
capabilities of Nesterov’s method and the dimensionality reduction benefits of
Hessian sketching. FLeNS approximates the centralized Newton’s method without
relying on the exact Hessian, significantly reducing communication overhead. By
combining Nesterov’s acceleration with adaptive Hessian sketching, FLeNS
preserves crucial second-order information while preserving the rapid
convergence characteristics. Our theoretical analysis, grounded in statistical
learning, demonstrates that FLeNS achieves super-linear convergence rates in
communication rounds - a notable advancement in federated optimization. We
provide rigorous convergence guarantees and characterize tradeoffs between
acceleration, sketch size, and convergence speed. Extensive empirical
evaluation validates our theoretical findings, showcasing FLeNS’s
state-of-the-art performance with reduced communication requirements,
particularly in privacy-sensitive and edge-computing scenarios. The code is
available at https://github.com/sunnyinAI/FLeNS
[COMMENTS]
10 pages, 3 figures, 2 Tables
[LINK]
http://arxiv.org/abs/2409.15216v2
[DATE]
2024-10-01 19:20:53+08:00
[CATEGORIES]
cs.LG
CompassDock: Comprehensive Accurate Assessment Approach for Deep Learning-Based Molecular Docking in Inference and Fine-Tuning
[AUTHORS]
Ahmet Sarigun, Vedran Franke, Bora Uyar, Altuna Akalin
[ABSTRACT]
Datasets used for molecular docking, such as PDBBind, contain technical
variability - they are noisy. Although the origins of the noise have been
discussed, a comprehensive analysis of the physical, chemical, and bioactivity
characteristics of the datasets is still lacking. To address this gap, we
introduce the Comprehensive Accurate Assessment (Compass). Compass integrates
two key components: PoseCheck, which examines ligand strain energy,
protein-ligand steric clashes, and interactions, and AA-Score, a new empirical
scoring function for calculating binding affinity energy. Together, these form
a unified workflow that assesses both the physical/chemical properties and
bioactivity favorability of ligands and protein-ligand interactions. Our
analysis of the PDBBind dataset using Compass reveals substantial noise in the
ground truth data. Additionally, we propose CompassDock, which incorporates the
Compass module with DiffDock, the state-of-the-art deep learning-based
molecular docking method, to enable accurate assessment of docked ligands
during inference. Finally, we present a new paradigm for enhancing molecular
docking model performance by fine-tuning with Compass Scores, which encompass
binding affinity energy, strain energy, and the number of steric clashes
identified by Compass. Our results show that, while fine-tuning without Compass
improves the percentage of docked poses with RMSD < 2{\AA}, it leads to a
decrease in physical/chemical and bioactivity favorability. In contrast,
fine-tuning with Compass shows a limited improvement in RMSD < 2{\AA} but
enhances the physical/chemical and bioactivity favorability of the ligand
conformation. The source code is available publicly at
https://github.com/BIMSBbioinfo/CompassDock.
[LINK]
http://arxiv.org/abs/2406.06841v2
[DATE]
2024-10-01 19:14:40+08:00
[CATEGORIES]
cs.LG
Enhancing Image Classification in Small and Unbalanced Datasets through Synthetic Data Augmentation
[AUTHORS]
Neil De La Fuente, Mireia Majó, Irina Luzko, Henry Córdova, Gloria Fernández-Esparrach, Jorge Bernal
[ABSTRACT]
Accurate and robust medical image classification is a challenging task,
especially in application domains where available annotated datasets are small
and present high imbalance between target classes. Considering that data
acquisition is not always feasible, especially for underrepresented classes,
our approach introduces a novel synthetic augmentation strategy using
class-specific Variational Autoencoders (VAEs) and latent space interpolation
to improve discrimination capabilities.
By generating realistic, varied synthetic data that fills feature space gaps,
we address issues of data scarcity and class imbalance. The method presented in
this paper relies on the interpolation of latent representations within each
class, thus enriching the training set and improving the model’s
generalizability and diagnostic accuracy. The proposed strategy was tested in a
small dataset of 321 images created to train and validate an automatic method
for assessing the quality of cleanliness of esophagogastroduodenoscopy images.
By combining real and synthetic data, an increase of over 18\% in the accuracy
of the most challenging underrepresented class was observed. The proposed
strategy not only benefited the underrepresented class but also led to a
general improvement in other metrics, including a 6\% increase in global
accuracy and precision.
[COMMENTS]
MICCAI 2024 (CLIP Workshop)
[LINK]
http://arxiv.org/abs/2409.10286v2
[DATE]
2024-10-01 19:08:24+08:00
[CATEGORIES]
cs.LG
Scalable Data Assimilation with Message Passing
[AUTHORS]
Oscar Key, So Takao, Daniel Giles, Marc Peter Deisenroth
[ABSTRACT]
Data assimilation is a core component of numerical weather prediction
systems. The large quantity of data processed during assimilation requires the
computation to be distributed across increasingly many compute nodes, yet
existing approaches suffer from synchronisation overhead in this setting. In
this paper, we exploit the formulation of data assimilation as a Bayesian
inference problem and apply a message-passing algorithm to solve the spatial
inference problem. Since message passing is inherently based on local
computations, this approach lends itself to parallel and distributed
computation. In combination with a GPU-accelerated implementation, we can scale
the algorithm to very large grid sizes while retaining good accuracy and
compute and memory requirements.
[LINK]
http://arxiv.org/abs/2404.12968v2
[DATE]
2024-10-01 19:01:37+08:00
[CATEGORIES]
cs.LG
Scaling Offline Model-Based RL via Jointly-Optimized World-Action Model Pretraining
[AUTHORS]
Jie Cheng, Ruixi Qiao, Gang Xiong, Qinghai Miao, Yingwei Ma, Binhua Li, Yongbin Li, Yisheng Lv
[ABSTRACT]
A significant aspiration of offline reinforcement learning (RL) is to develop
a generalist agent with high capabilities from large and heterogeneous
datasets. However, prior approaches that scale offline RL either rely heavily
on expert trajectories or struggle to generalize to diverse unseen tasks.
Inspired by the excellent generalization of world model in conditional video
generation, we explore the potential of image observation-based world model for
scaling offline RL and enhancing generalization on novel tasks. In this paper,
we introduce JOWA: Jointly-Optimized World-Action model, an offline model-based
RL agent pretrained on multiple Atari games to learn general-purpose
representation and decision-making ability. Our method jointly optimizes a
world-action model through shared transformer backbone, which stabilize
temporal difference learning with large models during pretraining. Moreover, we
propose an provably efficient and parallelizable planning algorithm to
compensate for the Q-value estimation error and thus search out better
policies. Experimental results indicate that our largest agent, with 150
million parameters, achieves 78.9% human-level performance on pretrained games
using only 10% subsampled offline data, outperforming existing state-of-the-art
large-scale offline RL baselines by 31.6% on averange. Furthermore, JOWA scales
favorably with model capacity and can sample-efficiently transfer to novel
games using only 5k offline fine-tuning data corresponding to about 4
trajectories per game, which demonstrates superior generalization of JOWA. We
will release codes at https://github.com/CJReinforce/JOWA.
[LINK]
http://arxiv.org/abs/2410.00564v1
[DATE]
2024-10-01 18:25:03+08:00
[CATEGORIES]
cs.LG
Best Practices for Multi-Fidelity Bayesian Optimization in Materials and Molecular Research
[AUTHORS]
Víctor Sabanza-Gil, Riccardo Barbano, Daniel Pacheco Gutiérrez, Jeremy S. Luterbacher, José Miguel Hernández-Lobato, Philippe Schwaller, Loïc Roch
[ABSTRACT]
Multi-fidelity Bayesian Optimization (MFBO) is a promising framework to speed
up materials and molecular discovery as sources of information of different
accuracies are at hand at increasing cost. Despite its potential use in
chemical tasks, there is a lack of systematic evaluation of the many parameters
playing a role in MFBO. In this work, we provide guidelines and recommendations
to decide when to use MFBO in experimental settings. We investigate MFBO
methods applied to molecules and materials problems. First, we test two
different families of acquisition functions in two synthetic problems and study
the effect of the informativeness and cost of the approximate function. We use
our implementation and guidelines to benchmark three real discovery problems
and compare them against their single-fidelity counterparts. Our results may
help guide future efforts to implement MFBO as a routine tool in the chemical
sciences.
[LINK]
http://arxiv.org/abs/2410.00544v1
[DATE]
2024-10-01 17:37:36+08:00
[CATEGORIES]
cs.LG
Differentially Private Active Learning: Balancing Effective Data Selection and Privacy
[AUTHORS]
Kristian Schwethelm, Johannes Kaiser, Jonas Kuntzer, Mehmet Yigitsoy, Daniel Rueckert, Georgios Kaissis
[ABSTRACT]
Active learning (AL) is a widely used technique for optimizing data labeling
in machine learning by iteratively selecting, labeling, and training on the
most informative data. However, its integration with formal privacy-preserving
methods, particularly differential privacy (DP), remains largely underexplored.
While some works have explored differentially private AL for specialized
scenarios like online learning, the fundamental challenge of combining AL with
DP in standard learning settings has remained unaddressed, severely limiting
AL’s applicability in privacy-sensitive domains. This work addresses this gap
by introducing differentially private active learning (DP-AL) for standard
learning settings. We demonstrate that naively integrating DP-SGD training into
AL presents substantial challenges in privacy budget allocation and data
utilization. To overcome these challenges, we propose step amplification, which
leverages individual sampling probabilities in batch creation to maximize data
point participation in training steps, thus optimizing data utilization.
Additionally, we investigate the effectiveness of various acquisition functions
for data selection under privacy constraints, revealing that many commonly used
functions become impractical. Our experiments on vision and natural language
processing tasks show that DP-AL can improve performance for specific datasets
and model architectures. However, our findings also highlight the limitations
of AL in privacy-constrained environments, emphasizing the trade-offs between
privacy, model accuracy, and data selection accuracy.
[LINK]
http://arxiv.org/abs/2410.00542v1
[DATE]
2024-10-01 17:34:06+08:00
[CATEGORIES]
cs.LG
Arges: Spatio-Temporal Transformer for Ulcerative Colitis Severity Assessment in Endoscopy Videos
[AUTHORS]
Krishna Chaitanya, Pablo F. Damasceno, Shreyas Fadnavis, Pooya Mobadersany, Chaitanya Parmar, Emily Scherer, Natalia Zemlianskaia, Lindsey Surace, Louis R. Ghanem, Oana Gabriela Cula, Tommaso Mansi, Kristopher Standish
[ABSTRACT]
Accurate assessment of disease severity from endoscopy videos in ulcerative
colitis (UC) is crucial for evaluating drug efficacy in clinical trials.
Severity is often measured by the Mayo Endoscopic Subscore (MES) and Ulcerative
Colitis Endoscopic Index of Severity (UCEIS) score. However, expert MES/UCEIS
annotation is time-consuming and susceptible to inter-rater variability,
factors addressable by automation. Automation attempts with frame-level labels
face challenges in fully-supervised solutions due to the prevalence of
video-level labels in clinical trials. CNN-based weakly-supervised models (WSL)
with end-to-end (e2e) training lack generalization to new disease scores and
ignore spatio-temporal information crucial for accurate scoring. To address
these limitations, we propose “Arges”, a deep learning framework that utilizes
a transformer with positional encoding to incorporate spatio-temporal
information from frame features to estimate disease severity scores in
endoscopy video. Extracted features are derived from a foundation model
(ArgesFM), pre-trained on a large diverse dataset from multiple clinical trials
(61M frames, 3927 videos). We evaluate four UC disease severity scores,
including MES and three UCEIS component scores. Test set evaluation indicates
significant improvements, with F1 scores increasing by 4.1% for MES and 18.8%,
6.6%, 3.8% for the three UCEIS component scores compared to state-of-the-art
methods. Prospective validation on previously unseen clinical trial data
further demonstrates the model’s successful generalization.
[COMMENTS]
12 pages, 2 figures, 5 tables, accepted at MLMI, MICCAI
[LINK]
http://arxiv.org/abs/2410.00536v1
[DATE]
2024-10-01 17:23:14+08:00
[CATEGORIES]
cs.LG
Deep Model Interpretation with Limited Data : A Coreset-based Approach
[AUTHORS]
Hamed Behzadi-Khormouji, José Oramas
[ABSTRACT]
Model Interpretation aims at the extraction of insights from the internals of
a trained model. A common approach to address this task is the characterization
of relevant features internally encoded in the model that are critical for its
proper operation. Despite recent progress of these methods, they come with the
weakness of being computationally expensive due to the dense evaluation of
datasets that they require. As a consequence, research on the design of these
methods have focused on smaller data subsets which may led to reduced insights.
To address these computational costs, we propose a coreset-based interpretation
framework that utilizes coreset selection methods to sample a representative
subset of the large dataset for the interpretation task. Towards this goal, we
propose a similarity-based evaluation protocol to assess the robustness of
model interpretation methods towards the amount data they take as input.
Experiments considering several interpretation methods, DNN models, and coreset
selection methods show the effectiveness of the proposed framework.
[LINK]
http://arxiv.org/abs/2410.00524v1
[DATE]
2024-10-01 17:07:24+08:00
[CATEGORIES]
cs.LG
The Uniqueness of LLaMA3-70B Series with Per-Channel Quantization
[AUTHORS]
Minghai Qin
[ABSTRACT]
We have observed a distinctive quantization-related behavior in the
LLaMA3/3.1-70B models that is absent in both the LLaMA2-70B and
LLaMA3/3.1/3.2-1B/3B/8B/405B models. Quantization is a crucial technique for
deploying large language models (LLMs) efficiently. The impact of W8A8
post-training quantization on model accuracy, especially on the recently
released LLaMA3/3.1 model series, remains contentious. In this paper, we
explore three key questions: What makes the LLaMA3-70B model series uniquely
vulnerable to quantization? Why is this the case? And how can the issue be
addressed? We empirically investigate multiple LLMs featured on an open LLM
leaderboard, discovering that the LLaMA3-70B model series have a unique
accuracy degradation behavior with W8A8 per-channel post-training quantization.
In contrast, other model series such as LLaMA2, LLaMA3/3.1-8B, LLaMA3.2, Qwen,
Mixtral, Mistral, Phi-3, and Falcon demonstrate robust performance with W8A8.
Contrary to previous assertions attributing degradation to the large dynamic
range of activations, our findings indicate that the weight distribution