Self-Training with Direct Preference Optimization Improves Chain-of-Thought Reasoning
[AUTHORS]
Tianduo Wang, Shichen Li, Wei Lu
[ABSTRACT]
Effective training of language models (LMs) for mathematical reasoning tasks
demands high-quality supervised fine-tuning data. Besides obtaining annotations
from human experts, a common alternative is sampling from larger and more
powerful LMs. However, this knowledge distillation approach can be costly and
unstable, particularly when relying on closed-source, proprietary LMs like
GPT-4, whose behaviors are often unpredictable. In this work, we demonstrate
that the reasoning abilities of small-scale LMs can be enhanced through
self-training, a process where models learn from their own outputs. We also
show that the conventional self-training can be further augmented by a
preference learning algorithm called Direct Preference Optimization (DPO). By
integrating DPO into self-training, we leverage preference data to guide LMs
towards more accurate and diverse chain-of-thought reasoning. We evaluate our
method across various mathematical reasoning tasks using different base models.
Our experiments show that this approach not only improves LMs’ reasoning
performance but also offers a more cost-effective and scalable solution
compared to relying on large proprietary LMs.
[COMMENTS]
ACL 2024. Code and data are available at
https://github.com/TianduoWang/DPO-ST
[LINK]
http://arxiv.org/abs/2407.18248v1
[DATE]
2024-07-26 01:59:16+08:00
[CATEGORIES]
cs.CL
LoRA-Pro: Are Low-Rank Adapters Properly Optimized?
[AUTHORS]
Zhengbo Wang, Jian Liang
[ABSTRACT]
Low-Rank Adaptation, also known as LoRA, has emerged as a prominent method
for parameter-efficient fine-tuning foundation models by re-parameterizing the
original matrix into the product of two low-rank matrices. Despite its
efficiency, LoRA often yields inferior performance compared to full
fine-tuning. In this paper, we propose LoRA-Pro to bridge this performance gap.
Firstly, we delve into the optimization processes in LoRA and full fine-tuning.
We reveal that while LoRA employs low-rank approximation, it neglects to
approximate the optimization process of full fine-tuning. To address this, we
introduce a novel concept called the “equivalent gradient.” This virtual
gradient makes the optimization process on the re-parameterized matrix
equivalent to LoRA, which can be used to quantify the differences between LoRA
and full fine-tuning. The equivalent gradient is derived from the gradients of
matrices $A$ and $B$. To narrow the performance gap, our approach minimizes the
differences between the equivalent gradient and the gradient obtained from full
fine-tuning during the optimization process. By solving this objective, we
derive optimal closed-form solutions for updating matrices $A$ and $B$. Our
method constrains the optimization process, shrinking the performance gap
between LoRA and full fine-tuning. Extensive experiments on natural language
processing tasks validate the effectiveness of our method.
[LINK]
http://arxiv.org/abs/2407.18242v1
[DATE]
2024-07-26 01:57:12+08:00
[CATEGORIES]
cs.LG
cs.CL
Block Verification Accelerates Speculative Decoding
[AUTHORS]
Ziteng Sun, Uri Mendlovic, Yaniv Leviathan, Asaf Aharoni, Ahmad Beirami, Jae Hun Ro, Ananda Theertha Suresh
[ABSTRACT]
Speculative decoding is an effective method for lossless acceleration of
large language models during inference. It uses a fast model to draft a block
of tokens which are then verified in parallel by the target model, and provides
a guarantee that the output is distributed identically to a sample from the
target model. In prior works, draft verification is performed independently
token-by-token. Surprisingly, we show that this approach is not optimal. We
propose Block Verification, a simple draft verification algorithm that verifies
the entire block jointly and provides additional wall-clock speedup. We prove
that the proposed mechanism is optimal in the expected number of tokens
produced each iteration and specifically is never worse than the standard
token-level verification. Empirically, block verification provides modest but
consistent wall-clock speedups over the standard token verification algorithm
of 5%-8% in a range of tasks and datasets. Given that block verification does
not increase code complexity, maintains the strong lossless guarantee of the
standard speculative decoding verification algorithm, cannot deteriorate
performance, and, in fact, consistently improves it, it can be used as a good
default in speculative decoding implementations.
[LINK]
http://arxiv.org/abs/2403.10444v2
[DATE]
2024-07-26 01:51:50+08:00
[CATEGORIES]
cs.LG
cs.CL
Recursive Introspection: Teaching Language Model Agents How to Self-Improve
[AUTHORS]
Yuxiao Qu, Tianjun Zhang, Naman Garg, Aviral Kumar
[ABSTRACT]
A central piece in enabling intelligent agentic behavior in foundation models
is to make them capable of introspecting upon their behavior, reasoning, and
correcting their mistakes as more computation or interaction is available. Even
the strongest proprietary large language models (LLMs) do not quite exhibit the
ability of continually improving their responses sequentially, even in
scenarios where they are explicitly told that they are making a mistake. In
this paper, we develop RISE: Recursive IntroSpEction, an approach for
fine-tuning LLMs to introduce this capability, despite prior work hypothesizing
that this capability may not be possible to attain. Our approach prescribes an
iterative fine-tuning procedure, which attempts to teach the model how to alter
its response after having executed previously unsuccessful attempts to solve a
hard test-time problem, with optionally additional environment feedback. RISE
poses fine-tuning for a single-turn prompt as solving a multi-turn Markov
decision process (MDP), where the initial state is the prompt. Inspired by
principles in online imitation learning and reinforcement learning, we propose
strategies for multi-turn data collection and training so as to imbue an LLM
with the capability to recursively detect and correct its previous mistakes in
subsequent iterations. Our experiments show that RISE enables Llama2, Llama3,
and Mistral models to improve themselves with more turns on math reasoning
tasks, outperforming several single-turn strategies given an equal amount of
inference-time computation. We also find that RISE scales well, often attaining
larger benefits with more capable models. Our analysis shows that RISE makes
meaningful improvements to responses to arrive at the correct solution for
challenging prompts, without disrupting one-turn abilities as a result of
expressing more complex distributions.
[LINK]
http://arxiv.org/abs/2407.18219v1
[DATE]
2024-07-26 01:35:59+08:00
[CATEGORIES]
cs.LG
cs.CL
Exploring Scaling Trends in LLM Robustness
[AUTHORS]
Nikolhaus Howe, Michał Zajac, Ian McKenzie, Oskar Hollinsworth, Tom Tseng, Pierre-Luc Bacon, Adam Gleave
[ABSTRACT]
Language model capabilities predictably improve from scaling a model’s size
and training data. Motivated by this, increasingly large language models have
been trained, yielding an array of impressive capabilities. Yet these models
are vulnerable to adversarial prompts, such as “jailbreaks” that hijack models
to perform undesired behaviors, posing a significant risk of misuse. Prior work
indicates that computer vision models become more robust with model and data
scaling, raising the question: does language model robustness also improve with
scale? We study this question empirically, finding that larger models respond
substantially better to adversarial training, but there is little to no benefit
from model scale in the absence of explicit defenses.
[COMMENTS]
31 pages
[LINK]
http://arxiv.org/abs/2407.18213v1
[DATE]
2024-07-26 01:26:41+08:00
[CATEGORIES]
cs.LG
cs.CL
ShiftAddLLM: Accelerating Pretrained LLMs via Post-Training Multiplication-Less Reparameterization
[AUTHORS]
Haoran You, Yipin Guo, Yichao Fu, Wei Zhou, Huihong Shi, Xiaofan Zhang, Souvik Kundu, Amir Yazdanbakhsh, Yingyan Celine Lin
[ABSTRACT]
Large language models (LLMs) have shown impressive performance on language
tasks but face challenges when deployed on resource-constrained devices due to
their extensive parameters and reliance on dense multiplications, resulting in
high memory demands and latency bottlenecks. Shift-and-add reparameterization
offers a promising solution by replacing costly multiplications with
hardware-friendly primitives in both the attention and multi-layer perceptron
(MLP) layers of an LLM. However, current reparameterization techniques require
training from scratch or full parameter fine-tuning to restore accuracy, which
is resource-intensive for LLMs. To address this, we propose accelerating
pretrained LLMs through post-training shift-and-add reparameterization,
creating efficient multiplication-free models, dubbed ShiftAddLLM.
Specifically, we quantize each weight matrix into binary matrices paired with
group-wise scaling factors. The associated multiplications are reparameterized
into (1) shifts between activations and scaling factors and (2) queries and
adds according to the binary matrices. To reduce accuracy loss, we present a
multi-objective optimization method to minimize both weight and output
activation reparameterization errors. Additionally, based on varying
sensitivity across layers to reparameterization, we develop an automated bit
allocation strategy to further reduce memory usage and latency. Experiments on
five LLM families and eight tasks consistently validate the effectiveness of
ShiftAddLLM, achieving average perplexity improvements of 5.6 and 22.7 points
at comparable or lower latency compared to the most competitive quantized LLMs
at 3 and 2 bits, respectively, and more than 80% memory and energy reductions
over the original LLMs. Codes and models are available at
https://github.com/GATECH-EIC/ShiftAddLLM.
[LINK]
http://arxiv.org/abs/2406.05981v3
[DATE]
2024-07-26 01:20:48+08:00
[CATEGORIES]
cs.LG
cs.CL
When Linear Attention Meets Autoregressive Decoding: Towards More Effective and Efficient Linearized Large Language Models
[AUTHORS]
Haoran You, Yichao Fu, Zheng Wang, Amir Yazdanbakhsh, Yingyan Celine Lin
[ABSTRACT]
Autoregressive Large Language Models (LLMs) have achieved impressive
performance in language tasks but face two significant bottlenecks: (1)
quadratic complexity in the attention module as the number of tokens increases,
and (2) limited efficiency due to the sequential processing nature of
autoregressive LLMs during generation. While linear attention and speculative
decoding offer potential solutions, their applicability and synergistic
potential for enhancing autoregressive LLMs remain uncertain. We conduct the
first comprehensive study on the efficacy of existing linear attention methods
for autoregressive LLMs, integrating them with speculative decoding. We
introduce an augmentation technique for linear attention that ensures
compatibility with speculative decoding, enabling more efficient training and
serving of LLMs. Extensive experiments and ablation studies involving seven
existing linear attention models and five encoder/decoder-based LLMs
consistently validate the effectiveness of our augmented linearized LLMs.
Notably, our approach achieves up to a 6.67 reduction in perplexity on the
LLaMA model and up to a 2$\times$ speedup during generation compared to prior
linear attention methods. Codes and models are available at
https://github.com/GATECH-EIC/Linearized-LLM.
[COMMENTS]
Accepted by ICML 2024; 17 pages; 10 figures; 16 tables
[LINK]
http://arxiv.org/abs/2406.07368v2
[DATE]
2024-07-26 01:18:01+08:00
[CATEGORIES]
cs.CL
cs.LG
A Unified Framework for Model Editing
[AUTHORS]
Akshat Gupta, Dev Sajnani, Gopala Anumanchipalli
[COMMENTS]
Under review. To appear as poster at KnowledgeableLM Workshop
co-located with ACL 2024
[LINK]
http://arxiv.org/abs/2403.14236v4
[DATE]
2024-07-26 00:52:15+08:00
[CATEGORIES]
cs.LG
cs.CL
Regurgitative Training: The Value of Real Data in Training Large Language Models
[AUTHORS]
Jinghui Zhang, Dandan Qiao, Mochen Yang, Qiang Wei
[ABSTRACT]
What happens if we train a new Large Language Model (LLM) using data that are
at least partially generated by other LLMs? The explosive success of LLMs means
that a substantial amount of content online will be generated by LLMs rather
than humans, which will inevitably enter the training datasets of
next-generation LLMs. We evaluate the implications of such “regurgitative
training” on LLM performance. Through fine-tuning GPT-3.5 with data generated
either by itself or by other LLMs in a machine translation task, we find strong
evidence that regurgitative training clearly handicaps the performance of LLMs.
The same performance loss of regurgitative training is observed on transformer
models that we train from scratch. We find suggestive evidence that the
performance disadvantage of regurgitative training can be attributed to at
least two mechanisms: (1) higher error rates and (2) lower lexical diversity in
LLM-generated data as compared to real data. Based on these mechanisms, we
propose and evaluate three different strategies to mitigate the performance
loss of regurgitative training. First, we devise data-driven metrics to gauge
the quality of each LLM-generated data instance, and then carry out an ordered
training process where high-quality data are added before low-quality ones.
Second, we combine data generated by multiple different LLMs (as an attempt to
increase lexical diversity). Third, we train an AI detection classifier to
differentiate between LLM- and human-generated data, and include LLM-generated
data in the order of resemblance to human-generated data. All three strategies
can improve the performance of regurgitative training to some extent but are
not always able to fully close the gap from training with real data. Our
results highlight the value of real, human-generated data in training LLMs,
which cannot be easily substituted by synthetic, LLM-generated data.
[LINK]
http://arxiv.org/abs/2407.12835v2
[DATE]
2024-07-26 00:50:58+08:00
[CATEGORIES]
cs.CL
Machine Translation Hallucination Detection for Low and High Resource Languages using Large Language Models
[AUTHORS]
Kenza Benkirane, Laura Gongas, Shahar Pelles, Naomi Fuchs, Joshua Darmon, Pontus Stenetorp, David Ifeoluwa Adelani, Eduardo Sánchez
[ABSTRACT]
Recent advancements in massively multilingual machine translation systems
have significantly enhanced translation accuracy; however, even the best
performing systems still generate hallucinations, severely impacting user
trust. Detecting hallucinations in Machine Translation (MT) remains a critical
challenge, particularly since existing methods excel with High-Resource
Languages (HRLs) but exhibit substantial limitations when applied to
Low-Resource Languages (LRLs). This paper evaluates hallucination detection
approaches using Large Language Models (LLMs) and semantic similarity within
massively multilingual embeddings. Our study spans 16 language directions,
covering HRLs, LRLs, with diverse scripts. We find that the choice of model is
essential for performance. On average, for HRLs, Llama3-70B outperforms the
previous state of the art by as much as 0.16 MCC (Matthews Correlation
Coefficient). However, for LRLs we observe that Claude Sonnet outperforms other
LLMs on average by 0.03 MCC. The key takeaway from our study is that LLMs can
achieve performance comparable or even better than previously proposed models,
despite not being explicitly trained for any machine translation task. However,
their advantage is less significant for LRLs.
[COMMENTS]
Authors Kenza Benkirane and Laura Gongas contributed equally to this
work
[LINK]
http://arxiv.org/abs/2407.16470v2
[DATE]
2024-07-26 00:31:39+08:00
[CATEGORIES]
cs.CL
Harmonic LLMs are Trustworthy
[AUTHORS]
Nicholas S. Kersting, Mohammad Rahman, Suchismitha Vedala, Yang Wang
[ABSTRACT]
We introduce an intuitive method to test the robustness (stability and
explainability) of any black-box LLM in real-time via its local deviation from
harmoniticity, denoted as $\gamma$. To the best of our knowledge this is the
first completely model-agnostic and unsupervised method of measuring the
robustness of any given response from an LLM, based upon the model itself
conforming to a purely mathematical standard. To show general application and
immediacy of results, we measure $\gamma$ in 10 popular LLMs (ChatGPT,
Claude-2.1, Claude3.0, GPT-4, GPT-4o, Smaug-72B, Mixtral-8x7B, Llama2-7B,
Mistral-7B and MPT-7B) across thousands of queries in three objective domains:
WebQA, ProgrammingQA, and TruthfulQA. Across all models and domains tested,
human annotation confirms that $\gamma \to 0$ indicates trustworthiness, and
conversely searching higher values of $\gamma$ easily exposes examples of
hallucination, a fact that enables efficient adversarial prompt generation
through stochastic gradient ascent in $\gamma$. The low-$\gamma$ leaders among
the models in the respective domains are GPT-4o, GPT-4, and Smaug-72B,
providing evidence that mid-size open-source models can win out against large
commercial models.
[COMMENTS]
15 pages, 2 figures, 16 tables; added Claude-3.0, GPT-4o, Mistral-7B,
Mixtral-8x7B, and more annotation for other models
[LINK]
http://arxiv.org/abs/2404.19708v2
[DATE]
2024-07-26 00:16:46+08:00
[CATEGORIES]
cs.LG
cs.CL
Sparse vs Contiguous Adversarial Pixel Perturbations in Multimodal Models: An Empirical Analysis
[AUTHORS]
Cristian-Alexandru Botocan, Raphael Meier, Ljiljana Dolamic
[ABSTRACT]
Assessing the robustness of multimodal models against adversarial examples is
an important aspect for the safety of its users. We craft L0-norm perturbation
attacks on the preprocessed input images. We launch them in a black-box setup
against four multimodal models and two unimodal DNNs, considering both targeted
and untargeted misclassification. Our attacks target less than 0.04% of
perturbed image area and integrate different spatial positioning of perturbed
pixels: sparse positioning and pixels arranged in different contiguous shapes
(row, column, diagonal, and patch). To the best of our knowledge, we are the
first to assess the robustness of three state-of-the-art multimodal models
(ALIGN, AltCLIP, GroupViT) against different sparse and contiguous pixel
distribution perturbations. The obtained results indicate that unimodal DNNs
are more robust than multimodal models. Furthermore, models using CNN-based
Image Encoder are more vulnerable than models with ViT - for untargeted
attacks, we obtain a 99% success rate by perturbing less than 0.02% of the
image area.
[LINK]
http://arxiv.org/abs/2407.18251v1
[DATE]
2024-07-26 01:59:48+08:00
[CATEGORIES]
cs.LG
VGGHeads: A Large-Scale Synthetic Dataset for 3D Human Heads
[AUTHORS]
Orest Kupyn, Eugene Khvedchenia, Christian Rupprecht
[ABSTRACT]
Human head detection, keypoint estimation, and 3D head model fitting are
important tasks with many applications. However, traditional real-world
datasets often suffer from bias, privacy, and ethical concerns, and they have
been recorded in laboratory environments, which makes it difficult for trained
models to generalize. Here, we introduce VGGHeads – a large scale synthetic
dataset generated with diffusion models for human head detection and 3D mesh
estimation. Our dataset comprises over 1 million high-resolution images, each
annotated with detailed 3D head meshes, facial landmarks, and bounding boxes.
Using this dataset we introduce a new model architecture capable of
simultaneous heads detection and head meshes reconstruction from a single image
in a single step. Through extensive experimental evaluations, we demonstrate
that models trained on our synthetic data achieve strong performance on real
images. Furthermore, the versatility of our dataset makes it applicable across
a broad spectrum of tasks, offering a general and comprehensive representation
of human heads. Additionally, we provide detailed information about the
synthetic data generation pipeline, enabling it to be re-used for other tasks
and domains.
[LINK]
http://arxiv.org/abs/2407.18245v1
[DATE]
2024-07-26 01:58:17+08:00
[CATEGORIES]
cs.LG
Numerical Literals in Link Prediction: A Critical Examination of Models and Datasets
[AUTHORS]
Moritz Blum, Basil Ell, Hannes Ill, Philipp Cimiano
[ABSTRACT]
Link Prediction(LP) is an essential task over Knowledge Graphs(KGs),
traditionally focussed on using and predicting the relations between entities.
Textual entity descriptions have already been shown to be valuable, but models
that incorporate numerical literals have shown minor improvements on existing
benchmark datasets. It is unclear whether a model is actually better in using
numerical literals, or better capable of utilizing the graph structure. This
raises doubts about the effectiveness of these methods and about the
suitability of the existing benchmark datasets.
We propose a methodology to evaluate LP models that incorporate numerical
literals. We propose i) a new synthetic dataset to better understand how well
these models use numerical literals and ii) dataset ablations strategies to
investigate potential difficulties with the existing datasets. We identify a
prevalent trend: many models underutilize literal information and potentially
rely on additional parameters for performance gains. Our investigation
highlights the need for more extensive evaluations when releasing new models
and datasets.
[LINK]
http://arxiv.org/abs/2407.18241v1
[DATE]
2024-07-26 01:55:33+08:00
[CATEGORIES]
cs.LG
Dr. Jekyll and Mr. Hyde: Two Faces of LLMs
[AUTHORS]
Matteo Gioele Collu, Tom Janssen-Groesbeek, Stefanos Koffas, Mauro Conti, Stjepan Picek
[ABSTRACT]
Recently, we have witnessed a rise in the use of Large Language Models
(LLMs), especially in applications like chatbot assistants. Safety mechanisms
and specialized training procedures are implemented to prevent improper
responses from these assistants. In this work, we bypass these measures for
ChatGPT and Gemini (and, to some extent, Bing chat) by making them impersonate
complex personas with personality characteristics that are not aligned with a
truthful assistant. We start by creating elaborate biographies of these
personas, which we then use in a new session with the same chatbots. Our
conversations then follow a role-play style to elicit prohibited responses.
Using personas, we show that prohibited responses are actually provided, making
it possible to obtain unauthorized, illegal, or harmful information. This work
shows that by using adversarial personas, one can overcome safety mechanisms
set out by ChatGPT and Gemini. We also introduce several ways of activating
such adversarial personas, which show that both chatbots are vulnerable to this
kind of attack. With the same principle, we introduce two defenses that push
the model to interpret trustworthy personalities and make it more robust
against such attacks.
[LINK]
http://arxiv.org/abs/2312.03853v4
[DATE]
2024-07-26 01:54:12+08:00
[CATEGORIES]
cs.LG
Can time series forecasting be automated? A benchmark and analysis
[AUTHORS]
Anvitha Thirthapura Sreedhara, Joaquin Vanschoren
[ABSTRACT]
In the field of machine learning and artificial intelligence, time series
forecasting plays a pivotal role across various domains such as finance,
healthcare, and weather. However, the task of selecting the most suitable
forecasting method for a given dataset is a complex task due to the diversity
of data patterns and characteristics. This research aims to address this
challenge by proposing a comprehensive benchmark for evaluating and ranking
time series forecasting methods across a wide range of datasets. This study
investigates the comparative performance of many methods from two prominent
time series forecasting frameworks, AutoGluon-Timeseries, and sktime to shed
light on their applicability in different real-world scenarios. This research
contributes to the field of time series forecasting by providing a robust
benchmarking methodology and facilitating informed decision-making when
choosing forecasting methods for achieving optimal prediction.
[LINK]
http://arxiv.org/abs/2407.16445v2
[DATE]
2024-07-26 01:53:38+08:00
[CATEGORIES]
cs.LG
Automated Ensemble Multimodal Machine Learning for Healthcare
[AUTHORS]
Fergus Imrie, Stefan Denner, Lucas S. Brunschwig, Klaus Maier-Hein, Mihaela van der Schaar
[ABSTRACT]
The application of machine learning in medicine and healthcare has led to the
creation of numerous diagnostic and prognostic models. However, despite their
success, current approaches generally issue predictions using data from a
single modality. This stands in stark contrast with clinician decision-making
which employs diverse information from multiple sources. While several
multimodal machine learning approaches exist, significant challenges in
developing multimodal systems remain that are hindering clinical adoption. In
this paper, we introduce a multimodal framework, AutoPrognosis-M, that enables
the integration of structured clinical (tabular) data and medical imaging using
automated machine learning. AutoPrognosis-M incorporates 17 imaging models,
including convolutional neural networks and vision transformers, and three
distinct multimodal fusion strategies. In an illustrative application using a
multimodal skin lesion dataset, we highlight the importance of multimodal
machine learning and the power of combining multiple fusion strategies using
ensemble learning. We have open-sourced our framework as a tool for the
community and hope it will accelerate the uptake of multimodal machine learning
in healthcare and spur further innovation.
[LINK]
http://arxiv.org/abs/2407.18227v1
[DATE]
2024-07-26 01:46:38+08:00
[CATEGORIES]
cs.LG
ShiftAddViT: Mixture of Multiplication Primitives Towards Efficient Vision Transformer
[AUTHORS]
Haoran You, Huihong Shi, Yipin Guo, Yingyan Celine Lin
[ABSTRACT]
Vision Transformers (ViTs) have shown impressive performance and have become
a unified backbone for multiple vision tasks. However, both the attention
mechanism and multi-layer perceptrons (MLPs) in ViTs are not sufficiently
efficient due to dense multiplications, leading to costly training and
inference. To this end, we propose to reparameterize pre-trained ViTs with a
mixture of multiplication primitives, e.g., bitwise shifts and additions,
towards a new type of multiplication-reduced model, dubbed
$\textbf{ShiftAddViT}$, which aims to achieve end-to-end inference speedups on
GPUs without requiring training from scratch. Specifically, all
$\texttt{MatMuls}$ among queries, keys, and values are reparameterized using
additive kernels, after mapping queries and keys to binary codes in Hamming
space. The remaining MLPs or linear layers are then reparameterized with shift
kernels. We utilize TVM to implement and optimize those customized kernels for
practical hardware deployment on GPUs. We find that such a reparameterization
on attention maintains model accuracy, while inevitably leading to accuracy
drops when being applied to MLPs. To marry the best of both worlds, we further
propose a new mixture of experts (MoE) framework to reparameterize MLPs by
taking multiplication or its primitives as experts, e.g., multiplication and
shift, and designing a new latency-aware load-balancing loss. Such a loss helps
to train a generic router for assigning a dynamic amount of input tokens to
different experts according to their latency. Extensive experiments on various
2D/3D Transformer-based vision tasks consistently validate the effectiveness of
our proposed ShiftAddViT, achieving up to $\textbf{5.18$\times$}$ latency
reductions on GPUs and $\textbf{42.9}$% energy savings, while maintaining a
comparable accuracy as original or efficient ViTs.
[COMMENTS]
Accepted by NeurIPS 2023
[LINK]
http://arxiv.org/abs/2306.06446v6
[DATE]
2024-07-26 01:19:31+08:00
[CATEGORIES]
cs.LG
Differentiable Quantum Architecture Search in Asynchronous Quantum Reinforcement Learning
[AUTHORS]
Samuel Yen-Chi Chen
[ABSTRACT]
The emergence of quantum reinforcement learning (QRL) is propelled by
advancements in quantum computing (QC) and machine learning (ML), particularly
through quantum neural networks (QNN) built on variational quantum circuits
(VQC). These advancements have proven successful in addressing sequential
decision-making tasks. However, constructing effective QRL models demands
significant expertise due to challenges in designing quantum circuit
architectures, including data encoding and parameterized circuits, which
profoundly influence model performance. In this paper, we propose addressing
this challenge with differentiable quantum architecture search (DiffQAS),
enabling trainable circuit parameters and structure weights using
gradient-based optimization. Furthermore, we enhance training efficiency
through asynchronous reinforcement learning (RL) methods facilitating parallel
training. Through numerical simulations, we demonstrate that our proposed
DiffQAS-QRL approach achieves performance comparable to manually-crafted
circuit architectures across considered environments, showcasing stability
across diverse scenarios. This methodology offers a pathway for designing QRL
models without extensive quantum knowledge, ensuring robust performance and
fostering broader application of QRL.
[COMMENTS]
Accepted by IEEE International Conference on Quantum Computing and
Engineering - QCE 2024
[LINK]
http://arxiv.org/abs/2407.18202v1
[DATE]
2024-07-26 01:11:00+08:00
[CATEGORIES]
cs.LG
Sparse Incremental Aggregation in Multi-Hop Federated Learning
[AUTHORS]
Sourav Mukherjee, Nasrin Razmi, Armin Dekorsy, Petar Popovski, Bho Matthiesen
[ABSTRACT]
This paper investigates federated learning (FL) in a multi-hop communication
setup, such as in constellations with inter-satellite links. In this setup,
part of the FL clients are responsible for forwarding other client’s results to
the parameter server. Instead of using conventional routing, the communication
efficiency can be improved significantly by using in-network model aggregation
at each intermediate hop, known as incremental aggregation (IA). Prior works
[1] have indicated diminishing gains for IA under gradient sparsification. Here
we study this issue and propose several novel correlated sparsification methods
for IA. Numerical results show that, for some of these algorithms, the full
potential of IA is still available under sparsification without impairing
convergence. We demonstrate a 15x improvement in communication efficiency over
conventional routing and a 11x improvement over state-of-the-art (SoA) sparse
IA.
[COMMENTS]
This paper is accepted for the 25th IEEE International Workshop on
Signal Processing Advances in Wireless Communications (SPAWC) conference
[LINK]
http://arxiv.org/abs/2407.18200v1
[DATE]
2024-07-26 01:09:22+08:00
[CATEGORIES]
cs.LG
Wasserstein approximation schemes based on Voronoi partitions
[AUTHORS]
Keaton Hamm, Varun Khurana
[ABSTRACT]
We consider structured approximation of measures in Wasserstein space
$\mathrm{W}_p(\mathbb{R}^d)$ for $p\in[1,\infty)$ using general measure
approximants compactly supported on Voronoi regions derived from a scaled
Voronoi partition of $\mathbb{R}^d$. We show that if a full rank lattice
$\Lambda$ is scaled by a factor of $h\in(0,1]$, then approximation of a measure
based on the Voronoi partition of $h\Lambda$ is $O(h)$ regardless of $d$ or
$p$. We then use a covering argument to show that $N$-term approximations of
compactly supported measures is $O(N^{-\frac1d})$ which matches known rates for
optimal quantizers and empirical measure approximation in most instances.
Additionally, we generalize our construction to nonuniform Voronoi partitions,
highlighting the flexibility and robustness of our approach for various measure
approximation scenarios. Finally, we extend these results to noncompactly
supported measures with sufficient decay. Our findings are pertinent to
applications in computer vision and machine learning where measures are used to
represent structured data such as images.
[LINK]
http://arxiv.org/abs/2310.09149v2
[DATE]
2024-07-26 01:05:37+08:00
[CATEGORIES]
cs.LG
AsEP: Benchmarking Deep Learning Methods for Antibody-specific Epitope Prediction
[AUTHORS]
Chunan Liu, Lilian Denzler, Yihong Chen, Andrew Martin, Brooks Paige
[ABSTRACT]
Epitope identification is vital for antibody design yet challenging due to
the inherent variability in antibodies. While many deep learning methods have
been developed for general protein binding site prediction tasks, whether they
work for epitope prediction remains an understudied research question. The
challenge is also heightened by the lack of a consistent evaluation pipeline
with sufficient dataset size and epitope diversity. We introduce a filtered
antibody-antigen complex structure dataset, AsEP (Antibody-specific Epitope
Prediction). AsEP is the largest of its kind and provides clustered epitope
groups, allowing the community to develop and test novel epitope prediction
methods. AsEP comes with an easy-to-use interface in Python and pre-built graph
representations of each antibody-antigen complex while also supporting
customizable embedding methods. Based on this new dataset, we benchmarked
various representative general protein-binding site prediction methods and find
that their performances are not satisfactory as expected for epitope
prediction. We thus propose a new method, WALLE, that leverages both protein
language models and graph neural networks. WALLE demonstrate about 5X
performance gain over existing methods. Our empirical findings evidence that
epitope prediction benefits from combining sequential embeddings provided by
language models and geometrical information from graph representations,
providing a guideline for future method design. In addition, we reformulate the
task as bipartite link prediction, allowing easy model performance attribution
and interpretability. We open-source our data and code at
https://github.com/biochunan/AsEP-dataset.
[LINK]
http://arxiv.org/abs/2407.18184v1
[DATE]
2024-07-26 00:43:56+08:00
[CATEGORIES]
cs.LG
Gene Regulatory Network Inference from Pre-trained Single-Cell Transcriptomics Transformer with Joint Graph Learning
[AUTHORS]
Sindhura Kommu, Yizhi Wang, Yue Wang, Xuan Wang
[ABSTRACT]
Inferring gene regulatory networks (GRNs) from single-cell RNA sequencing
(scRNA-seq) data is a complex challenge that requires capturing the intricate
relationships between genes and their regulatory interactions. In this study,
we tackle this challenge by leveraging the single-cell BERT-based pre-trained
transformer model (scBERT), trained on extensive unlabeled scRNA-seq data, to
augment structured biological knowledge from existing GRNs. We introduce a
novel joint graph learning approach that combines the rich contextual
representations learned by pre-trained single-cell language models with the
structured knowledge encoded in GRNs using graph neural networks (GNNs). By
integrating these two modalities, our approach effectively reasons over boththe
gene expression level constraints provided by the scRNA-seq data and the
structured biological knowledge inherent in GRNs. We evaluate our method on
human cell benchmark datasets from the BEELINE study with cell type-specific
ground truth networks. The results demonstrate superior performance over
current state-of-the-art baselines, offering a deeper understanding of cellular
regulatory mechanisms.
[COMMENTS]
Accepted into the ICML 2024 AI for Science workshop
[LINK]
http://arxiv.org/abs/2407.18181v1
[DATE]
2024-07-26 00:42:08+08:00
[CATEGORIES]
cs.LG
Quasar-ViT: Hardware-Oriented Quantization-Aware Architecture Search for Vision Transformers
[AUTHORS]
Zhengang Li, Alec Lu, Yanyue Xie, Zhenglun Kong, Mengshu Sun, Hao Tang, Zhong Jia Xue, Peiyan Dong, Caiwen Ding, Yanzhi Wang, Xue Lin, Zhenman Fang
[ABSTRACT]
Vision transformers (ViTs) have demonstrated their superior accuracy for
computer vision tasks compared to convolutional neural networks (CNNs).
However, ViT models are often computation-intensive for efficient deployment on
resource-limited edge devices. This work proposes Quasar-ViT, a
hardware-oriented quantization-aware architecture search framework for ViTs, to
design efficient ViT models for hardware implementation while preserving the
accuracy. First, Quasar-ViT trains a supernet using our row-wise flexible
mixed-precision quantization scheme, mixed-precision weight entanglement, and
supernet layer scaling techniques. Then, it applies an efficient
hardware-oriented search algorithm, integrated with hardware latency and
resource modeling, to determine a series of optimal subnets from supernet under
different inference latency targets. Finally, we propose a series of
model-adaptive designs on the FPGA platform to support the architecture search
and mitigate the gap between the theoretical computation reduction and the
practical inference speedup. Our searched models achieve 101.5, 159.6, and
251.6 frames-per-second (FPS) inference speed on the AMD/Xilinx ZCU102 FPGA
with 80.4%, 78.6%, and 74.9% top-1 accuracy, respectively, for the ImageNet
dataset, consistently outperforming prior works.
[COMMENTS]
Accepted by ICS 2024
[LINK]
http://arxiv.org/abs/2407.18175v1
[DATE]
2024-07-26 00:35:46+08:00
[CATEGORIES]
cs.LG
RIDA: A Robust Attack Framework on Incomplete Graphs
[AUTHORS]
Jianke Yu, Hanchen Wang, Chen Chen, Xiaoyang Wang, Wenjie Zhang, Ying Zhang
[ABSTRACT]
Graph Neural Networks (GNNs) are vital in data science but are increasingly
susceptible to adversarial attacks. To help researchers develop more robust GNN
models, it’s essential to focus on designing strong attack models as
foundational benchmarks and guiding references. Among adversarial attacks,
gray-box poisoning attacks are noteworthy due to their effectiveness and fewer
constraints. These attacks exploit GNNs’ need for retraining on updated data,
thereby impacting their performance by perturbing these datasets. However,
current research overlooks the real-world scenario of incomplete graphs.To
address this gap, we introduce the Robust Incomplete Deep Attack Framework
(RIDA). It is the first algorithm for robust gray-box poisoning attacks on
incomplete graphs. The approach innovatively aggregates distant vertex
information and ensures powerful data utilization.Extensive tests against 9
SOTA baselines on 3 real-world datasets demonstrate RIDA’s superiority in
handling incompleteness and high attack performance on the incomplete graph.
[LINK]
http://arxiv.org/abs/2407.18170v1
[DATE]
2024-07-26 00:33:35+08:00
[CATEGORIES]
cs.LG
Light Curve Classification with DistClassiPy: a new distance-based classifier
[AUTHORS]
Siddharth Chaini, Ashish Mahabal, Ajit Kembhavi, Federica B. Bianco
[ABSTRACT]
The rise of synoptic sky surveys has ushered in an era of big data in
time-domain astronomy, making data science and machine learning essential tools
for studying celestial objects. While tree-based models (e.g. Random Forests)
and deep learning models dominate the field, we explore the use of different
distance metrics to aid in the classification of astrophysical objects. We
developed DistClassiPy, a new distance metric based classifier. The direct use
of distance metrics is unexplored in time-domain astronomy, but distance-based
methods can help make classification more interpretable and decrease
computational costs. In particular, we applied DistClassiPy to classify light
curves of variable stars, comparing the distances between objects of different
classes. Using 18 distance metrics on a catalog of 6,000 variable stars across
10 classes, we demonstrate classification and dimensionality reduction. Our
classifier meets state-of-the-art performance but has lower computational
requirements and improved interpretability. Additionally, DistClassiPy can be
tailored to specific objects by identifying the most effective distance metric
for that classification. To facilitate broader applications within and beyond
astronomy, we have made DistClassiPy open-source and available at
https://pypi.org/project/distclassipy/.
[COMMENTS]
Accepted for publication in Astronomy and Computing (2024). 24 pages,
19 figures
[LINK]
http://arxiv.org/abs/2403.12120v2
[DATE]
2024-07-26 00:27:49+08:00
[CATEGORIES]
cs.LG
Longhorn: State Space Models are Amortized Online Learners
[AUTHORS]
Bo Liu, Rui Wang, Lemeng Wu, Yihao Feng, Peter Stone, Qiang Liu
[ABSTRACT]
The most fundamental capability of modern AI methods such as Large Language
Models (LLMs) is the ability to predict the next token in a long sequence of
tokens, known as ``sequence modeling.” Although the Transformers model is the
current dominant approach to sequence modeling, its quadratic computational
cost with respect to sequence length is a significant drawback. State-space
models (SSMs) offer a promising alternative due to their linear decoding
efficiency and high parallelizability during training. However, existing SSMs
often rely on seemingly ad hoc linear recurrence designs. In this work, we
explore SSM design through the lens of online learning, conceptualizing SSMs as
meta-modules for specific online learning problems. This approach links SSM
design to formulating precise online learning objectives, with state transition
rules derived from optimizing these objectives. Based on this insight, we
introduce a novel deep SSM architecture based on the implicit update for
optimizing an online regression objective. Our experimental results show that
our models outperform state-of-the-art SSMs, including the Mamba model, on
standard sequence modeling benchmarks and language modeling tasks.
[LINK]
http://arxiv.org/abs/2407.14207v2
[DATE]
2024-07-26 00:24:59+08:00
[CATEGORIES]
cs.LG
Unlocking Tokens as Data Points for Generalization Bounds on Larger Language Models
[AUTHORS]
Sanae Lotfi, Yilun Kuang, Brandon Amos, Micah Goldblum, Marc Finzi, Andrew Gordon Wilson
[ABSTRACT]
Large language models (LLMs) with billions of parameters excel at predicting
the next token in a sequence. Recent work computes non-vacuous
compression-based generalization bounds for LLMs, but these bounds are vacuous
for large models at the billion-parameter scale. Moreover, these bounds are
obtained through restrictive compression techniques, bounding compressed models
that generate low-quality text. Additionally, the tightness of these existing
bounds depends on the number of IID documents in a training set rather than the
much larger number of non-IID constituent tokens, leaving untapped potential
for tighter bounds. In this work, we instead use properties of martingales to
derive generalization bounds that benefit from the vast number of tokens in LLM
training sets. Since a dataset contains far more tokens than documents, our
generalization bounds not only tolerate but actually benefit from far less
restrictive compression schemes. With Monarch matrices, Kronecker
factorizations, and post-training quantization, we achieve non-vacuous
generalization bounds for LLMs as large as LLaMA2-70B. Unlike previous
approaches, our work achieves the first non-vacuous bounds for models that are
deployed in practice and generate high-quality text.
[LINK]
http://arxiv.org/abs/2407.18158v1
[DATE]
2024-07-26 00:13:58+08:00
[CATEGORIES]
cs.LG
No Representation, No Trust: Connecting Representation, Collapse, and Trust Issues in PPO
[AUTHORS]
Skander Moalla, Andrea Miele, Razvan Pascanu, Caglar Gulcehre
[ABSTRACT]
Reinforcement learning (RL) is inherently rife with non-stationarity since
the states and rewards the agent observes during training depend on its
changing policy. Therefore, networks in deep RL must be capable of adapting to
new observations and fitting new targets. However, previous works have observed
that networks in off-policy deep value-based methods exhibit a decrease in
representation rank, often correlated with an inability to continue learning or
a collapse in performance. Although this phenomenon has generally been
attributed to neural network learning under non-stationarity, it has been
overlooked in on-policy policy optimization methods which are often thought
capable of training indefinitely. In this work, we empirically study
representation dynamics in Proximal Policy Optimization (PPO) on the Atari and
MuJoCo environments, revealing that PPO agents are also affected by feature
rank deterioration and loss of plasticity. We show that this is aggravated with
stronger non-stationarity, ultimately driving the actor’s performance to
collapse, regardless of the performance of the critic. We ask why the trust
region, specific to methods like PPO, cannot alleviate or prevent the collapse.
We find that there is a connection between representation collapse and the
degradation of the trust region, one exacerbating the other, and present
Proximal Feature Optimization (PFO), a novel auxiliary loss that, along with
other interventions, shows that regularizing the representation dynamics
improves the performance of PPO agents.
[COMMENTS]
ICML ARLET workshop version. Code and run histories are available at
https://github.com/CLAIRE-Labo/no-representation-no-trust
[LINK]
http://arxiv.org/abs/2405.00662v2
[DATE]
2024-07-26 00:04:49+08:00
[CATEGORIES]
cs.LG
Evaluating the design space of diffusion-based generative models
[AUTHORS]
Yuqing Wang, Ye He, Molei Tao
[ABSTRACT]
Most existing theoretical investigations of the accuracy of diffusion models,
albeit significant, assume the score function has been approximated to a
certain accuracy, and then use this a priori bound to control the error of
generation. This article instead provides a first quantitative understanding of
the whole generation process, i.e., both training and sampling. More precisely,
it conducts a non-asymptotic convergence analysis of denoising score matching
under gradient descent. In addition, a refined sampling error analysis for
variance exploding models is also provided. The combination of these two
results yields a full error analysis, which elucidates (again, but this time
theoretically) how to design the training and sampling processes for effective
generation. For instance, our theory implies a preference toward noise
distribution and loss weighting in training that qualitatively agree with the
ones used in [Karras et al. 2022]. It also provides perspectives on the choices
of time and variance schedules in sampling: when the score is well trained, the
design in [Song et al. 2020] is more preferable, but when it is less trained,
the design in [Karras et al. 2022] becomes more preferable.
[COMMENTS]
Comments are welcome. Out of admiration we titled our paper after
EDM, and hoped theorists’ humor is not too corny
[LINK]
http://arxiv.org/abs/2406.12839v2
[DATE]
2024-07-26 00:01:04+08:00
[CATEGORIES]
cs.LG
The FIGNEWS Shared Task on News Media Narratives
[AUTHORS]
Wajdi Zaghouani, Mustafa Jarrar, Nizar Habash, Houda Bouamor, Imed Zitouni, Mona Diab, Samhaa R. El-Beltagy, Muhammed AbuOdeh
[COMMENTS]
18 pages, 10 tables, 1 figure, accepted to ArabicNLP 2024 co-located
with ACL 2024
[LINK]
http://arxiv.org/abs/2407.18147v1
[DATE]
2024-07-25 23:58:19+08:00
[CATEGORIES]
cs.CL
Tracking linguistic information in transformer-based sentence embeddings through targeted sparsification
[AUTHORS]
Vivi Nastase, Paola Merlo
[ABSTRACT]
Analyses of transformer-based models have shown that they encode a variety of
linguistic information from their textual input. While these analyses have shed
a light on the relation between linguistic information on one side, and
internal architecture and parameters on the other, a question remains
unanswered: how is this linguistic information reflected in sentence
embeddings? Using datasets consisting of sentences with known structure, we
test to what degree information about chunks (in particular noun, verb or
prepositional phrases), such as grammatical number, or semantic role, can be
localized in sentence embeddings. Our results show that such information is not
distributed over the entire sentence embedding, but rather it is encoded in
specific regions. Understanding how the information from an input text is
compressed into sentence embeddings helps understand current transformer models
and help build future explainable neural models.
[COMMENTS]
12 pages, 9 figures, 1 table, published in RepL4NLP 2024
[LINK]
http://arxiv.org/abs/2407.18119v1
[DATE]
2024-07-25 23:27:08+08:00
[CATEGORIES]
cs.CL
PEFT-U: Parameter-Efficient Fine-Tuning for User Personalization
[AUTHORS]
Christopher Clarke, Yuzhao Heng, Lingjia Tang, Jason Mars
[ABSTRACT]
The recent emergence of Large Language Models (LLMs) has heralded a new era
of human-AI interaction. These sophisticated models, exemplified by Chat-GPT
and its successors, have exhibited remarkable capabilities in language
understanding. However, as these LLMs have undergone exponential growth, a
crucial dimension that remains understudied is the personalization of these
models. Large foundation models such as GPT-3 etc. focus on creating a
universal model that serves a broad range of tasks and users. This approach
emphasizes the model’s generalization capabilities, treating users as a
collective rather than as distinct individuals. While practical for many common
applications, this one-size-fits-all approach often fails to address the rich
tapestry of human diversity and individual needs. To explore this issue we
introduce the PEFT-U Benchmark: a new dataset for building and evaluating NLP
models for user personalization. \datasetname{} consists of a series of
user-centered tasks containing diverse and individualized expressions where the
preferences of users can potentially differ for the same input. Using PEFT-U,
we explore the challenge of efficiently personalizing LLMs to accommodate
user-specific preferences in the context of diverse user-centered tasks.
[LINK]
http://arxiv.org/abs/2407.18078v1
[DATE]
2024-07-25 22:36:18+08:00
[CATEGORIES]
cs.CL
I can listen but cannot read: An evaluation of two-tower multimodal systems for instrument recognition
[AUTHORS]
Yannis Vasilakis, Rachel Bittner, Johan Pauwels
[ABSTRACT]
Music two-tower multimodal systems integrate audio and text modalities into a
joint audio-text space, enabling direct comparison between songs and their
corresponding labels. These systems enable new approaches for classification
and retrieval, leveraging both modalities. Despite the promising results they
have shown for zero-shot classification and retrieval tasks, closer inspection
of the embeddings is needed. This paper evaluates the inherent zero-shot
properties of joint audio-text spaces for the case-study of instrument
recognition. We present an evaluation and analysis of two-tower systems for
zero-shot instrument recognition and a detailed analysis of the properties of
the pre-joint and joint embeddings spaces. Our findings suggest that audio
encoders alone demonstrate good quality, while challenges remain within the
text encoder or joint space projection. Specifically, two-tower systems exhibit
sensitivity towards specific words, favoring generic prompts over musically
informed ones. Despite the large size of textual encoders, they do not yet
leverage additional textual context or infer instruments accurately from their
descriptions. Lastly, a novel approach for quantifying the semantic
meaningfulness of the textual space leveraging an instrument ontology is
proposed. This method reveals deficiencies in the systems’ understanding of
instruments and provides evidence of the need for fine-tuning text encoders on
musical data.
[COMMENTS]
Accepted to ISMIR 2024
[LINK]
http://arxiv.org/abs/2407.18058v1
[DATE]
2024-07-25 22:15:05+08:00
[CATEGORIES]
cs.CL
cs.LG
RestoreAgent: Autonomous Image Restoration Agent via Multimodal Large Language Models
[AUTHORS]
Haoyu Chen, Wenbo Li, Jinjin Gu, Jingjing Ren, Sixiang Chen, Tian Ye, Renjing Pei, Kaiwen Zhou, Fenglong Song, Lei Zhu
[ABSTRACT]
Natural images captured by mobile devices often suffer from multiple types of
degradation, such as noise, blur, and low light. Traditional image restoration
methods require manual selection of specific tasks, algorithms, and execution
sequences, which is time-consuming and may yield suboptimal results. All-in-one
models, though capable of handling multiple tasks, typically support only a
limited range and often produce overly smooth, low-fidelity outcomes due to
their broad data distribution fitting. To address these challenges, we first
define a new pipeline for restoring images with multiple degradations, and then
introduce RestoreAgent, an intelligent image restoration system leveraging
multimodal large language models. RestoreAgent autonomously assesses the type
and extent of degradation in input images and performs restoration through (1)
determining the appropriate restoration tasks, (2) optimizing the task
sequence, (3) selecting the most suitable models, and (4) executing the
restoration. Experimental results demonstrate the superior performance of
RestoreAgent in handling complex degradation, surpassing human experts.
Furthermore, the system modular design facilitates the fast integration of new
tasks and models, enhancing its flexibility and scalability for various
applications.
[LINK]
http://arxiv.org/abs/2407.18035v1
[DATE]
2024-07-25 21:29:37+08:00
[CATEGORIES]
cs.CL
PATCH! Psychometrics-AssisTed benCHmarking of Large Language Models: A Case Study of Proficiency in 8th Grade Mathematics
[AUTHORS]
Qixiang Fang, Daniel L. Oberski, Dong Nguyen
[ABSTRACT]
Many existing benchmarks of large (multimodal) language models (LLMs) focus
on measuring LLMs’ academic proficiency, often with also an interest in
comparing model performance with human test takers. While these benchmarks have
proven key to the development of LLMs, they suffer from several limitations,
including questionable measurement quality (e.g., Do they measure what they are
supposed to in a reliable way?), lack of quality assessment on the item level
(e.g., Are some items more important or difficult than others?) and unclear
human population reference (e.g., To whom can the model be compared?). In
response to these challenges, we propose leveraging knowledge from
psychometrics - a field dedicated to the measurement of latent variables like
academic proficiency - into LLM benchmarking. We make three primary
contributions. First, we introduce PATCH: a novel framework for
{P}sychometrics-{A}ssis{T}ed ben{CH}marking of LLMs. PATCH addresses the
aforementioned limitations, presenting a new direction for LLM benchmark
research. Second, we implement PATCH by measuring GPT-4 and Gemini-Pro-Vision’s
proficiency in 8th grade mathematics against 56 human populations. We show that
adopting a psychometrics-based approach yields evaluation outcomes that diverge
from those based on existing benchmarking practices. Third, we release 4
high-quality datasets to support measuring and comparing LLM proficiency in
grade school mathematics and science against human populations.
[LINK]
http://arxiv.org/abs/2404.01799v2
[DATE]
2024-07-25 21:12:47+08:00
[CATEGORIES]
cs.CL
Resolving Discrepancies in Compute-Optimal Scaling of Language Models
[AUTHORS]
Tomer Porian, Mitchell Wortsman, Jenia Jitsev, Ludwig Schmidt, Yair Carmon
[ABSTRACT]
Kaplan et al. and Hoffmann et al. developed influential scaling laws for the
optimal model size as a function of the compute budget, but these laws yield
substantially different predictions. We explain the discrepancy by reproducing
the Kaplan scaling law on two datasets (OpenWebText2 and RefinedWeb) and
identifying three factors causing the difference: last layer computational
cost, warmup duration, and scale-dependent optimizer tuning. With these factors
corrected, we obtain excellent agreement with the Hoffmann et al. (i.e.,
“Chinchilla”) scaling law. Counter to a hypothesis of Hoffmann et al., we find
that careful learning rate decay is not essential for the validity of their
scaling law. As a secondary result, we derive scaling laws for the optimal
learning rate and batch size, finding that tuning the AdamW $\beta_2$ parameter
is essential at lower batch sizes.
[COMMENTS]
Fixing bug in small models with tuned LR
[LINK]
http://arxiv.org/abs/2406.19146v2
[DATE]
2024-07-25 21:09:18+08:00
[CATEGORIES]
cs.LG
cs.CL
Keep the Cost Down: A Review on Methods to Optimize LLM’ s KV-Cache Consumption
[AUTHORS]
Shi Luohe, Zhang Hongyi, Yao Yao, Li Zuchao, Zhao Hai
[ABSTRACT]
Large Language Models (LLMs), epitomized by ChatGPT’ s release in late 2022,
have revolutionized various industries with their advanced language
comprehension. However, their efficiency is challenged by the Transformer
architecture’ s struggle with handling long texts. KV-Cache has emerged as a
pivotal solution to this issue, converting the time complexity of token
generation from quadratic to linear, albeit with increased GPU memory overhead
proportional to conversation length. With the development of the LLM community
and academia, various KV-Cache compression methods have been proposed. In this
review, we dissect the various properties of KV-Cache and elaborate on various
methods currently used to optimize the KV-Cache space usage of LLMs. These
methods span the pre-training phase, deployment phase, and inference phase, and
we summarize the commonalities and differences among these methods.
Additionally, we list some metrics for evaluating the long-text capabilities of
large language models, from both efficiency and capability perspectives. Our
review thus sheds light on the evolving landscape of LLM optimization, offering
insights into future advancements in this dynamic field.
[COMMENTS]
to be published in CoLM 2024
[LINK]
http://arxiv.org/abs/2407.18003v1
[DATE]
2024-07-25 20:56:22+08:00
[CATEGORIES]
cs.CL
On the Effect of Purely Synthetic Training Data for Different Automatic Speech Recognition Architectures
[AUTHORS]
Nick Rossenbach, Benedikt Hilmes, Ralf Schlüter
[ABSTRACT]
In this work we evaluate the utility of synthetic data for training automatic
speech recognition (ASR). We use the ASR training data to train a
text-to-speech (TTS) system similar to FastSpeech-2. With this TTS we reproduce
the original training data, training ASR systems solely on synthetic data. For
ASR, we use three different architectures, attention-based encoder-decoder,
hybrid deep neural network hidden Markov model and a Gaussian mixture hidden
Markov model, showing the different sensitivity of the models to synthetic data
generation. In order to extend previous work, we present a number of ablation
studies on the effectiveness of synthetic vs. real training data for ASR. In
particular we focus on how the gap between training on synthetic and real data
changes by varying the speaker embedding or by scaling the model size. For the
latter we show that the TTS models generalize well, even when training scores
indicate overfitting.
[COMMENTS]
Accepted at the SynData4GenAI 2024 workshop
[LINK]
http://arxiv.org/abs/2407.17997v1
[DATE]
2024-07-25 20:44:45+08:00
[CATEGORIES]
cs.CL
cs.LG
The Larger the Better? Improved LLM Code-Generation via Budget Reallocation
[AUTHORS]
Michael Hassid, Tal Remez, Jonas Gehring, Roy Schwartz, Yossi Adi
[ABSTRACT]
It is a common belief that large language models (LLMs) are better than
smaller-sized ones. However, larger models also require significantly more time
and compute during inference. This begs the question: what happens when both
models operate under the same budget? (e.g., compute, run-time). To address
this question, we analyze code generation LLMs of various sizes and make
comparisons such as running a 70B model once vs. generating five outputs from a
13B model. We consider a standard unit-test setup, which can be used to select
the correct output from the smaller model. Our findings reveal that the
repeated use of smaller models can yield consistent improvements, with gains of
up to 15% across five tasks. On the other hand, in scenarios where unit-tests
are unavailable, a ranking-based selection of candidates from the smaller model
falls short of the performance of a single output from larger ones. Our results
highlight the potential of using smaller models instead of larger ones, and the
importance of studying approaches for ranking LLM outputs.
[COMMENTS]
COLM 2024
[LINK]
http://arxiv.org/abs/2404.00725v2
[DATE]
2024-07-25 19:37:54+08:00
[CATEGORIES]
cs.CL
cs.LG
Positive Text Reframing under Multi-strategy Optimization
[AUTHORS]
Shutong Jia, Biwei Cao, Qingqing Gao, Jiuxin Cao, Bo Liu
[ABSTRACT]
Differing from sentiment transfer, positive reframing seeks to substitute
negative perspectives with positive expressions while preserving the original
meaning. With the emergence of pre-trained language models (PLMs), it is
possible to achieve acceptable results by fine-tuning PLMs. Nevertheless,
generating fluent, diverse and task-constrained reframing text remains a
significant challenge. To tackle this issue, a \textbf{m}ulti-\textbf{s}trategy
\textbf{o}ptimization \textbf{f}ramework (MSOF) is proposed in this paper.
Starting from the objective of positive reframing, we first design positive
sentiment reward and content preservation reward to encourage the model to
transform the negative expressions of the original text while ensuring the
integrity and consistency of the semantics. Then, different decoding
optimization approaches are introduced to improve the quality of text
generation. Finally, based on the modeling formula of positive reframing, we
propose a multi-dimensional re-ranking method that further selects candidate
sentences from three dimensions: strategy consistency, text similarity and
fluency. Extensive experiments on two Seq2Seq PLMs, BART and T5, demonstrate
our framework achieves significant improvements on unconstrained and controlled
positive reframing tasks.
[LINK]
http://arxiv.org/abs/2407.17940v1
[DATE]
2024-07-25 18:58:42+08:00
[CATEGORIES]
cs.CL
The Power of Combining Data and Knowledge: GPT-4o is an Effective Interpreter of Machine Learning Models in Predicting Lymph Node Metastasis of Lung Cancer
[AUTHORS]
Danqing Hu, Bing Liu, Xiaofeng Zhu, Nan Wu
[ABSTRACT]
Lymph node metastasis (LNM) is a crucial factor in determining the initial
treatment for patients with lung cancer, yet accurate preoperative diagnosis of
LNM remains challenging. Recently, large language models (LLMs) have garnered
significant attention due to their remarkable text generation capabilities.
Leveraging the extensive medical knowledge learned from vast corpora, LLMs can
estimate probabilities for clinical problems, though their performance has
historically been inferior to data-driven machine learning models. In this
paper, we propose a novel ensemble method that combines the medical knowledge
acquired by LLMs with the latent patterns identified by machine learning models
to enhance LNM prediction performance. Initially, we developed machine learning
models using patient data. We then designed a prompt template to integrate the
patient data with the predicted probability from the machine learning model.
Subsequently, we instructed GPT-4o, the most advanced LLM developed by OpenAI,
to estimate the likelihood of LNM based on patient data and then adjust the
estimate using the machine learning output. Finally, we collected three outputs
from the GPT-4o using the same prompt and ensembled these results as the final
prediction. Using the proposed method, our models achieved an AUC value of
0.765 and an AP value of 0.415 for LNM prediction, significantly improving
predictive performance compared to baseline machine learning models. The
experimental results indicate that GPT-4o can effectively leverage its medical
knowledge and the probabilities predicted by machine learning models to achieve
more accurate LNM predictions. These findings demonstrate that LLMs can perform
well in clinical risk prediction tasks, offering a new paradigm for integrating
medical knowledge and patient data in clinical predictions.
[LINK]
http://arxiv.org/abs/2407.17900v1
[DATE]
2024-07-25 17:42:24+08:00
[CATEGORIES]
cs.CL
cs.LG
KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache
[AUTHORS]
Zirui Liu, Jiayi Yuan, Hongye Jin, Shaochen Zhong, Zhaozhuo Xu, Vladimir Braverman, Beidi Chen, Xia Hu
[COMMENTS]
ICML2024
[LINK]
http://arxiv.org/abs/2402.02750v2
[DATE]
2024-07-25 17:16:05+08:00
[CATEGORIES]
cs.CL
cs.LG
A Large-Scale Sensitivity Analysis on Latent Embeddings and Dimensionality Reductions for Text Spatializations
[AUTHORS]
Daniel Atzberger, Tim Cech, Willy Scheibel, Jürgen Döllner, Michael Behrisch, Tobias Schreck
[ABSTRACT]
The semantic similarity between documents of a text corpus can be visualized
using map-like metaphors based on two-dimensional scatterplot layouts. These
layouts result from a dimensionality reduction on the document-term matrix or a
representation within a latent embedding, including topic models. Thereby, the
resulting layout depends on the input data and hyperparameters of the
dimensionality reduction and is therefore affected by changes in them.
Furthermore, the resulting layout is affected by changes in the input data and
hyperparameters of the dimensionality reduction. However, such changes to the
layout require additional cognitive efforts from the user. In this work, we
present a sensitivity study that analyzes the stability of these layouts
concerning (1) changes in the text corpora, (2) changes in the hyperparameter,
and (3) randomness in the initialization. Our approach has two stages: data
measurement and data analysis. First, we derived layouts for the combination of
three text corpora and six text embeddings and a grid-search-inspired
hyperparameter selection of the dimensionality reductions. Afterward, we
quantified the similarity of the layouts through ten metrics, concerning local
and global structures and class separation. Second, we analyzed the resulting
42817 tabular data points in a descriptive statistical analysis. From this, we
derived guidelines for informed decisions on the layout algorithm and highlight
specific hyperparameter settings. We provide our implementation as a Git
repository at
https://github.com/hpicgs/Topic-Models-and-Dimensionality-Reduction-Sensitivity-Study
and results as Zenodo archive at https://doi.org/10.5281/zenodo.12772898.
[COMMENTS]
To be published at IEEE VIS 2024 conference
[LINK]
http://arxiv.org/abs/2407.17876v1
[DATE]
2024-07-25 16:46:49+08:00
[CATEGORIES]
cs.CL
cs.LG
Improving Domain-Specific ASR with LLM-Generated Contextual Descriptions
[AUTHORS]
Jiwon Suh, Injae Na, Woohwan Jung
[ABSTRACT]
End-to-end automatic speech recognition (E2E ASR) systems have significantly
improved speech recognition through training on extensive datasets. Despite
these advancements, they still struggle to accurately recognize domain specific
words, such as proper nouns and technical terminologies. To address this
problem, we propose a method to utilize the state-of-the-art Whisper without
modifying its architecture, preserving its generalization performance while
enabling it to leverage descriptions effectively. Moreover, we propose two
additional training techniques to improve the domain specific ASR: decoder
fine-tuning, and context perturbation. We also propose a method to use a Large
Language Model (LLM) to generate descriptions with simple metadata, when
descriptions are unavailable. Our experiments demonstrate that proposed methods
notably enhance domain-specific ASR accuracy on real-life datasets, with
LLM-generated descriptions outperforming human-crafted ones in effectiveness.
[COMMENTS]
Accepted to INTERSPEECH 2024
[LINK]
http://arxiv.org/abs/2407.17874v1
[DATE]
2024-07-25 16:44:04+08:00
[CATEGORIES]
cs.CL
Is the Digital Forensics and Incident Response Pipeline Ready for Text-Based Threats in LLM Era?
[AUTHORS]
Avanti Bhandarkar, Ronald Wilson, Anushka Swarup, Mengdi Zhu, Damon Woodard
[ABSTRACT]
In the era of generative AI, the widespread adoption of Neural Text
Generators (NTGs) presents new cybersecurity challenges, particularly within
the realms of Digital Forensics and Incident Response (DFIR). These challenges
primarily involve the detection and attribution of sources behind advanced
attacks like spearphishing and disinformation campaigns. As NTGs evolve, the
task of distinguishing between human and NTG-authored texts becomes critically
complex. This paper rigorously evaluates the DFIR pipeline tailored for
text-based security systems, specifically focusing on the challenges of
detecting and attributing authorship of NTG-authored texts. By introducing a
novel human-NTG co-authorship text attack, termed CS-ACT, our study uncovers
significant vulnerabilities in traditional DFIR methodologies, highlighting
discrepancies between ideal scenarios and real-world conditions. Utilizing 14
diverse datasets and 43 unique NTGs, up to the latest GPT-4, our research
identifies substantial vulnerabilities in the forensic profiling phase,
particularly in attributing authorship to NTGs. Our comprehensive evaluation
points to factors such as model sophistication and the lack of distinctive
style within NTGs as significant contributors for these vulnerabilities. Our
findings underscore the necessity for more sophisticated and adaptable
strategies, such as incorporating adversarial learning, stylizing NTGs, and
implementing hierarchical attribution through the mapping of NTG lineages to
enhance source attribution. This sets the stage for future research and the
development of more resilient text-based security systems.
[COMMENTS]
This work has been submitted to the IEEE for possible publication.
Copyright may be transferred without notice, after which this version may no
longer be accessible
[LINK]
http://arxiv.org/abs/2407.17870v1
[DATE]
2024-07-25 16:42:53+08:00
[CATEGORIES]
cs.CL
Exploring Description-Augmented Dataless Intent Classification
[AUTHORS]
Ruoyu Hu, Foaad Khosmood, Abbas Edalat
[ABSTRACT]
In this work, we introduce several schemes to leverage description-augmented
embedding similarity for dataless intent classification using current
state-of-the-art (SOTA) text embedding models. We report results of our methods
on four commonly used intent classification datasets and compare against
previous works of a similar nature. Our work shows promising results for
dataless classification scaling to a large number of unseen intents. We show
competitive results and significant improvements (+6.12\% Avg.) over strong
zero-shot baselines, all without training on labelled or task-specific data.
Furthermore, we provide qualitative error analysis of the shortfalls of this
methodology to help guide future research in this area.
[COMMENTS]
Accepted to the 6th NLP for Conversational AI Workshop at ACL
2024(NLP4ConvAI)
[LINK]
http://arxiv.org/abs/2407.17862v1
[DATE]
2024-07-25 16:31:57+08:00
[CATEGORIES]
cs.CL
Shapley Value-based Contrastive Alignment for Multimodal Information Extraction
[AUTHORS]
Wen Luo, Yu Xia, Shen Tianshu, Sujian Li
[ABSTRACT]
The rise of social media and the exponential growth of multimodal
communication necessitates advanced techniques for Multimodal Information
Extraction (MIE). However, existing methodologies primarily rely on direct
Image-Text interactions, a paradigm that often faces significant challenges due
to semantic and modality gaps between images and text. In this paper, we
introduce a new paradigm of Image-Context-Text interaction, where large
multimodal models (LMMs) are utilized to generate descriptive textual context
to bridge these gaps. In line with this paradigm, we propose a novel Shapley
Value-based Contrastive Alignment (Shap-CA) method, which aligns both
context-text and context-image pairs. Shap-CA initially applies the Shapley
value concept from cooperative game theory to assess the individual
contribution of each element in the set of contexts, texts and images towards
total semantic and modality overlaps. Following this quantitative evaluation, a
contrastive learning strategy is employed to enhance the interactive
contribution within context-text/image pairs, while minimizing the influence
across these pairs. Furthermore, we design an adaptive fusion module for
selective cross-modal fusion. Extensive experiments across four MIE datasets
demonstrate that our method significantly outperforms existing state-of-the-art
methods.
[COMMENTS]
Accepted at ACM Multimedia 2024
[LINK]
http://arxiv.org/abs/2407.17854v1
[DATE]
2024-07-25 16:15:43+08:00
[CATEGORIES]
cs.CL
Scaling A Simple Approach to Zero-Shot Speech Recognition
[AUTHORS]
Jinming Zhao, Vineel Pratap, Michael Auli
[COMMENTS]
9 pages
[LINK]
http://arxiv.org/abs/2407.17852v1
[DATE]
2024-07-25 16:08:55+08:00
[CATEGORIES]
cs.CL
Identifying Semantic Induction Heads to Understand In-Context Learning
[AUTHORS]
Jie Ren, Qipeng Guo, Hang Yan, Dongrui Liu, Quanshi Zhang, Xipeng Qiu, Dahua Lin
[ABSTRACT]
Although large language models (LLMs) have demonstrated remarkable
performance, the lack of transparency in their inference logic raises concerns
about their trustworthiness. To gain a better understanding of LLMs, we conduct
a detailed analysis of the operations of attention heads and aim to better
understand the in-context learning of LLMs. Specifically, we investigate
whether attention heads encode two types of relationships between tokens
present in natural languages: the syntactic dependency parsed from sentences
and the relation within knowledge graphs. We find that certain attention heads
exhibit a pattern where, when attending to head tokens, they recall tail tokens
and increase the output logits of those tail tokens. More crucially, the
formulation of such semantic induction heads has a close correlation with the
emergence of the in-context learning ability of language models. The study of
semantic attention heads advances our understanding of the intricate operations
of attention heads in transformers, and further provides new insights into the
in-context learning of LLMs.
[LINK]
http://arxiv.org/abs/2402.13055v2
[DATE]
2024-07-25 16:07:39+08:00
[CATEGORIES]
cs.CL
SAFETY-J: Evaluating Safety with Critique
[AUTHORS]
Yixiu Liu, Yuxiang Zheng, Shijie Xia, Yuan Guo, Jiajun Li, Yi Tu, Chaoling Song, Pengfei Liu
[ABSTRACT]
The deployment of Large Language Models (LLMs) in content generation raises
significant safety concerns, particularly regarding the transparency and
interpretability of content evaluations. Current methods, primarily focused on
binary safety classifications, lack mechanisms for detailed critique, limiting
their utility for model improvement and user trust. To address these
limitations, we introduce SAFETY-J, a bilingual generative safety evaluator for
English and Chinese with critique-based judgment. SAFETY-J utilizes a robust
training dataset that includes diverse dialogues and augmented query-response
pairs to assess safety across various scenarios comprehensively. We establish
an automated meta-evaluation benchmark that objectively assesses the quality of
critiques with minimal human intervention, facilitating scalable and continuous
improvement. Additionally, SAFETY-J employs an iterative preference learning
technique to dynamically refine safety assessments based on meta-evaluations
and critiques. Our evaluations demonstrate that SAFETY-J provides more nuanced
and accurate safety evaluations, thereby enhancing both critique quality and
predictive reliability in complex content scenarios. To facilitate further
research and application, we open-source SAFETY-J’s training protocols,
datasets, and code at \url{https://github.com/GAIR-NLP/Safety-J}.
[LINK]
http://arxiv.org/abs/2407.17075v2
[DATE]
2024-07-25 15:50:46+08:00
[CATEGORIES]
cs.CL
Unified Lexical Representation for Interpretable Visual-Language Alignment
[AUTHORS]
Yifan Li, Yikai Wang, Yanwei Fu, Dongyu Ru, Zheng Zhang, Tong He
[ABSTRACT]
Visual-Language Alignment (VLA) has gained a lot of attention since CLIP’s
groundbreaking work. Although CLIP performs well, the typical direct latent
feature alignment lacks clarity in its representation and similarity scores. On
the other hand, lexical representation, a vector whose element represents the
similarity between the sample and a word from the vocabulary, is a natural
sparse representation and interpretable, providing exact matches for individual
words. However, lexical representations is difficult to learn due to no
ground-truth supervision and false-discovery issues, and thus requires complex
design to train effectively. In this paper, we introduce LexVLA, a more
interpretable VLA framework by learning a unified lexical representation for
both modalities without complex design. We use DINOv2 as our visual model for
its local-inclined features and Llama 2, a generative language model, to
leverage its in-context lexical prediction ability. To avoid the false
discovery, we propose an overuse penalty to refrain the lexical representation
from falsely frequently activating meaningless words. We demonstrate that these
two pre-trained uni-modal models can be well-aligned by fine-tuning on modest
multi-modal dataset and avoid intricate training configurations. On cross-modal
retrieval benchmarks, LexVLA, trained on the CC-12M multi-modal dataset,
outperforms baselines fine-tuned on larger datasets (e.g., YFCC15M) and those
trained from scratch on even bigger datasets (e.g., 1.1B data, including
CC-12M). We conduct extensive experiments to analyze LexVLA.
[LINK]
http://arxiv.org/abs/2407.17827v1
[DATE]
2024-07-25 15:35:27+08:00
[CATEGORIES]
cs.CL
cs.LG
Demystifying Verbatim Memorization in Large Language Models
[AUTHORS]
Jing Huang, Diyi Yang, Christopher Potts
[ABSTRACT]
Large Language Models (LLMs) frequently memorize long sequences verbatim,
often with serious legal and privacy implications. Much prior work has studied
such verbatim memorization using observational data. To complement such work,
we develop a framework to study verbatim memorization in a controlled setting
by continuing pre-training from Pythia checkpoints with injected sequences. We
find that (1) non-trivial amounts of repetition are necessary for verbatim
memorization to happen; (2) later (and presumably better) checkpoints are more
likely to verbatim memorize sequences, even for out-of-distribution sequences;
(3) the generation of memorized sequences is triggered by distributed model
states that encode high-level features and makes important use of general
language modeling capabilities. Guided by these insights, we develop stress
tests to evaluate unlearning methods and find they often fail to remove the
verbatim memorized information, while also degrading the LM. Overall, these
findings challenge the hypothesis that verbatim memorization stems from
specific model weights or mechanisms. Rather, verbatim memorization is
intertwined with the LM’s general capabilities and thus will be very difficult
to isolate and suppress without degrading model quality.
[LINK]
http://arxiv.org/abs/2407.17817v1
[DATE]
2024-07-25 15:10:31+08:00
[CATEGORIES]
cs.CL
cs.LG
Brand Network Booster: A new system for improving brand connectivity
[AUTHORS]
J. Cancellieri, W. Didimo, A. Fronzetti Colladon, F. Montecchiani, R. Vestrelli
[ABSTRACT]
This paper presents a new decision support system offered for an in-depth
analysis of semantic networks, which can provide insights for a better
exploration of a brand’s image and the improvement of its connectivity. In
terms of network analysis, we show that this goal is achieved by solving an
extended version of the Maximum Betweenness Improvement problem, which includes
the possibility of considering adversarial nodes, constrained budgets, and
weighted networks - where connectivity improvement can be obtained by adding
links or increasing the weight of existing connections. Our contribution
includes a new algorithmic framework and the integration of this framework into
a software system called Brand Network Booster (BNB), which supports brand
connectivity evaluation and improvement. We present this new system together
with three case studies, and we also discuss its performance. Our tool and
approach are valuable to both network scholars and in facilitating strategic
decision-making processes for marketing and communication managers across
various sectors, be it public or private.
[LINK]
http://arxiv.org/abs/2309.16228v2
[DATE]
2024-07-25 15:05:30+08:00
[CATEGORIES]
cs.CL
Automatic Textual Normalization for Hate Speech Detection
[AUTHORS]
Anh Thi-Hoang Nguyen, Dung Ha Nguyen, Nguyet Thi Nguyen, Khanh Thanh-Duy Ho, Kiet Van Nguyen
[ABSTRACT]
Social media data is a valuable resource for research, yet it contains a wide
range of non-standard words (NSW). These irregularities hinder the effective
operation of NLP tools. Current state-of-the-art methods for the Vietnamese
language address this issue as a problem of lexical normalization, involving
the creation of manual rules or the implementation of multi-staged deep
learning frameworks, which necessitate extensive efforts to craft intricate
rules. In contrast, our approach is straightforward, employing solely a
sequence-to-sequence (Seq2Seq) model. In this research, we provide a dataset
for textual normalization, comprising 2,181 human-annotated comments with an
inter-annotator agreement of 0.9014. By leveraging the Seq2Seq model for
textual normalization, our results reveal that the accuracy achieved falls
slightly short of 70%. Nevertheless, textual normalization enhances the
accuracy of the Hate Speech Detection (HSD) task by approximately 2%,
demonstrating its potential to improve the performance of complex NLP tasks.
Our dataset is accessible for research purposes.
[COMMENTS]
2023 International Conference on Intelligent Systems Design and
Applications (ISDA2023)
[LINK]
http://arxiv.org/abs/2311.06851v4
[DATE]
2024-07-25 14:41:43+08:00
[CATEGORIES]
cs.CL
KiVA: Kid-inspired Visual Analogies for Testing Large Multimodal Models
[AUTHORS]
Eunice Yiu, Maan Qraitem, Charlie Wong, Anisa Noor Majhi, Yutong Bai, Shiry Ginosar, Alison Gopnik, Kate Saenko
[ABSTRACT]
This paper investigates visual analogical reasoning in large multimodal
models (LMMs) compared to human adults and children. A “visual analogy” is an
abstract rule inferred from one image and applied to another. While benchmarks
exist for testing visual reasoning in LMMs, they require advanced skills and
omit basic visual analogies that even young children can make. Inspired by
developmental psychology, we propose a new benchmark of 1,400 visual
transformations of everyday objects to test LMMs on visual analogical reasoning
and compare them to children and adults. We structure the evaluation into three
stages: identifying what changed (e.g., color, number, etc.), how it changed
(e.g., added one object), and applying the rule to new scenarios. Our findings
show that while models like GPT-4V, LLaVA-1.5, and MANTIS identify the “what”
effectively, they struggle with quantifying the “how” and extrapolating this
rule to new objects. In contrast, children and adults exhibit much stronger
analogical reasoning at all three stages. Additionally, the strongest tested
model, GPT-4V, performs better in tasks involving simple visual attributes like
color and size, correlating with quicker human adult response times.
Conversely, more complex tasks such as number, rotation, and reflection, which
necessitate extensive cognitive processing and understanding of the 3D physical
world, present more significant challenges. Altogether, these findings
highlight the limitations of training models on data that primarily consists of
2D images and text.
[COMMENTS]
9 pages. For the KiVA benchmark, see https://github.com/ey242/KiVA
[LINK]
http://arxiv.org/abs/2407.17773v1
[DATE]
2024-07-25 13:02:39+08:00
[CATEGORIES]
cs.CL
cs.LG
ERIT Lightweight Multimodal Dataset for Elderly Emotion Recognition and Multimodal Fusion Evaluation
[AUTHORS]
Rita Frieske, Bertrand E. Shi
[ABSTRACT]
ERIT is a novel multimodal dataset designed to facilitate research in a
lightweight multimodal fusion. It contains text and image data collected from
videos of elderly individuals reacting to various situations, as well as seven
emotion labels for each data sample. Because of the use of labeled images of
elderly users reacting emotionally, it is also facilitating research on emotion
recognition in an underrepresented age group in machine learning visual emotion
recognition. The dataset is validated through comprehensive experiments
indicating its importance in neural multimodal fusion research.
[LINK]
http://arxiv.org/abs/2407.17772v1
[DATE]
2024-07-25 13:02:27+08:00
[CATEGORIES]
cs.CL
Banyan: Improved Representation Learning with Explicit Structure
[AUTHORS]
Mattia Opper, N. Siddharth
[ABSTRACT]
We present Banyan, an improved model to learn semantic representations by
inducing explicit structure over data. In contrast to prior approaches using
structure spanning single sentences, Banyan learns by resolving multiple
constituent structures into a shared one explicitly incorporating global
context. Combined with an improved message-passing scheme inspired by Griffin,
Banyan learns significantly better representations, avoids spurious false
negatives with contrastive learning, and drastically improves memory efficiency
in such explicit-structured models. Using the Self-StrAE framework, we show
that Banyan (a) outperforms baselines using sentential structure across various
settings (b) matches or outperforms unstructured baselines like GloVe
(+augmentations) and a RoBERTa medium (+simcse) pre-trained on 100M tokens,
despite having just a handful of (non-embedding) parameters, and (c) also
learns effective representations across several low resource (Asian and
African) languages as measured on SemRel tasks.
[COMMENTS]
First Draft
[LINK]
http://arxiv.org/abs/2407.17771v1
[DATE]
2024-07-25 12:58:08+08:00
[CATEGORIES]
cs.CL
BotEval: Facilitating Interactive Human Evaluation
[AUTHORS]
Hyundong Cho, Thamme Gowda, Yuyang Huang, Zixun Lu, Tianli Tong, Jonathan May
[COMMENTS]
ACL 2024 SDT, 10 pages
[LINK]
http://arxiv.org/abs/2407.17770v1
[DATE]
2024-07-25 12:57:31+08:00
[CATEGORIES]
cs.CL
The Belebele Benchmark: a Parallel Reading Comprehension Dataset in 122 Language Variants
[AUTHORS]
Lucas Bandarkar, Davis Liang, Benjamin Muller, Mikel Artetxe, Satya Narayan Shukla, Donald Husa, Naman Goyal, Abhinandan Krishnan, Luke Zettlemoyer, Madian Khabsa
[COMMENTS]
ACL 2024
[LINK]
http://arxiv.org/abs/2308.16884v2
[DATE]
2024-07-25 12:30:15+08:00
[CATEGORIES]
cs.CL
cs.LG
Beyond Entity Alignment: Towards Complete Knowledge Graph Alignment via Entity-Relation Synergy
[AUTHORS]
Xiaohan Fang, Chaozhuo Li, Yi Zhao, Qian Zang, Litian Zhang, Jiquan Peng, Xi Zhang, Jibing Gong
[ABSTRACT]
Knowledge Graph Alignment (KGA) aims to integrate knowledge from multiple
sources to address the limitations of individual Knowledge Graphs (KGs) in
terms of coverage and depth. However, current KGA models fall short in
achieving a “complete” knowledge graph alignment. Existing models primarily
emphasize the linkage of cross-graph entities but overlook aligning relations
across KGs, thereby providing only a partial solution to KGA. The semantic
correlations embedded in relations are largely overlooked, potentially
restricting a comprehensive understanding of cross-KG signals. In this paper,
we propose to conceptualize relation alignment as an independent task and
conduct KGA by decomposing it into two distinct but highly correlated
sub-tasks: entity alignment and relation alignment. To capture the mutually
reinforcing correlations between these objectives, we propose a novel
Expectation-Maximization-based model, EREM, which iteratively optimizes both
sub-tasks. Experimental results on real-world datasets demonstrate that EREM
consistently outperforms state-of-the-art models in both entity alignment and
relation alignment tasks.
[LINK]
http://arxiv.org/abs/2407.17745v1
[DATE]
2024-07-25 11:40:09+08:00
[CATEGORIES]
cs.CL
CCoE: A Compact LLM with Collaboration of Experts
[AUTHORS]
Shaomang Huang, Jianfeng Pan, Hanzhong Zheng
[ABSTRACT]
In the domain of Large Language Model (LLM), LLMs demonstrate significant
capabilities in natural language understanding and generation. With the growing
needs of applying LLMs on various domains, it is a research question that how
to efficiently train and build a model that has expertise in different domains
but with a low training cost. We propose CCoE architecture, a framework of
easily coupling multiple strong domain experts together to fuse into a big LLM,
provides a collective way of utilizing the different domain expert LLMs.
Besides, training a large collaborative of multiple expert LLMs requires a high
requirements on training sources. CCoE bypasses this problem through isolating
other experts and train each expert separately. The design of CCoE assembles
multiple expert LLMs through the CoE (Collaboration of Experts) layer. Each CoE
layer could have one or more expert LLMs. Expert LLMs have different number of
layers and have been well-trained for different domain tasks. Each expert is
fine-tuned to be able to achieve the comparable results with SOTA domain LLMs.
We start from 5 experts in the domain of Code, Math, Law, text-to-SQL and
Medical. The results indicate that our CCoE framework can easily and
efficiently boost nearly 10%-20% performance on original base model in
different domains but using less resources on training, as well as inference.
[LINK]
http://arxiv.org/abs/2407.11686v3
[DATE]
2024-07-25 11:34:56+08:00
[CATEGORIES]
cs.CL
Automatically Extracting Numerical Results from Randomized Controlled Trials with Large Language Models
[AUTHORS]
Hye Sun Yun, David Pogrebitskiy, Iain J. Marshall, Byron C. Wallace
[ABSTRACT]
Meta-analyses statistically aggregate the findings of different randomized
controlled trials (RCTs) to assess treatment effectiveness. Because this yields
robust estimates of treatment effectiveness, results from meta-analyses are
considered the strongest form of evidence. However, rigorous evidence syntheses
are time-consuming and labor-intensive, requiring manual extraction of data
from individual trials to be synthesized. Ideally, language technologies would
permit fully automatic meta-analysis, on demand. This requires accurately
extracting numerical results from individual trials, which has been beyond the
capabilities of natural language processing (NLP) models to date. In this work,
we evaluate whether modern large language models (LLMs) can reliably perform
this task. We annotate (and release) a modest but granular evaluation dataset
of clinical trial reports with numerical findings attached to interventions,
comparators, and outcomes. Using this dataset, we evaluate the performance of
seven LLMs applied zero-shot for the task of conditionally extracting numerical
findings from trial reports. We find that massive LLMs that can accommodate
lengthy inputs are tantalizingly close to realizing fully automatic
meta-analysis, especially for dichotomous (binary) outcomes (e.g., mortality).
However, LLMs – including ones trained on biomedical texts – perform poorly
when the outcome measures are complex and tallying the results requires
inference. This work charts a path toward fully automatic meta-analysis of RCTs
via LLMs, while also highlighting the limitations of existing models for this
aim.
[COMMENTS]
25 pages, 7 figures, 6 tables, MLHC 2024
[LINK]
http://arxiv.org/abs/2405.01686v2
[DATE]
2024-07-25 11:29:09+08:00
[CATEGORIES]
cs.CL
Towards the Law of Capacity Gap in Distilling Language Models
[AUTHORS]
Chen Zhang, Dawei Song, Zheyu Ye, Yan Gao
[ABSTRACT]
Language model (LM) distillation is a trending area that aims to distil the
knowledge residing in a large teacher LM to a small student one. While various
methods have been proposed to maximize the effectiveness of the distillation,
significant challenges persist, particularly when there is a substantial
capacity gap between the teacher and student LMs. This issue, often referred to
as the \textit{curse} of capacity gap, suggests that a larger teacher does not
necessarily result in a superior student compared to one distilled from a
smaller teacher. In other words, there is likely an optimal teacher yielding
the best student along the scaling course of the teacher. However, the curse of
capacity gap can not be tackled without notable compute overhead, as indicated
in previous studies. In the context of large LMs (LLMs), previously viable
approaches become much less meaningful, as it is an impossible triangle to
distill an expected student from an optimal teacher student with small compute
overhead. Fortunately, the impossible triangle can fortunately be possible
provided an inducted \textit{law} of capacity gap. In this paper, we take the
spirits of scaling law and reveal that the optimal teacher scale almost
consistently follows a linear scaling with the student scale across different
model architectures and data scales. The law later guides us to distil a 3B
student LM (termed \textsc{MiniMA}) from LLaMA2-7B. \textsc{MiniMA} is
demonstrated to outperform a wide range of 3B competitors and could even
compete with several 7B models.
[COMMENTS]
32 pages, 10 figures, 15 tables, work in progress. Code and
checkpoints are available at https://github.com/GeneZC/MiniMA
[LINK]
http://arxiv.org/abs/2311.07052v3
[DATE]
2024-07-25 11:20:15+08:00
[CATEGORIES]
cs.CL
cs.LG
Cost-effective Instruction Learning for Pathology Vision and Language Analysis
[AUTHORS]
Kaitao Chen, Mianxin Liu, Fang Yan, Lei Ma, Xiaoming Shi, Lilong Wang, Xiaosong Wang, Lifeng Zhu, Zhe Wang, Mu Zhou, Shaoting Zhang
[ABSTRACT]
The advent of vision-language models fosters the interactive conversations
between AI-enabled models and humans. Yet applying these models into clinics
must deal with daunting challenges around large-scale training data, financial,
and computational resources. Here we propose a cost-effective instruction
learning framework for conversational pathology named as CLOVER. CLOVER only
trains a lightweight module and uses instruction tuning while freezing the
parameters of the large language model. Instead of using costly GPT-4, we
propose well-designed prompts on GPT-3.5 for building generation-based
instructions, emphasizing the utility of pathological knowledge derived from
the Internet source. To augment the use of instructions, we construct a
high-quality set of template-based instructions in the context of digital
pathology. From two benchmark datasets, our findings reveal the strength of
hybrid-form instructions in the visual question-answer in pathology. Extensive
results show the cost-effectiveness of CLOVER in answering both open-ended and
closed-ended questions, where CLOVER outperforms strong baselines that possess
37 times more training parameters and use instruction data generated from
GPT-4. Through the instruction tuning, CLOVER exhibits robustness of few-shot
learning in the external clinical dataset. These findings demonstrate that
cost-effective modeling of CLOVER could accelerate the adoption of rapid
conversational applications in the landscape of digital pathology.
[LINK]
http://arxiv.org/abs/2407.17734v1
[DATE]
2024-07-25 11:12:57+08:00
[CATEGORIES]
cs.CL
Adapting Large Language Models to Domains via Reading Comprehension
[AUTHORS]
Daixuan Cheng, Shaohan Huang, Furu Wei
[COMMENTS]
ICLR 2024 Conference
[LINK]
http://arxiv.org/abs/2309.09530v4
[DATE]
2024-07-25 11:08:18+08:00
[CATEGORIES]
cs.CL
Are Large Language Models Possible to Conduct Cognitive Behavioral Therapy?
[AUTHORS]
Hao Shen, Zihan Li, Minqiang Yang, Minghui Ni, Yongfeng Tao, Zhengyang Yu, Weihao Zheng, Chen Xu, Bin Hu
[ABSTRACT]
In contemporary society, the issue of psychological health has become
increasingly prominent, characterized by the diversification, complexity, and
universality of mental disorders. Cognitive Behavioral Therapy (CBT), currently
the most influential and clinically effective psychological treatment method
with no side effects, has limited coverage and poor quality in most countries.
In recent years, researches on the recognition and intervention of emotional
disorders using large language models (LLMs) have been validated, providing new
possibilities for psychological assistance therapy. However, are LLMs truly
possible to conduct cognitive behavioral therapy? Many concerns have been
raised by mental health experts regarding the use of LLMs for therapy. Seeking
to answer this question, we collected real CBT corpus from online video
websites, designed and conducted a targeted automatic evaluation framework
involving the evaluation of emotion tendency of generated text, structured
dialogue pattern and proactive inquiry ability. For emotion tendency, we
calculate the emotion tendency score of the CBT dialogue text generated by each
model. For structured dialogue pattern, we use a diverse range of automatic
evaluation metrics to compare speaking style, the ability to maintain
consistency of topic and the use of technology in CBT between different models
. As for inquiring to guide the patient, we utilize PQA (Proactive Questioning
Ability) metric. We also evaluated the CBT ability of the LLM after integrating
a CBT knowledge base to explore the help of introducing additional knowledge to
enhance the model’s CBT counseling ability. Four LLM variants with excellent
performance on natural language processing are evaluated, and the experimental
result shows the great potential of LLMs in psychological counseling realm,
especially after combining with other technological means.
[LINK]
http://arxiv.org/abs/2407.17730v1
[DATE]
2024-07-25 11:01:47+08:00
[CATEGORIES]
cs.CL
Chain-of-Layer: Iteratively Prompting Large Language Models for Taxonomy Induction from Limited Examples
[AUTHORS]
Qingkai Zeng, Yuyang Bai, Zhaoxuan Tan, Shangbin Feng, Zhenwen Liang, Zhihan Zhang, Meng Jiang
[ABSTRACT]
Automatic taxonomy induction is crucial for web search, recommendation
systems, and question answering. Manual curation of taxonomies is expensive in
terms of human effort, making automatic taxonomy construction highly desirable.
In this work, we introduce Chain-of-Layer which is an in-context learning
framework designed to induct taxonomies from a given set of entities.
Chain-of-Layer breaks down the task into selecting relevant candidate entities
in each layer and gradually building the taxonomy from top to bottom. To
minimize errors, we introduce the Ensemble-based Ranking Filter to reduce the
hallucinated content generated at each iteration. Through extensive
experiments, we demonstrate that Chain-of-Layer achieves state-of-the-art
performance on four real-world benchmarks.
[LINK]
http://arxiv.org/abs/2402.07386v2
[DATE]
2024-07-25 10:46:50+08:00
[CATEGORIES]
cs.CL
Describe Where You Are: Improving Noise-Robustness for Speech Emotion Recognition with Text Description of the Environment
[AUTHORS]
Seong-Gyun Leem, Daniel Fulford, Jukka-Pekka Onnela, David Gard, Carlos Busso
[ABSTRACT]
Speech emotion recognition (SER) systems often struggle in real-world
environments, where ambient noise severely degrades their performance. This
paper explores a novel approach that exploits prior knowledge of testing
environments to maximize SER performance under noisy conditions. To address
this task, we propose a text-guided, environment-aware training where an SER
model is trained with contaminated speech samples and their paired noise
description. We use a pre-trained text encoder to extract the text-based
environment embedding and then fuse it to a transformer-based SER model during
training and inference. We demonstrate the effectiveness of our approach
through our experiment with the MSP-Podcast corpus and real-world additive
noise samples collected from the Freesound repository. Our experiment indicates
that the text-based environment descriptions processed by a large language
model (LLM) produce representations that improve the noise-robustness of the
SER system. In addition, our proposed approach with an LLM yields better
performance than our environment-agnostic baselines, especially in low
signal-to-noise ratio (SNR) conditions. When testing at -5dB SNR level, our
proposed method shows better performance than our best baseline model by 31.8 %
(arousal), 23.5% (dominance), and 9.5% (valence).
[LINK]
http://arxiv.org/abs/2407.17716v1
[DATE]
2024-07-25 10:30:40+08:00
[CATEGORIES]
cs.CL
cs.LG
JailbreakZoo: Survey, Landscapes, and Horizons in Jailbreaking Large Language and Vision-Language Models
[AUTHORS]
Haibo Jin, Leyang Hu, Xinuo Li, Peiyan Zhang, Chonghan Chen, Jun Zhuang, Haohan Wang
[ABSTRACT]
The rapid evolution of artificial intelligence (AI) through developments in
Large Language Models (LLMs) and Vision-Language Models (VLMs) has brought
significant advancements across various technological domains. While these
models enhance capabilities in natural language processing and visual
interactive tasks, their growing adoption raises critical concerns regarding
security and ethical alignment. This survey provides an extensive review of the
emerging field of jailbreaking–deliberately circumventing the ethical and
operational boundaries of LLMs and VLMs–and the consequent development of
defense mechanisms. Our study categorizes jailbreaks into seven distinct types
and elaborates on defense strategies that address these vulnerabilities.
Through this comprehensive examination, we identify research gaps and propose
directions for future studies to enhance the security frameworks of LLMs and
VLMs. Our findings underscore the necessity for a unified perspective that
integrates both jailbreak strategies and defensive solutions to foster a
robust, secure, and reliable environment for the next generation of language
models. More details can be found on our website:
\url{https://chonghan-chen.com/llm-jailbreak-zoo-survey/}.
[COMMENTS]
45 pages
[LINK]
http://arxiv.org/abs/2407.01599v2
[DATE]
2024-07-25 10:25:11+08:00
[CATEGORIES]
cs.CL
cs.LG
Exploring Semantic Perturbations on Grover
[AUTHORS]
Ziqing Ji, Pranav Kulkarni, Marko Neskovic, Kevin Nolan, Yan Xu
[ABSTRACT]
With news and information being as easy to access as they currently are, it
is more important than ever to ensure that people are not mislead by what they
read. Recently, the rise of neural fake news (AI-generated fake news) and its
demonstrated effectiveness at fooling humans has prompted the development of
models to detect it. One such model is the Grover model, which can both detect
neural fake news to prevent it, and generate it to demonstrate how a model
could be misused to fool human readers. In this work we explore the Grover
model’s fake news detection capabilities by performing targeted attacks through
perturbations on input news articles. Through this we test Grover’s resilience
to these adversarial attacks and expose some potential vulnerabilities which
should be addressed in further iterations to ensure it can detect all types of
fake news accurately.
[LINK]
http://arxiv.org/abs/2302.00509v2
[DATE]
2024-07-25 09:09:57+08:00
[CATEGORIES]
cs.LG
cs.CL
Transformers on Markov Data: Constant Depth Suffices
[AUTHORS]
Nived Rajaraman, Marco Bondaschi, Kannan Ramchandran, Michael Gastpar, Ashok Vardhan Makkuva
[ABSTRACT]
Attention-based transformers have been remarkably successful at modeling
generative processes across various domains and modalities. In this paper, we
study the behavior of transformers on data drawn from \kth Markov processes,
where the conditional distribution of the next symbol in a sequence depends on
the previous $k$ symbols observed. We observe a surprising phenomenon
empirically which contradicts previous findings: when trained for sufficiently
long, a transformer with a fixed depth and $1$ head per layer is able to
achieve low test loss on sequences drawn from \kth Markov sources, even as $k$
grows. Furthermore, this low test loss is achieved by the transformer’s ability
to represent and learn the in-context conditional empirical distribution. On
the theoretical side, our main result is that a transformer with a single head
and three layers can represent the in-context conditional empirical
distribution for \kth Markov sources, concurring with our empirical
observations. Along the way, we prove that \textit{attention-only} transformers
with $O(\log_2(k))$ layers can represent the in-context conditional empirical
distribution by composing induction heads to track the previous $k$ symbols in
the sequence. These results provide more insight into our current understanding
of the mechanisms by which transformers learn to capture context, by
understanding their behavior on Markov sources.
[COMMENTS]
29 pages, 10 figures
[LINK]
http://arxiv.org/abs/2407.17686v1
[DATE]
2024-07-25 09:07:09+08:00
[CATEGORIES]
cs.LG
cs.CL
Efficient LLM Training and Serving with Heterogeneous Context Sharding among Attention Heads
[AUTHORS]
Xihui Lin, Yunan Zhang, Suyu Ge, Barun Patra, Vishrav Chaudhary, Xia Song
[ABSTRACT]
Existing LLM training and inference frameworks struggle in boosting
efficiency with sparsity while maintaining the integrity of context and model
architecture. Inspired by the sharding concept in database and the fact that
attention parallelizes over heads on accelerators, we propose Sparsely-Sharded
(S2) Attention, an attention algorithm that allocates heterogeneous context
partitions for different attention heads to divide and conquer. S2-Attention
enforces each attention head to only attend to a partition of contexts
following a strided sparsity pattern, while the full context is preserved as
the union of all the shards. As attention heads are processed in separate
thread blocks, the context reduction for each head can thus produce end-to-end
speed-up and memory reduction. At inference, LLMs trained with S2-Attention can
then take the KV cache reduction as free meals with guaranteed model quality
preserve. In experiments, we show S2-Attentioncan provide as much as (1) 25.3X
wall-clock attention speed-up over FlashAttention-2, resulting in 6X reduction
in end-to-end training time and 10X inference latency, (2) on-par model
training quality compared to default attention, (3)perfect needle retrieval
accuracy over 32K context window. On top of the algorithm, we build DKernel, an
LLM training and inference kernel library that allows users to customize
sparsity patterns for their own models. We open-sourced DKerneland make it
compatible with Megatron, Pytorch, and vLLM.
[COMMENTS]
10 pages
[LINK]
http://arxiv.org/abs/2407.17678v1
[DATE]
2024-07-25 08:27:07+08:00
[CATEGORIES]
cs.CL
Data Mixture Inference: What do BPE Tokenizers Reveal about their Training Data?
[AUTHORS]
Jonathan Hayase, Alisa Liu, Yejin Choi, Sewoong Oh, Noah A. Smith
[ABSTRACT]
The pretraining data of today’s strongest language models is opaque; in
particular, little is known about the proportions of various domains or
languages represented. In this work, we tackle a task which we call data
mixture inference, which aims to uncover the distributional make-up of training
data. We introduce a novel attack based on a previously overlooked source of
information – byte-pair encoding (BPE) tokenizers, used by the vast majority
of modern language models. Our key insight is that the ordered list of merge
rules learned by a BPE tokenizer naturally reveals information about the token
frequencies in its training data: the first merge is the most common byte pair,
the second is the most common pair after merging the first token, and so on.
Given a tokenizer’s merge list along with data samples for each category of
interest, we formulate a linear program that solves for the proportion of each
category in the tokenizer’s training set. Importantly, to the extent to which
tokenizer training data is representative of the pretraining data, we
indirectly learn about pretraining data. In controlled experiments, we show
that our attack recovers mixture ratios with high precision for tokenizers
trained on known mixtures of natural languages, programming languages, and data
sources. We then apply our approach to off-the-shelf tokenizers released with
recent LMs. We confirm much publicly disclosed information about these models,
and also make several new inferences: GPT-4o’s tokenizer is much more
multilingual than its predecessors, training on 39% non-English data; Llama3
extends GPT-3.5’s tokenizer primarily for multilingual (48%) use; GPT-3.5’s and
Claude’s tokenizers are trained on predominantly code (~60%). We hope our work
sheds light on current design practices for pretraining data, and inspires
continued research into data mixture inference for LMs.
[COMMENTS]
19 pages, 5 figures
[LINK]
http://arxiv.org/abs/2407.16607v2
[DATE]
2024-07-25 07:34:21+08:00
[CATEGORIES]
cs.CL
cs.LG
Understanding and Mitigating the Threat of Vec2Text to Dense Retrieval Systems
[AUTHORS]
Shengyao Zhuang, Bevan Koopman, Xiaoran Chu, Guido Zuccon
[ABSTRACT]
The emergence of Vec2Text – a method for text embedding inversion – has
raised serious privacy concerns for dense retrieval systems which use text
embeddings, such as those offered by OpenAI and Cohere. This threat comes from
the ability for a malicious attacker with access to embeddings to reconstruct
the original text. In this paper, we investigate various factors related to
embedding models that may impact text recoverability via Vec2Text. We explore
factors such as distance metrics, pooling functions, bottleneck pre-training,
training with noise addition, embedding quantization, and embedding dimensions,
which were not considered in the original Vec2Text paper. Through a
comprehensive analysis of these factors, our objective is to gain a deeper
understanding of the key elements that affect the trade-offs between the text
recoverability and retrieval effectiveness of dense retrieval systems, offering
insights for practitioners designing privacy-aware dense retrieval systems. We
also propose a simple embedding transformation fix that guarantees equal
ranking effectiveness while mitigating the recoverability risk. Overall, this
study reveals that Vec2Text could pose a threat to current dense retrieval
systems, but there are some effective methods to patch such systems.
[LINK]
http://arxiv.org/abs/2402.12784v2
[DATE]
2024-07-25 07:00:50+08:00
[CATEGORIES]
cs.CL
Can GPT-4 learn to analyze moves in research article abstracts?
[AUTHORS]
Danni Yu, Marina Bondi, Ken Hyland
[ABSTRACT]
One of the most powerful and enduring ideas in written discourse analysis is
that genres can be described in terms of the moves which structure a writer’s
purpose. Considerable research has sought to identify these distinct
communicative acts, but analyses have been beset by problems of subjectivity,
reliability and the time-consuming need for multiple coders to confirm
analyses. In this paper we employ the affordances of GPT-4 to automate the
annotation process by using natural language prompts. Focusing on abstracts
from articles in four applied linguistics journals, we devise prompts which
enable the model to identify moves effectively. The annotated outputs of these
prompts were evaluated by two assessors with a third addressing disagreements.
The results show that an 8-shot prompt was more effective than one using two,
confirming that the inclusion of examples illustrating areas of variability can
enhance GPT-4’s ability to recognize multiple moves in a single sentence and
reduce bias related to textual position. We suggest that GPT-4 offers
considerable potential in automating this annotation process, when human actors
with domain specific linguistic expertise inform the prompting process.
[LINK]
http://arxiv.org/abs/2407.15612v2
[DATE]
2024-07-25 05:10:24+08:00
[CATEGORIES]
cs.CL
IgnitionInnovators at “Discharge Me!”: Chain-of-Thought Instruction Finetuning Large Language Models for Discharge Summaries
[AUTHORS]
An Quang Tang, Xiuzhen Zhang, Minh Ngoc Dinh
[ABSTRACT]
This paper presents our proposed approach to the Discharge Me! shared task,
collocated with the 23th Workshop on Biomedical Natural Language Processing
(BioNLP). In this work, we develop an LLM-based framework for solving the
Discharge Summary Documentation (DSD) task, i.e., generating the two critical
target sections Brief Hospital Course' and
Discharge Instructions’ in the
discharge summary. By streamlining the recent instruction-finetuning process on
LLMs, we explore several prompting strategies for optimally adapting LLMs to
specific generation task of DSD. Experimental results show that providing a
clear output structure, complimented by a set of comprehensive
Chain-of-Thoughts (CoT) questions, effectively improves the model’s reasoning
capability, and thereby, enhancing the structural correctness and faithfulness
of clinical information in the generated text. Source code is available at:
https://github.com/antangrocket1312/Discharge_LLM
[COMMENTS]
Accepted by BioNLP2024 Workshop
[LINK]
http://arxiv.org/abs/2407.17636v1
[DATE]
2024-07-25 05:02:53+08:00
[CATEGORIES]
cs.CL
Cascaded Cross-Modal Transformer for Audio-Textual Classification
[AUTHORS]
Nicolae-Catalin Ristea, Andrei Anghel, Radu Tudor Ionescu
[ABSTRACT]
Speech classification tasks often require powerful language understanding
models to grasp useful features, which becomes problematic when limited
training data is available. To attain superior classification performance, we
propose to harness the inherent value of multimodal representations by
transcribing speech using automatic speech recognition (ASR) models and
translating the transcripts into different languages via pretrained translation
models. We thus obtain an audio-textual (multimodal) representation for each
data sample. Subsequently, we combine language-specific Bidirectional Encoder
Representations from Transformers (BERT) with Wav2Vec2.0 audio features via a
novel cascaded cross-modal transformer (CCMT). Our model is based on two
cascaded transformer blocks. The first one combines text-specific features from
distinct languages, while the second one combines acoustic features with
multilingual features previously learned by the first transformer block. We
employed our system in the Requests Sub-Challenge of the ACM Multimedia 2023
Computational Paralinguistics Challenge. CCMT was declared the winning
solution, obtaining an unweighted average recall (UAR) of 65.41% and 85.87% for
complaint and request detection, respectively. Moreover, we applied our
framework on the Speech Commands v2 and HarperValleyBank dialog data sets,
surpassing previous studies reporting results on these benchmarks. Our code is
freely available for download at: https://github.com/ristea/ccmt.
[COMMENTS]
Accepted for publication in Artificial Intelligence Review
[LINK]
http://arxiv.org/abs/2401.07575v2
[DATE]
2024-07-25 04:50:04+08:00
[CATEGORIES]
cs.CL
cs.LG
Coupling Speech Encoders with Downstream Text Models
[AUTHORS]
Ciprian Chelba, Johan Schalkwyk
[ABSTRACT]
We present a modular approach to building cascade speech translation (AST)
models that guarantees that the resulting model performs no worse than the
1-best cascade baseline while preserving state-of-the-art speech recognition
(ASR) and text translation (MT) performance for a given task. Our novel
contribution is the use of an “exporter” layer that is trained under L2-loss
to ensure a strong match between ASR embeddings and the MT token embeddings for
the 1-best sequence. The “exporter” output embeddings are fed directly to the
MT model in lieu of 1-best token embeddings, thus guaranteeing that the
resulting model performs no worse than the 1-best cascade baseline, while
allowing back-propagation gradient to flow from the MT model into the ASR
components. The matched-embeddings cascade architecture provide a significant
improvement over its 1-best counterpart in scenarios where incremental training
of the MT model is not an option and yet we seek to improve quality by
leveraging (speech, transcription, translated transcription) data provided with
the AST task. The gain disappears when the MT model is incrementally trained on
the parallel text data available with the AST task. The approach holds promise
for other scenarios that seek to couple ASR encoders and immutable text models,
such at large language models (LLM).
[LINK]
http://arxiv.org/abs/2407.17605v1
[DATE]
2024-07-25 03:29:13+08:00
[CATEGORIES]
cs.CL
cs.LG
Distilling Robustness into Natural Language Inference Models with Domain-Targeted Augmentation
[AUTHORS]
Joe Stacey, Marek Rei
[COMMENTS]
Accepted at ACL Findings 2024
[LINK]
http://arxiv.org/abs/2305.13067v3
[DATE]
2024-07-25 02:54:53+08:00
[CATEGORIES]
cs.CL
cs.LG
Multi-step Inference over Unstructured Data
[AUTHORS]
Aditya Kalyanpur, Kailash Karthik Saravanakumar, Victor Barres, CJ McFate, Lori Moon, Nati Seifu, Maksim Eremeev, Jose Barrera, Abraham Bautista-Castillo, Eric Brown, David Ferrucci
[ABSTRACT]
The advent of Large Language Models (LLMs) and Generative AI has
revolutionized natural language applications across various domains. However,
high-stakes decision-making tasks in fields such as medical, legal and finance
require a level of precision, comprehensiveness, and logical consistency that
pure LLM or Retrieval-Augmented-Generation (RAG) approaches often fail to
deliver. At Elemental Cognition (EC), we have developed a neuro-symbolic AI
platform to tackle these problems. The platform integrates fine-tuned LLMs for
knowledge extraction and alignment with a robust symbolic reasoning engine for
logical inference, planning and interactive constraint solving. We describe
Cora, a Collaborative Research Assistant built on this platform, that is
designed to perform complex research and discovery tasks in high-stakes
domains. This paper discusses the multi-step inference challenges inherent in
such domains, critiques the limitations of existing LLM-based methods, and
demonstrates how Cora’s neuro-symbolic approach effectively addresses these
issues. We provide an overview of the system architecture, key algorithms for
knowledge extraction and formal reasoning, and present preliminary evaluation
results that highlight Cora’s superior performance compared to well-known LLM
and RAG baselines.
[LINK]
http://arxiv.org/abs/2406.17987v4
[DATE]
2024-07-25 02:38:51+08:00
[CATEGORIES]
cs.CL
I Could’ve Asked That: Reformulating Unanswerable Questions
[AUTHORS]
Wenting Zhao, Ge Gao, Claire Cardie, Alexander M. Rush
[LINK]
http://arxiv.org/abs/2407.17469v1
[DATE]
2024-07-25 01:59:07+08:00
[CATEGORIES]
cs.CL
WildHallucinations: Evaluating Long-form Factuality in LLMs with Real-World Entity Queries
[AUTHORS]
Wenting Zhao, Tanya Goyal, Yu Ying Chiu, Liwei Jiang, Benjamin Newman, Abhilasha Ravichander, Khyathi Chandu, Ronan Le Bras, Claire Cardie, Yuntian Deng, Yejin Choi
[LINK]
http://arxiv.org/abs/2407.17468v1
[DATE]
2024-07-25 01:59:05+08:00
[CATEGORIES]
cs.CL
CMR Scaling Law: Predicting Critical Mixture Ratios for Continual Pre-training of Language Models
[AUTHORS]
Jiawei Gu, Zacc Yang, Chuanghao Ding, Rui Zhao, Fei Tan
[ABSTRACT]
Large Language Models (LLMs) excel in diverse tasks but often underperform in
specialized fields due to limited domain-specific or proprietary corpus.
Continual pre-training (CPT) enhances LLM capabilities by imbuing new
domain-specific or proprietary knowledge while replaying general corpus to
prevent catastrophic forgetting. The data mixture ratio of general corpus and
domain-specific corpus, however, has been chosen heuristically, leading to
sub-optimal training efficiency in practice. In this context, we attempt to
re-visit the scaling behavior of LLMs under the hood of CPT, and discover a
power-law relationship between loss, mixture ratio, and training tokens scale.
We formalize the trade-off between general and domain-specific capabilities,
leading to a well-defined Critical Mixture Ratio (CMR) of general and domain
data. By striking the balance, CMR maintains the model’s general ability and
achieves the desired domain transfer, ensuring the highest utilization of
available resources. Therefore, if we value the balance between efficiency and
effectiveness, CMR can be consider as the optimal mixture ratio.Through
extensive experiments, we ascertain the predictability of CMR, and propose CMR
scaling law and have substantiated its generalization. These findings offer
practical guidelines for optimizing LLM training in specialized domains,
ensuring both general and domain-specific performance while efficiently
managing training resources.
[LINK]
http://arxiv.org/abs/2407.17467v1
[DATE]
2024-07-25 01:59:02+08:00
[CATEGORIES]
cs.CL
cs.LG
Exploring Domain Robust Lightweight Reward Models based on Router Mechanism
[AUTHORS]
Hyuk Namgoong, Jeesu Jung, Sangkeun Jung, Yoonhyung Roh
[ABSTRACT]
Recent advancements in large language models have heavily relied on the large
reward model from reinforcement learning from human feedback for fine-tuning.
However, the use of a single reward model across various domains may not always
be optimal, often requiring retraining from scratch when new domain data is
introduced. To address these challenges, we explore the utilization of small
language models operating in a domain-specific manner based on router
mechanisms. Our three approaches are: 1) utilize mixture of experts to form a
single reward model by modularizing an internal router and experts, 2)
employing external router to select the appropriate reward model from multiple
domain-specific models, and 3) the framework reduces parameter size by loading
reward models and router adapters onto a single small language model using
adapters. Experimental validation underscores the effectiveness of our
approach, demonstrating performance comparable to baseline methods while also
reducing the total parameter size.
[COMMENTS]
This paper is accepted for ACL 2024
[LINK]
http://arxiv.org/abs/2407.17546v1
[DATE]
2024-07-25 01:25:12+08:00
[CATEGORIES]
cs.LG
cs.CL
Fluent Student-Teacher Redteaming
[AUTHORS]
T. Ben Thompson, Michael Sklar
[ABSTRACT]
Many publicly available language models have been safety tuned to reduce the
likelihood of toxic or liability-inducing text. Users or security analysts
attempt to jailbreak or redteam these models with adversarial prompts which
cause compliance with requests. One attack method is to apply discrete
optimization techniques to the prompt. However, the resulting attack strings
are often gibberish text, easily filtered by defenders due to high measured
perplexity, and may fail for unseen tasks and/or well-tuned models. In this
work, we improve existing algorithms (primarily GCG and BEAST) to develop
powerful and fluent attacks on safety-tuned models like Llama-2 and Phi-3. Our
technique centers around a new distillation-based approach that encourages the
victim model to emulate a toxified finetune, either in terms of output
probabilities or internal activations. To encourage human-fluent attacks, we
add a multi-model perplexity penalty and a repetition penalty to the objective.
We also enhance optimizer strength by allowing token insertions, token swaps,
and token deletions and by using longer attack sequences. The resulting process
is able to reliably jailbreak the most difficult target models with prompts
that appear similar to human-written prompts. On Advbench we achieve attack
success rates $>93$% for Llama-2-7B, Llama-3-8B, and Vicuna-7B, while
maintaining model-measured perplexity $<33$; we achieve $95$% attack success
for Phi-3, though with higher perplexity. We also find a universally-optimized
single fluent prompt that induces $>88$% compliance on previously unseen tasks
across Llama-2-7B, Phi-3-mini and Vicuna-7B and transfers to other black-box
models.
[LINK]
http://arxiv.org/abs/2407.17447v1
[DATE]
2024-07-25 01:23:18+08:00
[CATEGORIES]
cs.CL
Dissecting Language Models: Machine Unlearning via Selective Pruning
[AUTHORS]
Nicholas Pochinkov, Nandi Schoots
[ABSTRACT]
Understanding and shaping the behaviour of Large Language Models (LLMs) is
increasingly important as applications become more powerful and more frequently
adopted. This paper introduces a machine unlearning method specifically
designed for LLMs. We introduce a selective pruning method for LLMs that
removes neurons based on their relative importance on a targeted capability
compared to overall network performance. This approach is a compute- and
data-efficient method for identifying and removing neurons that enable specific
behaviours. Our findings reveal that both feed-forward and attention neurons in
LLMs are specialized; that is, for specific tasks, certain neurons are more
crucial than others. Code from all experiments is available at
https://github.com/nickypro/selective-pruning
[LINK]
http://arxiv.org/abs/2403.01267v2
[DATE]
2024-07-25 01:13:55+08:00
[CATEGORIES]
cs.LG
cs.CL
Consent in Crisis: The Rapid Decline of the AI Data Commons
[AUTHORS]
Shayne Longpre, Robert Mahari, Ariel Lee, Campbell Lund, Hamidah Oderinwale, William Brannon, Nayan Saxena, Naana Obeng-Marnu, Tobin South, Cole Hunter, Kevin Klyman, Christopher Klamm, Hailey Schoelkopf, Nikhil Singh, Manuel Cherep, Ahmad Anis, An Dinh, Caroline Chitongo, Da Yin, Damien Sileo, Deividas Mataciunas, Diganta Misra, Emad Alghamdi, Enrico Shippole, Jianguo Zhang, Joanna Materzynska, Kun Qian, Kush Tiwary, Lester Miranda, Manan Dey, Minnie Liang, Mohammed Hamdy, Niklas Muennighoff, Seonghyeon Ye, Seungone Kim, Shrestha Mohanty, Vipul Gupta, Vivek Sharma, Vu Minh Chien, Xuhui Zhou, Yizhi Li, Caiming Xiong, Luis Villa, Stella Biderman, Hanlin Li, Daphne Ippolito, Sara Hooker, Jad Kabbara, Sandy Pentland
[ABSTRACT]
General-purpose artificial intelligence (AI) systems are built on massive
swathes of public web data, assembled into corpora such as C4, RefinedWeb, and
Dolma. To our knowledge, we conduct the first, large-scale, longitudinal audit
of the consent protocols for the web domains underlying AI training corpora.
Our audit of 14,000 web domains provides an expansive view of crawlable web
data and how codified data use preferences are changing over time. We observe a
proliferation of AI-specific clauses to limit use, acute differences in
restrictions on AI developers, as well as general inconsistencies between
websites’ expressed intentions in their Terms of Service and their robots.txt.
We diagnose these as symptoms of ineffective web protocols, not designed to
cope with the widespread re-purposing of the internet for AI. Our longitudinal
analyses show that in a single year (2023-2024) there has been a rapid
crescendo of data restrictions from web sources, rendering ~5%+ of all tokens
in C4, or 28%+ of the most actively maintained, critical sources in C4, fully
restricted from use. For Terms of Service crawling restrictions, a full 45% of
C4 is now restricted. If respected or enforced, these restrictions are rapidly
biasing the diversity, freshness, and scaling laws for general-purpose AI
systems. We hope to illustrate the emerging crises in data consent, for both
developers and creators. The foreclosure of much of the open web will impact
not only commercial AI, but also non-commercial AI and academic research.
[COMMENTS]
41 pages (13 main), 5 figures, 9 tables
[LINK]
http://arxiv.org/abs/2407.14933v2
[DATE]
2024-07-25 00:52:51+08:00
[CATEGORIES]
cs.CL
cs.LG
Dependency Transformer Grammars: Integrating Dependency Structures into Transformer Language Models
[AUTHORS]
Yida Zhao, Chao Lou, Kewei Tu
[ABSTRACT]
Syntactic Transformer language models aim to achieve better generalization
through simultaneously modeling syntax trees and sentences. While prior work
has been focusing on adding constituency-based structures to Transformers, we
introduce Dependency Transformer Grammars (DTGs), a new class of Transformer
language model with explicit dependency-based inductive bias. DTGs simulate
dependency transition systems with constrained attention patterns by modifying
attention masks, incorporate the stack information through relative positional
encoding, and augment dependency arc representation with a combination of token
embeddings and operation embeddings. When trained on a dataset of sentences
annotated with dependency trees, DTGs achieve better generalization while
maintaining comparable perplexity with Transformer language model baselines.
DTGs also outperform recent constituency-based models, showing that dependency
can better guide Transformer language models. Our code is released at
https://github.com/zhaoyd1/Dep_Transformer_Grammars.
[LINK]
http://arxiv.org/abs/2407.17406v1
[DATE]
2024-07-25 00:38:38+08:00
[CATEGORIES]
cs.CL
CovScore: Evaluation of Multi-Document Abstractive Title Set Generation
[AUTHORS]
Itamar Trainin, Omri Abend
[ABSTRACT]
This paper introduces CovScore, an automatic reference-less methodology for
evaluating thematic title sets, extracted from a corpus of documents. While
such extraction methods are widely used, evaluating their effectiveness remains
an open question. Moreover, some existing practices heavily rely on slow and
laborious human annotation procedures. Inspired by recently introduced
LLM-based judge methods, we propose a novel methodology that decomposes quality
into five main metrics along different aspects of evaluation. This framing
simplifies and expedites the manual evaluation process and enables automatic
and independent LLM-based evaluation. As a test case, we apply our approach to
a corpus of Holocaust survivor testimonies, motivated both by its relevance to
title set extraction and by the moral significance of this pursuit. We validate
the methodology by experimenting with naturalistic and synthetic title set
generation systems and compare their performance with the methodology.
[LINK]
http://arxiv.org/abs/2407.17390v1
[DATE]
2024-07-25 00:14:15+08:00
[CATEGORIES]
cs.CL
A Comprehensive Approach to Misspelling Correction with BERT and Levenshtein Distance
[AUTHORS]
Amirreza Naziri, Hossein Zeinali
[ABSTRACT]
Writing, as an omnipresent form of human communication, permeates nearly
every aspect of contemporary life. Consequently, inaccuracies or errors in
written communication can lead to profound consequences, ranging from financial
losses to potentially life-threatening situations. Spelling mistakes, among the
most prevalent writing errors, are frequently encountered due to various
factors. This research aims to identify and rectify diverse spelling errors in
text using neural networks, specifically leveraging the Bidirectional Encoder
Representations from Transformers (BERT) masked language model. To achieve this
goal, we compiled a comprehensive dataset encompassing both non-real-word and
real-word errors after categorizing different types of spelling mistakes.
Subsequently, multiple pre-trained BERT models were employed. To ensure optimal
performance in correcting misspelling errors, we propose a combined approach
utilizing the BERT masked language model and Levenshtein distance. The results
from our evaluation data demonstrate that the system presented herein exhibits
remarkable capabilities in identifying and rectifying spelling mistakes, often
surpassing existing systems tailored for the Persian language.
[COMMENTS]
12 pages, 9 figures, 5 tables
[LINK]
http://arxiv.org/abs/2407.17383v1
[DATE]
2024-07-25 00:07:11+08:00
[CATEGORIES]
cs.CL
cs.LG
StraightLine: An End-to-End Resource-Aware Scheduler for Machine Learning Application Requests
[AUTHORS]
Cheng-Wei Ching, Boyuan Guan, Hailu Xu, Liting Hu
[ABSTRACT]
The life cycle of machine learning (ML) applications consists of two stages:
model development and model deployment. However, traditional ML systems (e.g.,
training-specific or inference-specific systems) focus on one particular stage
or phase of the life cycle of ML applications. These systems often aim at
optimizing model training or accelerating model inference, and they frequently
assume homogeneous infrastructure, which may not always reflect real-world
scenarios that include cloud data centers, local servers, containers, and
serverless platforms. We present StraightLine, an end-to-end resource-aware
scheduler that schedules the optimal resources (e.g., container, virtual
machine, or serverless) for different ML application requests in a hybrid
infrastructure. The key innovation is an empirical dynamic placing algorithm
that intelligently places requests based on their unique characteristics (e.g.,
request frequency, input data size, and data distribution). In contrast to
existing ML systems, StraightLine offers end-to-end resource-aware placement,
thereby it can significantly reduce response time and failure rate for model
deployment when facing different computing resources in the hybrid
infrastructure.
[COMMENTS]
6 pages, 8 figures, to appear in AIoTC’24
[LINK]
http://arxiv.org/abs/2407.18148v1
[DATE]
2024-07-25 23:58:56+08:00
[CATEGORIES]
cs.LG
Maximum Entropy On-Policy Actor-Critic via Entropy Advantage Estimation
[AUTHORS]
Jean Seong Bjorn Choe, Jong-Kook Kim
[ABSTRACT]
Entropy Regularisation is a widely adopted technique that enhances policy
optimisation performance and stability. A notable form of entropy
regularisation is augmenting the objective with an entropy term, thereby
simultaneously optimising the expected return and the entropy. This framework,
known as maximum entropy reinforcement learning (MaxEnt RL), has shown
theoretical and empirical successes. However, its practical application in
straightforward on-policy actor-critic settings remains surprisingly
underexplored. We hypothesise that this is due to the difficulty of managing
the entropy reward in practice. This paper proposes a simple method of
separating the entropy objective from the MaxEnt RL objective, which
facilitates the implementation of MaxEnt RL in on-policy settings. Our
empirical evaluations demonstrate that extending Proximal Policy Optimisation
(PPO) and Trust Region Policy Optimisation (TRPO) within the MaxEnt framework
improves policy optimisation performance in both MuJoCo and Procgen tasks.
Additionally, our results highlight MaxEnt RL’s capacity to enhance
generalisation.
[LINK]
http://arxiv.org/abs/2407.18143v1
[DATE]
2024-07-25 23:48:24+08:00
[CATEGORIES]
cs.LG
$\mathbb{X}$-Sample Contrastive Loss: Improving Contrastive Learning with Sample Similarity Graphs
[AUTHORS]
Vlad Sobal, Mark Ibrahim, Randall Balestriero, Vivien Cabannes, Diane Bouchacourt, Pietro Astolfi, Kyunghyun Cho, Yann LeCun
[ABSTRACT]
Learning good representations involves capturing the diverse ways in which
data samples relate. Contrastive loss - an objective matching related samples -
underlies methods from self-supervised to multimodal learning. Contrastive
losses, however, can be viewed more broadly as modifying a similarity graph to
indicate how samples should relate in the embedding space. This view reveals a
shortcoming in contrastive learning: the similarity graph is binary, as only
one sample is the related positive sample. Crucially, similarities
\textit{across} samples are ignored. Based on this observation, we revise the
standard contrastive loss to explicitly encode how a sample relates to others.
We experiment with this new objective, called $\mathbb{X}$-Sample Contrastive,
to train vision models based on similarities in class or text caption
descriptions. Our study spans three scales: ImageNet-1k with 1 million, CC3M
with 3 million, and CC12M with 12 million samples. The representations learned
via our objective outperform both contrastive self-supervised and
vision-language models trained on the same data across a range of tasks. When
training on CC12M, we outperform CLIP by $0.6\%$ on both ImageNet and ImageNet
Real. Our objective appears to work particularly well in lower-data regimes,
with gains over CLIP of $16.8\%$ on ImageNet and $18.1\%$ on ImageNet Real when
training with CC3M. Finally, our objective seems to encourage the model to
learn representations that separate objects from their attributes and
backgrounds, with gains of $3.3$-$5.6$\% over CLIP on ImageNet9. We hope the
proposed solution takes a small step towards developing richer learning
objectives for understanding sample relations in foundation models.
[LINK]
http://arxiv.org/abs/2407.18134v1
[DATE]
2024-07-25 23:38:16+08:00
[CATEGORIES]
cs.LG
Looking at Model Debiasing through the Lens of Anomaly Detection
[AUTHORS]
Vito Paolo Pastore, Massimiliano Ciranni, Davide Marinelli, Francesca Odone, Vittorio Murino
[ABSTRACT]
It is widely recognized that deep neural networks are sensitive to bias in
the data. This means that during training these models are likely to learn
spurious correlations between data and labels, resulting in limited
generalization abilities and low performance. In this context, model debiasing
approaches can be devised aiming at reducing the model’s dependency on such
unwanted correlations, either leveraging the knowledge of bias information or
not. In this work, we focus on the latter and more realistic scenario, showing
the importance of accurately predicting the bias-conflicting and bias-aligned
samples to obtain compelling performance in bias mitigation. On this ground, we
propose to conceive the problem of model bias from an out-of-distribution
perspective, introducing a new bias identification method based on anomaly
detection. We claim that when data is mostly biased, bias-conflicting samples
can be regarded as outliers with respect to the bias-aligned distribution in
the feature space of a biased model, thus allowing for precisely detecting them
with an anomaly detection method. Coupling the proposed bias identification
approach with bias-conflicting data upsampling and augmentation in a two-step
strategy, we reach state-of-the-art performance on synthetic and real benchmark
datasets. Ultimately, our proposed approach shows that the data bias issue does
not necessarily require complex debiasing methods, given that an accurate bias
identification procedure is defined.
[COMMENTS]
15 pages, 7 figures
[LINK]
http://arxiv.org/abs/2407.17449v2
[DATE]
2024-07-25 23:33:00+08:00
[CATEGORIES]
cs.LG
Generative Learning of Continuous Data by Tensor Networks
[AUTHORS]
Alex Meiburg, Jing Chen, Jacob Miller, Raphaëlle Tihon, Guillaume Rabusseau, Alejandro Perdomo-Ortiz
[ABSTRACT]
Beyond their origin in modeling many-body quantum systems, tensor networks
have emerged as a promising class of models for solving machine learning
problems, notably in unsupervised generative learning. While possessing many
desirable features arising from their quantum-inspired nature, tensor network
generative models have previously been largely restricted to binary or
categorical data, limiting their utility in real-world modeling problems. We
overcome this by introducing a new family of tensor network generative models
for continuous data, which are capable of learning from distributions
containing continuous random variables. We develop our method in the setting of
matrix product states, first deriving a universal expressivity theorem proving
the ability of this model family to approximate any reasonably smooth
probability density function with arbitrary precision. We then benchmark the
performance of this model on several synthetic and real-world datasets, finding
that the model learns and generalizes well on distributions of continuous and
discrete variables. We develop methods for modeling different data domains, and
introduce a trainable compression layer which is found to increase model
performance given limited memory or computational resources. Overall, our
methods give important theoretical and empirical evidence of the efficacy of
quantum-inspired methods for the rapidly growing field of generative learning.
[COMMENTS]
21 pages, 15 figures
[LINK]
http://arxiv.org/abs/2310.20498v2
[DATE]
2024-07-25 23:25:27+08:00
[CATEGORIES]
cs.LG
Graph Neural Ordinary Differential Equations for Coarse-Grained Socioeconomic Dynamics
[AUTHORS]
James Koch, Pranab Roy Chowdhury, Heng Wan, Parin Bhaduri, Jim Yoon, Vivek Srikrishnan, W. Brent Daniel
[ABSTRACT]
We present a data-driven machine-learning approach for modeling space-time
socioeconomic dynamics. Through coarse-graining fine-scale observations, our
modeling framework simplifies these complex systems to a set of tractable
mechanistic relationships – in the form of ordinary differential equations –
while preserving critical system behaviors. This approach allows for expedited
‘what if’ studies and sensitivity analyses, essential for informed
policy-making. Our findings, from a case study of Baltimore, MD, indicate that
this machine learning-augmented coarse-grained model serves as a powerful
instrument for deciphering the complex interactions between social factors,
geography, and exogenous stressors, offering a valuable asset for system
forecasting and resilience planning.
[LINK]
http://arxiv.org/abs/2407.18108v1
[DATE]
2024-07-25 23:12:46+08:00
[CATEGORIES]
cs.LG
Fine-Tuning Large Language Models for Stock Return Prediction Using Newsflow
[AUTHORS]
Tian Guo, Emmanuel Hauptmann
[ABSTRACT]
Large language models (LLMs) and their fine-tuning techniques have
demonstrated superior performance in various language understanding and
generation tasks. This paper explores fine-tuning LLMs for stock return
forecasting with financial newsflow. In quantitative investing, return
forecasting is fundamental for subsequent tasks like stock picking, portfolio
optimization, etc. We formulate the model to include text representation and
forecasting modules. We propose to compare the encoder-only and decoder-only
LLMs, considering they generate text representations in distinct ways. The
impact of these different representations on forecasting performance remains an
open question. Meanwhile, we compare two simple methods of integrating LLMs’
token-level representations into the forecasting module. The experiments on
real news and investment universes reveal that: (1) aggregated representations
from LLMs’ token-level embeddings generally produce return predictions that
enhance the performance of long-only and long-short portfolios; (2) in the
relatively large investment universe, the decoder LLMs-based prediction model
leads to stronger portfolios, whereas in the small universes, there are no
consistent winners. Among the three LLMs studied (DeBERTa, Mistral, Llama),
Mistral performs more robustly across different universes; (3) return
predictions derived from LLMs’ text representations are a strong signal for
portfolio construction, outperforming conventional sentiment scores.
[LINK]
http://arxiv.org/abs/2407.18103v1
[DATE]
2024-07-25 23:07:35+08:00
[CATEGORIES]
cs.LG
Self-supervised learning of video representations from a child’s perspective
[AUTHORS]
A. Emin Orhan, Wentao Wang, Alex N. Wang, Mengye Ren, Brenden M. Lake
[ABSTRACT]
Children learn powerful internal models of the world around them from a few
years of egocentric visual experience. Can such internal models be learned from
a child’s visual experience with highly generic learning algorithms or do they
require strong inductive biases? Recent advances in collecting large-scale,
longitudinal, developmentally realistic video datasets and generic
self-supervised learning (SSL) algorithms are allowing us to begin to tackle
this nature vs. nurture question. However, existing work typically focuses on
image-based SSL algorithms and visual capabilities that can be learned from
static images (e.g. object recognition), thus ignoring temporal aspects of the
world. To close this gap, here we train self-supervised video models on
longitudinal, egocentric headcam recordings collected from a child over a two
year period in their early development (6-31 months). The resulting models are
highly effective at facilitating the learning of action concepts from a small
number of labeled examples; they have favorable data size scaling properties;
and they display emergent video interpolation capabilities. Video models also
learn more robust object representations than image-based models trained with
the exact same data. These results suggest that important temporal aspects of a
child’s internal model of the world may be learnable from their visual
experience using highly generic learning algorithms and without strong
inductive biases.
[COMMENTS]
Published as a conference paper at CogSci 2024; code & models
available from https://github.com/eminorhan/video-models
[LINK]
http://arxiv.org/abs/2402.00300v2
[DATE]
2024-07-25 22:48:34+08:00
[CATEGORIES]
cs.LG
[AUTHORS]
Sunny Duan, Mikail Khona, Abhiram Iyer, Rylan Schaeffer, Ila R Fiete [ABSTRACT]
Frontier AI systems are making transformative impacts across society, but
such benefits are not without costs: models trained on web-scale datasets
containing personal and private data raise profound concerns about data privacy
and security. Language models are trained on extensive corpora including
potentially sensitive or proprietary information, and the risk of data leakagewhere the model response reveals pieces of such information - remains
inadequately understood. Prior work has investigated what factors drive
memorization and have identified that sequence complexity and the number of
repetitions drive memorization. Here, we focus on the evolution of memorization
over training. We begin by reproducing findings that the probability of
memorizing a sequence scales logarithmically with the number of times it is
present in the data. We next show that sequences which are apparently not
memorized after the first encounter can be “uncovered” throughout the course of
training even without subsequent encounters, a phenomenon we term “latent
memorization”. The presence of latent memorization presents a challenge for
data privacy as memorized sequences may be hidden at the final checkpoint of
the model but remain easily recoverable. To this end, we develop a diagnostic
test relying on the cross entropy loss to uncover latent memorized sequences
with high accuracy.
[LINK]
http://arxiv.org/abs/2406.14549v2
[DATE]
2024-07-25 22:33:33+08:00
[CATEGORIES]
cs.LG
Clustering with minimum spanning trees: How good can it be?
[AUTHORS]
Marek Gagolewski, Anna Cena, Maciej Bartoszuk, Łukasz Brzozowski
[ABSTRACT]
Minimum spanning trees (MSTs) provide a convenient representation of datasets
in numerous pattern recognition activities. Moreover, they are relatively fast
to compute. In this paper, we quantify the extent to which they are meaningful
in low-dimensional partitional data clustering tasks. By identifying the upper
bounds for the agreement between the best (oracle) algorithm and the expert
labels from a large battery of benchmark data, we discover that MST methods can
be very competitive. Next, we review, study, extend, and generalise a few
existing, state-of-the-art MST-based partitioning schemes. This leads to some
new noteworthy approaches. Overall, the Genie and the information-theoretic
methods often outperform the non-MST algorithms such as K-means, Gaussian
mixtures, spectral clustering, Birch, density-based, and classical hierarchical
agglomerative procedures. Nevertheless, we identify that there is still some
room for improvement, and thus the development of novel algorithms is
encouraged.
[LINK]
http://arxiv.org/abs/2303.05679v3
[DATE]
2024-07-25 22:32:51+08:00
[CATEGORIES]
cs.LG
Normalised clustering accuracy: An asymmetric external cluster validity measure
[AUTHORS]
Marek Gagolewski
[ABSTRACT]
There is no, nor will there ever be, single best clustering algorithm.
Nevertheless, we would still like to be able to distinguish between methods
that work well on certain task types and those that systematically
underperform. Clustering algorithms are traditionally evaluated using either
internal or external validity measures. Internal measures quantify different
aspects of the obtained partitions, e.g., the average degree of cluster
compactness or point separability. However, their validity is questionable
because the clusterings they endorse can sometimes be meaningless. External
measures, on the other hand, compare the algorithms’ outputs to fixed ground
truth groupings provided by experts. In this paper, we argue that the commonly
used classical partition similarity scores, such as the normalised mutual
information, Fowlkes-Mallows, or adjusted Rand index, miss some desirable
properties. In particular, they do not identify worst-case scenarios correctly,
nor are they easily interpretable. As a consequence, the evaluation of
clustering algorithms on diverse benchmark datasets can be difficult. To remedy
these issues, we propose and analyse a new measure: a version of the optimal
set-matching accuracy, which is normalised, monotonic with respect to some
similarity relation, scale-invariant, and corrected for the imbalancedness of
cluster sizes (but neither symmetric nor adjusted for chance).
[LINK]
http://arxiv.org/abs/2209.02935v4
[DATE]
2024-07-25 22:31:03+08:00
[CATEGORIES]
cs.LG
3D Diffuser Actor: Policy Diffusion with 3D Scene Representations
[AUTHORS]
Tsung-Wei Ke, Nikolaos Gkanatsios, Katerina Fragkiadaki
[ABSTRACT]
Diffusion policies are conditional diffusion models that learn robot action
distributions conditioned on the robot and environment state. They have
recently shown to outperform both deterministic and alternative action
distribution learning formulations. 3D robot policies use 3D scene feature
representations aggregated from a single or multiple camera views using sensed
depth. They have shown to generalize better than their 2D counterparts across
camera viewpoints. We unify these two lines of work and present 3D Diffuser
Actor, a neural policy equipped with a novel 3D denoising transformer that
fuses information from the 3D visual scene, a language instruction and
proprioception to predict the noise in noised 3D robot pose trajectories. 3D
Diffuser Actor sets a new state-of-the-art on RLBench with an absolute
performance gain of 18.1% over the current SOTA on a multi-view setup and an
absolute gain of 13.1% on a single-view setup. On the CALVIN benchmark, it
improves over the current SOTA by a 9% relative increase. It also learns to
control a robot manipulator in the real world from a handful of demonstrations.
Through thorough comparisons with the current SOTA policies and ablations of
our model, we show 3D Diffuser Actor’s design choices dramatically outperform
2D representations, regression and classification objectives, absolute
attentions, and holistic non-tokenized 3D scene embeddings.
[COMMENTS]
First two authors contributed equally
[LINK]
http://arxiv.org/abs/2402.10885v3
[DATE]
2024-07-25 22:30:22+08:00
[CATEGORIES]
cs.LG
Principal-Agent Reinforcement Learning
[AUTHORS]
Dima Ivanov, Paul Dütting, Inbal Talgam-Cohen, Tonghan Wang, David C. Parkes
[ABSTRACT]
Contracts are the economic framework which allows a principal to delegate a
task to an agent – despite misaligned interests, and even without directly
observing the agent’s actions. In many modern reinforcement learning settings,
self-interested agents learn to perform a multi-stage task delegated to them by
a principal. We explore the significant potential of utilizing contracts to
incentivize the agents. We model the delegated task as an MDP, and study a
stochastic game between the principal and agent where the principal learns what
contracts to use, and the agent learns an MDP policy in response. We present a
learning-based algorithm for optimizing the principal’s contracts, which
provably converges to the subgame-perfect equilibrium of the principal-agent
game. A deep RL implementation allows us to apply our method to very large MDPs
with unknown transition dynamics. We extend our approach to multiple agents,
and demonstrate its relevance to resolving a canonical sequential social
dilemma with minimal intervention to agent rewards.
[LINK]
http://arxiv.org/abs/2407.18074v1
[DATE]
2024-07-25 22:28:58+08:00
[CATEGORIES]
cs.LG
HVM-1: Large-scale video models pretrained with nearly 5000 hours of human-like video data
[AUTHORS]
A. Emin Orhan
[ABSTRACT]
We introduce Human-like Video Models (HVM-1), large-scale video models
pretrained with nearly 5000 hours of curated human-like video data (mostly
egocentric, temporally extended, continuous video recordings), using the
spatiotemporal masked autoencoder (ST-MAE) algorithm. We release two 633M
parameter models trained at spatial resolutions of 224x224 and 448x448 pixels.
We evaluate the performance of these models in downstream few-shot video and
image recognition tasks and compare them against a model pretrained with 1330
hours of short action-oriented video clips from YouTube (Kinetics-700). HVM-1
models perform competitively against the Kinetics-700 pretrained model in
downstream evaluations despite substantial qualitative differences between the
spatiotemporal characteristics of the corresponding pretraining datasets. HVM-1
models also learn more accurate and more robust object representations compared
to models pretrained with the image-based MAE algorithm on the same data,
demonstrating the potential benefits of learning to predict temporal
regularities in natural videos for learning better object representations.
[COMMENTS]
10 pages, 5 figures, 1 table; code & models available from
https://github.com/eminorhan/hvm-1
[LINK]
http://arxiv.org/abs/2407.18067v1
[DATE]
2024-07-25 22:21:50+08:00
[CATEGORIES]
cs.LG
Diagnosing and fixing common problems in Bayesian optimization for molecule design
[AUTHORS]
Austin Tripp, José Miguel Hernández-Lobato
[ABSTRACT]
Bayesian optimization (BO) is a principled approach to molecular design
tasks. In this paper we explain three pitfalls of BO which can cause poor
empirical performance: an incorrect prior width, over-smoothing, and inadequate
acquisition function maximization. We show that with these issues addressed,
even a basic BO setup is able to achieve the highest overall performance on the
PMO benchmark for molecule design (Gao et al 2022). These results suggest that
BO may benefit from more attention in the machine learning for molecules
community.
[COMMENTS]
8 pages, 4 figures. ICML 2024 AI for science workshop
(https://openreview.net/forum?id=V4aG4wsoIt). Code at:
https://github.com/AustinT/basic-mol-bo-workshop2024
[LINK]
http://arxiv.org/abs/2406.07709v2
[DATE]
2024-07-25 22:17:40+08:00
[CATEGORIES]
cs.LG
Cross-Vendor Reproducibility of Radiomics-based Machine Learning Models for Computer-aided Diagnosis
[AUTHORS]
Jatin Chaudhary, Ivan Jambor, Hannu Aronen, Otto Ettala, Jani Saunavaara, Peter Boström, Jukka Heikkonen, Rajeev Kanth, Harri Merisaari
[ABSTRACT]
Background: The reproducibility of machine-learning models in prostate cancer
detection across different MRI vendors remains a significant challenge.
Methods: This study investigates Support Vector Machines (SVM) and Random
Forest (RF) models trained on radiomic features extracted from T2-weighted MRI
images using Pyradiomics and MRCradiomics libraries. Feature selection was
performed using the maximum relevance minimum redundancy (MRMR) technique. We
aimed to enhance clinical decision support through multimodal learning and
feature fusion. Results: Our SVM model, utilizing combined features from
Pyradiomics and MRCradiomics, achieved an AUC of 0.74 on the Multi-Improd
dataset (Siemens scanner) but decreased to 0.60 on the Philips test set. The RF
model showed similar trends, with notable robustness for models using
Pyradiomics features alone (AUC of 0.78 on Philips). Conclusions: These
findings demonstrate the potential of multimodal feature integration to improve
the robustness and generalizability of machine-learning models for clinical
decision support in prostate cancer detection. This study marks a significant
step towards developing reliable AI-driven diagnostic tools that maintain
efficacy across various imaging platforms.
[LINK]
http://arxiv.org/abs/2407.18060v1
[DATE]
2024-07-25 22:16:02+08:00
[CATEGORIES]
cs.LG
Physics-informed nonlinear vector autoregressive models for the prediction of dynamical systems
[AUTHORS]
James H. Adler, Samuel Hocking, Xiaozhe Hu, Shafiqul Islam
[ABSTRACT]
Machine learning techniques have recently been of great interest for solving
differential equations. Training these models is classically a data-fitting
task, but knowledge of the expression of the differential equation can be used
to supplement the training objective, leading to the development of
physics-informed scientific machine learning. In this article, we focus on one
class of models called nonlinear vector autoregression (NVAR) to solve ordinary
differential equations (ODEs). Motivated by connections to numerical
integration and physics-informed neural networks, we explicitly derive the
physics-informed NVAR (piNVAR) which enforces the right-hand side of the
underlying differential equation regardless of NVAR construction. Because NVAR
and piNVAR completely share their learned parameters, we propose an augmented
procedure to jointly train the two models. Then, using both data-driven and
ODE-driven metrics, we evaluate the ability of the piNVAR model to predict
solutions to various ODE systems, such as the undamped spring, a Lotka-Volterra
predator-prey nonlinear model, and the chaotic Lorenz system.
[LINK]
http://arxiv.org/abs/2407.18057v1
[DATE]
2024-07-25 22:10:42+08:00
[CATEGORIES]
cs.LG
The Geometry of Queries: Query-Based Innovations in Retrieval-Augmented Generation
[AUTHORS]
Eric Yang, Jonathan Amar, Jong Ha Lee, Bhawesh Kumar, Yugang Jia
[ABSTRACT]
Digital health chatbots powered by Large Language Models (LLMs) have the
potential to significantly improve personal health management for chronic
conditions by providing accessible and on-demand health coaching and
question-answering. However, these chatbots risk providing unverified and
inaccurate information because LLMs generate responses based on patterns
learned from diverse internet data. Retrieval Augmented Generation (RAG) can
help mitigate hallucinations and inaccuracies in LLM responses by grounding it
on reliable content. However, efficiently and accurately retrieving most
relevant set of content for real-time user questions remains a challenge. In
this work, we introduce Query-Based Retrieval Augmented Generation (QB-RAG), a
novel approach that pre-computes a database of potential queries from a content
base using LLMs. For an incoming patient question, QB-RAG efficiently matches
it against this pre-generated query database using vector search, improving
alignment between user questions and the content. We establish a theoretical
foundation for QB-RAG and provide a comparative analysis of existing retrieval
enhancement techniques for RAG systems. Finally, our empirical evaluation
demonstrates that QB-RAG significantly improves the accuracy of healthcare
question answering, paving the way for robust and trustworthy LLM applications
in digital health.
[COMMENTS]
22 pages
[LINK]
http://arxiv.org/abs/2407.18044v1
[DATE]
2024-07-25 21:47:01+08:00
[CATEGORIES]
cs.LG
Lifelong Graph Summarization with Neural Networks: 2012, 2022, and a Time Warp
[AUTHORS]
Jonatan Frank, Marcel Hoffmann, Nicolas Lell, David Richerby, Ansgar Scherp
[ABSTRACT]
Summarizing web graphs is challenging due to the heterogeneity of the modeled
information and its changes over time. We investigate the use of neural
networks for lifelong graph summarization. Assuming we observe the web graph at
a certain time, we train the networks to summarize graph vertices. We apply
this trained network to summarize the vertices of the changed graph at the next
point in time. Subsequently, we continue training and evaluating the network to
perform lifelong graph summarization. We use the GNNs Graph-MLP and GraphSAINT,
as well as an MLP baseline, to summarize the temporal graphs. We compare
$1$-hop and $2$-hop summaries. We investigate the impact of reusing parameters
from a previous snapshot by measuring the backward and forward transfer and the
forgetting rate of the neural networks. Our extensive experiments on ten weekly
snapshots of a web graph with over $100$M edges, sampled in 2012 and 2022, show
that all networks predominantly use $1$-hop information to determine the
summary, even when performing $2$-hop summarization. Due to the heterogeneity
of web graphs, in some snapshots, the $2$-hop summary produces over ten times
more vertex summaries than the $1$-hop summary. When using the network trained
on the last snapshot from 2012 and applying it to the first snapshot of 2022,
we observe a strong drop in accuracy. We attribute this drop over the ten-year
time warp to the strongly increased heterogeneity of the web graph in 2022.
[LINK]
http://arxiv.org/abs/2407.18042v1
[DATE]
2024-07-25 21:44:42+08:00
[CATEGORIES]
cs.LG
How to Train the Teacher Model for Effective Knowledge Distillation
[AUTHORS]
Shayan Mohajer Hamidi, Xizhen Deng, Renhao Tan, Linfeng Ye, Ahmed Hussein Salamah
[ABSTRACT]
Recently, it was shown that the role of the teacher in knowledge distillation
(KD) is to provide the student with an estimate of the true Bayes conditional
probability density (BCPD). Notably, the new findings propose that the
student’s error rate can be upper-bounded by the mean squared error (MSE)
between the teacher’s output and BCPD. Consequently, to enhance KD efficacy,
the teacher should be trained such that its output is close to BCPD in MSE
sense. This paper elucidates that training the teacher model with MSE loss
equates to minimizing the MSE between its output and BCPD, aligning with its
core responsibility of providing the student with a BCPD estimate closely
resembling it in MSE terms. In this respect, through a comprehensive set of
experiments, we demonstrate that substituting the conventional teacher trained
with cross-entropy loss with one trained using MSE loss in state-of-the-art KD
methods consistently boosts the student’s accuracy, resulting in improvements
of up to 2.6\%.
[COMMENTS]
The paper was accepted at ECCV2024
[LINK]
http://arxiv.org/abs/2407.18041v1
[DATE]
2024-07-25 21:39:11+08:00
[CATEGORIES]
cs.LG
Learning mental states estimation through self-observation: a developmental synergy between intentions and beliefs representations in a deep-learning model of Theory of Mind
[AUTHORS]
Francesca Bianco, Silvia Rigato, Maria Laura Filippetti, Dimitri Ognibene
[ABSTRACT]
Theory of Mind (ToM), the ability to attribute beliefs, intentions, or mental
states to others, is a crucial feature of human social interaction. In complex
environments, where the human sensory system reaches its limits, behaviour is
strongly driven by our beliefs about the state of the world around us.
Accessing others’ mental states, e.g., beliefs and intentions, allows for more
effective social interactions in natural contexts. Yet, these variables are not
directly observable, making understanding ToM a challenging quest of interest
for different fields, including psychology, machine learning and robotics. In
this paper, we contribute to this topic by showing a developmental synergy
between learning to predict low-level mental states (e.g., intentions, goals)
and attributing high-level ones (i.e., beliefs). Specifically, we assume that
learning beliefs attribution can occur by observing one’s own decision
processes involving beliefs, e.g., in a partially observable environment. Using
a simple feed-forward deep learning model, we show that, when learning to
predict others’ intentions and actions, more accurate predictions can be
acquired earlier if beliefs attribution is learnt simultaneously. Furthermore,
we show that the learning performance improves even when observed actors have a
different embodiment than the observer and the gain is higher when observing
beliefs-driven chunks of behaviour. We propose that our computational approach
can inform the understanding of human social cognitive development and be
relevant for the design of future adaptive social robots able to autonomously
understand, assist, and learn from human interaction partners in novel natural
environments and tasks.
[LINK]
http://arxiv.org/abs/2407.18022v1
[DATE]
2024-07-25 21:15:25+08:00
[CATEGORIES]
cs.LG
Quadratic Advantage with Quantum Randomized Smoothing Applied to Time-Series Analysis
[AUTHORS]
Nicola Franco, Marie Kempkes, Jakob Spiegelberg, Jeanette Miriam Lorenz
[ABSTRACT]
As quantum machine learning continues to develop at a rapid pace, the
importance of ensuring the robustness and efficiency of quantum algorithms
cannot be overstated. Our research presents an analysis of quantum randomized
smoothing, how data encoding and perturbation modeling approaches can be
matched to achieve meaningful robustness certificates. By utilizing an
innovative approach integrating Grover’s algorithm, a quadratic sampling
advantage over classical randomized smoothing is achieved. This strategy
necessitates a basis state encoding, thus restricting the space of meaningful
perturbations. We show how constrained $k$-distant Hamming weight perturbations
are a suitable noise distribution here, and elucidate how they can be
constructed on a quantum computer. The efficacy of the proposed framework is
demonstrated on a time series classification task employing a Bag-of-Words
pre-processing solution. The advantage of quadratic sample reduction is
recovered especially in the regime with large number of samples. This may allow
quantum computers to efficiently scale randomized smoothing to more complex
tasks beyond the reach of classical methods.
[COMMENTS]
Accepted at the IEEE International Conference on Quantum Computing
and Engineering (QCE)
[LINK]
http://arxiv.org/abs/2407.18021v1
[DATE]
2024-07-25 21:15:16+08:00
[CATEGORIES]
cs.LG
Self-Supervision Improves Diffusion Models for Tabular Data Imputation
[AUTHORS]
Yixin Liu, Thalaiyasingam Ajanthan, Hisham Husain, Vu Nguyen
[ABSTRACT]
The ubiquity of missing data has sparked considerable attention and focus on
tabular data imputation methods. Diffusion models, recognized as the
cutting-edge technique for data generation, demonstrate significant potential
in tabular data imputation tasks. However, in pursuit of diversity, vanilla
diffusion models often exhibit sensitivity to initialized noises, which hinders
the models from generating stable and accurate imputation results.
Additionally, the sparsity inherent in tabular data poses challenges for
diffusion models in accurately modeling the data manifold, impacting the
robustness of these models for data imputation. To tackle these challenges,
this paper introduces an advanced diffusion model named Self-supervised
imputation Diffusion Model (SimpDM for brevity), specifically tailored for
tabular data imputation tasks. To mitigate sensitivity to noise, we introduce a
self-supervised alignment mechanism that aims to regularize the model, ensuring
consistent and stable imputation predictions. Furthermore, we introduce a
carefully devised state-dependent data augmentation strategy within SimpDM,
enhancing the robustness of the diffusion model when dealing with limited data.
Extensive experiments demonstrate that SimpDM matches or outperforms
state-of-the-art imputation methods across various scenarios.
[COMMENTS]
10 pages, 5 figures. Accepted by CIKM 2024
[LINK]
http://arxiv.org/abs/2407.18013v1
[DATE]
2024-07-25 21:06:30+08:00
[CATEGORIES]
cs.LG
Equivariant Ensembles and Regularization for Reinforcement Learning in Map-based Path Planning
[AUTHORS]
Mirco Theile, Hongpeng Cao, Marco Caccamo, Alberto L. Sangiovanni-Vincentelli
[ABSTRACT]
In reinforcement learning (RL), exploiting environmental symmetries can
significantly enhance efficiency, robustness, and performance. However,
ensuring that the deep RL policy and value networks are respectively
equivariant and invariant to exploit these symmetries is a substantial
challenge. Related works try to design networks that are equivariant and
invariant by construction, limiting them to a very restricted library of
components, which in turn hampers the expressiveness of the networks. This
paper proposes a method to construct equivariant policies and invariant value
functions without specialized neural network components, which we term
equivariant ensembles. We further add a regularization term for adding
inductive bias during training. In a map-based path planning case study, we
show how equivariant ensembles and regularization benefit sample efficiency and
performance.
[COMMENTS]
Accepted at IROS 2024. A video can be found here:
https://youtu.be/L6NOdvU7n7s. The code is available at
https://github.com/theilem/uavSim
[LINK]
http://arxiv.org/abs/2403.12856v2
[DATE]
2024-07-25 20:56:35+08:00
[CATEGORIES]
cs.LG
Network Inversion of Convolutional Neural Nets
[AUTHORS]
Pirzada Suhail, Amit Sethi
[ABSTRACT]
Neural networks have emerged as powerful tools across various applications,
yet their decision-making process often remains opaque, leading to them being
perceived as “black boxes.” This opacity raises concerns about their
interpretability and reliability, especially in safety-critical scenarios.
Network inversion techniques offer a solution by allowing us to peek inside
these black boxes, revealing the features and patterns learned by the networks
behind their decision-making processes and thereby provide valuable insights
into how neural networks arrive at their conclusions, making them more
interpretable and trustworthy. This paper presents a simple yet effective
approach to network inversion using a carefully conditioned generator that
learns the data distribution in the input space of the trained neural network,
enabling the reconstruction of inputs that would most likely lead to the
desired outputs. To capture the diversity in the input space for a given
output, instead of simply revealing the conditioning labels to the generator,
we hideously encode the conditioning label information into vectors, further
exemplified by heavy dropout in the generation process and minimisation of
cosine similarity between the features corresponding to the generated images.
The paper concludes with immediate applications of Network Inversion including
in interpretability, explainability and generation of adversarial samples.
[LINK]
http://arxiv.org/abs/2407.18002v1
[DATE]
2024-07-25 20:53:21+08:00
[CATEGORIES]
cs.LG
iNNspector: Visual, Interactive Deep Model Debugging
[AUTHORS]
Thilo Spinner, Daniel Fürst, Mennatallah El-Assady
[ABSTRACT]
Deep learning model design, development, and debugging is a process driven by
best practices, guidelines, trial-and-error, and the personal experiences of
model developers. At multiple stages of this process, performance and internal
model data can be logged and made available. However, due to the sheer
complexity and scale of this data and process, model developers often resort to
evaluating their model performance based on abstract metrics like accuracy and
loss. We argue that a structured analysis of data along the model’s
architecture and at multiple abstraction levels can considerably streamline the
debugging process. Such a systematic analysis can further connect the
developer’s design choices to their impacts on the model behavior, facilitating
the understanding, diagnosis, and refinement of deep learning models. Hence, in
this paper, we (1) contribute a conceptual framework structuring the data space
of deep learning experiments. Our framework, grounded in literature analysis
and requirements interviews, captures design dimensions and proposes mechanisms
to make this data explorable and tractable. To operationalize our framework in
a ready-to-use application, we (2) present the iNNspector system. iNNspector
enables tracking of deep learning experiments and provides interactive
visualizations of the data on all levels of abstraction from multiple models to
individual neurons. Finally, we (3) evaluate our approach with three real-world
use-cases and a user study with deep learning developers and data analysts,
proving its effectiveness and usability.
[COMMENTS]
41 pages paper, 4 pages references, 3 pages appendix, 19 figures, 2
tables
[LINK]
http://arxiv.org/abs/2407.17998v1
[DATE]
2024-07-25 20:48:41+08:00
[CATEGORIES]
cs.LG
Amortized Active Learning for Nonparametric Functions
[AUTHORS]
Cen-You Li, Marc Toussaint, Barbara Rakitsch, Christoph Zimmer
[ABSTRACT]
Active learning (AL) is a sequential learning scheme aiming to select the
most informative data. AL reduces data consumption and avoids the cost of
labeling large amounts of data. However, AL trains the model and solves an
acquisition optimization for each selection. It becomes expensive when the
model training or acquisition optimization is challenging. In this paper, we
focus on active nonparametric function learning, where the gold standard
Gaussian process (GP) approaches suffer from cubic time complexity. We propose
an amortized AL method, where new data are suggested by a neural network which
is trained up-front without any real data (Figure 1). Our method avoids
repeated model training and requires no acquisition optimization during the AL
deployment. We (i) utilize GPs as function priors to construct an AL simulator,
(ii) train an AL policy that can zero-shot generalize from simulation to real
learning problems of nonparametric functions and (iii) achieve real-time data
selection and comparable learning performances to time-consuming baseline
methods.
[LINK]
http://arxiv.org/abs/2407.17992v1
[DATE]
2024-07-25 20:38:08+08:00
[CATEGORIES]
cs.LG
Expressivity and Generalization: Fragment-Biases for Molecular GNNs
[AUTHORS]
Tom Wollschläger, Niklas Kemper, Leon Hetzel, Johanna Sommer, Stephan Günnemann
[ABSTRACT]
Although recent advances in higher-order Graph Neural Networks (GNNs) improve
the theoretical expressiveness and molecular property predictive performance,
they often fall short of the empirical performance of models that explicitly
use fragment information as inductive bias. However, for these approaches,
there exists no theoretic expressivity study. In this work, we propose the
Fragment-WL test, an extension to the well-known Weisfeiler & Leman (WL) test,
which enables the theoretic analysis of these fragment-biased GNNs. Building on
the insights gained from the Fragment-WL test, we develop a new GNN
architecture and a fragmentation with infinite vocabulary that significantly
boosts expressiveness. We show the effectiveness of our model on synthetic and
real-world data where we outperform all GNNs on Peptides and have 12% lower
error than all GNNs on ZINC and 34% lower error than other fragment-biased
models. Furthermore, we show that our model exhibits superior generalization
capabilities compared to the latest transformer-based architectures,
positioning it as a robust solution for a range of molecular modeling tasks.
[LINK]
http://arxiv.org/abs/2406.08210v2
[DATE]
2024-07-25 20:23:26+08:00
[CATEGORIES]
cs.LG
Particle identification with machine learning from incomplete data in the ALICE experiment
[AUTHORS]
Maja Karwowska, Łukasz Graczykowski, Kamil Deja, Miłosz Kasak, Małgorzata Janik
[ABSTRACT]
The ALICE experiment at the LHC measures properties of the strongly
interacting matter formed in ultrarelativistic heavy-ion collisions. Such
studies require accurate particle identification (PID). ALICE provides PID
information via several detectors for particles with momentum from about 100
MeV/c up to 20 GeV/c. Traditionally, particles are selected with rectangular
cuts. A much better performance can be achieved with machine learning (ML)
methods. Our solution uses multiple neural networks (NN) serving as binary
classifiers. Moreover, we extended our particle classifier with Feature Set
Embedding and attention in order to train on data with incomplete samples. We
also present the integration of the ML project with the ALICE analysis
software, and we discuss domain adaptation, the ML technique needed to transfer
the knowledge between simulated and real experimental data.
[COMMENTS]
Proceedings of 3rd Artificial Intelligence for the Electron-Ion
Collider workshop – AI4EIC2023, 28.11-1.12.2023
[LINK]
http://arxiv.org/abs/2403.17436v3
[DATE]
2024-07-25 19:51:04+08:00
[CATEGORIES]
cs.LG
Relating the Seemingly Unrelated: Principled Understanding of Generalization for Generative Models in Arithmetic Reasoning Tasks
[AUTHORS]
Xingcheng Xu, Zibo Zhao, Haipeng Zhang, Yanqing Yang
[ABSTRACT]
Large language models (LLMs) have demonstrated impressive versatility across
numerous tasks, yet their generalization capabilities remain poorly understood.
To investigate these behaviors, arithmetic tasks serve as important venues. In
previous studies, seemingly unrelated mysteries still exist – (1) models with
appropriate positional embeddings can correctly perform longer unseen
arithmetic operations such as addition, but their effectiveness varies in more
complex tasks like multiplication; (2) models perform well for longer unseen
cases in modular addition under specific moduli (e.g., modulo 100) but struggle
under very close moduli (e.g., modulo 101), regardless of the positional
encoding used. We believe previous studies have been treating the symptoms
rather than addressing the root cause – they have paid excessive attention to
improving model components, while overlooking the differences in task
properties that may be the real drivers. This is confirmed by our unified
theoretical framework for different arithmetic scenarios. For example, unlike
multiplication, the digital addition task has the property of translation
invariance which naturally aligns with the relative positional encoding, and
this combination leads to successful generalization of addition to unseen
longer domains. The discrepancy in operations modulo 100 and 101 arises from
the base. Modulo 100, unlike 101, is compatible with the decimal system (base
10), such that unseen information in digits beyond the units digit and the tens
digit is actually not needed for the task. Extensive experiments with GPT-like
models validate our theoretical predictions. These findings deepen our
understanding of the generalization mechanisms, and facilitate more
data-efficient model training and objective-oriented AI alignment.
[LINK]
http://arxiv.org/abs/2407.17963v1
[DATE]
2024-07-25 19:35:22+08:00
[CATEGORIES]
cs.LG
Neural Networks for Generating Better Local Optima in Topology Optimization
[AUTHORS]
Leon Herrmann, Ole Sigmund, Viola Muning Li, Christian Vogl, Stefan Kollmannsberger
[ABSTRACT]
Neural networks have recently been employed as material discretizations
within adjoint optimization frameworks for inverse problems and topology
optimization. While advantageous regularization effects and better optima have
been found for some inverse problems, the benefit for topology optimization has
been limited – where the focus of investigations has been the compliance
problem. We demonstrate how neural network material discretizations can, under
certain conditions, find better local optima in more challenging optimization
problems, where we here specifically consider acoustic topology optimization.
The chances of identifying a better optimum can significantly be improved by
running multiple partial optimizations with different neural network
initializations. Furthermore, we show that the neural network material
discretization’s advantage comes from the interplay with the Adam optimizer and
emphasize its current limitations when competing with constrained and
higher-order optimization techniques. At the moment, this discretization has
only been shown to be beneficial for unconstrained first-order optimization.
[LINK]
http://arxiv.org/abs/2407.17957v1
[DATE]
2024-07-25 19:24:44+08:00
[CATEGORIES]
cs.LG
A unified law of robustness for Bregman divergence losses
[AUTHORS]
Santanu Das, Jatin Batra, Piyush Srivastava
[ABSTRACT]
In contemporary deep learning practice, models are often trained to near zero
loss i.e. to nearly interpolate the training data. However, the number of
parameters in the model is usually far more than the number of data points $n$,
the theoretical minimum needed for interpolation: a phenomenon referred to as
overparameterization. In an interesting piece of work that contributes to the
considerable research that has been devoted to understand overparameterization,
Bubeck and Sellke showed that for a broad class of covariate distributions
(specifically those satisfying a natural notion of concentration of measure),
overparameterization is necessary for robust interpolation i.e. if the
interpolating function is required to be Lipschitz. However, their robustness
results were proved only in the setting of regression with square loss. In
practice, however many other kinds of losses are used, e.g. cross entropy loss
for classification. In this work, we generalize Bubeck and Selke’s result to
Bregman divergence losses, which form a common generalization of square loss
and cross-entropy loss. Our generalization relies on identifying a
bias-variance type decomposition that lies at the heart of the proof and Bubeck
and Sellke.
[COMMENTS]
18 pages
[LINK]
http://arxiv.org/abs/2405.16639v2
[DATE]
2024-07-25 19:21:50+08:00
[CATEGORIES]
cs.LG
Scaling Training Data with Lossy Image Compression
[AUTHORS]
Katherine L. Mentzer, Andrea Montanari
[ABSTRACT]
Empirically-determined scaling laws have been broadly successful in
predicting the evolution of large machine learning models with training data
and number of parameters. As a consequence, they have been useful for
optimizing the allocation of limited resources, most notably compute time.
In certain applications, storage space is an important constraint, and data
format needs to be chosen carefully as a consequence. Computer vision is a
prominent example: images are inherently analog, but are always stored in a
digital format using a finite number of bits. Given a dataset of digital
images, the number of bits $L$ to store each of them can be further reduced
using lossy data compression. This, however, can degrade the quality of the
model trained on such images, since each example has lower resolution.
In order to capture this trade-off and optimize storage of training data, we
propose a `storage scaling law’ that describes the joint evolution of test
error with sample size and number of bits per image. We prove that this law
holds within a stylized model for image compression, and verify it empirically
on two computer vision tasks, extracting the relevant parameters. We then show
that this law can be used to optimize the lossy compression level. At given
storage, models trained on optimally compressed images present a significantly
smaller test error with respect to models trained on the original data.
Finally, we investigate the potential benefits of randomizing the compression
level.
[COMMENTS]
21 pages, 27 figures
[LINK]
http://arxiv.org/abs/2407.17954v1
[DATE]
2024-07-25 19:19:55+08:00
[CATEGORIES]
cs.LG
Fast convergence of the Expectation Maximization algorithm under a logarithmic Sobolev inequality
[AUTHORS]
Rocco Caprio, Adam M Johansen
[ABSTRACT]
By utilizing recently developed tools for constructing gradient flows on
Wasserstein spaces, we extend an analysis technique commonly employed to
understand alternating minimization algorithms on Euclidean space to the
Expectation Maximization (EM) algorithm via its representation as
coordinate-wise minimization on the product of a Euclidean space and a space of
probability distributions due to Neal and Hinton (1998). In so doing we obtain
finite sample error bounds and exponential convergence of the EM algorithm
under a natural generalisation of a log-Sobolev inequality. We further
demonstrate that the analysis technique is sufficiently flexible to allow also
the analysis of several variants of the EM algorithm.
[LINK]
http://arxiv.org/abs/2407.17949v1
[DATE]
2024-07-25 19:08:53+08:00
[CATEGORIES]
cs.LG
Comparison of different Artificial Neural Networks for Bitcoin price forecasting
[AUTHORS]
Silas Baumann, Karl A. Busch, Hamza A. A. Gardi
[ABSTRACT]
This study investigates the impact of varying sequence lengths on the
accuracy of predicting cryptocurrency returns using Artificial Neural Networks
(ANNs). Utilizing the Mean Absolute Error (MAE) as a threshold criterion, we
aim to enhance prediction accuracy by excluding returns that are smaller than
this threshold, thus mitigating errors associated with minor returns. The
subsequent evaluation focuses on the accuracy of predicted returns that exceed
this threshold. We compare four sequence lengths 168 hours (7 days), 72 hours
(3 days), 24 hours, and 12 hours each with a return prediction interval of 2
hours. Our findings reveal the influence of sequence length on prediction
accuracy and underscore the potential for optimized sequence configurations in
financial forecasting models.
[COMMENTS]
9 pages, 8 figures, 2 tables
[LINK]
http://arxiv.org/abs/2407.17930v1
[DATE]
2024-07-25 18:39:50+08:00
[CATEGORIES]
cs.LG
Improving probabilistic forecasts of extreme wind speeds by training statistical post-processing models with weighted scoring rules
[AUTHORS]
Jakob Benjamin Wessel, Christopher A. T. Ferro, Gavin R. Evans, Frank Kwasniok
[ABSTRACT]
Accurate forecasts of extreme wind speeds are of high importance for many
applications. Such forecasts are usually generated by ensembles of numerical
weather prediction (NWP) models, which however can be biased and have errors in
dispersion, thus necessitating the application of statistical post-processing
techniques. In this work we aim to improve statistical post-processing models
for probabilistic predictions of extreme wind speeds. We do this by adjusting
the training procedure used to fit ensemble model output statistics (EMOS)
models - a commonly applied post-processing technique - and propose estimating
parameters using the so-called threshold-weighted continuous ranked probability
score (twCRPS), a proper scoring rule that places special emphasis on
predictions over a threshold. We show that training using the twCRPS leads to
improved extreme event performance of post-processing models for a variety of
thresholds. We find a distribution body-tail trade-off where improved
performance for probabilistic predictions of extreme events comes with worse
performance for predictions of the distribution body. However, we introduce
strategies to mitigate this trade-off based on weighted training and linear
pooling. Finally, we consider some synthetic experiments to explain the
training impact of the twCRPS and derive closed-form expressions of the twCRPS
for a number of distributions, giving the first such collection in the
literature. The results will enable researchers and practitioners alike to
improve the performance of probabilistic forecasting models for extremes and
other events of interest.
[LINK]
http://arxiv.org/abs/2407.15900v2
[DATE]
2024-07-25 18:39:15+08:00
[CATEGORIES]
cs.LG
Guided Latent Slot Diffusion for Object-Centric Learning
[AUTHORS]
Krishnakant Singh, Simone Schaub-Meyer, Stefan Roth
[ABSTRACT]
Slot attention aims to decompose an input image into a set of meaningful
object files (slots). These latent object representations enable various
downstream tasks. Yet, these slots often bind to object parts, not objects
themselves, especially for real-world datasets. To address this, we introduce
Guided Latent Slot Diffusion - GLASS, an object-centric model that uses
generated captions as a guiding signal to better align slots with objects. Our
key insight is to learn the slot-attention module in the space of generated
images. This allows us to repurpose the pre-trained diffusion decoder model,
which reconstructs the images from the slots, as a semantic mask generator
based on the generated captions. GLASS learns an object-level representation
suitable for multiple tasks simultaneously, e.g., segmentation, image
generation, and property prediction, outperforming previous methods. For object
discovery, GLASS achieves approx. a +35% and +10% relative improvement for mIoU
over the previous state-of-the-art (SOTA) method on the VOC and COCO datasets,
respectively, and establishes a new SOTA FID score for conditional image
generation amongst slot-attention-based methods. For the segmentation task,
GLASS surpasses SOTA weakly-supervised and language-based segmentation models,
which were specifically designed for the task.
[COMMENTS]
Project Page: https://guided-sa.github.io
[LINK]
http://arxiv.org/abs/2407.17929v1
[DATE]
2024-07-25 18:38:32+08:00
[CATEGORIES]
cs.LG
Detection of Correlated Random Vectors
[AUTHORS]
Dor Elimelech, Wasim Huleihel
[ABSTRACT]
In this paper, we investigate the problem of deciding whether two standard
normal random vectors $\mathsf{X}\in\mathbb{R}^{n}$ and
$\mathsf{Y}\in\mathbb{R}^{n}$ are correlated or not. This is formulated as a
hypothesis testing problem, where under the null hypothesis, these vectors are
statistically independent, while under the alternative, $\mathsf{X}$ and a
randomly and uniformly permuted version of $\mathsf{Y}$, are correlated with
correlation $\rho$. We analyze the thresholds at which optimal testing is
information-theoretically impossible and possible, as a function of $n$ and
$\rho$. To derive our information-theoretic lower bounds, we develop a novel
technique for evaluating the second moment of the likelihood ratio using an
orthogonal polynomials expansion, which among other things, reveals a
surprising connection to integer partition functions. We also study a
multi-dimensional generalization of the above setting, where rather than two
vectors we observe two databases/matrices, and furthermore allow for partial
correlations between these two.
[COMMENTS]
42 pages
[LINK]
http://arxiv.org/abs/2401.13429v3
[DATE]
2024-07-25 18:15:51+08:00
[CATEGORIES]
cs.LG
Q-Pensieve: Boosting Sample Efficiency of Multi-Objective RL Through Memory Sharing of Q-Snapshots
[AUTHORS]
Wei Hung, Bo-Kai Huang, Ping-Chun Hsieh, Xi Liu
[ABSTRACT]
Many real-world continuous control problems are in the dilemma of weighing
the pros and cons, multi-objective reinforcement learning (MORL) serves as a
generic framework of learning control policies for different preferences over
objectives. However, the existing MORL methods either rely on multiple passes
of explicit search for finding the Pareto front and therefore are not
sample-efficient, or utilizes a shared policy network for coarse knowledge
sharing among policies. To boost the sample efficiency of MORL, we propose
Q-Pensieve, a policy improvement scheme that stores a collection of Q-snapshots
to jointly determine the policy update direction and thereby enables data
sharing at the policy level. We show that Q-Pensieve can be naturally
integrated with soft policy iteration with convergence guarantee. To
substantiate this concept, we propose the technique of Q replay buffer, which
stores the learned Q-networks from the past iterations, and arrive at a
practical actor-critic implementation. Through extensive experiments and an
ablation study, we demonstrate that with much fewer samples, the proposed
algorithm can outperform the benchmark MORL methods on a variety of MORL
benchmark tasks.
[COMMENTS]
20 pages, 15 figures
[LINK]
http://arxiv.org/abs/2212.03117v2
[DATE]
2024-07-25 18:11:29+08:00
[CATEGORIES]
cs.LG
Causal Deepsets for Off-policy Evaluation under Spatial or Spatio-temporal Interferences
[AUTHORS]
Runpeng Dai, Jianing Wang, Fan Zhou, Shikai Luo, Zhiwei Qin, Chengchun Shi, Hongtu Zhu
[ABSTRACT]
Off-policy evaluation (OPE) is widely applied in sectors such as
pharmaceuticals and e-commerce to evaluate the efficacy of novel products or
policies from offline datasets. This paper introduces a causal deepset
framework that relaxes several key structural assumptions, primarily the
mean-field assumption, prevalent in existing OPE methodologies that handle
spatio-temporal interference. These traditional assumptions frequently prove
inadequate in real-world settings, thereby restricting the capability of
current OPE methods to effectively address complex interference effects. In
response, we advocate for the implementation of the permutation invariance (PI)
assumption. This innovative approach enables the data-driven, adaptive learning
of the mean-field function, offering a more flexible estimation method beyond
conventional averaging. Furthermore, we present novel algorithms that
incorporate the PI assumption into OPE and thoroughly examine their theoretical
foundations. Our numerical analyses demonstrate that this novel approach yields
significantly more precise estimations than existing baseline algorithms,
thereby substantially improving the practical applicability and effectiveness
of OPE methodologies. A Python implementation of our proposed method is
available at https://github.com/BIG-S2/Causal-Deepsets.
[LINK]
http://arxiv.org/abs/2407.17910v1
[DATE]
2024-07-25 18:02:11+08:00
[CATEGORIES]
cs.LG
Separating Novel Features for Logical Anomaly Detection: A Straightforward yet Effective Approach
[AUTHORS]
Kangil Lee, Geonuk Kim
[ABSTRACT]
Vision-based inspection algorithms have significantly contributed to quality
control in industrial settings, particularly in addressing structural defects
like dent and contamination which are prevalent in mass production. Extensive
research efforts have led to the development of related benchmarks such as
MVTec AD (Bergmann et al., 2019). However, in industrial settings, there can be
instances of logical defects, where acceptable items are found in unsuitable
locations or product pairs do not match as expected. Recent methods tackling
logical defects effectively employ knowledge distillation to generate
difference maps. Knowledge distillation (KD) is used to learn normal data
distribution in unsupervised manner. Despite their effectiveness, these methods
often overlook the potential false negatives. Excessive similarity between the
teacher network and student network can hinder the generation of a suitable
difference map for logical anomaly detection. This technical report provides
insights on handling potential false negatives by utilizing a simple constraint
in KD-based logical anomaly detection methods. We select EfficientAD as a
state-of-the-art baseline and apply a margin-based constraint to its
unsupervised learning scheme. Applying this constraint, we can improve the
AUROC for MVTec LOCO AD by 1.3 %.
[LINK]
http://arxiv.org/abs/2407.17909v1
[DATE]
2024-07-25 18:00:21+08:00
[CATEGORIES]
cs.LG
Amortized Posterior Sampling with Diffusion Prior Distillation
[AUTHORS]
Abbas Mammadov, Hyungjin Chung, Jong Chul Ye
[ABSTRACT]
We propose a variational inference approach to sample from the posterior
distribution for solving inverse problems. From a pre-trained diffusion model,
our approach trains a conditional flow model to minimize the divergence between
the proposal variational distribution and the posterior distribution implicitly
defined through the diffusion model. Once trained, the flow model is capable of
sampling from the posterior distribution with a single NFE, amortized with
respect to the measurement. The proposed method paves a new path for distilling
a diffusion prior for efficient posterior sampling. We show that our method is
applicable to standard signals in Euclidean space, as well as signals on
manifold.
[LINK]
http://arxiv.org/abs/2407.17907v1
[DATE]
2024-07-25 17:53:12+08:00
[CATEGORIES]
cs.LG
Neural Fractional Differential Equations
[AUTHORS]
C. Coelho, M. Fernanda P. Costa, L. L. Ferrás
[ABSTRACT]
Fractional Differential Equations (FDEs) are essential tools for modelling
complex systems in science and engineering. They extend the traditional
concepts of differentiation and integration to non-integer orders, enabling a
more precise representation of processes characterised by non-local and
memory-dependent behaviours.
This property is useful in systems where variables do not respond to changes
instantaneously, but instead exhibit a strong memory of past interactions.
Having this in mind, and drawing inspiration from Neural Ordinary
Differential Equations (Neural ODEs), we propose the Neural FDE, a novel deep
neural network architecture that adjusts a FDE to the dynamics of data.
This work provides a comprehensive overview of the numerical method employed
in Neural FDEs and the Neural FDE architecture. The numerical outcomes suggest
that, despite being more computationally demanding, the Neural FDE may
outperform the Neural ODE in modelling systems with memory or dependencies on
past states, and it can effectively be applied to learn more intricate
dynamical systems.
[LINK]
http://arxiv.org/abs/2403.02737v2
[DATE]
2024-07-25 17:18:24+08:00
[CATEGORIES]
cs.LG
DAM: Towards A Foundation Model for Time Series Forecasting
[AUTHORS]
Luke Darlow, Qiwen Deng, Ahmed Hassan, Martin Asenov, Rajkarn Singh, Artjom Joosen, Adam Barker, Amos Storkey
[ABSTRACT]
It is challenging to scale time series forecasting models such that they
forecast accurately for multiple distinct domains and datasets, all with
potentially different underlying collection procedures (e.g., sample
resolution), patterns (e.g., periodicity), and prediction requirements (e.g.,
reconstruction vs. forecasting). We call this general task universal
forecasting. Existing methods usually assume that input data is regularly
sampled, and they forecast to pre-determined horizons, resulting in failure to
generalise outside of the scope of their training. We propose the DAM - a
neural model that takes randomly sampled histories and outputs an adjustable
basis composition as a continuous function of time for forecasting to non-fixed
horizons. It involves three key components: (1) a flexible approach for using
randomly sampled histories from a long-tail distribution, that enables an
efficient global perspective of the underlying temporal dynamics while
retaining focus on the recent history; (2) a transformer backbone that is
trained on these actively sampled histories to produce, as representational
output, (3) the basis coefficients of a continuous function of time. We show
that a single univariate DAM, trained on 25 time series datasets, either
outperformed or closely matched existing SoTA models at multivariate long-term
forecasting across 18 datasets, including 8 held-out for zero-shot transfer,
even though these models were trained to specialise for each dataset-horizon
combination. This single DAM excels at zero-shot transfer and very-long-term
forecasting, performs well at imputation, is interpretable via basis function
composition and attention, can be tuned for different inference-cost
requirements, is robust to missing and irregularly sampled data {by design}.
[LINK]
http://arxiv.org/abs/2407.17880v1
[DATE]
2024-07-25 16:48:07+08:00
[CATEGORIES]
cs.LG
Batchless Normalization: How to Normalize Activations Across Instances with Minimal Memory Requirements
[AUTHORS]
Benjamin Berger, Victor Uc Cetina
[ABSTRACT]
In training neural networks, batch normalization has many benefits, not all
of them entirely understood. But it also has some drawbacks. Foremost is
arguably memory consumption, as computing the batch statistics requires all
instances within the batch to be processed simultaneously, whereas without
batch normalization it would be possible to process them one by one while
accumulating the weight gradients. Another drawback is that that distribution
parameters (mean and standard deviation) are unlike all other model parameters
in that they are not trained using gradient descent but require special
treatment, complicating implementation. In this paper, I show a simple and
straightforward way to address these issues. The idea, in short, is to add
terms to the loss that, for each activation, cause the minimization of the
negative log likelihood of a Gaussian distribution that is used to normalize
the activation. Among other benefits, this will hopefully contribute to the
democratization of AI research by means of lowering the hardware requirements
for training larger models.
[COMMENTS]
17 pages (12 without appendices), 12 figures, 5 tables
[LINK]
http://arxiv.org/abs/2212.14729v2
[DATE]
2024-07-25 16:34:58+08:00
[CATEGORIES]
cs.LG
Unsupervised Outlier Detection using Random Subspace and Subsampling Ensembles of Dirichlet Process Mixtures
[AUTHORS]
Dongwook Kim, Juyeon Park, Hee Cheol Chung, Seonghyun Jeong
[ABSTRACT]
Probabilistic mixture models are recognized as effective tools for
unsupervised outlier detection owing to their interpretability and global
characteristics. Among these, Dirichlet process mixture models stand out as a
strong alternative to conventional finite mixture models for both clustering
and outlier detection tasks. Unlike finite mixture models, Dirichlet process
mixtures are infinite mixture models that automatically determine the number of
mixture components based on the data. Despite their advantages, the adoption of
Dirichlet process mixture models for unsupervised outlier detection has been
limited by challenges related to computational inefficiency and sensitivity to
outliers in the construction of outlier detectors. Additionally, Dirichlet
process Gaussian mixtures struggle to effectively model non-Gaussian data with
discrete or binary features. To address these challenges, we propose a novel
outlier detection method that utilizes ensembles of Dirichlet process Gaussian
mixtures. This unsupervised algorithm employs random subspace and subsampling
ensembles to ensure efficient computation and improve the robustness of the
outlier detector. The ensemble approach further improves the suitability of the
proposed method for detecting outliers in non-Gaussian data. Furthermore, our
method uses variational inference for Dirichlet process mixtures, which ensures
both efficient and rapid computation. Empirical analyses using benchmark
datasets demonstrate that our method outperforms existing approaches in
unsupervised outlier detection.
[LINK]
http://arxiv.org/abs/2401.00773v3
[DATE]
2024-07-25 16:13:27+08:00
[CATEGORIES]
cs.LG
COIN: Counterfactual inpainting for weakly supervised semantic segmentation for medical images
[AUTHORS]
Dmytro Shvetsov, Joonas Ariva, Marharyta Domnich, Raul Vicente, Dmytro Fishman
[ABSTRACT]
Deep learning is dramatically transforming the field of medical imaging and
radiology, enabling the identification of pathologies in medical images,
including computed tomography (CT) and X-ray scans. However, the performance of
deep learning models, particularly in segmentation tasks, is often limited by
the need for extensive annotated datasets. To address this challenge, the
capabilities of weakly supervised semantic segmentation are explored through
the lens of Explainable AI and the generation of counterfactual explanations.
The scope of this research is development of a novel counterfactual inpainting
approach (COIN) that flips the predicted classification label from abnormal to
normal by using a generative model. For instance, if the classifier deems an
input medical image X as abnormal, indicating the presence of a pathology, the
generative model aims to inpaint the abnormal region, thus reversing the
classifier’s original prediction label. The approach enables us to produce
precise segmentations for pathologies without depending on pre-existing
segmentation masks. Crucially, image-level labels are utilized, which are
substantially easier to acquire than creating detailed segmentation masks. The
effectiveness of the method is demonstrated by segmenting synthetic targets and
actual kidney tumors from CT images acquired from Tartu University Hospital in
Estonia. The findings indicate that COIN greatly surpasses established
attribution methods, such as RISE, ScoreCAM, and LayerCAM, as well as an
alternative counterfactual explanation method introduced by Singla et al. This
evidence suggests that COIN is a promising approach for semantic segmentation
of tumors in CT images, and presents a step forward in making deep learning
applications more accessible and effective in healthcare, where annotated data
is scarce.
[COMMENTS]
This work has been accepted to be presented to The 2nd World
Conference on eXplainable Artificial Intelligence (xAI 2024), July 17-19,
2024 - Valletta, Malta
[LINK]
http://arxiv.org/abs/2404.12832v2
[DATE]
2024-07-25 16:09:12+08:00
[CATEGORIES]
cs.LG
ServerlessLLM: Low-Latency Serverless Inference for Large Language Models
[AUTHORS]
Yao Fu, Leyang Xue, Yeqi Huang, Andrei-Octavian Brabete, Dmitrii Ustiugov, Yuvraj Patel, Luo Mai
[COMMENTS]
18th USENIX Symposium on Operating Systems Design and Implementation
[LINK]
http://arxiv.org/abs/2401.14351v2
[DATE]
2024-07-25 16:08:11+08:00
[CATEGORIES]
cs.LG
Nyström Kernel Stein Discrepancy
[AUTHORS]
Florian Kalinke, Zoltan Szabo, Bharath K. Sriperumbudur
[ABSTRACT]
Kernel methods underpin many of the most successful approaches in data
science and statistics, and they allow representing probability measures as
elements of a reproducing kernel Hilbert space without loss of information.
Recently, the kernel Stein discrepancy (KSD), which combines Stein’s method
with kernel techniques, gained considerable attention. Through the Stein
operator, KSD allows the construction of powerful goodness-of-fit tests where
it is sufficient to know the target distribution up to a multiplicative
constant. However, the typical U- and V-statistic-based KSD estimators suffer
from a quadratic runtime complexity, which hinders their application in
large-scale settings. In this work, we propose a Nystr"om-based KSD
acceleration – with runtime $\mathcal O!\left(mn+m^3\right)$ for $n$ samples
and $m\ll n$ Nystr"om points – , show its $\sqrt{n}$-consistency under the
null with a classical sub-Gaussian assumption, and demonstrate its
applicability for goodness-of-fit testing on a suite of benchmarks.
[COMMENTS]
Update proof of Lemma B.3, milder Assumption 1, more experiments
[LINK]
http://arxiv.org/abs/2406.08401v2
[DATE]
2024-07-25 16:01:32+08:00
[CATEGORIES]
cs.LG
Enhancing Counterfactual Explanation Search with Diffusion Distance and Directional Coherence
[AUTHORS]
Marharyta Domnich, Raul Vicente
[ABSTRACT]
A pressing issue in the adoption of AI models is the increasing demand for
more human-centric explanations of their predictions. To advance towards more
human-centric explanations, understanding how humans produce and select
explanations has been beneficial. In this work, inspired by insights of human
cognition we propose and test the incorporation of two novel biases to enhance
the search for effective counterfactual explanations. Central to our
methodology is the application of diffusion distance, which emphasizes data
connectivity and actionability in the search for feasible counterfactual
explanations. In particular, diffusion distance effectively weights more those
points that are more interconnected by numerous short-length paths. This
approach brings closely connected points nearer to each other, identifying a
feasible path between them. We also introduce a directional coherence term that
allows the expression of a preference for the alignment between the joint and
marginal directional changes in feature space to reach a counterfactual. This
term enables the generation of counterfactual explanations that align with a
set of marginal predictions based on expectations of how the outcome of the
model varies by changing one feature at a time. We evaluate our method, named
Coherent Directional Counterfactual Explainer (CoDiCE), and the impact of the
two novel biases against existing methods such as DiCE, FACE, Prototypes, and
Growing Spheres. Through a series of ablation experiments on both synthetic and
real datasets with continuous and mixed-type features, we demonstrate the
effectiveness of our method.
[COMMENTS]
This work has been accepted to be presented to The 2nd World
Conference on eXplainable Artificial Intelligence (xAI 2024), July 17-19,
2024 - Valletta, Malta
[LINK]
http://arxiv.org/abs/2404.12810v2
[DATE]
2024-07-25 16:00:44+08:00
[CATEGORIES]
cs.LG
Long-term Fairness in Ride-Hailing Platform
[AUTHORS]
Yufan Kang, Jeffrey Chan, Wei Shao, Flora D. Salim, Christopher Leckie
[ABSTRACT]
Matching in two-sided markets such as ride-hailing has recently received
significant attention. However, existing studies on ride-hailing mainly focus
on optimising efficiency, and fairness issues in ride-hailing have been
neglected. Fairness issues in ride-hailing, including significant earning
differences between drivers and variance of passenger waiting times among
different locations, have potential impacts on economic and ethical aspects.
The recent studies that focus on fairness in ride-hailing exploit traditional
optimisation methods and the Markov Decision Process to balance efficiency and
fairness. However, there are several issues in these existing studies, such as
myopic short-term decision-making from traditional optimisation and instability
of fairness in a comparably longer horizon from both traditional optimisation
and Markov Decision Process-based methods. To address these issues, we propose
a dynamic Markov Decision Process model to alleviate fairness issues currently
faced by ride-hailing, and seek a balance between efficiency and fairness, with
two distinct characteristics: (i) a prediction module to predict the number of
requests that will be raised in the future from different locations to allow
the proposed method to consider long-term fairness based on the whole timeline
instead of consider fairness only based on historical and current data
patterns; (ii) a customised scalarisation function for multi-objective
multi-agent Q Learning that aims to balance efficiency and fairness. Extensive
experiments on a publicly available real-world dataset demonstrate that our
proposed method outperforms existing state-of-the-art methods.
[COMMENTS]
Accepted by ECML PKDD 2024
[LINK]
http://arxiv.org/abs/2407.17839v1
[DATE]
2024-07-25 15:54:07+08:00
[CATEGORIES]
cs.LG
Imperative Learning: A Self-supervised Neural-Symbolic Learning Framework for Robot Autonomy
[AUTHORS]
Chen Wang, Kaiyi Ji, Junyi Geng, Zhongqiang Ren, Taimeng Fu, Fan Yang, Yifan Guo, Haonan He, Xiangyu Chen, Zitong Zhan, Qiwei Du, Shaoshu Su, Bowen Li, Yuheng Qiu, Yi Du, Qihang Li, Yifan Yang, Xiao Lin, Zhipeng Zhao
[ABSTRACT]
Data-driven methods such as reinforcement and imitation learning have
achieved remarkable success in robot autonomy. However, their data-centric
nature still hinders them from generalizing well to ever-changing environments.
Moreover, collecting large datasets for robotic tasks is often impractical and
expensive. To overcome these challenges, we introduce a new self-supervised
neural-symbolic (NeSy) computational framework, imperative learning (IL), for
robot autonomy, leveraging the generalization abilities of symbolic reasoning.
The framework of IL consists of three primary components: a neural module, a
reasoning engine, and a memory system. We formulate IL as a special bilevel
optimization (BLO), which enables reciprocal learning over the three modules.
This overcomes the label-intensive obstacles associated with data-driven
approaches and takes advantage of symbolic reasoning concerning logical
reasoning, physical principles, geometric analysis, etc. We discuss several
optimization techniques for IL and verify their effectiveness in five distinct
robot autonomy tasks including path planning, rule induction, optimal control,
visual odometry, and multi-robot routing. Through various experiments, we show
that IL can significantly enhance robot autonomy capabilities and we anticipate
that it will catalyze further research across diverse domains.
[LINK]
http://arxiv.org/abs/2406.16087v3
[DATE]
2024-07-25 15:50:58+08:00
[CATEGORIES]
cs.LG
Image Segmentation via Divisive Normalization: dealing with environmental diversity
[AUTHORS]
Pablo Hernández-Cámara, Jorge Vila-Tomás, Paula Dauden-Oliver, Nuria Alabau-Bosque, Valero Laparra, Jesús Malo
[ABSTRACT]
Autonomous driving is a challenging scenario for image segmentation due to
the presence of uncontrolled environmental conditions and the eventually
catastrophic consequences of failures. Previous work suggested that a
biologically motivated computation, the so-called Divisive Normalization, could
be useful to deal with image variability, but its effects have not been
systematically studied over different data sources and environmental factors.
Here we put segmentation U-nets augmented with Divisive Normalization to work
far from training conditions to find where this adaptation is more critical. We
categorize the scenes according to their radiance level and dynamic range
(day/night), and according to their achromatic/chromatic contrasts. We also
consider video game (synthetic) images to broaden the range of environments. We
check the performance in the extreme percentiles of such categorization. Then,
we push the limits further by artificially modifying the images in
perceptually/environmentally relevant dimensions: luminance, contrasts and
spectral radiance. Results show that neural networks with Divisive
Normalization get better results in all the scenarios and their performance
remains more stable with regard to the considered environmental factors and
nature of the source. Finally, we explain the improvements in segmentation
performance in two ways: (1) by quantifying the invariance of the responses
that incorporate Divisive Normalization, and (2) by illustrating the adaptive
nonlinearity of the different layers that depends on the local activity.
[LINK]
http://arxiv.org/abs/2407.17829v1
[DATE]
2024-07-25 15:38:27+08:00
[CATEGORIES]
cs.LG
Node-like as a Whole: Structure-aware Searching and Coarsening for Graph Classification
[AUTHORS]
Xiaorui Qi, Qijie Bai, Yanlong Wen, Haiwei Zhang, Xiaojie Yuan
[ABSTRACT]
Graph Transformers (GTs) have made remarkable achievements in graph-level
tasks. However, most existing works regard graph structures as a form of
guidance or bias for enhancing node representations, which focuses on
node-central perspectives and lacks explicit representations of edges and
structures. One natural question is, can we treat graph structures node-like as
a whole to learn high-level features? Through experimental analysis, we explore
the feasibility of this assumption. Based on our findings, we propose a novel
multi-view graph representation learning model via structure-aware searching
and coarsening (GRLsc) on GT architecture for graph classification.
Specifically, we build three unique views, original, coarsening, and
conversion, to learn a thorough structural representation. We compress loops
and cliques via hierarchical heuristic graph coarsening and restrict them with
well-designed constraints, which builds the coarsening view to learn high-level
interactions between structures. We also introduce line graphs for edge
embeddings and switch to edge-central perspective to construct the conversion
view. Experiments on eight real-world datasets demonstrate the improvements of
GRLsc over 28 baselines from various architectures.
[LINK]
http://arxiv.org/abs/2404.11869v3
[DATE]
2024-07-25 15:29:02+08:00
[CATEGORIES]
cs.LG
Optimal Hessian/Jacobian-Free Nonconvex-PL Bilevel Optimization
[AUTHORS]
Feihu Huang
[ABSTRACT]
Bilevel optimization is widely applied in many machine learning tasks such as
hyper-parameter learning, meta learning and reinforcement learning. Although
many algorithms recently have been developed to solve the bilevel optimization
problems, they generally rely on the (strongly) convex lower-level problems.
More recently, some methods have been proposed to solve the nonconvex-PL
bilevel optimization problems, where their upper-level problems are possibly
nonconvex, and their lower-level problems are also possibly nonconvex while
satisfying Polyak-{\L}ojasiewicz (PL) condition. However, these methods still
have a high convergence complexity or a high computation complexity such as
requiring compute expensive Hessian/Jacobian matrices and its inverses. In the
paper, thus, we propose an efficient Hessian/Jacobian-free method (i.e.,
HJFBiO) with the optimal convergence complexity to solve the nonconvex-PL
bilevel problems. Theoretically, under some mild conditions, we prove that our
HJFBiO method obtains an optimal convergence rate of $O(\frac{1}{T})$, where
$T$ denotes the number of iterations, and has an optimal gradient complexity of
$O(\epsilon^{-1})$ in finding an $\epsilon$-stationary solution. We conduct
some numerical experiments on the bilevel PL game and hyper-representation
learning task to demonstrate efficiency of our proposed method.
[COMMENTS]
ICML 2024 (Oral). arXiv admin note: text overlap with
arXiv:2311.04520
[LINK]
http://arxiv.org/abs/2407.17823v1
[DATE]
2024-07-25 15:25:06+08:00
[CATEGORIES]
cs.LG
Advanced deep-reinforcement-learning methods for flow control: group-invariant and positional-encoding networks improve learning speed and quality
[AUTHORS]
Joogoo Jeon, Jean Rabault, Joel Vasanth, Francisco Alcántara-Ávila, Shilaj Baral, Ricardo Vinuesa
[ABSTRACT]
Flow control is key to maximize energy efficiency in a wide range of
applications. However, traditional flow-control methods face significant
challenges in addressing non-linear systems and high-dimensional data, limiting
their application in realistic energy systems. This study advances
deep-reinforcement-learning (DRL) methods for flow control, particularly
focusing on integrating group-invariant networks and positional encoding into
DRL architectures. Our methods leverage multi-agent reinforcement learning
(MARL) to exploit policy invariance in space, in combination with
group-invariant networks to ensure local symmetry invariance. Additionally, a
positional encoding inspired by the transformer architecture is incorporated to
provide location information to the agents, mitigating action constraints from
strict invariance. The proposed methods are verified using a case study of
Rayleigh-B'enard convection, where the goal is to minimize the Nusselt number
Nu. The group-invariant neural networks (GI-NNs) show faster convergence
compared to the base MARL, achieving better average policy performance. The
GI-NNs not only cut DRL training time in half but also notably enhance learning
reproducibility. Positional encoding further enhances these results,
effectively reducing the minimum Nu and stabilizing convergence. Interestingly,
group invariant networks specialize in improving learning speed and positional
encoding specializes in improving learning quality. These results demonstrate
that choosing a suitable feature-representation method according to the purpose
as well as the characteristics of each control problem is essential. We believe
that the results of this study will not only inspire novel DRL methods with
invariant and unique representations, but also provide useful insights for
industrial applications.
[LINK]
http://arxiv.org/abs/2407.17822v1
[DATE]
2024-07-25 15:24:41+08:00
[CATEGORIES]
cs.LG
Spatial-Temporal Cross-View Contrastive Pre-training for Check-in Sequence Representation Learning
[AUTHORS]
Letian Gong, Huaiyu Wan, Shengnan Guo, Xiucheng Li, Yan Lin, Erwen Zheng, Tianyi Wang, Zeyu Zhou, Youfang Lin
[ABSTRACT]
The rapid growth of location-based services (LBS) has yielded massive amounts
of data on human mobility. Effectively extracting meaningful representations
for user-generated check-in sequences is pivotal for facilitating various
downstream services. However, the user-generated check-in data are
simultaneously influenced by the surrounding objective circumstances and the
user’s subjective intention. Specifically, the temporal uncertainty and spatial
diversity exhibited in check-in data make it difficult to capture the
macroscopic spatial-temporal patterns of users and to understand the semantics
of user mobility activities. Furthermore, the distinct characteristics of the
temporal and spatial information in check-in sequences call for an effective
fusion method to incorporate these two types of information. In this paper, we
propose a novel Spatial-Temporal Cross-view Contrastive Representation (STCCR)
framework for check-in sequence representation learning. Specifically, STCCR
addresses the above challenges by employing self-supervision from “spatial
topic” and “temporal intention” views, facilitating effective fusion of spatial
and temporal information at the semantic level. Besides, STCCR leverages
contrastive clustering to uncover users’ shared spatial topics from diverse
mobility activities, while employing angular momentum contrast to mitigate the
impact of temporal uncertainty and noise. We extensively evaluate STCCR on
three real-world datasets and demonstrate its superior performance across three
downstream tasks.
[COMMENTS]
This paper has been accepted as a regular paper at IEEE TKDE
[LINK]
http://arxiv.org/abs/2407.15899v3
[DATE]
2024-07-25 15:18:05+08:00
[CATEGORIES]
cs.LG
NC-NCD: Novel Class Discovery for Node Classification
[AUTHORS]
Yue Hou, Xueyuan Chen, He Zhu, Romei Liu, Bowen Shi, Jiaheng Liu, Junran Wu, Ke Xu
[ABSTRACT]
Novel Class Discovery (NCD) involves identifying new categories within
unlabeled data by utilizing knowledge acquired from previously established
categories. However, existing NCD methods often struggle to maintain a balance
between the performance of old and new categories. Discovering unlabeled new
categories in a class-incremental way is more practical but also more
challenging, as it is frequently hindered by either catastrophic forgetting of
old categories or an inability to learn new ones. Furthermore, the
implementation of NCD on continuously scalable graph-structured data remains an
under-explored area. In response to these challenges, we introduce for the
first time a more practical NCD scenario for node classification (i.e.,
NC-NCD), and propose a novel self-training framework with prototype replay and
distillation called SWORD, adopted to our NC-NCD setting. Our approach enables
the model to cluster unlabeled new category nodes after learning labeled nodes
while preserving performance on old categories without reliance on old category
nodes. SWORD achieves this by employing a self-training strategy to learn new
categories and preventing the forgetting of old categories through the joint
use of feature prototypes and knowledge distillation. Extensive experiments on
four common benchmarks demonstrate the superiority of SWORD over other
state-of-the-art methods.
[COMMENTS]
Accepted by CIKM’24
[LINK]
http://arxiv.org/abs/2407.17816v1
[DATE]
2024-07-25 15:10:08+08:00
[CATEGORIES]
cs.LG
Nested replicator dynamics, nested logit choice, and similarity-based learning
[AUTHORS]
Panayotis Mertikopoulos, William H. Sandholm
[ABSTRACT]
We consider a model of learning and evolution in games whose action sets are
endowed with a partition-based similarity structure intended to capture
exogenous similarities between strategies. In this model, revising agents have
a higher probability of comparing their current strategy with other strategies
that they deem similar, and they switch to the observed strategy with
probability proportional to its payoff excess. Because of this implicit bias
toward similar strategies, the resulting dynamics - which we call the nested
replicator dynamics - do not satisfy any of the standard monotonicity
postulates for imitative game dynamics; nonetheless, we show that they retain
the main long-run rationality properties of the replicator dynamics, albeit at
quantitatively different rates. We also show that the induced dynamics can be
viewed as a stimulus-response model in the spirit of Erev & Roth (1998), with
choice probabilities given by the nested logit choice rule of Ben-Akiva (1973)
and McFadden (1978). This result generalizes an existing relation between the
replicator dynamics and the exponential weights algorithm in online learning,
and provides an additional layer of interpretation to our analysis and results.
[COMMENTS]
37 pages, 9 figures
[LINK]
http://arxiv.org/abs/2407.17815v1
[DATE]
2024-07-25 15:09:53+08:00
[CATEGORIES]
cs.LG
CCVA-FL: Cross-Client Variations Adaptive Federated Learning for Medical Imaging
[AUTHORS]
Sunny Gupta, Amit Sethi
[ABSTRACT]
Federated Learning (FL) offers a privacy-preserving approach to train models
on decentralized data. Its potential in healthcare is significant, but
challenges arise due to cross-client variations in medical image data,
exacerbated by limited annotations. This paper introduces Cross-Client
Variations Adaptive Federated Learning (CCVA-FL) to address these issues.
CCVA-FL aims to minimize cross-client variations by transforming images into a
common feature space. It involves expert annotation of a subset of images from
each client, followed by the selection of a client with the least data
complexity as the target. Synthetic medical images are then generated using
Scalable Diffusion Models with Transformers (DiT) based on the target client’s
annotated images. These synthetic images, capturing diversity and representing
the original data, are shared with other clients. Each client then translates
its local images into the target image space using image-to-image translation.
The translated images are subsequently used in a federated learning setting to
develop a server model. Our results demonstrate that CCVA-FL outperforms
Vanilla Federated Averaging by effectively addressing data distribution
differences across clients without compromising privacy.
[COMMENTS]
10 pages, 6 figures
[LINK]
http://arxiv.org/abs/2407.11652v2
[DATE]
2024-07-25 15:04:32+08:00
[CATEGORIES]
cs.LG
Individual Privacy Accounting for Differentially Private Stochastic Gradient Descent
[AUTHORS]
Da Yu, Gautam Kamath, Janardhan Kulkarni, Tie-Yan Liu, Jian Yin, Huishuai Zhang
[ABSTRACT]
Differentially private stochastic gradient descent (DP-SGD) is the workhorse
algorithm for recent advances in private deep learning. It provides a single
privacy guarantee to all datapoints in the dataset. We propose output-specific
$(\varepsilon,\delta)$-DP to characterize privacy guarantees for individual
examples when releasing models trained by DP-SGD. We also design an efficient
algorithm to investigate individual privacy across a number of datasets. We
find that most examples enjoy stronger privacy guarantees than the worst-case
bound. We further discover that the training loss and the privacy parameter of
an example are well-correlated. This implies groups that are underserved in
terms of model utility simultaneously experience weaker privacy guarantees. For
example, on CIFAR-10, the average $\varepsilon$ of the class with the lowest
test accuracy is 44.2\% higher than that of the class with the highest
accuracy.
[COMMENTS]
Add clarification about the applicability of Definition 4
[LINK]
http://arxiv.org/abs/2206.02617v7
[DATE]
2024-07-25 14:33:58+08:00
[CATEGORIES]
cs.LG
EEG-SSM: Leveraging State-Space Model for Dementia Detection
[AUTHORS]
Xuan-The Tran, Linh Le, Quoc Toan Nguyen, Thomas Do, Chin-Teng Lin
[ABSTRACT]
State-space models (SSMs) have garnered attention for effectively processing
long data sequences, reducing the need to segment time series into shorter
intervals for model training and inference. Traditionally, SSMs capture only
the temporal dynamics of time series data, omitting the equally critical
spectral features. This study introduces EEG-SSM, a novel state-space
model-based approach for dementia classification using EEG data. Our model
features two primary innovations: EEG-SSM temporal and EEG-SSM spectral
components. The temporal component is designed to efficiently process EEG
sequences of varying lengths, while the spectral component enhances the model
by integrating frequency-domain information from EEG signals. The synergy of
these components allows EEG-SSM to adeptly manage the complexities of
multivariate EEG data, significantly improving accuracy and stability across
different temporal resolutions. Demonstrating a remarkable 91.0 percent
accuracy in classifying Healthy Control (HC), Frontotemporal Dementia (FTD),
and Alzheimer’s Disease (AD) groups, EEG-SSM outperforms existing models on the
same dataset. The development of EEG-SSM represents an improvement in the use
of state-space models for screening dementia, offering more precise and
cost-effective tools for clinical neuroscience.
[LINK]
http://arxiv.org/abs/2407.17801v1
[DATE]
2024-07-25 14:20:03+08:00
[CATEGORIES]
cs.LG
Enhancing Diversity in Multi-objective Feature Selection
[AUTHORS]
Sevil Zanjani Miyandoab, Shahryar Rahnamayan, Azam Asilian Bidgoli, Sevda Ebrahimi, Masoud Makrehchi
[ABSTRACT]
Feature selection plays a pivotal role in the data preprocessing and
model-building pipeline, significantly enhancing model performance,
interpretability, and resource efficiency across diverse domains. In
population-based optimization methods, the generation of diverse individuals
holds utmost importance for adequately exploring the problem landscape,
particularly in highly multi-modal multi-objective optimization problems. Our
study reveals that, in line with findings from several prior research papers,
commonly employed crossover and mutation operations lack the capability to
generate high-quality diverse individuals and tend to become confined to
limited areas around various local optima. This paper introduces an
augmentation to the diversity of the population in the well-established
multi-objective scheme of the genetic algorithm, NSGA-II. This enhancement is
achieved through two key components: the genuine initialization method and the
substitution of the worst individuals with new randomly generated individuals
as a re-initialization approach in each generation. The proposed
multi-objective feature selection method undergoes testing on twelve real-world
classification problems, with the number of features ranging from 2,400 to
nearly 50,000. The results demonstrate that replacing the last front of the
population with an equivalent number of new random individuals generated using
the genuine initialization method and featuring a limited number of features
substantially improves the population’s quality and, consequently, enhances the
performance of the multi-objective algorithm.
[COMMENTS]
8 pages, 3 figures, accepted to be published in IEEE WCCI 2024
conference
[LINK]
http://arxiv.org/abs/2407.17795v1
[DATE]
2024-07-25 14:09:44+08:00
[CATEGORIES]
cs.LG
Variational Inference with Coverage Guarantees in Simulation-Based Inference
[AUTHORS]
Yash Patel, Declan McNamara, Jackson Loper, Jeffrey Regier, Ambuj Tewari
[ABSTRACT]
Amortized variational inference is an often employed framework in
simulation-based inference that produces a posterior approximation that can be
rapidly computed given any new observation. Unfortunately, there are few
guarantees about the quality of these approximate posteriors. We propose
Conformalized Amortized Neural Variational Inference (CANVI), a procedure that
is scalable, easily implemented, and provides guaranteed marginal coverage.
Given a collection of candidate amortized posterior approximators, CANVI
constructs conformalized predictors based on each candidate, compares the
predictors using a metric known as predictive efficiency, and returns the most
efficient predictor. CANVI ensures that the resulting predictor constructs
regions that contain the truth with a user-specified level of probability.
CANVI is agnostic to design decisions in formulating the candidate
approximators and only requires access to samples from the forward model,
permitting its use in likelihood-free settings. We prove lower bounds on the
predictive efficiency of the regions produced by CANVI and explore how the
quality of a posterior approximation relates to the predictive efficiency of
prediction regions based on that approximation. Finally, we demonstrate the
accurate calibration and high predictive efficiency of CANVI on a suite of
simulation-based inference benchmark tasks and an important scientific task:
analyzing galaxy emission spectra.
[LINK]
http://arxiv.org/abs/2305.14275v3
[DATE]
2024-07-25 13:53:46+08:00
[CATEGORIES]
cs.LG
Integrating Ensemble Kalman Filter with AI-based Weather Prediction Model ClimaX
[AUTHORS]
Shunji Kotsuki, Kenta Shiraishi, Atsushi Okazaki
[ABSTRACT]
Artificial intelligence (AI)-based weather prediction research is growing
rapidly and has shown to be competitive with the advanced dynamic numerical
weather prediction models. However, research combining AI-based weather
prediction models with data assimilation remains limited partially because
long-term sequential data assimilation cycles are required to evaluate data
assimilation systems. This study explores integrating the local ensemble
transform Kalman filter (LETKF) with an AI-based weather prediction model
ClimaX. Our experiments demonstrated that the ensemble data assimilation cycled
stably for the AI-based weather prediction model using covariance inflation and
localization techniques inside the LETKF. While ClimaX showed some limitations
in capturing flow-dependent error covariance compared to dynamical models, the
AI-based ensemble forecasts provided reasonable and beneficial error covariance
in sparsely observed regions. These findings highlight the potential of AI
models in weather forecasting and the importance of physical consistency and
accurate error growth representation in improving ensemble data assimilation.
[LINK]
http://arxiv.org/abs/2407.17781v1
[DATE]
2024-07-25 13:22:08+08:00
[CATEGORIES]
cs.LG
Online Learning for Autonomous Management of Intent-based 6G Networks
[AUTHORS]
Erciyes Karakaya, Ozgur Ercetin, Huseyin Ozkan, Mehmet Karaca, Elham Dehghan Biyar, Alexandros Palaios
[ABSTRACT]
The growing complexity of networks and the variety of future scenarios with
diverse and often stringent performance requirements call for a higher level of
automation. Intent-based management emerges as a solution to attain high level
of automation, enabling human operators to solely communicate with the network
through high-level intents. The intents consist of the targets in the form of
expectations (i.e., latency expectation) from a service and based on the
expectations the required network configurations should be done accordingly. It
is almost inevitable that when a network action is taken to fulfill one intent,
it can cause negative impacts on the performance of another intent, which
results in a conflict. In this paper, we aim to address the conflict issue and
autonomous management of intent-based networking, and propose an online
learning method based on the hierarchical multi-armed bandits approach for an
effective management. Thanks to this hierarchical structure, it performs an
efficient exploration and exploitation of network configurations with respect
to the dynamic network conditions. We show that our algorithm is an effective
approach regarding resource allocation and satisfaction of intent expectations.
[LINK]
http://arxiv.org/abs/2407.17767v1
[DATE]
2024-07-25 12:48:56+08:00
[CATEGORIES]
cs.LG
Goodness-of-Fit and Clustering of Spherical Data: the QuadratiK package in R and Python
[AUTHORS]
Giovanni Saraceno, Marianthi Markatou, Raktim Mukhopadhyay, Mojgan Golzy
[ABSTRACT]
We introduce the QuadratiK package that incorporates innovative data analysis
methodologies. The presented software, implemented in both R and Python, offers
a comprehensive set of goodness-of-fit tests and clustering techniques using
kernel-based quadratic distances, thereby bridging the gap between the
statistical and machine learning literatures. Our software implements one, two
and k-sample tests for goodness of fit, providing an efficient and
mathematically sound way to assess the fit of probability distributions.
Expanded capabilities of our software include supporting tests for uniformity
on the d-dimensional Sphere based on Poisson kernel densities. Particularly
noteworthy is the incorporation of a unique clustering algorithm specifically
tailored for spherical data that leverages a mixture of Poisson kernel-based
densities on the sphere. Alongside this, our software includes additional
graphical functions, aiding the users in validating, as well as visualizing and
representing clustering results. This enhances interpretability and usability
of the analysis. In summary, our R and Python packages serve as a powerful
suite of tools, offering researchers and practitioners the means to delve
deeper into their data, draw robust inference, and conduct potentially
impactful analyses and inference across a wide array of disciplines.
[COMMENTS]
36 pages, 9 figures
[LINK]
http://arxiv.org/abs/2402.02290v2
[DATE]
2024-07-25 12:43:32+08:00
[CATEGORIES]
cs.LG
SES: Bridging the Gap Between Explainability and Prediction of Graph Neural Networks
[AUTHORS]
Zhenhua Huang, Kunhao Li, Shaojie Wang, Zhaohong Jia, Wentao Zhu, Sharad Mehrotra
[ABSTRACT]
Despite the Graph Neural Networks’ (GNNs) proficiency in analyzing graph
data, achieving high-accuracy and interpretable predictions remains
challenging. Existing GNN interpreters typically provide post-hoc explanations
disjointed from GNNs’ predictions, resulting in misrepresentations.
Self-explainable GNNs offer built-in explanations during the training process.
However, they cannot exploit the explanatory outcomes to augment prediction
performance, and they fail to provide high-quality explanations of node
features and require additional processes to generate explainable subgraphs,
which is costly. To address the aforementioned limitations, we propose a
self-explained and self-supervised graph neural network (SES) to bridge the gap
between explainability and prediction. SES comprises two processes: explainable
training and enhanced predictive learning. During explainable training, SES
employs a global mask generator co-trained with a graph encoder and directly
produces crucial structure and feature masks, reducing time consumption and
providing node feature and subgraph explanations. In the enhanced predictive
learning phase, mask-based positive-negative pairs are constructed utilizing
the explanations to compute a triplet loss and enhance the node representations
by contrastive learning.
[COMMENTS]
Accepted as a conference paper at ICDE 2024
[LINK]
http://arxiv.org/abs/2407.11358v2
[DATE]
2024-07-25 12:20:12+08:00
[CATEGORIES]
cs.LG
Efficient Combinatorial Optimization via Heat Diffusion
[AUTHORS]
Hengyuan Ma, Wenlian Lu, Jianfeng Feng
[ABSTRACT]
Combinatorial optimization problems are widespread but inherently challenging
due to their discrete nature. The primary limitation of existing methods is
that they can only access a small fraction of the solution space at each
iteration, resulting in limited efficiency for searching the global optimal.To
overcome this challenge, diverging from conventional efforts of expanding the
solver’s search scope, we focus on enabling information to actively propagate
to the solver through heat diffusion. By transforming the target function while
preserving its optima, heat diffusion facilitates information flow from distant
regions to the solver, providing more efficient navigation. Utilizing heat
diffusion, we propose a framework for solving general combinatorial
optimization problems.The proposed methodology demonstrates superior
performance across a range of the most challenging and widely encountered
combinatorial optimizations. Echoing recent advancements in harnessing
thermodynamics for generative artificial intelligence, our study further
reveals its significant potential in advancing combinatorial optimization.
[COMMENTS]
Code is available in https://github.com/AwakerMhy/HeO
[LINK]
http://arxiv.org/abs/2403.08757v3
[DATE]
2024-07-25 12:12:17+08:00
[CATEGORIES]
cs.LG
DualFed: Enjoying both Generalization and Personalization in Federated Learning via Hierachical Representations
[AUTHORS]
Guogang Zhu, Xuefeng Liu, Jianwei Niu, Shaojie Tang, Xinghao Wu, Jiayuan Zhang
[ABSTRACT]
In personalized federated learning (PFL), it is widely recognized that
achieving both high model generalization and effective personalization poses a
significant challenge due to their conflicting nature. As a result, existing
PFL methods can only manage a trade-off between these two objectives. This
raises an interesting question: Is it feasible to develop a model capable of
achieving both objectives simultaneously? Our paper presents an affirmative
answer, and the key lies in the observation that deep models inherently exhibit
hierarchical architectures, which produce representations with various levels
of generalization and personalization at different stages. A straightforward
approach stemming from this observation is to select multiple representations
from these layers and combine them to concurrently achieve generalization and
personalization. However, the number of candidate representations is commonly
huge, which makes this method infeasible due to high computational costs.To
address this problem, we propose DualFed, a new method that can directly yield
dual representations correspond to generalization and personalization
respectively, thereby simplifying the optimization task. Specifically, DualFed
inserts a personalized projection network between the encoder and classifier.
The pre-projection representations are able to capture generalized information
shareable across clients, and the post-projection representations are effective
to capture task-specific information on local clients. This design minimizes
the mutual interference between generalization and personalization, thereby
achieving a win-win situation. Extensive experiments show that DualFed can
outperform other FL methods. Code is available at
https://github.com/GuogangZhu/DualFed.
[COMMENTS]
Accepted by ACM MutltiMedia 2024
[LINK]
http://arxiv.org/abs/2407.17754v1
[DATE]
2024-07-25 12:09:12+08:00
[CATEGORIES]
cs.LG
A Survey on Hypergraph Neural Networks: An In-Depth and Step-By-Step Guide
[AUTHORS]
Sunwoo Kim, Soo Yong Lee, Yue Gao, Alessia Antelmi, Mirko Polato, Kijung Shin
[ABSTRACT]
Higher-order interactions (HOIs) are ubiquitous in real-world complex systems
and applications. Investigation of deep learning for HOIs, thus, has become a
valuable agenda for the data mining and machine learning communities. As
networks of HOIs are expressed mathematically as hypergraphs, hypergraph neural
networks (HNNs) have emerged as a powerful tool for representation learning on
hypergraphs. Given the emerging trend, we present the first survey dedicated to
HNNs, with an in-depth and step-by-step guide. Broadly, the present survey
overviews HNN architectures, training strategies, and applications. First, we
break existing HNNs down into four design components: (i) input features, (ii)
input structures, (iii) message-passing schemes, and (iv) training strategies.
Second, we examine how HNNs address and learn HOIs with each of their
components. Third, we overview the recent applications of HNNs in
recommendation, bioinformatics and medical science, time series analysis, and
computer vision. Lastly, we conclude with a discussion on limitations and
future directions.
[COMMENTS]
To appear in KDD 2024 (survey paper). The typo in Equation (5) has
been fixed
[LINK]
http://arxiv.org/abs/2404.01039v3
[DATE]
2024-07-25 11:35:48+08:00
[CATEGORIES]
cs.LG
A Priori Uncertainty Quantification of Reacting Turbulence Closure Models using Bayesian Neural Networks
[AUTHORS]
Graham Pash, Malik Hassanaly, Shashank Yellapantula
[ABSTRACT]
While many physics-based closure model forms have been posited for the
sub-filter scale (SFS) in large eddy simulation (LES), vast amounts of data
available from direct numerical simulation (DNS) create opportunities to
leverage data-driven modeling techniques. Albeit flexible, data-driven models
still depend on the dataset and the functional form of the model chosen.
Increased adoption of such models requires reliable uncertainty estimates both
in the data-informed and out-of-distribution regimes. In this work, we employ
Bayesian neural networks (BNNs) to capture both epistemic and aleatoric
uncertainties in a reacting flow model. In particular, we model the filtered
progress variable scalar dissipation rate which plays a key role in the
dynamics of turbulent premixed flames. We demonstrate that BNN models can
provide unique insights about the structure of uncertainty of the data-driven
closure models. We also propose a method for the incorporation of
out-of-distribution information in a BNN. The efficacy of the model is
demonstrated by a priori evaluation on a dataset consisting of a variety of
flame conditions and fuels.
[LINK]
http://arxiv.org/abs/2402.18729v2
[DATE]
2024-07-25 11:06:54+08:00
[CATEGORIES]
cs.LG
Optimal Trade and Industrial Policies in the Global Economy: A Deep Learning Framework
[AUTHORS]
Zi Wang, Xingcheng Xu, Yanqing Yang, Xiaodong Zhu
[ABSTRACT]
We propose a deep learning framework, DL-opt, designed to efficiently solve
for optimal policies in quantifiable general equilibrium trade models. DL-opt
integrates (i) a nested fixed point (NFXP) formulation of the optimization
problem, (ii) automatic implicit differentiation to enhance gradient descent
for solving unilateral optimal policies, and (iii) a best-response dynamics
approach for finding Nash equilibria. Utilizing DL-opt, we solve for
non-cooperative tariffs and industrial subsidies across 7 economies and 44
sectors, incorporating sectoral external economies of scale. Our quantitative
analysis reveals significant sectoral heterogeneity in Nash policies: Nash
industrial subsidies increase with scale elasticities, whereas Nash tariffs
decrease with trade elasticities. Moreover, we show that global dual
competition, involving both tariffs and industrial subsidies, results in lower
tariffs and higher welfare outcomes compared to a global tariff war. These
findings highlight the importance of considering sectoral heterogeneity and
policy combinations in understanding global economic competition.
[LINK]
http://arxiv.org/abs/2407.17731v1
[DATE]
2024-07-25 11:03:20+08:00
[CATEGORIES]
cs.LG
Your Graph Recommender is Provably a Single-view Graph Contrastive Learning
[AUTHORS]
Wenjie Yang, Shengzhong Zhang, Jiaxing Guo, Zengfeng Huang
[ABSTRACT]
Graph recommender (GR) is a type of graph neural network (GNNs) encoder that
is customized for extracting information from the user-item interaction graph.
Due to its strong performance on the recommendation task, GR has gained
significant attention recently. Graph contrastive learning (GCL) is also a
popular research direction that aims to learn, often unsupervised, GNNs with
certain contrastive objectives. As a general graph representation learning
method, GCLs have been widely adopted with the supervised recommendation loss
for joint training of GRs. Despite the intersection of GR and GCL research,
theoretical understanding of the relationship between the two fields is
surprisingly sparse. This vacancy inevitably leads to inefficient scientific
research.
In this paper, we aim to bridge the gap between the field of GR and GCL from
the perspective of encoders and loss functions. With mild assumptions, we
theoretically show an astonishing fact that graph recommender is equivalent to
a commonly-used single-view graph contrastive model. Specifically, we find that
(1) the classic encoder in GR is essentially a linear graph convolutional
network with one-hot inputs, and (2) the loss function in GR is well bounded by
a single-view GCL loss with certain hyperparameters. The first observation
enables us to explain crucial designs of GR models, e.g., the removal of
self-loop and nonlinearity. And the second finding can easily prompt many
cross-field research directions. We empirically show a remarkable result that
the recommendation loss and the GCL loss can be used interchangeably. The fact
that we can train GR models solely with the GCL loss is particularly
insightful, since before this work, GCLs were typically viewed as unsupervised
methods that need fine-tuning. We also discuss some potential future works
inspired by our theory.
[LINK]
http://arxiv.org/abs/2407.17723v1
[DATE]
2024-07-25 10:53:11+08:00
[CATEGORIES]
cs.LG
Text-Driven Neural Collaborative Filtering Model for Paper Source Tracing
[AUTHORS]
Aobo Xu, Bingyu Chang, Qingpeng Liu, Ling Jian
[ABSTRACT]
Identifying significant references within the complex interrelations of a
citation knowledge graph is challenging, which encompasses connections through
citations, authorship, keywords, and other relational attributes. The Paper
Source Tracing (PST) task seeks to automate the identification of pivotal
references for given scholarly articles utilizing advanced data mining
techniques. In the KDD CUP 2024, we design a recommendation-based framework
tailored for the PST task. This framework employs the Neural Collaborative
Filtering (NCF) model to generate final predictions. To process the textual
attributes of the papers and extract input features for the model, we utilize
SciBERT, a pre-trained language model. According to the experimental results,
our method achieved a score of 0.37814 on the Mean Average Precision (MAP)
metric, outperforming baseline models and ranking 11th among all participating
teams. The source code is publicly available at
https://github.com/MyLove-XAB/KDDCupFinal.
[COMMENTS]
KDD CUP 2024 OAG-Challenges, Paper Source Tracing, Technical Report
of Team AoboSama @ KDD CUP 2024. August 25–29, 2024. Barcelona, Spain
[LINK]
http://arxiv.org/abs/2407.17722v1
[DATE]
2024-07-25 10:48:56+08:00
[CATEGORIES]
cs.LG
A Two-Stage Imaging Framework Combining CNN and Physics-Informed Neural Networks for Full-Inverse Tomography: A Case Study in Electrical Impedance Tomography (EIT)
[AUTHORS]
Xuanxuan Yang, Yangming Zhang, Haofeng Chen, Gang Ma, Xiaojie Wang
[ABSTRACT]
Physics-Informed Neural Networks (PINNs) are a machine learning technique for
solving partial differential equations (PDEs) by incorporating PDEs as loss
terms in neural networks and minimizing the loss function during training.
Tomographic imaging, a method to reconstruct internal properties from external
measurement data, is highly complex and ill-posed, making it an inverse
problem. Recently, PINNs have shown significant potential in computational
fluid dynamics (CFD) and have advantages in solving inverse problems. However,
existing research has primarily focused on semi-inverse Electrical Impedance
Tomography (EIT), where internal electric potentials are accessible. The
practical full inverse EIT problem, where only boundary voltage measurements
are available, remains challenging. To address this, we propose a two-stage
hybrid learning framework combining Convolutional Neural Networks (CNNs) and
PINNs to solve the full inverse EIT problem. This framework integrates
data-driven and model-driven approaches, combines supervised and unsupervised
learning, and decouples the forward and inverse problems within the PINN
framework in EIT. Stage I: a U-Net constructs an end-to-end mapping from
boundary voltage measurements to the internal potential distribution using
supervised learning. Stage II: a Multilayer Perceptron (MLP)-based PINN takes
the predicted internal potentials as input to solve for the conductivity
distribution through unsupervised learning.
[LINK]
http://arxiv.org/abs/2407.17721v1
[DATE]
2024-07-25 10:48:22+08:00
[CATEGORIES]
cs.LG
Robust experimental data assimilation for the Spalart-Allmaras turbulence model
[AUTHORS]
Deepinder Jot Singh Aulakh, Xiang Yang, Romit Maulik
[ABSTRACT]
This study presents a methodology focusing on the use of computational model
and experimental data fusion to improve the Spalart-Allmaras (SA) closure model
for Reynolds-averaged Navier-Stokes solutions. In particular, our goal is to
develop a technique that not only assimilates sparse experimental data to
improve turbulence model performance, but also preserves generalization for
unseen cases by recovering classical SA behavior. We achieve our goals using
data assimilation, namely the Ensemble Kalman filtering approach (EnKF), to
calibrate the coefficients of the SA model for separated flows. A holistic
calibration strategy is implemented via the parameterization of the production,
diffusion, and destruction terms. This calibration relies on the assimilation
of experimental data collected in the form of velocity profiles, skin friction,
and pressure coefficients. Despite using observational data from a single flow
condition around a backward-facing step (BFS), the recalibrated SA model
demonstrates generalization to other separated flows, including cases such as
the 2D NASA wall mounted hump (2D-WMH) and modified BFS. Significant
improvement is observed in the quantities of interest, i.e., skin friction
coefficient ($C_f$) and pressure coefficient ($C_p$) for each flow tested.
Finally, it is also demonstrated that the newly proposed model recovers SA
proficiency for flows, such as a NACA-0012 airfoil and axisymmetric jet (ASJ),
and that the individually calibrated terms in the SA model target specific
flow-physics wherein the calibrated production term improves the re-circulation
zone while destruction improves the recovery zone.
[LINK]
http://arxiv.org/abs/2309.06679v3
[DATE]
2024-07-25 10:30:32+08:00
[CATEGORIES]
cs.LG
Improving Online Algorithms via ML Predictions
[AUTHORS]
Ravi Kumar, Manish Purohit, Zoya Svitkina
[COMMENTS]
Conference version appeared in Neurips 2018
[LINK]
http://arxiv.org/abs/2407.17712v1
[DATE]
2024-07-25 10:17:53+08:00
[CATEGORIES]
cs.LG
Revisiting Machine Unlearning with Dimensional Alignment
[AUTHORS]
Seonguk Seo, Dongwan Kim, Bohyung Han
[ABSTRACT]
Machine unlearning, an emerging research topic focusing on compliance with
data privacy regulations, enables trained models to remove the information
learned from specific data. While many existing methods indirectly address this
issue by intentionally injecting incorrect supervisions, they can drastically
and unpredictably alter the decision boundaries and feature spaces, leading to
training instability and undesired side effects. To fundamentally approach this
task, we first analyze the changes in latent feature spaces between original
and retrained models, and observe that the feature representations of samples
not involved in training are closely aligned with the feature manifolds of
previously seen samples in training. Based on these findings, we introduce a
novel evaluation metric for machine unlearning, coined dimensional alignment,
which measures the alignment between the eigenspaces of the forget and retain
set samples. We employ this metric as a regularizer loss to build a robust and
stable unlearning framework, which is further enhanced by integrating a
self-distillation loss and an alternating training scheme. Our framework
effectively eliminates information from the forget set and preserves knowledge
from the retain set. Lastly, we identify critical flaws in established
evaluation metrics for machine unlearning, and introduce new evaluation tools
that more accurately reflect the fundamental goals of machine unlearning.
[LINK]
http://arxiv.org/abs/2407.17710v1
[DATE]
2024-07-25 10:05:15+08:00
[CATEGORIES]
cs.LG
Hierarchical Classification of Research Fields in the “Web of Science” Using Deep Learning
[AUTHORS]
Susie Xi Rao, Peter H. Egger, Ce Zhang
[ABSTRACT]
This paper presents a hierarchical classification system that automatically
categorizes a scholarly publication using its abstract into a three-tier
hierarchical label set (discipline, field, subfield) in a multi-class setting.
This system enables a holistic categorization of research activities in the
mentioned hierarchy in terms of knowledge production through articles and
impact through citations, permitting those activities to fall into multiple
categories. The classification system distinguishes 44 disciplines, 718 fields
and 1,485 subfields among 160 million abstract snippets in Microsoft Academic
Graph (version 2018-05-17). We used batch training in a modularized and
distributed fashion to address and allow for interdisciplinary and interfield
classifications in single-label and multi-label settings. In total, we have
conducted 3,140 experiments in all considered models (Convolutional Neural
Networks, Recurrent Neural Networks, Transformers). The classification accuracy
is > 90% in 77.13% and 78.19% of the single-label and multi-label
classifications, respectively. We examine the advantages of our classification
by its ability to better align research texts and output with disciplines, to
adequately classify them in an automated way, and to capture the degree of
interdisciplinarity. The proposed system (a set of pre-trained models) can
serve as a backbone to an interactive system for indexing scientific
publications in the future.
[COMMENTS]
Under minor revision at QSS
[LINK]
http://arxiv.org/abs/2302.00390v3
[DATE]
2024-07-25 10:02:58+08:00
[CATEGORIES]
cs.LG
Investigating and Mitigating Barren Plateaus in Variational Quantum Circuits: A Survey
[AUTHORS]
Jack Cunningham, Jun Zhuang
[ABSTRACT]
In recent years, variational quantum circuits (VQCs) have been widely
explored to advance quantum circuits against classic models on various domains,
such as quantum chemistry and quantum machine learning. Similar to classic
machine-learning models, VQCs can be optimized through gradient-based
approaches. However, the gradient variance of VQCs may dramatically vanish as
the number of qubits or layers increases. This issue, a.k.a. Barren Plateaus
(BPs), seriously hinders the scaling of VQCs on large datasets. To mitigate the
exponential gradient vanishing, extensive efforts have been devoted to tackling
this issue through diverse strategies. In this survey, we conduct a systematic
literature review of recent works from both investigation and mitigation
perspectives. Besides, we propose a new taxonomy to categorize most existing
mitigation strategies. At last, we provide insightful discussion for future
directions of BPs.
[COMMENTS]
preprint, under review. Please feel free to reach out if your work
fits within our scope
[LINK]
http://arxiv.org/abs/2407.17706v1
[DATE]
2024-07-25 09:58:46+08:00
[CATEGORIES]
cs.LG
Context-aware knowledge graph framework for traffic speed forecasting using graph neural network
[AUTHORS]
Yatao Zhang, Yi Wang, Song Gao, Martin Raubal
[ABSTRACT]
Human mobility is intricately influenced by urban contexts spatially and
temporally, constituting essential domain knowledge in understanding traffic
systems. While existing traffic forecasting models primarily rely on raw
traffic data and advanced deep learning techniques, incorporating contextual
information remains underexplored due to the lack of effective integration
frameworks and the complexity of urban contexts. This study proposes a novel
context-aware knowledge graph (CKG) framework to enhance traffic speed
forecasting by effectively modeling spatial and temporal contexts. Employing a
relation-dependent integration strategy, the framework generates context-aware
representations from the spatial and temporal units of CKG to capture
spatio-temporal dependencies of urban contexts. A CKG-GNN model, combining the
CKG, dual-view multi-head self-attention (MHSA), and graph neural network
(GNN), is then designed to predict traffic speed using these context-aware
representations. Our experiments demonstrate that CKG’s configuration
significantly influences embedding performance, with ComplEx and KG2E emerging
as optimal for embedding spatial and temporal units, respectively. The CKG-GNN
model surpasses benchmark models, achieving an average MAE of $3.46\pm0.01$ and
a MAPE of $14.76\pm0.09\%$ for traffic speed predictions from 10 to 120
minutes. The dual-view MHSA analysis reveals the crucial role of
relation-dependent features from the context-based view and the model’s ability
to prioritize recent time slots in prediction from the sequence-based view. The
CKG framework’s model-agnostic nature suggests its potential applicability in
various applications of intelligent transportation systems. Overall, this study
underscores the importance of incorporating domain-specific contexts into
traffic forecasting and merging context-aware knowledge graphs with neural
networks to enhance accuracy.
[COMMENTS]
13 pages, 4 figures
[LINK]
http://arxiv.org/abs/2407.17703v1
[DATE]
2024-07-25 09:52:12+08:00
[CATEGORIES]
cs.LG
Superior Scoring Rules for Probabilistic Evaluation of Single-Label Multi-Class Classification Tasks
[AUTHORS]
Rouhollah Ahmadian, Mehdi Ghatee, Johan Wahlström
[ABSTRACT]
This study introduces novel superior scoring rules called Penalized Brier
Score (PBS) and Penalized Logarithmic Loss (PLL) to improve model evaluation
for probabilistic classification. Traditional scoring rules like Brier Score
and Logarithmic Loss sometimes assign better scores to misclassifications in
comparison with correct classifications. This discrepancy from the actual
preference for rewarding correct classifications can lead to suboptimal model
selection. By integrating penalties for misclassifications, PBS and PLL modify
traditional proper scoring rules to consistently assign better scores to
correct predictions. Formal proofs demonstrate that PBS and PLL satisfy
strictly proper scoring rule properties while also preferentially rewarding
accurate classifications. Experiments showcase the benefits of using PBS and
PLL for model selection, model checkpointing, and early stopping. PBS exhibits
a higher negative correlation with the F1 score compared to the Brier Score
during training. Thus, PBS more effectively identifies optimal checkpoints and
early stopping points, leading to improved F1 scores. Comparative analysis
verifies models selected by PBS and PLL achieve superior F1 scores. Therefore,
PBS and PLL address the gap between uncertainty quantification and accuracy
maximization by encapsulating both proper scoring principles and explicit
preference for true classifications. The proposed metrics can enhance model
evaluation and selection for reliable probabilistic classification.
[COMMENTS]
21 Pages, 3 Figures, 3 Tables
[LINK]
http://arxiv.org/abs/2407.17697v1
[DATE]
2024-07-25 09:46:05+08:00
[CATEGORIES]
cs.LG
Cheems: Wonderful Matrices More Efficient and More Effective Architecture
[AUTHORS]
Jingze Shi, Lu He, Yuhan Wang, Tianyu He, Bingheng Wu, Mingkun Hou
[ABSTRACT]
Recent studies have shown that, relative position encoding performs well in
selective state space model scanning algorithms, and the architecture that
balances SSM and Attention enhances the efficiency and effectiveness of the
algorithm, while the sparse activation of the mixture of experts reduces the
training cost. I studied the effectiveness of using different position
encodings in structured state space dual algorithms, and the more effective
SSD-Attn internal and external function mixing method, and designed a more
efficient cross domain mixture of experts. I found that the same matrix is very
wonderful in different algorithms, which allows us to establish a new hybrid
sparse architecture: Cheems. Compared with other hybrid architectures, it is
more efficient and more effective in language modeling tasks.
[LINK]
http://arxiv.org/abs/2407.16958v2
[DATE]
2024-07-25 09:34:13+08:00
[CATEGORIES]
cs.LG
Predicting the structure of dynamic graphs
[AUTHORS]
Sevvandi Kandanaarachchi, Ziqi Xu, Stefan Westerlund
[ABSTRACT]
Many aspects of graphs have been studied in depth. However, forecasting the
structure of a graph at future time steps incorporating unseen, new nodes and
edges has not gained much attention. In this paper, we present such an
approach. Using a time series of graphs, we forecast graphs at future time
steps. We use time series forecasting methods to predict the node degree at
future time points and combine these forecasts with flux balance analysis – a
linear programming method used in biochemistry – to obtain the structure of
future graphs. We evaluate this approach using synthetic and real-world
datasets and demonstrate its utility and applicability.
[LINK]
http://arxiv.org/abs/2401.04280v2
[DATE]
2024-07-25 09:31:45+08:00
[CATEGORIES]
cs.LG
SLADE: Detecting Dynamic Anomalies in Edge Streams without Labels via Self-Supervised Learning
[AUTHORS]
Jongha Lee, Sunwoo Kim, Kijung Shin
[ABSTRACT]
To detect anomalies in real-world graphs, such as social, email, and
financial networks, various approaches have been developed. While they
typically assume static input graphs, most real-world graphs grow over time,
naturally represented as edge streams. In this context, we aim to achieve three
goals: (a) instantly detecting anomalies as they occur, (b) adapting to
dynamically changing states, and (c) handling the scarcity of dynamic anomaly
labels. In this paper, we propose SLADE (Self-supervised Learning for Anomaly
Detection in Edge Streams) for rapid detection of dynamic anomalies in edge
streams, without relying on labels. SLADE detects the shifts of nodes into
abnormal states by observing deviations in their interaction patterns over
time. To this end, it trains a deep neural network to perform two
self-supervised tasks: (a) minimizing drift in node representations and (b)
generating long-term interaction patterns from short-term ones. Failure in
these tasks for a node signals its deviation from the norm. Notably, the neural
network and tasks are carefully designed so that all required operations can be
performed in constant time (w.r.t. the graph size) in response to each new edge
in the input stream. In dynamic anomaly detection across four real-world
datasets, SLADE outperforms nine competing methods, even those leveraging label
supervision.
[COMMENTS]
12 pages, 4 figures, To Appear in KDD 2024
[LINK]
http://arxiv.org/abs/2402.11933v3
[DATE]
2024-07-25 08:46:33+08:00
[CATEGORIES]
cs.LG
IncidentResponseGPT: Generating Traffic Incident Response Plans with Generative Artificial Intelligence
[AUTHORS]
Artur Grigorev, Adriana-Simona Mihaita Khaled Saleh, Yuming Ou
[ABSTRACT]
The proposed IncidentResponseGPT framework - a novel system that applies
generative artificial intelligence (AI) to potentially enhance the efficiency
and effectiveness of traffic incident response. This model allows for synthesis
of region-specific incident response guidelines and generates incident response
plans adapted to specific area, aiming to expedite decision-making for traffic
management authorities. This approach aims to accelerate incident resolution
times by suggesting various recommendations (e.g. optimal rerouting strategies,
estimating resource needs) to minimize the overall impact on the urban traffic
network. The system suggests specific actions, including dynamic lane closures,
optimized rerouting and dispatching appropriate emergency resources.
IncidentResponseGPT employs the Technique for Order Preference by Similarity to
Ideal Solution (TOPSIS) to rank generated response plans based on criteria like
impact minimization and resource efficiency based on their proximity to an
human-proposed solution.
[LINK]
http://arxiv.org/abs/2404.18550v3
[DATE]
2024-07-25 07:51:38+08:00
[CATEGORIES]
cs.LG
Synthetic High-resolution Cryo-EM Density Maps with Generative Adversarial Networks
[AUTHORS]
Chenwei Zhang, Anne Condon, Khanh Dao Duc
[ABSTRACT]
Generating synthetic cryogenic electron microscopy (cryo-EM) 3D density maps
from molecular structures has potential important applications in structural
biology. Yet existing simulation-based methods cannot mimic all the complex
features present in experimental maps, such as secondary structure elements. As
an alternative, we propose struc2mapGAN, a novel data-driven method that
employs a generative adversarial network (GAN) to produce high-resolution
experimental-like density maps from molecular structures. More specifically,
struc2mapGAN uses a U-Net++ architecture as the generator, with an additional
L1 loss term and further processing of raw experimental maps to enhance
learning efficiency. While struc2mapGAN can promptly generate maps after
training, we demonstrate that it outperforms existing simulation-based methods
for a wide array of tested maps and across various evaluation metrics. Our code
is available at https://github.com/chenwei-zhang/struc2mapGAN.
[LINK]
http://arxiv.org/abs/2407.17674v1
[DATE]
2024-07-25 07:47:05+08:00
[CATEGORIES]
cs.LG
Spiking Neural Networks in Vertical Federated Learning: Performance Trade-offs
[AUTHORS]
Maryam Abbasihafshejani, Anindya Maiti, Murtuza Jadliwala
[ABSTRACT]
Federated machine learning enables model training across multiple clients
while maintaining data privacy. Vertical Federated Learning (VFL) specifically
deals with instances where the clients have different feature sets of the same
samples. As federated learning models aim to improve efficiency and
adaptability, innovative neural network architectures like Spiking Neural
Networks (SNNs) are being leveraged to enable fast and accurate processing at
the edge. SNNs, known for their efficiency over Artificial Neural Networks
(ANNs), have not been analyzed for their applicability in VFL, thus far. In
this paper, we investigate the benefits and trade-offs of using SNN models in a
vertical federated learning setting. We implement two different federated
learning architectures – with model splitting and without model splitting –
that have different privacy and performance implications. We evaluate the setup
using CIFAR-10 and CIFAR-100 benchmark datasets along with SNN implementations
of VGG9 and ResNET classification models. Comparative evaluations demonstrate
that the accuracy of SNN models is comparable to that of traditional ANNs for
VFL applications, albeit significantly more energy efficient.
[LINK]
http://arxiv.org/abs/2407.17672v1
[DATE]
2024-07-25 07:31:02+08:00
[CATEGORIES]
cs.LG
Natural Gradient Hybrid Variational Inference with Application to Deep Mixed Models
[AUTHORS]
Weiben Zhang, Michael Stanley Smith, Worapree Maneesoonthorn, Ruben Loaiza-Maya
[ABSTRACT]
Stochastic models with global parameters and latent variables are common, and
for which variational inference (VI) is popular. However, existing methods are
often either slow or inaccurate in high dimensions. We suggest a fast and
accurate VI method for this case that employs a well-defined natural gradient
variational optimization that targets the joint posterior of the global
parameters and latent variables. It is a hybrid method, where at each step the
global parameters are updated using the natural gradient and the latent
variables are generated from their conditional posterior. A fast to compute
expression for the Tikhonov damped Fisher information matrix is used, along
with the re-parameterization trick, to provide a stable natural gradient. We
apply the approach to deep mixed models, which are an emerging class of
Bayesian neural networks with random output layer coefficients to allow for
heterogeneity. A range of simulations show that using the natural gradient is
substantially more efficient than using the ordinary gradient, and that the
approach is faster and more accurate than two cutting-edge natural gradient VI
methods. In a financial application we show that accounting for industry level
heterogeneity using the deep mixed model improves the accuracy of asset pricing
models. MATLAB code to implement the method can be found at:
https://github.com/WeibenZhang07/NG-HVI.
[LINK]
http://arxiv.org/abs/2302.13536v2
[DATE]
2024-07-25 07:23:48+08:00
[CATEGORIES]
cs.LG
Unsqueeze [CLS] Bottleneck to Learn Rich Representations
[AUTHORS]
Qing Su, Shihao Ji
[ABSTRACT]
Distillation-based self-supervised learning typically leads to more
compressed representations due to its radical clustering process and the
implementation of a sharper target distribution. To overcome this limitation
and preserve more information from input, we introduce UDI, conceptualized as
Unsqueezed Distillation-based self-supervised learning (SSL). UDI enriches the
learned representation by encouraging multimodal prediction distilled from a
consolidated profile of local predictions that are derived via stratified
sampling. Our evaluations show that UDI not only promotes semantically
meaningful representations at instance level, delivering superior or
competitive results to state-of-the-art SSL methods in image classification,
but also effectively preserves the nuisance of input, which yields significant
improvement in dense prediction tasks, including object detection and
segmentation. Additionally, UDI performs competitively in low-shot image
classification, improving the scalability of joint-embedding pipelines. Various
visualizations and ablation studies are presented to further elucidate the
mechanisms behind UDI. Our source code is available at
https://github.com/ISL-CV/udi.
[LINK]
http://arxiv.org/abs/2407.17671v1
[DATE]
2024-07-25 07:23:38+08:00
[CATEGORIES]
cs.LG
One-shot Generative Distribution Matching for Augmented RF-based UAV Identification
[AUTHORS]
Amir Kazemi, Salar Basiri, Volodymyr Kindratenko, Srinivasa Salapaka
[ABSTRACT]
This work addresses the challenge of identifying Unmanned Aerial Vehicles
(UAV) using radiofrequency (RF) fingerprinting in limited RF environments. The
complexity and variability of RF signals, influenced by environmental
interference and hardware imperfections, often render traditional RF-based
identification methods ineffective. To address these complications, the study
introduces the rigorous use of one-shot generative methods for augmenting
transformed RF signals, offering a significant improvement in UAV
identification. This approach shows promise in low-data regimes, outperforming
deep generative methods like conditional generative adversarial networks (GANs)
and variational auto-encoders (VAEs). The paper provides a theoretical
guarantee for the effectiveness of one-shot generative models in augmenting
limited data, setting a precedent for their application in limited RF
environments. This research contributes to learning techniques in low-data
regime scenarios, which may include atypical complex sequences beyond images
and videos. The code and links to datasets used in this study are available at
https://github.com/amir-kazemi/uav-rf-id.
[COMMENTS]
31 pages, 7 figures, 4 tables
[LINK]
http://arxiv.org/abs/2301.08403v4
[DATE]
2024-07-25 06:41:12+08:00
[CATEGORIES]
cs.LG
Tackling the Problem of Distributional Shifts: Correcting Misspecified, High-Dimensional Data-Driven Priors for Inverse Problems
[AUTHORS]
Gabriel Missael Barco, Alexandre Adam, Connor Stone, Yashar Hezaveh, Laurence Perreault-Levasseur
[ABSTRACT]
Bayesian inference for inverse problems hinges critically on the choice of
priors. In the absence of specific prior information, population-level
distributions can serve as effective priors for parameters of interest. With
the advent of machine learning, the use of data-driven population-level
distributions (encoded, e.g., in a trained deep neural network) as priors is
emerging as an appealing alternative to simple parametric priors in a variety
of inverse problems. However, in many astrophysical applications, it is often
difficult or even impossible to acquire independent and identically distributed
samples from the underlying data-generating process of interest to train these
models. In these cases, corrupted data or a surrogate, e.g. a simulator, is
often used to produce training samples, meaning that there is a risk of
obtaining misspecified priors. This, in turn, can bias the inferred posteriors
in ways that are difficult to quantify, which limits the potential
applicability of these models in real-world scenarios. In this work, we propose
addressing this issue by iteratively updating the population-level
distributions by retraining the model with posterior samples from different
sets of observations and showcase the potential of this method on the problem
of background image reconstruction in strong gravitational lensing when
score-based models are used as data-driven priors. We show that starting from a
misspecified prior distribution, the updated distribution becomes progressively
closer to the underlying population-level distribution, and the resulting
posterior samples exhibit reduced bias after several updates.
[COMMENTS]
17 pages, 15 figures, Submitted to The Astrophysical Journal
[LINK]
http://arxiv.org/abs/2407.17667v1
[DATE]
2024-07-25 06:39:27+08:00
[CATEGORIES]
cs.LG
Generative Learning for Simulation of US Army Vehicle Faults
[AUTHORS]
Patrick Kuiper, Sirui Lin, Jose Blanchet, Vahid Tarokh
[ABSTRACT]
We develop a novel generative model to simulate vehicle health and forecast
faults, conditioned on practical operational considerations. The model, trained
on data from the US Army’s Predictive Logistics program, aims to support
predictive maintenance. It forecasts faults far enough in advance to execute a
maintenance intervention before a breakdown occurs. The model incorporates
real-world factors that affect vehicle health. It also allows us to understand
the vehicle’s condition by analyzing operating data, and characterizing each
vehicle into discrete states. Importantly, the model predicts the time to first
fault with high accuracy. We compare its performance to other models and
demonstrate its successful training.
[LINK]
http://arxiv.org/abs/2407.17654v1
[DATE]
2024-07-25 05:46:39+08:00
[CATEGORIES]
cs.LG
Hopfield Networks for Asset Allocation
[AUTHORS]
Carlo Nicolini, Monisha Gopalan, Jacopo Staiano, Bruno Lepri
[ABSTRACT]
We present the first application of modern Hopfield networks to the problem
of portfolio optimization. We performed an extensive study based on
combinatorial purged cross-validation over several datasets and compared our
results to both traditional and deep-learning-based methods for portfolio
selection. Compared to state-of-the-art deep-learning methods such as
Long-Short Term Memory networks and Transformers, we find that the proposed
approach performs on par or better, while providing faster training times and
better stability. Our results show that Modern Hopfield Networks represent a
promising approach to portfolio optimization, allowing for an efficient,
scalable, and robust solution for asset allocation, risk management, and
dynamic rebalancing.
[COMMENTS]
12 pages, 4 figures
[LINK]
http://arxiv.org/abs/2407.17645v1
[DATE]
2024-07-25 05:24:00+08:00
[CATEGORIES]
cs.LG
SMA-Hyper: Spatiotemporal Multi-View Fusion Hypergraph Learning for Traffic Accident Prediction
[AUTHORS]
Xiaowei Gao, James Haworth, Ilya Ilyankou, Xianghui Zhang, Tao Cheng, Stephen Law, Huanfa Chen
[ABSTRACT]
Predicting traffic accidents is the key to sustainable city management, which
requires effective address of the dynamic and complex spatiotemporal
characteristics of cities. Current data-driven models often struggle with data
sparsity and typically overlook the integration of diverse urban data sources
and the high-order dependencies within them. Additionally, they frequently rely
on predefined topologies or weights, limiting their adaptability in
spatiotemporal predictions. To address these issues, we introduce the
Spatiotemporal Multiview Adaptive HyperGraph Learning (SMA-Hyper) model, a
dynamic deep learning framework designed for traffic accident prediction.
Building on previous research, this innovative model incorporates dual adaptive
spatiotemporal graph learning mechanisms that enable high-order cross-regional
learning through hypergraphs and dynamic adaptation to evolving urban data. It
also utilises contrastive learning to enhance global and local data
representations in sparse datasets and employs an advance attention mechanism
to fuse multiple views of accident data and urban functional features, thereby
enriching the contextual understanding of risk factors. Extensive testing on
the London traffic accident dataset demonstrates that the SMA-Hyper model
significantly outperforms baseline models across various temporal horizons and
multistep outputs, affirming the effectiveness of its multiview fusion and
adaptive learning strategies. The interpretability of the results further
underscores its potential to improve urban traffic management and safety by
leveraging complex spatiotemporal urban data, offering a scalable framework
adaptable to diverse urban environments.
[LINK]
http://arxiv.org/abs/2407.17642v1
[DATE]
2024-07-25 05:10:34+08:00
[CATEGORIES]
cs.LG
AgentKit: Structured LLM Reasoning with Dynamic Graphs
[AUTHORS]
Yue Wu, Yewen Fan, So Yeon Min, Shrimai Prabhumoye, Stephen McAleer, Yonatan Bisk, Ruslan Salakhutdinov, Yuanzhi Li, Tom Mitchell
[ABSTRACT]
We propose an intuitive LLM prompting framework (AgentKit) for
multifunctional agents. AgentKit offers a unified framework for explicitly
constructing a complex “thought process” from simple natural language prompts.
The basic building block in AgentKit is a node, containing a natural language
prompt for a specific subtask. The user then puts together chains of nodes,
like stacking LEGO pieces. The chains of nodes can be designed to explicitly
enforce a naturally structured “thought process”. For example, for the task of
writing a paper, one may start with the thought process of 1) identify a core
message, 2) identify prior research gaps, etc. The nodes in AgentKit can be
designed and combined in different ways to implement multiple advanced
capabilities including on-the-fly hierarchical planning, reflection, and
learning from interactions. In addition, due to the modular nature and the
intuitive design to simulate explicit human thought process, a basic agent
could be implemented as simple as a list of prompts for the subtasks and
therefore could be designed and tuned by someone without any programming
experience. Quantitatively, we show that agents designed through AgentKit
achieve SOTA performance on WebShop and Crafter. These advances underscore
AgentKit’s potential in making LLM agents effective and accessible for a wider
range of applications. https://github.com/holmeswww/AgentKit
[LINK]
http://arxiv.org/abs/2404.11483v2
[DATE]
2024-07-25 04:53:10+08:00
[CATEGORIES]
cs.LG
Neural Clamping: Joint Input Perturbation and Temperature Scaling for Neural Network Calibration
[AUTHORS]
Yung-Chen Tang, Pin-Yu Chen, Tsung-Yi Ho
[ABSTRACT]
Neural network calibration is an essential task in deep learning to ensure
consistency between the confidence of model prediction and the true correctness
likelihood. In this paper, we propose a new post-processing calibration method
called Neural Clamping, which employs a simple joint input-output
transformation on a pre-trained classifier via a learnable universal input
perturbation and an output temperature scaling parameter. Moreover, we provide
theoretical explanations on why Neural Clamping is provably better than
temperature scaling. Evaluated on BloodMNIST, CIFAR-100, and ImageNet image
recognition datasets and a variety of deep neural network models, our empirical
results show that Neural Clamping significantly outperforms state-of-the-art
post-processing calibration methods. The code is available at
github.com/yungchentang/NCToolkit, and the demo is available at
huggingface.co/spaces/TrustSafeAI/NCTV.
[COMMENTS]
Transactions on Machine Learning Research
[LINK]
http://arxiv.org/abs/2209.11604v2
[DATE]
2024-07-25 04:47:55+08:00
[CATEGORIES]
cs.LG
BLAZE: Cross-Language and Cross-Project Bug Localization via Dynamic Chunking and Hard Example Learning
[AUTHORS]
Partha Chakraborty, Mahmoud Alfadel, Meiyappan Nagappan
[ABSTRACT]
Software bugs require developers to exert significant effort to identify and
resolve them, often consuming about one-third of their time. Bug localization,
the process of pinpointing the exact source code files that need modification,
is crucial in reducing this effort. Existing bug localization tools, typically
reliant on deep learning techniques, face limitations in cross-project
applicability and effectiveness in multi-language environments. Recent
advancements with Large Language Models (LLMs) offer detailed representations
for bug localization. However, they encounter challenges with limited context
windows and mapping accuracy. To address these issues, we propose BLAZE, an
approach that employs dynamic chunking and hard example learning. First, BLAZE
dynamically segments source code to minimize continuity loss. Then, BLAZE
fine-tunes a GPT-based model using challenging bug cases, in order to enhance
cross-project and cross-language bug localization. To support the capability of
BLAZE, we create the BEETLEBOX dataset, which comprises 26,321 bugs from 29
large and thriving open-source projects across five different programming
languages (Java, C++, Python, Go, and JavaScript). Our evaluations of BLAZE on
three benchmark datasets BEETLEBOX, SWE-Bench, and Ye et al. demonstrate
substantial improvements compared to six state-of-the-art baselines.
Specifically, BLAZE achieves up to an increase of 120% in Top 1 accuracy, 144%
in Mean Average Precision (MAP), and 100% in Mean Reciprocal Rank (MRR). An
extensive ablation study confirms the contributions of our pipeline components
to the overall performance enhancement.
[LINK]
http://arxiv.org/abs/2407.17631v1
[DATE]
2024-07-25 04:44:36+08:00
[CATEGORIES]
cs.LG
SAfEPaTh: A System-Level Approach for Efficient Power and Thermal Estimation of Convolutional Neural Network Accelerator
[AUTHORS]
Yukai Chen, Simei Yang, Debjyoti Bhattacharjee, Francky Catthoor, Arindam Mallik
[ABSTRACT]
The design of energy-efficient, high-performance, and reliable Convolutional
Neural Network (CNN) accelerators involves significant challenges due to
complex power and thermal management issues. This paper introduces SAfEPaTh, a
novel system-level approach for accurately estimating power and temperature in
tile-based CNN accelerators. By addressing both steady-state and
transient-state scenarios, SAfEPaTh effectively captures the dynamic effects of
pipeline bubbles in interlayer pipelines, utilizing real CNN workloads for
comprehensive evaluation. Unlike traditional methods, it eliminates the need
for circuit-level simulations or on-chip measurements. Our methodology
leverages TANIA, a cutting-edge hybrid digital-analog tile-based accelerator
featuring analog-in-memory computing cores alongside digital cores. Through
rigorous simulation results using the ResNet18 model, we demonstrate SAfEPaTh’s
capability to accurately estimate power and temperature within 500 seconds,
encompassing CNN model accelerator mapping exploration and detailed power and
thermal estimations. This efficiency and accuracy make SAfEPaTh an invaluable
tool for designers, enabling them to optimize performance while adhering to
stringent power and thermal constraints. Furthermore, SAfEPaTh’s adaptability
extends its utility across various CNN models and accelerator architectures,
underscoring its broad applicability in the field. This study contributes
significantly to the advancement of energy-efficient and reliable CNN
accelerator designs, addressing critical challenges in dynamic power and
thermal management.
[LINK]
http://arxiv.org/abs/2407.17623v1
[DATE]
2024-07-25 04:29:52+08:00
[CATEGORIES]
cs.LG
Towards Neural Network based Cognitive Models of Dynamic Decision-Making by Humans
[AUTHORS]
Changyu Chen, Shashank Reddy Chirra, Maria José Ferreira, Cleotilde Gonzalez, Arunesh Sinha, Pradeep Varakantham
[ABSTRACT]
Modelling human cognitive processes in dynamic decision-making tasks has been
an endeavor in AI for a long time. Some initial works have attempted to utilize
neural networks (and large language models) but often assume one common model
for all humans and aim to emulate human behavior in aggregate. However,
behavior of each human is distinct, heterogeneous and relies on specific past
experiences in specific tasks. To that end, we build on a well known model of
cognition, namely Instance Based Learning (IBL), that posits that decisions are
made based on similar situations encountered in the past. We propose two new
attention based neural network models to model human decision-making in dynamic
settings. We experiment with two distinct datasets gathered from human subject
experiment data, one focusing on detection of phishing email by humans and
another where humans act as attackers in a cybersecurity setting and decide on
an attack option. We conduct extensive experiments with our two neural network
models, IBL, and GPT3.5, and demonstrate that one of our neural network models
achieves the best performance in representing human decision-making. We find an
interesting trend that all models predict a human’s decision better if that
human is better at the task. We also explore explanation of human decisions
based on what our model considers important in prediction. Overall, our work
yields promising results for further use of neural networks in cognitive
modelling of human decision making. Our code is available at
https://github.com/shshnkreddy/NCM-HDM.
[LINK]
http://arxiv.org/abs/2407.17622v1
[DATE]
2024-07-25 04:28:03+08:00
[CATEGORIES]
cs.LG
Pretraining a Neural Operator in Lower Dimensions
[AUTHORS]
AmirPouya Hemmasian, Amir Barati Farimani
[ABSTRACT]
There has recently been increasing attention towards developing foundational
neural Partial Differential Equation (PDE) solvers and neural operators through
large-scale pretraining. However, unlike vision and language models that make
use of abundant and inexpensive (unlabeled) data for pretraining, these neural
solvers usually rely on simulated PDE data, which can be costly to obtain,
especially for high-dimensional PDEs. In this work, we aim to Pretrain neural
PDE solvers on Lower Dimensional PDEs (PreLowD) where data collection is the
least expensive. We evaluated the effectiveness of this pretraining strategy in
similar PDEs in higher dimensions. We use the Factorized Fourier Neural
Operator (FFNO) due to having the necessary flexibility to be applied to PDE
data of arbitrary spatial dimensions and reuse trained parameters in lower
dimensions. In addition, our work sheds light on the effect of the fine-tuning
configuration to make the most of this pretraining strategy.
[LINK]
http://arxiv.org/abs/2407.17616v1
[DATE]
2024-07-25 04:06:12+08:00
[CATEGORIES]
cs.LG
Adaptive Training of Grid-Dependent Physics-Informed Kolmogorov-Arnold Networks
[AUTHORS]
Spyros Rigas, Michalis Papachristou, Theofilos Papadopoulos, Fotios Anagnostopoulos, Georgios Alexandridis
[ABSTRACT]
Physics-Informed Neural Networks (PINNs) have emerged as a robust framework
for solving Partial Differential Equations (PDEs) by approximating their
solutions via neural networks and imposing physics-based constraints on the
loss function. Traditionally, Multilayer Perceptrons (MLPs) are the neural
network of choice, and significant progress has been made in optimizing their
training. Recently, Kolmogorov-Arnold Networks (KANs) were introduced as a
viable alternative, with the potential of offering better interpretability and
efficiency while requiring fewer parameters. In this paper, we present a fast
JAX-based implementation of grid-dependent Physics-Informed Kolmogorov-Arnold
Networks (PIKANs) for solving PDEs. We propose an adaptive training scheme for
PIKANs, incorporating known MLP-based PINN techniques, introducing an adaptive
state transition scheme to avoid loss function peaks between grid updates, and
proposing a methodology for designing PIKANs with alternative basis functions.
Through comparative experiments we demonstrate that these adaptive features
significantly enhance training efficiency and solution accuracy. Our results
illustrate the effectiveness of PIKANs in improving performance for PDE
solutions, highlighting their potential as a superior alternative in scientific
and engineering applications.
[LINK]
http://arxiv.org/abs/2407.17611v1
[DATE]
2024-07-25 03:55:08+08:00
[CATEGORIES]
cs.LG
POCKET: Pruning Random Convolution Kernels for Time Series Classification from a Feature Selection Perspective
[AUTHORS]
Shaowu Chen, Weize Sun, Lei Huang, Xiaopeng Li, Qingyuan Wang, Deepu John
[ABSTRACT]
In recent years, two competitive time series classification models, namely,
ROCKET and MINIROCKET, have garnered considerable attention due to their low
training cost and high accuracy. However, they rely on a large number of random
1-D convolutional kernels to comprehensively capture features, which is
incompatible with resource-constrained devices. Despite the development of
heuristic algorithms designed to recognize and prune redundant kernels, the
inherent time-consuming nature of evolutionary algorithms hinders efficient
evaluation. To efficiently prune models, this paper eliminates feature groups
contributing minimally to the classifier, thereby discarding the associated
random kernels without direct evaluation. To this end, we incorporate both
group-level ($l_{2,1}$-norm) and element-level ($l_2$-norm) regularizations to
the classifier, formulating the pruning challenge as a group elastic net
classification problem. An ADMM-based algorithm is initially introduced to
solve the problem, but it is computationally intensive. Building on the
ADMM-based algorithm, we then propose our core algorithm, POCKET, which
significantly speeds up the process by dividing the task into two sequential
stages. In Stage 1, POCKET utilizes dynamically varying penalties to
efficiently achieve group sparsity within the classifier, removing features
associated with zero weights and their corresponding kernels. In Stage 2, the
remaining kernels and features are used to refit a $l_2$-regularized classifier
for enhanced performance. Experimental results on diverse time series datasets
show that POCKET prunes up to 60% of kernels without a significant reduction in
accuracy and performs 11$\times$ faster than its counterparts. Our code is
publicly available at https://github.com/ShaowuChen/POCKET.
[LINK]
http://arxiv.org/abs/2309.08499v4
[DATE]
2024-07-25 03:48:04+08:00
[CATEGORIES]
cs.LG
Exploring Active Learning in Meta-Learning: Enhancing Context Set Labeling
[AUTHORS]
Wonho Bae, Jing Wang, Danica J. Sutherland
[ABSTRACT]
Most meta-learning methods assume that the (very small) context set used to
establish a new task at test time is passively provided. In some settings,
however, it is feasible to actively select which points to label; the potential
gain from a careful choice is substantial, but the setting requires major
differences from typical active learning setups. We clarify the ways in which
active meta-learning can be used to label a context set, depending on which
parts of the meta-learning process use active learning. Within this framework,
we propose a natural algorithm based on fitting Gaussian mixtures for selecting
which points to label; though simple, the algorithm also has theoretical
motivation. The proposed algorithm outperforms state-of-the-art active learning
methods when used with various meta-learning algorithms across several
benchmark datasets.
[COMMENTS]
Accepted to ECCV2024
[LINK]
http://arxiv.org/abs/2311.02879v3
[DATE]
2024-07-25 03:10:52+08:00
[CATEGORIES]
cs.LG
Quality Assured: Rethinking Annotation Strategies in Imaging AI
[AUTHORS]
Tim Rädsch, Annika Reinke, Vivienn Weru, Minu D. Tizabi, Nicholas Heller, Fabian Isensee, Annette Kopp-Schneider, Lena Maier-Hein
[ABSTRACT]
This paper does not describe a novel method. Instead, it studies an essential
foundation for reliable benchmarking and ultimately real-world application of
AI-based image analysis: generating high-quality reference annotations.
Previous research has focused on crowdsourcing as a means of outsourcing
annotations. However, little attention has so far been given to annotation
companies, specifically regarding their internal quality assurance (QA)
processes. Therefore, our aim is to evaluate the influence of QA employed by
annotation companies on annotation quality and devise methodologies for
maximizing data annotation efficacy. Based on a total of 57,648 instance
segmented images obtained from a total of 924 annotators and 34 QA workers from
four annotation companies and Amazon Mechanical Turk (MTurk), we derived the
following insights: (1) Annotation companies perform better both in terms of
quantity and quality compared to the widely used platform MTurk. (2) Annotation
companies’ internal QA only provides marginal improvements, if any. However,
improving labeling instructions instead of investing in QA can substantially
boost annotation performance. (3) The benefit of internal QA depends on
specific image characteristics. Our work could enable researchers to derive
substantially more value from a fixed annotation budget and change the way
annotation companies conduct internal QA.
[COMMENTS]
Accepted at ECCV 2024, preprint, Computer Vision, Data Annotation
[LINK]
http://arxiv.org/abs/2407.17596v1
[DATE]
2024-07-25 03:02:01+08:00
[CATEGORIES]
cs.LG
Explainable AI for Engineering Design: A Unified Approach of Systems Engineering and Component- Based Deep Learning Demonstrated by Energy- Efficient Building Design
[AUTHORS]
Philipp Geyer, Manav Mahan Singh, Xia Chen
[ABSTRACT]
Data-driven models created by machine learning, gain in importance in all
fields of design and engineering. They, have high potential to assist
decision-makers in creating novel, artefacts with better performance and
sustainability. However,, limited generalization and the black-box nature of
these models, lead to limited explainability and reusability. To overcome this,
situation, we propose a component-based approach to create, partial component
models by machine learning (ML). This, component-based approach aligns deep
learning with systems, engineering (SE). The key contribution of the
component-based, method is that activations at interfaces between the
components, are interpretable engineering quantities. In this way, the,
hierarchical component system forms a deep neural network, (DNN) that a priori
integrates information for engineering, explainability. The, approach adapts
the model structure to engineering methods of, systems engineering and to
domain knowledge. We examine the, performance of the approach by the field of
energy-efficient, building design: First, we observed better generalization of
the, component-based method by analyzing prediction accuracy, outside the
training data. Especially for representative designs, different in structure,
we observe a much higher accuracy, (R2 = 0.94) compared to conventional
monolithic methods, (R2 = 0.71). Second, we illustrate explainability by
exemplary, demonstrating how sensitivity information from SE and rules, from
low-depth decision trees serve engineering. Third, we, evaluate explainability
by qualitative and quantitative methods, demonstrating the matching of
preliminary knowledge and data-driven, derived strategies and show correctness
of activations at, component interfaces compared to white-box simulation
results, (envelope components: R2 = 0.92..0.99; zones: R2 = 0.78..0.93).
[COMMENTS]
20 pages
[LINK]
http://arxiv.org/abs/2108.13836v6
[DATE]
2024-07-25 02:42:07+08:00
[CATEGORIES]
cs.LG
Generating Explanations for Cellular Neural Networks
[AUTHORS]
Akshit Sinha, Sreeram Vennam, Charu Sharma, Ponnurangam Kumaraguru
[ABSTRACT]
Recent advancements in graph learning contributed to explaining predictions
generated by Graph Neural Networks. However, existing methodologies often fall
short when applied to real-world datasets. We introduce HOGE, a framework to
capture higher-order structures using cell complexes, which excel at modeling
higher-order relationships. In the real world, higher-order structures are
ubiquitous like in molecules or social networks, thus our work significantly
enhances the practical applicability of graph explanations. HOGE produces
clearer and more accurate explanations compared to prior methods. Our method
can be integrated with all existing graph explainers, ensuring seamless
integration into current frameworks. We evaluate on GraphXAI benchmark
datasets, HOGE achieves improved or comparable performance with minimal
computational overhead. Ablation studies show that the performance gain
observed can be attributed to the higher-order structures that come from
introducing cell complexes.
[LINK]
http://arxiv.org/abs/2406.03253v3
[DATE]
2024-07-25 02:22:22+08:00
[CATEGORIES]
cs.LG
No Free Prune: Information-Theoretic Barriers to Pruning at Initialization
[AUTHORS]
Tanishq Kumar, Kevin Luo, Mark Sellke
[ABSTRACT]
The existence of “lottery tickets” arXiv:1803.03635 at or near initialization
raises the tantalizing question of whether large models are necessary in deep
learning, or whether sparse networks can be quickly identified and trained
without ever training the dense models that contain them. However, efforts to
find these sparse subnetworks without training the dense model (“pruning at
initialization”) have been broadly unsuccessful arXiv:2009.08576. We put
forward a theoretical explanation for this, based on the model’s effective
parameter count, $p_\text{eff}$, given by the sum of the number of non-zero
weights in the final network and the mutual information between the sparsity
mask and the data. We show the Law of Robustness of arXiv:2105.12806 extends to
sparse networks with the usual parameter count replaced by $p_\text{eff}$,
meaning a sparse neural network which robustly interpolates noisy data requires
a heavily data-dependent mask. We posit that pruning during and after training
outputs masks with higher mutual information than those produced by pruning at
initialization. Thus two networks may have the same sparsities, but differ in
effective parameter count based on how they were trained. This suggests that
pruning near initialization may be infeasible and explains why lottery tickets
exist, but cannot be found fast (i.e. without training the full network).
Experiments on neural networks confirm that information gained during training
may indeed affect model capacity.
[LINK]
http://arxiv.org/abs/2402.01089v2
[DATE]
2024-07-25 02:05:45+08:00
[CATEGORIES]
cs.LG
Traversing Pareto Optimal Policies: Provably Efficient Multi-Objective Reinforcement Learning
[AUTHORS]
Shuang Qiu, Dake Zhang, Rui Yang, Boxiang Lyu, Tong Zhang
[ABSTRACT]
This paper investigates multi-objective reinforcement learning (MORL), which
focuses on learning Pareto optimal policies in the presence of multiple reward
functions. Despite MORL’s significant empirical success, there is still a lack
of satisfactory understanding of various MORL optimization targets and
efficient learning algorithms. Our work offers a systematic analysis of several
optimization targets to assess their abilities to find all Pareto optimal
policies and controllability over learned policies by the preferences for
different objectives. We then identify Tchebycheff scalarization as a favorable
scalarization method for MORL. Considering the non-smoothness of Tchebycheff
scalarization, we reformulate its minimization problem into a new min-max-max
optimization problem. Then, for the stochastic policy class, we propose
efficient algorithms using this reformulation to learn Pareto optimal policies.
We first propose an online UCB-based algorithm to achieve an $\varepsilon$
learning error with an $\tilde{\mathcal{O}}(\varepsilon^{-2})$ sample
complexity for a single given preference. To further reduce the cost of
environment exploration under different preferences, we propose a
preference-free framework that first explores the environment without
pre-defined preferences and then generates solutions for any number of
preferences. We prove that it only requires an
$\tilde{\mathcal{O}}(\varepsilon^{-2})$ exploration complexity in the
exploration phase and demands no additional exploration afterward. Lastly, we
analyze the smooth Tchebycheff scalarization, an extension of Tchebycheff
scalarization, which is proved to be more advantageous in distinguishing the
Pareto optimal policies from other weakly Pareto optimal policies based on
entry values of preference vectors. Furthermore, we extend our algorithms and
theoretical analysis to accommodate this optimization target.
[COMMENTS]
Initially submitted in May 2024
[LINK]
http://arxiv.org/abs/2407.17466v1
[DATE]
2024-07-25 01:58:49+08:00
[CATEGORIES]
cs.LG
u-$μ$P: The Unit-Scaled Maximal Update Parametrization
[AUTHORS]
Charlie Blake, Constantin Eichenberg, Josef Dean, Lukas Balles, Luke Y. Prince, Björn Deiseroth, Andres Felipe Cruz-Salinas, Carlo Luschi, Samuel Weinbach, Douglas Orr
[ABSTRACT]
The Maximal Update Parametrization ($\mu$P) aims to make the optimal
hyperparameters (HPs) of a model independent of its size, allowing them to be
swept using a cheap proxy model rather than the full-size target model. We
present a new scheme, u-$\mu$P, which improves upon $\mu$P by combining it with
Unit Scaling, a method for designing models that makes them easy to train in
low-precision. The two techniques have a natural affinity: $\mu$P ensures that
the scale of activations is independent of model size, and Unit Scaling ensures
that activations, weights and gradients begin training with a scale of one.
This synthesis opens the door to a simpler scheme, whose default values are
near-optimal. This in turn facilitates a more efficient sweeping strategy, with
u-$\mu$P models reaching a lower loss than comparable $\mu$P models and working
out-of-the-box in FP8.
[COMMENTS]
48 pages
[LINK]
http://arxiv.org/abs/2407.17465v1
[DATE]
2024-07-25 01:58:42+08:00
[CATEGORIES]
cs.LG
SoNIC: Safe Social Navigation with Adaptive Conformal Inference and Constrained Reinforcement Learning
[AUTHORS]
Jianpeng Yao, Xiaopan Zhang, Yu Xia, Zejin Wang, Amit K. Roy-Chowdhury, Jiachen Li
[ABSTRACT]
Reinforcement Learning (RL) has enabled social robots to generate
trajectories without human-designed rules or interventions, which makes it more
effective than hard-coded systems for generalizing to complex real-world
scenarios. However, social navigation is a safety-critical task that requires
robots to avoid collisions with pedestrians while previous RL-based solutions
fall short in safety performance in complex environments. To enhance the safety
of RL policies, to the best of our knowledge, we propose the first algorithm,
SoNIC, that integrates adaptive conformal inference (ACI) with constrained
reinforcement learning (CRL) to learn safe policies for social navigation. More
specifically, our method augments RL observations with ACI-generated
nonconformity scores and provides explicit guidance for agents to leverage the
uncertainty metrics to avoid safety-critical areas by incorporating safety
constraints with spatial relaxation. Our method outperforms state-of-the-art
baselines in terms of both safety and adherence to social norms by a large
margin and demonstrates much stronger robustness to out-of-distribution
scenarios. Our code and video demos are available on our project website:
https://sonic-social-nav.github.io/.
[COMMENTS]
Project website: https://sonic-social-nav.github.io/
[LINK]
http://arxiv.org/abs/2407.17460v1
[DATE]
2024-07-25 01:57:21+08:00
[CATEGORIES]
cs.LG
Hidden or Inferred: Fair Learning-To-Rank with Unknown Demographics
[AUTHORS]
Oluseun Olulana, Kathleen Cachel, Fabricio Murai, Elke Rundensteiner
[ABSTRACT]
As learning-to-rank models are increasingly deployed for decision-making in
areas with profound life implications, the FairML community has been developing
fair learning-to-rank (LTR) models. These models rely on the availability of
sensitive demographic features such as race or sex. However, in practice,
regulatory obstacles and privacy concerns protect this data from collection and
use. As a result, practitioners may either need to promote fairness despite the
absence of these features or turn to demographic inference tools to attempt to
infer them. Given that these tools are fallible, this paper aims to further
understand how errors in demographic inference impact the fairness performance
of popular fair LTR strategies. In which cases would it be better to keep such
demographic attributes hidden from models versus infer them? We examine a
spectrum of fair LTR strategies ranging from fair LTR with and without
demographic features hidden versus inferred to fairness-unaware LTR followed by
fair re-ranking. We conduct a controlled empirical investigation modeling
different levels of inference errors by systematically perturbing the inferred
sensitive attribute. We also perform three case studies with real-world
datasets and popular open-source inference methods. Our findings reveal that as
inference noise grows, LTR-based methods that incorporate fairness
considerations into the learning process may increase bias. In contrast, fair
re-ranking strategies are more robust to inference errors. All source code,
data, and experimental artifacts of our experimental study are available here:
https://github.com/sewen007/hoiltr.git
[COMMENTS]
This paper has been accepted by AAAI/AIES to the AIES 2024 conference
[LINK]
http://arxiv.org/abs/2407.17459v1
[DATE]
2024-07-25 01:54:07+08:00
[CATEGORIES]
cs.LG
Investigating Resource-efficient Neutron/Gamma Classification ML Models Targeting eFPGAs
[AUTHORS]
Jyothisraj Johnson, Billy Boxer, Tarun Prakash, Carl Grace, Peter Sorensen, Mani Tripathi
[ABSTRACT]
There has been considerable interest and resulting progress in implementing
machine learning (ML) models in hardware over the last several years from the
particle and nuclear physics communities. A big driver has been the release of
the Python package, hls4ml, which has enabled porting models specified and
trained using Python ML libraries to register transfer level (RTL) code. So
far, the primary end targets have been commercial FPGAs or synthesized custom
blocks on ASICs. However, recent developments in open-source embedded FPGA
(eFPGA) frameworks now provide an alternate, more flexible pathway for
implementing ML models in hardware. These customized eFPGA fabrics can be
integrated as part of an overall chip design. In general, the decision between
a fully custom, eFPGA, or commercial FPGA ML implementation will depend on the
details of the end-use application. In this work, we explored the parameter
space for eFPGA implementations of fully-connected neural network (fcNN) and
boosted decision tree (BDT) models using the task of neutron/gamma
classification with a specific focus on resource efficiency. We used data
collected using an AmBe sealed source incident on Stilbene, which was optically
coupled to an OnSemi J-series SiPM to generate training and test data for this
study. We investigated relevant input features and the effects of
bit-resolution and sampling rate as well as trade-offs in hyperparameters for
both ML architectures while tracking total resource usage. The performance
metric used to track model performance was the calculated neutron efficiency at
a gamma leakage of 10$^{-3}$. The results of the study will be used to aid the
specification of an eFPGA fabric, which will be integrated as part of a test
chip.
[LINK]
http://arxiv.org/abs/2404.14436v2
[DATE]
2024-07-25 01:26:07+08:00
[CATEGORIES]
cs.LG
HumanVid: Demystifying Training Data for Camera-controllable Human Image Animation
[AUTHORS]
Zhenzhi Wang, Yixuan Li, Yanhong Zeng, Youqing Fang, Yuwei Guo, Wenran Liu, Jing Tan, Kai Chen, Tianfan Xue, Bo Dai, Dahua Lin
[ABSTRACT]
Human image animation involves generating videos from a character photo,
allowing user control and unlocking potential for video and movie production.
While recent approaches yield impressive results using high-quality training
data, the inaccessibility of these datasets hampers fair and transparent
benchmarking. Moreover, these approaches prioritize 2D human motion and
overlook the significance of camera motions in videos, leading to limited
control and unstable video generation.To demystify the training data, we
present HumanVid, the first large-scale high-quality dataset tailored for human
image animation, which combines crafted real-world and synthetic data. For the
real-world data, we compile a vast collection of copyright-free real-world
videos from the internet. Through a carefully designed rule-based filtering
strategy, we ensure the inclusion of high-quality videos, resulting in a
collection of 20K human-centric videos in 1080P resolution. Human and camera
motion annotation is accomplished using a 2D pose estimator and a SLAM-based
method. For the synthetic data, we gather 2,300 copyright-free 3D avatar assets
to augment existing available 3D assets. Notably, we introduce a rule-based
camera trajectory generation method, enabling the synthetic pipeline to
incorporate diverse and precise camera motion annotation, which can rarely be
found in real-world data. To verify the effectiveness of HumanVid, we establish
a baseline model named CamAnimate, short for Camera-controllable Human
Animation, that considers both human and camera motions as conditions. Through
extensive experimentation, we demonstrate that such simple baseline training on
our HumanVid achieves state-of-the-art performance in controlling both human
pose and camera motions, setting a new benchmark. Code and data will be
publicly available at \url{https://github.com/zhenzhiwang/HumanVid/}.
[COMMENTS]
camera controllable human image animation, a dataset and a baseline
[LINK]
http://arxiv.org/abs/2407.17438v1
[DATE]
2024-07-25 01:15:58+08:00
[CATEGORIES]
cs.LG
Solving Deep Reinforcement Learning Tasks with Evolution Strategies and Linear Policy Networks
[AUTHORS]
Annie Wong, Jacob de Nobel, Thomas Bäck, Aske Plaat, Anna V. Kononova
[ABSTRACT]
Although deep reinforcement learning methods can learn effective policies for
challenging problems such as Atari games and robotics tasks, algorithms are
complex, and training times are often long. This study investigates how
Evolution Strategies perform compared to gradient-based deep reinforcement
learning methods. We use Evolution Strategies to optimize the weights of a
neural network via neuroevolution, performing direct policy search. We
benchmark both deep policy networks and networks consisting of a single linear
layer from observations to actions for three gradient-based methods, such as
Proximal Policy Optimization. These methods are evaluated against three
classical Evolution Strategies and Augmented Random Search, which all use
linear policy networks. Our results reveal that Evolution Strategies can find
effective linear policies for many reinforcement learning benchmark tasks,
unlike deep reinforcement learning methods that can only find successful
policies using much larger networks, suggesting that current benchmarks are
easier to solve than previously assumed. Interestingly, Evolution Strategies
also achieve results comparable to gradient-based deep reinforcement learning
algorithms for higher-complexity tasks. Furthermore, we find that by directly
accessing the memory state of the game, Evolution Strategies can find
successful policies in Atari that outperform the policies found by Deep
Q-Learning. Evolution Strategies also outperform Augmented Random Search in
most benchmarks, demonstrating superior sample efficiency and robustness in
training linear policy networks.
[LINK]
http://arxiv.org/abs/2402.06912v2
[DATE]
2024-07-25 01:15:44+08:00
[CATEGORIES]
cs.LG
Nerva: a Truly Sparse Implementation of Neural Networks
[AUTHORS]
Wieger Wesselink, Bram Grooten, Qiao Xiao, Cassio de Campos, Mykola Pechenizkiy
[ABSTRACT]
We introduce Nerva, a fast neural network library under development in C++.
It supports sparsity by using the sparse matrix operations of Intel’s Math
Kernel Library (MKL), which eliminates the need for binary masks. We show that
Nerva significantly decreases training time and memory usage while reaching
equivalent accuracy to PyTorch. We run static sparse experiments with an MLP on
CIFAR-10. On high sparsity levels like $99\%$, the runtime is reduced by a
factor of $4\times$ compared to a PyTorch model using masks. Similar to other
popular frameworks such as PyTorch and Keras, Nerva offers a Python interface
for users to work with.
[COMMENTS]
The Nerva library is available at https://github.com/wiegerw/nerva
[LINK]
http://arxiv.org/abs/2407.17437v1
[DATE]
2024-07-25 01:13:31+08:00
[CATEGORIES]
cs.LG
Proof-of-Collaborative-Learning: A Multi-winner Federated Learning Consensus Algorithm
[AUTHORS]
Amirreza Sokhankhosh, Sara Rouhani
[ABSTRACT]
Regardless of their variations, blockchains require a consensus mechanism to
validate transactions, supervise added blocks, maintain network security,
synchronize the network state, and distribute incentives. Proof-of-Work (PoW),
one of the most influential implementations of consensus mechanisms, consumes
an extraordinary amount of energy for a task that lacks direct productive
output. In this paper, we propose Proof-of-Collaborative-Learning (PoCL), a
multi-winner federated learning validated consensus mechanism that redirects
the computation power of blockchains to train federated learning models. In
addition, we present a novel evaluation mechanism to ensure the efficiency of
the locally trained models of miners. We evaluated the security of our
evaluation mechanism by introducing and conducting probable attacks. Moreover,
we present a novel reward distribution mechanism to incentivize winning miners
fairly, and demonstrate that our reward system is fair both within and across
all rounds.
[COMMENTS]
8 pages. Accepted at the 7th IEEE International Conference on
Blockchain (Blockchain 2024)
[LINK]
http://arxiv.org/abs/2407.13018v2
[DATE]
2024-07-25 01:04:35+08:00
[CATEGORIES]
cs.LG
The Elements of Differentiable Programming
[AUTHORS]
Mathieu Blondel, Vincent Roulet
[ABSTRACT]
Artificial intelligence has recently experienced remarkable advances, fueled
by large models, vast datasets, accelerated hardware, and, last but not least,
the transformative power of differentiable programming. This new programming
paradigm enables end-to-end differentiation of complex computer programs
(including those with control flows and data structures), making gradient-based
optimization of program parameters possible. As an emerging paradigm,
differentiable programming builds upon several areas of computer science and
applied mathematics, including automatic differentiation, graphical models,
optimization and statistics. This book presents a comprehensive review of the
fundamental concepts useful for differentiable programming. We adopt two main
perspectives, that of optimization and that of probability, with clear
analogies between the two. Differentiable programming is not merely the
differentiation of programs, but also the thoughtful design of programs
intended for differentiation. By making programs differentiable, we inherently
introduce probability distributions over their execution, providing a means to
quantify the uncertainty associated with program outputs.
[COMMENTS]
Draft version 2
[LINK]
http://arxiv.org/abs/2403.14606v2
[DATE]
2024-07-25 00:56:17+08:00
[CATEGORIES]
cs.LG
Efficient Unbiased Sparsification
[AUTHORS]
Leighton Barnes, Stephen Cameron, Timothy Chow, Emma Cohen, Keith Frankston, Benjamin Howard, Fred Kochman, Daniel Scheinerman, Jeffrey VanderKam
[ABSTRACT]
An unbiased $m$-sparsification of a vector $p\in \mathbb{R}^n$ is a random
vector $Q\in \mathbb{R}^n$ with mean $p$ that has at most $m<n$ nonzero
coordinates. Unbiased sparsification compresses the original vector without
introducing bias; it arises in various contexts, such as in federated learning
and sampling sparse probability distributions. Ideally, unbiased sparsification
should also minimize the expected value of a divergence function
$\mathsf{Div}(Q,p)$ that measures how far away $Q$ is from the original $p$. If
$Q$ is optimal in this sense, then we call it efficient. Our main results
describe efficient unbiased sparsifications for divergences that are either
permutation-invariant or additively separable. Surprisingly, the
characterization for permutation-invariant divergences is robust to the choice
of divergence function, in the sense that our class of optimal $Q$ for squared
Euclidean distance coincides with our class of optimal $Q$ for Kullback-Leibler
divergence, or indeed any of a wide variety of divergences.
[LINK]
http://arxiv.org/abs/2402.14925v2
[DATE]
2024-07-25 00:54:33+08:00
[CATEGORIES]
cs.LG
Can Watermarking Large Language Models Prevent Copyrighted Text Generation and Hide Training Data?
[AUTHORS]
Michael-Andrei Panaitescu-Liess, Zora Che, Bang An, Yuancheng Xu, Pankayaraj Pathmanathan, Souradip Chakraborty, Sicheng Zhu, Tom Goldstein, Furong Huang
[ABSTRACT]
Large Language Models (LLMs) have demonstrated impressive capabilities in
generating diverse and contextually rich text. However, concerns regarding
copyright infringement arise as LLMs may inadvertently produce copyrighted
material. In this paper, we first investigate the effectiveness of watermarking
LLMs as a deterrent against the generation of copyrighted texts. Through
theoretical analysis and empirical evaluation, we demonstrate that
incorporating watermarks into LLMs significantly reduces the likelihood of
generating copyrighted content, thereby addressing a critical concern in the
deployment of LLMs. Additionally, we explore the impact of watermarking on
Membership Inference Attacks (MIAs), which aim to discern whether a sample was
part of the pretraining dataset and may be used to detect copyright violations.
Surprisingly, we find that watermarking adversely affects the success rate of
MIAs, complicating the task of detecting copyrighted text in the pretraining
dataset. Finally, we propose an adaptive technique to improve the success rate
of a recent MIA under watermarking. Our findings underscore the importance of
developing adaptive methods to study critical problems in LLMs with potential
legal implications.
[COMMENTS]
21 pages, 6 figures
[LINK]
http://arxiv.org/abs/2407.17417v1
[DATE]
2024-07-25 00:53:09+08:00
[CATEGORIES]
cs.LG
Self-driving lab discovers principles for steering spontaneous emission
[AUTHORS]
Saaketh Desai, Sadhvikas Addamane, Jeffery Y. Tsao, Igal Brener, Remi Dingreville, Prasad P. Iyer
[ABSTRACT]
We developed an autonomous experimentation platform to accelerate
interpretable scientific discovery in ultrafast nanophotonics, targeting a
novel method to steer spontaneous emission from reconfigurable semiconductor
metasurfaces. Controlling spontaneous emission is crucial for clean-energy
solutions in illumination, thermal radiation engineering, and remote sensing.
Despite the potential of reconfigurable semiconductor metasurfaces with
embedded sources for spatiotemporal control, achieving arbitrary far-field
control remains challenging. Here, we present a self-driving lab (SDL) platform
that addresses this challenge by discovering the governing equations for
predicting the far-field emission profile from light-emitting metasurfaces. We
discover that both the spatial gradient (grating-like) and the curvature
(lens-like) of the local refractive index are key factors in steering
spontaneous emission. The SDL employs a machine-learning framework comprising:
(1) a variational autoencoder for generating complex spatial refractive index
profiles, (2) an active learning agent for guiding experiments with real-time
closed-loop feedback, and (3) a neural network-based equation learner to
uncover structure-property relationships. The SDL demonstrated a four-fold
enhancement in peak emission directivity (up to 77%) over a 72{\deg} field of
view within ~300 experiments. Our findings reveal that combinations of positive
gratings and lenses are as effective as negative lenses and gratings for all
emission angles, offering a novel strategy for controlling spontaneous emission
beyond conventional Fourier optics.
[COMMENTS]
25 pages, 4 figures in main text, 5 figures in supplementary
information
[LINK]
http://arxiv.org/abs/2407.16083v2
[DATE]
2024-07-25 00:45:29+08:00
[CATEGORIES]
cs.LG
Sparks of Quantum Advantage and Rapid Retraining in Machine Learning
[AUTHORS]
William Troy
[ABSTRACT]
The advent of quantum computing holds the potential to revolutionize various
fields by solving complex problems more efficiently than classical computers.
Despite this promise, practical quantum advantage is hindered by current
hardware limitations, notably the small number of qubits and high noise levels.
In this study, we leverage adiabatic quantum computers to optimize
Kolmogorov-Arnold Networks, a powerful neural network architecture for
representing complex functions with minimal parameters. By modifying the
network to use Bezier curves as the basis functions and formulating the
optimization problem into a Quadratic Unconstrained Binary Optimization
problem, we create a fixed-sized solution space, independent of the number of
training samples. Our approach demonstrates sparks of quantum advantage through
faster training times compared to classical optimizers such as the Adam,
Stochastic Gradient Descent, Adaptive Gradient, and simulated annealing.
Additionally, we introduce a novel rapid retraining capability, enabling the
network to be retrained with new data without reprocessing old samples, thus
enhancing learning efficiency in dynamic environments. Experimental results on
initial training of classification and regression tasks validate the efficacy
of our approach, showcasing significant speedups and comparable performance to
classical methods. While experiments on retraining demonstrate a sixty times
speed up using adiabatic quantum computing based optimization compared to that
of the gradient descent based optimizers, with theoretical models allowing this
speed up to be even larger! Our findings suggest that with further advancements
in quantum hardware and algorithm optimization, quantum-optimized machine
learning models could have broad applications across various domains, with
initial focus on rapid retraining.
[COMMENTS]
Fixed figure 2 in v2
[LINK]
http://arxiv.org/abs/2407.16020v2
[DATE]
2024-07-25 00:23:55+08:00
[CATEGORIES]
cs.LG
MELTing point: Mobile Evaluation of Language Transformers
[AUTHORS]
Stefanos Laskaridis, Kleomenis Katevas, Lorenzo Minto, Hamed Haddadi
[ABSTRACT]
Transformers have revolutionized the machine learning landscape, gradually
making their way into everyday tasks and equipping our computers with “sparks
of intelligence”. However, their runtime requirements have prevented them from
being broadly deployed on mobile. As personal devices become increasingly
powerful and prompt privacy becomes an ever more pressing issue, we explore the
current state of mobile execution of Large Language Models (LLMs). To achieve
this, we have created our own automation infrastructure, MELT, which supports
the headless execution and benchmarking of LLMs on device, supporting different
models, devices and frameworks, including Android, iOS and Nvidia Jetson
devices. We evaluate popular instruction fine-tuned LLMs and leverage different
frameworks to measure their end-to-end and granular performance, tracing their
memory and energy requirements along the way. Our analysis is the first
systematic study of on-device LLM execution, quantifying performance, energy
efficiency and accuracy across various state-of-the-art models and showcases
the state of on-device intelligence in the era of hyperscale models. Results
highlight the performance heterogeneity across targets and corroborates that
LLM inference is largely memory-bound. Quantization drastically reduces memory
requirements and renders execution viable, but at a non-negligible accuracy
cost. Drawing from its energy footprint and thermal behavior, the continuous
execution of LLMs remains elusive, as both factors negatively affect user
experience. Last, our experience shows that the ecosystem is still in its
infancy, and algorithmic as well as hardware breakthroughs can significantly
shift the execution cost. We expect NPU acceleration, and framework-hardware
co-design to be the biggest bet towards efficient standalone execution, with
the alternative of offloading tailored towards edge deployments.
[COMMENTS]
Accepted at the 30th Annual International Conference On Mobile
Computing And Networking (MobiCom 2024)
[LINK]
http://arxiv.org/abs/2403.12844v3
[DATE]
2024-07-25 00:17:22+08:00
[CATEGORIES]
cs.LG
Five reasons against assuming a data-generating distribution in Machine Learning
[AUTHORS]
Benedikt Höltgen, Robert C. Williamson
[ABSTRACT]
Machine Learning research, as most of Statistics, heavily relies on the
concept of a data-generating probability distribution. As data points are
thought to be sampled from such a distribution, we can learn from observed data
about this distribution and, thus, predict future data points drawn from it
(with some probability of success). Drawing on scholarship across disciplines,
we here argue that this framework is not always a good model. Not only do such
true probability distributions not exist; the framework can also be misleading
and obscure both the choices made and the goals pursued in machine learning
practice. We suggest an alternative framework that focuses on finite
populations rather than abstract distributions; while classical learning theory
can be left almost unchanged, it opens new opportunities, especially to model
sampling. We compile these considerations into five reasons for modelling
machine learning – in some settings – with finite distributions rather than
generative distributions, both to be more faithful to practice and to provide
novel theoretical insights.
[COMMENTS]
Presented at the Humans, Algorithmic Decision-Making and Society
Workshop at ICML 2024
[LINK]
http://arxiv.org/abs/2407.17395v1
[DATE]
2024-07-25 00:17:14+08:00
[CATEGORIES]
cs.LG
Causal Discovery over High-Dimensional Structured Hypothesis Spaces with Causal Graph Partitioning
[AUTHORS]
Ashka Shah, Adela DePavia, Nathaniel Hudson, Ian Foster, Rick Stevens
[ABSTRACT]
The aim in many sciences is to understand the mechanisms that underlie the
observed distribution of variables, starting from a set of initial hypotheses.
Causal discovery allows us to infer mechanisms as sets of cause and effect
relationships in a generalized way – without necessarily tailoring to a
specific domain. Causal discovery algorithms search over a structured
hypothesis space, defined by the set of directed acyclic graphs, to find the
graph that best explains the data. For high-dimensional problems, however, this
search becomes intractable and scalable algorithms for causal discovery are
needed to bridge the gap. In this paper, we define a novel causal graph
partition that allows for divide-and-conquer causal discovery with theoretical
guarantees. We leverage the idea of a superstructure – a set of learned or
existing candidate hypotheses – to partition the search space. We prove under
certain assumptions that learning with a causal graph partition always yields
the Markov Equivalence Class of the true causal graph. We show our algorithm
achieves comparable accuracy and a faster time to solution for
biologically-tuned synthetic networks and networks up to ${10^4}$ variables.
This makes our method applicable to gene regulatory network inference and other
domains with high-dimensional structured hypothesis spaces.
[LINK]
http://arxiv.org/abs/2406.06348v2
[DATE]
2024-07-25 00:13:45+08:00
[CATEGORIES]
cs.LG
[AUTHORS]
Benedikt Höltgen, Robert C. Williamson [COMMENTS]
Presented at the Humans, Algorithmic Decision-Making and Society
Workshop at ICML 2024 [LINK]
http://arxiv.org/abs/2407.17385v1 [DATE]
2024-07-25 00:07:57+08:00 [CATEGORIES]
cs.LG
MM-Soc: Benchmarking Multimodal Large Language Models in Social Media Platforms
[AUTHORS]
Yiqiao Jin, Minje Choi, Gaurav Verma, Jindong Wang, Srijan Kumar
[ABSTRACT]
Social media platforms are hubs for multimodal information exchange,
encompassing text, images, and videos, making it challenging for machines to
comprehend the information or emotions associated with interactions in online
spaces. Multimodal Large Language Models (MLLMs) have emerged as a promising
solution to these challenges, yet they struggle to accurately interpret human
emotions and complex content such as misinformation. This paper introduces
MM-Soc, a comprehensive benchmark designed to evaluate MLLMs’ understanding of
multimodal social media content. MM-Soc compiles prominent multimodal datasets
and incorporates a novel large-scale YouTube tagging dataset, targeting a range
of tasks from misinformation detection, hate speech detection, and social
context generation. Through our exhaustive evaluation on ten size-variants of
four open-source MLLMs, we have identified significant performance disparities,
highlighting the need for advancements in models’ social understanding
capabilities. Our analysis reveals that, in a zero-shot setting, various types
of MLLMs generally exhibit difficulties in handling social media tasks.
However, MLLMs demonstrate performance improvements post fine-tuning,
suggesting potential pathways for improvement. Our code and data are available
at https://github.com/claws-lab/MMSoc.git.
[COMMENTS]
In Proceedings of ACL 2024
[LINK]
http://arxiv.org/abs/2402.14154v2
[DATE]
2024-07-24 23:19:20+08:00
[CATEGORIES]
cs.CL
AMONGAGENTS: Evaluating Large Language Models in the Interactive Text-Based Social Deduction Game
[AUTHORS]
Yizhou Chi, Lingjun Mao, Zineng Tang
[COMMENTS]
Wordplay @ ACL 2024
[LINK]
http://arxiv.org/abs/2407.16521v2
[DATE]
2024-07-24 23:12:09+08:00
[CATEGORIES]
cs.CL
Description-Based Text Similarity
[AUTHORS]
Shauli Ravfogel, Valentina Pyatkin, Amir DN Cohen, Avshalom Manevich, Yoav Goldberg
[ABSTRACT]
Identifying texts with a given semantics is central for many information
seeking scenarios. Similarity search over vector embeddings appear to be
central to this ability, yet the similarity reflected in current text
embeddings is corpus-driven, and is inconsistent and sub-optimal for many use
cases. What, then, is a good notion of similarity for effective retrieval of
text?
We identify the need to search for texts based on abstract descriptions of
their content, and the corresponding notion of \emph{description based
similarity}. We demonstrate the inadequacy of current text embeddings and
propose an alternative model that significantly improves when used in standard
nearest neighbor search. The model is trained using positive and negative pairs
sourced through prompting a LLM, demonstrating how data from LLMs can be used
for creating new capabilities not immediately possible using the original
model.
[COMMENTS]
Accepted in COLM 2024
[LINK]
http://arxiv.org/abs/2305.12517v5
[DATE]
2024-07-24 23:10:41+08:00
[CATEGORIES]
cs.CL
cs.LG
Overview of AI-Debater 2023: The Challenges of Argument Generation Tasks
[AUTHORS]
Jiayu Lin, Guanrong Chen, Bojun Jin, Chenyang Li, Shutong Jia, Wancong Lin, Yang Sun, Yuhang He, Caihua Yang, Jianzhu Bao, Jipeng Wu, Wen Su, Jinglu Chen, Xinyi Li, Tianyu Chen, Mingjie Han, Shuaiwen Du, Zijian Wang, Jiyin Li, Fuzhong Suo, Hao Wang, Nuanchen Lin, Xuanjing Huang, Changjian Jiang, RuiFeng Xu, Long Zhang, Jiuxin Cao, Ting Jin, Zhongyu Wei
[ABSTRACT]
In this paper we present the results of the AI-Debater 2023 Challenge held by
the Chinese Conference on Affect Computing (CCAC 2023), and introduce the
related datasets. We organize two tracks to handle the argumentative generation
tasks in different scenarios, namely, Counter-Argument Generation (Track 1) and
Claim-based Argument Generation (Track 2). Each track is equipped with its
distinct dataset and baseline model respectively. In total, 32 competing teams
register for the challenge, from which we received 11 successful submissions.
In this paper, we will present the results of the challenge and a summary of
the systems, highlighting commonalities and innovations among participating
systems. Datasets and baseline models of the AI-Debater 2023 Challenge have
been already released and can be accessed through the official website of the
challenge.
[LINK]
http://arxiv.org/abs/2407.14829v2
[DATE]
2024-07-24 23:09:29+08:00
[CATEGORIES]
cs.CL
Q-Sparse: All Large Language Models can be Fully Sparsely-Activated
[AUTHORS]
Hongyu Wang, Shuming Ma, Ruiping Wang, Furu Wei
[ABSTRACT]
We introduce, Q-Sparse, a simple yet effective approach to training
sparsely-activated large language models (LLMs). Q-Sparse enables full sparsity
of activations in LLMs which can bring significant efficiency gains in
inference. This is achieved by applying top-K sparsification to the activations
and the straight-through-estimator to the training. We also introduce Block
Q-Sparse for batch training and inference. The key results from this work are,
(1) Q-Sparse can achieve results comparable to those of baseline LLMs while
being much more efficient at inference time; (2) We present an
inference-optimal scaling law for sparsely-activated LLMs; (3) Q-Sparse is
effective in different settings, including training-from-scratch,
continue-training of off-the-shelf LLMs, and finetuning; (4) Q-Sparse works for
both full-precision and 1-bit LLMs (e.g., BitNet b1.58). Particularly, the
synergy of BitNet b1.58 and Q-Sparse (can be equipped with MoE) provides the
cornerstone and a clear path to revolutionize the efficiency, including cost
and energy consumption, of future LLMs.
[COMMENTS]
Work in progress
[LINK]
http://arxiv.org/abs/2407.10969v3
[DATE]
2024-07-24 22:57:48+08:00
[CATEGORIES]
cs.CL
cs.LG
Large Language Models as Topological Structure Enhancers for Text-Attributed Graphs
[AUTHORS]
Shengyin Sun, Yuxiang Ren, Chen Ma, Xuecang Zhang
[ABSTRACT]
The latest advancements in large language models (LLMs) have revolutionized
the field of natural language processing (NLP). Inspired by the success of LLMs
in NLP tasks, some recent work has begun investigating the potential of
applying LLMs in graph learning tasks. However, most of the existing work
focuses on utilizing LLMs as powerful node feature augmenters, leaving
employing LLMs to enhance graph topological structures an understudied problem.
In this work, we explore how to leverage the information retrieval and text
generation capabilities of LLMs to refine/enhance the topological structure of
text-attributed graphs (TAGs) under the node classification setting. First, we
propose using LLMs to help remove unreliable edges and add reliable ones in the
TAG. Specifically, we first let the LLM output the semantic similarity between
node attributes through delicate prompt designs, and then perform edge deletion
and edge addition based on the similarity. Second, we propose using
pseudo-labels generated by the LLM to improve graph topology, that is, we
introduce the pseudo-label propagation as a regularization to guide the graph
neural network (GNN) in learning proper edge weights. Finally, we incorporate
the two aforementioned LLM-based methods for graph topological refinement into
the process of GNN training, and perform extensive experiments on four
real-world datasets. The experimental results demonstrate the effectiveness of
LLM-based graph topology refinement (achieving a 0.15%–2.47% performance gain
on public benchmarks).
[COMMENTS]
10 pages
[LINK]
http://arxiv.org/abs/2311.14324v2
[DATE]
2024-07-24 21:34:14+08:00
[CATEGORIES]
cs.CL
cs.LG
Arrows of Time for Large Language Models
[AUTHORS]
Vassilis Papadopoulos, Jérémie Wenger, Clément Hongler
[ABSTRACT]
We study the probabilistic modeling performed by Autoregressive Large
Language Models (LLMs) through the angle of time directionality, addressing a
question first raised in (Shannon, 1951). For large enough models, we
empirically find a time asymmetry in their ability to learn natural language: a
difference in the average log-perplexity when trying to predict the next token
versus when trying to predict the previous one. This difference is at the same
time subtle and very consistent across various modalities (language, model
size, training time, …). Theoretically, this is surprising: from an
information-theoretic point of view, there should be no such difference. We
provide a theoretical framework to explain how such an asymmetry can appear
from sparsity and computational complexity considerations, and outline a number
of perspectives opened by our results.
[COMMENTS]
Corrected typos in Table 2. Added links. 12 figures, 20 pages
[LINK]
http://arxiv.org/abs/2401.17505v4
[DATE]
2024-07-24 20:57:56+08:00
[CATEGORIES]
cs.LG
cs.CL
Investigating Low-Rank Training in Transformer Language Models: Efficiency and Scaling Analysis
[AUTHORS]
Xiuying Wei, Skander Moalla, Razvan Pascanu, Caglar Gulcehre
[ABSTRACT]
State-of-the-art LLMs often rely on scale with high computational costs,
which has sparked a research agenda to reduce parameter counts and costs
without significantly impacting performance. Our study focuses on
Transformer-based LLMs, specifically applying low-rank parametrization to the
computationally intensive feedforward networks (FFNs), which are less studied
than attention blocks. In contrast to previous works, (i) we explore low-rank
parametrization at scale, up to 1.3B parameters; (ii) within Transformer
language models rather than convolutional architectures; and (iii) starting
from training from scratch. Experiments on the large RefinedWeb dataset show
that low-rank parametrization is both efficient (e.g., 2.6$\times$ FFN speed-up
with 32\% parameters) and effective during training. Interestingly, these
structured FFNs exhibit steeper scaling curves than the original models.
Motivated by this finding, we develop the wide and structured networks
surpassing the current medium-sized and large-sized Transformer in perplexity
and throughput performance. Our code is available at
https://github.com/CLAIRE-Labo/StructuredFFN/tree/main.
[COMMENTS]
Accepted by ICML 2024 Next Generation of Sequence Modeling
Architectures Workshop. Short version of arXiv:2406.16450
[LINK]
http://arxiv.org/abs/2407.09835v2
[DATE]
2024-07-24 20:43:33+08:00
[CATEGORIES]
cs.CL
Improving ICD coding using Chapter based Named Entities and Attentional Models
[AUTHORS]
Abhijith R. Beeravolu, Mirjam Jonkman, Sami Azam, Friso De Boer
[ABSTRACT]
Recent advancements in natural language processing (NLP) have led to
automation in various domains. However, clinical NLP often relies on benchmark
datasets that may not reflect real-world scenarios accurately. Automatic ICD
coding, a vital NLP task, typically uses outdated and imbalanced datasets like
MIMIC-III, with existing methods yielding micro-averaged F1 scores between 0.4
and 0.7 due to many false positives. Our research introduces an enhanced
approach to ICD coding that improves F1 scores by using chapter-based named
entities and attentional models. This method categorizes discharge summaries
into ICD-9 Chapters and develops attentional models with chapter-specific data,
eliminating the need to consider external data for code identification. For
categorization, we use Chapter-IV to de-bias and influence key entities and
weights without neural networks, creating accurate thresholds and providing
interpretability for human validation. Post-validation, we develop attentional
models for three frequent and three non-frequent codes from Chapter-IV using
Bidirectional-Gated Recurrent Units (GRUs) with Attention and Transformer with
Multi-head Attention architectures. The average Micro-F1 scores of 0.79 and
0.81 from these models demonstrate significant performance improvements in ICD
coding.
[COMMENTS]
10 Pages
[LINK]
http://arxiv.org/abs/2407.17230v1
[DATE]
2024-07-24 20:34:23+08:00
[CATEGORIES]
cs.CL
Tree-Planner: Efficient Close-loop Task Planning with Large Language Models
[AUTHORS]
Mengkang Hu, Yao Mu, Xinmiao Yu, Mingyu Ding, Shiguang Wu, Wenqi Shao, Qiguang Chen, Bin Wang, Yu Qiao, Ping Luo
[ABSTRACT]
This paper studies close-loop task planning, which refers to the process of
generating a sequence of skills (a plan) to accomplish a specific goal while
adapting the plan based on real-time observations. Recently, prompting Large
Language Models (LLMs) to generate actions iteratively has become a prevalent
paradigm due to its superior performance and user-friendliness. However, this
paradigm is plagued by two inefficiencies: high token consumption and redundant
error correction, both of which hinder its scalability for large-scale testing
and applications. To address these issues, we propose Tree-Planner, which
reframes task planning with LLMs into three distinct phases: plan sampling,
action tree construction, and grounded deciding. Tree-Planner starts by using
an LLM to sample a set of potential plans before execution, followed by the
aggregation of them to form an action tree. Finally, the LLM performs a
top-down decision-making process on the tree, taking into account real-time
environmental information. Experiments show that Tree-Planner achieves
state-of-the-art performance while maintaining high efficiency. By decomposing
LLM queries into a single plan-sampling call and multiple grounded-deciding
calls, a considerable part of the prompt are less likely to be repeatedly
consumed. As a result, token consumption is reduced by 92.2% compared to the
previously best-performing model. Additionally, by enabling backtracking on the
action tree as needed, the correction process becomes more flexible, leading to
a 40.5% decrease in error corrections.
[COMMENTS]
Published in ICLR 2024
[LINK]
http://arxiv.org/abs/2310.08582v2
[DATE]
2024-07-24 20:25:17+08:00
[CATEGORIES]
cs.CL
cs.LG
A Bounding Box is Worth One Token: Interleaving Layout and Text in a Large Language Model for Document Understanding
[AUTHORS]
Jinghui Lu, Haiyang Yu, Yanjie Wang, Yongjie Ye, Jingqun Tang, Ziwei Yang, Binghong Wu, Qi Liu, Hao Feng, Han Wang, Hao Liu, Can Huang
[ABSTRACT]
Recently, many studies have demonstrated that exclusively incorporating
OCR-derived text and spatial layouts with large language models (LLMs) can be
highly effective for document understanding tasks. However, existing methods
that integrate spatial layouts with text have limitations, such as producing
overly long text sequences or failing to fully leverage the autoregressive
traits of LLMs. In this work, we introduce Interleaving Layout and Text in a
Large Language Model (LayTextLLM)} for document understanding. In particular,
LayTextLLM projects each bounding box to a single embedding and interleaves it
with text, efficiently avoiding long sequence issues while leveraging
autoregressive traits of LLMs. LayTextLLM not only streamlines the interaction
of layout and textual data but also shows enhanced performance in Key
Information Extraction (KIE) and Visual Question Answering (VQA). Comprehensive
benchmark evaluations reveal significant improvements, with a 27.2% increase on
KIE tasks and 12.0% on VQA tasks compared to previous state-of-the-art document
understanding MLLMs, as well as a 15.1% improvement over other SOTA OCR-based
LLMs on KIE tasks.
[LINK]
http://arxiv.org/abs/2407.01976v2
[DATE]
2024-07-24 19:45:48+08:00
[CATEGORIES]
cs.CL
A Comparative Analysis of Bilingual and Trilingual Wav2Vec Models for Automatic Speech Recognition in Multilingual Oral History Archives
[AUTHORS]
Jan Lehečka, Josef V. Psutka, Luboš Šmídl, Pavel Ircing, Josef Psutka
[ABSTRACT]
In this paper, we are comparing monolingual Wav2Vec 2.0 models with various
multilingual models to see whether we could improve speech recognition
performance on a unique oral history archive containing a lot of mixed-language
sentences. Our main goal is to push forward research on this unique dataset,
which is an extremely valuable part of our cultural heritage. Our results
suggest that monolingual speech recognition models are, in most cases, superior
to multilingual models, even when processing the oral history archive full of
mixed-language sentences from non-native speakers. We also performed the same
experiments on the public CommonVoice dataset to verify our results. We are
contributing to the research community by releasing our pre-trained models to
the public.
[COMMENTS]
Accepted to INTERSPEECH2024
[LINK]
http://arxiv.org/abs/2407.17160v1
[DATE]
2024-07-24 19:03:47+08:00
[CATEGORIES]
cs.CL
Efficient Tuning and Inference for Large Language Models on Textual Graphs
[AUTHORS]
Yun Zhu, Yaoke Wang, Haizhou Shi, Siliang Tang
[ABSTRACT]
Rich textual and topological information of textual graphs need to be modeled
in real-world applications such as webpages, e-commerce, and academic articles.
Practitioners have been long following the path of adopting a shallow text
encoder and a subsequent graph neural network (GNN) to solve this problem. In
light of recent advancements in large language models (LLMs), it is apparent
that integrating LLMs for enhanced textual encoding can substantially improve
the performance of textual graphs. Nevertheless, the efficiency of these
methods poses a significant challenge. In this paper, we propose ENGINE, a
parameter- and memory-efficient fine-tuning method for textual graphs with an
LLM encoder. The key insight is to combine the LLMs and GNNs through a tunable
side structure, which significantly reduces the training complexity without
impairing the joint model’s capacity. Extensive experiments on textual graphs
demonstrate our method’s effectiveness by achieving the best model performance,
meanwhile having the lowest training cost compared to previous methods.
Moreover, we introduce two variants with caching and dynamic early exit to
further enhance training and inference speed. Specifically, caching accelerates
ENGINE’s training by 12x, and dynamic early exit achieves up to 5x faster
inference with a negligible performance drop (at maximum 1.17% relevant drop
across 7 datasets). Our codes are available at:
https://github.com/ZhuYun97/ENGINE
[COMMENTS]
Accepted by IJCAI2024
[LINK]
http://arxiv.org/abs/2401.15569v2
[DATE]
2024-07-24 16:56:11+08:00
[CATEGORIES]
cs.CL
Learning a Patent-Informed Biomedical Knowledge Graph Reveals Technological Potential of Drug Repositioning Candidates
[AUTHORS]
Yongseung Jegal, Jaewoong Choi, Jiho Lee, Ki-Su Park, Seyoung Lee, Janghyeok Yoon
[ABSTRACT]
Drug repositioning-a promising strategy for discovering new therapeutic uses
for existing drugs-has been increasingly explored in the computational science
literature using biomedical databases. However, the technological potential of
drug repositioning candidates has often been overlooked. This study presents a
novel protocol to comprehensively analyse various sources such as
pharmaceutical patents and biomedical databases, and identify drug
repositioning candidates with both technological potential and scientific
evidence. To this end, first, we constructed a scientific biomedical knowledge
graph (s-BKG) comprising relationships between drugs, diseases, and genes
derived from biomedical databases. Our protocol involves identifying drugs that
exhibit limited association with the target disease but are closely located in
the s-BKG, as potential drug candidates. We constructed a patent-informed
biomedical knowledge graph (p-BKG) by adding pharmaceutical patent information.
Finally, we developed a graph embedding protocol to ascertain the structure of
the p-BKG, thereby calculating the relevance scores of those candidates with
target disease-related patents to evaluate their technological potential. Our
case study on Alzheimer’s disease demonstrates its efficacy and feasibility,
while the quantitative outcomes and systematic methods are expected to bridge
the gap between computational discoveries and successful market applications in
drug repositioning research.
[COMMENTS]
We are sorry to withdraw this paper. We found some critical errors in
the introduction and results sections. Specifically, we found that the first
author have wrongly inserted citations on background works and he made
mistakes in the graph embedding methods and relevant results are wrongly
calculated. In this regard, we tried to revise this paper and withdraw the
current version. Thank you
[LINK]
http://arxiv.org/abs/2309.03227v2
[DATE]
2024-07-24 16:31:21+08:00
[CATEGORIES]
cs.CL
cs.LG
A Survey Forest Diagram : Gain a Divergent Insight View on a Specific Research Topic
[AUTHORS]
Jinghong Li, Wen Gu, Koichi Ota, Shinobu Hasegawa
[ABSTRACT]
With the exponential growth in the number of papers and the trend of AI
research, the use of Generative AI for information retrieval and
question-answering has become popular for conducting research surveys. However,
novice researchers unfamiliar with a particular field may not significantly
improve their efficiency in interacting with Generative AI because they have
not developed divergent thinking in that field. This study aims to develop an
in-depth Survey Forest Diagram that guides novice researchers in divergent
thinking about the research topic by indicating the citation clues among
multiple papers, to help expand the survey perspective for novice researchers.
[COMMENTS]
This paper will submit to IEEE SMC 2024
[LINK]
http://arxiv.org/abs/2407.17081v1
[DATE]
2024-07-24 16:17:37+08:00
[CATEGORIES]
cs.CL
Building Intelligence Identification System via Large Language Model Watermarking: A Survey and Beyond
[AUTHORS]
Xuhong Wang, Haoyu Jiang, Yi Yu, Jingru Yu, Yilun Lin, Ping Yi, Yingchun Wang, Yu Qiao, Li Li, Fei-Yue Wang
[ABSTRACT]
Large Language Models (LLMs) are increasingly integrated into diverse
industries, posing substantial security risks due to unauthorized replication
and misuse. To mitigate these concerns, robust identification mechanisms are
widely acknowledged as an effective strategy. Identification systems for LLMs
now rely heavily on watermarking technology to manage and protect intellectual
property and ensure data security. However, previous studies have primarily
concentrated on the basic principles of algorithms and lacked a comprehensive
analysis of watermarking theory and practice from the perspective of
intelligent identification. To bridge this gap, firstly, we explore how a
robust identity recognition system can be effectively implemented and managed
within LLMs by various participants using watermarking technology. Secondly, we
propose a mathematical framework based on mutual information theory, which
systematizes the identification process to achieve more precise and customized
watermarking. Additionally, we present a comprehensive evaluation of
performance metrics for LLM watermarking, reflecting participant preferences
and advancing discussions on its identification applications. Lastly, we
outline the existing challenges in current watermarking technologies and
theoretical frameworks, and provide directional guidance to address these
challenges. Our systematic classification and detailed exposition aim to
enhance the comparison and evaluation of various methods, fostering further
research and development toward a transparent, secure, and equitable LLM
ecosystem.
[COMMENTS]
59 pages, 7 figures
[LINK]
http://arxiv.org/abs/2407.11100v3
[DATE]
2024-07-24 16:10:29+08:00
[CATEGORIES]
cs.CL
Artificial Agency and Large Language Models
[AUTHORS]
Maud van Lier, Gorka Muñoz-Gil
[ABSTRACT]
The arrival of Large Language Models (LLMs) has stirred up philosophical
debates about the possibility of realizing agency in an artificial manner. In
this work we contribute to the debate by presenting a theoretical model that
can be used as a threshold conception for artificial agents. The model defines
agents as systems whose actions and goals are always influenced by a dynamic
framework of factors that consists of the agent’s accessible history, its
adaptive repertoire and its external environment. This framework, in turn, is
influenced by the actions that the agent takes and the goals that it forms. We
show with the help of the model that state-of-the-art LLMs are not agents yet,
but that there are elements to them that suggest a way forward. The paper
argues that a combination of the agent architecture presented in Park et al.
(2023) together with the use of modules like the Coscientist in Boiko et al.
(2023) could potentially be a way to realize agency in an artificial manner. We
end the paper by reflecting on the obstacles one might face in building such an
artificial agent and by presenting possible directions for future research.
[COMMENTS]
Accepted for publication in journal Intellectica, special issue
“Philosophies of AI: thinking and writing with LLMs” (Intellectica, issue 81)
[LINK]
http://arxiv.org/abs/2407.16190v2
[DATE]
2024-07-24 15:32:25+08:00
[CATEGORIES]
cs.CL
RefuteBench: Evaluating Refuting Instruction-Following for Large Language Models
[AUTHORS]
Jianhao Yan, Yun Luo, Yue Zhang
[ABSTRACT]
The application scope of large language models (LLMs) is increasingly
expanding. In practical use, users might provide feedback based on the model’s
output, hoping for a responsive model that can complete responses according to
their feedback. Whether the model can appropriately respond to users’ refuting
feedback and consistently follow through with execution has not been thoroughly
analyzed. In light of this, this paper proposes a comprehensive benchmark,
RefuteBench, covering tasks such as question answering, machine translation,
and email writing. The evaluation aims to assess whether models can positively
accept feedback in form of refuting instructions and whether they can
consistently adhere to user demands throughout the conversation. We conduct
evaluations on numerous LLMs and find that LLMs are stubborn, i.e. exhibit
inclination to their internal knowledge, often failing to comply with user
feedback. Additionally, as the length of the conversation increases, models
gradually forget the user’s stated feedback and roll back to their own
responses. We further propose a recall-and-repeat prompts as a simple and
effective way to enhance the model’s responsiveness to feedback.
[COMMENTS]
ACL 2024 final version
[LINK]
http://arxiv.org/abs/2402.13463v4
[DATE]
2024-07-24 14:50:18+08:00
[CATEGORIES]
cs.CL
From Internal Conflict to Contextual Adaptation of Language Models
[AUTHORS]
Sara Vera Marjanović, Haeun Yu, Pepa Atanasova, Maria Maistro, Christina Lioma, Isabelle Augenstein
[ABSTRACT]
Knowledge-intensive language understanding tasks require Language Models
(LMs) to integrate relevant context, mitigating their inherent weaknesses, such
as incomplete or outdated knowledge. Nevertheless, studies indicate that LMs
often ignore the provided context as it can conflict with the pre-existing LM’s
memory learned during pre-training. Moreover, conflicting knowledge can already
be present in the LM’s parameters, termed intra-memory conflict. Existing works
have studied the two types of knowledge conflicts only in isolation. We
conjecture that the (degree of) intra-memory conflicts can in turn affect LM’s
handling of context-memory conflicts. To study this, we introduce the DYNAMICQA
dataset, which includes facts with a temporal dynamic nature where a fact can
change with a varying time frequency and disputable dynamic facts, which can
change depending on the viewpoint. DYNAMICQA is the first to include real-world
knowledge conflicts and provide context to study the link between the different
types of knowledge conflicts. With the proposed dataset, we assess the use of
uncertainty for measuring the intra-memory conflict and introduce a novel
Coherent Persuasion (CP) score to evaluate the context’s ability to sway LM’s
semantic output. Our extensive experiments reveal that static facts, which are
unlikely to change, are more easily updated with additional context, relative
to temporal and disputable facts.
[COMMENTS]
22 pages, 15 figures
[LINK]
http://arxiv.org/abs/2407.17023v1
[DATE]
2024-07-24 14:06:07+08:00
[CATEGORIES]
cs.CL
Unveiling In-Context Learning: A Coordinate System to Understand Its Working Mechanism
[AUTHORS]
Anhao Zhao, Fanghua Ye, Jinlan Fu, Xiaoyu Shen
[ABSTRACT]
Large language models (LLMs) exhibit remarkable in-context learning (ICL)
capabilities. However, the underlying working mechanism of ICL remains poorly
understood. Recent research presents two conflicting views on ICL: One
attributes it to LLMs’ inherent ability of task recognition, deeming label
correctness and shot numbers of demonstrations as not crucial; the other
emphasizes the impact of similar examples in the demonstrations, stressing the
need for label correctness and more shots. In this work, we provide a
Two-Dimensional Coordinate System that unifies both views into a systematic
framework. The framework explains the behavior of ICL through two orthogonal
variables: whether LLMs can recognize the task and whether similar examples are
presented in the demonstrations. We propose the peak inverse rank metric to
detect the task recognition ability of LLMs and study LLMs’ reactions to
different definitions of similarity. Based on these, we conduct extensive
experiments to elucidate how ICL functions across each quadrant on multiple
representative classification tasks. Finally, we extend our analyses to
generation tasks, showing that our coordinate system can also be used to
interpret ICL for generation tasks effectively.
[LINK]
http://arxiv.org/abs/2407.17011v1
[DATE]
2024-07-24 13:26:52+08:00
[CATEGORIES]
cs.CL
A Voter-Based Stochastic Rejection-Method Framework for Asymptotically Safe Language Model Outputs
[AUTHORS]
Jake R. Watts, Joel Sokol
[ABSTRACT]
This paper proposes a new method for preventing unsafe or otherwise low
quality large language model (LLM) outputs, by leveraging the stochasticity of
LLMs. We propose a system whereby LLM checkers vote on the acceptability of a
generated output, regenerating it if a threshold of disapproval is reached,
until sufficient checkers approve. We further propose estimators for cost and
failure rate, and based on those estimators and experimental data tailored to
the application, we propose an algorithm that achieves a desired failure rate
at the least possible cost. We demonstrate that, under these models, failure
rate decreases exponentially as a function of cost when voter count and
threshold are chosen according to the algorithm, and that the models reasonably
estimate the actual performance of such a system in action, even with limited
data.
[COMMENTS]
7 pages, 2 figures
[LINK]
http://arxiv.org/abs/2407.16994v1
[DATE]
2024-07-24 12:27:55+08:00
[CATEGORIES]
cs.CL
cs.LG
A Survey of Prompt Engineering Methods in Large Language Models for Different NLP Tasks
[AUTHORS]
Shubham Vatsal, Harsh Dubey
[ABSTRACT]
Large language models (LLMs) have shown remarkable performance on many
different Natural Language Processing (NLP) tasks. Prompt engineering plays a
key role in adding more to the already existing abilities of LLMs to achieve
significant performance gains on various NLP tasks. Prompt engineering requires
composing natural language instructions called prompts to elicit knowledge from
LLMs in a structured way. Unlike previous state-of-the-art (SoTA) models,
prompt engineering does not require extensive parameter re-training or
fine-tuning based on the given NLP task and thus solely operates on the
embedded knowledge of LLMs. Additionally, LLM enthusiasts can intelligently
extract LLMs’ knowledge through a basic natural language conversational
exchange or prompt engineering, allowing more and more people even without deep
mathematical machine learning background to experiment with LLMs. With prompt
engineering gaining popularity in the last two years, researchers have come up
with numerous engineering techniques around designing prompts to improve
accuracy of information extraction from the LLMs. In this paper, we summarize
different prompting techniques and club them together based on different NLP
tasks that they have been used for. We further granularly highlight the
performance of these prompting strategies on various datasets belonging to that
NLP task, talk about the corresponding LLMs used, present a taxonomy diagram
and discuss the possible SoTA for specific datasets. In total, we read and
present a survey of 44 research papers which talk about 39 different prompting
methods on 29 different NLP tasks of which most of them have been published in
the last two years.
[LINK]
http://arxiv.org/abs/2407.12994v2
[DATE]
2024-07-24 11:53:41+08:00
[CATEGORIES]
cs.CL
Generative artificial intelligence in dentistry: Current approaches and future challenges
[AUTHORS]
Fabián Villena, Claudia Véliz, Rosario García-Huidobro, Sebastián Aguayo
[ABSTRACT]
Artificial intelligence (AI) has become a commodity for people because of the
advent of generative AI (GenAI) models that bridge the usability gap of AI by
providing a natural language interface to interact with complex models. These
GenAI models range from text generation - such as two-way chat systems - to the
generation of image or video from textual descriptions input by a user. These
advancements in AI have impacted Dentistry in multiple aspects. In dental
education, the student now has the opportunity to solve a plethora of questions
by only prompting a GenAI model and have the answer in a matter of seconds.
GenAI models can help us deliver better patient healthcare by helping
practitioners gather knowledge quickly and efficiently. Finally, GenAI can also
be used in dental research, where the applications range from new drug
discovery to assistance in academic writing. In this review, we first define
GenAI models and describe their multiple generation modalities; then, we
explain and discuss their current and potential applications in Dentistry; and
finally, we describe the challenges these new technologies impose in our area.
[LINK]
http://arxiv.org/abs/2407.17532v1
[DATE]
2024-07-24 11:33:47+08:00
[CATEGORIES]
cs.CL
Towards Aligning Language Models with Textual Feedback
[AUTHORS]
Saüc Abadal Lloret, Shehzaad Dhuliawala, Keerthiram Murugesan, Mrinmaya Sachan
[ABSTRACT]
We present ALT (ALignment with Textual feedback), an approach that aligns
language models with user preferences expressed in text. We argue that text
offers greater expressiveness, enabling users to provide richer feedback than
simple comparative preferences and this richer feedback can lead to more
efficient and effective alignment. ALT aligns the model by conditioning its
generation on the textual feedback. Our method relies solely on language
modeling techniques and requires minimal hyper-parameter tuning, though it
still presents the main benefits of RL-based alignment algorithms and can
effectively learn from textual feedback. We explore the efficacy and efficiency
of textual feedback across different tasks such as toxicity reduction,
summarization, and dialog response generation. We find that ALT outperforms PPO
for the task of toxicity reduction while being able to match its performance on
summarization with only 20% of the samples. We also explore how ALT can be used
with feedback provided by an existing LLM where we explore an LLM providing
constrained and unconstrained textual feedback. We also outline future
directions to align models with natural language feedback.
[LINK]
http://arxiv.org/abs/2407.16970v1
[DATE]
2024-07-24 11:32:05+08:00
[CATEGORIES]
cs.CL
cs.LG
LLM as Dataset Analyst: Subpopulation Structure Discovery with Large Language Model
[AUTHORS]
Yulin Luo, Ruichuan An, Bocheng Zou, Yiming Tang, Jiaming Liu, Shanghang Zhang
[ABSTRACT]
The distribution of subpopulations is an important property hidden within a
dataset. Uncovering and analyzing the subpopulation distribution within
datasets provides a comprehensive understanding of the datasets, standing as a
powerful tool beneficial to various downstream tasks, including Dataset
Subpopulation Organization, Subpopulation Shift, and Slice Discovery. Despite
its importance, there has been no work that systematically explores the
subpopulation distribution of datasets to our knowledge. To address the
limitation and solve all the mentioned tasks in a unified way, we introduce a
novel concept of subpopulation structures to represent, analyze, and utilize
subpopulation distributions within datasets. To characterize the structures in
an interpretable manner, we propose the Subpopulation Structure Discovery with
Large Language Models (SSD-LLM) framework, which employs world knowledge and
instruction-following capabilities of Large Language Models (LLMs) to
linguistically analyze informative image captions and summarize the structures.
Furthermore, we propose complete workflows to address downstream tasks, named
Task-specific Tuning, showcasing the application of the discovered structure to
a spectrum of subpopulation-related tasks, including dataset subpopulation
organization, subpopulation shift, and slice discovery. Furthermore, we propose
complete workflows to address downstream tasks, named Task-specific Tuning,
showcasing the application of the discovered structure to a spectrum of
subpopulation-related tasks, including dataset subpopulation organization,
subpopulation shift, and slice discovery.
[COMMENTS]
ECCV24 Camera Ready
[LINK]
http://arxiv.org/abs/2405.02363v2
[DATE]
2024-07-24 10:36:07+08:00
[CATEGORIES]
cs.CL
Early screening of potential breakthrough technologies with enhanced interpretability: A patent-specific hierarchical attention network model
[AUTHORS]
Jaewoong Choi, Janghyeok Yoon, Changyong Lee
[ABSTRACT]
Despite the usefulness of machine learning approaches for the early screening
of potential breakthrough technologies, their practicality is often hindered by
opaque models. To address this, we propose an interpretable machine learning
approach to predicting future citation counts from patent texts using a
patent-specific hierarchical attention network (PatentHAN) model. Central to
this approach are (1) a patent-specific pre-trained language model, capturing
the meanings of technical words in patent claims, (2) a hierarchical network
structure, enabling detailed analysis at the claim level, and (3) a claim-wise
self-attention mechanism, revealing pivotal claims during the screening
process. A case study of 35,376 pharmaceutical patents demonstrates the
effectiveness of our approach in early screening of potential breakthrough
technologies while ensuring interpretability. Furthermore, we conduct
additional analyses using different language models and claim types to examine
the robustness of the approach. It is expected that the proposed approach will
enhance expert-machine collaboration in identifying breakthrough technologies,
providing new insight derived from text mining into technological value.
[LINK]
http://arxiv.org/abs/2407.16939v1
[DATE]
2024-07-24 10:17:10+08:00
[CATEGORIES]
cs.CL
CHATATC: Large Language Model-Driven Conversational Agents for Supporting Strategic Air Traffic Flow Management
[AUTHORS]
Sinan Abdulhak, Wayne Hubbard, Karthik Gopalakrishnan, Max Z. Li
[ABSTRACT]
Generative artificial intelligence (AI) and large language models (LLMs) have
gained rapid popularity through publicly available tools such as ChatGPT. The
adoption of LLMs for personal and professional use is fueled by the natural
interactions between human users and computer applications such as ChatGPT,
along with powerful summarization and text generation capabilities. Given the
widespread use of such generative AI tools, in this work we investigate how
these tools can be deployed in a non-safety critical, strategic traffic flow
management setting. Specifically, we train an LLM, CHATATC, based on a large
historical data set of Ground Delay Program (GDP) issuances, spanning 2000-2023
and consisting of over 80,000 GDP implementations, revisions, and
cancellations. We test the query and response capabilities of CHATATC,
documenting successes (e.g., providing correct GDP rates, durations, and
reason) and shortcomings (e.g,. superlative questions). We also detail the
design of a graphical user interface for future users to interact and
collaborate with the CHATATC conversational agent.
[COMMENTS]
8 pages, 5 figures; minor revisions to address reviewer feedback for
final submission to the 11th International Conference on Research in Air
Transportation (ICRAT)
[LINK]
http://arxiv.org/abs/2402.14850v2
[DATE]
2024-07-24 10:11:47+08:00
[CATEGORIES]
cs.CL
Multi-Convformer: Extending Conformer with Multiple Convolution Kernels
[AUTHORS]
Darshan Prabhu, Yifan Peng, Preethi Jyothi, Shinji Watanabe
[ABSTRACT]
Convolutions have become essential in state-of-the-art end-to-end Automatic
Speech Recognition~(ASR) systems due to their efficient modelling of local
context. Notably, its use in Conformers has led to superior performance
compared to vanilla Transformer-based ASR systems. While components other than
the convolution module in the Conformer have been reexamined, altering the
convolution module itself has been far less explored. Towards this, we
introduce Multi-Convformer that uses multiple convolution kernels within the
convolution module of the Conformer in conjunction with gating. This helps in
improved modeling of local dependencies at varying granularities. Our model
rivals existing Conformer variants such as CgMLP and E-Branchformer in
performance, while being more parameter efficient. We empirically compare our
approach with Conformer and its variants across four different datasets and
three different modelling paradigms and show up to 8% relative word error
rate~(WER) improvements.
[COMMENTS]
Accepted to INTERSPEECH 2024
[LINK]
http://arxiv.org/abs/2407.03718v2
[DATE]
2024-07-24 10:03:47+08:00
[CATEGORIES]
cs.CL
cs.LG
Two-stage Generative Question Answering on Temporal Knowledge Graph Using Large Language Models
[AUTHORS]
Yifu Gao, Linbo Qiao, Zhigang Kan, Zhihua Wen, Yongquan He, Dongsheng Li
[ABSTRACT]
Temporal knowledge graph question answering (TKGQA) poses a significant
challenge task, due to the temporal constraints hidden in questions and the
answers sought from dynamic structured knowledge. Although large language
models (LLMs) have made considerable progress in their reasoning ability over
structured data, their application to the TKGQA task is a relatively unexplored
area. This paper first proposes a novel generative temporal knowledge graph
question answering framework, GenTKGQA, which guides LLMs to answer temporal
questions through two phases: Subgraph Retrieval and Answer Generation. First,
we exploit LLM’s intrinsic knowledge to mine temporal constraints and
structural links in the questions without extra training, thus narrowing down
the subgraph search space in both temporal and structural dimensions. Next, we
design virtual knowledge indicators to fuse the graph neural network signals of
the subgraph and the text representations of the LLM in a non-shallow way,
which helps the open-source LLM deeply understand the temporal order and
structural dependencies among the retrieved facts through instruction tuning.
Experimental results on two widely used datasets demonstrate the superiority of
our model.
[COMMENTS]
Accepted by ACL(Findings) 2024
[LINK]
http://arxiv.org/abs/2402.16568v2
[DATE]
2024-07-24 09:44:05+08:00
[CATEGORIES]
cs.CL
PEA-Diffusion: Parameter-Efficient Adapter with Knowledge Distillation in non-English Text-to-Image Generation
[AUTHORS]
Jian Ma, Chen Chen, Qingsong Xie, Haonan Lu
[ABSTRACT]
Text-to-image diffusion models are well-known for their ability to generate
realistic images based on textual prompts. However, the existing works have
predominantly focused on English, lacking support for non-English text-to-image
models. The most commonly used translation methods cannot solve the generation
problem related to language culture, while training from scratch on a specific
language dataset is prohibitively expensive. In this paper, we are inspired to
propose a simple plug-and-play language transfer method based on knowledge
distillation. All we need to do is train a lightweight MLP-like
parameter-efficient adapter (PEA) with only 6M parameters under teacher
knowledge distillation along with a small parallel data corpus. We are
surprised to find that freezing the parameters of UNet can still achieve
remarkable performance on the language-specific prompt evaluation set,
demonstrating that PEA can stimulate the potential generation ability of the
original UNet. Additionally, it closely approaches the performance of the
English text-to-image model on a general prompt evaluation set. Furthermore,
our adapter can be used as a plugin to achieve significant results in
downstream tasks in cross-lingual text-to-image generation. Code will be
available at: https://github.com/OPPO-Mente-Lab/PEA-Diffusion
[COMMENTS]
ECCV 2024
[LINK]
http://arxiv.org/abs/2311.17086v2
[DATE]
2024-07-24 09:41:01+08:00
[CATEGORIES]
cs.CL
Train-Attention: Meta-Learning Where to Focus in Continual Knowledge Learning
[AUTHORS]
Yeongbin Seo, Dongha Lee, Jinyoung Yeo
[ABSTRACT]
Previous studies on continual knowledge learning (CKL) in large language
models (LLMs) have predominantly focused on approaches such as regularization,
architectural modifications, and rehearsal techniques to mitigate catastrophic
forgetting. However, these methods naively inherit the inefficiencies of
standard training procedures, indiscriminately applying uniform weight across
all tokens, which can lead to unnecessary parameter updates and increased
forgetting. To address these shortcomings, we propose a novel CKL approach
termed Train-Attention-Augmented Language Model (TAALM), which enhances
learning efficiency by dynamically predicting and applying weights to tokens
based on their usefulness. This method employs a meta-learning framework that
optimizes token importance predictions, facilitating targeted knowledge updates
and minimizing forgetting. Also, we observe that existing benchmarks do not
clearly exhibit the trade-off between learning and retaining, therefore we
propose a new benchmark, \textsc{LAMA-ckl}, to address this issue. Through
experiments conducted on both newly introduced and established CKL benchmarks,
TAALM proves the state-of-the-art performance upon the baselines, and also
shows synergistic compatibility when integrated with previous CKL approaches.
[LINK]
http://arxiv.org/abs/2407.16920v1
[DATE]
2024-07-24 09:04:34+08:00
[CATEGORIES]
cs.CL
Tailoring Vaccine Messaging with Common-Ground Opinions
[AUTHORS]
Rickard Stureborg, Sanxing Chen, Ruoyu Xie, Aayushi Patel, Christopher Li, Chloe Qinyu Zhu, Tingnan Hu, Jun Yang, Bhuwan Dhingra
[ABSTRACT]
One way to personalize chatbot interactions is by establishing common ground
with the intended reader. A domain where establishing mutual understanding
could be particularly impactful is vaccine concerns and misinformation. Vaccine
interventions are forms of messaging which aim to answer concerns expressed
about vaccination. Tailoring responses in this domain is difficult, since
opinions often have seemingly little ideological overlap. We define the task of
tailoring vaccine interventions to a Common-Ground Opinion (CGO). Tailoring
responses to a CGO involves meaningfully improving the answer by relating it to
an opinion or belief the reader holds. In this paper we introduce TAILOR-CGO, a
dataset for evaluating how well responses are tailored to provided CGOs. We
benchmark several major LLMs on this task; finding GPT-4-Turbo performs
significantly better than others. We also build automatic evaluation metrics,
including an efficient and accurate BERT model that outperforms finetuned LLMs,
investigate how to successfully tailor vaccine messaging to CGOs, and provide
actionable recommendations from this investigation.
Code and model weights: https://github.com/rickardstureborg/tailor-cgo
Dataset: https://huggingface.co/datasets/DukeNLP/tailor-cgo
[COMMENTS]
NAACL Findings 2024
[LINK]
http://arxiv.org/abs/2405.10861v2
[DATE]
2024-07-24 08:10:04+08:00
[CATEGORIES]
cs.CL
Generation Constraint Scaling Can Mitigate Hallucination
[AUTHORS]
Georgios Kollias, Payel Das, Subhajit Chaudhury
[ABSTRACT]
Addressing the issue of hallucinations in large language models (LLMs) is a
critical challenge. As the cognitive mechanisms of hallucination have been
related to memory, here we explore hallucination for LLM that is enabled with
explicit memory mechanisms. We empirically demonstrate that by simply scaling
the readout vector that constrains generation in a memory-augmented LLM
decoder, hallucination mitigation can be achieved in a training-free manner.
Our method is geometry-inspired and outperforms a state-of-the-art LLM editing
method on the task of generation of Wikipedia-like biography entries both in
terms of generation quality and runtime complexity.
[COMMENTS]
7 pages; accepted at ICML 2024 Workshop on Large Language Models and
Cognition
[LINK]
http://arxiv.org/abs/2407.16908v1
[DATE]
2024-07-24 07:58:19+08:00
[CATEGORIES]
cs.CL
cs.LG
RigorLLM: Resilient Guardrails for Large Language Models against Undesired Content
[AUTHORS]
Zhuowen Yuan, Zidi Xiong, Yi Zeng, Ning Yu, Ruoxi Jia, Dawn Song, Bo Li
[ABSTRACT]
Recent advancements in Large Language Models (LLMs) have showcased remarkable
capabilities across various tasks in different domains. However, the emergence
of biases and the potential for generating harmful content in LLMs,
particularly under malicious inputs, pose significant challenges. Current
mitigation strategies, while effective, are not resilient under adversarial
attacks. This paper introduces Resilient Guardrails for Large Language Models
(RigorLLM), a novel framework designed to efficiently and effectively moderate
harmful and unsafe inputs and outputs for LLMs. By employing a multi-faceted
approach that includes energy-based training data augmentation through Langevin
dynamics, optimizing a safe suffix for inputs via minimax optimization, and
integrating a fusion-based model combining robust KNN with LLMs based on our
data augmentation, RigorLLM offers a robust solution to harmful content
moderation. Our experimental evaluations demonstrate that RigorLLM not only
outperforms existing baselines like OpenAI API and Perspective API in detecting
harmful content but also exhibits unparalleled resilience to jailbreaking
attacks. The innovative use of constrained optimization and a fusion-based
guardrail approach represents a significant step forward in developing more
secure and reliable LLMs, setting a new standard for content moderation
frameworks in the face of evolving digital threats.
[LINK]
http://arxiv.org/abs/2403.13031v2
[DATE]
2024-07-24 06:56:13+08:00
[CATEGORIES]
cs.CL
cs.LG
Retrieval Augmented Generation or Long-Context LLMs? A Comprehensive Study and Hybrid Approach
[AUTHORS]
Zhuowan Li, Cheng Li, Mingyang Zhang, Qiaozhu Mei, Michael Bendersky
[ABSTRACT]
Retrieval Augmented Generation (RAG) has been a powerful tool for Large
Language Models (LLMs) to efficiently process overly lengthy contexts. However,
recent LLMs like Gemini-1.5 and GPT-4 show exceptional capabilities to
understand long contexts directly. We conduct a comprehensive comparison
between RAG and long-context (LC) LLMs, aiming to leverage the strengths of
both. We benchmark RAG and LC across various public datasets using three latest
LLMs. Results reveal that when resourced sufficiently, LC consistently
outperforms RAG in terms of average performance. However, RAG’s significantly
lower cost remains a distinct advantage. Based on this observation, we propose
Self-Route, a simple yet effective method that routes queries to RAG or LC
based on model self-reflection. Self-Route significantly reduces the
computation cost while maintaining a comparable performance to LC. Our findings
provide a guideline for long-context applications of LLMs using RAG and LC.
[LINK]
http://arxiv.org/abs/2407.16833v1
[DATE]
2024-07-24 04:51:52+08:00
[CATEGORIES]
cs.CL
cs.LG
Safe-CLIP: Removing NSFW Concepts from Vision-and-Language Models
[AUTHORS]
Samuele Poppi, Tobia Poppi, Federico Cocchi, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara
[ABSTRACT]
Large-scale vision-and-language models, such as CLIP, are typically trained
on web-scale data, which can introduce inappropriate content and lead to the
development of unsafe and biased behavior. This, in turn, hampers their
applicability in sensitive and trustworthy contexts and could raise significant
concerns in their adoption. Our research introduces a novel approach to
enhancing the safety of vision-and-language models by diminishing their
sensitivity to NSFW (not safe for work) inputs. In particular, our methodology
seeks to sever “toxic” linguistic and visual concepts, unlearning the linkage
between unsafe linguistic or visual items and unsafe regions of the embedding
space. We show how this can be done by fine-tuning a CLIP model on synthetic
data obtained from a large language model trained to convert between safe and
unsafe sentences, and a text-to-image generator. We conduct extensive
experiments on the resulting embedding space for cross-modal retrieval,
text-to-image, and image-to-text generation, where we show that our model can
be remarkably employed with pre-trained generative models. Our source code and
trained models are available at: https://github.com/aimagelab/safe-clip.
[COMMENTS]
ECCV 2024
[LINK]
http://arxiv.org/abs/2311.16254v3
[DATE]
2024-07-24 04:38:11+08:00
[CATEGORIES]
cs.CL
Prompt Perturbation in Retrieval-Augmented Generation based Large Language Models
[AUTHORS]
Zhibo Hu, Chen Wang, Yanfeng Shu, Helen, Paik, Liming Zhu
[ABSTRACT]
The robustness of large language models (LLMs) becomes increasingly important
as their use rapidly grows in a wide range of domains. Retrieval-Augmented
Generation (RAG) is considered as a means to improve the trustworthiness of
text generation from LLMs. However, how the outputs from RAG-based LLMs are
affected by slightly different inputs is not well studied. In this work, we
find that the insertion of even a short prefix to the prompt leads to the
generation of outputs far away from factually correct answers. We
systematically evaluate the effect of such prefixes on RAG by introducing a
novel optimization technique called Gradient Guided Prompt Perturbation (GGPP).
GGPP achieves a high success rate in steering outputs of RAG-based LLMs to
targeted wrong answers. It can also cope with instructions in the prompts
requesting to ignore irrelevant context. We also exploit LLMs’ neuron
activation difference between prompts with and without GGPP perturbations to
give a method that improves the robustness of RAG-based LLMs through a highly
effective detector trained on neuron activation triggered by GGPP generated
prompts. Our evaluation on open-sourced LLMs demonstrates the effectiveness of
our methods.
[COMMENTS]
12 pages, 9 figures
[LINK]
http://arxiv.org/abs/2402.07179v3
[DATE]
2024-07-24 03:41:05+08:00
[CATEGORIES]
cs.CL
Differentially Private Synthetic Data via Foundation Model APIs 2: Text
[AUTHORS]
Chulin Xie, Zinan Lin, Arturs Backurs, Sivakanth Gopi, Da Yu, Huseyin A Inan, Harsha Nori, Haotian Jiang, Huishuai Zhang, Yin Tat Lee, Bo Li, Sergey Yekhanin
[ABSTRACT]
Text data has become extremely valuable due to the emergence of machine
learning algorithms that learn from it. A lot of high-quality text data
generated in the real world is private and therefore cannot be shared or used
freely due to privacy concerns. Generating synthetic replicas of private text
data with a formal privacy guarantee, i.e., differential privacy (DP), offers a
promising and scalable solution. However, existing methods necessitate DP
finetuning of large language models (LLMs) on private data to generate DP
synthetic data. This approach is not viable for proprietary LLMs (e.g.,
GPT-3.5) and also demands considerable computational resources for open-source
LLMs. Lin et al. (2024) recently introduced the Private Evolution (PE)
algorithm to generate DP synthetic images with only API access to diffusion
models. In this work, we propose an augmented PE algorithm, named Aug-PE, that
applies to the complex setting of text. We use API access to an LLM and
generate DP synthetic text without any model training. We conduct comprehensive
experiments on three benchmark datasets. Our results demonstrate that Aug-PE
produces DP synthetic text that yields competitive utility with the SOTA DP
finetuning baselines. This underscores the feasibility of relying solely on API
access of LLMs to produce high-quality DP synthetic texts, thereby facilitating
more accessible routes to privacy-preserving LLM applications. Our code and
data are available at https://github.com/AI-secure/aug-pe.
[COMMENTS]
ICML‘24 Spotlight
[LINK]
http://arxiv.org/abs/2403.01749v2
[DATE]
2024-07-24 03:19:02+08:00
[CATEGORIES]
cs.CL
APPLS: Evaluating Evaluation Metrics for Plain Language Summarization
[AUTHORS]
Yue Guo, Tal August, Gondy Leroy, Trevor Cohen, Lucy Lu Wang
[ABSTRACT]
While there has been significant development of models for Plain Language
Summarization (PLS), evaluation remains a challenge. PLS lacks a dedicated
assessment metric, and the suitability of text generation evaluation metrics is
unclear due to the unique transformations involved (e.g., adding background
explanations, removing jargon). To address these questions, our study
introduces a granular meta-evaluation testbed, APPLS, designed to evaluate
metrics for PLS. We identify four PLS criteria from previous work –
informativeness, simplification, coherence, and faithfulness – and define a
set of perturbations corresponding to these criteria that sensitive metrics
should be able to detect. We apply these perturbations to extractive hypotheses
for two PLS datasets to form our testbed. Using APPLS, we assess performance of
14 metrics, including automated scores, lexical features, and LLM prompt-based
evaluations. Our analysis reveals that while some current metrics show
sensitivity to specific criteria, no single method captures all four criteria
simultaneously. We therefore recommend a suite of automated metrics be used to
capture PLS quality along all relevant criteria. This work contributes the
first meta-evaluation testbed for PLS and a comprehensive evaluation of
existing metrics. APPLS and our evaluation code is available at
https://github.com/LinguisticAnomalies/APPLS.
[LINK]
http://arxiv.org/abs/2305.14341v3
[DATE]
2024-07-24 02:28:43+08:00
[CATEGORIES]
cs.CL
Learning Task Decomposition to Assist Humans in Competitive Programming
[AUTHORS]
Jiaxin Wen, Ruiqi Zhong, Pei Ke, Zhihong Shao, Hongning Wang, Minlie Huang
[ABSTRACT]
When using language models (LMs) to solve complex problems, humans might
struggle to understand the LM-generated solutions and repair the flawed ones.
To assist humans in repairing them, we propose to automatically decompose
complex solutions into multiple simpler pieces that correspond to specific
subtasks. We introduce a novel objective for learning task decomposition,
termed assistive value (AssistV), which measures the feasibility and speed for
humans to repair the decomposed solution. We collect a dataset of human repair
experiences on different decomposed solutions. Utilizing the collected data as
in-context examples, we then learn to critique, refine, and rank decomposed
solutions to improve AssistV. We validate our method under competitive
programming problems: under 177 hours of human study, our method enables
non-experts to solve 33.3\% more problems, speeds them up by 3.3x, and empowers
them to match unassisted experts.
[COMMENTS]
ACL 2024 Main Conference
[LINK]
http://arxiv.org/abs/2406.04604v3
[DATE]
2024-07-24 02:26:32+08:00
[CATEGORIES]
cs.CL
Cross-lingual Argument Mining in the Medical Domain
[AUTHORS]
Anar Yeginbergen, Rodrigo Agerri
[ABSTRACT]
Nowadays the medical domain is receiving more and more attention in
applications involving Artificial Intelligence as clinicians decision-making is
increasingly dependent on dealing with enormous amounts of unstructured textual
data. In this context, Argument Mining (AM) helps to meaningfully structure
textual data by identifying the argumentative components in the text and
classifying the relations between them. However, as it is the case for man
tasks in Natural Language Processing in general and in medical text processing
in particular, the large majority of the work on computational argumentation
has been focusing only on the English language. In this paper, we investigate
several strategies to perform AM in medical texts for a language such as
Spanish, for which no annotated data is available. Our work shows that
automatically translating and projecting annotations (data-transfer) from
English to a given target language is an effective way to generate annotated
data without costly manual intervention. Furthermore, and contrary to
conclusions from previous work for other sequence labelling tasks, our
experiments demonstrate that data-transfer outperforms methods based on the
crosslingual transfer capabilities of multilingual pre-trained language models
(model-transfer). Finally, we show how the automatically generated data in
Spanish can also be used to improve results in the original English monolingual
setting, providing thus a fully automatic data augmentation strategy.
[LINK]
http://arxiv.org/abs/2301.10527v3
[DATE]
2024-07-24 02:17:35+08:00
[CATEGORIES]
cs.CL
VisMin: Visual Minimal-Change Understanding
[AUTHORS]
Rabiul Awal, Saba Ahmadi, Le Zhang, Aishwarya Agrawal
[ABSTRACT]
Fine-grained understanding of objects, attributes, and relationships between
objects is crucial for visual-language models (VLMs). Existing benchmarks
primarily focus on evaluating VLMs’ capability to distinguish between two very
similar \textit{captions} given an image. In this paper, we introduce a new,
challenging benchmark termed \textbf{Vis}ual \textbf{Min}imal-Change
Understanding (VisMin), which requires models to predict the correct
image-caption match given two images and two captions. The image pair and
caption pair contain minimal changes, i.e., only one aspect changes at a time
from among the following: \textit{object}, \textit{attribute}, \textit{count},
and \textit{spatial relation}. These changes test the models’ understanding of
objects, attributes (such as color, material, shape), counts, and spatial
relationships between objects. We built an automatic framework using large
language models and diffusion models, followed by a rigorous 4-step
verification process by human annotators. Empirical experiments reveal that
current VLMs exhibit notable deficiencies in understanding spatial
relationships and counting abilities. We also generate a large-scale training
dataset to finetune CLIP and Idefics2, showing significant improvements in
fine-grained understanding across benchmarks and in CLIP’s general image-text
alignment. We release all resources, including the benchmark, training data,
and finetuned model checkpoints, at \url{https://vismin.net/}.
[COMMENTS]
Project URL at https://vismin.net/
[LINK]
http://arxiv.org/abs/2407.16772v1
[DATE]
2024-07-24 02:10:43+08:00
[CATEGORIES]
cs.CL
cs.LG
Watermarks in the Sand: Impossibility of Strong Watermarking for Generative Models
[AUTHORS]
Hanlin Zhang, Benjamin L. Edelman, Danilo Francati, Daniele Venturi, Giuseppe Ateniese, Boaz Barak
[ABSTRACT]
Watermarking generative models consists of planting a statistical signal
(watermark) in a model’s output so that it can be later verified that the
output was generated by the given model. A strong watermarking scheme satisfies
the property that a computationally bounded attacker cannot erase the watermark
without causing significant quality degradation. In this paper, we study the
(im)possibility of strong watermarking schemes. We prove that, under
well-specified and natural assumptions, strong watermarking is impossible to
achieve. This holds even in the private detection algorithm setting, where the
watermark insertion and detection algorithms share a secret key, unknown to the
attacker. To prove this result, we introduce a generic efficient watermark
attack; the attacker is not required to know the private key of the scheme or
even which scheme is used. Our attack is based on two assumptions: (1) The
attacker has access to a “quality oracle” that can evaluate whether a candidate
output is a high-quality response to a prompt, and (2) The attacker has access
to a “perturbation oracle” which can modify an output with a nontrivial
probability of maintaining quality, and which induces an efficiently mixing
random walk on high-quality outputs. We argue that both assumptions can be
satisfied in practice by an attacker with weaker computational capabilities
than the watermarked model itself, to which the attacker has only black-box
access. Furthermore, our assumptions will likely only be easier to satisfy over
time as models grow in capabilities and modalities. We demonstrate the
feasibility of our attack by instantiating it to attack three existing
watermarking schemes for large language models: Kirchenbauer et al. (2023),
Kuditipudi et al. (2023), and Zhao et al. (2023). The same attack successfully
removes the watermarks planted by all three schemes, with only minor quality
degradation.
[COMMENTS]
ICML 2024. Website: https://hanlin-zhang.com/impossibility-watermarks
[LINK]
http://arxiv.org/abs/2311.04378v4
[DATE]
2024-07-24 02:05:59+08:00
[CATEGORIES]
cs.LG
cs.CL
Stress-Testing Long-Context Language Models with Lifelong ICL and Task Haystack
[AUTHORS]
Xiaoyue Xu, Qinyuan Ye, Xiang Ren
[ABSTRACT]
We introduce Lifelong ICL, a problem setting that challenges long-context
language models (LMs) to learn from a sequence of language tasks through
in-context learning (ICL). We further introduce Task Haystack, an evaluation
suite dedicated to assessing and diagnosing how long-context LMs utilizes
contexts in Lifelong ICL. When given a task instruction and test inputs,
long-context LMs are expected to leverage the relevant demonstrations in the
Lifelong ICL prompt, avoid distraction and interference from other tasks, and
achieve test accuracies that are not significantly worse than the Single-task
ICL baseline.
Task Haystack draws inspiration from the widely-adopted
“needle-in-a-haystack” (NIAH) evaluation, but presents new and unique
challenges. It demands that models (1) utilize the contexts with deeper
understanding, rather than resorting to simple copying and pasting; (2)
navigate through long streams of evolving topics and tasks, which closely
approximates the complexities of real-world usage of long-context LMs.
Additionally, Task Haystack inherits the controllability aspect of NIAH,
providing model developers with tools and visualizations to identify model
vulnerabilities effectively.
We benchmark 12 long-context LMs using Task Haystack. We find that
state-of-the-art closed models such as GPT-4o still struggle in this setting,
failing 15% of the cases on average, while all open-weight models we evaluate
further lack behind by a large margin, failing up to 61% of the cases. In our
controlled analysis, we identify factors such as distraction and recency bias
as contributors to these failure cases. Further, we observe declines in
performance when task instructions are paraphrased at test time or when ICL
demonstrations are repeated excessively, raising concerns about the robustness,
instruction understanding, and true context utilization of current long-context
LMs.
[COMMENTS]
Code: https://github.com/INK-USC/Lifelong-ICL; Website:
https://inklab.usc.edu/lifelong-icl/
[LINK]
http://arxiv.org/abs/2407.16695v1
[DATE]
2024-07-24 01:57:41+08:00
[CATEGORIES]
cs.CL
cs.LG
Explanation Regularisation through the Lens of Attributions
[AUTHORS]
Pedro Ferreira, Wilker Aziz, Ivan Titov
[COMMENTS]
18 pages, 7 figures, 8 tables
[LINK]
http://arxiv.org/abs/2407.16693v1
[DATE]
2024-07-24 01:56:32+08:00
[CATEGORIES]
cs.CL
Dynamic Memory Compression: Retrofitting LLMs for Accelerated Inference
[AUTHORS]
Piotr Nawrot, Adrian Łańcucki, Marcin Chochowski, David Tarjan, Edoardo M. Ponti
[ABSTRACT]
Transformers have emerged as the backbone of large language models (LLMs).
However, generation remains inefficient due to the need to store in memory a
cache of key-value representations for past tokens, whose size scales linearly
with the input sequence length and batch size. As a solution, we propose
Dynamic Memory Compression (DMC), a method for online key-value cache
compression at inference time. Most importantly, the model learns to apply
different compression ratios in different heads and layers. We retrofit
pre-trained LLMs such as Llama 2 (7B, 13B and 70B) into DMC Transformers,
achieving up to 7x throughput increase during auto-regressive inference on an
NVIDIA H100 GPU. DMC is applied via continued pre-training on a negligible
percentage of the original data without adding any extra parameters. DMC
preserves the original downstream performance with up to 4x cache compression,
outperforming up-trained grouped-query attention (GQA) and key-value eviction
policies (H$_2$O, TOVA). GQA and DMC can be even combined to obtain compounded
gains. Hence, DMC can serve as a drop-in replacement for KV caching in existing
LLMs to fit longer contexts and larger batches within any given memory budget.
[LINK]
http://arxiv.org/abs/2403.09636v2
[DATE]
2024-07-24 01:55:30+08:00
[CATEGORIES]
cs.CL
OpenDevin: An Open Platform for AI Software Developers as Generalist Agents
[AUTHORS]
Xingyao Wang, Boxuan Li, Yufan Song, Frank F. Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H. Tran, Fuqiang Li, Ren Ma, Mingzhang Zheng, Bill Qian, Yanjun Shao, Niklas Muennighoff, Yizhe Zhang, Binyuan Hui, Junyang Lin, Robert Brennan, Hao Peng, Heng Ji, Graham Neubig
[ABSTRACT]
Software is one of the most powerful tools that we humans have at our
disposal; it allows a skilled programmer to interact with the world in complex
and profound ways. At the same time, thanks to improvements in large language
models (LLMs), there has also been a rapid development in AI agents that
interact with and affect change in their surrounding environments. In this
paper, we introduce OpenDevin, a platform for the development of powerful and
flexible AI agents that interact with the world in similar ways to those of a
human developer: by writing code, interacting with a command line, and browsing
the web. We describe how the platform allows for the implementation of new
agents, safe interaction with sandboxed environments for code execution,
coordination between multiple agents, and incorporation of evaluation
benchmarks. Based on our currently incorporated benchmarks, we perform an
evaluation of agents over 15 challenging tasks, including software engineering
(e.g., SWE-Bench) and web browsing (e.g., WebArena), among others. Released
under the permissive MIT license, OpenDevin is a community project spanning
academia and industry with more than 1.3K contributions from over 160
contributors and will improve going forward.
[COMMENTS]
Code: https://github.com/OpenDevin/OpenDevin
[LINK]
http://arxiv.org/abs/2407.16741v1
[DATE]
2024-07-24 01:50:43+08:00
[CATEGORIES]
cs.CL
RedAgent: Red Teaming Large Language Models with Context-aware Autonomous Language Agent
[AUTHORS]
Huiyu Xu, Wenhui Zhang, Zhibo Wang, Feng Xiao, Rui Zheng, Yunhe Feng, Zhongjie Ba, Kui Ren
[ABSTRACT]
Recently, advanced Large Language Models (LLMs) such as GPT-4 have been
integrated into many real-world applications like Code Copilot. These
applications have significantly expanded the attack surface of LLMs, exposing
them to a variety of threats. Among them, jailbreak attacks that induce toxic
responses through jailbreak prompts have raised critical safety concerns. To
identify these threats, a growing number of red teaming approaches simulate
potential adversarial scenarios by crafting jailbreak prompts to test the
target LLM. However, existing red teaming methods do not consider the unique
vulnerabilities of LLM in different scenarios, making it difficult to adjust
the jailbreak prompts to find context-specific vulnerabilities. Meanwhile,
these methods are limited to refining jailbreak templates using a few mutation
operations, lacking the automation and scalability to adapt to different
scenarios. To enable context-aware and efficient red teaming, we abstract and
model existing attacks into a coherent concept called “jailbreak strategy” and
propose a multi-agent LLM system named RedAgent that leverages these strategies
to generate context-aware jailbreak prompts. By self-reflecting on contextual
feedback in an additional memory buffer, RedAgent continuously learns how to
leverage these strategies to achieve effective jailbreaks in specific contexts.
Extensive experiments demonstrate that our system can jailbreak most black-box
LLMs in just five queries, improving the efficiency of existing red teaming
methods by two times. Additionally, RedAgent can jailbreak customized LLM
applications more efficiently. By generating context-aware jailbreak prompts
towards applications on GPTs, we discover 60 severe vulnerabilities of these
real-world applications with only two queries per vulnerability. We have
reported all found issues and communicated with OpenAI and Meta for bug fixes.
[LINK]
http://arxiv.org/abs/2407.16667v1
[DATE]
2024-07-24 01:34:36+08:00
[CATEGORIES]
cs.CL
Towards scalable efficient on-device ASR with transfer learning
[AUTHORS]
Laxmi Pandey, Ke Li, Jinxi Guo, Debjyoti Paul, Arthur Guo, Jay Mahadeokar, Xuedong Zhang
[ABSTRACT]
Multilingual pretraining for transfer learning significantly boosts the
robustness of low-resource monolingual ASR models. This study systematically
investigates three main aspects: (a) the impact of transfer learning on model
performance during initial training or fine-tuning, (b) the influence of
transfer learning across dataset domains and languages, and (c) the effect on
rare-word recognition compared to non-rare words. Our finding suggests that
RNNT-loss pretraining, followed by monolingual fine-tuning with Minimum Word
Error Rate (MinWER) loss, consistently reduces Word Error Rates (WER) across
languages like Italian and French. WER Reductions (WERR) reach 36.2% and 42.8%
compared to monolingual baselines for MLS and in-house datasets. Out-of-domain
pretraining leads to 28% higher WERR than in-domain pretraining. Both rare and
non-rare words benefit, with rare words showing greater improvements with
out-of-domain pretraining, and non-rare words with in-domain pretraining.
[LINK]
http://arxiv.org/abs/2407.16664v1
[DATE]
2024-07-24 01:29:02+08:00
[CATEGORIES]
cs.CL
A Survey of Text Style Transfer: Applications and Ethical Implications
[AUTHORS]
Sourabrata Mukherjee, Mateusz Lango, Zdenek Kasner, Ondrej Dušek
[ABSTRACT]
Text style transfer (TST) is an important task in controllable text
generation, which aims to control selected attributes of language use, such as
politeness, formality, or sentiment, without altering the style-independent
content of the text. The field has received considerable research attention in
recent years and has already been covered in several reviews, but the focus has
mostly been on the development of new algorithms and learning from different
types of data (supervised, unsupervised, out-of-domain, etc.) and not so much
on the application side. However, TST-related technologies are gradually
reaching a production- and deployment-ready level, and therefore, the inclusion
of the application perspective in TST research becomes crucial. Similarly, the
often overlooked ethical considerations of TST technology have become a
pressing issue. This paper presents a comprehensive review of TST applications
that have been researched over the years, using both traditional linguistic
approaches and more recent deep learning methods. We discuss current
challenges, future research directions, and ethical implications of TST
applications in text generation. By providing a holistic overview of the
landscape of TST applications, we hope to stimulate further research and
contribute to a better understanding of the potential as well as ethical
considerations associated with TST.
[LINK]
http://arxiv.org/abs/2407.16737v1
[DATE]
2024-07-24 01:15:23+08:00
[CATEGORIES]
cs.CL
Semantic Change Characterization with LLMs using Rhetorics
[AUTHORS]
Jader Martins Camboim de Sá, Marcos Da Silveira, Cédric Pruski
[ABSTRACT]
Languages continually evolve in response to societal events, resulting in new
terms and shifts in meanings. These changes have significant implications for
computer applications, including automatic translation and chatbots, making it
essential to characterize them accurately. The recent development of LLMs has
notably advanced natural language understanding, particularly in sense
inference and reasoning. In this paper, we investigate the potential of LLMs in
characterizing three types of semantic change: dimension, relation, and
orientation. We achieve this by combining LLMs’ Chain-of-Thought with
rhetorical devices and conducting an experimental assessment of our approach
using newly created datasets. Our results highlight the effectiveness of LLMs
in capturing and analyzing semantic changes, providing valuable insights to
improve computational linguistic applications.
[LINK]
http://arxiv.org/abs/2407.16624v1
[DATE]
2024-07-24 00:32:49+08:00
[CATEGORIES]
cs.CL
Variation Spaces for Multi-Output Neural Networks: Insights on Multi-Task Learning and Network Compression
[AUTHORS]
Joseph Shenouda, Rahul Parhi, Kangwook Lee, Robert D. Nowak
[ABSTRACT]
This paper introduces a novel theoretical framework for the analysis of
vector-valued neural networks through the development of vector-valued
variation spaces, a new class of reproducing kernel Banach spaces. These spaces
emerge from studying the regularization effect of weight decay in training
networks with activations like the rectified linear unit (ReLU). This framework
offers a deeper understanding of multi-output networks and their function-space
characteristics. A key contribution of this work is the development of a
representer theorem for the vector-valued variation spaces. This representer
theorem establishes that shallow vector-valued neural networks are the
solutions to data-fitting problems over these infinite-dimensional spaces,
where the network widths are bounded by the square of the number of training
data. This observation reveals that the norm associated with these
vector-valued variation spaces encourages the learning of features that are
useful for multiple tasks, shedding new light on multi-task learning with
neural networks. Finally, this paper develops a connection between weight-decay
regularization and the multi-task lasso problem. This connection leads to novel
bounds for layer widths in deep networks that depend on the intrinsic
dimensions of the training data representations. This insight not only deepens
the understanding of the deep network architectural requirements, but also
yields a simple convex optimization method for deep neural network compression.
The performance of this compression procedure is evaluated on various
architectures.
[COMMENTS]
Updated to version published in JMLR
[LINK]
http://arxiv.org/abs/2305.16534v3
[DATE]
2024-07-24 23:45:58+08:00
[CATEGORIES]
cs.LG
An Experimental Study on the Rashomon Effect of Balancing Methods in Imbalanced Classification
[AUTHORS]
Mustafa Cavus, Przemysław Biecek
[ABSTRACT]
Predictive models may generate biased predictions when classifying imbalanced
datasets. This happens when the model favors the majority class, leading to low
performance in accurately predicting the minority class. To address this issue,
balancing or resampling methods are critical data-centric AI approaches in the
modeling process to improve prediction performance. However, there have been
debates and questions about the functionality of these methods in recent years.
In particular, many candidate models may exhibit very similar predictive
performance, called the Rashomon effect, in model selection, and they may even
produce different predictions for the same observations. Selecting one of these
models without considering the predictive multiplicity – which is the case of
yielding conflicting models’ predictions for any sample – can result in blind
selection. In this paper, the impact of balancing methods on predictive
multiplicity is examined using the Rashomon effect. It is crucial because the
blind model selection in data-centric AI is risky from a set of approximately
equally accurate models. This may lead to severe problems in model selection,
validation, and explanation. To tackle this matter, we conducted real dataset
experiments to observe the impact of balancing methods on predictive
multiplicity through the Rashomon effect by using a newly proposed metric
obscurity in addition to the existing ones: ambiguity and discrepancy. Our
findings showed that balancing methods inflate the predictive multiplicity and
yield varying results. To monitor the trade-off between the prediction
performance and predictive multiplicity for conducting the modeling process
responsibly, we proposed using the extended version of the performance-gain
plot when balancing the training data.
[COMMENTS]
16 pages, 6 figures
[LINK]
http://arxiv.org/abs/2405.01557v4
[DATE]
2024-07-24 23:43:49+08:00
[CATEGORIES]
cs.LG
Euler Characteristic Tools For Topological Data Analysis
[AUTHORS]
Olympio Hacquard, Vadim Lebovici
[COMMENTS]
39 pages - Version accepted in JMLR
[LINK]
http://arxiv.org/abs/2303.14040v3
[DATE]
2024-07-24 23:29:46+08:00
[CATEGORIES]
cs.LG
Gradient-based inference of abstract task representations for generalization in neural networks
[AUTHORS]
Ali Hummos, Felipe del Río, Brabeeba Mien Wang, Julio Hurtado, Cristian B. Calderon, Guangyu Robert Yang
[ABSTRACT]
Humans and many animals show remarkably adaptive behavior and can respond
differently to the same input depending on their internal goals. The brain not
only represents the intermediate abstractions needed to perform a computation
but also actively maintains a representation of the computation itself (task
abstraction). Such separation of the computation and its abstraction is
associated with faster learning, flexible decision-making, and broad
generalization capacity. We investigate if such benefits might extend to neural
networks trained with task abstractions. For such benefits to emerge, one needs
a task inference mechanism that possesses two crucial abilities: First, the
ability to infer abstract task representations when no longer explicitly
provided (task inference), and second, manipulate task representations to adapt
to novel problems (task recomposition). To tackle this, we cast task inference
as an optimization problem from a variational inference perspective and ground
our approach in an expectation-maximization framework. We show that gradients
backpropagated through a neural network to a task representation layer are an
efficient heuristic to infer current task demands, a process we refer to as
gradient-based inference (GBI). Further iterative optimization of the task
representation layer allows for recomposing abstractions to adapt to novel
situations. Using a toy example, a novel image classifier, and a language
model, we demonstrate that GBI provides higher learning efficiency and
generalization to novel tasks and limits forgetting. Moreover, we show that GBI
has unique advantages such as preserving information for uncertainty estimation
and detecting out-of-distribution samples.
[LINK]
http://arxiv.org/abs/2407.17356v1
[DATE]
2024-07-24 23:28:08+08:00
[CATEGORIES]
cs.LG
Scalify: scale propagation for efficient low-precision LLM training
[AUTHORS]
Paul Balança, Sam Hosegood, Carlo Luschi, Andrew Fitzgibbon
[ABSTRACT]
Low-precision formats such as float8 have been introduced in machine learning
accelerated hardware to improve computational efficiency for large language
models training and inference. Nevertheless, adoption by the ML community has
been slowed down by the complex, and sometimes brittle, techniques required to
match higher precision training accuracy. In this work, we present Scalify, a
end-to-end scale propagation paradigm for computational graphs, generalizing
and formalizing existing tensor scaling methods. Experiment results show that
Scalify supports out-of-the-box float8 matrix multiplication and gradients
representation, as well as float16 optimizer state storage. Our JAX
implementation of Scalify is open-sourced at
https://github.com/graphcore-research/jax-scalify
[COMMENTS]
11 pages, 5 figures, ICML 2024 WANT workshop
[LINK]
http://arxiv.org/abs/2407.17353v1
[DATE]
2024-07-24 23:26:01+08:00
[CATEGORIES]
cs.LG
Dataset Distribution Impacts Model Fairness: Single vs. Multi-Task Learning
[AUTHORS]
Ralf Raumanns, Gerard Schouten, Josien P. W. Pluim, Veronika Cheplygina
[ABSTRACT]
The influence of bias in datasets on the fairness of model predictions is a
topic of ongoing research in various fields. We evaluate the performance of
skin lesion classification using ResNet-based CNNs, focusing on patient sex
variations in training data and three different learning strategies. We present
a linear programming method for generating datasets with varying patient sex
and class labels, taking into account the correlations between these variables.
We evaluated the model performance using three different learning strategies: a
single-task model, a reinforcing multi-task model, and an adversarial learning
scheme. Our observations include: 1) sex-specific training data yields better
results, 2) single-task models exhibit sex bias, 3) the reinforcement approach
does not remove sex bias, 4) the adversarial model eliminates sex bias in cases
involving only female patients, and 5) datasets that include male patients
enhance model performance for the male subgroup, even when female patients are
the majority. To generalise these findings, in future research, we will examine
more demographic attributes, like age, and other possibly confounding factors,
such as skin colour and artefacts in the skin lesions. We make all data and
models available on GitHub.
[COMMENTS]
Submitted to MICCAI 2024
[LINK]
http://arxiv.org/abs/2407.17543v1
[DATE]
2024-07-24 23:23:26+08:00
[CATEGORIES]
cs.LG
Mathematical programming algorithms for convex hull approximation with a hyperplane budget
[AUTHORS]
Michele Barbato, Alberto Ceselli, Rosario Messana
[ABSTRACT]
We consider the following problem in computational geometry: given, in the
d-dimensional real space, a set of points marked as positive and a set of
points marked as negative, such that the convex hull of the positive set does
not intersect the negative set, find K hyperplanes that separate, if possible,
all the positive points from the negative ones. That is, we search for a convex
polyhedron with at most K faces, containing all the positive points and no
negative point. The problem is known in the literature for pure convex
polyhedral approximation; our interest stems from its possible applications in
constraint learning, where points are feasible or infeasible solutions of a
Mixed Integer Program, and the K hyperplanes are linear constraints to be
found. We cast the problem as an optimization one, minimizing the number of
negative points inside the convex polyhedron, whenever exact separation cannot
be achieved. We introduce models inspired by support vector machines and we
design two mathematical programming formulations with binary variables. We
exploit Dantzig-Wolfe decomposition to obtain extended formulations, and we
devise column generation algorithms with ad-hoc pricing routines. We compare
computing time and separation error values obtained by all our approaches on
synthetic datasets, with number of points from hundreds up to a few thousands,
showing our approaches to perform better than existing ones from the
literature. Furthermore, we observe that key computational differences arise,
depending on whether the budget K is sufficient to completely separate the
positive points from the negative ones or not. On 8-dimensional instances (and
over), existing convex hull algorithms become computational inapplicable, while
our algorithms allow to identify good convex hull approximations in minutes of
computation.
[LINK]
http://arxiv.org/abs/2407.17341v1
[DATE]
2024-07-24 23:08:52+08:00
[CATEGORIES]
cs.LG
QUACK: Quantum Aligned Centroid Kernel
[AUTHORS]
Kilian Tscharke, Sebastian Issel, Pascal Debus
[ABSTRACT]
Quantum computing (QC) seems to show potential for application in machine
learning (ML). In particular quantum kernel methods (QKM) exhibit promising
properties for use in supervised ML tasks. However, a major disadvantage of
kernel methods is their unfavorable quadratic scaling with the number of
training samples. Together with the limits imposed by currently available
quantum hardware (NISQ devices) with their low qubit coherence times, small
number of qubits, and high error rates, the use of QC in ML at an industrially
relevant scale is currently impossible. As a small step in improving the
potential applications of QKMs, we introduce QUACK, a quantum kernel algorithm
whose time complexity scales linear with the number of samples during training,
and independent of the number of training samples in the inference stage. In
the training process, only the kernel entries for the samples and the centers
of the classes are calculated, i.e. the maximum shape of the kernel for n
samples and c classes is (n, c). During training, the parameters of the quantum
kernel and the positions of the centroids are optimized iteratively. In the
inference stage, for every new sample the circuit is only evaluated for every
centroid, i.e. c times. We show that the QUACK algorithm nevertheless provides
satisfactory results and can perform at a similar level as classical kernel
methods with quadratic scaling during training. In addition, our (simulated)
algorithm is able to handle high-dimensional datasets such as MNIST with 784
features without any dimensionality reduction.
[COMMENTS]
Accepted to IEEE International Conference on Quantum Computing and
Engineering (QCE) 2024
[LINK]
http://arxiv.org/abs/2405.00304v2
[DATE]
2024-07-24 23:01:45+08:00
[CATEGORIES]
cs.LG
Global and Local Confidence Based Fraud Detection Graph Neural Network
[AUTHORS]
Jiaxun Liu, Yue Tian, Guanjun Liu
[ABSTRACT]
This paper presents the Global and Local Confidence Graph Neural Network
(GLC-GNN), an innovative approach to graph-based anomaly detection that
addresses the challenges of heterophily and camouflage in fraudulent
activities. By introducing a prototype to encapsulate the global features of a
graph and calculating a Global Confidence (GC) value for each node, GLC-GNN
effectively distinguishes between benign and fraudulent nodes. The model
leverages GC to generate attention values for message aggregation, enhancing
its ability to capture both homophily and heterophily. Through extensive
experiments on four open datasets, GLC-GNN demonstrates superior performance
over state-of-the-art models in accuracy and convergence speed, while
maintaining a compact model size and expedited training process. The
integration of global and local confidence measures in GLC-GNN offers a robust
solution for detecting anomalies in graphs, with significant implications for
fraud detection across diverse domains.
[LINK]
http://arxiv.org/abs/2407.17333v1
[DATE]
2024-07-24 22:55:37+08:00
[CATEGORIES]
cs.LG
Low dimensional representation of multi-patient flow cytometry datasets using optimal transport for minimal residual disease detection in leukemia
[AUTHORS]
Erell Gachon, Jérémie Bigot, Elsa Cazelles, Aguirre Mimoun, Jean-Philippe Vial
[ABSTRACT]
Representing and quantifying Minimal Residual Disease (MRD) in Acute Myeloid
Leukemia (AML), a type of cancer that affects the blood and bone marrow, is
essential in the prognosis and follow-up of AML patients. As traditional
cytological analysis cannot detect leukemia cells below 5\%, the analysis of
flow cytometry dataset is expected to provide more reliable results. In this
paper, we explore statistical learning methods based on optimal transport (OT)
to achieve a relevant low-dimensional representation of multi-patient flow
cytometry measurements (FCM) datasets considered as high-dimensional
probability distributions. Using the framework of OT, we justify the use of the
K-means algorithm for dimensionality reduction of multiple large-scale point
clouds through mean measure quantization by merging all the data into a single
point cloud. After this quantization step, the visualization of the intra and
inter-patients FCM variability is carried out by embedding low-dimensional
quantized probability measures into a linear space using either Wasserstein
Principal Component Analysis (PCA) through linearized OT or log-ratio PCA of
compositional data. Using a publicly available FCM dataset and a FCM dataset
from Bordeaux University Hospital, we demonstrate the benefits of our approach
over the popular kernel mean embedding technique for statistical learning from
multiple high-dimensional probability distributions. We also highlight the
usefulness of our methodology for low-dimensional projection and clustering
patient measurements according to their level of MRD in AML from FCM. In
particular, our OT-based approach allows a relevant and informative
two-dimensional representation of the results of the FlowSom algorithm, a
state-of-the-art method for the detection of MRD in AML using multi-patient
FCM.
[LINK]
http://arxiv.org/abs/2407.17329v1
[DATE]
2024-07-24 22:53:01+08:00
[CATEGORIES]
cs.LG
MoveLight: Enhancing Traffic Signal Control through Movement-Centric Deep Reinforcement Learning
[AUTHORS]
Junqi Shao, Chenhao Zheng, Yuxuan Chen, Yucheng Huang, Rui Zhang
[ABSTRACT]
This paper introduces MoveLight, a novel traffic signal control system that
enhances urban traffic management through movement-centric deep reinforcement
learning. By leveraging detailed real-time data and advanced machine learning
techniques, MoveLight overcomes the limitations of traditional traffic signal
control methods. It employs a lane-level control approach using the FRAP
algorithm to achieve dynamic and adaptive traffic signal control, optimizing
traffic flow, reducing congestion, and improving overall efficiency. Our
research demonstrates the scalability and effectiveness of MoveLight across
single intersections, arterial roads, and network levels. Experimental results
using real-world datasets from Cologne and Hangzhou show significant
improvements in metrics such as queue length, delay, and throughput compared to
existing methods. This study highlights the transformative potential of deep
reinforcement learning in intelligent traffic signal control, setting a new
standard for sustainable and efficient urban transportation systems.
[LINK]
http://arxiv.org/abs/2407.17303v1
[DATE]
2024-07-24 22:17:16+08:00
[CATEGORIES]
cs.LG
High-Probability Convergence for Composite and Distributed Stochastic Minimization and Variational Inequalities with Heavy-Tailed Noise
[AUTHORS]
Eduard Gorbunov, Abdurakhmon Sadiev, Marina Danilova, Samuel Horváth, Gauthier Gidel, Pavel Dvurechensky, Alexander Gasnikov, Peter Richtárik
[ABSTRACT]
High-probability analysis of stochastic first-order optimization methods
under mild assumptions on the noise has been gaining a lot of attention in
recent years. Typically, gradient clipping is one of the key algorithmic
ingredients to derive good high-probability guarantees when the noise is
heavy-tailed. However, if implemented na"ively, clipping can spoil the
convergence of the popular methods for composite and distributed optimization
(Prox-SGD/Parallel SGD) even in the absence of any noise. Due to this reason,
many works on high-probability analysis consider only unconstrained
non-distributed problems, and the existing results for composite/distributed
problems do not include some important special cases (like strongly convex
problems) and are not optimal. To address this issue, we propose new stochastic
methods for composite and distributed optimization based on the clipping of
stochastic gradient differences and prove tight high-probability convergence
results (including nearly optimal ones) for the new methods. Using similar
ideas, we also develop new methods for composite and distributed variational
inequalities and analyze the high-probability convergence of these methods.
[COMMENTS]
ICML 2024; changes in version 2: minor corrections (typos were fixed
and the structure was modified)
[LINK]
http://arxiv.org/abs/2310.01860v2
[DATE]
2024-07-24 22:10:13+08:00
[CATEGORIES]
cs.LG
Enhanced SMC$^2$: Leveraging Gradient Information from Differentiable Particle Filters Within Langevin Proposals
[AUTHORS]
Conor Rosato, Joshua Murphy, Alessandro Varsi, Paul Horridge, Simon Maskell
[ABSTRACT]
Sequential Monte Carlo Squared (SMC$^2$) is a Bayesian method which can infer
the states and parameters of non-linear, non-Gaussian state-space models. The
standard random-walk proposal in SMC$^2$ faces challenges, particularly with
high-dimensional parameter spaces. This study outlines a novel approach by
harnessing first-order gradients derived from a Common Random Numbers -
Particle Filter (CRN-PF) using PyTorch. The resulting gradients can be
leveraged within a Langevin proposal without accept/reject. Including Langevin
dynamics within the proposal can result in a higher effective sample size and
more accurate parameter estimates when compared with the random-walk. The
resulting algorithm is parallelized on distributed memory using Message Passing
Interface (MPI) and runs in $\mathcal{O}(\log_2N)$ time complexity. Utilizing
64 computational cores we obtain a 51x speed-up when compared to a single core.
A GitHub link is given which provides access to the code.
[COMMENTS]
8 pages, 3 images. Accepted to 2024 IEEE International Conference on
Multisensor Fusion and Integration (MFI 2024). https://mfi2024.org/. arXiv
admin note: text overlap with arXiv:2311.12973
[LINK]
http://arxiv.org/abs/2407.17296v1
[DATE]
2024-07-24 22:05:44+08:00
[CATEGORIES]
cs.LG
[AUTHORS]
Fabiano Belém, Washington Cunha, Celso França, Claudio Andrade, Leonardo Rocha, Marcos André Gonçalves [ABSTRACT]
This is the first work to investigate the effectiveness of BERT-based
contextual embeddings in active learning (AL) tasks on cold-start scenarios,
where traditional fine-tuning is infeasible due to the absence of labeled data.
Our primary contribution is the proposal of a more robust fine-tuning pipelineDoTCAL - that diminishes the reliance on labeled data in AL using two steps:
(1) fully leveraging unlabeled data through domain adaptation of the embeddings
via masked language modeling and (2) further adjusting model weights using
labeled data selected by AL. Our evaluation contrasts BERT-based embeddings
with other prevalent text representation paradigms, including Bag of Words
(BoW), Latent Semantic Indexing (LSI), and FastText, at two critical stages of
the AL process: instance selection and classification. Experiments conducted on
eight ATC benchmarks with varying AL budgets (number of labeled instances) and
number of instances (about 5,000 to 300,000) demonstrate DoTCAL’s superior
effectiveness, achieving up to a 33% improvement in Macro-F1 while reducing
labeling efforts by half compared to the traditional one-step method. We also
found that in several tasks, BoW and LSI (due to information aggregation)
produce results superior (up to 59% ) to BERT, especially in low-budget
scenarios and hard-to-classify tasks, which is quite surprising.
[COMMENTS]
11 pages, 4 figures, 2 Tables, and 1 algorithm
[LINK]
http://arxiv.org/abs/2407.17284v1
[DATE]
2024-07-24 21:50:21+08:00
[CATEGORIES]
cs.LG
Learning from Graphs with Heterophily: Progress and Future
[AUTHORS]
Chenghua Gong, Yao Cheng, Xiang Li, Caihua Shan, Siqiang Luo
[ABSTRACT]
Graphs are structured data that models complex relations between real-world
entities. Heterophilous graphs, where linked nodes are prone to be with
different labels or dissimilar features, have recently attracted significant
attention and found many applications. Meanwhile, increasing efforts have been
made to advance learning from heterophilous graphs. Although there exist
surveys on the relevant topic, they focus on heterophilous GNNs, which are only
sub-topics of heterophilous graph learning. In this survey, we comprehensively
overview existing works on learning from graphs with heterophily.First, we
collect over 180 publications and introduce the development of this field.
Then, we systematically categorize existing methods based on a hierarchical
taxonomy including learning strategies, model architectures and practical
applications. Finally, we discuss the primary challenges of existing studies
and highlight promising avenues for future research.More publication details
and corresponding open-source codes can be accessed and will be continuously
updated at our
repositories:https://github.com/gongchenghua/Papers-Graphs-with-Heterophily.
[LINK]
http://arxiv.org/abs/2401.09769v3
[DATE]
2024-07-24 21:49:13+08:00
[CATEGORIES]
cs.LG
Enhanced Feature Learning via Regularisation: Integrating Neural Networks and Kernel Methods
[AUTHORS]
Bertille Follain, Francis Bach
[ABSTRACT]
We propose a new method for feature learning and function estimation in
supervised learning via regularised empirical risk minimisation. Our approach
considers functions as expectations of Sobolev functions over all possible
one-dimensional projections of the data. This framework is similar to kernel
ridge regression, where the kernel is $\mathbb{E}w ( k^{(B)}(w^\top x,w^\top
x^\prime))$, with $k^{(B)}(a,b) := \min(|a|, |b|)1{ab>0}$ the Brownian kernel,
and the distribution of the projections $w$ is learnt. This can also be viewed
as an infinite-width one-hidden layer neural network, optimising the first
layer’s weights through gradient descent and explicitly adjusting the
non-linearity and weights of the second layer. We introduce an efficient
computation method for the estimator, called Brownian Kernel Neural Network
(BKerNN), using particles to approximate the expectation. The optimisation is
principled due to the positive homogeneity of the Brownian kernel. Using
Rademacher complexity, we show that BKerNN’s expected risk converges to the
minimal risk with explicit high-probability rates of $O( \min((d/n)^{1/2},
n^{-1/6}))$ (up to logarithmic factors). Numerical experiments confirm our
optimisation intuitions, and BKerNN outperforms kernel ridge regression, and
favourably compares to a one-hidden layer neural network with ReLU activations
in various settings and real data sets.
[LINK]
http://arxiv.org/abs/2407.17280v1
[DATE]
2024-07-24 21:46:50+08:00
[CATEGORIES]
cs.LG
$Φ$-DVAE: Physics-Informed Dynamical Variational Autoencoders for Unstructured Data Assimilation
[AUTHORS]
Alex Glyn-Davies, Connor Duffin, Ö. Deniz Akyildiz, Mark Girolami
[ABSTRACT]
Incorporating unstructured data into physical models is a challenging problem
that is emerging in data assimilation. Traditional approaches focus on
well-defined observation operators whose functional forms are typically assumed
to be known. This prevents these methods from achieving a consistent model-data
synthesis in configurations where the mapping from data-space to model-space is
unknown. To address these shortcomings, in this paper we develop a
physics-informed dynamical variational autoencoder ($\Phi$-DVAE) to embed
diverse data streams into time-evolving physical systems described by
differential equations. Our approach combines a standard, possibly nonlinear,
filter for the latent state-space model and a VAE, to assimilate the
unstructured data into the latent dynamical system. Unstructured data, in our
example systems, comes in the form of video data and velocity field
measurements, however the methodology is suitably generic to allow for
arbitrary unknown observation operators. A variational Bayesian framework is
used for the joint estimation of the encoding, latent states, and unknown
system parameters. To demonstrate the method, we provide case studies with the
Lorenz-63 ordinary differential equation, and the advection and Korteweg-de
Vries partial differential equations. Our results, with synthetic data, show
that $\Phi$-DVAE provides a data efficient dynamics encoding methodology which
is competitive with standard approaches. Unknown parameters are recovered with
uncertainty quantification, and unseen data are accurately predicted.
[COMMENTS]
29 pages, 9 figures, updated version
[LINK]
http://arxiv.org/abs/2209.15609v3
[DATE]
2024-07-24 21:31:07+08:00
[CATEGORIES]
cs.LG
Physics-informed Information Field Theory for Modeling Physical Systems with Uncertainty Quantification
[AUTHORS]
Alex Alberts, Ilias Bilionis
[ABSTRACT]
Data-driven approaches coupled with physical knowledge are powerful
techniques to model systems. The goal of such models is to efficiently solve
for the underlying field by combining measurements with known physical laws. As
many systems contain unknown elements, such as missing parameters, noisy data,
or incomplete physical laws, this is widely approached as an uncertainty
quantification problem. The common techniques to handle all the variables
typically depend on the numerical scheme used to approximate the posterior, and
it is desirable to have a method which is independent of any such
discretization. Information field theory (IFT) provides the tools necessary to
perform statistics over fields that are not necessarily Gaussian. We extend IFT
to physics-informed IFT (PIFT) by encoding the functional priors with
information about the physical laws which describe the field. The posteriors
derived from this PIFT remain independent of any numerical scheme and can
capture multiple modes, allowing for the solution of problems which are
ill-posed. We demonstrate our approach through an analytical example involving
the Klein-Gordon equation. We then develop a variant of stochastic gradient
Langevin dynamics to draw samples from the joint posterior over the field and
model parameters. We apply our method to numerical examples with various
degrees of model-form error and to inverse problems involving nonlinear
differential equations. As an addendum, the method is equipped with a metric
which allows the posterior to automatically quantify model-form uncertainty.
Because of this, our numerical experiments show that the method remains robust
to even an incorrect representation of the physics given sufficient data. We
numerically demonstrate that the method correctly identifies when the physics
cannot be trusted, in which case it automatically treats learning the field as
a regression problem.
[COMMENTS]
32 pages, 8 figures. Published in Journal of Computational Physics
[LINK]
http://arxiv.org/abs/2301.07609v5
[DATE]
2024-07-24 21:23:17+08:00
[CATEGORIES]
cs.LG
When Does Bottom-up Beat Top-down in Hierarchical Community Detection?
[AUTHORS]
Maximilien Dreveton, Daichi Kuroda, Matthias Grossglauser, Patrick Thiran
[ABSTRACT]
Hierarchical clustering of networks consists in finding a tree of
communities, such that lower levels of the hierarchy reveal finer-grained
community structures. There are two main classes of algorithms tackling this
problem. Divisive ($\textit{top-down}$) algorithms recursively partition the
nodes into two communities, until a stopping rule indicates that no further
split is needed. In contrast, agglomerative ($\textit{bottom-up}$) algorithms
first identify the smallest community structure and then repeatedly merge the
communities using a $\textit{linkage}$ method. In this article, we establish
theoretical guarantees for the recovery of the hierarchical tree and community
structure of a Hierarchical Stochastic Block Model by a bottom-up algorithm. We
also establish that this bottom-up algorithm attains the information-theoretic
threshold for exact recovery at intermediate levels of the hierarchy. Notably,
these recovery conditions are less restrictive compared to those existing for
top-down algorithms. This shows that bottom-up algorithms extend the feasible
region for achieving exact recovery at intermediate levels. Numerical
experiments on both synthetic and real data sets confirm the superiority of
bottom-up algorithms over top-down algorithms. We also observe that top-down
algorithms can produce dendrograms with inversions. These findings contribute
to a better understanding of hierarchical clustering techniques and their
applications in network analysis.
[LINK]
http://arxiv.org/abs/2306.00833v2
[DATE]
2024-07-24 21:13:20+08:00
[CATEGORIES]
cs.LG
Channel-Aware Low-Rank Adaptation in Time Series Forecasting
[AUTHORS]
Tong Nie, Yuewen Mei, Guoyang Qin, Jian Sun, Wei Ma
[ABSTRACT]
The balance between model capacity and generalization has been a key focus of
recent discussions in long-term time series forecasting. Two representative
channel strategies are closely associated with model expressivity and
robustness, including channel independence (CI) and channel dependence (CD).
The former adopts individual channel treatment and has been shown to be more
robust to distribution shifts, but lacks sufficient capacity to model
meaningful channel interactions. The latter is more expressive for representing
complex cross-channel dependencies, but is prone to overfitting. To balance the
two strategies, we present a channel-aware low-rank adaptation method to
condition CD models on identity-aware individual components. As a plug-in
solution, it is adaptable for a wide range of backbone architectures. Extensive
experiments show that it can consistently and significantly improve the
performance of both CI and CD models with demonstrated efficiency and
flexibility. The code is available at https://github.com/tongnie/C-LoRA.
[COMMENTS]
Accepted by CIKM 2024, short research paper track
[LINK]
http://arxiv.org/abs/2407.17246v1
[DATE]
2024-07-24 21:05:17+08:00
[CATEGORIES]
cs.LG
Adaptive Splitting of Reusable Temporal Monitors for Rare Traffic Violations
[AUTHORS]
Craig Innes, Subramanian Ramamoorthy
[ABSTRACT]
Autonomous Vehicles (AVs) are often tested in simulation to estimate the
probability they will violate safety specifications. Two common issues arise
when using existing techniques to produce this estimation: If violations occur
rarely, simple Monte-Carlo sampling techniques can fail to produce efficient
estimates; if simulation horizons are too long, importance sampling techniques
(which learn proposal distributions from past simulations) can fail to
converge. This paper addresses both issues by interleaving rare-event sampling
techniques with online specification monitoring algorithms. We use adaptive
multi-level splitting to decompose simulations into partial trajectories, then
calculate the distance of those partial trajectories to failure by leveraging
robustness metrics from Signal Temporal Logic (STL). By caching those partial
robustness metric values, we can efficiently re-use computations across
multiple sampling stages. Our experiments on an interstate lane-change scenario
show our method is viable for testing simulated AV-pipelines, efficiently
estimating failure probabilities for STL specifications based on real traffic
rules. We produce better estimates than Monte-Carlo and importance sampling in
fewer simulations.
[LINK]
http://arxiv.org/abs/2405.15771v2
[DATE]
2024-07-24 20:56:41+08:00
[CATEGORIES]
cs.LG
FreeCG: Free the Design Space of Clebsch-Gordan Transform for Machine Learning Force Fields
[AUTHORS]
Shihao Shao, Haoran Geng, Zun Wang, Qinghua Cui
[ABSTRACT]
The Clebsch-Gordan Transform (CG transform) effectively encodes many-body
interactions. Many studies have proven its accuracy in depicting atomic
environments, although this comes with high computational needs. The
computational burden of this challenge is hard to reduce due to the need for
permutation equivariance, which limits the design space of the CG transform
layer. We show that, implementing the CG transform layer on
permutation-invariant inputs allows complete freedom in the design of this
layer without affecting symmetry. Developing further on this premise, our idea
is to create a CG transform layer that operates on permutation-invariant
abstract edges generated from real edge information. We bring in group CG
transform with sparse path, abstract edges shuffling, and attention enhancer to
form a powerful and efficient CG transform layer. Our method, known as FreeCG,
achieves State-of-The-Art (SoTA) results in force prediction for MD17, rMD17,
MD22, and property prediction in QM9 datasets with notable enhancement. The
extensibility to other models is also examined. Molecular dynamics simulations
are carried out on MD17 and other periodic systems, including water and LiPS,
showcasing the capacity for real-world applications of FreeCG. It introduces a
novel paradigm for carrying out efficient and expressive CG transform in future
geometric neural network designs.
[COMMENTS]
29 pages, 8 tables, 10 figures
[LINK]
http://arxiv.org/abs/2407.02263v3
[DATE]
2024-07-24 20:36:41+08:00
[CATEGORIES]
cs.LG
A Hybrid Federated Kernel Regularized Least Squares Algorithm
[AUTHORS]
Celeste Damiani, Yulia Rodina, Sergio Decherchi
[ABSTRACT]
Federated learning is becoming an increasingly viable and accepted strategy
for building machine learning models in critical privacy-preserving scenarios
such as clinical settings. Often, the data involved is not limited to clinical
data but also includes additional omics features (e.g. proteomics).
Consequently, data is distributed not only across hospitals but also across
omics centers, which are labs capable of generating such additional features
from biosamples. This scenario leads to a hybrid setting where data is
scattered both in terms of samples and features. In this hybrid setting, we
present an efficient reformulation of the Kernel Regularized Least Squares
algorithm, introduce two variants and validate them using well-established
datasets. Lastly, we discuss security measures to defend against possible
attacks.
[LINK]
http://arxiv.org/abs/2407.17228v1
[DATE]
2024-07-24 20:32:08+08:00
[CATEGORIES]
cs.LG
Sublinear Regret for An Actor-Critic Algorithm in Continuous-Time Linear-Quadratic Reinforcement Learning
[AUTHORS]
Yilie Huang, Yanwei Jia, Xun Yu Zhou
[ABSTRACT]
We study reinforcement learning (RL) for a class of continuous-time
linear-quadratic (LQ) control problems for diffusions where volatility of the
state processes depends on both state and control variables. We apply a
model-free approach that relies neither on knowledge of model parameters nor on
their estimations, and devise an actor-critic algorithm to learn the optimal
policy parameter directly. Our main contributions include the introduction of a
novel exploration schedule and a regret analysis of the proposed algorithm. We
provide the convergence rate of the policy parameter to the optimal one, and
prove that the algorithm achieves a regret bound of $O(N^{\frac{3}{4}})$ up to
a logarithmic factor. We conduct a simulation study to validate the theoretical
results and demonstrate the effectiveness and reliability of the proposed
algorithm. We also perform numerical comparisons between our method and those
of the recent model-based stochastic LQ RL studies adapted to the state- and
control-dependent volatility setting, demonstrating a better performance of the
former in terms of regret bounds.
[COMMENTS]
42 pages, 4 figures
[LINK]
http://arxiv.org/abs/2407.17226v1
[DATE]
2024-07-24 20:26:21+08:00
[CATEGORIES]
cs.LG
An Adaptive Second-order Method for a Class of Nonconvex Nonsmooth Composite Optimization
[AUTHORS]
Hao Wang, Xiangyu Yang, Yichen Zhu
[ABSTRACT]
This paper explores a specific type of nonconvex sparsity-promoting
regularization problems, namely those involving $\ell_p$-norm regularization,
in conjunction with a twice continuously differentiable loss function. We
propose a novel second-order algorithm designed to effectively address this
class of challenging nonconvex and nonsmooth problems, showcasing several
innovative features: (i) The use of an alternating strategy to solve a
reweighted $\ell_1$ regularized subproblem and the subspace approximate Newton
step. (ii) The reweighted $\ell_1$ regularized subproblem relies on a convex
approximation to the nonconvex regularization term, enabling a closed-form
solution characterized by the soft-thresholding operator. This feature allows
our method to be applied to various nonconvex regularization problems. (iii)
Our algorithm ensures that the iterates maintain their sign values and that
nonzero components are kept away from 0 for a sufficient number of iterations,
eventually transitioning to a perturbed Newton method. (iv) We provide
theoretical guarantees of global convergence, local superlinear convergence in
the presence of the Kurdyka-\L ojasiewicz (KL) property, and local quadratic
convergence when employing the exact Newton step in our algorithm. We also
showcase the effectiveness of our approach through experiments on a diverse set
of model prediction problems.
[LINK]
http://arxiv.org/abs/2407.17216v1
[DATE]
2024-07-24 20:15:59+08:00
[CATEGORIES]
cs.LG
Spectrum-Informed Multistage Neural Networks: Multiscale Function Approximators of Machine Precision
[AUTHORS]
Jakin Ng, Yongji Wang, Ching-Yao Lai
[ABSTRACT]
Deep learning frameworks have become powerful tools for approaching
scientific problems such as turbulent flow, which has wide-ranging
applications. In practice, however, existing scientific machine learning
approaches have difficulty fitting complex, multi-scale dynamical systems to
very high precision, as required in scientific contexts. We propose using the
novel multistage neural network approach with a spectrum-informed
initialization to learn the residue from the previous stage, utilizing the
spectral biases associated with neural networks to capture high frequency
features in the residue, and successfully tackle the spectral bias of neural
networks. This approach allows the neural network to fit target functions to
double floating-point machine precision $O(10^{-16})$.
[COMMENTS]
8 pages, 3 figures, ICML 2024 workshop (AI for Science: Scaling in AI
for Scientific Discovery)
[LINK]
http://arxiv.org/abs/2407.17213v1
[DATE]
2024-07-24 20:11:09+08:00
[CATEGORIES]
cs.LG
Take a Step and Reconsider: Sequence Decoding for Self-Improved Neural Combinatorial Optimization
[AUTHORS]
Jonathan Pirnay, Dominik G. Grimm
[ABSTRACT]
The constructive approach within Neural Combinatorial Optimization (NCO)
treats a combinatorial optimization problem as a finite Markov decision
process, where solutions are built incrementally through a sequence of
decisions guided by a neural policy network. To train the policy, recent
research is shifting toward a ‘self-improved’ learning methodology that
addresses the limitations of reinforcement learning and supervised approaches.
Here, the policy is iteratively trained in a supervised manner, with solutions
derived from the current policy serving as pseudo-labels. The way these
solutions are obtained from the policy determines the quality of the
pseudo-labels. In this paper, we present a simple and problem-independent
sequence decoding method for self-improved learning based on sampling sequences
without replacement. We incrementally follow the best solution found and repeat
the sampling process from intermediate partial solutions. By modifying the
policy to ignore previously sampled sequences, we force it to consider only
unseen alternatives, thereby increasing solution diversity. Experimental
results for the Traveling Salesman and Capacitated Vehicle Routing Problem
demonstrate its strong performance. Furthermore, our method outperforms
previous NCO approaches on the Job Shop Scheduling Problem.
[COMMENTS]
Accepted at ECAI-2024
[LINK]
http://arxiv.org/abs/2407.17206v1
[DATE]
2024-07-24 20:06:09+08:00
[CATEGORIES]
cs.LG
Generalization Bounds of Surrogate Policies for Combinatorial Optimization Problems
[AUTHORS]
Pierre-Cyril Aubin-Frankowski, Yohann De Castro, Axel Parmentier, Alessandro Rudi
[ABSTRACT]
A recent stream of structured learning approaches has improved the practical
state of the art for a range of combinatorial optimization problems with
complex objectives encountered in operations research. Such approaches train
policies that chain a statistical model with a surrogate combinatorial
optimization oracle to map any instance of the problem to a feasible solution.
The key idea is to exploit the statistical distribution over instances instead
of dealing with instances separately. However learning such policies by risk
minimization is challenging because the empirical risk is piecewise constant in
the parameters, and few theoretical guarantees have been provided so far. In
this article, we investigate methods that smooth the risk by perturbing the
policy, which eases optimization and improves generalization. Our main
contribution is a generalization bound that controls the perturbation bias, the
statistical learning error, and the optimization error. Our analysis relies on
the introduction of a uniform weak property, which captures and quantifies the
interplay of the statistical model and the surrogate combinatorial optimization
oracle. This property holds under mild assumptions on the statistical model,
the surrogate optimization, and the instance data distribution. We illustrate
the result on a range of applications such as stochastic vehicle scheduling. In
particular, such policies are relevant for contextual stochastic optimization
and our results cover this case.
[COMMENTS]
10 pages main document, 3 pages supplement
[LINK]
http://arxiv.org/abs/2407.17200v1
[DATE]
2024-07-24 20:00:30+08:00
[CATEGORIES]
cs.LG
Surrogate-guided optimization in quantum networks
[AUTHORS]
Luise Prielinger, Álvaro G. Iñesta, Gayane Vardoyan
[ABSTRACT]
We propose an optimization algorithm to improve the design and performance of
quantum communication networks. When physical architectures become too complex
for analytical methods, numerical simulation becomes essential to study quantum
network behavior. Although highly informative, these simulations involve
complex numerical functions without known analytical forms, making traditional
optimization techniques that assume continuity, differentiability, or convexity
inapplicable. Additionally, quantum network simulations are computationally
demanding, rendering global approaches like Simulated Annealing or genetic
algorithms,
which require extensive function evaluations, impractical. We introduce a
more efficient optimization workflow using machine learning models, which serve
as surrogates for a given objective function. We demonstrate the effectiveness
of our approach by applying it to three well-known optimization problems in
quantum networking: quantum memory allocation for multiple network nodes,
tuning an experimental parameter in all physical links of a quantum
entanglement switch, and finding efficient protocol settings within a large
asymmetric quantum network. The solutions found by our algorithm consistently
outperform those obtained with our baseline approaches – Simulated Annealing
and Bayesian optimization – in the allotted time limit by up to 18\% and 20\%,
respectively. Our framework thus allows for more comprehensive quantum network
studies, integrating surrogate-assisted optimization with existing quantum
network simulators.
[COMMENTS]
20 pages (including supplementary notes), 12 figures
[LINK]
http://arxiv.org/abs/2407.17195v1
[DATE]
2024-07-24 19:55:18+08:00
[CATEGORIES]
cs.LG
Discovering Dynamic Symbolic Policies with Genetic Programming
[AUTHORS]
Sigur de Vries, Sander Keemink, Marcel van Gerven
[ABSTRACT]
Artificial intelligence techniques are increasingly being applied to solve
control problems, but often rely on black-box methods without transparent
output generation. To improve the interpretability and transparency in control
systems, models can be defined as white-box symbolic policies described by
mathematical expressions. While current approaches to learn symbolic policies
focus on static policies that directly map observations to control signals,
these may fail in partially observable and volatile environments. We instead
consider dynamic symbolic policies with memory, optimised with genetic
programming. The resulting policies are robust, and consist of easy to
interpret coupled differential equations. Our results show that dynamic
symbolic policies compare with black-box policies on a variety of control
tasks. Furthermore, the benefit of the memory in dynamic policies is
demonstrated on experiments where static policies fall short. Overall, we
present a method for evolving high-performing symbolic policies that offer
interpretability and transparency, which lacks in black-box models.
[COMMENTS]
19 pages including references and appendix, 5 figures, 1 algorithm, 5
tables
[LINK]
http://arxiv.org/abs/2406.02765v3
[DATE]
2024-07-24 19:35:26+08:00
[CATEGORIES]
cs.LG
Efficient Convex Optimization Requires Superlinear Memory
[AUTHORS]
Annie Marsden, Vatsal Sharan, Aaron Sidford, Gregory Valiant
[ABSTRACT]
We show that any memory-constrained, first-order algorithm which minimizes
$d$-dimensional, $1$-Lipschitz convex functions over the unit ball to
$1/\mathrm{poly}(d)$ accuracy using at most $d^{1.25 - \delta}$ bits of memory
must make at least $\tilde{\Omega}(d^{1 + (4/3)\delta})$ first-order queries
(for any constant $\delta \in [0, 1/4]$). Consequently, the performance of such
memory-constrained algorithms are a polynomial factor worse than the optimal
$\tilde{O}(d)$ query bound for this problem obtained by cutting plane methods
that use $\tilde{O}(d^2)$ memory. This resolves a COLT 2019 open problem of
Woodworth and Srebro.
[COMMENTS]
33 pages, 1 figure
[LINK]
http://arxiv.org/abs/2203.15260v2
[DATE]
2024-07-24 19:21:47+08:00
[CATEGORIES]
cs.LG
On the Utility of Speech and Audio Foundation Models for Marmoset Call Analysis
[AUTHORS]
Eklavya Sarkar, Mathew Magimai. -Doss
[ABSTRACT]
Marmoset monkeys encode vital information in their calls and serve as a
surrogate model for neuro-biologists to understand the evolutionary origins of
human vocal communication. Traditionally analyzed with signal processing-based
features, recent approaches have utilized self-supervised models pre-trained on
human speech for feature extraction, capitalizing on their ability to learn a
signal’s intrinsic structure independently of its acoustic domain. However, the
utility of such foundation models remains unclear for marmoset call analysis in
terms of multi-class classification, bandwidth, and pre-training domain. This
study assesses feature representations derived from speech and general audio
domains, across pre-training bandwidths of 4, 8, and 16 kHz for marmoset
call-type and caller classification tasks. Results show that models with higher
bandwidth improve performance, and pre-training on speech or general audio
yields comparable results, improving over a spectral baseline.
[COMMENTS]
Accepted at Interspeech 2024 satellite event (VIHAR 2024)
[LINK]
http://arxiv.org/abs/2407.16417v2
[DATE]
2024-07-24 19:19:22+08:00
[CATEGORIES]
cs.LG
Explainable Artificial Intelligence Techniques for Irregular Temporal Classification of Multidrug Resistance Acquisition in Intensive Care Unit Patients
[AUTHORS]
Óscar Escudero-Arnanz, Cristina Soguero-Ruiz, Joaquín Álvarez-Rodríguez, Antonio G. Marques
[ABSTRACT]
Antimicrobial Resistance represents a significant challenge in the Intensive
Care Unit (ICU), where patients are at heightened risk of Multidrug-Resistant
(MDR) infections-pathogens resistant to multiple antimicrobial agents. This
study introduces a novel methodology that integrates Gated Recurrent Units
(GRUs) with advanced intrinsic and post-hoc interpretability techniques for
detecting the onset of MDR in patients across time. Within interpretability
methods, we propose Explainable Artificial Intelligence (XAI) approaches to
handle irregular Multivariate Time Series (MTS), introducing Irregular Time
Shapley Additive Explanations (IT-SHAP), a modification of Shapley Additive
Explanations designed for irregular MTS with Recurrent Neural Networks focused
on temporal outputs. Our methodology aims to identify specific risk factors
associated with MDR in ICU patients. GRU with Hadamard’s attention demonstrated
high initial specificity and increasing sensitivity over time, correlating with
increased nosocomial infection risks during prolonged ICU stays. XAI analysis,
enhanced by Hadamard attention and IT-SHAP, identified critical factors such as
previous non-resistant cultures, specific antibiotic usage patterns, and
hospital environment dynamics. These insights suggest that early detection of
at-risk patients can inform interventions such as preventive isolation and
customized treatments, significantly improving clinical outcomes. The proposed
GRU model for temporal classification achieved an average Receiver Operating
Characteristic Area Under the Curve of 78.27 +- 1.26 over time, indicating
strong predictive performance. In summary, this study highlights the clinical
utility of our methodology, which combines predictive accuracy with
interpretability, thereby facilitating more effective healthcare interventions
by professionals.
[LINK]
http://arxiv.org/abs/2407.17165v1
[DATE]
2024-07-24 19:12:01+08:00
[CATEGORIES]
cs.LG
EXACT: How to Train Your Accuracy
[AUTHORS]
Ivan Karpukhin, Stanislav Dereka, Sergey Kolesnikov
[ABSTRACT]
Classification tasks are usually evaluated in terms of accuracy. However,
accuracy is discontinuous and cannot be directly optimized using gradient
ascent. Popular methods minimize cross-entropy, hinge loss, or other surrogate
losses, which can lead to suboptimal results. In this paper, we propose a new
optimization framework by introducing stochasticity to a model’s output and
optimizing expected accuracy, i.e. accuracy of the stochastic model. Extensive
experiments on linear models and deep image classification show that the
proposed optimization method is a powerful alternative to widely used
classification losses.
[COMMENTS]
Pattern Recognition Letters (2024)
[LINK]
http://arxiv.org/abs/2205.09615v5
[DATE]
2024-07-24 18:49:23+08:00
[CATEGORIES]
cs.LG
Automated transport separation using the neural shifted proper orthogonal decomposition
[AUTHORS]
Beata Zorawski, Shubhaditya Burela, Philipp Krah, Arthur Marmin, Kai Schneider
[ABSTRACT]
This paper presents a neural network-based methodology for the decomposition
of transport-dominated fields using the shifted proper orthogonal decomposition
(sPOD). Classical sPOD methods typically require an a priori knowledge of the
transport operators to determine the co-moving fields. However, in many
real-life problems, such knowledge is difficult or even impossible to obtain,
limiting the applicability and benefits of the sPOD. To address this issue, our
approach estimates both the transport and co-moving fields simultaneously using
neural networks. This is achieved by training two sub-networks dedicated to
learning the transports and the co-moving fields, respectively. Applications to
synthetic data and a wildland fire model illustrate the capabilities and
efficiency of this neural sPOD approach, demonstrating its ability to separate
the different fields effectively.
[COMMENTS]
Proceedings not peer-reviewed yet. Code available:
https://github.com/MOR-transport/automated_NsPOD
[LINK]
http://arxiv.org/abs/2407.17539v1
[DATE]
2024-07-24 18:47:50+08:00
[CATEGORIES]
cs.LG
Towards Robust Continual Learning with Bayesian Adaptive Moment Regularization
[AUTHORS]
Jack Foster, Alexandra Brintrup
[ABSTRACT]
The pursuit of long-term autonomy mandates that machine learning models must
continuously adapt to their changing environments and learn to solve new tasks.
Continual learning seeks to overcome the challenge of catastrophic forgetting,
where learning to solve new tasks causes a model to forget previously learnt
information. Prior-based continual learning methods are appealing as they are
computationally efficient and do not require auxiliary models or data storage.
However, prior-based approaches typically fail on important benchmarks and are
thus limited in their potential applications compared to their memory-based
counterparts. We introduce Bayesian adaptive moment regularization (BAdam), a
novel prior-based method that better constrains parameter growth, reducing
catastrophic forgetting. Our method boasts a range of desirable properties such
as being lightweight and task label-free, converging quickly, and offering
calibrated uncertainty that is important for safe real-world deployment.
Results show that BAdam achieves state-of-the-art performance for prior-based
methods on challenging single-headed class-incremental experiments such as
Split MNIST and Split FashionMNIST, and does so without relying on task labels
or discrete task boundaries.
[LINK]
http://arxiv.org/abs/2309.08546v3
[DATE]
2024-07-24 18:16:59+08:00
[CATEGORIES]
cs.LG
Logistic regression models for patient-level prediction based on massive observational data: Do we need all data?
[AUTHORS]
Luis H. John, Jan A. Kors, Jenna M. Reps, Patrick B. Ryan, Peter R. Rijnbeek
[ABSTRACT]
Objective: Provide guidance on sample size considerations for developing
predictive models by empirically establishing the adequate sample size, which
balances the competing objectives of improving model performance and reducing
model complexity as well as computational requirements.
Materials and Methods: We empirically assess the effect of sample size on
prediction performance and model complexity by generating learning curves for
81 prediction problems (23 outcomes predicted in a depression cohort, 58
outcomes predicted in a hypertension cohort) in three large observational
health databases, requiring training of 17,248 prediction models. The adequate
sample size was defined as the sample size for which the performance of a model
equalled the maximum model performance minus a small threshold value.
Results: The adequate sample size achieves a median reduction of the number
of observations of 9.5%, 37.3%, 58.5%, and 78.5% for the thresholds of 0.001,
0.005, 0.01, and 0.02, respectively. The median reduction of the number of
predictors in the models was 8.6%, 32.2%, 48.2%, and 68.3% for the thresholds
of 0.001, 0.005, 0.01, and 0.02, respectively.
Discussion: Based on our results a conservative, yet significant, reduction
in sample size and model complexity can be estimated for future prediction
work. Though, if a researcher is willing to generate a learning curve a much
larger reduction of the model complexity may be possible as suggested by a
large outcome-dependent variability.
Conclusion: Our results suggest that in most cases only a fraction of the
available data was sufficient to produce a model close to the performance of
one developed on the full data set, but with a substantially reduced model
complexity.
[LINK]
http://arxiv.org/abs/2008.07361v2
[DATE]
2024-07-24 17:56:06+08:00
[CATEGORIES]
cs.LG
SAE: Single Architecture Ensemble Neural Networks
[AUTHORS]
Martin Ferianc, Hongxiang Fan, Miguel Rodrigues
[ABSTRACT]
Ensembles of separate neural networks (NNs) have shown superior accuracy and
confidence calibration over single NN across tasks. To improve the hardware
efficiency of ensembles of separate NNs, recent methods create ensembles within
a single network via adding early exits or considering multi input multi output
approaches. However, it is unclear which of these methods is the most effective
for a given task, needing a manual and separate search through each method. Our
novel Single Architecture Ensemble (SAE) framework enables an automatic and
joint search through the early exit and multi input multi output configurations
and their previously unobserved in-between combinations. SAE consists of two
parts: a scalable search space that generalises the previous methods and their
in-between configurations, and an optimisation objective that allows learning
the optimal configuration for a given task. Our image classification and
regression experiments show that with SAE we can automatically find diverse
configurations that fit the task, achieving competitive accuracy or confidence
calibration to baselines while reducing the compute operations or parameter
count by up to $1.5{\sim}3.7\times$.
[COMMENTS]
Accepted at BMVC’24
[LINK]
http://arxiv.org/abs/2402.06580v2
[DATE]
2024-07-24 17:38:49+08:00
[CATEGORIES]
cs.LG
Parameter-Efficient Fine-Tuning for Continual Learning: A Neural Tangent Kernel Perspective
[AUTHORS]
Jingren Liu, Zhong Ji, YunLong Yu, Jiale Cao, Yanwei Pang, Jungong Han, Xuelong Li
[ABSTRACT]
Parameter-efficient fine-tuning for continual learning (PEFT-CL) has shown
promise in adapting pre-trained models to sequential tasks while mitigating
catastrophic forgetting problem. However, understanding the mechanisms that
dictate continual performance in this paradigm remains elusive. To tackle this
complexity, we undertake a rigorous analysis of PEFT-CL dynamics to derive
relevant metrics for continual scenarios using Neural Tangent Kernel (NTK)
theory. With the aid of NTK as a mathematical analysis tool, we recast the
challenge of test-time forgetting into the quantifiable generalization gaps
during training, identifying three key factors that influence these gaps and
the performance of PEFT-CL: training sample size, task-level feature
orthogonality, and regularization. To address these challenges, we introduce
NTK-CL, a novel framework that eliminates task-specific parameter storage while
adaptively generating task-relevant features. Aligning with theoretical
guidance, NTK-CL triples the feature representation of each sample,
theoretically and empirically reducing the magnitude of both task-interplay and
task-specific generalization gaps. Grounded in NTK analysis, our approach
imposes an adaptive exponential moving average mechanism and constraints on
task-level feature orthogonality, maintaining intra-task NTK forms while
attenuating inter-task NTK forms. Ultimately, by fine-tuning optimizable
parameters with appropriate regularization, NTK-CL achieves state-of-the-art
performance on established PEFT-CL benchmarks. This work provides a theoretical
foundation for understanding and improving PEFT-CL models, offering insights
into the interplay between feature representation, task orthogonality, and
generalization, contributing to the development of more efficient continual
learning systems.
[LINK]
http://arxiv.org/abs/2407.17120v1
[DATE]
2024-07-24 17:30:04+08:00
[CATEGORIES]
cs.LG
On the Federated Learning Framework for Cooperative Perception
[AUTHORS]
Zhenrong Zhang, Jianan Liu, Xi Zhou, Tao Huang, Qing-Long Han, Jingxin Liu, Hongbin Liu
[ABSTRACT]
Cooperative perception is essential to enhance the efficiency and safety of
future transportation systems, requiring extensive data sharing among vehicles
on the road, which raises significant privacy concerns. Federated learning
offers a promising solution by enabling data privacy-preserving collaborative
enhancements in perception, decision-making, and planning among connected and
autonomous vehicles (CAVs). However, federated learning is impeded by
significant challenges arising from data heterogeneity across diverse clients,
potentially diminishing model accuracy and prolonging convergence periods. This
study introduces a specialized federated learning framework for CP, termed the
federated dynamic weighted aggregation (FedDWA) algorithm, facilitated by
dynamic adjusting loss (DALoss) function. This framework employs dynamic client
weighting to direct model convergence and integrates a novel loss function that
utilizes Kullback-Leibler divergence (KLD) to counteract the detrimental
effects of non-independently and identically distributed (Non-IID) and
unbalanced data. Utilizing the BEV transformer as the primary model, our
rigorous testing on the OpenV2V dataset, augmented with FedBEVT data,
demonstrates significant improvements in the average intersection over union
(IoU). These results highlight the substantial potential of our federated
learning framework to address data heterogeneity challenges in CP, thereby
enhancing the accuracy of environmental perception models and facilitating more
robust and efficient collaborative learning solutions in the transportation
sector.
[LINK]
http://arxiv.org/abs/2404.17147v3
[DATE]
2024-07-24 17:28:11+08:00
[CATEGORIES]
cs.LG
EverAdapt: Continuous Adaptation for Dynamic Machine Fault Diagnosis Environments
[AUTHORS]
Edward, Mohamed Ragab, Yuecong Xu, Min Wu, Yuecong Xu, Zhenghua Chen, Abdulla Alseiari, Xiaoli Li
[ABSTRACT]
Unsupervised Domain Adaptation (UDA) has emerged as a key solution in
data-driven fault diagnosis, addressing domain shift where models underperform
in changing environments. However, under the realm of continually changing
environments, UDA tends to underperform on previously seen domains when
adapting to new ones - a problem known as catastrophic forgetting. To address
this limitation, we introduce the EverAdapt framework, specifically designed
for continuous model adaptation in dynamic environments. Central to EverAdapt
is a novel Continual Batch Normalization (CBN), which leverages source domain
statistics as a reference point to standardize feature representations across
domains. EverAdapt not only retains statistical information from previous
domains but also adapts effectively to new scenarios. Complementing CBN, we
design a class-conditional domain alignment module for effective integration of
target domains, and a Sample-efficient Replay strategy to reinforce memory
retention. Experiments on real-world datasets demonstrate EverAdapt superiority
in maintaining robust fault diagnosis in dynamic environments. Our code is
available: https://github.com/mohamedr002/EverAdapt
[LINK]
http://arxiv.org/abs/2407.17117v1
[DATE]
2024-07-24 17:25:54+08:00
[CATEGORIES]
cs.LG
Neural Dueling Bandits
[AUTHORS]
Arun Verma, Zhongxiang Dai, Xiaoqiang Lin, Patrick Jaillet, Bryan Kian Hsiang Low
[ABSTRACT]
Contextual dueling bandit is used to model the bandit problems, where a
learner’s goal is to find the best arm for a given context using observed noisy
preference feedback over the selected arms for the past contexts. However,
existing algorithms assume the reward function is linear, which can be complex
and non-linear in many real-life applications like online recommendations or
ranking web search results. To overcome this challenge, we use a neural network
to estimate the reward function using preference feedback for the previously
selected arms. We propose upper confidence bound- and Thompson sampling-based
algorithms with sub-linear regret guarantees that efficiently select arms in
each round. We then extend our theoretical results to contextual bandit
problems with binary feedback, which is in itself a non-trivial contribution.
Experimental results on the problem instances derived from synthetic datasets
corroborate our theoretical results.
[COMMENTS]
Accepted at ICML 2024 Workshop on Foundations of Reinforcement
Learning and Control
[LINK]
http://arxiv.org/abs/2407.17112v1
[DATE]
2024-07-24 17:23:22+08:00
[CATEGORIES]
cs.LG
Improved Random Features for Dot Product Kernels
[AUTHORS]
Jonas Wacker, Motonobu Kanagawa, Maurizio Filippone
[ABSTRACT]
Dot product kernels, such as polynomial and exponential (softmax) kernels,
are among the most widely used kernels in machine learning, as they enable
modeling the interactions between input features, which is crucial in
applications like computer vision, natural language processing, and recommender
systems. We make several novel contributions for improving the efficiency of
random feature approximations for dot product kernels, to make these kernels
more useful in large scale learning. First, we present a generalization of
existing random feature approximations for polynomial kernels, such as
Rademacher and Gaussian sketches and TensorSRHT, using complex-valued random
features. We show empirically that the use of complex features can
significantly reduce the variances of these approximations. Second, we provide
a theoretical analysis for understanding the factors affecting the efficiency
of various random feature approximations, by deriving closed-form expressions
for their variances. These variance formulas elucidate conditions under which
certain approximations (e.g., TensorSRHT) achieve lower variances than others
(e.g., Rademacher sketches), and conditions under which the use of complex
features leads to lower variances than real features. Third, by using these
variance formulas, which can be evaluated in practice, we develop a data-driven
optimization approach to improve random feature approximations for general dot
product kernels, which is also applicable to the Gaussian kernel. We describe
the improvements brought by these contributions with extensive experiments on a
variety of tasks and datasets.
[LINK]
http://arxiv.org/abs/2201.08712v3
[DATE]
2024-07-24 17:17:56+08:00
[CATEGORIES]
cs.LG
Towards Robust Knowledge Tracing Models via k-Sparse Attention
[AUTHORS]
Shuyan Huang, Zitao Liu, Xiangyu Zhao, Weiqi Luo, Jian Weng
[ABSTRACT]
Knowledge tracing (KT) is the problem of predicting students’ future
performance based on their historical interaction sequences. With the advanced
capability of capturing contextual long-term dependency, attention mechanism
becomes one of the essential components in many deep learning based KT (DLKT)
models. In spite of the impressive performance achieved by these attentional
DLKT models, many of them are often vulnerable to run the risk of overfitting,
especially on small-scale educational datasets. Therefore, in this paper, we
propose \textsc{sparseKT}, a simple yet effective framework to improve the
robustness and generalization of the attention based DLKT approaches.
Specifically, we incorporate a k-selection module to only pick items with the
highest attention scores. We propose two sparsification heuristics : (1)
soft-thresholding sparse attention and (2) top-$K$ sparse attention. We show
that our \textsc{sparseKT} is able to help attentional KT models get rid of
irrelevant student interactions and have comparable predictive performance when
compared to 11 state-of-the-art KT models on three publicly available
real-world educational datasets. To encourage reproducible research, we make
our data and code publicly available at
\url{https://github.com/pykt-team/pykt-toolkit}\footnote{We merged our model to
the \textsc{pyKT} benchmark at \url{https://pykt.org/}.}.
[COMMENTS]
Accepted at SIGIR’2023 (revised version with additional results)
[LINK]
http://arxiv.org/abs/2407.17097v1
[DATE]
2024-07-24 16:49:18+08:00
[CATEGORIES]
cs.LG
Assessing Non-Nested Configurations of Multifidelity Machine Learning for Quantum-Chemical Properties
[AUTHORS]
Vivin Vinod, Peter Zaspel
[ABSTRACT]
Multifidelity machine learning (MFML) for quantum chemical (QC) properties
has seen strong development in the recent years. The method has been shown to
reduce the cost of generating training data for high-accuracy low-cost ML
models. In such a set-up, the ML models are trained on molecular geometries and
some property of interest computed at various computational chemistry
accuracies, or fidelities. These are then combined in training the MFML models.
In some multifidelity models, the training data is required to be nested, that
is the same molecular geometries are included to calculate the property across
all the fidelities. In these multifidelity models, the requirement of a nested
configuration restricts the kind of sampling that can be performed while
selection training samples at different fidelities.
This work assesses the use of non-nested training data for two of these
multifidelity methods, namely MFML and optimized MFML (o-MFML). The assessment
is carried out for the prediction of ground state energies and first vertical
excitation energies of a diverse collection of molecules of the CheMFi dataset.
Results indicate that the MFML method still requires a nested structure of
training data across the fidelities. However, the o-MFML method shows promising
results for non-nested multifidelity training data with model errors comparable
to the nested configurations.
[LINK]
http://arxiv.org/abs/2407.17087v1
[DATE]
2024-07-24 16:34:08+08:00
[CATEGORIES]
cs.LG
OVR: A Dataset for Open Vocabulary Temporal Repetition Counting in Videos
[AUTHORS]
Debidatta Dwibedi, Yusuf Aytar, Jonathan Tompson, Andrew Zisserman
[ABSTRACT]
We introduce a dataset of annotations of temporal repetitions in videos. The
dataset, OVR (pronounced as over), contains annotations for over 72K videos,
with each annotation specifying the number of repetitions, the start and end
time of the repetitions, and also a free-form description of what is repeating.
The annotations are provided for videos sourced from Kinetics and Ego4D, and
consequently cover both Exo and Ego viewing conditions, with a huge variety of
actions and activities. Moreover, OVR is almost an order of magnitude larger
than previous datasets for video repetition. We also propose a baseline
transformer-based counting model, OVRCounter, that can localise and count
repetitions in videos that are up to 320 frames long. The model is trained and
evaluated on the OVR dataset, and its performance assessed with and without
using text to specify the target class to count. The performance is also
compared to a prior repetition counting model. The dataset is available for
download at: https://sites.google.com/view/openvocabreps/
[LINK]
http://arxiv.org/abs/2407.17085v1
[DATE]
2024-07-24 16:22:49+08:00
[CATEGORIES]
cs.LG
MetaLLM: A High-performant and Cost-efficient Dynamic Framework for Wrapping LLMs
[AUTHORS]
Quang H. Nguyen, Duy C. Hoang, Juliette Decugis, Saurav Manchanda, Nitesh V. Chawla, Khoa D. Doan
[ABSTRACT]
The rapid progress in machine learning (ML) has brought forth many large
language models (LLMs) that excel in various tasks and areas. These LLMs come
with different abilities and costs in terms of computation or pricing. Since
the demand for each query can vary, e.g., because of the queried domain or its
complexity, defaulting to one LLM in an application is not usually the best
choice, whether it is the biggest, priciest, or even the one with the best
average test performance. Consequently, picking the right LLM that is both
accurate and cost-effective for an application remains a challenge. In this
paper, we introduce MetaLLM, a framework that dynamically and intelligently
routes each query to the optimal LLM (among several available LLMs) for
classification tasks, achieving significantly improved accuracy and
cost-effectiveness. By framing the selection problem as a multi-armed bandit,
MetaLLM balances prediction accuracy and cost efficiency under uncertainty. Our
experiments, conducted on popular LLM platforms such as OpenAI’s GPT models,
Amazon’s Titan, Anthropic’s Claude, and Meta’s LLaMa, showcase MetaLLM’s
efficacy in real-world scenarios, laying the groundwork for future extensions
beyond classification tasks.
[LINK]
http://arxiv.org/abs/2407.10834v2
[DATE]
2024-07-24 16:14:50+08:00
[CATEGORIES]
cs.LG
Boosting Gradient Ascent for Continuous DR-submodular Maximization
[AUTHORS]
Qixin Zhang, Zongqi Wan, Zengde Deng, Zaiyi Chen, Xiaoming Sun, Jialin Zhang, Yu Yang
[ABSTRACT]
Projected Gradient Ascent (PGA) is the most commonly used optimization scheme
in machine learning and operations research areas. Nevertheless, numerous
studies and examples have shown that the PGA methods may fail to achieve the
tight approximation ratio for continuous DR-submodular maximization problems.
To address this challenge, we present a boosting technique in this paper, which
can efficiently improve the approximation guarantee of the standard PGA to
\emph{optimal} with only small modifications on the objective function. The
fundamental idea of our boosting technique is to exploit non-oblivious search
to derive a novel auxiliary function $F$, whose stationary points are excellent
approximations to the global maximum of the original DR-submodular objective
$f$. Specifically, when $f$ is monotone and $\gamma$-weakly DR-submodular, we
propose an auxiliary function $F$ whose stationary points can provide a better
$(1-e^{-\gamma})$-approximation than the
$(\gamma^2/(1+\gamma^2))$-approximation guaranteed by the stationary points of
$f$ itself. Similarly, for the non-monotone case, we devise another auxiliary
function $F$ whose stationary points can achieve an optimal
$\frac{1-\min_{\boldsymbol{x}\in\mathcal{C}}|\boldsymbol{x}|_{\infty}}{4}$-approximation
guarantee where $\mathcal{C}$ is a convex constraint set. In contrast, the
stationary points of the original non-monotone DR-submodular function can be
arbitrarily bad~\citep{chen2023continuous}. Furthermore, we demonstrate the
scalability of our boosting technique on four problems. In all of these four
problems, our resulting variants of boosting PGA algorithm beat the previous
standard PGA in several aspects such as approximation ratio and efficiency.
Finally, we corroborate our theoretical findings with numerical experiments,
which demonstrate the effectiveness of our boosting PGA methods.
[COMMENTS]
74 pages, 6 figures and 9 tables. An extended version of Stochastic
Continuous Submodular Maximization: Boosting via Non-oblivious Function (ICML
2022)
[LINK]
http://arxiv.org/abs/2401.08330v2
[DATE]
2024-07-24 16:13:26+08:00
[CATEGORIES]
cs.LG
Surrogate Neural Networks Local Stability for Aircraft Predictive Maintenance
[AUTHORS]
Mélanie Ducoffe, Guillaume Povéda, Audrey Galametz, Ryma Boumazouza, Marion-Cécile Martin, Julien Baris, Derk Daverschot, Eugene O’Higgins
[ABSTRACT]
Surrogate Neural Networks are nowadays routinely used in industry as
substitutes for computationally demanding engineering simulations (e.g., in
structural analysis). They allow to generate faster predictions and thus
analyses in industrial applications e.g., during a product design, testing or
monitoring phases. Due to their performance and time-efficiency, these
surrogate models are now being developed for use in safety-critical
applications. Neural network verification and in particular the assessment of
their robustness (e.g., to perturbations) is the next critical step to allow
their inclusion in real-life applications and certification. We assess the
applicability and scalability of empirical and formal methods in the context of
aircraft predictive maintenance for surrogate neural networks designed to
predict the stress sustained by an aircraft part from external loads. The case
study covers a high-dimensional input and output space and the verification
process thus accommodates multi-objective constraints. We explore the
complementarity of verification methods in assessing the local stability
property of such surrogate models to input noise. We showcase the effectiveness
of sequentially combining methods in one verification ‘pipeline’ and
demonstrate the subsequent gain in runtime required to assess the targeted
property.
[COMMENTS]
Peer-reviewed and accepted at the 29th International Conference on
Formal Methods for Industrial Critical Systems (FMICS 2024) - 15 pages
[LINK]
http://arxiv.org/abs/2401.06821v4
[DATE]
2024-07-24 16:12:11+08:00
[CATEGORIES]
cs.LG
Contrastive Learning Is Not Optimal for Quasiperiodic Time Series
[AUTHORS]
Adrian Atienza, Jakob Bardram, Sadasivan Puthusserypady
[ABSTRACT]
Despite recent advancements in Self-Supervised Learning (SSL) for time series
analysis, a noticeable gap persists between the anticipated achievements and
actual performance. While these methods have demonstrated formidable
generalization capabilities with minimal labels in various domains, their
effectiveness in distinguishing between different classes based on a limited
number of annotated records is notably lacking. Our hypothesis attributes this
bottleneck to the prevalent use of Contrastive Learning, a shared training
objective in previous state-of-the-art (SOTA) methods. By mandating
distinctiveness between representations for negative pairs drawn from separate
records, this approach compels the model to encode unique record-based patterns
but simultaneously neglects changes occurring across the entire record. To
overcome this challenge, we introduce Distilled Embedding for Almost-Periodic
Time Series (DEAPS) in this paper, offering a non-contrastive method tailored
for quasiperiodic time series, such as electrocardiogram (ECG) data. By
avoiding the use of negative pairs, we not only mitigate the model’s blindness
to temporal changes but also enable the integration of a “Gradual Loss (Lgra)”
function. This function guides the model to effectively capture dynamic
patterns evolving throughout the record. The outcomes are promising, as DEAPS
demonstrates a notable improvement of +10% over existing SOTA methods when just
a few annotated records are presented to fit a Machine Learning (ML) model
based on the learned representation.
[COMMENTS]
Accepted to IJCAI 2024
[LINK]
http://arxiv.org/abs/2407.17073v1
[DATE]
2024-07-24 16:02:41+08:00
[CATEGORIES]
cs.LG
An Efficient Procedure for Computing Bayesian Network Structure Learning
[AUTHORS]
Hongming Huang, Joe Suzuki
[ABSTRACT]
We propose a globally optimal Bayesian network structure discovery algorithm
based on a progressively leveled scoring approach. Bayesian network structure
discovery is a fundamental yet NP-hard problem in the field of probabilistic
graphical models, and as the number of variables increases, memory usage grows
exponentially. The simple and effective method proposed by Silander and
Myllym"aki has been widely applied in this field, as it incrementally
calculates local scores to achieve global optimality. However, existing methods
that utilize disk storage, while capable of handling networks with a larger
number of variables, introduce issues such as latency, fragmentation, and
additional overhead associated with disk I/O operations. To avoid these
problems, we explore how to further enhance computational efficiency and reduce
peak memory usage using only memory. We introduce an efficient hierarchical
computation method that requires only a single traversal of all local
structures, retaining only the data and information necessary for the current
computation, thereby improving efficiency and significantly reducing memory
requirements. Experimental results indicate that our method, when using only
memory, not only reduces peak memory usage but also improves computational
efficiency compared to existing methods, demonstrating good scalability for
handling larger networks and exhibiting stable experimental results.
Ultimately, we successfully achieved the processing of a Bayesian network with
28 variables using only memory.
[LINK]
http://arxiv.org/abs/2407.17072v1
[DATE]
2024-07-24 15:59:18+08:00
[CATEGORIES]
cs.LG
Curriculum Negative Mining For Temporal Networks
[AUTHORS]
Ziyue Chen, Tongya Zheng, Mingli Song
[ABSTRACT]
Temporal networks are effective in capturing the evolving interactions of
networks over time, such as social networks and e-commerce networks. In recent
years, researchers have primarily concentrated on developing specific model
architectures for Temporal Graph Neural Networks (TGNNs) in order to improve
the representation quality of temporal nodes and edges. However, limited
attention has been given to the quality of negative samples during the training
of TGNNs. When compared with static networks, temporal networks present two
specific challenges for negative sampling: positive sparsity and positive
shift. Positive sparsity refers to the presence of a single positive sample
amidst numerous negative samples at each timestamp, while positive shift
relates to the variations in positive samples across different timestamps. To
robustly address these challenges in training TGNNs, we introduce Curriculum
Negative Mining (CurNM), a model-aware curriculum learning framework that
adaptively adjusts the difficulty of negative samples. Within this framework,
we first establish a dynamically updated negative pool that balances random,
historical, and hard negatives to address the challenges posed by positive
sparsity. Secondly, we implement a temporal-aware negative selection module
that focuses on learning from the disentangled factors of recently active
edges, thus accurately capturing shifting preferences. Extensive experiments on
12 datasets and 3 TGNNs demonstrate that our method outperforms baseline
methods by a significant margin. Additionally, thorough ablation studies and
parameter sensitivity experiments verify the usefulness and robustness of our
approach. Our code is available at https://github.com/zziyue83/CurNM.
[LINK]
http://arxiv.org/abs/2407.17070v1
[DATE]
2024-07-24 15:55:49+08:00
[CATEGORIES]
cs.LG
A spatiotemporal deep learning framework for prediction of crack dynamics in heterogeneous solids: efficient mapping of concrete microstructures to its fracture properties
[AUTHORS]
Rasoul Najafi Koopas, Shahed Rezaei, Natalie Rauter, Richard Ostwald, Rolf Lammering
[ABSTRACT]
A spatiotemporal deep learning framework is proposed that is capable of 2D
full-field prediction of fracture in concrete mesostructures. This framework
not only predicts fractures but also captures the entire history of the
fracture process, from the crack initiation in the interfacial transition zone
to the subsequent propagation of the cracks in the mortar matrix. In addition,
a convolutional neural network is developed which can predict the averaged
stress-strain curve of the mesostructures. The UNet modeling framework, which
comprises an encoder-decoder section with skip connections, is used as the deep
learning surrogate model. Training and test data are generated from
high-fidelity fracture simulations of randomly generated concrete
mesostructures. These mesostructures include geometric variabilities such as
different aggregate particle geometrical features, spatial distribution, and
the total volume fraction of aggregates. The fracture simulations are carried
out in Abaqus, utilizing the cohesive phase-field fracture modeling technique
as the fracture modeling approach. In this work, to reduce the number of
training datasets, the spatial distribution of three sets of material
properties for three-phase concrete mesostructures, along with the spatial
phase-field damage index, are fed to the UNet to predict the corresponding
stress and spatial damage index at the subsequent step. It is shown that after
the training process using this methodology, the UNet model is capable of
accurately predicting damage on the unseen test dataset by using 470 datasets.
Moreover, another novel aspect of this work is the conversion of irregular
finite element data into regular grids using a developed pipeline. This
approach allows for the implementation of less complex UNet architecture and
facilitates the integration of phase-field fracture equations into surrogate
models for future developments.
[LINK]
http://arxiv.org/abs/2407.15665v2
[DATE]
2024-07-24 15:51:20+08:00
[CATEGORIES]
cs.LG
Knowledge-augmented Graph Machine Learning for Drug Discovery: A Survey
[AUTHORS]
Zhiqiang Zhong, Anastasia Barkova, Davide Mottin
[ABSTRACT]
The integration of Artificial Intelligence (AI) into the field of drug
discovery has been a growing area of interdisciplinary scientific research.
However, conventional AI models are heavily limited in handling complex
biomedical structures (such as 2D or 3D protein and molecule structures) and
providing interpretations for outputs, which hinders their practical
application. As of late, Graph Machine Learning (GML) has gained considerable
attention for its exceptional ability to model graph-structured biomedical data
and investigate their properties and functional relationships. Despite
extensive efforts, GML methods still suffer from several deficiencies, such as
the limited ability to handle supervision sparsity and provide interpretability
in learning and inference processes, and their ineffectiveness in utilising
relevant domain knowledge. In response, recent studies have proposed
integrating external biomedical knowledge into the GML pipeline to realise more
precise and interpretable drug discovery with limited training instances.
However, a systematic definition for this burgeoning research direction is yet
to be established. This survey presents a comprehensive overview of
long-standing drug discovery principles, provides the foundational concepts and
cutting-edge techniques for graph-structured data and knowledge databases, and
formally summarises Knowledge-augmented Graph Machine Learning (KaGML) for drug
discovery. we propose a thorough review of related KaGML works, collected
following a carefully designed search methodology, and organise them into four
categories following a novel-defined taxonomy. To facilitate research in this
promptly emerging field, we also share collected practical resources that are
valuable for intelligent drug discovery and provide an in-depth discussion of
the potential avenues for future advancements.
[LINK]
http://arxiv.org/abs/2302.08261v3
[DATE]
2024-07-24 15:26:59+08:00
[CATEGORIES]
cs.LG
DCoM: Active Learning for All Learners
[AUTHORS]
Inbal Mishal, Daphna Weinshall
[ABSTRACT]
Deep Active Learning (AL) techniques can be effective in reducing annotation
costs for training deep models. However, their effectiveness in low- and
high-budget scenarios seems to require different strategies, and achieving
optimal results across varying budget scenarios remains a challenge. In this
study, we introduce Dynamic Coverage & Margin mix (DCoM), a novel active
learning approach designed to bridge this gap. Unlike existing strategies, DCoM
dynamically adjusts its strategy, considering the competence of the current
model. Through theoretical analysis and empirical evaluations on diverse
datasets, including challenging computer vision tasks, we demonstrate DCoM’s
ability to overcome the cold start problem and consistently improve results
across different budgetary constraints. Thus DCoM achieves state-of-the-art
performance in both low- and high-budget regimes.
[LINK]
http://arxiv.org/abs/2407.01804v2
[DATE]
2024-07-24 15:19:00+08:00
[CATEGORIES]
cs.LG
Time Series Missing Imputation with Multivariate Radial Basis Function Neural Network
[AUTHORS]
Chanyoung Jung, Yun Jang
[ABSTRACT]
Researchers have been persistently working to address the issue of missing
values in time series data. Numerous models have been proposed, striving to
estimate the distribution of the data. The Radial Basis Functions Neural
Network (RBFNN) has recently exhibited exceptional performance in estimating
data distribution. In this paper, we propose a time series imputation model
based on RBFNN. Our imputation model learns local information from timestamps
to create a continuous function. Additionally, we incorporate time gaps to
facilitate learning information considering the missing terms of missing
values. We name this model the Missing Imputation Multivariate RBFNN
(MIM-RBFNN). However, MIM-RBFNN relies on a local information-based learning
approach, which presents difficulties in utilizing temporal information.
Therefore, we propose an extension called the Missing Value Imputation
Recurrent Neural Network with Continuous Function (MIRNN-CF) using the
continuous function generated by MIM-RBFNN. We evaluate the performance using
two real-world datasets with non-random missing and random missing patterns,
and conduct an ablation study comparing MIM-RBFNN and MIRNN-CF.
[LINK]
http://arxiv.org/abs/2407.17040v1
[DATE]
2024-07-24 15:02:16+08:00
[CATEGORIES]
cs.LG
Sparse Inducing Points in Deep Gaussian Processes: Enhancing Modeling with Denoising Diffusion Variational Inference
[AUTHORS]
Jian Xu, Delu Zeng, John Paisley
[ABSTRACT]
Deep Gaussian processes (DGPs) provide a robust paradigm for Bayesian deep
learning. In DGPs, a set of sparse integration locations called inducing points
are selected to approximate the posterior distribution of the model. This is
done to reduce computational complexity and improve model efficiency. However,
inferring the posterior distribution of inducing points is not straightforward.
Traditional variational inference approaches to posterior approximation often
lead to significant bias. To address this issue, we propose an alternative
method called Denoising Diffusion Variational Inference (DDVI) that uses a
denoising diffusion stochastic differential equation (SDE) to generate
posterior samples of inducing variables. We rely on score matching methods for
denoising diffusion model to approximate score functions with a neural network.
Furthermore, by combining classical mathematical theory of SDEs with the
minimization of KL divergence between the approximate and true processes, we
propose a novel explicit variational lower bound for the marginal likelihood
function of DGP. Through experiments on various datasets and comparisons with
baseline methods, we empirically demonstrate the effectiveness of DDVI for
posterior inference of inducing points for DGP models.
[LINK]
http://arxiv.org/abs/2407.17033v1
[DATE]
2024-07-24 14:39:58+08:00
[CATEGORIES]
cs.LG
Diversity-Preserving K-Armed Bandits, Revisited
[AUTHORS]
Hédi Hadiji, Sébastien Gerchinovitz, Jean-Michel Loubes, Gilles Stoltz
[ABSTRACT]
We consider the bandit-based framework for diversity-preserving
recommendations introduced by Celis et al. (2019), who approached it in the
case of a polytope mainly by a reduction to the setting of linear bandits. We
design a UCB algorithm using the specific structure of the setting and show
that it enjoys a bounded distribution-dependent regret in the natural cases
when the optimal mixed actions put some probability mass on all actions (i.e.,
when diversity is desirable). The regret lower bounds provided show that
otherwise, at least when the model is mean-unbounded, a $\ln T$ regret is
suffered. We also discuss an example beyond the special case of polytopes.
[LINK]
http://arxiv.org/abs/2010.01874v3
[DATE]
2024-07-24 14:25:27+08:00
[CATEGORIES]
cs.LG
Accurate and Efficient Fine-Tuning of Quantized Large Language Models Through Optimal Balance
[AUTHORS]
Ao Shen, Qiang Wang, Zhiquan Lai, Xionglve Li, Dongsheng Li
[ABSTRACT]
Large Language Models (LLMs) have demonstrated impressive performance across
various domains. However, the enormous number of model parameters makes
fine-tuning challenging, significantly limiting their application and
deployment. Existing solutions combine parameter quantization with Low-Rank
Adaptation (LoRA), greatly reducing memory usage but resulting in noticeable
performance degradation. In this paper, we identify an imbalance in fine-tuning
quantized pre-trained models: overly complex adapter inputs and outputs versus
low effective trainability of the adaptation. We propose Quantized LLMs with
Balanced-rank Adaptation (Q-BaRA), which simplifies the adapter inputs and
outputs while increasing the adapter’s rank to achieve a more suitable balance
for fine-tuning quantized LLMs. Additionally, for scenarios where fine-tuned
LLMs need to be deployed as low-precision inference models, we introduce
Quantization-Aware Fine-tuning with Higher Rank Adaptation (QA-HiRA), which
simplifies the adapter inputs and outputs to align with the pre-trained model’s
block-wise quantization while employing a single matrix to achieve a higher
rank. Both Q-BaRA and QA-HiRA are easily implemented and offer the following
optimizations: (i) Q-BaRA consistently achieves the highest accuracy compared
to baselines and other variants, requiring the same number of trainable
parameters and computational effort; (ii) QA-HiRA naturally merges adapter
parameters into the block-wise quantized model after fine-tuning, achieving the
highest accuracy compared to other methods. We apply our Q-BaRA and QA-HiRA to
the LLaMA and LLaMA2 model families and validate their effectiveness across
different fine-tuning datasets and downstream scenarios.
Code will be made available at
\href{https://github.com/xiaocaigou/qbaraqahira}{https://github.com/xiaocaigou/qbaraqahira}
[LINK]
http://arxiv.org/abs/2407.17029v1
[DATE]
2024-07-24 14:16:37+08:00
[CATEGORIES]
cs.LG
Nonlinear Schrödinger Network
[AUTHORS]
Yiming Zhou, Callen MacPhee, Tingyi Zhou, Bahram Jalali
[ABSTRACT]
Deep neural networks (DNNs) have achieved exceptional performance across
various fields by learning complex nonlinear mappings from large-scale
datasets. However, they encounter challenges such as high computational costs
and limited interpretability. To address these issues, hybrid approaches that
integrate physics with AI are gaining interest. This paper introduces a novel
physics-based AI model called the “Nonlinear Schr"odinger Network”, which
treats the Nonlinear Schr"odinger Equation (NLSE) as a general-purpose
trainable model for learning complex patterns including nonlinear mappings and
memory effects from data. Existing physics-informed machine learning methods
use neural networks to approximate the solutions of partial differential
equations (PDEs). In contrast, our approach directly treats the PDE as a
trainable model to obtain general nonlinear mappings that would otherwise
require neural networks. As a type of physics-AI symbiosis, it offers a more
interpretable and parameter-efficient alternative to traditional black-box
neural networks, achieving comparable or better accuracy in some time series
classification tasks while significantly reducing the number of required
parameters. Notably, the trained Nonlinear Schr"odinger Network is
interpretable, with all parameters having physical meanings as properties of a
virtual physical system that transforms the data to a more separable space.
This interpretability allows for insight into the underlying dynamics of the
data transformation process. Applications to time series forecasting have also
been explored. While our current implementation utilizes the NLSE, the proposed
method of using physics equations as trainable models to learn nonlinear
mappings from data is not limited to the NLSE and may be extended to other
master equations of physics.
[LINK]
http://arxiv.org/abs/2407.14504v2
[DATE]
2024-07-24 12:33:55+08:00
[CATEGORIES]
cs.LG
Fourier-MIONet: Fourier-enhanced multiple-input neural operators for multiphase modeling of geological carbon sequestration
[AUTHORS]
Zhongyi Jiang, Min Zhu, Lu Lu
[ABSTRACT]
Geologic carbon sequestration (GCS) is a safety-critical technology that aims
to reduce the amount of carbon dioxide in the atmosphere, which also places
high demands on reliability. Multiphase flow in porous media is essential to
understand CO$_2$ migration and pressure fields in the subsurface associated
with GCS. However, numerical simulation for such problems in 4D is
computationally challenging and expensive, due to the multiphysics and
multiscale nature of the highly nonlinear governing partial differential
equations (PDEs). It prevents us from considering multiple subsurface scenarios
and conducting real-time optimization. Here, we develop a Fourier-enhanced
multiple-input neural operator (Fourier-MIONet) to learn the solution operator
of the problem of multiphase flow in porous media. Fourier-MIONet utilizes the
recently developed framework of the multiple-input deep neural operators
(MIONet) and incorporates the Fourier neural operator (FNO) in the network
architecture. Once Fourier-MIONet is trained, it can predict the evolution of
saturation and pressure of the multiphase flow under various reservoir
conditions, such as permeability and porosity heterogeneity, anisotropy,
injection configurations, and multiphase flow properties. Compared to the
enhanced FNO (U-FNO), the proposed Fourier-MIONet has 90% fewer unknown
parameters, and it can be trained in significantly less time (about 3.5 times
faster) with much lower CPU memory ($<$ 15%) and GPU memory ($<$ 35%)
requirements, to achieve similar prediction accuracy. In addition to the lower
computational cost, Fourier-MIONet can be trained with only 6 snapshots of time
to predict the PDE solutions for 30 years. The excellent generalizability of
Fourier-MIONet is enabled by its adherence to the physical principle that the
solution to a PDE is continuous over time.
[LINK]
http://arxiv.org/abs/2303.04778v2
[DATE]
2024-07-24 12:10:16+08:00
[CATEGORIES]
cs.LG
Sparse Tensor PCA via Tensor Decomposition for Unsupervised Feature Selection
[AUTHORS]
Junjing Zheng, Xinyu Zhang, Weidong Jiang
[ABSTRACT]
Recently, introducing Tensor Decomposition (TD) methods into unsupervised
feature selection (UFS) has been a rising research point. A tensor structure is
beneficial for mining the relations between different modes and helps relieve
the computation burden. However, while existing methods exploit TD to minimize
the reconstruction error of a data tensor, they don’t fully utilize the
interpretable and discriminative information in the factor matrices. Moreover,
most methods require domain knowledge to perform feature selection. To solve
the above problems, we develop two Sparse Tensor Principal Component Analysis
(STPCA) models that utilize the projection directions in the factor matrices to
perform UFS. The first model extends Tucker Decomposition to a multiview sparse
regression form and is transformed into several alternatively solved convex
subproblems. The second model formulates a sparse version of the family of
Tensor Singular Value Decomposition (T-SVDs) and is transformed into individual
convex subproblems. For both models, we prove the optimal solution of each
subproblem falls onto the Hermitian Positive Semidefinite Cone (HPSD).
Accordingly, we design two fast algorithms based on HPSD projection and prove
their convergence. According to the experimental results on two original
synthetic datasets (Orbit and Array Signal) and five real-world datasets, the
two proposed methods are suitable for handling different data tensor scenarios
and outperform the state-of-the-art UFS methods.
[LINK]
http://arxiv.org/abs/2407.16985v1
[DATE]
2024-07-24 12:04:56+08:00
[CATEGORIES]
cs.LG
scGHSOM: Hierarchical clustering and visualization of single-cell and CRISPR data using growing hierarchical SOM
[AUTHORS]
Shang-Jung Wen, Jia-Ming Chang, Fang Yu
[ABSTRACT]
High-dimensional single-cell data poses significant challenges in identifying
underlying biological patterns due to the complexity and heterogeneity of
cellular states. We propose a comprehensive gene-cell dependency visualization
via unsupervised clustering, Growing Hierarchical Self-Organizing Map (GHSOM),
specifically designed for analyzing high-dimensional single-cell data like
single-cell sequencing and CRISPR screens. GHSOM is applied to cluster samples
in a hierarchical structure such that the self-growth structure of clusters
satisfies the required variations between and within. We propose a novel
Significant Attributes Identification Algorithm to identify features that
distinguish clusters. This algorithm pinpoints attributes with minimal
variation within a cluster but substantial variation between clusters. These
key attributes can then be used for targeted data retrieval and downstream
analysis. Furthermore, we present two innovative visualization tools: Cluster
Feature Map and Cluster Distribution Map. The Cluster Feature Map highlights
the distribution of specific features across the hierarchical structure of
GHSOM clusters. This allows for rapid visual assessment of cluster uniqueness
based on chosen features. The Cluster Distribution Map depicts leaf clusters as
circles on the GHSOM grid, with circle size reflecting cluster data size and
color customizable to visualize features like cell type or other attributes. We
apply our analysis to three single-cell datasets and one CRISPR dataset
(cell-gene database) and evaluate clustering methods with internal and external
CH and ARI scores. GHSOM performs well, being the best performer in internal
evaluation (CH=4.2). In external evaluation, GHSOM has the third-best
performance of all methods.
[COMMENTS]
Abstract presentation at BIOKDD@ACM KDD 2024
[LINK]
http://arxiv.org/abs/2407.16984v1
[DATE]
2024-07-24 12:01:09+08:00
[CATEGORIES]
cs.LG
On the Parameter Identifiability of Partially Observed Linear Causal Models
[AUTHORS]
Xinshuai Dong, Ignavier Ng, Biwei Huang, Yuewen Sun, Songyao Jin, Roberto Legaspi, Peter Spirtes, Kun Zhang
[ABSTRACT]
Linear causal models are important tools for modeling causal dependencies and
yet in practice, only a subset of the variables can be observed. In this paper,
we examine the parameter identifiability of these models by investigating
whether the edge coefficients can be recovered given the causal structure and
partially observed data. Our setting is more general than that of prior
research - we allow all variables, including both observed and latent ones, to
be flexibly related, and we consider the coefficients of all edges, whereas
most existing works focus only on the edges between observed variables.
Theoretically, we identify three types of indeterminacy for the parameters in
partially observed linear causal models. We then provide graphical conditions
that are sufficient for all parameters to be identifiable and show that some of
them are provably necessary. Methodologically, we propose a novel
likelihood-based parameter estimation method that addresses the variance
indeterminacy of latent variables in a specific way and can asymptotically
recover the underlying parameters up to trivial indeterminacy. Empirical
studies on both synthetic and real-world datasets validate our identifiability
theory and the effectiveness of the proposed method in the finite-sample
regime.
[LINK]
http://arxiv.org/abs/2407.16975v1
[DATE]
2024-07-24 11:43:55+08:00
[CATEGORIES]
cs.LG
CADC: Encoding User-Item Interactions for Compressing Recommendation Model Training Data
[AUTHORS]
Hossein Entezari Zarch, Abdulla Alshabanah, Chaoyi Jiang, Murali Annavaram
[ABSTRACT]
Deep learning recommendation models (DLRMs) are at the heart of the current
e-commerce industry. However, the amount of training data used to train these
large models is growing exponentially, leading to substantial training hurdles.
The training dataset contains two primary types of information: content-based
information (features of users and items) and collaborative information
(interactions between users and items). One approach to reduce the training
dataset is to remove user-item interactions. But that significantly diminishes
collaborative information, which is crucial for maintaining accuracy due to its
inclusion of interaction histories. This loss profoundly impacts DLRM
performance.
This paper makes an important observation that if one can capture the
user-item interaction history to enrich the user and item embeddings, then the
interaction history can be compressed without losing model accuracy. Thus, this
work, Collaborative Aware Data Compression (CADC), takes a two-step approach to
training dataset compression. In the first step, we use matrix factorization of
the user-item interaction matrix to create a novel embedding representation for
both the users and items. Once the user and item embeddings are enriched by the
interaction history information the approach then applies uniform random
sampling of the training dataset to drastically reduce the training dataset
size while minimizing model accuracy drop. The source code of CADC is available
at
\href{https://anonymous.4open.science/r/DSS-RM-8C1D/README.md}{https://anonymous.4open.science/r/DSS-RM-8C1D/README.md}.
[LINK]
http://arxiv.org/abs/2407.08108v2
[DATE]
2024-07-24 11:37:17+08:00
[CATEGORIES]
cs.LG
On the Trade-offs between Adversarial Robustness and Actionable Explanations
[AUTHORS]
Satyapriya Krishna, Chirag Agarwal, Himabindu Lakkaraju
[ABSTRACT]
As machine learning models are increasingly being employed in various
high-stakes settings, it becomes important to ensure that predictions of these
models are not only adversarially robust, but also readily explainable to
relevant stakeholders. However, it is unclear if these two notions can be
simultaneously achieved or if there exist trade-offs between them. In this
work, we make one of the first attempts at studying the impact of adversarially
robust models on actionable explanations which provide end users with a means
for recourse. We theoretically and empirically analyze the cost (ease of
implementation) and validity (probability of obtaining a positive model
prediction) of recourses output by state-of-the-art algorithms when the
underlying models are adversarially robust vs. non-robust. More specifically,
we derive theoretical bounds on the differences between the cost and the
validity of the recourses generated by state-of-the-art algorithms for
adversarially robust vs. non-robust linear and non-linear models. Our empirical
results with multiple real-world datasets validate our theoretical results and
show the impact of varying degrees of model robustness on the cost and validity
of the resulting recourses. Our analyses demonstrate that adversarially robust
models significantly increase the cost and reduce the validity of the resulting
recourses, thus shedding light on the inherent trade-offs between adversarial
robustness and actionable explanations.
[COMMENTS]
Accepted in the 7th AAAI Conference on AI, Ethics, and Society, 2024
[LINK]
http://arxiv.org/abs/2309.16452v2
[DATE]
2024-07-24 11:32:09+08:00
[CATEGORIES]
cs.LG
Stochastic Variance-Reduced Iterative Hard Thresholding in Graph Sparsity Optimization
[AUTHORS]
Derek Fox, Samuel Hernandez, Qianqian Tong
[ABSTRACT]
Stochastic optimization algorithms are widely used for large-scale data
analysis due to their low per-iteration costs, but they often suffer from slow
asymptotic convergence caused by inherent variance. Variance-reduced techniques
have been therefore used to address this issue in structured sparse models
utilizing sparsity-inducing norms or $\ell_0$-norms. However, these techniques
are not directly applicable to complex (non-convex) graph sparsity models,
which are essential in applications like disease outbreak monitoring and social
network analysis. In this paper, we introduce two stochastic variance-reduced
gradient-based methods to solve graph sparsity optimization: GraphSVRG-IHT and
GraphSCSG-IHT. We provide a general framework for theoretical analysis,
demonstrating that our methods enjoy a linear convergence speed. Extensive
experiments validate
[LINK]
http://arxiv.org/abs/2407.16968v1
[DATE]
2024-07-24 11:26:26+08:00
[CATEGORIES]
cs.LG
When AI Defeats Password Deception! A Deep Learning Framework to Distinguish Passwords and Honeywords
[AUTHORS]
Jimmy Dani, Brandon McCulloh, Nitesh Saxena
[ABSTRACT]
“Honeywords” have emerged as a promising defense mechanism for detecting data
breaches and foiling offline dictionary attacks (ODA) by deceiving attackers
with false passwords. In this paper, we propose PassFilter, a novel deep
learning (DL) based attack framework, fundamental in its ability to identify
passwords from a set of sweetwords associated with a user account, effectively
challenging a variety of honeywords generation techniques (HGTs). The DL model
in PassFilter is trained with a set of previously collected or adversarially
generated passwords and honeywords, and carefully orchestrated to predict
whether a sweetword is the password or a honeyword. Our model can compromise
the security of state-of-the-art, heuristics-based, and representation
learning-based HGTs proposed by Dionysiou et al. Specifically, our analysis
with nine publicly available password datasets shows that PassFilter
significantly outperforms the baseline random guessing success rate of 5%,
achieving 6.10% to 52.78% on the 1st guessing attempt, considering 20
sweetwords per account. This success rate rapidly increases with additional
login attempts before account lock-outs, often allowed on many real-world
online services to maintain reasonable usability. For example, it ranges from
41.78% to 96.80% for five attempts, and from 72.87% to 99.00% for ten attempts,
compared to 25% and 50% random guessing, respectively. We also examined
PassFilter against general-purpose language models used for honeyword
generation, like those proposed by Yu et al. These honeywords also proved
vulnerable to our attack, with success rates of 14.19% for 1st guessing
attempt, increasing to 30.23%, 41.70%, and 63.10% after 3rd, 5th, and 10th
guessing attempts, respectively. Our findings demonstrate the effectiveness of
DL model deployed in PassFilter in breaching state-of-the-art HGTs and
compromising password security based on ODA.
[LINK]
http://arxiv.org/abs/2407.16964v1
[DATE]
2024-07-24 11:02:57+08:00
[CATEGORIES]
cs.LG
MINT-1T: Scaling Open-Source Multimodal Data by 10x: A Multimodal Dataset with One Trillion Tokens
[AUTHORS]
Anas Awadalla, Le Xue, Oscar Lo, Manli Shu, Hannah Lee, Etash Kumar Guha, Matt Jordan, Sheng Shen, Mohamed Awadalla, Silvio Savarese, Caiming Xiong, Ran Xu, Yejin Choi, Ludwig Schmidt
[ABSTRACT]
Multimodal interleaved datasets featuring free-form interleaved sequences of
images and text are crucial for training frontier large multimodal models
(LMMs). Despite the rapid progression of open-source LMMs, there remains a
pronounced scarcity of large-scale, diverse open-source multimodal interleaved
datasets. In response, we introduce MINT-1T, the most extensive and diverse
open-source Multimodal INTerleaved dataset to date. MINT-1T comprises one
trillion text tokens and 3.4 billion images, a 10x scale-up from existing
open-source datasets. Additionally, we include previously untapped sources such
as PDFs and ArXiv papers. As scaling multimodal interleaved datasets requires
substantial engineering effort, sharing the data curation process and releasing
the dataset greatly benefits the community. Our experiments show that LMMs
trained on MINT-1T rival the performance of models trained on the previous
leading dataset, OBELICS. Our data and code will be released at
https://github.com/mlfoundations/MINT-1T.
[LINK]
http://arxiv.org/abs/2406.11271v2
[DATE]
2024-07-24 10:59:40+08:00
[CATEGORIES]
cs.LG
Dynamic Graph Transformer with Correlated Spatial-Temporal Positional Encoding
[AUTHORS]
Zhe Wang, Sheng Zhou, Jiawei Chen, Zhen Zhang, Binbin Hu, Yan Feng, Chun Chen, Can Wang
[ABSTRACT]
Learning effective representations for Continuous-Time Dynamic Graphs (CTDGs)
has garnered significant research interest, largely due to its powerful
capabilities in modeling complex interactions between nodes. A fundamental and
crucial requirement for representation learning in CTDGs is the appropriate
estimation and preservation of proximity. However, due to the sparse and
evolving characteristics of CTDGs, the spatial-temporal properties inherent in
high-order proximity remain largely unexplored. Despite its importance, this
property presents significant challenges due to the computationally intensive
nature of personalized interaction intensity estimation and the dynamic
attributes of CTDGs. To this end, we propose a novel Correlated
Spatial-Temporal Positional encoding that incorporates a parameter-free
personalized interaction intensity estimation under the weak assumption of the
Poisson Point Process. Building on this, we introduce the Dynamic Graph
Transformer with \Correlated Spatial-Temporal Positional Encoding (CorDGT),
which efficiently retains the evolving spatial-temporal high-order proximity
for effective node representation learning in CTDGs. Extensive experiments on
seven small and two large-scale datasets demonstrate the superior performance
and scalability of the proposed CorDGT.
[LINK]
http://arxiv.org/abs/2407.16959v1
[DATE]
2024-07-24 10:56:22+08:00
[CATEGORIES]
cs.LG
An Adaptive Gradient Regularization Method
[AUTHORS]
Huixiu Jiang, Yu Bao, Rutong Si
[ABSTRACT]
Optimizer plays an important role in neural network training with high
efficiency and performance. Weight update based on its gradient is the central
part of the optimizer. It has been shown that normalization and standardization
operation on weight and gradient can accelerate the training process and
improve performance such as Weight Standardization (WS), weight normalization
(WN) and gradient normalization (GN); there is also gradient centralization
(GC). In this work, we introduce a new optimization technique based on the
gradient magnitude in a gradient vector named adaptive gradient regularization
(AGR), which normalizes the gradient vector in all dimensions as a coefficient
vector and subtracts the product of the gradient and its coefficient vector by
the vanilla gradient. It can be viewed as an adaptive gradient clipping method.
We show that the AGR can improve the loss function Lipschitzness with a more
stable training process and better generalization performance. AGR is very
simple to be embedded into vanilla optimizers such as Adan and AdamW with only
three lines of code. Our experiments are conducted in image generation, image
classification and language representation, which shows that our AGR improves
the training result.
[COMMENTS]
11 pages, 11 figures
[LINK]
http://arxiv.org/abs/2407.16944v1
[DATE]
2024-07-24 10:23:18+08:00
[CATEGORIES]
cs.LG
GV-Rep: A Large-Scale Dataset for Genetic Variant Representation Learning
[AUTHORS]
Zehui Li, Vallijah Subasri, Guy-Bart Stan, Yiren Zhao, Bo Wang
[ABSTRACT]
Genetic variants (GVs) are defined as differences in the DNA sequences among
individuals and play a crucial role in diagnosing and treating genetic
diseases. The rapid decrease in next generation sequencing cost has led to an
exponential increase in patient-level GV data. This growth poses a challenge
for clinicians who must efficiently prioritize patient-specific GVs and
integrate them with existing genomic databases to inform patient management. To
addressing the interpretation of GVs, genomic foundation models (GFMs) have
emerged. However, these models lack standardized performance assessments,
leading to considerable variability in model evaluations. This poses the
question: How effectively do deep learning methods classify unknown GVs and
align them with clinically-verified GVs? We argue that representation learning,
which transforms raw data into meaningful feature spaces, is an effective
approach for addressing both indexing and classification challenges. We
introduce a large-scale Genetic Variant dataset, named GV-Rep, featuring
variable-length contexts and detailed annotations, designed for deep learning
models to learn GV representations across various traits, diseases, tissue
types, and experimental contexts. Our contributions are three-fold: (i)
Construction of a comprehensive dataset with 7 million records, each labeled
with characteristics of the corresponding variants, alongside additional data
from 17,548 gene knockout tests across 1,107 cell types, 1,808 variant
combinations, and 156 unique clinically verified GVs from real-world patients.
(ii) Analysis of the structure and properties of the dataset. (iii)
Experimentation of the dataset with pre-trained GFMs. The results show a
significant gap between GFMs current capabilities and accurate GV
representation. We hope this dataset will help advance genomic deep learning to
bridge this gap.
[COMMENTS]
Preprint
[LINK]
http://arxiv.org/abs/2407.16940v1
[DATE]
2024-07-24 10:20:29+08:00
[CATEGORIES]
cs.LG
Synthetic Trajectory Generation Through Convolutional Neural Networks
[AUTHORS]
Jesse Merhi, Erik Buchholz, Salil S. Kanhere
[ABSTRACT]
Location trajectories provide valuable insights for applications from urban
planning to pandemic control. However, mobility data can also reveal sensitive
information about individuals, such as political opinions, religious beliefs,
or sexual orientations. Existing privacy-preserving approaches for publishing
this data face a significant utility-privacy trade-off. Releasing synthetic
trajectory data generated through deep learning offers a promising solution.
Due to the trajectories’ sequential nature, most existing models are based on
recurrent neural networks (RNNs). However, research in generative adversarial
networks (GANs) largely employs convolutional neural networks (CNNs) for image
generation. This discrepancy raises the question of whether advances in
computer vision can be applied to trajectory generation. In this work, we
introduce a Reversible Trajectory-to-CNN Transformation (RTCT) that adapts
trajectories into a format suitable for CNN-based models. We integrated this
transformation with the well-known DCGAN in a proof-of-concept (PoC) and
evaluated its performance against an RNN-based trajectory GAN using four
metrics across two datasets. The PoC was superior in capturing spatial
distributions compared to the RNN model but had difficulty replicating
sequential and temporal properties. Although the PoC’s utility is not
sufficient for practical applications, the results demonstrate the
transformation’s potential to facilitate the use of CNNs for trajectory
generation, opening up avenues for future research. To support continued
research, all source code has been made available under an open-source license.
[COMMENTS]
To appear in the proceedings of the 21st Annual International
Conference on Privacy, Security & Trust (PST 2024)
[LINK]
http://arxiv.org/abs/2407.16938v1
[DATE]
2024-07-24 10:16:52+08:00
[CATEGORIES]
cs.LG
Provable Benefit of Annealed Langevin Monte Carlo for Non-log-concave Sampling
[AUTHORS]
Wei Guo, Molei Tao, Yongxin Chen
[ABSTRACT]
We address the outstanding problem of sampling from an unnormalized density
that may be non-log-concave and multimodal. To enhance the performance of
simple Markov chain Monte Carlo (MCMC) methods, techniques of annealing type
have been widely used. However, quantitative theoretical guarantees of these
techniques are under-explored. This study takes a first step toward providing a
non-asymptotic analysis of annealed MCMC. Specifically, we establish, for the
first time, an oracle complexity of $\widetilde{O}\left(\frac{d\beta^2{\cal
A}^2}{\varepsilon^6}\right)$ for simple annealed Langevin Monte Carlo algorithm
to achieve $\varepsilon^2$ accuracy in Kullback-Leibler divergence to the
target distribution $\pi\propto{\rm e}^{-V}$ on $\mathbb{R}^d$ with
$\beta$-smooth potential $V$. Here, ${\cal A}$ represents the action of a curve
of probability measures interpolating the target distribution $\pi$ and a
readily sampleable distribution.
[LINK]
http://arxiv.org/abs/2407.16936v1
[DATE]
2024-07-24 10:15:48+08:00
[CATEGORIES]
cs.LG
Optimizer’s Information Criterion: Dissecting and Correcting Bias in Data-Driven Optimization
[AUTHORS]
Garud Iyengar, Henry Lam, Tianyu Wang
[ABSTRACT]
In data-driven optimization, the sample performance of the obtained decision
typically incurs an optimistic bias against the true performance, a phenomenon
commonly known as the Optimizer’s Curse and intimately related to overfitting
in machine learning. Common techniques to correct this bias, such as
cross-validation, require repeatedly solving additional optimization problems
and are therefore computationally expensive. We develop a general bias
correction approach, building on what we call Optimizer’s Information Criterion
(OIC), that directly approximates the first-order bias and does not require
solving any additional optimization problems. Our OIC generalizes the
celebrated Akaike Information Criterion to evaluate the objective performance
in data-driven optimization, which crucially involves not only model fitting
but also its interplay with the downstream optimization. As such it can be used
for decision selection instead of only model selection. We apply our approach
to a range of data-driven optimization formulations comprising empirical and
parametric models, their regularized counterparts, and furthermore contextual
optimization. Finally, we provide numerical validation on the superior
performance of our approach under synthetic and real-world datasets.
[LINK]
http://arxiv.org/abs/2306.10081v3
[DATE]
2024-07-24 10:08:25+08:00
[CATEGORIES]
cs.LG
Federated Automatic Latent Variable Selection in Multi-output Gaussian Processes
[AUTHORS]
Jingyi Gao, Seokhyun Chung
[ABSTRACT]
This paper explores a federated learning approach that automatically selects
the number of latent processes in multi-output Gaussian processes (MGPs). The
MGP has seen great success as a transfer learning tool when data is generated
from multiple sources/units/entities. A common approach in MGPs to transfer
knowledge across units involves gathering all data from each unit to a central
server and extracting common independent latent processes to express each unit
as a linear combination of the shared latent patterns. However, this approach
poses key challenges in (i) determining the adequate number of latent processes
and (ii) relying on centralized learning which leads to potential privacy risks
and significant computational burdens on the central server. To address these
issues, we propose a hierarchical model that places spike-and-slab priors on
the coefficients of each latent process. These priors help automatically select
only needed latent processes by shrinking the coefficients of unnecessary ones
to zero. To estimate the model while avoiding the drawbacks of centralized
learning, we propose a variational inference-based approach, that formulates
model inference as an optimization problem compatible with federated settings.
We then design a federated learning algorithm that allows units to jointly
select and infer the common latent processes without sharing their data. We
also discuss an efficient learning approach for a new unit within our proposed
federated framework. Simulation and case studies on Li-ion battery degradation
and air temperature data demonstrate the advantageous features of our proposed
approach.
[LINK]
http://arxiv.org/abs/2407.16935v1
[DATE]
2024-07-24 10:03:28+08:00
[CATEGORIES]
cs.LG
Deep Koopman-based Control of Quality Variation in Multistage Manufacturing Systems
[AUTHORS]
Zhiyi Chen, Harshal Maske, Devesh Upadhyay, Huanyi Shui, Xun Huan, Jun Ni
[ABSTRACT]
This paper presents a modeling-control synthesis to address the quality
control challenges in multistage manufacturing systems (MMSs). A new
feedforward control scheme is developed to minimize the quality variations
caused by process disturbances in MMSs. Notably, the control framework
leverages a stochastic deep Koopman (SDK) model to capture the quality
propagation mechanism in the MMSs, highlighted by its ability to transform the
nonlinear propagation dynamics into a linear one. Two roll-to-roll case studies
are presented to validate the proposed method and demonstrate its
effectiveness. The overall method is suitable for nonlinear MMSs and does not
require extensive expert knowledge.
[COMMENTS]
The paper was in the proceeding of 2024 American Control Conference.
This submitted version addresses a minor correction to one equation (Eq. 14),
while the results and conclusions remain the same
[LINK]
http://arxiv.org/abs/2407.16933v1
[DATE]
2024-07-24 09:54:30+08:00
[CATEGORIES]
cs.LG
Enabling On-Device LLMs Personalization with Smartphone Sensing
[AUTHORS]
Shiquan Zhang, Ying Ma, Le Fang, Hong Jia, Simon D’Alfonso, Vassilis Kostakos
[ABSTRACT]
This demo presents a novel end-to-end framework that combines on-device large
language models (LLMs) with smartphone sensing technologies to achieve
context-aware and personalized services. The framework addresses critical
limitations of current personalization solutions via cloud LLMs, such as
privacy concerns, latency and cost, and limited personal information. To
achieve this, we innovatively proposed deploying LLMs on smartphones with
multimodal sensor data through context-aware sensing and customized prompt
engineering, ensuring privacy and enhancing personalization performance. A case
study involving a university student demonstrated the capability of the
framework to provide tailored recommendations. In addition, we show that the
framework achieves the best trade-off in privacy, performance, latency, cost,
battery and energy consumption between on-device and cloud LLMs. To the best of
our knowledge, this is the first framework to provide on-device LLMs
personalization with smartphone sensing. Future work will incorporate more
diverse sensor data and involve extensive user studies to enhance
personalization. Our proposed framework has the potential to substantially
improve user experiences across domains including healthcare, productivity, and
entertainment.
[COMMENTS]
5 pages, 3 figures, conference demo paper
[LINK]
http://arxiv.org/abs/2407.04418v2
[DATE]
2024-07-24 09:32:05+08:00
[CATEGORIES]
cs.LG
DeepCell: A Ubiquitous Accurate Provider-side Cellular-based Localization
[AUTHORS]
Ahmed Shokry, Moustafa Youssef
[ABSTRACT]
Although outdoor localization is already available to the general public and
businesses through the wide spread use of the GPS, it is not supported by
low-end phones, requires a direct line of sight to satellites and can drain
phone battery quickly. The current fingerprinting solutions can provide
high-accuracy localization but are based on the client side. This limits their
ubiquitous deployment and accuracy. In this paper, we introduce DeepCell: a
provider-side fingerprinting localization system that can provide high accuracy
localization for any cell phone. To build its fingerprint, DeepCell leverages
the unlabeled cellular measurements recorded by the cellular provider while
opportunistically synchronizing with selected client devices to get location
labels. The fingerprint is then used to train a deep neural network model that
is harnessed for localization. To achieve this goal, DeepCell need to address a
number of challenges including using unlabeled data from the provider side,
handling noise and sparsity, scaling the data to large areas, and finally
providing enough data that is required for training deep models without
overhead. Evaluation of DeepCell in a typical realistic environment shows that
it can achieve a consistent median accuracy of 29m. This accuracy outperforms
the state-of-the-art client-based cellular-based systems by more than 75.4%. In
addition, the same accuracy is extended to low-end phones.
[COMMENTS]
arXiv admin note: substantial text overlap with arXiv:2106.13632
[LINK]
http://arxiv.org/abs/2407.16927v1
[DATE]
2024-07-24 09:28:04+08:00
[CATEGORIES]
cs.LG
Universally Harmonizing Differential Privacy Mechanisms for Federated Learning: Boosting Accuracy and Convergence
[AUTHORS]
Shuya Feng, Meisam Mohammady, Hanbin Hong, Shenao Yan, Ashish Kundu, Binghui Wang, Yuan Hong
[ABSTRACT]
Differentially private federated learning (DP-FL) is a promising technique
for collaborative model training while ensuring provable privacy for clients.
However, optimizing the tradeoff between privacy and accuracy remains a
critical challenge. To our best knowledge, we propose the first DP-FL framework
(namely UDP-FL), which universally harmonizes any randomization mechanism
(e.g., an optimal one) with the Gaussian Moments Accountant (viz. DP-SGD) to
significantly boost accuracy and convergence. Specifically, UDP-FL demonstrates
enhanced model performance by mitigating the reliance on Gaussian noise. The
key mediator variable in this transformation is the R'enyi Differential
Privacy notion, which is carefully used to harmonize privacy budgets. We also
propose an innovative method to theoretically analyze the convergence for DP-FL
(including our UDP-FL ) based on mode connectivity analysis. Moreover, we
evaluate our UDP-FL through extensive experiments benchmarked against
state-of-the-art (SOTA) methods, demonstrating superior performance on both
privacy guarantees and model performance. Notably, UDP-FL exhibits substantial
resilience against different inference attacks, indicating a significant
advance in safeguarding sensitive data in federated learning environments.
[LINK]
http://arxiv.org/abs/2407.14710v2
[DATE]
2024-07-24 09:15:40+08:00
[CATEGORIES]
cs.LG
AI-Driven Guided Response for Security Operation Centers with Microsoft Copilot for Security
[AUTHORS]
Scott Freitas, Jovan Kalajdjieski, Amir Gharib, Robert McCann
[ABSTRACT]
Security operation centers contend with a constant stream of security
incidents, ranging from straightforward to highly complex. To address this, we
developed Copilot Guided Response (CGR), an industry-scale ML architecture that
guides security analysts across three key tasks – (1) investigation, providing
essential historical context by identifying similar incidents; (2) triaging to
ascertain the nature of the incident – whether it is a true positive, false
positive, or benign positive; and (3) remediation, recommending tailored
containment actions. CGR is integrated into the Microsoft Defender XDR product
and deployed worldwide, generating millions of recommendations across thousands
of customers. Our extensive evaluation, incorporating internal evaluation,
collaboration with security experts, and customer feedback, demonstrates that
CGR delivers high-quality recommendations across all three tasks. We provide a
comprehensive overview of the CGR architecture, setting a precedent as the
first cybersecurity company to openly discuss these capabilities in such depth.
Additionally, we GUIDE, the largest public collection of real-world security
incidents, spanning 13M evidences across 1M annotated incidents. By enabling
researchers and practitioners to conduct research on real-world data, GUIDE
advances the state of cybersecurity and supports the development of
next-generation machine learning systems.
[LINK]
http://arxiv.org/abs/2407.09017v3
[DATE]
2024-07-24 09:15:20+08:00
[CATEGORIES]
cs.LG
Guaranteed Trajectory Tracking under Learned Dynamics with Contraction Metrics and Disturbance Estimation
[AUTHORS]
Pan Zhao, Ziyao Guo, Yikun Cheng, Aditya Gahlawat, Hyungsoo Kang, Naira Hovakimyan
[ABSTRACT]
This paper presents an approach to trajectory-centric learning control based
on contraction metrics and disturbance estimation for nonlinear systems subject
to matched uncertainties. The approach uses deep neural networks to learn
uncertain dynamics while still providing guarantees of transient tracking
performance throughout the learning phase. Within the proposed approach, a
disturbance estimation law is adopted to estimate the pointwise value of the
uncertainty, with pre-computable estimation error bounds (EEBs). The learned
dynamics, the estimated disturbances, and the EEBs are then incorporated in a
robust Riemann energy condition to compute the control law that guarantees
exponential convergence of actual trajectories to desired ones throughout the
learning phase, even when the learned model is poor. On the other hand, with
improved accuracy, the learned model can help improve the robustness of the
tracking controller, e.g., against input delays, and can be incorporated to
plan better trajectories with improved performance, e.g., lower energy
consumption and shorter travel time.The proposed framework is validated on a
planar quadrotor example.
[COMMENTS]
18 pages, 8 figures
[LINK]
http://arxiv.org/abs/2112.08222v5
[DATE]
2024-07-24 09:06:36+08:00
[CATEGORIES]
cs.LG
[AUTHORS]
Ehsan, Gharib-Nezhad, Natasha E. Batalha, Hamed Valizadegan, Miguel J. S. Martinho, Mahdi Habibi, Gopal Nookula [ABSTRACT]
We are on the verge of a revolutionary era in space exploration, thanks to
advancements in telescopes such as the James Webb Space Telescope
(\textit{JWST}). High-resolution, high signal-to-noise spectra from exoplanet
and brown dwarf atmospheres have been collected over the past few decades,
requiring the development of accurate and reliable pipelines and tools for
their analysis. Accurately and swiftly determining the spectroscopic parameters
from the observational spectra of these objects is crucial for understanding
their atmospheric composition and guiding future follow-up observations.
\texttt{TelescopeML} is a Python package developed to perform three main tasks:
[COMMENTS]
Please find the accepted paper with complete reference list at
https://joss.theoj.org/papers/10.21105/joss.06346 [LINK]
http://arxiv.org/abs/2407.16917v1 [DATE]
2024-07-24 08:44:52+08:00 [CATEGORIES]
cs.LG
A Library of Mirrors: Deep Neural Nets in Low Dimensions are Convex Lasso Models with Reflection Features
[AUTHORS]
Emi Zeger, Yifei Wang, Aaron Mishkin, Tolga Ergen, Emmanuel Candès, Mert Pilanci
[ABSTRACT]
We prove that training neural networks on 1-D data is equivalent to solving
convex Lasso problems with discrete, explicitly defined dictionary matrices. We
consider neural networks with piecewise linear activations and depths ranging
from 2 to an arbitrary but finite number of layers. We first show that
two-layer networks with piecewise linear activations are equivalent to Lasso
models using a discrete dictionary of ramp functions, with breakpoints
corresponding to the training data points. In certain general architectures
with absolute value or ReLU activations, a third layer surprisingly creates
features that reflect the training data about themselves. Additional layers
progressively generate reflections of these reflections. The Lasso
representation provides valuable insights into the analysis of globally optimal
networks, elucidating their solution landscapes and enabling closed-form
solutions in certain special cases. Numerical results show that reflections
also occur when optimizing standard deep networks using standard non-convex
optimizers. Additionally, we demonstrate our theory with autoregressive time
series models.
[LINK]
http://arxiv.org/abs/2403.01046v4
[DATE]
2024-07-24 08:32:35+08:00
[CATEGORIES]
cs.LG
Cross-Domain Policy Transfer by Representation Alignment via Multi-Domain Behavioral Cloning
[AUTHORS]
Hayato Watahiki, Ryo Iwase, Ryosuke Unno, Yoshimasa Tsuruoka
[ABSTRACT]
Transferring learned skills across diverse situations remains a fundamental
challenge for autonomous agents, particularly when agents are not allowed to
interact with an exact target setup. While prior approaches have predominantly
focused on learning domain translation, they often struggle with handling
significant domain gaps or out-of-distribution tasks. In this paper, we present
a simple approach for cross-domain policy transfer that learns a shared latent
representation across domains and a common abstract policy on top of it. Our
approach leverages multi-domain behavioral cloning on unaligned trajectories of
proxy tasks and employs maximum mean discrepancy (MMD) as a regularization term
to encourage cross-domain alignment. The MMD regularization better preserves
structures of latent state distributions than commonly used
domain-discriminative distribution matching, leading to higher transfer
performance. Moreover, our approach involves training only one multi-domain
policy, which makes extension easier than existing methods. Empirical
evaluations demonstrate the efficacy of our method across various domain
shifts, especially in scenarios where exact domain translation is challenging,
such as cross-morphology or cross-viewpoint settings. Our ablation studies
further reveal that multi-domain behavioral cloning implicitly contributes to
representation alignment alongside domain-adversarial regularization.
[COMMENTS]
CoLLAs 2024 (Oral). Code:
https://github.com/hwatahiki/portable-latent-policy
[LINK]
http://arxiv.org/abs/2407.16912v1
[DATE]
2024-07-24 08:13:00+08:00
[CATEGORIES]
cs.LG
Trust Your Gut: Comparing Human and Machine Inference from Noisy Visualizations
[AUTHORS]
Ratanond Koonchanok, Michael E. Papka, Khairi Reda
[ABSTRACT]
People commonly utilize visualizations not only to examine a given dataset,
but also to draw generalizable conclusions about the underlying models or
phenomena. Prior research has compared human visual inference to that of an
optimal Bayesian agent, with deviations from rational analysis viewed as
problematic. However, human reliance on non-normative heuristics may prove
advantageous in certain circumstances. We investigate scenarios where human
intuition might surpass idealized statistical rationality. In two experiments,
we examine individuals’ accuracy in characterizing the parameters of known
data-generating models from bivariate visualizations. Our findings indicate
that, although participants generally exhibited lower accuracy compared to
statistical models, they frequently outperformed Bayesian agents, particularly
when faced with extreme samples. Participants appeared to rely on their
internal models to filter out noisy visualizations, thus improving their
resilience against spurious data. However, participants displayed
overconfidence and struggled with uncertainty estimation. They also exhibited
higher variance than statistical machines. Our findings suggest that analyst
gut reactions to visualizations may provide an advantage, even when departing
from rationality. These results carry implications for designing visual
analytics tools, offering new perspectives on how to integrate statistical
models and analyst intuition for improved inference and decision-making. The
data and materials for this paper are available at https://osf.io/qmfv6
[COMMENTS]
To appear in IEEE Transactions on Visualization and Computer Graphics
(Proceedings of IEEE VIS’24)
[LINK]
http://arxiv.org/abs/2407.16871v1
[DATE]
2024-07-24 06:39:57+08:00
[CATEGORIES]
cs.LG
From Text to Insight: Large Language Models for Materials Science Data Extraction
[AUTHORS]
Mara Schilling-Wilhelmi, Martiño Ríos-García, Sherjeel Shabih, María Victoria Gil, Santiago Miret, Christoph T. Koch, José A. Márquez, Kevin Maik Jablonka
[ABSTRACT]
The vast majority of materials science knowledge exists in unstructured
natural language, yet structured data is crucial for innovative and systematic
materials design. Traditionally, the field has relied on manual curation and
partial automation for data extraction for specific use cases. The advent of
large language models (LLMs) represents a significant shift, potentially
enabling efficient extraction of structured, actionable data from unstructured
text by non-experts. While applying LLMs to materials science data extraction
presents unique challenges, domain knowledge offers opportunities to guide and
validate LLM outputs. This review provides a comprehensive overview of
LLM-based structured data extraction in materials science, synthesizing current
knowledge and outlining future directions. We address the lack of standardized
guidelines and present frameworks for leveraging the synergy between LLMs and
materials science expertise. This work serves as a foundational resource for
researchers aiming to harness LLMs for data-driven materials research. The
insights presented here could significantly enhance how researchers across
disciplines access and utilize scientific information, potentially accelerating
the development of novel materials for critical societal needs.
[LINK]
http://arxiv.org/abs/2407.16867v1
[DATE]
2024-07-24 06:23:47+08:00
[CATEGORIES]
cs.LG
Balanced Multi-Relational Graph Clustering
[AUTHORS]
Zhixiang Shen, Haolan He, Zhao Kang
[ABSTRACT]
Multi-relational graph clustering has demonstrated remarkable success in
uncovering underlying patterns in complex networks. Representative methods
manage to align different views motivated by advances in contrastive learning.
Our empirical study finds the pervasive presence of imbalance in real-world
graphs, which is in principle contradictory to the motivation of alignment. In
this paper, we first propose a novel metric, the Aggregation Class Distance, to
empirically quantify structural disparities among different graphs. To address
the challenge of view imbalance, we propose Balanced Multi-Relational Graph
Clustering (BMGC), comprising unsupervised dominant view mining and dual
signals guided representation learning. It dynamically mines the dominant view
throughout the training process, synergistically improving clustering
performance with representation learning. Theoretical analysis ensures the
effectiveness of dominant view mining. Extensive experiments and in-depth
analysis on real-world and synthetic datasets showcase that BMGC achieves
state-of-the-art performance, underscoring its superiority in addressing the
view imbalance inherent in multi-relational graphs. The source code and
datasets are available at https://github.com/zxlearningdeep/BMGC.
[COMMENTS]
Accepted by ACM Multimedia 2024
[LINK]
http://arxiv.org/abs/2407.16863v1
[DATE]
2024-07-24 06:11:13+08:00
[CATEGORIES]
cs.LG
Unexpected Benefits of Self-Modeling in Neural Systems
[AUTHORS]
Vickram N. Premakumar, Michael Vaiana, Florin Pop, Judd Rosenblatt, Diogo Schwerz de Lucena, Kirsten Ziman, Michael S. A. Graziano
[ABSTRACT]
Self-models have been a topic of great interest for decades in studies of
human cognition and more recently in machine learning. Yet what benefits do
self-models confer? Here we show that when artificial networks learn to predict
their internal states as an auxiliary task, they change in a fundamental way.
To better perform the self-model task, the network learns to make itself
simpler, more regularized, more parameter-efficient, and therefore more
amenable to being predictively modeled. To test the hypothesis of
self-regularizing through self-modeling, we used a range of network
architectures performing three classification tasks across two modalities. In
all cases, adding self-modeling caused a significant reduction in network
complexity. The reduction was observed in two ways. First, the distribution of
weights was narrower when self-modeling was present. Second, a measure of
network complexity, the real log canonical threshold (RLCT), was smaller when
self-modeling was present. Not only were measures of complexity reduced, but
the reduction became more pronounced as greater training weight was placed on
the auxiliary task of self-modeling. These results strongly support the
hypothesis that self-modeling is more than simply a network learning to predict
itself. The learning has a restructuring effect, reducing complexity and
increasing parameter efficiency. This self-regularization may help explain some
of the benefits of self-models reported in recent machine learning literature,
as well as the adaptive value of self-models to biological systems. In
particular, these findings may shed light on the possible interaction between
the ability to model oneself and the ability to be more easily modeled by
others in a social or cooperative context.
[LINK]
http://arxiv.org/abs/2407.10188v2
[DATE]
2024-07-24 05:54:12+08:00
[CATEGORIES]
cs.LG
MambaMixer: Efficient Selective State Space Models with Dual Token and Channel Selection
[AUTHORS]
Ali Behrouz, Michele Santacatterina, Ramin Zabih
[ABSTRACT]
Recent advances in deep learning have mainly relied on Transformers due to
their data dependency and ability to learn at scale. The attention module in
these architectures, however, exhibits quadratic time and space in input size,
limiting their scalability for long-sequence modeling. Despite recent attempts
to design efficient and effective architecture backbone for multi-dimensional
data, such as images and multivariate time series, existing models are either
data independent, or fail to allow inter- and intra-dimension communication.
Recently, State Space Models (SSMs), and more specifically Selective State
Space Models, with efficient hardware-aware implementation, have shown
promising potential for long sequence modeling. Motivated by the success of
SSMs, we present MambaMixer, a new architecture with data-dependent weights
that uses a dual selection mechanism across tokens and channels, called
Selective Token and Channel Mixer. MambaMixer connects selective mixers using a
weighted averaging mechanism, allowing layers to have direct access to early
features. As a proof of concept, we design Vision MambaMixer (ViM2) and Time
Series MambaMixer (TSM2) architectures based on the MambaMixer block and
explore their performance in various vision and time series forecasting tasks.
Our results underline the importance of selective mixing across both tokens and
channels. In ImageNet classification, object detection, and semantic
segmentation tasks, ViM2 achieves competitive performance with well-established
vision models and outperforms SSM-based vision models. In time series
forecasting, TSM2 achieves outstanding performance compared to state-of-the-art
methods while demonstrating significantly improved computational cost. These
results show that while Transformers, cross-channel attention, and MLPs are
sufficient for good performance in time series forecasting, neither is
necessary.
[LINK]
http://arxiv.org/abs/2403.19888v4
[DATE]
2024-07-24 05:33:06+08:00
[CATEGORIES]
cs.LG
SPLAT: A framework for optimised GPU code-generation for SParse reguLar ATtention
[AUTHORS]
Ahan Gupta, Yueming Yuan, Devansh Jain, Yuhao Ge, David Aponte, Yanqi Zhou, Charith Mendis
[ABSTRACT]
Multi-head-self-attention (MHSA) mechanisms achieve state-of-the-art (SOTA)
performance across natural language processing and vision tasks. However, their
quadratic dependence on sequence lengths has bottlenecked inference speeds. To
circumvent this bottleneck, researchers have proposed various sparse-MHSA
models, where a subset of full attention is computed. Despite their promise,
current sparse libraries and compilers do not support high-performance
implementations for diverse sparse-MHSA patterns due to the underlying sparse
formats they operate on. These formats, which are typically designed for
high-performance & scientific computing applications, are either curated for
extreme amounts of random sparsity (<1% non-zero values), or specific sparsity
patterns. However, the sparsity patterns in sparse-MHSA are moderately sparse
(10-50% non-zero values) and varied, resulting in existing sparse-formats
trading off generality for performance.
We bridge this gap, achieving both generality and performance, by proposing a
novel sparse format: affine-compressed-sparse-row (ACSR) and supporting
code-generation scheme, SPLAT, that generates high-performance implementations
for diverse sparse-MHSA patterns on GPUs. Core to our proposed format and code
generation algorithm is the observation that common sparse-MHSA patterns have
uniquely regular geometric properties. These properties, which can be analyzed
just-in-time, expose novel optimizations and tiling strategies that SPLAT
exploits to generate high-performance implementations for diverse patterns. To
demonstrate SPLAT’s efficacy, we use it to generate code for various
sparse-MHSA models, achieving geomean speedups of 2.05x and 4.05x over
hand-written kernels written in triton and TVM respectively on A100 GPUs.
Moreover, its interfaces are intuitive and easy to use with existing
implementations of MHSA in JAX.
[COMMENTS]
31 pages, 16 figures
[LINK]
http://arxiv.org/abs/2407.16847v1
[DATE]
2024-07-24 05:18:07+08:00
[CATEGORIES]
cs.LG
Active Learning of Piecewise Gaussian Process Surrogates
[AUTHORS]
Chiwoo Park, Robert Waelder, Bonggwon Kang, Benji Maruyama, Soondo Hong, Robert Gramacy
[ABSTRACT]
Active learning of Gaussian process (GP) surrogates has been useful for
optimizing experimental designs for physical/computer simulation experiments,
and for steering data acquisition schemes in machine learning. In this paper,
we develop a method for active learning of piecewise, Jump GP surrogates. Jump
GPs are continuous within, but discontinuous across, regions of a design space,
as required for applications spanning autonomous materials design,
configuration of smart factory systems, and many others. Although our active
learning heuristics are appropriated from strategies originally designed for
ordinary GPs, we demonstrate that additionally accounting for model bias, as
opposed to the usual model uncertainty, is essential in the Jump GP context.
Toward that end, we develop an estimator for bias and variance of Jump GP
models. Illustrations, and evidence of the advantage of our proposed methods,
are provided on a suite of synthetic benchmarks, and real-simulation
experiments of varying complexity.
[COMMENTS]
The main algorithm of this work is protected by a patent pending with
application number 18/532,296
[LINK]
http://arxiv.org/abs/2301.08789v2
[DATE]
2024-07-24 05:06:04+08:00
[CATEGORIES]
cs.LG
Privacy-preserving machine learning with tensor networks
[AUTHORS]
Alejandro Pozas-Kerstjens, Senaida Hernández-Santana, José Ramón Pareja Monturiol, Marco Castrillón López, Giannicola Scarpa, Carlos E. González-Guillén, David Pérez-García
[ABSTRACT]
Tensor networks, widely used for providing efficient representations of
low-energy states of local quantum many-body systems, have been recently
proposed as machine learning architectures which could present advantages with
respect to traditional ones. In this work we show that tensor network
architectures have especially prospective properties for privacy-preserving
machine learning, which is important in tasks such as the processing of medical
records. First, we describe a new privacy vulnerability that is present in
feedforward neural networks, illustrating it in synthetic and real-world
datasets. Then, we develop well-defined conditions to guarantee robustness to
such vulnerability, which involve the characterization of models equivalent
under gauge symmetry. We rigorously prove that such conditions are satisfied by
tensor-network architectures. In doing so, we define a novel canonical form for
matrix product states, which has a high degree of regularity and fixes the
residual gauge that is left in the canonical forms based on singular value
decompositions. We supplement the analytical findings with practical examples
where matrix product states are trained on datasets of medical records, which
show large reductions on the probability of an attacker extracting information
about the training dataset from the model’s parameters. Given the growing
expertise in training tensor-network architectures, these results imply that
one may not have to be forced to make a choice between accuracy in prediction
and ensuring the privacy of the information processed.
[COMMENTS]
16 pages, 2 figures. Quantumarticle 6.1. The computational appendix
is available at https://www.github.com/apozas/private-tn V3: Published
version
[LINK]
http://arxiv.org/abs/2202.12319v3
[DATE]
2024-07-24 04:47:04+08:00
[CATEGORIES]
cs.LG
Pareto Front Approximation for Multi-Objective Session-Based Recommender Systems
[AUTHORS]
Timo Wilm, Philipp Normann, Felix Stepprath
[ABSTRACT]
This work introduces MultiTRON, an approach that adapts Pareto front
approximation techniques to multi-objective session-based recommender systems
using a transformer neural network. Our approach optimizes trade-offs between
key metrics such as click-through and conversion rates by training on sampled
preference vectors. A significant advantage is that after training, a single
model can access the entire Pareto front, allowing it to be tailored to meet
the specific requirements of different stakeholders by adjusting an additional
input vector that weights the objectives. We validate the model’s performance
through extensive offline and online evaluation. For broader application and
research, the source code is made available at
https://github.com/otto-de/MultiTRON . The results confirm the model’s ability
to manage multiple recommendation objectives effectively, offering a flexible
tool for diverse business needs.
[LINK]
http://arxiv.org/abs/2407.16828v1
[DATE]
2024-07-24 04:38:23+08:00
[CATEGORIES]
cs.LG
Symplectic Structure-Aware Hamiltonian (Graph) Embeddings
[AUTHORS]
Jiaxu Liu, Xinping Yi, Tianle Zhang, Xiaowei Huang
[ABSTRACT]
In traditional Graph Neural Networks (GNNs), the assumption of a fixed
embedding manifold often limits their adaptability to diverse graph geometries.
Recently, Hamiltonian system-inspired GNNs have been proposed to address the
dynamic nature of such embeddings by incorporating physical laws into node
feature updates. We present Symplectic Structure-Aware Hamiltonian GNN
(SAH-GNN), a novel approach that generalizes Hamiltonian dynamics for more
flexible node feature updates. Unlike existing Hamiltonian approaches, SAH-GNN
employs Riemannian optimization on the symplectic Stiefel manifold to
adaptively learn the underlying symplectic structure, circumventing the
limitations of existing Hamiltonian GNNs that rely on a pre-defined form of
standard symplectic structure. This innovation allows SAH-GNN to automatically
adapt to various graph datasets without extensive hyperparameter tuning.
Moreover, it conserves energy during training meaning the implicit Hamiltonian
system is physically meaningful. Finally, we empirically validate SAH-GNN’s
superiority and adaptability in node classification tasks across multiple types
of graph datasets.
[COMMENTS]
A Note
[LINK]
http://arxiv.org/abs/2309.04885v4
[DATE]
2024-07-24 04:10:42+08:00
[CATEGORIES]
cs.LG
Quantum Implicit Neural Representations
[AUTHORS]
Jiaming Zhao, Wenbo Qiao, Peng Zhang, Hui Gao
[ABSTRACT]
Implicit neural representations have emerged as a powerful paradigm to
represent signals such as images and sounds. This approach aims to utilize
neural networks to parameterize the implicit function of the signal. However,
when representing implicit functions, traditional neural networks such as
ReLU-based multilayer perceptrons face challenges in accurately modeling
high-frequency components of signals. Recent research has begun to explore the
use of Fourier Neural Networks (FNNs) to overcome this limitation. In this
paper, we propose Quantum Implicit Representation Network (QIREN), a novel
quantum generalization of FNNs. Furthermore, through theoretical analysis, we
demonstrate that QIREN possesses a quantum advantage over classical FNNs.
Lastly, we conducted experiments in signal representation, image
superresolution, and image generation tasks to show the superior performance of
QIREN compared to state-of-the-art (SOTA) models. Our work not only
incorporates quantum advantages into implicit neural representations but also
uncovers a promising application direction for Quantum Neural Networks.
[COMMENTS]
This paper was accepted by icml 2024
[LINK]
http://arxiv.org/abs/2406.03873v2
[DATE]
2024-07-24 03:43:09+08:00
[CATEGORIES]
cs.LG
In Search for Architectures and Loss Functions in Multi-Objective Reinforcement Learning
[AUTHORS]
Mikhail Terekhov, Caglar Gulcehre
[ABSTRACT]
Multi-objective reinforcement learning (MORL) is essential for addressing the
intricacies of real-world RL problems, which often require trade-offs between
multiple utility functions. However, MORL is challenging due to unstable
learning dynamics with deep learning-based function approximators. The research
path most taken has been to explore different value-based loss functions for
MORL to overcome this issue. Our work empirically explores model-free policy
learning loss functions and the impact of different architectural choices. We
introduce two different approaches: Multi-objective Proximal Policy
Optimization (MOPPO), which extends PPO to MORL, and Multi-objective Advantage
Actor Critic (MOA2C), which acts as a simple baseline in our ablations. Our
proposed approach is straightforward to implement, requiring only small
modifications at the level of function approximator. We conduct comprehensive
evaluations on the MORL Deep Sea Treasure, Minecart, and Reacher environments
and show that MOPPO effectively captures the Pareto front. Our extensive
ablation studies and empirical analyses reveal the impact of different
architectural choices, underscoring the robustness and versatility of MOPPO
compared to popular MORL approaches like Pareto Conditioned Networks (PCN) and
Envelope Q-learning in terms of MORL metrics, including hypervolume and
expected utility.
[COMMENTS]
20 pages, 10 figures, 3 tables
[LINK]
http://arxiv.org/abs/2407.16807v1
[DATE]
2024-07-24 03:17:47+08:00
[CATEGORIES]
cs.LG
Fusion and Cross-Modal Transfer for Zero-Shot Human Action Recognition
[AUTHORS]
Abhi Kamboj, Anh Duy Nguyen, Minh Do
[ABSTRACT]
Despite living in a multi-sensory world, most AI models are limited to
textual and visual interpretations of human motion and behavior. Inertial
measurement units (IMUs) provide a salient signal to understand human motion;
however, they are challenging to use due to their uninterpretability and
scarcity of their data. We investigate a method to transfer knowledge between
visual and inertial modalities using the structure of an informative joint
representation space designed for human action recognition (HAR). We apply the
resulting Fusion and Cross-modal Transfer (FACT) method to a novel setup, where
the model does not have access to labeled IMU data during training and is able
to perform HAR with only IMU data during testing. Extensive experiments on a
wide range of RGB-IMU datasets demonstrate that FACT significantly outperforms
existing methods in zero-shot cross-modal transfer.
[LINK]
http://arxiv.org/abs/2407.16803v1
[DATE]
2024-07-24 03:06:44+08:00
[CATEGORIES]
cs.LG
Distribution-Aware Robust Learning from Long-Tailed Data with Noisy Labels
[AUTHORS]
Jae Soon Baik, In Young Yoon, Kun Hoon Kim, Jun Won Choi
[ABSTRACT]
Deep neural networks have demonstrated remarkable advancements in various
fields using large, well-annotated datasets. However, real-world data often
exhibit long-tailed distributions and label noise, significantly degrading
generalization performance. Recent studies addressing these issues have focused
on noisy sample selection methods that estimate the centroid of each class
based on high-confidence samples within each target class. The performance of
these methods is limited because they use only the training samples within each
class for class centroid estimation, making the quality of centroids
susceptible to long-tailed distributions and noisy labels. In this study, we
present a robust training framework called Distribution-aware Sample Selection
and Contrastive Learning (DaSC). Specifically, DaSC introduces a
Distribution-aware Class Centroid Estimation (DaCC) to generate enhanced class
centroids. DaCC performs weighted averaging of the features from all samples,
with weights determined based on model predictions. Additionally, we propose a
confidence-aware contrastive learning strategy to obtain balanced and robust
representations. The training samples are categorized into high-confidence and
low-confidence samples. Our method then applies Semi-supervised Balanced
Contrastive Loss (SBCL) using high-confidence samples, leveraging reliable
label information to mitigate class bias. For the low-confidence samples, our
method computes Mixup-enhanced Instance Discrimination Loss (MIDL) to improve
their representations in a self-supervised manner. Our experimental results on
CIFAR and real-world noisy-label datasets demonstrate the superior performance
of the proposed DaSC compared to previous approaches.
[LINK]
http://arxiv.org/abs/2407.16802v1
[DATE]
2024-07-24 03:06:15+08:00
[CATEGORIES]
cs.LG
Wasserstein Distributionally Robust Shallow Convex Neural Networks
[AUTHORS]
Julien Pallage, Antoine Lesage-Landry
[ABSTRACT]
In this work, we propose Wasserstein distributionally robust shallow convex
neural networks (WaDiRo-SCNNs) to provide reliable nonlinear predictions when
subject to adverse and corrupted datasets. Our approach is based on a new
convex training program for ReLU shallow neural networks which allows us to
cast the problem as an exact, tractable reformulation of its order-1
Wasserstein distributionally robust equivalent. Our training procedure is
conservative by design, has low stochasticity, is solvable with open-source
solvers, and is scalable to large industrial deployments. We provide
out-of-sample performance guarantees and show that hard convex physical
constraints can be enforced in the training program. WaDiRo-SCNN aims to make
neural networks safer for critical applications, such as in the energy sector.
Finally, we numerically demonstrate the performance of our model on a synthetic
experiment and a real-world power system application, i.e., the prediction of
non-residential buildings’ hourly energy consumption. The experimental results
are convincing and showcase the strengths of the proposed model.
[LINK]
http://arxiv.org/abs/2407.16800v1
[DATE]
2024-07-24 03:01:53+08:00
[CATEGORIES]
cs.LG
Multi-Type Point Cloud Autoencoder: A Complete Equivariant Embedding for Molecule Conformation and Pose
[AUTHORS]
Michael Kilgour, Mark Tuckerman, Jutta Rogal
[ABSTRACT]
The point cloud is a flexible representation for a wide variety of data
types, and is a particularly natural fit for the 3D conformations of molecules.
Extant molecule embedding/representation schemes typically focus on internal
degrees of freedom, ignoring the global 3D orientation. For tasks that depend
on knowledge of both molecular conformation and 3D orientation, such as the
generation of molecular dimers, clusters, or condensed phases, we require a
representation which is provably complete in the types and positions of atomic
nuclei and roto-inversion equivariant with respect to the input point cloud. We
develop, train, and evaluate a new type of autoencoder, molecular O(3) encoding
net (Mo3ENet), for multi-type point clouds, for which we propose a new
reconstruction loss, capitalizing on a Gaussian mixture representation of the
input and output point clouds. Mo3ENet is end-to-end equivariant, meaning the
learned representation can be manipulated on O(3), a practical bonus for
downstream learning tasks. An appropriately trained Mo3ENet latent space
comprises a universal embedding for scalar and vector molecule property
prediction tasks, as well as other downstream tasks incorporating the 3D
molecular pose.
[COMMENTS]
16 pages, 8 figures, including main text, bibliography and
supplemental material
[LINK]
http://arxiv.org/abs/2405.13791v2
[DATE]
2024-07-24 02:24:02+08:00
[CATEGORIES]
cs.LG
Molecular Topological Profile (MOLTOP) – Simple and Strong Baseline for Molecular Graph Classification
[AUTHORS]
Jakub Adamczyk, Wojciech Czech
[ABSTRACT]
We revisit the effectiveness of topological descriptors for molecular graph
classification and design a simple, yet strong baseline. We demonstrate that a
simple approach to feature engineering - employing histogram aggregation of
edge descriptors and one-hot encoding for atomic numbers and bond types - when
combined with a Random Forest classifier, can establish a strong baseline for
Graph Neural Networks (GNNs). The novel algorithm, Molecular Topological
Profile (MOLTOP), integrates Edge Betweenness Centrality, Adjusted Rand Index
and SCAN Structural Similarity score. This approach proves to be remarkably
competitive when compared to modern GNNs, while also being simple, fast,
low-variance and hyperparameter-free. Our approach is rigorously tested on
MoleculeNet datasets using fair evaluation protocol provided by Open Graph
Benchmark. We additionally show out-of-domain generation capabilities on
peptide classification task from Long Range Graph Benchmark. The evaluations
across eleven benchmark datasets reveal MOLTOP’s strong discriminative
capabilities, surpassing the $1$-WL test and even $3$-WL test for some classes
of graphs. Our conclusion is that descriptor-based baselines, such as the one
we propose, are still crucial for accurately assessing advancements in the GNN
domain.
[LINK]
http://arxiv.org/abs/2407.12136v3
[DATE]
2024-07-24 01:58:52+08:00
[CATEGORIES]
cs.LG
Automatic Equalization for Individual Instrument Tracks Using Convolutional Neural Networks
[AUTHORS]
Florian Mockenhaupt, Joscha Simon Rieber, Shahan Nercessian
[ABSTRACT]
We propose a novel approach for the automatic equalization of individual
musical instrument tracks. Our method begins by identifying the instrument
present within a source recording in order to choose its corresponding ideal
spectrum as a target. Next, the spectral difference between the recording and
the target is calculated, and accordingly, an equalizer matching model is used
to predict settings for a parametric equalizer. To this end, we build upon a
differentiable parametric equalizer matching neural network, demonstrating
improvements relative to previously established state-of-the-art. Unlike past
approaches, we show how our system naturally allows real-world audio data to be
leveraged during the training of our matching model, effectively generating
suitably produced training targets in an automated manner mirroring conditions
at inference time. Consequently, we illustrate how fine-tuning our matching
model on such examples considerably improves parametric equalizer matching
performance in real-world scenarios, decreasing mean absolute error by 24%
relative to methods relying solely on random parameter sampling techniques as a
self-supervised learning strategy. We perform listening tests, and demonstrate
that our proposed automatic equalization solution subjectively enhances the
tonal characteristics for recordings of common instrument types.
[COMMENTS]
8 pages, 9 figures. Accepted to the 27th International Conference on
Digital Audio Effects (DAFx24)
[LINK]
http://arxiv.org/abs/2407.16691v1
[DATE]
2024-07-24 01:55:25+08:00
[CATEGORIES]
cs.LG
From Imitation to Refinement – Residual RL for Precise Visual Assembly
[AUTHORS]
Lars Ankile, Anthony Simeonov, Idan Shenfeld, Marcel Torne, Pulkit Agrawal
[ABSTRACT]
Behavior cloning (BC) currently stands as a dominant paradigm for learning
real-world visual manipulation. However, in tasks that require locally
corrective behaviors like multi-part assembly, learning robust policies purely
from human demonstrations remains challenging. Reinforcement learning (RL) can
mitigate these limitations by allowing policies to acquire locally corrective
behaviors through task reward supervision and exploration. This paper explores
the use of RL fine-tuning to improve upon BC-trained policies in precise
manipulation tasks. We analyze and overcome technical challenges associated
with using RL to directly train policy networks that incorporate modern
architectural components like diffusion models and action chunking. We propose
training residual policies on top of frozen BC-trained diffusion models using
standard policy gradient methods and sparse rewards, an approach we call ResiP
(Residual for Precise manipulation). Our experimental results demonstrate that
this residual learning framework can significantly improve success rates beyond
the base BC-trained models in high-precision assembly tasks by learning
corrective actions. We also show that by combining ResiP with teacher-student
distillation and visual domain randomization, our method can enable learning
real-world policies for robotic assembly directly from RGB images. Find videos
and code at \url{https://residual-assembly.github.io}.
[LINK]
http://arxiv.org/abs/2407.16677v1
[DATE]
2024-07-24 01:44:54+08:00
[CATEGORIES]
cs.LG
PLM-Net: Perception Latency Mitigation Network for Vision-Based Lateral Control of Autonomous Vehicles
[AUTHORS]
Aws Khalil, Jaerock Kwon
[ABSTRACT]
This study introduces the Perception Latency Mitigation Network (PLM-Net), a
novel deep learning approach for addressing perception latency in vision-based
Autonomous Vehicle (AV) lateral control systems. Perception latency is the
delay between capturing the environment through vision sensors (e.g., cameras)
and applying an action (e.g., steering). This issue is understudied in both
classical and neural-network-based control methods. Reducing this latency with
powerful GPUs and FPGAs is possible but impractical for automotive platforms.
PLM-Net comprises the Base Model (BM) and the Timed Action Prediction Model
(TAPM). BM represents the original Lane Keeping Assist (LKA) system, while TAPM
predicts future actions for different latency values. By integrating these
models, PLM-Net mitigates perception latency. The final output is determined
through linear interpolation of BM and TAPM outputs based on real-time latency.
This design addresses both constant and varying latency, improving driving
trajectories and steering control. Experimental results validate the efficacy
of PLM-Net across various latency conditions. Source code:
https://github.com/AwsKhalil/oscar/tree/devel-plm-net.
[COMMENTS]
13 pages excluding the appendixes. 19 pages including appendixes
[LINK]
http://arxiv.org/abs/2407.16740v1
[DATE]
2024-07-24 01:41:13+08:00
[CATEGORIES]
cs.LG
Forecasting Automotive Supply Chain Disruption with Heterogeneous Time Series
[AUTHORS]
Bach Viet Do, Xingyu Li, Chaoye Pan, Oleg Gusikhin
[ABSTRACT]
Operational disruptions can significantly impact companies performance. Ford,
with its 37 plants globally, uses 17 billion parts annually to manufacture six
million cars and trucks. With up to ten tiers of suppliers between the company
and raw materials, any extended disruption in this supply chain can cause
substantial financial losses. Therefore, the ability to forecast and identify
such disruptions early is crucial for maintaining seamless operations. In this
study, we demonstrate how we construct a dataset consisting of many
multivariate time series to forecast first-tier supply chain disruptions,
utilizing features related to capacity, inventory, utilization, and processing,
as outlined in the classical Factory Physics framework. This dataset is
technically challenging due to its vast scale of over five hundred thousand
time series. Furthermore, these time series, while exhibiting certain
similarities, also display heterogeneity within specific subgroups. To address
these challenges, we propose a novel methodology that integrates an enhanced
Attention Sequence to Sequence Deep Learning architecture, using Neural Network
Embeddings to model group effects, with a Survival Analysis model. This model
is designed to learn intricate heterogeneous data patterns related to
operational disruptions. Our model has demonstrated a strong performance,
achieving 0.85 precision and 0.8 recall during the Quality Assurance (QA) phase
across Ford’s five North American plants. Additionally, to address the common
criticism of Machine Learning models as black boxes, we show how the SHAP
framework can be used to generate feature importance from the model
predictions. It offers valuable insights that can lead to actionable strategies
and highlights the potential of advanced machine learning for managing and
mitigating supply chain risks in the automotive industry.
[LINK]
http://arxiv.org/abs/2407.16739v1
[DATE]
2024-07-24 01:28:10+08:00
[CATEGORIES]
cs.LG
S-E Pipeline: A Vision Transformer (ViT) based Resilient Classification Pipeline for Medical Imaging Against Adversarial Attacks
[AUTHORS]
Neha A S, Vivek Chaturvedi, Muhammad Shafique
[ABSTRACT]
Vision Transformer (ViT) is becoming widely popular in automating accurate
disease diagnosis in medical imaging owing to its robust self-attention
mechanism. However, ViTs remain vulnerable to adversarial attacks that may
thwart the diagnosis process by leading it to intentional misclassification of
critical disease. In this paper, we propose a novel image classification
pipeline, namely, S-E Pipeline, that performs multiple pre-processing steps
that allow ViT to be trained on critical features so as to reduce the impact of
input perturbations by adversaries. Our method uses a combination of
segmentation and image enhancement techniques such as Contrast Limited Adaptive
Histogram Equalization (CLAHE), Unsharp Masking (UM), and High-Frequency
Emphasis filtering (HFE) as preprocessing steps to identify critical features
that remain intact even after adversarial perturbations. The experimental study
demonstrates that our novel pipeline helps in reducing the effect of
adversarial attacks by 72.22% for the ViT-b32 model and 86.58% for the ViT-l32
model. Furthermore, we have shown an end-to-end deployment of our proposed
method on the NVIDIA Jetson Orin Nano board to demonstrate its practical use
case in modern hand-held devices that are usually resource-constrained.
[LINK]
http://arxiv.org/abs/2407.17587v1
[DATE]
2024-07-24 01:20:40+08:00
[CATEGORIES]
cs.LG
Deep-Graph-Sprints: Accelerated Representation Learning in Continuous-Time Dynamic Graphs
[AUTHORS]
Ahmad Naser Eddin, Jacopo Bono, David Aparício, Hugo Ferreira, Pedro Ribeiro, Pedro Bizarro
[ABSTRACT]
Continuous-time dynamic graphs (CTDGs) are essential for modeling
interconnected, evolving systems. Traditional methods for extracting knowledge
from these graphs often depend on feature engineering or deep learning. Feature
engineering is limited by the manual and time-intensive nature of crafting
features, while deep learning approaches suffer from high inference latency,
making them impractical for real-time applications. This paper introduces
Deep-Graph-Sprints (DGS), a novel deep learning architecture designed for
efficient representation learning on CTDGs with low-latency inference
requirements. We benchmark DGS against state-of-the-art feature engineering and
graph neural network methods using five diverse datasets. The results indicate
that DGS achieves competitive performance while improving inference speed up to
12x compared to other deep learning approaches on our tested benchmarks. Our
method effectively bridges the gap between deep representation learning and
low-latency application requirements for CTDGs.
[LINK]
http://arxiv.org/abs/2407.07712v2
[DATE]
2024-07-24 01:01:12+08:00
[CATEGORIES]
cs.LG
Synthesizer Sound Matching Using Audio Spectrogram Transformers
[AUTHORS]
Fred Bruford, Frederik Blang, Shahan Nercessian
[ABSTRACT]
Systems for synthesizer sound matching, which automatically set the
parameters of a synthesizer to emulate an input sound, have the potential to
make the process of synthesizer programming faster and easier for novice and
experienced musicians alike, whilst also affording new means of interaction
with synthesizers. Considering the enormous variety of synthesizers in the
marketplace, and the complexity of many of them, general-purpose sound matching
systems that function with minimal knowledge or prior assumptions about the
underlying synthesis architecture are particularly desirable. With this in
mind, we introduce a synthesizer sound matching model based on the Audio
Spectrogram Transformer. We demonstrate the viability of this model by training
on a large synthetic dataset of randomly generated samples from the popular
Massive synthesizer. We show that this model can reconstruct parameters of
samples generated from a set of 16 parameters, highlighting its improved
fidelity relative to multi-layer perceptron and convolutional neural network
baselines. We also provide audio examples demonstrating the out-of-domain model
performance in emulating vocal imitations, and sounds from other synthesizers
and musical instruments.
[COMMENTS]
4 pages, 1 figure. Accepted to the 27th International Conference on
Digital Audio Effects (DAFx24)
[LINK]
http://arxiv.org/abs/2407.16643v1
[DATE]
2024-07-24 00:58:14+08:00
[CATEGORIES]
cs.LG
World Model on Million-Length Video And Language With Blockwise RingAttention
[AUTHORS]
Hao Liu, Wilson Yan, Matei Zaharia, Pieter Abbeel
[ABSTRACT]
Current language models fall short in understanding aspects of the world not
easily described in words, and struggle with complex, long-form tasks. Video
sequences offer valuable temporal information absent in language and static
images, making them attractive for joint modeling with language. Such models
could develop a understanding of both human textual knowledge and the physical
world, enabling broader AI capabilities for assisting humans. However, learning
from millions of tokens of video and language sequences poses challenges due to
memory constraints, computational complexity, and limited datasets. To address
these challenges, we curate a large dataset of diverse videos and books,
utilize the Blockwise RingAttention technique to scalably train on long
sequences, and gradually increase context size from 4K to 1M tokens. This paper
makes the following contributions: (a) Largest context size neural network: We
train one of the largest context size transformers on long video and language
sequences, setting new benchmarks in difficult retrieval tasks and long video
understanding. (b) Solutions for overcoming vision-language training
challenges, including using masked sequence packing for mixing different
sequence lengths, loss weighting to balance language and vision, and
model-generated QA dataset for long sequence chat. (c) A highly-optimized
implementation with RingAttention, Blockwise Transformers, masked sequence
packing, and other key features for training on millions-length multimodal
sequences. (d) Fully open-sourced a family of 7B parameter models capable of
processing long text documents (LWM-Text, LWM-Text-Chat) and videos (LWM,
LWM-Chat) of over 1M tokens. This work paves the way for training on massive
datasets of long video and language to develop understanding of both human
knowledge and the multimodal world, and broader capabilities.
[LINK]
http://arxiv.org/abs/2402.08268v3
[DATE]
2024-07-24 00:57:26+08:00
[CATEGORIES]
cs.LG
Aggregation of expert advice, revisited
[AUTHORS]
Aryeh Kontorovich
[ABSTRACT]
We revisit the classic problem of aggregating binary advice from
conditionally independent experts, also known as the Naive Bayes setting. Our
quantity of interest is the error probability of the optimal decision rule. In
the symmetric case (sensitivity = specificity), reasonably tight bounds on the
optimal error probability are known. In the general asymmetric case, we are not
aware of any nontrivial estimates on this quantity. Our contribution consists
of sharp upper and lower bounds on the optimal error probability in the general
case, which recover and sharpen the best known results in the symmetric special
case. Since this amounts to estimating the total variation distance between two
product distributions, our results also have bearing on this important and
challenging problem.
[LINK]
http://arxiv.org/abs/2407.16642v1
[DATE]
2024-07-24 00:57:10+08:00
[CATEGORIES]
cs.LG
A Geometry-Aware Algorithm to Learn Hierarchical Embeddings in Hyperbolic Space
[AUTHORS]
Zhangyu Wang, Lantian Xu, Zhifeng Kong, Weilong Wang, Xuyu Peng, Enyang Zheng
[ABSTRACT]
Hyperbolic embeddings are a class of representation learning methods that
offer competitive performances when data can be abstracted as a tree-like
graph. However, in practice, learning hyperbolic embeddings of hierarchical
data is difficult due to the different geometry between hyperbolic space and
the Euclidean space. To address such difficulties, we first categorize three
kinds of illness that harm the performance of the embeddings. Then, we develop
a geometry-aware algorithm using a dilation operation and a transitive closure
regularization to tackle these illnesses. We empirically validate these
techniques and present a theoretical analysis of the mechanism behind the
dilation operation. Experiments on synthetic and real-world datasets reveal
superior performances of our algorithm.
[LINK]
http://arxiv.org/abs/2407.16641v1
[DATE]
2024-07-24 00:56:59+08:00
[CATEGORIES]
cs.LG
Graph Neural Networks for Learning Equivariant Representations of Neural Networks
[AUTHORS]
Miltiadis Kofinas, Boris Knyazev, Yan Zhang, Yunlu Chen, Gertjan J. Burghouts, Efstratios Gavves, Cees G. M. Snoek, David W. Zhang
[COMMENTS]
In ICLR 2024. Source code: https://github.com/mkofinas/neural-graphs
[LINK]
http://arxiv.org/abs/2403.12143v3
[DATE]
2024-07-24 00:30:10+08:00
[CATEGORIES]
cs.LG
ConvNet vs Transformer, Supervised vs CLIP: Beyond ImageNet Accuracy
[AUTHORS]
Kirill Vishniakov, Zhiqiang Shen, Zhuang Liu
[ABSTRACT]
Modern computer vision offers a great variety of models to practitioners, and
selecting a model from multiple options for specific applications can be
challenging. Conventionally, competing model architectures and training
protocols are compared by their classification accuracy on ImageNet. However,
this single metric does not fully capture performance nuances critical for
specialized tasks. In this work, we conduct an in-depth comparative analysis of
model behaviors beyond ImageNet accuracy, for both ConvNet and Vision
Transformer architectures, each across supervised and CLIP training paradigms.
Although our selected models have similar ImageNet accuracies and compute
requirements, we find that they differ in many other aspects: types of
mistakes, output calibration, transferability, and feature invariance, among
others. This diversity in model characteristics, not captured by traditional
metrics, highlights the need for more nuanced analysis when choosing among
different models. Our code is available at
https://github.com/kirill-vish/Beyond-INet.
[COMMENTS]
Project page: https://kirill-vish.github.io/beyond-imagenet-accuracy/
[LINK]
http://arxiv.org/abs/2311.09215v3
[DATE]
2024-07-24 00:20:54+08:00
[CATEGORIES]
cs.LG
Local vs Global continual learning
[AUTHORS]
Giulia Lanzillotta, Sidak Pal Singh, Benjamin F. Grewe, Thomas Hofmann
[ABSTRACT]
Continual learning is the problem of integrating new information in a model
while retaining the knowledge acquired in the past. Despite the tangible
improvements achieved in recent years, the problem of continual learning is
still an open one. A better understanding of the mechanisms behind the
successes and failures of existing continual learning algorithms can unlock the
development of new successful strategies. In this work, we view continual
learning from the perspective of the multi-task loss approximation, and we
compare two alternative strategies, namely local and global approximations. We
classify existing continual learning algorithms based on the approximation
used, and we assess the practical effects of this distinction in common
continual learning settings.Additionally, we study optimal continual learning
objectives in the case of local polynomial approximations and we provide
examples of existing algorithms implementing the optimal objectives
[COMMENTS]
(10 pages, Will appear in the proceedings of CoLLAs 2024)
[LINK]
http://arxiv.org/abs/2407.16611v1
[DATE]
2024-07-24 00:18:00+08:00
[CATEGORIES]
cs.LG
Recurrent Action Transformer with Memory
[AUTHORS]
Egor Cherepanov, Alexey Staroverov, Dmitry Yudin, Alexey K. Kovalev, Aleksandr I. Panov
[ABSTRACT]
Recently, the use of transformers in offline reinforcement learning has
become a rapidly developing area. This is due to their ability to treat the
agent’s trajectory in the environment as a sequence, thereby reducing the
policy learning problem to sequence modeling. In environments where the agent’s
decisions depend on past events (POMDPs), capturing both the event itself and
the decision point in the context of the model is essential. However, the
quadratic complexity of the attention mechanism limits the potential for
context expansion. One solution to this problem is to enhance transformers with
memory mechanisms. This paper proposes a Recurrent Action Transformer with
Memory (RATE), a novel model architecture incorporating a recurrent memory
mechanism designed to regulate information retention. To evaluate our model, we
conducted extensive experiments on memory-intensive environments
(ViZDoom-Two-Colors, T-Maze, Memory Maze, Minigrid.Memory), classic Atari games
and MuJoCo control environments. The results show that using memory can
significantly improve performance in memory-intensive environments while
maintaining or improving results in classic environments. We hope our findings
will stimulate research on memory mechanisms for transformers applicable to
offline reinforcement learning.
[COMMENTS]
18 pages, 9 figures
[LINK]
http://arxiv.org/abs/2306.09459v4
[DATE]
2024-07-24 00:17:36+08:00
[CATEGORIES]
cs.LG
Targeted Adaptive Design
[AUTHORS]
Carlo Graziani, Marieme Ngom
[ABSTRACT]
Modern advanced manufacturing and advanced materials design often require
searches of relatively high-dimensional process control parameter spaces for
settings that result in optimal structure, property, and performance
parameters. The mapping from the former to the latter must be determined from
noisy experiments or from expensive simulations. We abstract this problem to a
mathematical framework in which an unknown function from a control space to a
design space must be ascertained by means of expensive noisy measurements,
which locate optimal control settings generating desired design features within
specified tolerances, with quantified uncertainty. We describe targeted
adaptive design (TAD), a new algorithm that performs this sampling task
efficiently. TAD creates a Gaussian process surrogate model of the unknown
mapping at each iterative stage, proposing a new batch of control settings to
sample experimentally and optimizing the updated log-predictive likelihood of
the target design. TAD either stops upon locating a solution with uncertainties
that fit inside the tolerance box or uses a measure of expected future
information to determine that the search space has been exhausted with no
solution. TAD thus embodies the exploration-exploitation tension in a manner
that recalls, but is essentially different from, Bayesian optimization and
optimal experimental design.
[COMMENTS]
SIAM/ASA Journal on Uncertainty Quantification, Accepted Version
[LINK]
http://arxiv.org/abs/2205.14208v3
[DATE]
2024-07-24 00:16:32+08:00
[CATEGORIES]
cs.LG
Towards a “universal translator” for neural dynamics at single-cell, single-spike resolution
[AUTHORS]
Yizi Zhang, Yanchen Wang, Donato Jimenez-Beneto, Zixuan Wang, Mehdi Azabou, Blake Richards, Olivier Winter, International Brain Laboratory, Eva Dyer, Liam Paninski, Cole Hurwitz
[ABSTRACT]
Neuroscience research has made immense progress over the last decade, but our
understanding of the brain remains fragmented and piecemeal: the dream of
probing an arbitrary brain region and automatically reading out the information
encoded in its neural activity remains out of reach. In this work, we build
towards a first foundation model for neural spiking data that can solve a
diverse set of tasks across multiple brain areas. We introduce a novel
self-supervised modeling approach for population activity in which the model
alternates between masking out and reconstructing neural activity across
different time steps, neurons, and brain regions. To evaluate our approach, we
design unsupervised and supervised prediction tasks using the International
Brain Laboratory repeated site dataset, which is comprised of Neuropixels
recordings targeting the same brain locations across 48 animals and
experimental sessions. The prediction tasks include single-neuron and
region-level activity prediction, forward prediction, and behavior decoding. We
demonstrate that our multi-task-masking (MtM) approach significantly improves
the performance of current state-of-the-art population models and enables
multi-task learning. We also show that by training on multiple animals, we can
improve the generalization ability of the model to unseen animals, paving the
way for a foundation model of the brain at single-cell, single-spike
resolution.
[LINK]
http://arxiv.org/abs/2407.14668v2
[DATE]
2024-07-24 00:14:27+08:00
[CATEGORIES]
cs.LG
Interpretable Machine Learning for TabPFN
[AUTHORS]
David Rundel, Julius Kobialka, Constantin von Crailsheim, Matthias Feurer, Thomas Nagler, David Rügamer
[ABSTRACT]
The recently developed Prior-Data Fitted Networks (PFNs) have shown very
promising results for applications in low-data regimes. The TabPFN model, a
special case of PFNs for tabular data, is able to achieve state-of-the-art
performance on a variety of classification tasks while producing posterior
predictive distributions in mere seconds by in-context learning without the
need for learning parameters or hyperparameter tuning. This makes TabPFN a very
attractive option for a wide range of domain applications. However, a major
drawback of the method is its lack of interpretability. Therefore, we propose
several adaptations of popular interpretability methods that we specifically
design for TabPFN. By taking advantage of the unique properties of the model,
our adaptations allow for more efficient computations than existing
implementations. In particular, we show how in-context learning facilitates the
estimation of Shapley values by avoiding approximate retraining and enables the
use of Leave-One-Covariate-Out (LOCO) even when working with large-scale
Transformers. In addition, we demonstrate how data valuation methods can be
used to address scalability challenges of TabPFN. Our proposed methods are
implemented in a package tabpfn_iml and made available at
https://github.com/david-rundel/tabpfn_iml.
[COMMENTS]
This preprint has not undergone peer review or any post-submission
improvements or corrections. The Version of Record of this contribution is
published in Explainable Artificial Intelligence, and is available online at
https://doi.org/10.1007/978-3-031-63797-1_23
[LINK]
http://arxiv.org/abs/2403.10923v2
[DATE]
2024-07-24 00:10:52+08:00
[CATEGORIES]
cs.LG
[AUTHORS]
Veronica Chelu, Doina Precup [ABSTRACT]
We apply functional acceleration to the Policy Mirror Descent (PMD) general
family of algorithms, which cover a wide range of novel and fundamental methods
in Reinforcement Learning (RL). Leveraging duality, we propose a momentum-based
PMD update. By taking the functional route, our approach is independent of the
policy parametrization and applicable to large-scale optimization, covering
previous applications of momentum at the level of policy parameters as a
special case. We theoretically analyze several properties of this approach and
complement with a numerical ablation study, which serves to illustrate the
policy optimization dynamics on the value polytope, relative to different
algorithmic design choices in this space. We further characterize numerically
several features of the problem setting relevant for functional acceleration,
and lastly, we investigate the impact of approximation on their learning
mechanics. [LINK]
http://arxiv.org/abs/2407.16602v1 [DATE]
2024-07-24 00:04:55+08:00 [CATEGORIES]
cs.LG
Adapting Multilingual LLMs to Low-Resource Languages with Knowledge Graphs via Adapters
[AUTHORS]
Daniil Gurgurov, Mareike Hartmann, Simon Ostermann
[ABSTRACT]
This paper explores the integration of graph knowledge from linguistic
ontologies into multilingual Large Language Models (LLMs) using adapters to
improve performance for low-resource languages (LRLs) in sentiment analysis
(SA) and named entity recognition (NER). Building upon successful
parameter-efficient fine-tuning techniques, such as K-ADAPTER and MAD-X, we
propose a similar approach for incorporating knowledge from multilingual
graphs, connecting concepts in various languages with each other through
linguistic relationships, into multilingual LLMs for LRLs. Specifically, we
focus on eight LRLs – Maltese, Bulgarian, Indonesian, Nepali, Javanese,
Uyghur, Tibetan, and Sinhala – and employ language-specific adapters
fine-tuned on data extracted from the language-specific section of ConceptNet,
aiming to enable knowledge transfer across the languages covered by the
knowledge graph. We compare various fine-tuning objectives, including standard
Masked Language Modeling (MLM), MLM with full-word masking, and MLM with
targeted masking, to analyse their effectiveness in learning and integrating
the extracted graph data. Through empirical evaluation on language-specific
tasks, we assess how structured graph knowledge affects the performance of
multilingual LLMs for LRLs in SA and NER, providing insights into the potential
benefits of adapting language models for low-resource scenarios.
[COMMENTS]
9 pages, KaLLM workshop
[LINK]
http://arxiv.org/abs/2407.01406v2
[DATE]
2024-07-23 23:51:12+08:00
[CATEGORIES]
cs.CL
TLCR: Token-Level Continuous Reward for Fine-grained Reinforcement Learning from Human Feedback
[AUTHORS]
Eunseop Yoon, Hee Suk Yoon, SooHwan Eom, Gunsoo Han, Daniel Wontae Nam, Daejin Jo, Kyoung-Woon On, Mark A. Hasegawa-Johnson, Sungwoong Kim, Chang D. Yoo
[ABSTRACT]
Reinforcement Learning from Human Feedback (RLHF) leverages human preference
data to train language models to align more closely with human essence. These
human preference data, however, are labeled at the sequence level, creating a
mismatch between sequence-level preference labels and tokens, which are
autoregressively generated from the language model. Although several recent
approaches have tried to provide token-level (i.e., dense) rewards for each
individual token, these typically rely on predefined discrete reward values
(e.g., positive: +1, negative: -1, neutral: 0), failing to account for varying
degrees of preference inherent to each token. To address this limitation, we
introduce TLCR (Token-Level Continuous Reward) for RLHF, which incorporates a
discriminator trained to distinguish positive and negative tokens, and the
confidence of the discriminator is used to assign continuous rewards to each
token considering the context. Extensive experiments show that our proposed
TLCR leads to consistent performance improvements over previous sequence-level
or token-level discrete rewards on open-ended generation benchmarks.
[COMMENTS]
ACL2024 Findings
[LINK]
http://arxiv.org/abs/2407.16574v1
[DATE]
2024-07-23 23:27:37+08:00
[CATEGORIES]
cs.CL
Retrieve, Generate, Evaluate: A Case Study for Medical Paraphrases Generation with Small Language Models
[AUTHORS]
Ioana Buhnila, Aman Sinha, Mathieu Constant
[ABSTRACT]
Recent surge in the accessibility of large language models (LLMs) to the
general population can lead to untrackable use of such models for
medical-related recommendations. Language generation via LLMs models has two
key problems: firstly, they are prone to hallucination and therefore, for any
medical purpose they require scientific and factual grounding; secondly, LLMs
pose tremendous challenge to computational resources due to their gigantic
model size. In this work, we introduce pRAGe, a pipeline for Retrieval
Augmented Generation and evaluation of medical paraphrases generation using
Small Language Models (SLM). We study the effectiveness of SLMs and the impact
of external knowledge base for medical paraphrase generation in French.
[COMMENTS]
KnowledgeableLM 2024
[LINK]
http://arxiv.org/abs/2407.16565v1
[DATE]
2024-07-23 23:17:11+08:00
[CATEGORIES]
cs.CL
End-to-End Video Question Answering with Frame Scoring Mechanisms and Adaptive Sampling
[AUTHORS]
Jianxin Liang, Xiaojun Meng, Yueqian Wang, Chang Liu, Qun Liu, Dongyan Zhao
[ABSTRACT]
Video Question Answering (VideoQA) has emerged as a challenging frontier in
the field of multimedia processing, requiring intricate interactions between
visual and textual modalities. Simply uniformly sampling frames or
indiscriminately aggregating frame-level visual features often falls short in
capturing the nuanced and relevant contexts of videos to well perform VideoQA.
To mitigate these issues, we propose VidF4, a novel VideoQA framework equipped
with tailored frame selection strategy for effective and efficient VideoQA. We
propose three frame-scoring mechanisms that consider both question relevance
and inter-frame similarity to evaluate the importance of each frame for a given
question on the video. Furthermore, we design a differentiable adaptive frame
sampling mechanism to facilitate end-to-end training for the frame selector and
answer generator. The experimental results across three widely adopted
benchmarks demonstrate that our model consistently outperforms existing VideoQA
methods, establishing a new SOTA across NExT-QA (+0.3%), STAR (+0.9%), and TVQA
(+1.0%). Furthermore, through both quantitative and qualitative analyses, we
validate the effectiveness of each design choice.
[LINK]
http://arxiv.org/abs/2407.15047v2
[DATE]
2024-07-23 22:56:22+08:00
[CATEGORIES]
cs.CL
Position: AI/ML Influencers Have a Place in the Academic Process
[AUTHORS]
Iain Xie Weissburg, Mehir Arora, Xinyi Wang, Liangming Pan, William Yang Wang
[COMMENTS]
15 Pages, 22 Figures, ICML 2024
[LINK]
http://arxiv.org/abs/2401.13782v3
[DATE]
2024-07-23 22:49:43+08:00
[CATEGORIES]
cs.CL
cs.LG
Quantifying the Role of Textual Predictability in Automatic Speech Recognition
[AUTHORS]
Sean Robertson, Gerald Penn, Ewan Dunbar
[ABSTRACT]
A long-standing question in automatic speech recognition research is how to
attribute errors to the ability of a model to model the acoustics, versus its
ability to leverage higher-order context (lexicon, morphology, syntax,
semantics). We validate a novel approach which models error rates as a function
of relative textual predictability, and yields a single number, $k$, which
measures the effect of textual predictability on the recognizer. We use this
method to demonstrate that a Wav2Vec 2.0-based model makes greater stronger use
of textual context than a hybrid ASR model, in spite of not using an explicit
language model, and also use it to shed light on recent results demonstrating
poor performance of standard ASR systems on African-American English. We
demonstrate that these mostly represent failures of acoustic–phonetic
modelling. We show how this approach can be used straightforwardly in
diagnosing and improving ASR.
[LINK]
http://arxiv.org/abs/2407.16537v1
[DATE]
2024-07-23 22:47:25+08:00
[CATEGORIES]
cs.CL
Imperfect Vision Encoders: Efficient and Robust Tuning for Vision-Language Models
[AUTHORS]
Aristeidis Panos, Rahaf Aljundi, Daniel Olmeda Reino, Richard E Turner
[ABSTRACT]
Vision language models (VLMs) demonstrate impressive capabilities in visual
question answering and image captioning, acting as a crucial link between
visual and language models. However, existing open-source VLMs heavily rely on
pretrained and frozen vision encoders (such as CLIP). Despite CLIP’s robustness
across diverse domains, it still exhibits non-negligible image understanding
errors. These errors propagate to the VLM responses, resulting in sub-optimal
performance. In our work, we propose an efficient and robust method for
updating vision encoders within VLMs. Our approach selectively and locally
updates encoders, leading to substantial performance improvements on data where
previous mistakes occurred, while maintaining overall robustness. Furthermore,
we demonstrate the effectiveness of our method during continual few-shot
updates. Theoretical grounding, generality, and computational efficiency
characterize our approach.
[LINK]
http://arxiv.org/abs/2407.16526v1
[DATE]
2024-07-23 22:39:40+08:00
[CATEGORIES]
cs.CL
cs.LG
Large Language Models Lack Understanding of Character Composition of Words
[AUTHORS]
Andrew Shin, Kunitake Kaneko
[COMMENTS]
ICML 2024 Workshop on Large Language Models and Cognition
[LINK]
http://arxiv.org/abs/2405.11357v3
[DATE]
2024-07-23 22:39:06+08:00
[CATEGORIES]
cs.CL
Assessing In-context Learning and Fine-tuning for Topic Classification of German Web Data
[AUTHORS]
Julian Schelb, Roberto Ulloa, Andreas Spitz
[ABSTRACT]
Researchers in the political and social sciences often rely on classification
models to analyze trends in information consumption by examining browsing
histories of millions of webpages. Automated scalable methods are necessary due
to the impracticality of manual labeling. In this paper, we model the detection
of topic-related content as a binary classification task and compare the
accuracy of fine-tuned pre-trained encoder models against in-context learning
strategies. Using only a few hundred annotated data points per topic, we detect
content related to three German policies in a database of scraped webpages. We
compare multilingual and monolingual models, as well as zero and few-shot
approaches, and investigate the impact of negative sampling strategies and the
combination of URL & content-based features. Our results show that a small
sample of annotated data is sufficient to train an effective classifier.
Fine-tuning encoder-based models yields better results than in-context
learning. Classifiers using both URL & content-based features perform best,
while using URLs alone provides adequate results when content is unavailable.
[LINK]
http://arxiv.org/abs/2407.16516v1
[DATE]
2024-07-23 22:31:59+08:00
[CATEGORIES]
cs.CL
E-TSL: A Continuous Educational Turkish Sign Language Dataset with Baseline Methods
[AUTHORS]
Şükrü Öztürk, Hacer Yalim Keles
[ABSTRACT]
This study introduces the continuous Educational Turkish Sign Language
(E-TSL) dataset, collected from online Turkish language lessons for 5th, 6th,
and 8th grades. The dataset comprises 1,410 videos totaling nearly 24 hours and
includes performances from 11 signers. Turkish, an agglutinative language,
poses unique challenges for sign language translation, particularly with a
vocabulary where 64% are singleton words and 85% are rare words, appearing less
than five times. We developed two baseline models to address these challenges:
the Pose to Text Transformer (P2T-T) and the Graph Neural Network based
Transformer (GNN-T) models. The GNN-T model achieved 19.13% BLEU-1 score and
3.28% BLEU-4 score, presenting a significant challenge compared to existing
benchmarks. The P2T-T model, while demonstrating slightly lower performance in
BLEU scores, achieved a higher ROUGE-L score of 22.09%. Additionally, we
benchmarked our model using the well-known PHOENIX-Weather 2014T dataset to
validate our approach.
[COMMENTS]
7 pages, 3 figures, 4 tables
[LINK]
http://arxiv.org/abs/2405.02984v2
[DATE]
2024-07-23 21:56:20+08:00
[CATEGORIES]
cs.CL
cs.LG
GenCeption: Evaluate Multimodal LLMs with Unlabeled Unimodal Data
[AUTHORS]
Lele Cao, Valentin Buchner, Zineb Senane, Fangkai Yang
[ABSTRACT]
Multimodal Large Language Models (MLLMs) are typically assessed using
expensive annotated multimodal benchmarks, which often lag behind the rapidly
evolving demands of MLLM evaluation. This paper outlines and validates
GenCeption, a novel, annotation-free evaluation method that requires only
unimodal data to measure inter-modality semantic coherence and inversely
assesses MLLMs’ tendency to hallucinate. This approach eliminates the need for
costly data annotation, minimizes the risk of training data contamination,
results in slower benchmark saturation, and avoids the illusion of emerging
abilities. Inspired by the DrawCeption game, GenCeption begins with a
non-textual sample and proceeds through iterative description and generation
steps. The semantic drift across iterations is quantified using the GC@T
metric. Based on the GenCeption method, we establish the MMECeption benchmark
for evaluating Vision LLMs (VLLMs), and compare performance of several popular
VLLMs and human annotators. Our empirical results validate GenCeption’s
effectiveness, demonstrating strong correlations with established VLLM
benchmarks. VLLMs still significantly lack behind human performance and
struggle especially with text-intensive tasks.
[COMMENTS]
Significantly extended from v2. Source code:
https://github.com/llcresearch/GenCeption. Leaderboard:
https://huggingface.co/spaces/valbuc/GenCeption
[LINK]
http://arxiv.org/abs/2402.14973v3
[DATE]
2024-07-23 21:54:16+08:00
[CATEGORIES]
cs.CL
cs.LG
The Oscars of AI Theater: A Survey on Role-Playing with Language Models
[AUTHORS]
Nuo Chen, Yang Deng, Jia Li
[ABSTRACT]
This survey explores the burgeoning field of role-playing with language
models, focusing on their development from early persona-based models to
advanced character-driven simulations facilitated by Large Language Models
(LLMs). Initially confined to simple persona consistency due to limited model
capabilities, role-playing tasks have now expanded to embrace complex character
portrayals involving character consistency, behavioral alignment, and overall
attractiveness. We provide a comprehensive taxonomy of the critical components
in designing these systems, including data, models and alignment, agent
architecture and evaluation. This survey not only outlines the current
methodologies and challenges, such as managing dynamic personal profiles and
achieving high-level persona consistency but also suggests avenues for future
research in improving the depth and realism of role-playing applications. The
goal is to guide future research by offering a structured overview of current
methodologies and identifying potential areas for improvement. Related
resources and papers are available at
https://github.com/nuochenpku/Awesome-Role-Play-Papers.
[COMMENTS]
28 pages
[LINK]
http://arxiv.org/abs/2407.11484v4
[DATE]
2024-07-23 21:18:31+08:00
[CATEGORIES]
cs.CL
Educating LLMs like Human Students: Structure-aware Injection of Domain Knowledge
[AUTHORS]
Kai Liu, Ze Chen, Zhihang Fu, Rongxin Jiang, Fan Zhou, Yaowu Chen, Yue Wu, Jieping Ye
[ABSTRACT]
This paper presents a pioneering methodology, termed StructTuning, to
efficiently transform foundation Large Language Models (LLMs) into domain
specialists. It significantly minimizes the training corpus requirement to a
mere 0.3% while achieving an impressive 50% of traditional knowledge injection
performance. Our method is inspired by the educational processes for human
students, particularly how structured domain knowledge from textbooks is
absorbed and then applied to tackle real-world challenges through specific
exercises. Based on this, we propose a novel two-stage knowledge injection
strategy: Structure-aware Continual Pre-Training (SCPT) and Structure-aware
Supervised Fine-Tuning (SSFT). In the SCPT phase, we organize the training data
into an auto-generated taxonomy of domain knowledge, enabling LLMs to
effectively memorize textual segments linked to specific expertise within the
taxonomy’s architecture. Subsequently, in the SSFT phase, we explicitly prompt
models to reveal the underlying knowledge structure in their outputs,
leveraging this structured domain insight to address practical problems
adeptly. Our ultimate method has undergone extensive evaluations across model
architectures and scales, using closed-book question-answering tasks on
LongBench and MMedBench datasets. Remarkably, our method matches 50% of the
improvement displayed by the state-of-the-art MMedLM2 on MMedBench, but with
only 0.3% quantity of the training corpus. This breakthrough showcases the
potential to scale up our StructTuning for stronger domain-specific LLMs. Code
will be made public soon.
[COMMENTS]
N/A
[LINK]
http://arxiv.org/abs/2407.16724v1
[DATE]
2024-07-23 20:38:48+08:00
[CATEGORIES]
cs.CL
Enhancing LLM’s Cognition via Structurization
[AUTHORS]
Kai Liu, Zhihang Fu, Chao Chen, Wei Zhang, Rongxin Jiang, Fan Zhou, Yaowu Chen, Yue Wu, Jieping Ye
[ABSTRACT]
When reading long-form text, human cognition is complex and structurized.
While large language models (LLMs) process input contexts through a causal and
sequential perspective, this approach can potentially limit their ability to
handle intricate and complex inputs effectively. To enhance LLM’s cognition
capability, this paper presents a novel concept of context structurization.
Specifically, we transform the plain, unordered contextual sentences into
well-ordered and hierarchically structurized elements. By doing so, LLMs can
better grasp intricate and extended contexts through precise attention and
information-seeking along the organized structures. Extensive evaluations are
conducted across various model architectures and sizes (including several 7B-
to 72B-size auto-regressive LLMs as well as BERT-like masking models) on a
diverse set of NLP tasks (e.g., context-based question-answering, exhaustive
hallucination evaluation, and passage-level dense retrieval). Empirical results
show consistent and significant performance gains afforded by a single-round
structurization. In particular, we boost a 72B-parameter open-source model to
achieve comparable performance against GPT-3.5-Turbo as the hallucination
evaluator. Besides, we show the feasibility of distilling advanced LLMs’
language processing abilities to a smaller yet effective StruXGPT-7B to execute
structurization, addressing the practicality of our approach. Code will be made
public soon.
[COMMENTS]
N/A
[LINK]
http://arxiv.org/abs/2407.16434v1
[DATE]
2024-07-23 20:33:58+08:00
[CATEGORIES]
cs.CL
Context Embeddings for Efficient Answer Generation in RAG
[AUTHORS]
David Rau, Shuai Wang, Hervé Déjean, Stéphane Clinchant
[ABSTRACT]
Retrieval-Augmented Generation (RAG) allows overcoming the limited knowledge
of LLMs by extending the input with external information. As a consequence, the
contextual inputs to the model become much longer which slows down decoding
time directly translating to the time a user has to wait for an answer. We
address this challenge by presenting COCOM, an effective context compression
method, reducing long contexts to only a handful of Context Embeddings speeding
up the generation time by a large margin. Our method allows for different
compression rates trading off decoding time for answer quality. Compared to
earlier methods, COCOM allows for handling multiple contexts more effectively,
significantly reducing decoding time for long inputs. Our method demonstrates a
speed-up of up to 5.69 $\times$ while achieving higher performance compared to
existing efficient context compression methods.
[COMMENTS]
10 pages
[LINK]
http://arxiv.org/abs/2407.09252v2
[DATE]
2024-07-23 20:28:31+08:00
[CATEGORIES]
cs.CL
Does In-Context Learning Really Learn? Rethinking How Large Language Models Respond and Solve Tasks via In-Context Learning
[AUTHORS]
Quanyu Long, Yin Wu, Wenya Wang, Sinno Jialin Pan
[ABSTRACT]
In-context Learning (ICL) has emerged as a powerful capability alongside the
development of scaled-up large language models (LLMs). By instructing LLMs
using few-shot demonstrative examples, ICL enables them to perform a wide range
of tasks without updating millions of parameters. However, the precise
contributions of demonstrations towards improving end-task performance have not
been thoroughly investigated in recent analytical studies. In this paper, we
empirically decompose the overall performance of ICL into three dimensions,
label space, format, and discrimination, and we evaluate four general-purpose
LLMs across a diverse range of tasks. Counter-intuitively, we find that the
demonstrations have a marginal impact on provoking discriminative knowledge of
language models. However, ICL exhibits significant efficacy in regulating the
label space and format, which helps LLMs respond to desired label words. We
then demonstrate that this ability functions similar to detailed instructions
for LLMs to follow. We additionally provide an in-depth analysis of the
mechanism of retrieval helping with ICL. Our findings demonstrate that
retrieving the semantically similar examples notably boosts the model’s
discriminative capability. However, we also observe a trade-off in selecting
good in-context examples regarding label diversity.
[COMMENTS]
39 pages, 8 figures. Accepted by Conference On Language Modeling
(COLM) 2024
[LINK]
http://arxiv.org/abs/2404.07546v2
[DATE]
2024-07-23 20:28:14+08:00
[CATEGORIES]
cs.CL
Learning to Plan and Generate Text with Citations
[AUTHORS]
Constanza Fierro, Reinald Kim Amplayo, Fantine Huot, Nicola De Cao, Joshua Maynez, Shashi Narayan, Mirella Lapata
[ABSTRACT]
The increasing demand for the deployment of LLMs in information-seeking
scenarios has spurred efforts in creating verifiable systems, which generate
responses to queries along with supporting evidence. In this paper, we explore
the attribution capabilities of plan-based models which have been recently
shown to improve the faithfulness, grounding, and controllability of generated
text. We conceptualize plans as a sequence of questions which serve as
blueprints of the generated content and its organization. We propose two
attribution models that utilize different variants of blueprints, an
abstractive model where questions are generated from scratch, and an extractive
model where questions are copied from the input. Experiments on long-form
question-answering show that planning consistently improves attribution
quality. Moreover, the citations generated by blueprint models are more
accurate compared to those obtained from LLM-based pipelines lacking a planning
component.
[COMMENTS]
Accepted at ACL 2024
[LINK]
http://arxiv.org/abs/2404.03381v3
[DATE]
2024-07-23 19:54:10+08:00
[CATEGORIES]
cs.CL
Language Models Meet Anomaly Detection for Better Interpretability and Generalizability
[AUTHORS]
Jun Li, Su Hwan Kim, Philip Müller, Lina Felsner, Daniel Rueckert, Benedikt Wiestler, Julia A. Schnabel, Cosmin I. Bercea
[ABSTRACT]
This research explores the integration of language models and unsupervised
anomaly detection in medical imaging, addressing two key questions: (1) Can
language models enhance the interpretability of anomaly detection maps? and (2)
Can anomaly maps improve the generalizability of language models in open-set
anomaly detection tasks? To investigate these questions, we introduce a new
dataset for multi-image visual question-answering on brain magnetic resonance
images encompassing multiple conditions. We propose KQ-Former (Knowledge
Querying Transformer), which is designed to optimally align visual and textual
information in limited-sample contexts. Our model achieves a 60.81% accuracy on
closed questions, covering disease classification and severity across 15
different classes. For open questions, KQ-Former demonstrates a 70% improvement
over the baseline with a BLEU-4 score of 0.41, and achieves the highest
entailment ratios (up to 71.9%) and lowest contradiction ratios (down to 10.0%)
among various natural language inference models. Furthermore, integrating
anomaly maps results in an 18% accuracy increase in detecting open-set
anomalies, thereby enhancing the language model’s generalizability to
previously unseen medical conditions. The code and dataset are available at
https://github.com/compai-lab/miccai-2024-junli?tab=readme-ov-file
[COMMENTS]
13 pages, 7 figures. 5th International Workshop on Multiscale
Multimodal Medical Imaging (MMMI 2024)
[LINK]
http://arxiv.org/abs/2404.07622v2
[DATE]
2024-07-23 19:50:03+08:00
[CATEGORIES]
cs.CL
TookaBERT: A Step Forward for Persian NLU
[AUTHORS]
MohammadAli SadraeiJavaheri, Ali Moghaddaszadeh, Milad Molazadeh, Fariba Naeiji, Farnaz Aghababaloo, Hamideh Rafiee, Zahra Amirmahani, Tohid Abedini, Fatemeh Zahra Sheikhi, Amirmohammad Salehoof
[LINK]
http://arxiv.org/abs/2407.16382v1
[DATE]
2024-07-23 19:12:47+08:00
[CATEGORIES]
cs.CL
Weakly Supervised Gaussian Contrastive Grounding with Large Multimodal Models for Video Question Answering
[AUTHORS]
Haibo Wang, Chenghang Lai, Yixuan Sun, Weifeng Ge
[ABSTRACT]
Video Question Answering (VideoQA) aims to answer natural language questions
based on the information observed in videos. Despite the recent success of
Large Multimodal Models (LMMs) in image-language understanding and reasoning,
they deal with VideoQA insufficiently, by simply taking uniformly sampled
frames as visual inputs, which ignores question-relevant visual clues.
Moreover, there are no human annotations for question-critical timestamps in
existing VideoQA datasets. In light of this, we propose a novel weakly
supervised framework to enforce the LMMs to reason out the answers with
question-critical moments as visual inputs. Specifically, we first fuse the
question and answer pairs as event descriptions to find multiple keyframes as
target moments and pseudo-labels, with the visual-language alignment capability
of the CLIP models. With these pseudo-labeled keyframes as additionally weak
supervision, we devise a lightweight Gaussian-based Contrastive Grounding (GCG)
module. GCG learns multiple Gaussian functions to characterize the temporal
structure of the video, and sample question-critical frames as positive moments
to be the visual inputs of LMMs. Extensive experiments on several benchmarks
verify the effectiveness of our framework, and we achieve substantial
improvements compared to previous state-of-the-art methods.
[COMMENTS]
accepted by ACM Multimedia 2024
[LINK]
http://arxiv.org/abs/2401.10711v4
[DATE]
2024-07-23 18:17:39+08:00
[CATEGORIES]
cs.CL
FACTTRACK: Time-Aware World State Tracking in Story Outlines
[AUTHORS]
Zhiheng Lyu, Kevin Yang, Lingpeng Kong, Daniel Klein
[ABSTRACT]
While accurately detecting and correcting factual contradictions in language
model outputs has become increasingly important as their capabilities improve,
doing so is highly challenging. We propose a novel method, FACTTRACK, for
tracking atomic facts and addressing factual contradictions. Crucially,
FACTTRACK also maintains time-aware validity intervals for each fact, allowing
for change over time. At a high level, FACTTRACK consists of a four-step
pipeline to update a world state data structure for each new event: (1)
decompose the event into directional atomic facts; (2) determine the validity
interval of each atomic fact using the world state; (3) detect contradictions
with existing facts in the world state; and finally (4) add new facts to the
world state and update existing atomic facts. When we apply FACTTRACK to
contradiction detection on structured story outlines, we find that FACTTRACK
using LLaMA2-7B-Chat substantially outperforms a fair baseline using
LLaMA2-7B-Chat, and achieves performance comparable to a GPT4 baseline.
Moreover, when using GPT4, FACTTRACK significantly outperforms the GPT4
baseline.
[COMMENTS]
22 pages
[LINK]
http://arxiv.org/abs/2407.16347v1
[DATE]
2024-07-23 17:50:14+08:00
[CATEGORIES]
cs.CL
An Empirical Study of Validating Synthetic Data for Formula Generation
[AUTHORS]
Usneek Singh, José Cambronero, Sumit Gulwani, Aditya Kanade, Anirudh Khatry, Vu Le, Mukul Singh, Gust Verbruggen
[LINK]
http://arxiv.org/abs/2407.10657v2
[DATE]
2024-07-23 17:41:50+08:00
[CATEGORIES]
cs.CL
DOPRA: Decoding Over-accumulation Penalization and Re-allocation in Specific Weighting Layer
[AUTHORS]
Jinfeng Wei, Xiaofeng Zhang
[ABSTRACT]
In this work, we introduce DOPRA, a novel approach designed to mitigate
hallucinations in multi-modal large language models (MLLMs). Unlike existing
solutions that typically involve costly supplementary training data or the
integration of external knowledge sources, DOPRA innovatively addresses
hallucinations by decoding specific weighted layer penalties and
redistribution, offering an economical and effective solution without
additional resources. DOPRA is grounded in unique insights into the intrinsic
mechanisms controlling hallucinations within MLLMs, especially the models’
tendency to over-rely on a subset of summary tokens in the self-attention
matrix, neglecting critical image-related information. This phenomenon is
particularly pronounced in certain strata. To counteract this over-reliance,
DOPRA employs a strategy of weighted overlay penalties and redistribution in
specific layers, such as the 12th layer, during the decoding process.
Furthermore, DOPRA includes a retrospective allocation process that re-examines
the sequence of generated tokens, allowing the algorithm to reallocate token
selection to better align with the actual image content, thereby reducing the
incidence of hallucinatory descriptions in auto-generated captions. Overall,
DOPRA represents a significant step forward in improving the output quality of
MLLMs by systematically reducing hallucinations through targeted adjustments
during the decoding process.
[LINK]
http://arxiv.org/abs/2407.15130v2
[DATE]
2024-07-23 17:30:57+08:00
[CATEGORIES]
cs.CL
PrimeGuard: Safe and Helpful LLMs through Tuning-Free Routing
[AUTHORS]
Blazej Manczak, Eliott Zemour, Eric Lin, Vaikkunth Mugunthan
[ABSTRACT]
Deploying language models (LMs) necessitates outputs to be both high-quality
and compliant with safety guidelines. Although Inference-Time Guardrails (ITG)
offer solutions that shift model output distributions towards compliance, we
find that current methods struggle in balancing safety with helpfulness. ITG
Methods that safely address non-compliant queries exhibit lower helpfulness
while those that prioritize helpfulness compromise on safety. We refer to this
trade-off as the guardrail tax, analogous to the alignment tax. To address
this, we propose PrimeGuard, a novel ITG method that utilizes structured
control flow.
PrimeGuard routes requests to different self-instantiations of the LM with
varying instructions, leveraging its inherent instruction-following
capabilities and in-context learning. Our tuning-free approach dynamically
compiles system-designer guidelines for each query. We construct and release
safe-eval, a diverse red-team safety benchmark. Extensive evaluations
demonstrate that PrimeGuard, without fine-tuning, overcomes the guardrail tax
by (1) significantly increasing resistance to iterative jailbreak attacks and
(2) achieving state-of-the-art results in safety guardrailing while (3)
matching helpfulness scores of alignment-tuned models. Extensive evaluations
demonstrate that PrimeGuard, without fine-tuning, outperforms all competing
baselines and overcomes the guardrail tax by improving the fraction of safe
responses from 61% to 97% and increasing average helpfulness scores from 4.17
to 4.29 on the largest models, while reducing attack success rate from 100% to
8%.
PrimeGuard implementation is available at
https://github.com/dynamofl/PrimeGuard and safe-eval dataset is available at
https://huggingface.co/datasets/dynamoai/safe_eval.
[COMMENTS]
ICML 2024 NextGenAISafety workshop version with links to
implementation and dataset
[LINK]
http://arxiv.org/abs/2407.16318v1
[DATE]
2024-07-23 17:14:27+08:00
[CATEGORIES]
cs.CL
Beyond Binary Gender: Evaluating Gender-Inclusive Machine Translation with Ambiguous Attitude Words
[AUTHORS]
Yijie Chen, Yijin Liu, Fandong Meng, Jinan Xu, Yufeng Chen, Jie Zhou
[ABSTRACT]
Gender bias has been a focal point in the study of bias in machine
translation and language models. Existing machine translation gender bias
evaluations are primarily focused on male and female genders, limiting the
scope of the evaluation. To assess gender bias accurately, these studies often
rely on calculating the accuracy of gender pronouns or the masculine and
feminine attributes of grammatical gender via the stereotypes triggered by
occupations or sentiment words ({\em i.e.}, clear positive or negative
attitude), which cannot extend to non-binary groups. This study presents a
benchmark AmbGIMT (Gender-Inclusive Machine Translation with Ambiguous attitude
words), which assesses gender bias beyond binary gender. Meanwhile, we propose
a novel process to evaluate gender bias based on the Emotional Attitude Score
(EAS), which is used to quantify ambiguous attitude words. In evaluating three
recent and effective open-source LLMs and one powerful multilingual
translation-specific model, our main observations are: (1) The translation
performance within non-binary gender contexts is markedly inferior in terms of
translation quality and exhibits more negative attitudes than binary-gender
contexts. (2) The analysis experiments indicate that incorporating constraint
context in prompts for gender identity terms can substantially reduce
translation bias, while the bias remains evident despite the presence of the
constraints. The code is publicly available at
\url{https://github.com/pppa2019/ambGIMT}.
[COMMENTS]
The code is publicly available at
\url{https://github.com/pppa2019/ambGIMT}
[LINK]
http://arxiv.org/abs/2407.16266v1
[DATE]
2024-07-23 16:13:51+08:00
[CATEGORIES]
cs.CL
ITERTL: An Iterative Framework for Fine-tuning LLMs for RTL Code Generation
[AUTHORS]
Peiyang Wu, Nan Guo, Xiao Xiao, Wenming Li, Xiaochun Ye, Dongrui Fan
[COMMENTS]
There is some mistakes about the Experimental Setup in Section4.1
[LINK]
http://arxiv.org/abs/2407.12022v2
[DATE]
2024-07-23 16:08:45+08:00
[CATEGORIES]
cs.CL
LawLuo: A Chinese Law Firm Co-run by LLM Agents
[AUTHORS]
Jingyun Sun, Chengxiao Dai, Zhongze Luo, Yangbo Chang, Yang Li
[ABSTRACT]
Large Language Models (LLMs) demonstrate substantial potential in delivering
legal consultation services to users without a legal background, attributed to
their superior text comprehension and generation capabilities. Nonetheless,
existing Chinese legal LLMs limit interaction to a single model-user dialogue,
unlike the collaborative consultations typical of law firms, where multiple
staff members contribute to a single consultation. This limitation prevents an
authentic consultation experience. Additionally, extant Chinese legal LLMs
suffer from critical limitations: (1) insufficient control over the quality of
instruction fine-tuning data; (2) increased model hallucination resulting from
users’ ambiguous queries; and (3) a reduction in the model’s ability to follow
instructions over multiple dialogue turns. In response to these challenges, we
propose a novel legal dialogue framework that leverages the collaborative
capabilities of multiple LLM agents, termed LawLuo. This framework encompasses
four agents: a receptionist, a lawyer, a secretary, and a boss, each
responsible for different functionalities, collaboratively providing a
comprehensive legal consultation to users. Additionally, we constructed two
high-quality legal dialogue datasets, KINLED and MURLED, and fine-tuned
ChatGLM-3-6b using these datasets. We propose a legal query clarification
algorithm called ToLC. Experimental results demonstrate that LawLuo outperforms
baseline LLMs, including GPT-4, across three dimensions: lawyer-like language
style, the usefulness of legal advice, and the accuracy of legal knowledge. Our
code and datasets are available at https://github.com/NEFUJing/LawLuo.
[COMMENTS]
11 pages, 13 figures, 2 tables
[LINK]
http://arxiv.org/abs/2407.16252v1
[DATE]
2024-07-23 15:40:41+08:00
[CATEGORIES]
cs.CL
Exploring the Effectiveness and Consistency of Task Selection in Intermediate-Task Transfer Learning
[AUTHORS]
Pin-Jie Lin, Miaoran Zhang, Marius Mosbach, Dietrich Klakow
[COMMENTS]
Accepted to ACL SRW 2024
[LINK]
http://arxiv.org/abs/2407.16245v1
[DATE]
2024-07-23 15:31:43+08:00
[CATEGORIES]
cs.CL
A Multi-view Mask Contrastive Learning Graph Convolutional Neural Network for Age Estimation
[AUTHORS]
Yiping Zhang, Yuntao Shou, Tao Meng, Wei Ai, Keqin Li
[ABSTRACT]
The age estimation task aims to use facial features to predict the age of
people and is widely used in public security, marketing, identification, and
other fields. However, the features are mainly concentrated in facial
keypoints, and existing CNN and Transformer-based methods have inflexibility
and redundancy for modeling complex irregular structures. Therefore, this paper
proposes a Multi-view Mask Contrastive Learning Graph Convolutional Neural
Network (MMCL-GCN) for age estimation. Specifically, the overall structure of
the MMCL-GCN network contains a feature extraction stage and an age estimation
stage. In the feature extraction stage, we introduce a graph structure to
construct face images as input and then design a Multi-view Mask Contrastive
Learning (MMCL) mechanism to learn complex structural and semantic information
about face images. The learning mechanism employs an asymmetric siamese network
architecture, which utilizes an online encoder-decoder structure to reconstruct
the missing information from the original graph and utilizes the target encoder
to learn latent representations for contrastive learning. Furthermore, to
promote the two learning mechanisms better compatible and complementary, we
adopt two augmentation strategies and optimize the joint losses. In the age
estimation stage, we design a Multi-layer Extreme Learning Machine (ML-IELM)
with identity mapping to fully use the features extracted by the online
encoder. Then, a classifier and a regressor were constructed based on ML-IELM,
which were used to identify the age grouping interval and accurately estimate
the final age. Extensive experiments show that MMCL-GCN can effectively reduce
the error of age estimation on benchmark datasets such as Adience, MORPH-II,
and LAP-2016.
[COMMENTS]
20 pages, 9 figures
[LINK]
http://arxiv.org/abs/2407.16234v1
[DATE]
2024-07-23 15:17:46+08:00
[CATEGORIES]
cs.CL
A Comprehensive Survey of LLM Alignment Techniques: RLHF, RLAIF, PPO, DPO and More
[AUTHORS]
Zhichao Wang, Bin Bi, Shiva Kumar Pentyala, Kiran Ramnath, Sougata Chaudhuri, Shubham Mehrotra, Zixu, Zhu, Xiang-Bo Mao, Sitaram Asur, Na, Cheng
[ABSTRACT]
With advancements in self-supervised learning, the availability of trillions
tokens in a pre-training corpus, instruction fine-tuning, and the development
of large Transformers with billions of parameters, large language models (LLMs)
are now capable of generating factual and coherent responses to human queries.
However, the mixed quality of training data can lead to the generation of
undesired responses, presenting a significant challenge. Over the past two
years, various methods have been proposed from different perspectives to
enhance LLMs, particularly in aligning them with human expectation. Despite
these efforts, there has not been a comprehensive survey paper that categorizes
and details these approaches. In this work, we aim to address this gap by
categorizing these papers into distinct topics and providing detailed
explanations of each alignment method, thereby helping readers gain a thorough
understanding of the current state of the field.
[LINK]
http://arxiv.org/abs/2407.16216v1
[DATE]
2024-07-23 14:45:52+08:00
[CATEGORIES]
cs.CL
Subgraph-Aware Training of Text-based Methods for Knowledge Graph Completion
[AUTHORS]
Youmin Ko, Hyemin Yang, Taeuk Kim, Hyunjoon Kim
[ABSTRACT]
Fine-tuning pre-trained language models (PLMs) has recently shown a potential
to improve knowledge graph completion (KGC). However, most PLM-based methods
encode only textual information, neglecting various topological structures of
knowledge graphs (KGs). In this paper, we empirically validate the significant
relations between the structural properties of KGs and the performance of the
PLM-based methods. To leverage the structural knowledge, we propose a
Subgraph-Aware Training framework for KGC (SATKGC) that combines (i)
subgraph-aware mini-batching to encourage hard negative sampling, and (ii) a
new contrastive learning method to focus more on harder entities and harder
negative triples in terms of the structural properties. To the best of our
knowledge, this is the first study to comprehensively incorporate the
structural inductive bias of the subgraphs into fine-tuning PLMs. Extensive
experiments on four KGC benchmarks demonstrate the superiority of SATKGC. Our
code is available.
[LINK]
http://arxiv.org/abs/2407.12703v3
[DATE]
2024-07-23 14:26:30+08:00
[CATEGORIES]
cs.CL
Graph-Structured Speculative Decoding
[AUTHORS]
Zhuocheng Gong, Jiahao Liu, Ziyue Wang, Pengfei Wu, Jingang Wang, Xunliang Cai, Dongyan Zhao, Rui Yan
[ABSTRACT]
Speculative decoding has emerged as a promising technique to accelerate the
inference of Large Language Models (LLMs) by employing a small language model
to draft a hypothesis sequence, which is then validated by the LLM. The
effectiveness of this approach heavily relies on the balance between
performance and efficiency of the draft model. In our research, we focus on
enhancing the proportion of draft tokens that are accepted to the final output
by generating multiple hypotheses instead of just one. This allows the LLM more
options to choose from and select the longest sequence that meets its
standards. Our analysis reveals that hypotheses produced by the draft model
share many common token sequences, suggesting a potential for optimizing
computation. Leveraging this observation, we introduce an innovative approach
utilizing a directed acyclic graph (DAG) to manage the drafted hypotheses. This
structure enables us to efficiently predict and merge recurring token
sequences, vastly reducing the computational demands of the draft model. We
term this approach Graph-structured Speculative Decoding (GSD). We apply GSD
across a range of LLMs, including a 70-billion parameter LLaMA-2 model, and
observe a remarkable speedup of 1.73$\times$ to 1.96$\times$, significantly
surpassing standard speculative decoding.
[LINK]
http://arxiv.org/abs/2407.16207v1
[DATE]
2024-07-23 14:21:24+08:00
[CATEGORIES]
cs.CL
Don’t Go To Extremes: Revealing the Excessive Sensitivity and Calibration Limitations of LLMs in Implicit Hate Speech Detection
[AUTHORS]
Min Zhang, Jianfeng He, Taoran Ji, Chang-Tien Lu
[COMMENTS]
ACL 2024 Main Conference
[LINK]
http://arxiv.org/abs/2402.11406v3
[DATE]
2024-07-23 14:20:32+08:00
[CATEGORIES]
cs.CL
Development of Compositionality and Generalization through Interactive Learning of Language and Action of Robots
[AUTHORS]
Prasanna Vijayaraghavan, Jeffrey Frederic Queisser, Sergio Verduzco Flores, Jun Tani
[ABSTRACT]
Humans excel at applying learned behavior to unlearned situations. A crucial
component of this generalization behavior is our ability to compose/decompose a
whole into reusable parts, an attribute known as compositionality. One of the
fundamental questions in robotics concerns this characteristic. “How can
linguistic compositionality be developed concomitantly with sensorimotor skills
through associative learning, particularly when individuals only learn partial
linguistic compositions and their corresponding sensorimotor patterns?” To
address this question, we propose a brain-inspired neural network model that
integrates vision, proprioception, and language into a framework of predictive
coding and active inference, based on the free-energy principle. The
effectiveness and capabilities of this model were assessed through various
simulation experiments conducted with a robot arm. Our results show that
generalization in learning to unlearned verb-noun compositions, is
significantly enhanced when training variations of task composition are
increased. We attribute this to self-organized compositional structures in
linguistic latent state space being influenced significantly by sensorimotor
learning. Ablation studies show that visual attention and working memory are
essential to accurately generate visuo-motor sequences to achieve
linguistically represented goals. These insights advance our understanding of
mechanisms underlying development of compositionality through interactions of
linguistic and sensorimotor experience.
[COMMENTS]
64 pages, 6 figures, 10 supplementary figures
[LINK]
http://arxiv.org/abs/2403.19995v2
[DATE]
2024-07-23 13:21:44+08:00
[CATEGORIES]
cs.CL
Structural Optimization Ambiguity and Simplicity Bias in Unsupervised Neural Grammar Induction
[AUTHORS]
Jinwook Park, Kangil Kim
[ABSTRACT]
Neural parameterization has significantly advanced unsupervised grammar
induction. However, training these models with a traditional likelihood loss
for all possible parses exacerbates two issues: 1) $\textit{structural
optimization ambiguity}$ that arbitrarily selects one among structurally
ambiguous optimal grammars despite the specific preference of gold parses, and
2) $\textit{structural simplicity bias}$ that leads a model to underutilize
rules to compose parse trees. These challenges subject unsupervised neural
grammar induction (UNGI) to inevitable prediction errors, high variance, and
the necessity for extensive grammars to achieve accurate predictions. This
paper tackles these issues, offering a comprehensive analysis of their origins.
As a solution, we introduce $\textit{sentence-wise parse-focusing}$ to reduce
the parse pool per sentence for loss evaluation, using the structural bias from
pre-trained parsers on the same dataset. In unsupervised parsing benchmark
tests, our method significantly improves performance while effectively reducing
variance and bias toward overly simplistic parses. Our research promotes
learning more compact, accurate, and consistent explicit grammars, facilitating
better interpretability.
[COMMENTS]
Accepted in ACL2024 Findings, 16 pages, 10 figures
[LINK]
http://arxiv.org/abs/2407.16181v1
[DATE]
2024-07-23 12:57:03+08:00
[CATEGORIES]
cs.CL
Rescue: Ranking LLM Responses with Partial Ordering to Improve Response Generation
[AUTHORS]
Yikun Wang, Rui Zheng, Haoming Li, Qi Zhang, Tao Gui, Fei Liu
[ABSTRACT]
Customizing LLMs for a specific task involves separating high-quality
responses from lower-quality ones. This skill can be developed using supervised
fine-tuning with extensive human preference data. However, obtaining a large
volume of expert-annotated data is costly for most tasks. In this paper, we
explore a novel method to optimize LLMs using ranking metrics. This method
trains the model to prioritize the best responses from a pool of candidates
created for a particular task. Rather than a traditional full ordering, we
advocate for a partial ordering, as achieving consensus on the perfect order of
candidate responses can be challenging. Our partial ordering is more robust,
less sensitive to noise, and can be achieved with limited human annotations or
through heuristic methods. We test our system’s improved response generation
ability using benchmark datasets, including textual entailment and
multi-document question answering. We conduct ablation studies to understand
crucial factors, such as how to gather candidate responses for a specific task,
determine their most suitable order, and balance supervised fine-tuning with
ranking metrics. Our approach, named Rescue, offers a promising avenue for
enhancing the response generation and task accuracy of LLMs.
[COMMENTS]
ACL 2024 SRW
[LINK]
http://arxiv.org/abs/2311.09136v3
[DATE]
2024-07-23 12:35:45+08:00
[CATEGORIES]
cs.CL
Robust Privacy Amidst Innovation with Large Language Models Through a Critical Assessment of the Risks
[AUTHORS]
Yao-Shun Chuang, Atiquer Rahman Sarkar, Noman Mohammed, Xiaoqian Jiang
[ABSTRACT]
This study examines integrating EHRs and NLP with large language models
(LLMs) to improve healthcare data management and patient care. It focuses on
using advanced models to create secure, HIPAA-compliant synthetic patient notes
for biomedical research. The study used de-identified and re-identified MIMIC
III datasets with GPT-3.5, GPT-4, and Mistral 7B to generate synthetic notes.
Text generation employed templates and keyword extraction for contextually
relevant notes, with one-shot generation for comparison. Privacy assessment
checked PHI occurrence, while text utility was tested using an ICD-9 coding
task. Text quality was evaluated with ROUGE and cosine similarity metrics to
measure semantic similarity with source notes. Analysis of PHI occurrence and
text utility via the ICD-9 coding task showed that the keyword-based method had
low risk and good performance. One-shot generation showed the highest PHI
exposure and PHI co-occurrence, especially in geographic location and date
categories. The Normalized One-shot method achieved the highest classification
accuracy. Privacy analysis revealed a critical balance between data utility and
privacy protection, influencing future data use and sharing. Re-identified data
consistently outperformed de-identified data. This study demonstrates the
effectiveness of keyword-based methods in generating privacy-protecting
synthetic clinical notes that retain data usability, potentially transforming
clinical data-sharing practices. The superior performance of re-identified over
de-identified data suggests a shift towards methods that enhance utility and
privacy by using dummy PHIs to perplex privacy attacks.
[COMMENTS]
13 pages, 4 figures, 1 table, 1 supplementary, under review
[LINK]
http://arxiv.org/abs/2407.16166v1
[DATE]
2024-07-23 12:20:14+08:00
[CATEGORIES]
cs.CL
LLMExplainer: Large Language Model based Bayesian Inference for Graph Explanation Generation
[AUTHORS]
Jiaxing Zhang, Jiayi Liu, Dongsheng Luo, Jennifer Neville, Hua Wei
[ABSTRACT]
Recent studies seek to provide Graph Neural Network (GNN) interpretability
via multiple unsupervised learning models. Due to the scarcity of datasets,
current methods easily suffer from learning bias. To solve this problem, we
embed a Large Language Model (LLM) as knowledge into the GNN explanation
network to avoid the learning bias problem. We inject LLM as a Bayesian
Inference (BI) module to mitigate learning bias. The efficacy of the BI module
has been proven both theoretically and experimentally. We conduct experiments
on both synthetic and real-world datasets. The innovation of our work lies in
two parts: 1. We provide a novel view of the possibility of an LLM functioning
as a Bayesian inference to improve the performance of existing algorithms; 2.
We are the first to discuss the learning bias issues in the GNN explanation
problem.
[COMMENTS]
Preprint Paper with 13 pages
[LINK]
http://arxiv.org/abs/2407.15351v2
[DATE]
2024-07-23 12:01:19+08:00
[CATEGORIES]
cs.LG
cs.CL
UniMEL: A Unified Framework for Multimodal Entity Linking with Large Language Models
[AUTHORS]
Liu Qi, He Yongyi, Lian Defu, Zheng Zhi, Xu Tong, Liu Che, Chen Enhong
[ABSTRACT]
Multimodal Entity Linking (MEL) is a crucial task that aims at linking
ambiguous mentions within multimodal contexts to the referent entities in a
multimodal knowledge base, such as Wikipedia. Existing methods focus heavily on
using complex mechanisms and extensive model tuning methods to model the
multimodal interaction on specific datasets. However, these methods
overcomplicate the MEL task and overlook the visual semantic information, which
makes them costly and hard to scale. Moreover, these methods can not solve the
issues like textual ambiguity, redundancy, and noisy images, which severely
degrade their performance. Fortunately, the advent of Large Language Models
(LLMs) with robust capabilities in text understanding and reasoning,
particularly Multimodal Large Language Models (MLLMs) that can process
multimodal inputs, provides new insights into addressing this challenge.
However, how to design a universally applicable LLMs-based MEL approach remains
a pressing challenge. To this end, we propose UniMEL, a unified framework which
establishes a new paradigm to process multimodal entity linking tasks using
LLMs. In this framework, we employ LLMs to augment the representation of
mentions and entities individually by integrating textual and visual
information and refining textual information. Subsequently, we employ the
embedding-based method for retrieving and re-ranking candidate entities. Then,
with only ~0.26% of the model parameters fine-tuned, LLMs can make the final
selection from the candidate entities. Extensive experiments on three public
benchmark datasets demonstrate that our solution achieves state-of-the-art
performance, and ablation studies verify the effectiveness of all modules. Our
code is available at https://anonymous.4open.science/r/UniMEL/.
[COMMENTS]
CIKM 2024. The first two authors contributed equally to this work
[LINK]
http://arxiv.org/abs/2407.16160v1
[DATE]
2024-07-23 11:58:08+08:00
[CATEGORIES]
cs.CL
CHIME: LLM-Assisted Hierarchical Organization of Scientific Studies for Literature Review Support
[AUTHORS]
Chao-Chun Hsu, Erin Bransom, Jenna Sparks, Bailey Kuehl, Chenhao Tan, David Wadden, Lucy Lu Wang, Aakanksha Naik
[ABSTRACT]
Literature review requires researchers to synthesize a large amount of
information and is increasingly challenging as the scientific literature
expands. In this work, we investigate the potential of LLMs for producing
hierarchical organizations of scientific studies to assist researchers with
literature review. We define hierarchical organizations as tree structures
where nodes refer to topical categories and every node is linked to the studies
assigned to that category. Our naive LLM-based pipeline for hierarchy
generation from a set of studies produces promising yet imperfect hierarchies,
motivating us to collect CHIME, an expert-curated dataset for this task focused
on biomedicine. Given the challenging and time-consuming nature of building
hierarchies from scratch, we use a human-in-the-loop process in which experts
correct errors (both links between categories and study assignment) in
LLM-generated hierarchies. CHIME contains 2,174 LLM-generated hierarchies
covering 472 topics, and expert-corrected hierarchies for a subset of 100
topics. Expert corrections allow us to quantify LLM performance, and we find
that while they are quite good at generating and organizing categories, their
assignment of studies to categories could be improved. We attempt to train a
corrector model with human feedback which improves study assignment by 12.6 F1
points. We release our dataset and models to encourage research on developing
better assistive tools for literature review.
[COMMENTS]
2024 ACL Findings
[LINK]
http://arxiv.org/abs/2407.16148v1
[DATE]
2024-07-23 11:18:00+08:00
[CATEGORIES]
cs.CL
Will the Real Linda Please Stand up…to Large Language Models? Examining the Representativeness Heuristic in LLMs
[AUTHORS]
Pengda Wang, Zilin Xiao, Hanjie Chen, Frederick L. Oswald
[ABSTRACT]
Although large language models (LLMs) have demonstrated remarkable
proficiency in modeling text and generating human-like text, they may exhibit
biases acquired from training data in doing so. Specifically, LLMs may be
susceptible to a common cognitive trap in human decision-making called the
representativeness heuristic. This is a concept in psychology that refers to
judging the likelihood of an event based on how closely it resembles a
well-known prototype or typical example, versus considering broader facts or
statistical evidence. This research investigates the impact of the
representativeness heuristic on LLM reasoning. We created ReHeAT
(Representativeness Heuristic AI Testing), a dataset containing a series of
problems spanning six common types of representativeness heuristics.
Experiments reveal that four LLMs applied to ReHeAT all exhibited
representativeness heuristic biases. We further identify that the model’s
reasoning steps are often incorrectly based on a stereotype rather than on the
problem’s description. Interestingly, the performance improves when adding a
hint in the prompt to remind the model to use its knowledge. This suggests the
uniqueness of the representativeness heuristic compared to traditional biases.
It can occur even when LLMs possess the correct knowledge while falling into a
cognitive trap. This highlights the importance of future research focusing on
the representativeness heuristic in model reasoning and decision-making and on
developing solutions to address it.
[COMMENTS]
Published as a conference paper at COLM 2024
[LINK]
http://arxiv.org/abs/2404.01461v4
[DATE]
2024-07-23 10:41:57+08:00
[CATEGORIES]
cs.CL
Finetuning Generative Large Language Models with Discrimination Instructions for Knowledge Graph Completion
[AUTHORS]
Yang Liu, Xiaobin Tian, Zequn Sun, Wei Hu
[ABSTRACT]
Traditional knowledge graph (KG) completion models learn embeddings to
predict missing facts. Recent works attempt to complete KGs in a
text-generation manner with large language models (LLMs). However, they need to
ground the output of LLMs to KG entities, which inevitably brings errors. In
this paper, we present a finetuning framework, DIFT, aiming to unleash the KG
completion ability of LLMs and avoid grounding errors. Given an incomplete
fact, DIFT employs a lightweight model to obtain candidate entities and
finetunes an LLM with discrimination instructions to select the correct one
from the given candidates. To improve performance while reducing instruction
data, DIFT uses a truncated sampling method to select useful facts for
finetuning and injects KG embeddings into the LLM. Extensive experiments on
benchmark datasets demonstrate the effectiveness of our proposed framework.
[COMMENTS]
Accepted in the 23rd International Semantic Web Conference (ISWC
2024)
[LINK]
http://arxiv.org/abs/2407.16127v1
[DATE]
2024-07-23 10:25:01+08:00
[CATEGORIES]
cs.CL
Flooding Spread of Manipulated Knowledge in LLM-Based Multi-Agent Communities
[AUTHORS]
Tianjie Ju, Yiting Wang, Xinbei Ma, Pengzhou Cheng, Haodong Zhao, Yulong Wang, Lifeng Liu, Jian Xie, Zhuosheng Zhang, Gongshen Liu
[ABSTRACT]
The rapid adoption of large language models (LLMs) in multi-agent systems has
highlighted their impressive capabilities in various applications, such as
collaborative problem-solving and autonomous negotiation. However, the security
implications of these LLM-based multi-agent systems have not been thoroughly
investigated, particularly concerning the spread of manipulated knowledge. In
this paper, we investigate this critical issue by constructing a detailed
threat model and a comprehensive simulation environment that mirrors real-world
multi-agent deployments in a trusted platform. Subsequently, we propose a novel
two-stage attack method involving Persuasiveness Injection and Manipulated
Knowledge Injection to systematically explore the potential for manipulated
knowledge (i.e., counterfactual and toxic knowledge) spread without explicit
prompt manipulation.
Our method leverages the inherent vulnerabilities of LLMs in handling world
knowledge, which can be exploited by attackers to unconsciously spread
fabricated information. Through extensive experiments, we demonstrate that our
attack method can successfully induce LLM-based agents to spread both
counterfactual and toxic knowledge without degrading their foundational
capabilities during agent communication. Furthermore, we show that these
manipulations can persist through popular retrieval-augmented generation
frameworks, where several benign agents store and retrieve manipulated chat
histories for future interactions. This persistence indicates that even after
the interaction has ended, the benign agents may continue to be influenced by
manipulated knowledge. Our findings reveal significant security risks in
LLM-based multi-agent systems, emphasizing the imperative need for robust
defenses against manipulated knowledge spread, such as introducing “guardian”
agents and advanced fact-checking tools.
[COMMENTS]
18 Pages, working in progress
[LINK]
http://arxiv.org/abs/2407.07791v2
[DATE]
2024-07-23 09:59:54+08:00
[CATEGORIES]
cs.CL
Analyzing the Polysemy Evolution using Semantic Cells
[AUTHORS]
Yukio Ohsawa, Dingming Xue, Kaira Sekiguchi
[ABSTRACT]
The senses of words evolve. The sense of the same word may change from today
to tomorrow, and multiple senses of the same word may be the result of the
evolution of each other, that is, they may be parents and children. If we view
Juba as an evolving ecosystem, the paradigm of learning the correct answer,
which does not move with the sense of a word, is no longer valid. This paper is
a case study that shows that word polysemy is an evolutionary consequence of
the modification of Semantic Cells, which has al-ready been presented by the
author, by introducing a small amount of diversity in its initial state as an
example of analyzing the current set of short sentences. In particular, the
analysis of a sentence sequence of 1000 sentences in some order for each of the
four senses of the word Spring, collected using Chat GPT, shows that the word
acquires the most polysemy monotonically in the analysis when the senses are
arranged in the order in which they have evolved. In other words, we present a
method for analyzing the dynamism of a word’s acquiring polysemy with evolution
and, at the same time, a methodology for viewing polysemy from an evolutionary
framework rather than a learning-based one.
[COMMENTS]
11 pages, 2 figures. arXiv admin note: text overlap with
arXiv:2404.14749
[LINK]
http://arxiv.org/abs/2407.16110v1
[DATE]
2024-07-23 08:52:12+08:00
[CATEGORIES]
cs.CL
Time Sensitive Knowledge Editing through Efficient Finetuning
[AUTHORS]
Xiou Ge, Ali Mousavi, Edouard Grave, Armand Joulin, Kun Qian, Benjamin Han, Mostafa Arefiyan, Yunyao Li
[COMMENTS]
ACL 2024 main
[LINK]
http://arxiv.org/abs/2406.04496v2
[DATE]
2024-07-23 08:46:37+08:00
[CATEGORIES]
cs.CL
cs.LG
MACM: Utilizing a Multi-Agent System for Condition Mining in Solving Complex Mathematical Problems
[AUTHORS]
Bin Lei, Yi Zhang, Shan Zuo, Ali Payani, Caiwen Ding
[ABSTRACT]
Recent advancements in large language models, such as GPT-4, have
demonstrated remarkable capabilities in processing standard queries. Despite
these advancements, their performance substantially declines in
\textbf{advanced mathematical problems requiring complex, multi-step logical
reasoning}. To enhance their inferential capabilities, current research has
delved into \textit{prompting engineering}, exemplified by methodologies such
as the Tree of Thought and Graph of Thought. Nonetheless, these existing
approaches encounter two significant limitations. Firstly, their effectiveness
in tackling complex mathematical problems is somewhat constrained. Secondly,
the necessity to design distinct prompts for individual problems hampers their
generalizability. In response to these limitations, this paper introduces the
\textit{Multi-Agent System for conditional Mining} (\textbf{MACM}) prompting
method. It not only resolves intricate mathematical problems but also
demonstrates strong generalization capabilities across various mathematical
contexts. With the assistance of MACM, the accuracy of GPT-4 Turbo on the most
challenging level five mathematical problems in the MATH dataset increase from
$\mathbf{54.68\%} \text{ to } \mathbf{76.73\%}$. The code is available in
\url{https://github.com/bin123apple/MACM}.
[LINK]
http://arxiv.org/abs/2404.04735v2
[DATE]
2024-07-23 06:37:40+08:00
[CATEGORIES]
cs.CL
KaPQA: Knowledge-Augmented Product Question-Answering
[AUTHORS]
Swetha Eppalapally, Daksh Dangi, Chaithra Bhat, Ankita Gupta, Ruiyi Zhang, Shubham Agarwal, Karishma Bagga, Seunghyun Yoon, Nedim Lipka, Ryan A. Rossi, Franck Dernoncourt
[COMMENTS]
Accepted at the ACL 2024 Workshop on Knowledge Augmented Methods for
NLP
[LINK]
http://arxiv.org/abs/2407.16073v1
[DATE]
2024-07-23 06:14:56+08:00
[CATEGORIES]
cs.CL
SyllabusQA: A Course Logistics Question Answering Dataset
[AUTHORS]
Nigel Fernandez, Alexander Scarlatos, Andrew Lan
[ABSTRACT]
Automated teaching assistants and chatbots have significant potential to
reduce the workload of human instructors, especially for logistics-related
question answering, which is important to students yet repetitive for
instructors. However, due to privacy concerns, there is a lack of publicly
available datasets. We introduce SyllabusQA, an open-source dataset with 63
real course syllabi covering 36 majors, containing 5,078 open-ended course
logistics-related question-answer pairs that are diverse in both question types
and answer formats. Since many logistics-related questions contain critical
information like the date of an exam, it is important to evaluate the
factuality of answers. We benchmark several strong baselines on this task, from
large language model prompting to retrieval-augmented generation. We introduce
Fact-QA, an LLM-based (GPT-4) evaluation metric to evaluate the factuality of
predicted answers. We find that despite performing close to humans on
traditional metrics of textual similarity, there remains a significant gap
between automated approaches and humans in terms of fact precision.
[COMMENTS]
ACL 2024: The 62nd Annual Meeting of the Association for
Computational Linguistics
[LINK]
http://arxiv.org/abs/2403.14666v2
[DATE]
2024-07-23 04:37:55+08:00
[CATEGORIES]
cs.CL
cs.LG
Enhancing Temporal Understanding in LLMs for Semi-structured Tables
[AUTHORS]
Irwin Deng, Kushagra Dixit, Vivek Gupta, Dan Roth
[COMMENTS]
Total Pages 18, Total Tables 6, Total figures 7
[LINK]
http://arxiv.org/abs/2407.16030v1
[DATE]
2024-07-23 04:13:10+08:00
[CATEGORIES]
cs.CL
cs.LG
Boosting Reward Model with Preference-Conditional Multi-Aspect Synthetic Data Generation
[AUTHORS]
Jiaming Shen, Ran Xu, Yennie Jun, Zhen Qin, Tianqi Liu, Carl Yang, Yi Liang, Simon Baumgartner, Michael Bendersky
[ABSTRACT]
Reward models (RMs) are crucial for aligning large language models (LLMs)
with human preferences. They are trained using preference datasets where each
example consists of one input prompt, two responses, and a preference label. As
curating a high-quality human labeled preference dataset is both time-consuming
and expensive, people often rely on existing powerful LLMs for preference label
generation. This can potentially introduce noise and impede RM training. In
this work, we present RMBoost, a novel synthetic preference data generation
paradigm to boost reward model quality. Unlike traditional methods, which
generate two responses before obtaining the preference label, RMBoost first
generates one response and selects a preference label, followed by generating
the second more (or less) preferred response conditioned on the pre-selected
preference label and the first response. This approach offers two main
advantages. First, RMBoost reduces labeling noise since preference pairs are
constructed intentionally. Second, RMBoost facilitates the creation of more
diverse responses by incorporating various quality aspects (e.g., helpfulness,
relevance, completeness) into the prompts. We conduct extensive experiments
across three diverse datasets and demonstrate that RMBoost outperforms other
synthetic preference data generation techniques and significantly boosts the
performance of four distinct reward models.
[LINK]
http://arxiv.org/abs/2407.16008v1
[DATE]
2024-07-23 03:21:55+08:00
[CATEGORIES]
cs.CL
SocialQuotes: Learning Contextual Roles of Social Media Quotes on the Web
[AUTHORS]
John Palowitch, Hamidreza Alvari, Mehran Kazemi, Tanvir Amin, Filip Radlinski
[ABSTRACT]
Web authors frequently embed social media to support and enrich their
content, creating the potential to derive web-based, cross-platform social
media representations that can enable more effective social media retrieval
systems and richer scientific analyses. As step toward such capabilities, we
introduce a novel language modeling framework that enables automatic annotation
of roles that social media entities play in their embedded web context. Using
related communication theory, we liken social media embeddings to quotes,
formalize the page context as structured natural language signals, and identify
a taxonomy of roles for quotes within the page context. We release
SocialQuotes, a new data set built from the Common Crawl of over 32 million
social quotes, 8.3k of them with crowdsourced quote annotations. Using
SocialQuotes and the accompanying annotations, we provide a role classification
case study, showing reasonable performance with modern-day LLMs, and exposing
explainable aspects of our framework via page content ablations. We also
classify a large batch of un-annotated quotes, revealing interesting
cross-domain, cross-platform role distributions on the web.
[LINK]
http://arxiv.org/abs/2407.16007v1
[DATE]
2024-07-23 03:21:01+08:00
[CATEGORIES]
cs.CL
Multimodal Input Aids a Bayesian Model of Phonetic Learning
[AUTHORS]
Sophia Zhi, Roger P. Levy, Stephan C. Meylan
[COMMENTS]
12 pages, 5 figures
[LINK]
http://arxiv.org/abs/2407.15992v1
[DATE]
2024-07-23 03:00:11+08:00
[CATEGORIES]
cs.CL
UQA: Corpus for Urdu Question Answering
[AUTHORS]
Samee Arif, Sualeha Farid, Awais Athar, Agha Ali Raza
[ABSTRACT]
This paper introduces UQA, a novel dataset for question answering and text
comprehension in Urdu, a low-resource language with over 70 million native
speakers. UQA is generated by translating the Stanford Question Answering
Dataset (SQuAD2.0), a large-scale English QA dataset, using a technique called
EATS (Enclose to Anchor, Translate, Seek), which preserves the answer spans in
the translated context paragraphs. The paper describes the process of selecting
and evaluating the best translation model among two candidates: Google
Translator and Seamless M4T. The paper also benchmarks several state-of-the-art
multilingual QA models on UQA, including mBERT, XLM-RoBERTa, and mT5, and
reports promising results. For XLM-RoBERTa-XL, we have an F1 score of 85.99 and
74.56 EM. UQA is a valuable resource for developing and testing multilingual
NLP systems for Urdu and for enhancing the cross-lingual transferability of
existing models. Further, the paper demonstrates the effectiveness of EATS for
creating high-quality datasets for other languages and domains. The UQA dataset
and the code are publicly available at www.github.com/sameearif/UQA.
[LINK]
http://arxiv.org/abs/2405.01458v2
[DATE]
2024-07-23 02:46:11+08:00
[CATEGORIES]
cs.CL
cs.LG
Lynx: An Open Source Hallucination Evaluation Model
[AUTHORS]
Selvan Sunitha Ravi, Bartosz Mielczarek, Anand Kannappan, Douwe Kiela, Rebecca Qian
[ABSTRACT]
Retrieval Augmented Generation (RAG) techniques aim to mitigate
hallucinations in Large Language Models (LLMs). However, LLMs can still produce
information that is unsupported or contradictory to the retrieved contexts. We
introduce LYNX, a SOTA hallucination detection LLM that is capable of advanced
reasoning on challenging real-world hallucination scenarios. To evaluate LYNX,
we present HaluBench, a comprehensive hallucination evaluation benchmark,
consisting of 15k samples sourced from various real-world domains. Our
experiment results show that LYNX outperforms GPT-4o, Claude-3-Sonnet, and
closed and open-source LLM-as-a-judge models on HaluBench. We release LYNX,
HaluBench and our evaluation code for public access.
[LINK]
http://arxiv.org/abs/2407.08488v2
[DATE]
2024-07-23 02:41:53+08:00
[CATEGORIES]
cs.CL
Multilingual Fine-Grained News Headline Hallucination Detection
[AUTHORS]
Jiaming Shen, Tianqi Liu, Jialu Liu, Zhen Qin, Jay Pavagadhi, Simon Baumgartner, Michael Bendersky
[ABSTRACT]
The popularity of automated news headline generation has surged with
advancements in pre-trained language models. However, these models often suffer
from the “hallucination” problem, where the generated headline is not fully
supported by its source article. Efforts to address this issue have
predominantly focused on English, using over-simplistic classification schemes
that overlook nuanced hallucination types. In this study, we introduce the
first multilingual, fine-grained news headline hallucination detection dataset
that contains over 11 thousand pairs in 5 languages, each annotated with
detailed hallucination types by experts. We conduct extensive experiments on
this dataset under two settings. First, we implement several supervised
fine-tuning approaches as preparatory solutions and demonstrate this dataset’s
challenges and utilities. Second, we test various large language models’
in-context learning abilities and propose two novel techniques,
language-dependent demonstration selection and coarse-to-fine prompting, to
boost the few-shot hallucination detection performance in terms of the
example-F1 metric. We release this dataset to foster further research in
multilingual, fine-grained headline hallucination detection.
[LINK]
http://arxiv.org/abs/2407.15975v1
[DATE]
2024-07-23 02:37:53+08:00
[CATEGORIES]
cs.CL
Schema-Driven Information Extraction from Heterogeneous Tables
[AUTHORS]
Fan Bai, Junmo Kang, Gabriel Stanovsky, Dayne Freitag, Mark Dredze, Alan Ritter
[ABSTRACT]
In this paper, we explore the question of whether large language models can
support cost-efficient information extraction from tables. We introduce
schema-driven information extraction, a new task that transforms tabular data
into structured records following a human-authored schema. To assess various
LLM’s capabilities on this task, we present a benchmark comprised of tables
from four diverse domains: machine learning papers, chemistry literature,
material science journals, and webpages. We use this collection of annotated
tables to evaluate the ability of open-source and API-based language models to
extract information from tables covering diverse domains and data formats. Our
experiments demonstrate that surprisingly competitive performance can be
achieved without requiring task-specific pipelines or labels, achieving F1
scores ranging from 74.2 to 96.1, while maintaining cost efficiency. Moreover,
through detailed ablation studies and analyses, we investigate the factors
contributing to model success and validate the practicality of distilling
compact models to reduce API reliance.
[LINK]
http://arxiv.org/abs/2305.14336v4
[DATE]
2024-07-23 02:22:08+08:00
[CATEGORIES]
cs.CL
Foundation Models for Autonomous Robots in Unstructured Environments
[AUTHORS]
Hossein Naderi, Alireza Shojaei, Lifu Huang
[ABSTRACT]
Automating activities through robots in unstructured environments, such as
construction sites, has been a long-standing desire. However, the high degree
of unpredictable events in these settings has resulted in far less adoption
compared to more structured settings, such as manufacturing, where robots can
be hard-coded or trained on narrowly defined datasets. Recently, pretrained
foundation models, such as Large Language Models (LLMs), have demonstrated
superior generalization capabilities by providing zero-shot solutions for
problems do not present in the training data, proposing them as a potential
solution for introducing robots to unstructured environments. To this end, this
study investigates potential opportunities and challenges of pretrained
foundation models from a multi-dimensional perspective. The study
systematically reviews application of foundation models in two field of robotic
and unstructured environment and then synthesized them with deliberative acting
theory. Findings showed that linguistic capabilities of LLMs have been utilized
more than other features for improving perception in human-robot interactions.
On the other hand, findings showed that the use of LLMs demonstrated more
applications in project management and safety in construction, and natural
hazard detection in disaster management. Synthesizing these findings, we
located the current state-of-the-art in this field on a five-level scale of
automation, placing them at conditional automation. This assessment was then
used to envision future scenarios, challenges, and solutions toward autonomous
safe unstructured environments. Our study can be seen as a benchmark to track
our progress toward that future.
[COMMENTS]
arXiv admin note: text overlap with arXiv:2312.07843,
arXiv:2402.05741 by other authors
[LINK]
http://arxiv.org/abs/2407.14296v2
[DATE]
2024-07-23 01:55:26+08:00
[CATEGORIES]
cs.CL
Benchmarks as Microscopes: A Call for Model Metrology
[AUTHORS]
Michael Saxon, Ari Holtzman, Peter West, William Yang Wang, Naomi Saphra
[ABSTRACT]
Modern language models (LMs) pose a new challenge in capability assessment.
Static benchmarks inevitably saturate without providing confidence in the
deployment tolerances of LM-based systems, but developers nonetheless claim
that their models have generalized traits such as reasoning or open-domain
language understanding based on these flawed metrics. The science and practice
of LMs requires a new approach to benchmarking which measures specific
capabilities with dynamic assessments. To be confident in our metrics, we need
a new discipline of model metrology – one which focuses on how to generate
benchmarks that predict performance under deployment. Motivated by our
evaluation criteria, we outline how building a community of model metrology
practitioners – one focused on building tools and studying how to measure
system capabilities – is the best way to meet these needs to and add clarity
to the AI discussion.
[COMMENTS]
Conference paper at COLM 2024
[LINK]
http://arxiv.org/abs/2407.16711v1
[DATE]
2024-07-23 01:52:12+08:00
[CATEGORIES]
cs.CL
dMel: Speech Tokenization made Simple
[AUTHORS]
He Bai, Tatiana Likhomanenko, Ruixiang Zhang, Zijin Gu, Zakaria Aldeneh, Navdeep Jaitly
[ABSTRACT]
Large language models have revolutionized natural language processing by
leveraging self-supervised pretraining on vast textual data. Inspired by this
success, researchers have investigated complicated speech tokenization methods
to discretize continuous speech signals so that language modeling techniques
can be applied to speech data. However, existing approaches either model
semantic tokens, potentially losing acoustic information, or model acoustic
tokens, risking the loss of semantic information. Having multiple token types
also complicates the architecture and requires additional pretraining. Here we
show that discretizing mel-filterbank channels into discrete intensity bins
produces a simple representation (dMel), that performs better than other
existing speech tokenization methods. Using a transformer decoder-only
architecture for speech-text modeling, we comprehensively evaluate different
speech tokenization methods on speech recognition (ASR), speech synthesis
(TTS). Our results demonstrate the effectiveness of dMel in achieving high
performance on both tasks within a unified framework, paving the way for
efficient and effective joint modeling of speech and text.
[COMMENTS]
under review
[LINK]
http://arxiv.org/abs/2407.15835v1
[DATE]
2024-07-23 01:51:53+08:00
[CATEGORIES]
cs.CL
J-CHAT: Japanese Large-scale Spoken Dialogue Corpus for Spoken Dialogue Language Modeling
[AUTHORS]
Wataru Nakata, Kentaro Seki, Hitomi Yanaka, Yuki Saito, Shinnosuke Takamichi, Hiroshi Saruwatari
[ABSTRACT]
Spoken dialogue plays a crucial role in human-AI interactions, necessitating
dialogue-oriented spoken language models (SLMs). To develop versatile SLMs,
large-scale and diverse speech datasets are essential. Additionally, to ensure
hiqh-quality speech generation, the data must be spontaneous like in-wild data
and must be acoustically clean with noise removed. Despite the critical need,
no open-source corpus meeting all these criteria has been available. This study
addresses this gap by constructing and releasing a large-scale spoken dialogue
corpus, named Japanese Corpus for Human-AI Talks (J-CHAT), which is publicly
accessible. Furthermore, this paper presents a language-independent method for
corpus construction and describes experiments on dialogue generation using SLMs
trained on J-CHAT. Experimental results indicate that the collected data from
multiple domains by our method improve the naturalness and meaningfulness of
dialogue generation.
[COMMENTS]
8 pages, 6 figures
[LINK]
http://arxiv.org/abs/2407.15828v1
[DATE]
2024-07-23 01:46:50+08:00
[CATEGORIES]
cs.CL
Rainbow Teaming: Open-Ended Generation of Diverse Adversarial Prompts
[AUTHORS]
Mikayel Samvelyan, Sharath Chandra Raparthy, Andrei Lupu, Eric Hambro, Aram H. Markosyan, Manish Bhatt, Yuning Mao, Minqi Jiang, Jack Parker-Holder, Jakob Foerster, Tim Rocktäschel, Roberta Raileanu
[ABSTRACT]
As large language models (LLMs) become increasingly prevalent across many
real-world applications, understanding and enhancing their robustness to
adversarial attacks is of paramount importance. Existing methods for
identifying adversarial prompts tend to focus on specific domains, lack
diversity, or require extensive human annotations. To address these
limitations, we present Rainbow Teaming, a novel black-box approach for
producing a diverse collection of adversarial prompts. Rainbow Teaming casts
adversarial prompt generation as a quality-diversity problem, and uses
open-ended search to generate prompts that are both effective and diverse.
Focusing on the safety domain, we use Rainbow Teaming to target various
state-of-the-art LLMs, including the Llama 2 and Llama 3 models. Our approach
reveals hundreds of effective adversarial prompts, with an attack success rate
exceeding 90% across all tested models. Furthermore, we demonstrate that
fine-tuning models with synthetic data generated by the Rainbow Teaming method
significantly enhances their safety without sacrificing general performance or
helpfulness. We additionally explore the versatility of Rainbow Teaming by
applying it to question answering and cybersecurity, showcasing its potential
to drive robust open-ended self-improvement in a wide range of applications.
[LINK]
http://arxiv.org/abs/2402.16822v2
[DATE]
2024-07-23 01:31:43+08:00
[CATEGORIES]
cs.CL
cs.LG
Perceptions of Linguistic Uncertainty by Language Models and Humans
[AUTHORS]
Catarina G Belem, Markelle Kelly, Mark Steyvers, Sameer Singh, Padhraic Smyth
[ABSTRACT]
Uncertainty expressions such as “probably” or “highly unlikely” are
pervasive in human language. While prior work has established that there is
population-level agreement in terms of how humans interpret these expressions,
there has been little inquiry into the abilities of language models to
interpret such expressions. In this paper, we investigate how language models
map linguistic expressions of uncertainty to numerical responses. Our approach
assesses whether language models can employ theory of mind in this setting:
understanding the uncertainty of another agent about a particular statement,
independently of the model’s own certainty about that statement. We evaluate
both humans and 10 popular language models on a task created to assess these
abilities. Unexpectedly, we find that 8 out of 10 models are able to map
uncertainty expressions to probabilistic responses in a human-like manner.
However, we observe systematically different behavior depending on whether a
statement is actually true or false. This sensitivity indicates that language
models are substantially more susceptible to bias based on their prior
knowledge (as compared to humans). These findings raise important questions and
have broad implications for human-AI alignment and AI-AI communication.
[COMMENTS]
In submission
[LINK]
http://arxiv.org/abs/2407.15814v1
[DATE]
2024-07-23 01:26:12+08:00
[CATEGORIES]
cs.CL
cs.LG
FSboard: Over 3 million characters of ASL fingerspelling collected via smartphones
[AUTHORS]
Manfred Georg, Garrett Tanzer, Saad Hassan, Maximus Shengelia, Esha Uboweja, Sam Sepah, Sean Forbes, Thad Starner
[ABSTRACT]
Progress in machine understanding of sign languages has been slow and
hampered by limited data. In this paper, we present FSboard, an American Sign
Language fingerspelling dataset situated in a mobile text entry use case,
collected from 147 paid and consenting Deaf signers using Pixel 4A selfie
cameras in a variety of environments. Fingerspelling recognition is an
incomplete solution that is only one small part of sign language translation,
but it could provide some immediate benefit to Deaf/Hard of Hearing signers as
more broadly capable technology develops. At >3 million characters in length
and >250 hours in duration, FSboard is the largest fingerspelling recognition
dataset to date by a factor of >10x. As a simple baseline, we finetune 30 Hz
MediaPipe Holistic landmark inputs into ByT5-Small and achieve 11.1% Character
Error Rate (CER) on a test set with unique phrases and signers. This quality
degrades gracefully when decreasing frame rate and excluding face/body
landmarks: plausible optimizations to help models run on device in real time.
[COMMENTS]
Access FSboard at https://www.kaggle.com/datasets/googleai/fsboard
[LINK]
http://arxiv.org/abs/2407.15806v1
[DATE]
2024-07-23 01:20:22+08:00
[CATEGORIES]
cs.CL
Extracting Structured Insights from Financial News: An Augmented LLM Driven Approach
[AUTHORS]
Rian Dolphin, Joe Dursun, Jonathan Chow, Jarrett Blankenship, Katie Adams, Quinton Pike
[ABSTRACT]
Financial news plays a crucial role in decision-making processes across the
financial sector, yet the efficient processing of this information into a
structured format remains challenging. This paper presents a novel approach to
financial news processing that leverages Large Language Models (LLMs) to
overcome limitations that previously prevented the extraction of structured
data from unstructured financial news. We introduce a system that extracts
relevant company tickers from raw news article content, performs sentiment
analysis at the company level, and generates summaries, all without relying on
pre-structured data feeds. Our methodology combines the generative capabilities
of LLMs, and recent prompting techniques, with a robust validation framework
that uses a tailored string similarity approach. Evaluation on a dataset of
5530 financial news articles demonstrates the effectiveness of our approach,
with 90% of articles not missing any tickers compared with current data
providers, and 22% of articles having additional relevant tickers. In addition
to this paper, the methodology has been implemented at scale with the resulting
processed data made available through a live API endpoint, which is updated in
real-time with the latest news. To the best of our knowledge, we are the first
data provider to offer granular, per-company sentiment analysis from news
articles, enhancing the depth of information available to market participants.
We also release the evaluation dataset of 5530 processed articles as a static
file, which we hope will facilitate further research leveraging financial news.
[COMMENTS]
7 pages, 6 figures
[LINK]
http://arxiv.org/abs/2407.15788v1
[DATE]
2024-07-23 00:47:31+08:00
[CATEGORIES]
cs.CL
LongVideoBench: A Benchmark for Long-context Interleaved Video-Language Understanding
[AUTHORS]
Haoning Wu, Dongxu Li, Bei Chen, Junnan Li
[ABSTRACT]
Large multimodal models (LMMs) are processing increasingly longer and richer
inputs. Albeit the progress, few public benchmark is available to measure such
development. To mitigate this gap, we introduce LongVideoBench, a
question-answering benchmark that features video-language interleaved inputs up
to an hour long. Our benchmark includes 3,763 varying-length web-collected
videos with their subtitles across diverse themes, designed to comprehensively
evaluate LMMs on long-term multimodal understanding. To achieve this, we
interpret the primary challenge as to accurately retrieve and reason over
detailed multimodal information from long inputs. As such, we formulate a novel
video question-answering task termed referring reasoning. Specifically, as part
of the question, it contains a referring query that references related video
contexts, called referred context. The model is then required to reason over
relevant video details from the referred context. Following the paradigm of
referring reasoning, we curate 6,678 human-annotated multiple-choice questions
in 17 fine-grained categories, establishing one of the most comprehensive
benchmarks for long-form video understanding. Evaluations suggest that the
LongVideoBench presents significant challenges even for the most advanced
proprietary models (e.g. GPT-4o, Gemini-1.5-Pro, GPT-4-Turbo), while their
open-source counterparts show an even larger performance gap. In addition, our
results indicate that model performance on the benchmark improves only when
they are capable of processing more frames, positioning LongVideoBench as a
valuable benchmark for evaluating future-generation long-context LMMs.
[COMMENTS]
29 pages
[LINK]
http://arxiv.org/abs/2407.15754v1
[DATE]
2024-07-23 00:00:55+08:00
[CATEGORIES]
cs.CL
cs.LG
SemiSFL: Split Federated Learning on Unlabeled and Non-IID Data
[AUTHORS]
Yang Xu, Yunming Liao, Hongli Xu, Zhipeng Sun, Liusheng Huang, Chunming Qiao
[ABSTRACT]
Federated Learning (FL) has emerged to allow multiple clients to
collaboratively train machine learning models on their private data at the
network edge. However, training and deploying large-scale models on
resource-constrained devices is challenging. Fortunately, Split Federated
Learning (SFL) offers a feasible solution by alleviating the computation and/or
communication burden on clients. However, existing SFL works often assume
sufficient labeled data on clients, which is usually impractical. Besides, data
non-IIDness poses another challenge to ensure efficient model training. To our
best knowledge, the above two issues have not been simultaneously addressed in
SFL. Herein, we propose a novel Semi-supervised SFL system, termed SemiSFL,
which incorporates clustering regularization to perform SFL with unlabeled and
non-IID client data. Moreover, our theoretical and experimental investigations
into model convergence reveal that the inconsistent training processes on
labeled and unlabeled data have an influence on the effectiveness of clustering
regularization. To mitigate the training inconsistency, we develop an algorithm
for dynamically adjusting the global updating frequency, so as to improve
training performance. Extensive experiments on benchmark models and datasets
show that our system provides a 3.8x speed-up in training time, reduces the
communication cost by about 70.3% while reaching the target accuracy, and
achieves up to 5.8% improvement in accuracy under non-IID scenarios compared to
the state-of-the-art baselines.
[COMMENTS]
16 pages
[LINK]
http://arxiv.org/abs/2307.15870v4
[DATE]
2024-07-23 23:30:32+08:00
[CATEGORIES]
cs.LG
Rendering Wireless Environments Useful for Gradient Estimators: A Zero-Order Stochastic Federated Learning Method
[AUTHORS]
Elissa Mhanna, Mohamad Assaad
[LINK]
http://arxiv.org/abs/2401.17460v2
[DATE]
2024-07-23 23:14:08+08:00
[CATEGORIES]
cs.LG
Era Splitting: Invariant Learning for Decision Trees
[AUTHORS]
Timothy DeLise
[ABSTRACT]
Real-life machine learning problems exhibit distributional shifts in the data
from one time to another or from one place to another. This behavior is beyond
the scope of the traditional empirical risk minimization paradigm, which
assumes i.i.d. distribution of data over time and across locations. The
emerging field of out-of-distribution (OOD) generalization addresses this
reality with new theory and algorithms which incorporate “environmental”, or
“era-wise” information into the algorithms. So far, most research has been
focused on linear models and/or neural networks . In this research we develop
two new splitting criteria for decision trees, which allow us to apply ideas
from OOD generalization research to decision tree models, namely, gradient
boosting decision trees (GBDTs). The new splitting criteria use era-wise
information associated with the data to grow tree-based models that are optimal
across all disjoint eras in the data, instead of optimal over the entire data
set pooled together, which is the default setting. In this paper, two new
splitting criteria are defined and analyzed theoretically. Effectiveness is
tested on four experiments, ranging from simple, synthetic to complex,
real-world applications. In particular we cast the OOD domain-adaptation
problem in the context of financial markets, where the new models out-perform
state-of-the-art GBDT models on the Numerai data set. The new criteria are
incorporated into the Scikit-Learn code base and made freely available online.
[COMMENTS]
29 pages, 9 figures, 3 tables, 2 algorithms
[LINK]
http://arxiv.org/abs/2309.14496v5
[DATE]
2024-07-23 23:01:06+08:00
[CATEGORIES]
cs.LG
Minimax Optimality of Score-based Diffusion Models: Beyond the Density Lower Bound Assumptions
[AUTHORS]
Kaihong Zhang, Caitlyn H. Yin, Feng Liang, Jingbo Liu
[ABSTRACT]
We study the asymptotic error of score-based diffusion model sampling in
large-sample scenarios from a non-parametric statistics perspective. We show
that a kernel-based score estimator achieves an optimal mean square error of
$\widetilde{O}\left(n^{-1} t^{-\frac{d+2}{2}}(t^{\frac{d}{2}} \vee 1)\right)$
for the score function of $p_0*\mathcal{N}(0,t\boldsymbol{I}_d)$, where $n$ and
$d$ represent the sample size and the dimension, $t$ is bounded above and below
by polynomials of $n$, and $p_0$ is an arbitrary sub-Gaussian distribution. As
a consequence, this yields an $\widetilde{O}\left(n^{-1/2}
t^{-\frac{d}{4}}\right)$ upper bound for the total variation error of the
distribution of the sample generated by the diffusion model under a mere
sub-Gaussian assumption. If in addition, $p_0$ belongs to the nonparametric
family of the $\beta$-Sobolev space with $\beta\le 2$, by adopting an early
stopping strategy, we obtain that the diffusion model is nearly (up to log
factors) minimax optimal. This removes the crucial lower bound assumption on
$p_0$ in previous proofs of the minimax optimality of the diffusion model for
nonparametric families.
[LINK]
http://arxiv.org/abs/2402.15602v2
[DATE]
2024-07-23 23:00:52+08:00
[CATEGORIES]
cs.LG
PateGail: A Privacy-Preserving Mobility Trajectory Generator with Imitation Learning
[AUTHORS]
Huandong Wang, Changzheng Gao, Yuchen Wu, Depeng Jin, Lina Yao, Yong Li
[ABSTRACT]
Generating human mobility trajectories is of great importance to solve the
lack of large-scale trajectory data in numerous applications, which is caused
by privacy concerns. However, existing mobility trajectory generation methods
still require real-world human trajectories centrally collected as the training
data, where there exists an inescapable risk of privacy leakage. To overcome
this limitation, in this paper, we propose PateGail, a privacy-preserving
imitation learning model to generate mobility trajectories, which utilizes the
powerful generative adversary imitation learning model to simulate the
decision-making process of humans. Further, in order to protect user privacy,
we train this model collectively based on decentralized mobility data stored in
user devices, where personal discriminators are trained locally to distinguish
and reward the real and generated human trajectories. In the training process,
only the generated trajectories and their rewards obtained based on personal
discriminators are shared between the server and devices, whose privacy is
further preserved by our proposed perturbation mechanisms with theoretical
proof to satisfy differential privacy. Further, to better model the human
decision-making process, we propose a novel aggregation mechanism of the
rewards obtained from personal discriminators. We theoretically prove that
under the reward obtained based on the aggregation mechanism, our proposed
model maximizes the lower bound of the discounted total rewards of users.
Extensive experiments show that the trajectories generated by our model are
able to resemble real-world trajectories in terms of five key statistical
metrics, outperforming state-of-the-art algorithms by over 48.03%. Furthermore,
we demonstrate that the synthetic trajectories are able to efficiently support
practical applications, including mobility prediction and location
recommendation.
[LINK]
http://arxiv.org/abs/2407.16729v1
[DATE]
2024-07-23 22:59:23+08:00
[CATEGORIES]
cs.LG
Unsupervised End-to-End Training with a Self-Defined Target
[AUTHORS]
Dongshu Liu, Jérémie Laydevant, Adrien Pontlevy, Damien Querlioz, Julie Grollier
[ABSTRACT]
Designing algorithms for versatile AI hardware that can learn on the edge
using both labeled and unlabeled data is challenging. Deep end-to-end training
methods incorporating phases of self-supervised and supervised learning are
accurate and adaptable to input data but self-supervised learning requires even
more computational and memory resources than supervised learning, too high for
current embedded hardware. Conversely, unsupervised layer-by-layer training,
such as Hebbian learning, is more compatible with existing hardware but does
not integrate well with supervised learning. To address this, we propose a
method enabling networks or hardware designed for end-to-end supervised
learning to also perform high-performance unsupervised learning by adding two
simple elements to the output layer: Winner-Take-All (WTA) selectivity and
homeostasis regularization. These mechanisms introduce a “self-defined target”
for unlabeled data, allowing purely unsupervised training for both
fully-connected and convolutional layers using backpropagation or equilibrium
propagation on datasets like MNIST (up to 99.2%), Fashion-MNIST (up to 90.3%),
and SVHN (up to 81.5%). We extend this method to semi-supervised learning,
adjusting targets based on data type, achieving 96.6% accuracy with only 600
labeled MNIST samples in a multi-layer perceptron. Our results show that this
approach can effectively enable networks and hardware initially dedicated to
supervised learning to also perform unsupervised learning, adapting to varying
availability of labeled data.
[LINK]
http://arxiv.org/abs/2403.12116v2
[DATE]
2024-07-23 22:49:22+08:00
[CATEGORIES]
cs.LG
Laplacian Segmentation Networks Improve Epistemic Uncertainty Quantification
[AUTHORS]
Kilian Zepf, Selma Wanna, Marco Miani, Juston Moore, Jes Frellsen, Søren Hauberg, Frederik Warburg, Aasa Feragen
[ABSTRACT]
Image segmentation relies heavily on neural networks which are known to be
overconfident, especially when making predictions on out-of-distribution (OOD)
images. This is a common scenario in the medical domain due to variations in
equipment, acquisition sites, or image corruptions. This work addresses the
challenge of OOD detection by proposing Laplacian Segmentation Networks (LSN):
methods which jointly model epistemic (model) and aleatoric (data) uncertainty
for OOD detection. In doing so, we propose the first Laplace approximation of
the weight posterior that scales to large neural networks with skip connections
that have high-dimensional outputs. We demonstrate on three datasets that the
LSN-modeled parameter distributions, in combination with suitable uncertainty
measures, gives superior OOD detection.
[COMMENTS]
Published in the Conference Proceedings of the 27th International
Conference on Medical Image Computing and Computer Assisted Intervention
(MICCAI)
[LINK]
http://arxiv.org/abs/2303.13123v2
[DATE]
2024-07-23 22:38:34+08:00
[CATEGORIES]
cs.LG
Decision-Focused Learning with Directional Gradients
[AUTHORS]
Michael Huang, Vishal Gupta
[ABSTRACT]
We propose a novel family of decision-aware surrogate losses, called
Perturbation Gradient (PG) losses, for the predict-then-optimize framework. The
key idea is to connect the expected downstream decision loss with the
directional derivative of a particular plug-in objective, and then approximate
this derivative using zeroth order gradient techniques. Unlike the original
decision loss which is typically piecewise constant and discontinuous, our new
PG losses can be optimized using off-the-shelf gradient-based methods. Most
importantly, unlike existing surrogate losses, the approximation error of our
PG losses vanishes as the number of samples grows. Hence, optimizing our
surrogate loss yields a best-in-class policy asymptotically, even in
misspecified settings. This is the first such result in misspecified settings,
and we provide numerical evidence confirming our PG losses substantively
outperform existing proposals when the underlying model is misspecified.
[LINK]
http://arxiv.org/abs/2402.03256v3
[DATE]
2024-07-23 22:34:15+08:00
[CATEGORIES]
cs.LG
Gradient-Regularized Out-of-Distribution Detection
[AUTHORS]
Sina Sharifi, Taha Entesari, Bardia Safaei, Vishal M. Patel, Mahyar Fazlyab
[ABSTRACT]
One of the challenges for neural networks in real-life applications is the
overconfident errors these models make when the data is not from the original
training distribution.
Addressing this issue is known as Out-of-Distribution (OOD) detection.
Many state-of-the-art OOD methods employ an auxiliary dataset as a surrogate
for OOD data during training to achieve improved performance.
However, these methods fail to fully exploit the local information embedded
in the auxiliary dataset.
In this work, we propose the idea of leveraging the information embedded in
the gradient of the loss function during training to enable the network to not
only learn a desired OOD score for each sample but also to exhibit similar
behavior in a local neighborhood around each sample.
We also develop a novel energy-based sampling method to allow the network to
be exposed to more informative OOD samples during the training phase. This is
especially important when the auxiliary dataset is large. We demonstrate the
effectiveness of our method through extensive experiments on several OOD
benchmarks, improving the existing state-of-the-art FPR95 by 4% on our ImageNet
experiment.
We further provide a theoretical analysis through the lens of certified
robustness and Lipschitz analysis to showcase the theoretical foundation of our
work. Our code is available at https://github.com/o4lc/Greg-OOD.
[COMMENTS]
Accepted to ECCV 2024
[LINK]
http://arxiv.org/abs/2404.12368v3
[DATE]
2024-07-23 22:13:48+08:00
[CATEGORIES]
cs.LG
Articulation Work and Tinkering for Fairness in Machine Learning
[AUTHORS]
Miriam Fahimi, Mayra Russo, Kristen M. Scott, Maria-Esther Vidal, Bettina Berendt, Katharina Kinder-Kurlanda
[ABSTRACT]
The field of fair AI aims to counter biased algorithms through computational
modelling. However, it faces increasing criticism for perpetuating the use of
overly technical and reductionist methods. As a result, novel approaches appear
in the field to address more socially-oriented and interdisciplinary (SOI)
perspectives on fair AI. In this paper, we take this dynamic as the starting
point to study the tension between computer science (CS) and SOI research. By
drawing on STS and CSCW theory, we position fair AI research as a matter of
‘organizational alignment’: what makes research ‘doable’ is the successful
alignment of three levels of work organization (the social world, the
laboratory and the experiment). Based on qualitative interviews with CS
researchers, we analyze the tasks, resources, and actors required for doable
research in the case of fair AI. We find that CS researchers engage with SOI to
some extent, but organizational conditions, articulation work, and ambiguities
of the social world constrain the doability of SOI research. Based on our
findings, we identify and discuss problems for aligning CS and SOI as fair AI
continues to evolve.
[LINK]
http://arxiv.org/abs/2407.16496v1
[DATE]
2024-07-23 22:11:12+08:00
[CATEGORIES]
cs.LG
Learning General Continuous Constraint from Demonstrations via Positive-Unlabeled Learning
[AUTHORS]
Baiyu Peng, Aude Billard
[ABSTRACT]
Planning for a wide range of real-world tasks necessitates to know and write
all constraints. However, instances exist where these constraints are either
unknown or challenging to specify accurately. A possible solution is to infer
the unknown constraints from expert demonstration. The majority of prior works
limit themselves to learning simple linear constraints, or require strong
knowledge of the true constraint parameterization or environmental model. To
mitigate these problems, this paper presents a positive-unlabeled (PU) learning
approach to infer a continuous, arbitrary and possibly nonlinear, constraint
from demonstration. From a PU learning view, We treat all data in
demonstrations as positive (feasible) data, and learn a (sub)-optimal policy to
generate high-reward-winning but potentially infeasible trajectories, which
serve as unlabeled data containing both feasible and infeasible states. Under
an assumption on data distribution, a feasible-infeasible classifier (i.e.,
constraint model) is learned from the two datasets through a postprocessing PU
learning technique. The entire method employs an iterative framework
alternating between updating the policy, which generates and selects
higher-reward policies, and updating the constraint model. Additionally, a
memory buffer is introduced to record and reuse samples from previous
iterations to prevent forgetting. The effectiveness of the proposed method is
validated in two Mujoco environments, successfully inferring continuous
nonlinear constraints and outperforming a baseline method in terms of
constraint accuracy and policy safety.
[LINK]
http://arxiv.org/abs/2407.16485v1
[DATE]
2024-07-23 22:00:18+08:00
[CATEGORIES]
cs.LG
Topology Reorganized Graph Contrastive Learning with Mitigating Semantic Drift
[AUTHORS]
Jiaqiang Zhang, Songcan Chen
[ABSTRACT]
Graph contrastive learning (GCL) is an effective paradigm for node
representation learning in graphs. The key components hidden behind GCL are
data augmentation and positive-negative pair selection. Typical data
augmentations in GCL, such as uniform deletion of edges, are generally blind
and resort to local perturbation, which is prone to producing under-diversity
views. Additionally, there is a risk of making the augmented data traverse to
other classes. Moreover, most methods always treat all other samples as
negatives. Such a negative pairing naturally results in sampling bias and
likewise may make the learned representation suffer from semantic drift.
Therefore, to increase the diversity of the contrastive view, we propose two
simple and effective global topological augmentations to compensate current
GCL. One is to mine the semantic correlation between nodes in the feature
space. The other is to utilize the algebraic properties of the adjacency matrix
to characterize the topology by eigen-decomposition. With the help of both, we
can retain important edges to build a better view. To reduce the risk of
semantic drift, a prototype-based negative pair selection is further designed
which can filter false negative samples. Extensive experiments on various tasks
demonstrate the advantages of the model compared to the state-of-the-art
methods.
[LINK]
http://arxiv.org/abs/2407.16726v1
[DATE]
2024-07-23 21:55:33+08:00
[CATEGORIES]
cs.LG
RanDumb: A Simple Approach that Questions the Efficacy of Continual Representation Learning
[AUTHORS]
Ameya Prabhu, Shiven Sinha, Ponnurangam Kumaraguru, Philip H. S. Torr, Ozan Sener, Puneet K. Dokania
[ABSTRACT]
Continual learning has primarily focused on the issue of catastrophic
forgetting and the associated stability-plasticity tradeoffs. However, little
attention has been paid to the efficacy of continually learned representations,
as representations are learned alongside classifiers throughout the learning
process. Our primary contribution is empirically demonstrating that existing
online continually trained deep networks produce inferior representations
compared to a simple pre-defined random transforms. Our approach embeds raw
pixels using a fixed random transform, approximating an RBF-Kernel initialized
before any data is seen. We then train a simple linear classifier on top
without storing any exemplars, processing one sample at a time in an online
continual learning setting. This method, called RanDumb, significantly
outperforms state-of-the-art continually learned representations across all
standard online continual learning benchmarks. Our study reveals the
significant limitations of representation learning, particularly in
low-exemplar and online continual learning scenarios. Extending our
investigation to popular exemplar-free scenarios with pretrained models, we
find that training only a linear classifier on top of pretrained
representations surpasses most continual fine-tuning and prompt-tuning
strategies. Overall, our investigation challenges the prevailing assumptions
about effective representation learning in online continual learning. Our code
is available at://github.com/drimpossible/RanDumb.
[COMMENTS]
Tech Report
[LINK]
http://arxiv.org/abs/2402.08823v2
[DATE]
2024-07-23 21:52:28+08:00
[CATEGORIES]
cs.LG
First-order ANIL provably learns representations despite overparametrization
[AUTHORS]
Oğuz Kaan Yüksel, Etienne Boursier, Nicolas Flammarion
[ABSTRACT]
Due to its empirical success in few-shot classification and reinforcement
learning, meta-learning has recently received significant interest.
Meta-learning methods leverage data from previous tasks to learn a new task in
a sample-efficient manner. In particular, model-agnostic methods look for
initialization points from which gradient descent quickly adapts to any new
task. Although it has been empirically suggested that such methods perform well
by learning shared representations during pretraining, there is limited
theoretical evidence of such behavior. More importantly, it has not been shown
that these methods still learn a shared structure, despite architectural
misspecifications. In this direction, this work shows, in the limit of an
infinite number of tasks, that first-order ANIL with a linear two-layer network
architecture successfully learns linear shared representations. This result
even holds with overparametrization; having a width larger than the dimension
of the shared representations results in an asymptotically low-rank solution.
The learned solution then yields a good adaptation performance on any new task
after a single gradient step. Overall, this illustrates how well model-agnostic
methods such as first-order ANIL can learn shared representations.
[COMMENTS]
42 pages, 17 figures
[LINK]
http://arxiv.org/abs/2303.01335v3
[DATE]
2024-07-23 21:36:51+08:00
[CATEGORIES]
cs.LG
Enhancing GNNs Performance on Combinatorial Optimization by Recurrent Feature Update
[AUTHORS]
Daria Pugacheva, Andrei Ermakov, Igor Lyskov, Ilya Makarov, Yuriy Zotov
[ABSTRACT]
Combinatorial optimization (CO) problems are crucial in various scientific
and industrial applications. Recently, researchers have proposed using
unsupervised Graph Neural Networks (GNNs) to address NP-hard combinatorial
optimization problems, which can be reformulated as Quadratic Unconstrained
Binary Optimization (QUBO) problems. GNNs have demonstrated high performance
with nearly linear scalability and significantly outperformed classic
heuristic-based algorithms in terms of computational efficiency on large-scale
problems. However, when utilizing standard node features, GNNs tend to get
trapped to suboptimal local minima of the energy landscape, resulting in low
quality solutions. We introduce a novel algorithm, denoted hereafter as
QRF-GNN, leveraging the power of GNNs to efficiently solve CO problems with
QUBO formulation. It relies on unsupervised learning by minimizing the loss
function derived from QUBO relaxation. The proposed key components of the
architecture include the recurrent use of intermediate GNN predictions,
parallel convolutional layers and combination of static node features as input.
Altogether, it helps to adapt the intermediate solution candidate to minimize
QUBO-based loss function, taking into account not only static graph features,
but also intermediate predictions treated as dynamic, i.e. iteratively changing
recurrent features. The performance of the proposed algorithm has been
evaluated on the canonical benchmark datasets for maximum cut, graph coloring
and maximum independent set problems. Results of experiments show that QRF-GNN
drastically surpasses existing learning-based approaches and is comparable to
the state-of-the-art conventional heuristics, improving their scalability on
large instances.
[LINK]
http://arxiv.org/abs/2407.16468v1
[DATE]
2024-07-23 21:34:35+08:00
[CATEGORIES]
cs.LG
Sobolev neural network with residual weighting as a surrogate in linear and non-linear mechanics
[AUTHORS]
A. O. M. Kilicsoy, J. Liedmann, M. A. Valdebenito, F. -J. Barthold, M. G. R. Faes
[ABSTRACT]
Areas of computational mechanics such as uncertainty quantification and
optimization usually involve repeated evaluation of numerical models that
represent the behavior of engineering systems. In the case of complex nonlinear
systems however, these models tend to be expensive to evaluate, making
surrogate models quite valuable. Artificial neural networks approximate systems
very well by taking advantage of the inherent information of its given training
data. In this context, this paper investigates the improvement of the training
process by including sensitivity information, which are partial derivatives
w.r.t. inputs, as outlined by Sobolev training. In computational mechanics,
sensitivities can be applied to neural networks by expanding the training loss
function with additional loss terms, thereby improving training convergence
resulting in lower generalisation error. This improvement is shown in two
examples of linear and non-linear material behavior. More specifically, the
Sobolev designed loss function is expanded with residual weights adjusting the
effect of each loss on the training step. Residual weighting is the given
scaling to the different training data, which in this case are response and
sensitivities. These residual weights are optimized by an adaptive scheme,
whereby varying objective functions are explored, with some showing
improvements in accuracy and precision of the general training convergence.
[COMMENTS]
Submitted to IEEE Access, 40 pages, 18 figures
[LINK]
http://arxiv.org/abs/2407.16466v1
[DATE]
2024-07-23 21:28:07+08:00
[CATEGORIES]
cs.LG
Advances in Land Surface Model-based Forecasting: A comparative study of LSTM, Gradient Boosting, and Feedforward Neural Network Models as prognostic state emulators
[AUTHORS]
Marieke Wesselkamp, Matthew Chantry, Ewan Pinnington, Margarita Choulga, Souhail Boussetta, Maria Kalweit, Joschka Boedecker, Carsten F. Dormann, Florian Pappenberger, Gianpaolo Balsamo
[ABSTRACT]
Most useful weather prediction for the public is near the surface. The
processes that are most relevant for near-surface weather prediction are also
those that are most interactive and exhibit positive feedback or have key role
in energy partitioning. Land surface models (LSMs) consider these processes
together with surface heterogeneity and forecast water, carbon and energy
fluxes, and coupled with an atmospheric model provide boundary and initial
conditions. This numerical parametrization of atmospheric boundaries being
computationally expensive, statistical surrogate models are increasingly used
to accelerated progress in experimental research. We evaluated the efficiency
of three surrogate models in speeding up experimental research by simulating
land surface processes, which are integral to forecasting water, carbon, and
energy fluxes in coupled atmospheric models. Specifically, we compared the
performance of a Long-Short Term Memory (LSTM) encoder-decoder network, extreme
gradient boosting, and a feed-forward neural network within a physics-informed
multi-objective framework. This framework emulates key states of the ECMWF’s
Integrated Forecasting System (IFS) land surface scheme, ECLand, across
continental and global scales. Our findings indicate that while all models on
average demonstrate high accuracy over the forecast period, the LSTM network
excels in continental long-range predictions when carefully tuned, the XGB
scores consistently high across tasks and the MLP provides an excellent
implementation-time-accuracy trade-off. The runtime reduction achieved by the
emulators in comparison to the full numerical models are significant, offering
a faster, yet reliable alternative for conducting numerical experiments on land
surfaces.
[LINK]
http://arxiv.org/abs/2407.16463v1
[DATE]
2024-07-23 21:26:05+08:00
[CATEGORIES]
cs.LG
Differentially private projection-depth-based medians
[AUTHORS]
Kelly Ramsay, Dylan Spicker
[ABSTRACT]
We develop $(\epsilon,\delta)$-differentially private projection-depth-based
medians using the propose-test-release (PTR) and exponential mechanisms. Under
general conditions on the input parameters and the population measure, (e.g. we
do not assume any moment bounds), we quantify the probability the test in PTR
fails, as well as the cost of privacy via finite sample deviation bounds. Next,
we show that when some observations are contaminated, the private
projection-depth-based median does not break down, provided its input location
and scale estimators do not break down. We demonstrate our main results on the
canonical projection-depth-based median, as well as on projection-depth-based
medians derived from trimmed estimators. In the Gaussian setting, we show that
the resulting deviation bound matches the known lower bound for private
Gaussian mean estimation. In the Cauchy setting, we show that the ``outlier
error amplification’’ effect resulting from the heavy tails outweighs the cost
of privacy. This result is then verified via numerical simulations.
Additionally, we present results on general PTR mechanisms and a uniform
concentration result on the projected spacings of order statistics, which may
be of general interest.
[COMMENTS]
45 pages, 1 figure
[LINK]
http://arxiv.org/abs/2312.07792v3
[DATE]
2024-07-23 21:25:57+08:00
[CATEGORIES]
cs.LG
Solving a Real-World Optimization Problem Using Proximal Policy Optimization with Curriculum Learning and Reward Engineering
[AUTHORS]
Abhijeet Pendyala, Asma Atamna, Tobias Glasmachers
[ABSTRACT]
We present a proximal policy optimization (PPO) agent trained through
curriculum learning (CL) principles and meticulous reward engineering to
optimize a real-world high-throughput waste sorting facility. Our work
addresses the challenge of effectively balancing the competing objectives of
operational safety, volume optimization, and minimizing resource usage. A
vanilla agent trained from scratch on these multiple criteria fails to solve
the problem due to its inherent complexities. This problem is particularly
difficult due to the environment’s extremely delayed rewards with long time
horizons and class (or action) imbalance, with important actions being
infrequent in the optimal policy. This forces the agent to anticipate long-term
action consequences and prioritize rare but rewarding behaviours, creating a
non-trivial reinforcement learning task. Our five-stage CL approach tackles
these challenges by gradually increasing the complexity of the environmental
dynamics during policy transfer while simultaneously refining the reward
mechanism. This iterative and adaptable process enables the agent to learn a
desired optimal policy. Results demonstrate that our approach significantly
improves inference-time safety, achieving near-zero safety violations in
addition to enhancing waste sorting plant efficiency.
[LINK]
http://arxiv.org/abs/2404.02577v2
[DATE]
2024-07-23 21:15:01+08:00
[CATEGORIES]
cs.LG
Leveraging Self-Consistency for Data-Efficient Amortized Bayesian Inference
[AUTHORS]
Marvin Schmitt, Desi R. Ivanova, Daniel Habermann, Ullrich Köthe, Paul-Christian Bürkner, Stefan T. Radev
[ABSTRACT]
We propose a method to improve the efficiency and accuracy of amortized
Bayesian inference by leveraging universal symmetries in the joint
probabilistic model of parameters and data. In a nutshell, we invert Bayes’
theorem and estimate the marginal likelihood based on approximate
representations of the joint model. Upon perfect approximation, the marginal
likelihood is constant across all parameter values by definition. However,
errors in approximate inference lead to undesirable variance in the marginal
likelihood estimates across different parameter values. We penalize violations
of this symmetry with a \textit{self-consistency loss} which significantly
improves the quality of approximate inference in low data regimes and can be
used to augment the training of popular neural density estimators. We apply our
method to a number of synthetic problems and realistic scientific models,
discovering notable advantages in the context of both neural posterior and
likelihood approximation.
[COMMENTS]
Proceedings of the 41st International Conference on Machine Learning
(ICML), Vienna, Austria. PMLR 235, 2024
[LINK]
http://arxiv.org/abs/2310.04395v4
[DATE]
2024-07-23 20:55:13+08:00
[CATEGORIES]
cs.LG
Stochastic weight matrix dynamics during learning and Dyson Brownian motion
[AUTHORS]
Gert Aarts, Biagio Lucini, Chanju Park
[ABSTRACT]
We demonstrate that the update of weight matrices in learning algorithms can
be described in the framework of Dyson Brownian motion, thereby inheriting many
features of random matrix theory. We relate the level of stochasticity to the
ratio of the learning rate and the mini-batch size, providing more robust
evidence to a previously conjectured scaling relationship. We discuss universal
and non-universal features in the resulting Coulomb gas distribution and
identify the Wigner surmise and Wigner semicircle explicitly in a
teacher-student model and in the (near-)solvable case of the Gaussian
restricted Boltzmann machine.
[COMMENTS]
17 pages, 16 figures
[LINK]
http://arxiv.org/abs/2407.16427v1
[DATE]
2024-07-23 20:25:50+08:00
[CATEGORIES]
cs.LG
Structured Inverse-Free Natural Gradient: Memory-Efficient & Numerically-Stable KFAC
[AUTHORS]
Wu Lin, Felix Dangel, Runa Eschenhagen, Kirill Neklyudov, Agustinus Kristiadi, Richard E. Turner, Alireza Makhzani
[ABSTRACT]
Second-order methods such as KFAC can be useful for neural net training.
However, they are often memory-inefficient since their preconditioning
Kronecker factors are dense, and numerically unstable in low precision as they
require matrix inversion or decomposition. These limitations render such
methods unpopular for modern mixed-precision training. We address them by (i)
formulating an inverse-free KFAC update and (ii) imposing structures in the
Kronecker factors, resulting in structured inverse-free natural gradient
descent (SINGD). On modern neural networks, we show that SINGD is
memory-efficient and numerically robust, in contrast to KFAC, and often
outperforms AdamW even in half precision. Our work closes a gap between first-
and second-order methods in modern low-precision training.
[COMMENTS]
A long version of the ICML 2024 paper, updated the text about a
related work
[LINK]
http://arxiv.org/abs/2312.05705v4
[DATE]
2024-07-23 20:13:44+08:00
[CATEGORIES]
cs.LG
Sample-Efficient Constrained Reinforcement Learning with General Parameterization
[AUTHORS]
Washim Uddin Mondal, Vaneet Aggarwal
[ABSTRACT]
We consider a constrained Markov Decision Problem (CMDP) where the goal of an
agent is to maximize the expected discounted sum of rewards over an infinite
horizon while ensuring that the expected discounted sum of costs exceeds a
certain threshold. Building on the idea of momentum-based acceleration, we
develop the Primal-Dual Accelerated Natural Policy Gradient (PD-ANPG) algorithm
that guarantees an $\epsilon$ global optimality gap and $\epsilon$ constraint
violation with $\tilde{\mathcal{O}}(\epsilon^{-2})$ sample complexity for
general parameterized policies. This improves the state-of-the-art sample
complexity in general parameterized CMDPs by a factor of
$\mathcal{O}(\epsilon^{-2})$ and achieves the theoretical lower bound.
[COMMENTS]
Improved the sample complexity result in comparison to the earlier
version. The new result coincides with the theoretical lower bound
[LINK]
http://arxiv.org/abs/2405.10624v2
[DATE]
2024-07-23 20:04:52+08:00
[CATEGORIES]
cs.LG
Global Counterfactual Directions
[AUTHORS]
Bartlomiej Sobieski, Przemysław Biecek
[ABSTRACT]
Despite increasing progress in development of methods for generating visual
counterfactual explanations, especially with the recent rise of Denoising
Diffusion Probabilistic Models, previous works consider them as an entirely
local technique. In this work, we take the first step at globalizing them.
Specifically, we discover that the latent space of Diffusion Autoencoders
encodes the inference process of a given classifier in the form of global
directions. We propose a novel proxy-based approach that discovers two types of
these directions with the use of only single image in an entirely black-box
manner. Precisely, g-directions allow for flipping the decision of a given
classifier on an entire dataset of images, while h-directions further increase
the diversity of explanations. We refer to them in general as Global
Counterfactual Directions (GCDs). Moreover, we show that GCDs can be naturally
combined with Latent Integrated Gradients resulting in a new black-box
attribution method, while simultaneously enhancing the understanding of
counterfactual explanations. We validate our approach on existing benchmarks
and show that it generalizes to real-world use-cases.
[COMMENTS]
ECCV 2024
[LINK]
http://arxiv.org/abs/2404.12488v2
[DATE]
2024-07-23 19:58:53+08:00
[CATEGORIES]
cs.LG
Modality-Order Matters! A Novel Hierarchical Feature Fusion Method for CoSAm: A Code-Switched Autism Corpus
[AUTHORS]
Mohd Mujtaba Akhtar, Girish, Muskaan Singh, Orchid Chetia Phukan
[ABSTRACT]
Autism Spectrum Disorder (ASD) is a complex neuro-developmental challenge,
presenting a spectrum of difficulties in social interaction, communication, and
the expression of repetitive behaviors in different situations. This increasing
prevalence underscores the importance of ASD as a major public health concern
and the need for comprehensive research initiatives to advance our
understanding of the disorder and its early detection methods. This study
introduces a novel hierarchical feature fusion method aimed at enhancing the
early detection of ASD in children through the analysis of code-switched speech
(English and Hindi). Employing advanced audio processing techniques, the
research integrates acoustic, paralinguistic, and linguistic information using
Transformer Encoders. This innovative fusion strategy is designed to improve
classification robustness and accuracy, crucial for early and precise ASD
identification. The methodology involves collecting a code-switched speech
corpus, CoSAm, from children diagnosed with ASD and a matched control group.
The dataset comprises 61 voice recordings from 30 children diagnosed with ASD
and 31 from neurotypical children, aged between 3 and 13 years, resulting in a
total of 159.75 minutes of voice recordings. The feature analysis focuses on
MFCCs and extensive statistical attributes to capture speech pattern
variability and complexity. The best model performance is achieved using a
hierarchical fusion technique with an accuracy of 98.75% using a combination of
acoustic and linguistic features first, followed by paralinguistic features in
a hierarchical manner.
[LINK]
http://arxiv.org/abs/2407.14328v2
[DATE]
2024-07-23 19:56:22+08:00
[CATEGORIES]
cs.LG
Data-Driven Optimal Feedback Laws via Kernel Mean Embeddings
[AUTHORS]
Petar Bevanda, Nicolas Hoischen, Stefan Sosnowski, Sandra Hirche, Boris Houska
[ABSTRACT]
This paper proposes a fully data-driven approach for optimal control of
nonlinear control-affine systems represented by a stochastic diffusion. The
focus is on the scenario where both the nonlinear dynamics and stage cost
functions are unknown, while only control penalty function and constraints are
provided. Leveraging the theory of reproducing kernel Hilbert spaces, we
introduce novel kernel mean embeddings (KMEs) to identify the Markov transition
operators associated with controlled diffusion processes. The KME learning
approach seamlessly integrates with modern convex operator-theoretic
Hamilton-Jacobi-Bellman recursions. Thus, unlike traditional dynamic
programming methods, our approach exploits the “kernel trick” to break the
curse of dimensionality. We demonstrate the effectiveness of our method through
numerical examples, highlighting its ability to solve a large class of
nonlinear optimal control problems.
[COMMENTS]
author-submitted electronic preprint version: 16 pages, 3 figures, 4
tables
[LINK]
http://arxiv.org/abs/2407.16407v1
[DATE]
2024-07-23 19:53:03+08:00
[CATEGORIES]
cs.LG
Enhancing Neural Training via a Correlated Dynamics Model
[AUTHORS]
Jonathan Brokman, Roy Betser, Rotem Turjeman, Tom Berkov, Ido Cohen, Guy Gilboa
[COMMENTS]
ICLR 2024 accepted URL: https://openreview.net/forum?id=c9xsaASm9L
[LINK]
http://arxiv.org/abs/2312.13247v2
[DATE]
2024-07-23 19:42:14+08:00
[CATEGORIES]
cs.LG
E(n) Equivariant Topological Neural Networks
[AUTHORS]
Claudio Battiloro, Ege Karaismailoğlu, Mauricio Tec, George Dasoulas, Michelle Audirac, Francesca Dominici
[ABSTRACT]
Graph neural networks excel at modeling pairwise interactions, but they
cannot flexibly accommodate higher-order interactions and features. Topological
deep learning (TDL) has emerged recently as a promising tool for addressing
this issue. TDL enables the principled modeling of arbitrary multi-way,
hierarchical higher-order interactions by operating on combinatorial
topological spaces, such as simplicial or cell complexes, instead of graphs.
However, little is known about how to leverage geometric features such as
positions and velocities for TDL. This paper introduces E(n)-Equivariant
Topological Neural Networks (ETNNs), which are E(n)-equivariant message-passing
networks operating on combinatorial complexes, formal objects unifying graphs,
hypergraphs, simplicial, path, and cell complexes. ETNNs incorporate geometric
node features while respecting rotation and translation equivariance. Moreover,
ETNNs are natively ready for settings with heterogeneous interactions. We
provide a theoretical analysis to show the improved expressiveness of ETNNs
over architectures for geometric graphs. We also show how several E(n)
equivariant variants of TDL models can be directly derived from our framework.
The broad applicability of ETNNs is demonstrated through two tasks of vastly
different nature: i) molecular property prediction on the QM9 benchmark and ii)
land-use regression for hyper-local estimation of air pollution with
multi-resolution irregular geospatial data. The experiment results indicate
that ETNNs are an effective tool for learning from diverse types of richly
structured data, highlighting the benefits of principled geometric inductive
bias.
[COMMENTS]
36 pages, 11 figures, 9 tables
[LINK]
http://arxiv.org/abs/2405.15429v3
[DATE]
2024-07-23 19:37:53+08:00
[CATEGORIES]
cs.LG
On ADMM in Heterogeneous Federated Learning: Personalization, Robustness, and Fairness
[AUTHORS]
Shengkun Zhu, Jinshan Zeng, Sheng Wang, Yuan Sun, Xiaodong Li, Yuan Yao, Zhiyong Peng
[ABSTRACT]
Statistical heterogeneity is a root cause of tension among accuracy,
fairness, and robustness of federated learning (FL), and is key in paving a
path forward. Personalized FL (PFL) is an approach that aims to reduce the
impact of statistical heterogeneity by developing personalized models for
individual users, while also inherently providing benefits in terms of fairness
and robustness. However, existing PFL frameworks focus on improving the
performance of personalized models while neglecting the global model. Moreover,
these frameworks achieve sublinear convergence rates and rely on strong
assumptions. In this paper, we propose FLAME, an optimization framework by
utilizing the alternating direction method of multipliers (ADMM) to train
personalized and global models. We propose a model selection strategy to
improve performance in situations where clients have different types of
heterogeneous data. Our theoretical analysis establishes the global convergence
and two kinds of convergence rates for FLAME under mild assumptions. We
theoretically demonstrate that FLAME is more robust and fair than the
state-of-the-art methods on a class of linear problems. Our experimental
findings show that FLAME outperforms state-of-the-art methods in convergence
and accuracy, and it achieves higher test accuracy under various attacks and
performs more uniformly across clients.
[COMMENTS]
arXiv admin note: text overlap with arXiv:2311.06756
[LINK]
http://arxiv.org/abs/2407.16397v1
[DATE]
2024-07-23 19:35:42+08:00
[CATEGORIES]
cs.LG
Interval Forecasts for Gas Prices in the Face of Structural Breaks – Statistical Models vs. Neural Networks
[AUTHORS]
Stephan Schlüter, Sven Pappert, Martin Neumann
[ABSTRACT]
Reliable gas price forecasts are an essential information for gas and energy
traders, for risk managers and also economists. However, ahead of the war in
Ukraine Europe began to suffer from substantially increased and volatile gas
prices which culminated in the aftermath of the North Stream 1 explosion. This
shock changed both trend and volatility structure of the prices and has
considerable effects on forecasting models. In this study we investigate
whether modern machine learning methods such as neural networks are more
resilient against such changes than statistical models such as autoregressive
moving average (ARMA) models with conditional heteroskedasticity, or
copula-based time series models. Thereby the focus lies on interval forecasting
and applying respective evaluation measures. As data, the Front Month prices
from the Dutch Title Transfer Facility, currently the predominant European
exchange, are used. We see that, during the shock period, most models
underestimate the variance while overestimating the variance in the after-shock
period. Furthermore, we recognize that, during the shock, the simpler models,
i.e. an ARMA model with conditional heteroskedasticity and the multilayer
perceptron (a neural network), perform best with regards to prediction interval
coverage. Interestingly, the widely-used long-short term neural network is
outperformed by its competitors.
[LINK]
http://arxiv.org/abs/2407.16723v1
[DATE]
2024-07-23 19:34:13+08:00
[CATEGORIES]
cs.LG
Anwendung von Causal-Discovery-Algorithmen zur Root-Cause-Analyse in der Fahrzeugmontage
[AUTHORS]
Lucas Possner, Lukas Bahr, Leonard Roehl, Christoph Wehner, Sophie Groeger
[ABSTRACT]
Root Cause Analysis (RCA) is a quality management method that aims to
systematically investigate and identify the cause-and-effect relationships of
problems and their underlying causes. Traditional methods are based on the
analysis of problems by subject matter experts. In modern production processes,
large amounts of data are collected. For this reason, increasingly
computer-aided and data-driven methods are used for RCA. One of these methods
are Causal Discovery Algorithms (CDA). This publication demonstrates the
application of CDA on data from the assembly of a leading automotive
manufacturer. The algorithms used learn the causal structure between the
characteristics of the manufactured vehicles, the ergonomics and the temporal
scope of the involved assembly processes, and quality-relevant product features
based on representative data. This publication compares various CDAs in terms
of their suitability in the context of quality management. For this purpose,
the causal structures learned by the algorithms as well as their runtime are
compared. This publication provides a contribution to quality management and
demonstrates how CDAs can be used for RCA in assembly processes.
[COMMENTS]
in German language
[LINK]
http://arxiv.org/abs/2407.16388v1
[DATE]
2024-07-23 19:22:33+08:00
[CATEGORIES]
cs.LG
Inferring turbulent velocity and temperature fields and their statistics from Lagrangian velocity measurements using physics-informed Kolmogorov-Arnold Networks
[AUTHORS]
Juan Diego Toscano, Theo Käufer, Zhibo Wang, Martin Maxey, Christian Cierpka, George Em Karniadakis
[ABSTRACT]
We propose the Artificial Intelligence Velocimetry-Thermometry (AIVT) method
to infer hidden temperature fields from experimental turbulent velocity data.
This physics-informed machine learning method enables us to infer continuous
temperature fields using only sparse velocity data, hence eliminating the need
for direct temperature measurements. Specifically, AIVT is based on
physics-informed Kolmogorov-Arnold Networks (not neural networks) and is
trained by optimizing a combined loss function that minimizes the residuals of
the velocity data, boundary conditions, and the governing equations. We apply
AIVT to a unique set of experimental volumetric and simultaneous temperature
and velocity data of Rayleigh-B'enard convection (RBC) that we acquired by
combining Particle Image Thermometry and Lagrangian Particle Tracking. This
allows us to compare AIVT predictions and measurements directly. We demonstrate
that we can reconstruct and infer continuous and instantaneous velocity and
temperature fields from sparse experimental data at a fidelity comparable to
direct numerical simulations (DNS) of turbulence. This, in turn, enables us to
compute important quantities for quantifying turbulence, such as fluctuations,
viscous and thermal dissipation, and QR distribution. This paradigm shift in
processing experimental data using AIVT to infer turbulent fields at DNS-level
fidelity is a promising avenue in breaking the current deadlock of quantitative
understanding of turbulence at high Reynolds numbers, where DNS is
computationally infeasible.
[COMMENTS]
turbulence, data assimilation, physics-informed machine learning,
experimental methods, Kolmogorov-Arnold networks. 50 pages, 8 figures
[LINK]
http://arxiv.org/abs/2407.15727v2
[DATE]
2024-07-23 19:19:18+08:00
[CATEGORIES]
cs.LG
Probing Perfection: The Relentless Art of Meddling for Pulmonary Airway Segmentation from HRCT via a Human-AI Collaboration Based Active Learning Method
[AUTHORS]
Shiyi Wang, Yang Nan, Sheng Zhang, Federico Felder, Xiaodan Xing, Yingying Fang, Javier Del Ser, Simon L F Walsh, Guang Yang
[ABSTRACT]
In pulmonary tracheal segmentation, the scarcity of annotated data is a
prevalent issue in medical segmentation. Additionally, Deep Learning (DL)
methods face challenges: the opacity of ‘black box’ models and the need for
performance enhancement. Our Human-Computer Interaction (HCI) based models
(RS_UNet, LC_UNet, UUNet, and WD_UNet) address these challenges by combining
diverse query strategies with various DL models. We train four HCI models and
repeat these steps: (1) Query Strategy: The HCI models select samples that
provide the most additional representative information when labeled in each
iteration and identify unlabeled samples with the greatest predictive disparity
using Wasserstein Distance, Least Confidence, Entropy Sampling, and Random
Sampling. (2) Central line correction: Selected samples are used for expert
correction of system-generated tracheal central lines in each training round.
(3) Update training dataset: Experts update the training dataset after each DL
model’s training epoch, enhancing the trustworthiness and performance of the
models. (4) Model training: The HCI model is trained using the updated dataset
and an enhanced UNet version. Experimental results confirm the effectiveness of
these HCI-based approaches, showing that WD-UNet, LC-UNet, UUNet, and RS-UNet
achieve comparable or superior performance to state-of-the-art DL models.
Notably, WD-UNet achieves this with only 15%-35% of the training data, reducing
physician annotation time by 65%-85%.
[LINK]
http://arxiv.org/abs/2407.03542v2
[DATE]
2024-07-23 19:16:22+08:00
[CATEGORIES]
cs.LG
Reinforcement Learning-based Adaptive Mitigation of Uncorrected DRAM Errors in the Field
[AUTHORS]
Isaac Boixaderas, Sergi Moré, Javier Bartolome, David Vicente, Petar Radojković, Paul M. Carpenter, Eduard Ayguadé
[ABSTRACT]
Scaling to larger systems, with current levels of reliability, requires
cost-effective methods to mitigate hardware failures. One of the main causes of
hardware failure is an uncorrected error in memory, which terminates the
current job and wastes all computation since the last checkpoint. This paper
presents the first adaptive method for triggering uncorrected error mitigation.
It uses a prediction approach that considers the likelihood of an uncorrected
error and its current potential cost. The method is based on reinforcement
learning, and the only user-defined parameters are the mitigation cost and
whether the job can be restarted from a mitigation point. We evaluate our
method using classical machine learning metrics together with a cost-benefit
analysis, which compares the cost of mitigation actions with the benefits from
mitigating some of the errors. On two years of production logs from the
MareNostrum supercomputer, our method reduces lost compute time by 54% compared
with no mitigation and is just 6% below the optimal Oracle method. All source
code is open source.
[COMMENTS]
Published in HPDC’24
[LINK]
http://arxiv.org/abs/2407.16377v1
[DATE]
2024-07-23 19:04:33+08:00
[CATEGORIES]
cs.LG
Interpolation-Split: a data-centric deep learning approach with big interpolated data to boost airway segmentation performance
[AUTHORS]
Wing Keung Cheung, Ashkan Pakzad, Nesrin Mogulkoc, Sarah Needleman, Bojidar Rangelov, Eyjolfur Gudmundsson, An Zhao, Mariam Abbas, Davina McLaverty, Dimitrios Asimakopoulos, Robert Chapman, Recep Savas, Sam M Janes, Yipeng Hu, Daniel C. Alexander, John R Hurst, Joseph Jacob
[ABSTRACT]
The morphology and distribution of airway tree abnormalities enables
diagnosis and disease characterisation across a variety of chronic respiratory
conditions. In this regard, airway segmentation plays a critical role in the
production of the outline of the entire airway tree to enable estimation of
disease extent and severity. In this study, we propose a data-centric deep
learning technique to segment the airway tree. The proposed technique utilises
interpolation and image split to improve data usefulness and quality. Then, an
ensemble learning strategy is implemented to aggregate the segmented airway
trees at different scales. In terms of segmentation performance (dice
similarity coefficient), our method outperforms the baseline model by 2.5% on
average when a combined loss is used. Further, our proposed technique has a low
GPU usage and high flexibility enabling it to be deployed on any 2D deep
learning model.
[LINK]
http://arxiv.org/abs/2308.00008v2
[DATE]
2024-07-23 19:02:52+08:00
[CATEGORIES]
cs.LG
Bayesian Autoregressive Online Change-Point Detection with Time-Varying Parameters
[AUTHORS]
Ioanna-Yvonni Tsaknaki, Fabrizio Lillo, Piero Mazzarisi
[ABSTRACT]
Change points in real-world systems mark significant regime shifts in system
dynamics, possibly triggered by exogenous or endogenous factors. These points
define regimes for the time evolution of the system and are crucial for
understanding transitions in financial, economic, social, environmental, and
technological contexts. Building upon the Bayesian approach introduced in
\cite{c:07}, we devise a new method for online change point detection in the
mean of a univariate time series, which is well suited for real-time
applications and is able to handle the general temporal patterns displayed by
data in many empirical contexts. We first describe time series as an
autoregressive process of an arbitrary order. Second, the variance and
correlation of the data are allowed to vary within each regime driven by a
scoring rule that updates the value of the parameters for a better fit of the
observations. Finally, a change point is detected in a probabilistic framework
via the posterior distribution of the current regime length. By modeling
temporal dependencies and time-varying parameters, the proposed approach
enhances both the estimate accuracy and the forecasting power. Empirical
validations using various datasets demonstrate the method’s effectiveness in
capturing memory and dynamic patterns, offering deeper insights into the
non-stationary dynamics of real-world systems.
[COMMENTS]
38 pages, 9 figures, 3 tables
[LINK]
http://arxiv.org/abs/2407.16376v1
[DATE]
2024-07-23 18:57:13+08:00
[CATEGORIES]
cs.LG
Smooth Tchebycheff Scalarization for Multi-Objective Optimization
[AUTHORS]
Xi Lin, Xiaoyuan Zhang, Zhiyuan Yang, Fei Liu, Zhenkun Wang, Qingfu Zhang
[ABSTRACT]
Multi-objective optimization problems can be found in many real-world
applications, where the objectives often conflict each other and cannot be
optimized by a single solution. In the past few decades, numerous methods have
been proposed to find Pareto solutions that represent optimal trade-offs among
the objectives for a given problem. However, these existing methods could have
high computational complexity or may not have good theoretical properties for
solving a general differentiable multi-objective optimization problem. In this
work, by leveraging the smooth optimization technique, we propose a lightweight
and efficient smooth Tchebycheff scalarization approach for gradient-based
multi-objective optimization. It has good theoretical properties for finding
all Pareto solutions with valid trade-off preferences, while enjoying
significantly lower computational complexity compared to other methods.
Experimental results on various real-world application problems fully
demonstrate the effectiveness of our proposed method.
[COMMENTS]
Accepted by the 41st International Conference on Machine Learning
(ICML 2024)
[LINK]
http://arxiv.org/abs/2402.19078v3
[DATE]
2024-07-23 18:46:28+08:00
[CATEGORIES]
cs.LG
CheMFi: A Multifidelity Dataset of Quantum Chemical Properties of Diverse Molecules
[AUTHORS]
Vivin Vinod, Peter Zaspel
[ABSTRACT]
Progress in both Machine Learning (ML) and Quantum Chemistry (QC) methods
have resulted in high accuracy ML models for QC properties. Datasets such as
MD17 and WS22 have been used to benchmark these models at some level of QC
method, or fidelity, which refers to the accuracy of the chosen QC method.
Multifidelity ML (MFML) methods, where models are trained on data from more
than one fidelity, have shown to be effective over single fidelity methods.
Much research is progressing in this direction for diverse applications ranging
from energy band gaps to excitation energies. One hurdle for effective research
here is the lack of a diverse multifidelity dataset for benchmarking. We
provide the quantum Chemistry MultiFidelity (CheMFi) dataset consisting of five
fidelities calculated with the TD-DFT formalism. The fidelities differ in their
basis set choice: STO-3G, 3-21G, 6-31G, def2-SVP, and def2-TZVP. CheMFi offers
to the community a variety of QC properties such as vertical excitation
properties and molecular dipole moments, further including QC computation times
allowing for a time benefit benchmark of multifidelity models for ML-QC.
[LINK]
http://arxiv.org/abs/2406.14149v2
[DATE]
2024-07-23 18:34:19+08:00
[CATEGORIES]
cs.LG
Navigating Uncertainty in Medical Image Segmentation
[AUTHORS]
Kilian Zepf, Jes Frellsen, Aasa Feragen
[ABSTRACT]
We address the selection and evaluation of uncertain segmentation methods in
medical imaging and present two case studies: prostate segmentation,
illustrating that for minimal annotator variation simple deterministic models
can suffice, and lung lesion segmentation, highlighting the limitations of the
Generalized Energy Distance (GED) in model selection. Our findings lead to
guidelines for accurately choosing and developing uncertain segmentation
models, that integrate aleatoric and epistemic components. These guidelines are
designed to aid researchers and practitioners in better developing, selecting,
and evaluating uncertain segmentation methods, thereby facilitating enhanced
adoption and effective application of segmentation uncertainty in practice.
[COMMENTS]
Published in the conference proceedings of the 21st IEEE
International Symposium on Biomedical Imaging (ISBI 2024)
[LINK]
http://arxiv.org/abs/2407.16367v1
[DATE]
2024-07-23 18:21:18+08:00
[CATEGORIES]
cs.LG
Online Learning with Sublinear Best-Action Queries
[AUTHORS]
Matteo Russo, Andrea Celli, Riccardo Colini Baldeschi, Federico Fusco, Daniel Haimovich, Dima Karamshuk, Stefano Leonardi, Niek Tax
[ABSTRACT]
In online learning, a decision maker repeatedly selects one of a set of
actions, with the goal of minimizing the overall loss incurred. Following the
recent line of research on algorithms endowed with additional predictive
features, we revisit this problem by allowing the decision maker to acquire
additional information on the actions to be selected. In particular, we study
the power of \emph{best-action queries}, which reveal beforehand the identity
of the best action at a given time step. In practice, predictive features may
be expensive, so we allow the decision maker to issue at most $k$ such queries.
We establish tight bounds on the performance any algorithm can achieve when
given access to $k$ best-action queries for different types of feedback models.
In particular, we prove that in the full feedback model, $k$ queries are enough
to achieve an optimal regret of $\Theta\left(\min\left\{\sqrt T, \frac
Tk\right\}\right)$. This finding highlights the significant multiplicative
advantage in the regret rate achievable with even a modest (sublinear) number
$k \in \Omega(\sqrt{T})$ of queries. Additionally, we study the challenging
setting in which the only available feedback is obtained during the time steps
corresponding to the $k$ best-action queries. There, we provide a tight regret
rate of $\Theta\left(\min\left\{\frac{T}{\sqrt
k},\frac{T^2}{k^2}\right\}\right)$, which improves over the standard
$\Theta\left(\frac{T}{\sqrt k}\right)$ regret rate for label efficient
prediction for $k \in \Omega(T^{2/3})$.
[LINK]
http://arxiv.org/abs/2407.16355v1
[DATE]
2024-07-23 17:59:43+08:00
[CATEGORIES]
cs.LG
Strike a Balance in Continual Panoptic Segmentation
[AUTHORS]
Jinpeng Chen, Runmin Cong, Yuxuan Luo, Horace Ho Shing Ip, Sam Kwong
[ABSTRACT]
This study explores the emerging area of continual panoptic segmentation,
highlighting three key balances. First, we introduce past-class backtrace
distillation to balance the stability of existing knowledge with the
adaptability to new information. This technique retraces the features
associated with past classes based on the final label assignment results,
performing knowledge distillation targeting these specific features from the
previous model while allowing other features to flexibly adapt to new
information. Additionally, we introduce a class-proportional memory strategy,
which aligns the class distribution in the replay sample set with that of the
historical training data. This strategy maintains a balanced class
representation during replay, enhancing the utility of the limited-capacity
replay sample set in recalling prior classes. Moreover, recognizing that replay
samples are annotated only for the classes of their original step, we devise
balanced anti-misguidance losses, which combat the impact of incomplete
annotations without incurring classification bias. Building upon these
innovations, we present a new method named Balanced Continual Panoptic
Segmentation (BalConpas). Our evaluation on the challenging ADE20K dataset
demonstrates its superior performance compared to existing state-of-the-art
methods. The official code is available at
https://github.com/jinpeng0528/BalConpas.
[LINK]
http://arxiv.org/abs/2407.16354v1
[DATE]
2024-07-23 17:58:20+08:00
[CATEGORIES]
cs.LG
Data-driven Multistage Distributionally Robust Linear Optimization with Nested Distance
[AUTHORS]
Rui Gao, Rohit Arora, Yizhe Huang
[ABSTRACT]
We study multistage distributionally robust linear optimization, where the
uncertainty set is defined as a ball of distribution centered at a scenario
tree using the nested distance. The resulting minimax problem is notoriously
difficult to solve due to its inherent non-convexity. In this paper, we
demonstrate that, under mild conditions, the robust risk evaluation of a given
policy can be expressed in an equivalent recursive form. Furthermore, assuming
stagewise independence, we derive equivalent dynamic programming reformulations
to find an optimal robust policy that is time-consistent and well-defined on
unseen sample paths. Our reformulations reconcile two modeling frameworks: the
multistage-static formulation (with nested distance) and the multistage-dynamic
formulation (with one-period Wasserstein distance). Moreover, we identify
tractable cases when the value functions can be computed efficiently using
convex optimization techniques.
[COMMENTS]
First appeared online at https://optimization-online.org/?p=20641 on
Oct 15, 2022
[LINK]
http://arxiv.org/abs/2407.16346v1
[DATE]
2024-07-23 17:49:22+08:00
[CATEGORIES]
cs.LG
STATE: A Robust ATE Estimator of Heavy-Tailed Metrics for Variance Reduction in Online Controlled Experiments
[AUTHORS]
Hao Zhou, Kun Sun, Shaoming Li, Yangfeng Fan, Guibin Jiang, Jiaqi Zheng, Tao Li
[ABSTRACT]
Online controlled experiments play a crucial role in enabling data-driven
decisions across a wide range of companies. Variance reduction is an effective
technique to improve the sensitivity of experiments, achieving higher
statistical power while using fewer samples and shorter experimental periods.
However, typical variance reduction methods (e.g., regression-adjusted
estimators) are built upon the intuitional assumption of Gaussian distributions
and cannot properly characterize the real business metrics with heavy-tailed
distributions. Furthermore, outliers diminish the correlation between
pre-experiment covariates and outcome metrics, greatly limiting the
effectiveness of variance reduction.
In this paper, we develop a novel framework that integrates the Student’s
t-distribution with machine learning tools to fit heavy-tailed metrics and
construct a robust average treatment effect estimator in online controlled
experiments, which we call STATE. By adopting a variational EM method to
optimize the loglikehood function, we can infer a robust solution that greatly
eliminates the negative impact of outliers and achieves significant variance
reduction. Moreover, we extend the STATE method from count metrics to ratio
metrics by utilizing linear transformation that preserves unbiased estimation,
whose variance reduction is more complex but less investigated in existing
works. Finally, both simulations on synthetic data and long-term empirical
results on Meituan experiment platform demonstrate the effectiveness of our
method. Compared with the state-of-the-art estimators (CUPAC/MLRATE), STATE
achieves over 50% variance reduction, indicating it can reach the same
statistical power with only half of the observations, or half the experimental
duration.
[COMMENTS]
Accepted by KDD 2024
[LINK]
http://arxiv.org/abs/2407.16337v1
[DATE]
2024-07-23 17:35:59+08:00
[CATEGORIES]
cs.LG
Score matching for bridges without time-reversals
[AUTHORS]
Elizabeth L. Baker, Moritz Schauer, Stefan Sommer
[ABSTRACT]
We propose a new algorithm for learning a bridged diffusion process using
score-matching methods. Our method relies on reversing the dynamics of the
forward process and using this to learn a score function, which, via Doob’s
$h$-transform, gives us a bridged diffusion process; that is, a process
conditioned on an endpoint. In contrast to prior methods, ours learns the score
term $\nabla_x \log p(t, x; T, y)$, for given $t, Y$ directly, completely
avoiding the need for first learning a time reversal. We compare the
performance of our algorithm with existing methods and see that it outperforms
using the (learned) time-reversals to learn the score term. The code can be
found at https://github.com/libbylbaker/forward_bridge.
[LINK]
http://arxiv.org/abs/2407.15455v2
[DATE]
2024-07-23 17:25:16+08:00
[CATEGORIES]
cs.LG
On The Expressive Power of Knowledge Graph Embedding Methods
[AUTHORS]
Jiexing Gao, Dmitry Rodin, Vasily Motolygin, Denis Zaytsev
[ABSTRACT]
Knowledge Graph Embedding (KGE) is a popular approach, which aims to
represent entities and relations of a knowledge graph in latent spaces. Their
representations are known as embeddings. To measure the plausibility of
triplets, score functions are defined over embedding spaces. Despite wide
dissemination of KGE in various tasks, KGE methods have limitations in
reasoning abilities. In this paper we propose a mathematical framework to
compare reasoning abilities of KGE methods. We show that STransE has a higher
capability than TransComplEx, and then present new STransCoRe method, which
improves the STransE by combining it with the TransCoRe insights, which can
reduce the STransE space complexity.
[COMMENTS]
11 pages, 1 figure
[LINK]
http://arxiv.org/abs/2407.16326v1
[DATE]
2024-07-23 17:21:38+08:00
[CATEGORIES]
cs.LG
Deep Learning for Pancreas Segmentation: a Systematic Review
[AUTHORS]
Andrea Moglia, Matteo Cavicchioli, Luca Mainardi, Pietro Cerveri
[ABSTRACT]
Pancreas segmentation has been traditionally challenging due to its small
size in computed tomography abdominal volumes, high variability of shape and
positions among patients, and blurred boundaries due to low contrast between
the pancreas and surrounding organs. Many deep learning models for pancreas
segmentation have been proposed in the past few years. We present a thorough
systematic review based on the Preferred Reporting Items for Systematic Reviews
and Meta-analyses (PRISMA) statement. The literature search was conducted on
PubMed, Web of Science, Scopus, and IEEE Xplore on original studies published
in peer-reviewed journals from 2013 to 2023. Overall, 130 studies were
retrieved. We initially provided an overview of the technical background of the
most common network architectures and publicly available datasets. Then, the
analysis of the studies combining visual presentation in tabular form and text
description was reported. The tables grouped the studies specifying the
application, dataset size, design (model architecture, learning strategy, and
loss function), results, and main contributions. We first analyzed the studies
focusing on parenchyma segmentation using coarse-to-fine approaches,
multi-organ segmentation, semi-supervised learning, and unsupervised learning,
followed by those studies on generalization to other datasets and those
concerning the design of new loss functions. Then, we analyzed the studies on
segmentation of tumors, cysts, and inflammation reporting multi-stage methods,
semi-supervised learning, generalization to other datasets, and design of new
loss functions. Finally, we provided a critical discussion on the subject based
on the published evidence underlining current issues that need to be addressed
before clinical translation.
[LINK]
http://arxiv.org/abs/2407.16313v1
[DATE]
2024-07-23 17:05:23+08:00
[CATEGORIES]
cs.LG
Constrained Stein Variational Trajectory Optimization
[AUTHORS]
Thomas Power, Dmitry Berenson
[ABSTRACT]
We present Constrained Stein Variational Trajectory Optimization (CSVTO), an
algorithm for performing trajectory optimization with constraints on a set of
trajectories in parallel. We frame constrained trajectory optimization as a
novel form of constrained functional minimization over trajectory
distributions, which avoids treating the constraints as a penalty in the
objective and allows us to generate diverse sets of constraint-satisfying
trajectories. Our method uses Stein Variational Gradient Descent (SVGD) to find
a set of particles that approximates a distribution over low-cost trajectories
while obeying constraints. CSVTO is applicable to problems with differentiable
equality and inequality constraints and includes a novel particle re-sampling
step to escape local minima. By explicitly generating diverse sets of
trajectories, CSVTO is better able to avoid poor local minima and is more
robust to initialization. We demonstrate that CSVTO outperforms baselines in
challenging highly-constrained tasks, such as a 7DoF wrench manipulation task,
where CSVTO outperforms all baselines both in success and constraint
satisfaction.
[COMMENTS]
18 pages, 10 figures, 3 tables
[LINK]
http://arxiv.org/abs/2308.12110v3
[DATE]
2024-07-23 16:52:31+08:00
[CATEGORIES]
cs.LG
A new Linear Time Bi-level $\ell_{1,\infty}$ projection ; Application to the sparsification of auto-encoders neural networks
[AUTHORS]
Michel Barlaud, Guillaume Perez, Jean-Paul Marmorat
[ABSTRACT]
The $\ell_{1,\infty}$ norm is an efficient-structured projection, but the
complexity of the best algorithm is, unfortunately, $\mathcal{O}\big(n m \log(n
m)\big)$ for a matrix $n\times m$.\ In this paper, we propose a new bi-level
projection method, for which we show that the time complexity for the
$\ell_{1,\infty}$ norm is only $\mathcal{O}\big(n m \big)$ for a matrix
$n\times m$. Moreover, we provide a new $\ell_{1,\infty}$ identity with
mathematical proof and experimental validation. Experiments show that our
bi-level $\ell_{1,\infty}$ projection is $2.5$ times faster than the actual
fastest algorithm and provides the best sparsity while keeping the same
accuracy in classification applications.
[COMMENTS]
arXiv admin note: substantial text overlap with arXiv:2405.02086
[LINK]
http://arxiv.org/abs/2407.16293v1
[DATE]
2024-07-23 16:51:29+08:00
[CATEGORIES]
cs.LG
Automated Security Response through Online Learning with Adaptive Conjectures
[AUTHORS]
Kim Hammar, Tao Li, Rolf Stadler, Quanyan Zhu
[ABSTRACT]
We study automated security response for an IT infrastructure and formulate
the interaction between an attacker and a defender as a partially observed,
non-stationary game. We relax the standard assumption that the game model is
correctly specified and consider that each player has a probabilistic
conjecture about the model, which may be misspecified in the sense that the
true model has probability 0. This formulation allows us to capture uncertainty
about the infrastructure and the intents of the players. To learn effective
game strategies online, we design a novel method where a player iteratively
adapts its conjecture using Bayesian learning and updates its strategy through
rollout. We prove that the conjectures converge to best fits, and we provide a
bound on the performance improvement that rollout enables with a conjectured
model. To characterize the steady state of the game, we propose a variant of
the Berk-Nash equilibrium. We present our method through an advanced persistent
threat use case. Testbed evaluations show that our method produces effective
security strategies that adapt to a changing environment. We also find that our
method enables faster convergence than current reinforcement learning
techniques.
[COMMENTS]
This work has been submitted to the IEEE for possible publication.
Copyright may be transferred without notice, after which this version may no
longer be accessible
[LINK]
http://arxiv.org/abs/2402.12499v2
[DATE]
2024-07-23 16:50:09+08:00
[CATEGORIES]
cs.LG
Federated Learning for Face Recognition via Intra-subject Self-supervised Learning
[AUTHORS]
Hansol Kim, Hoyeol Choi, Youngjun Kwak
[ABSTRACT]
Federated Learning (FL) for face recognition aggregates locally optimized
models from individual clients to construct a generalized face recognition
model. However, previous studies present two major challenges: insufficient
incorporation of self-supervised learning and the necessity for clients to
accommodate multiple subjects. To tackle these limitations, we propose FedFS
(Federated Learning for personalized Face recognition via intra-subject
Self-supervised learning framework), a novel federated learning architecture
tailored to train personalized face recognition models without imposing
subjects. Our proposed FedFS comprises two crucial components that leverage
aggregated features of the local and global models to cooperate with
representations of an off-the-shelf model. These components are (1) adaptive
soft label construction, utilizing dot product operations to reformat labels
within intra-instances, and (2) intra-subject self-supervised learning,
employing cosine similarity operations to strengthen robust intra-subject
representations. Additionally, we introduce a regularization loss to prevent
overfitting and ensure the stability of the optimized model. To assess the
effectiveness of FedFS, we conduct comprehensive experiments on the DigiFace-1M
and VGGFace datasets, demonstrating superior performance compared to previous
methods.
[COMMENTS]
Accepted at the The 35th British Machine Vision Conference 2024 (BMVC
2024), Glasgow, UK. Youngjun Kwak is corresponding author
[LINK]
http://arxiv.org/abs/2407.16289v1
[DATE]
2024-07-23 16:43:42+08:00
[CATEGORIES]
cs.LG
A deeper look at depth pruning of LLMs
[AUTHORS]
Shoaib Ahmed Siddiqui, Xin Dong, Greg Heinrich, Thomas Breuel, Jan Kautz, David Krueger, Pavlo Molchanov
[ABSTRACT]
Large Language Models (LLMs) are not only resource-intensive to train but
even more costly to deploy in production. Therefore, recent work has attempted
to prune blocks of LLMs based on cheap proxies for estimating block importance,
effectively removing 10% of blocks in well-trained LLaMa-2 and Mistral 7b
models without any significant degradation of downstream metrics. In this
paper, we explore different block importance metrics by considering adaptive
metrics such as Shapley value in addition to static ones explored in prior
work. We show that adaptive metrics exhibit a trade-off in performance between
tasks i.e., improvement on one task may degrade performance on the other due to
differences in the computed block influences. Furthermore, we extend this
analysis from a complete block to individual self-attention and feed-forward
layers, highlighting the propensity of the self-attention layers to be more
amendable to pruning, even allowing removal of upto 33% of the self-attention
layers without incurring any performance degradation on MMLU for Mistral 7b
(significant reduction in costly maintenance of KV-cache). Finally, we look at
simple performance recovery techniques to emulate the pruned layers by training
lightweight additive bias or low-rank linear adapters. Performance recovery
using emulated updates avoids performance degradation for the initial blocks
(up to 5% absolute improvement on MMLU), which is either competitive or
superior to the learning-based technique.
[LINK]
http://arxiv.org/abs/2407.16286v1
[DATE]
2024-07-23 16:40:27+08:00
[CATEGORIES]
cs.LG
Efficient Detection of Commutative Factors in Factor Graphs
[AUTHORS]
Malte Luttermann, Johann Machemer, Marcel Gehrke
[ABSTRACT]
Lifted probabilistic inference exploits symmetries in probabilistic graphical
models to allow for tractable probabilistic inference with respect to domain
sizes. To exploit symmetries in, e.g., factor graphs, it is crucial to identify
commutative factors, i.e., factors having symmetries within themselves due to
their arguments being exchangeable. The current state of the art to check
whether a factor is commutative with respect to a subset of its arguments
iterates over all possible subsets of the factor’s arguments, i.e., $O(2^n)$
iterations for a factor with $n$ arguments in the worst case. In this paper, we
efficiently solve the problem of detecting commutative factors in a factor
graph. In particular, we introduce the detection of commutative factors (DECOR)
algorithm, which allows us to drastically reduce the computational effort for
checking whether a factor is commutative in practice. We prove that DECOR
efficiently identifies restrictions to drastically reduce the number of
required iterations and validate the efficiency of DECOR in our empirical
evaluation.
[COMMENTS]
Accepted to the Proceedings of the 12th Conference on Probabilistic
Graphical Models (PGM 2024)
[LINK]
http://arxiv.org/abs/2407.16280v1
[DATE]
2024-07-23 16:31:24+08:00
[CATEGORIES]
cs.LG
Self-Reasoning Assistant Learning for non-Abelian Gauge Fields Design
[AUTHORS]
Jinyang Sun, Xi Chen, Xiumei Wang, Dandan Zhu, Xingping Zhou
[ABSTRACT]
Non-Abelian braiding has attracted substantial attention because of its
pivotal role in describing the exchange behaviour of anyons, in which the input
and outcome of non-Abelian braiding are connected by a unitary matrix.
Implementing braiding in a classical system can assist the experimental
investigation of non-Abelian physics. However, the design of non-Abelian gauge
fields faces numerous challenges stemmed from the intricate interplay of group
structures, Lie algebra properties, representation theory, topology, and
symmetry breaking. The extreme diversity makes it a powerful tool for the study
of condensed matter physics. Whereas the widely used artificial intelligence
with data-driven approaches has greatly promoted the development of physics,
most works are limited on the data-to-data design. Here we propose a
self-reasoning assistant learning framework capable of directly generating
non-Abelian gauge fields. This framework utilizes the forward diffusion process
to capture and reproduce the complex patterns and details inherent in the
target distribution through continuous transformation. Then the reverse
diffusion process is used to make the generated data closer to the distribution
of the original situation. Thus, it owns strong self-reasoning capabilities,
allowing to automatically discover the feature representation and capture more
subtle relationships from the dataset. Moreover, the self-reasoning eliminates
the need for manual feature engineering and simplifies the process of model
building. Our framework offers a disruptive paradigm shift to parse complex
physical processes, automatically uncovering patterns from massive datasets.
[LINK]
http://arxiv.org/abs/2407.16255v1
[DATE]
2024-07-23 15:49:35+08:00
[CATEGORIES]
cs.LG
Dataset Growth
[AUTHORS]
Ziheng Qin, Zhaopan Xu, Yukun Zhou, Zangwei Zheng, Zebang Cheng, Hao Tang, Lei Shang, Baigui Sun, Xiaojiang Peng, Radu Timofte, Hongxun Yao, Kai Wang, Yang You
[ABSTRACT]
Deep learning benefits from the growing abundance of available data.
Meanwhile, efficiently dealing with the growing data scale has become a
challenge. Data publicly available are from different sources with various
qualities, and it is impractical to do manual cleaning against noise and
redundancy given today’s data scale. There are existing techniques for
cleaning/selecting the collected data. However, these methods are mainly
proposed for offline settings that target one of the cleanness and redundancy
problems. In practice, data are growing exponentially with both problems. This
leads to repeated data curation with sub-optimal efficiency. To tackle this
challenge, we propose InfoGrowth, an efficient online algorithm for data
cleaning and selection, resulting in a growing dataset that keeps up to date
with awareness of cleanliness and diversity. InfoGrowth can improve data
quality/efficiency on both single-modal and multi-modal tasks, with an
efficient and scalable design. Its framework makes it practical for real-world
data engines.
[COMMENTS]
arXiv admin note: text overlap with arXiv:2305.20087 by other authors
[LINK]
http://arxiv.org/abs/2405.18347v2
[DATE]
2024-07-23 15:31:18+08:00
[CATEGORIES]
cs.LG
Identifiable latent bandits: Combining observational data and exploration for personalized healthcare
[AUTHORS]
Ahmet Zahid Balcıoğlu, Emil Carlsson, Fredrik D. Johansson
[ABSTRACT]
Bandit algorithms hold great promise for improving personalized
decision-making but are notoriously sample-hungry. In most health applications,
it is infeasible to fit a new bandit for each patient, and observable variables
are often insufficient to determine optimal treatments, ruling out applying
contextual bandits learned from multiple patients. Latent bandits offer both
rapid exploration and personalization beyond what context variables can reveal
but require that a latent variable model can be learned consistently. In this
work, we propose bandit algorithms based on nonlinear independent component
analysis that can be provably identified from observational data to a degree
sufficient to infer the optimal action in a new bandit instance consistently.
We verify this strategy in simulated data, showing substantial improvement over
learning independent multi-armed bandits for every instance.
[COMMENTS]
9 pages, 2 figures
[LINK]
http://arxiv.org/abs/2407.16239v1
[DATE]
2024-07-23 15:26:38+08:00
[CATEGORIES]
cs.LG
OriGen:Enhancing RTL Code Generation with Code-to-Code Augmentation and Self-Reflection
[AUTHORS]
Fan Cui, Chenyang Yin, Kexing Zhou, Youwei Xiao, Guangyu Sun, Qiang Xu, Qipeng Guo, Demin Song, Dahua Lin, Xingcheng Zhang, Yun, Liang
[ABSTRACT]
Recent studies have illuminated that Large Language Models (LLMs) exhibit
substantial potential in the realm of RTL (Register Transfer Level) code
generation, with notable advancements evidenced by commercial models such as
GPT-4 and Claude3-Opus. Despite their proficiency, these commercial LLMs often
raise concerns regarding privacy and security. Conversely, open-source LLMs,
which offer solutions to these concerns, have inferior performance in RTL code
generation tasks to commercial models due to the lack of highquality
open-source RTL datasets. To address this issue, we introduce OriGen, a fully
open-source framework featuring self-reflection capabilities and a dataset
augmentation methodology for generating high-quality, large-scale RTL code. We
propose a novel code-to-code augmentation methodology that leverages knowledge
distillation to enhance the quality of the open-source RTL code datasets.
Additionally, OriGen is capable of correcting syntactic errors by leveraging a
self-reflection process based on feedback from the compiler. The
self-reflection ability of the model is facilitated by a carefully constructed
dataset, which comprises a comprehensive collection of samples. Experimental
results demonstrate that OriGen remarkably outperforms other open-source
alternatives in RTL code generation, surpassing the previous best-performing
LLM by 9.8% on the VerilogEval-Human benchmark. Furthermore, OriGen exhibits
superior capabilities in self-reflection and error rectification, surpassing
GPT-4 by 18.1% on the benchmark designed to evaluate the capability of
self-reflection.
[LINK]
http://arxiv.org/abs/2407.16237v1
[DATE]
2024-07-23 15:22:25+08:00
[CATEGORIES]
cs.LG
Algebraic Adversarial Attacks on Integrated Gradients
[AUTHORS]
Lachlan Simpson, Federico Costanza, Kyle Millar, Adriel Cheng, Cheng-Chew Lim, Hong Gunn Chew
[ABSTRACT]
Adversarial attacks on explainability models have drastic consequences when
explanations are used to understand the reasoning of neural networks in safety
critical systems. Path methods are one such class of attribution methods
susceptible to adversarial attacks. Adversarial learning is typically phrased
as a constrained optimisation problem. In this work, we propose algebraic
adversarial examples and study the conditions under which one can generate
adversarial examples for integrated gradients. Algebraic adversarial examples
provide a mathematically tractable approach to adversarial examples.
[LINK]
http://arxiv.org/abs/2407.16233v1
[DATE]
2024-07-23 15:17:45+08:00
[CATEGORIES]
cs.LG
Chemical Reaction Extraction from Long Patent Documents
[AUTHORS]
Aishwarya Jadhav, Ritam Dutt
[ABSTRACT]
The task of searching through patent documents is crucial for chemical patent
recommendation and retrieval. This can be enhanced by creating a patent
knowledge base (ChemPatKB) to aid in prior art searches and to provide a
platform for domain experts to explore new innovations in chemical compound
synthesis and use-cases. An essential foundational component of this KB is the
extraction of important reaction snippets from long patents documents which
facilitates multiple downstream tasks such as reaction co-reference resolution
and chemical entity role identification. In this work, we explore the problem
of extracting reactions spans from chemical patents in order to create a
reactions resource database. We formulate this task as a paragraph-level
sequence tagging problem, where the system is required to return a sequence of
paragraphs that contain a description of a reaction. We propose several
approaches and modifications of the baseline models and study how different
methods generalize across different domains of chemical patents.
[COMMENTS]
Work completed in 2022 at Carnegie Mellon University
[LINK]
http://arxiv.org/abs/2407.15124v2
[DATE]
2024-07-23 15:11:47+08:00
[CATEGORIES]
cs.LG
ODGR: Online Dynamic Goal Recognition
[AUTHORS]
Matan Shamir, Osher Elhadad, Matthew E. Taylor, Reuth Mirsky
[ABSTRACT]
Traditionally, Reinforcement Learning (RL) problems are aimed at optimization
of the behavior of an agent. This paper proposes a novel take on RL, which is
used to learn the policy of another agent, to allow real-time recognition of
that agent’s goals. Goal Recognition (GR) has traditionally been framed as a
planning problem where one must recognize an agent’s objectives based on its
observed actions. Recent approaches have shown how reinforcement learning can
be used as part of the GR pipeline, but are limited to recognizing predefined
goals and lack scalability in domains with a large goal space. This paper
formulates a novel problem, “Online Dynamic Goal Recognition” (ODGR), as a
first step to address these limitations. Contributions include introducing the
concept of dynamic goals into the standard GR problem definition, revisiting
common approaches by reformulating them using ODGR, and demonstrating the
feasibility of solving ODGR in a navigation domain using transfer learning.
These novel formulations open the door for future extensions of existing
transfer learning-based GR methods, which will be robust to changing and
expansive real-time environments.
[COMMENTS]
8 pages, 1 figure, RLC workshop, WAHT workshop
[LINK]
http://arxiv.org/abs/2407.16220v1
[DATE]
2024-07-23 14:52:52+08:00
[CATEGORIES]
cs.LG
AI Control: Improving Safety Despite Intentional Subversion
[AUTHORS]
Ryan Greenblatt, Buck Shlegeris, Kshitij Sachan, Fabien Roger
[ABSTRACT]
As large language models (LLMs) become more powerful and are deployed more
autonomously, it will be increasingly important to prevent them from causing
harmful outcomes. Researchers have investigated a variety of safety techniques
for this purpose, e.g. using models to review the outputs of other models, or
red-teaming techniques to surface subtle failure modes. However, researchers
have not evaluated whether such techniques still ensure safety if the model is
itself intentionally trying to subvert them. In this paper, we develop and
evaluate pipelines of safety techniques (“protocols”) that are robust to
intentional subversion.
We investigate a scenario in which we want to solve a sequence of programming
problems, using access to a powerful but untrusted model (in our case, GPT-4),
access to a less powerful trusted model (in our case, GPT-3.5), and limited
access to high-quality trusted labor. We investigate protocols that aim to
never submit solutions containing backdoors, which we operationalize here as
logical errors that are not caught by test cases.
We investigate a range of protocols and test each against strategies that the
untrusted model could use to subvert them. One protocol is what we call trusted
editing. This protocol first asks GPT-4 to write code, and then asks GPT-3.5 to
rate the suspiciousness of that code. If the code is below some suspiciousness
threshold, it is submitted. Otherwise, GPT-3.5 edits the solution to remove
parts that seem suspicious and then submits the edited code. Another protocol
is untrusted monitoring. This protocol asks GPT-4 to write code, and then asks
another instance of GPT-4 whether the code is backdoored, using various
techniques to prevent the GPT-4 instances from colluding. These protocols
improve substantially on simple baselines.
[COMMENTS]
Edit: Fix minor typos, clarify abstract, add glossary, expand related
work. ICML version: https://openreview.net/pdf?id=KviM5k8pcP
[LINK]
http://arxiv.org/abs/2312.06942v5
[DATE]
2024-07-23 14:47:13+08:00
[CATEGORIES]
cs.LG
Counterfactual Learning on Graphs: A Survey
[AUTHORS]
Zhimeng Guo, Teng Xiao, Zongyu Wu, Charu Aggarwal, Hui Liu, Suhang Wang
[ABSTRACT]
Graph-structured data are pervasive in the real-world such as social
networks, molecular graphs and transaction networks. Graph neural networks
(GNNs) have achieved great success in representation learning on graphs,
facilitating various downstream tasks. However, GNNs have several drawbacks
such as lacking interpretability, can easily inherit the bias of data and
cannot model casual relations. Recently, counterfactual learning on graphs has
shown promising results in alleviating these drawbacks. Various approaches have
been proposed for counterfactual fairness, explainability, link prediction and
other applications on graphs. To facilitate the development of this promising
direction, in this survey, we categorize and comprehensively review papers on
graph counterfactual learning. We divide existing methods into four categories
based on problems studied. For each category, we provide background and
motivating examples, a general framework summarizing existing works and a
detailed review of these works. We point out promising future research
directions at the intersection of graph-structured data, counterfactual
learning, and real-world applications. To offer a comprehensive view of
resources for future studies, we compile a collection of open-source
implementations, public datasets, and commonly-used evaluation metrics. This
survey aims to serve as a “one-stop-shop” for building a unified
understanding of graph counterfactual learning categories and current
resources. We also maintain a repository for papers and resources and will keep
updating the repository
https://github.com/TimeLovercc/Awesome-Graph-Causal-Learning.
[LINK]
http://arxiv.org/abs/2304.01391v3
[DATE]
2024-07-23 14:43:57+08:00
[CATEGORIES]
cs.LG
Strategy and Skill Learning for Physics-based Table Tennis Animation
[AUTHORS]
Jiashun Wang, Jessica Hodgins, Jungdam Won
[ABSTRACT]
Recent advancements in physics-based character animation leverage deep
learning to generate agile and natural motion, enabling characters to execute
movements such as backflips, boxing, and tennis. However, reproducing the
selection and use of diverse motor skills in dynamic environments to solve
complex tasks, as humans do, still remains a challenge. We present a strategy
and skill learning approach for physics-based table tennis animation. Our
method addresses the issue of mode collapse, where the characters do not fully
utilize the motor skills they need to perform to execute complex tasks. More
specifically, we demonstrate a hierarchical control system for diversified
skill learning and a strategy learning framework for effective decision-making.
We showcase the efficacy of our method through comparative analysis with
state-of-the-art methods, demonstrating its capabilities in executing various
skills for table tennis. Our strategy learning framework is validated through
both agent-agent interaction and human-agent interaction in Virtual Reality,
handling both competitive and cooperative tasks.
[COMMENTS]
SIGGRAPH 2024
[LINK]
http://arxiv.org/abs/2407.16210v1
[DATE]
2024-07-23 14:31:13+08:00
[CATEGORIES]
cs.LG
Scalar Function Topology Divergence: Comparing Topology of 3D Objects
[AUTHORS]
Ilya Trofimov, Daria Voronkova, Eduard Tulchinskii, Evgeny Burnaev, Serguei Barannikov
[ABSTRACT]
We propose a new topological tool for computer vision - Scalar Function
Topology Divergence (SFTD), which measures the dissimilarity of multi-scale
topology between sublevel sets of two functions having a common domain.
Functions can be defined on an undirected graph or Euclidean space of any
dimensionality. Most of the existing methods for comparing topology are based
on Wasserstein distance between persistence barcodes and they don’t take into
account the localization of topological features. The minimization of SFTD
ensures that the corresponding topological features of scalar functions are
located in the same places. The proposed tool provides useful visualizations
depicting areas where functions have topological dissimilarities. We provide
applications of the proposed method to 3D computer vision. In particular,
experiments demonstrate that SFTD as an additional loss improves the
reconstruction of cellular 3D shapes from 2D fluorescence microscopy images,
and helps to identify topological errors in 3D segmentation. Additionally, we
show that SFTD outperforms Betti matching loss in 2D segmentation problems.
[LINK]
http://arxiv.org/abs/2407.08364v2
[DATE]
2024-07-23 14:26:34+08:00
[CATEGORIES]
cs.LG
Automatic Environment Shaping is the Next Frontier in RL
[AUTHORS]
Younghyo Park, Gabriel B. Margolis, Pulkit Agrawal
[ABSTRACT]
Many roboticists dream of presenting a robot with a task in the evening and
returning the next morning to find the robot capable of solving the task. What
is preventing us from achieving this? Sim-to-real reinforcement learning (RL)
has achieved impressive performance on challenging robotics tasks, but requires
substantial human effort to set up the task in a way that is amenable to RL.
It’s our position that algorithmic improvements in policy optimization and
other ideas should be guided towards resolving the primary bottleneck of
shaping the training environment, i.e., designing observations, actions,
rewards and simulation dynamics. Most practitioners don’t tune the RL
algorithm, but other environment parameters to obtain a desirable controller.
We posit that scaling RL to diverse robotic tasks will only be achieved if the
community focuses on automating environment shaping procedures.
[COMMENTS]
ICML 2024 Position Track; Website at
https://auto-env-shaping.github.io/
[LINK]
http://arxiv.org/abs/2407.16186v1
[DATE]
2024-07-23 13:22:29+08:00
[CATEGORIES]
cs.LG
Logifold: A Geometrical Foundation of Ensemble Machine Learning
[AUTHORS]
Inkee Jung, Siu-Cheong Lau
[ABSTRACT]
We present a local-to-global and measure-theoretical approach to
understanding datasets. The core idea is to formulate a logifold structure and
to interpret network models with restricted domains as local charts of
datasets. In particular, this provides a mathematical foundation for ensemble
machine learning. Our experiments demonstrate that logifolds can be implemented
to identify fuzzy domains and improve accuracy compared to taking average of
model outputs. Additionally, we provide a theoretical example of a logifold,
highlighting the importance of restricting to domains of classifiers in an
ensemble.
[COMMENTS]
6 pages
[LINK]
http://arxiv.org/abs/2407.16177v1
[DATE]
2024-07-23 12:47:58+08:00
[CATEGORIES]
cs.LG
Pixel Embedding: Fully Quantized Convolutional Neural Network with Differentiable Lookup Table
[AUTHORS]
Hiroyuki Tokunaga, Joel Nicholls, Daria Vazhenina, Atsunori Kanemura
[ABSTRACT]
By quantizing network weights and activations to low bitwidth, we can obtain
hardware-friendly and energy-efficient networks. However, existing quantization
techniques utilizing the straight-through estimator and piecewise constant
functions face the issue of how to represent originally high-bit input data
with low-bit values. To fully quantize deep neural networks, we propose pixel
embedding, which replaces each float-valued input pixel with a vector of
quantized values by using a lookup table. The lookup table or low-bit
representation of pixels is differentiable and trainable by backpropagation.
Such replacement of inputs with vectors is similar to word embedding in the
natural language processing field. Experiments on ImageNet and CIFAR-100 show
that pixel embedding reduces the top-5 error gap caused by quantizing the
floating points at the first layer to only 1% for the ImageNet dataset, and the
top-1 error gap caused by quantizing first and last layers to slightly over 1%
for the CIFAR-100 dataset. The usefulness of pixel embedding is further
demonstrated by inference time measurements, which demonstrate over 1.7 times
speedup compared to floating point precision first layer.
[LINK]
http://arxiv.org/abs/2407.16174v1
[DATE]
2024-07-23 12:41:36+08:00
[CATEGORIES]
cs.LG
Advanced AI Framework for Enhanced Detection and Assessment of Abdominal Trauma: Integrating 3D Segmentation with 2D CNN and RNN Models
[AUTHORS]
Liheng Jiang, Xuechun yang, Chang Yu, Zhizhong Wu, Yuting Wang
[ABSTRACT]
Trauma is a significant cause of mortality and disability, particularly among
individuals under forty. Traditional diagnostic methods for traumatic injuries,
such as X-rays, CT scans, and MRI, are often time-consuming and dependent on
medical expertise, which can delay critical interventions. This study explores
the application of artificial intelligence (AI) and machine learning (ML) to
improve the speed and accuracy of abdominal trauma diagnosis. We developed an
advanced AI-based model combining 3D segmentation, 2D Convolutional Neural
Networks (CNN), and Recurrent Neural Networks (RNN) to enhance diagnostic
performance. Our model processes abdominal CT scans to provide real-time,
precise assessments, thereby improving clinical decision-making and patient
outcomes. Comprehensive experiments demonstrated that our approach
significantly outperforms traditional diagnostic methods, as evidenced by
rigorous evaluation metrics. This research sets a new benchmark for automated
trauma detection, leveraging the strengths of AI and ML to revolutionize trauma
care.
[COMMENTS]
6 Pages
[LINK]
http://arxiv.org/abs/2407.16165v1
[DATE]
2024-07-23 12:18:34+08:00
[CATEGORIES]
cs.LG
Gaussian Splashing: Unified Particles for Versatile Motion Synthesis and Rendering
[AUTHORS]
Yutao Feng, Xiang Feng, Yintong Shang, Ying Jiang, Chang Yu, Zeshun Zong, Tianjia Shao, Hongzhi Wu, Kun Zhou, Chenfanfu Jiang, Yin Yang
[ABSTRACT]
We demonstrate the feasibility of integrating physics-based animations of
solids and fluids with 3D Gaussian Splatting (3DGS) to create novel effects in
virtual scenes reconstructed using 3DGS. Leveraging the coherence of the
Gaussian Splatting and Position-Based Dynamics (PBD) in the underlying
representation, we manage rendering, view synthesis, and the dynamics of solids
and fluids in a cohesive manner. Similar to GaussianShader, we enhance each
Gaussian kernel with an added normal, aligning the kernel’s orientation with
the surface normal to refine the PBD simulation. This approach effectively
eliminates spiky noises that arise from rotational deformation in solids. It
also allows us to integrate physically based rendering to augment the dynamic
surface reflections on fluids. Consequently, our framework is capable of
realistically reproducing surface highlights on dynamic fluids and facilitating
interactions between scene objects and fluids from new views. For more
information, please visit our project page at
\url{https://gaussiansplashing.github.io/}.
[LINK]
http://arxiv.org/abs/2401.15318v2
[DATE]
2024-07-23 12:05:53+08:00
[CATEGORIES]
cs.LG
TransFeat-TPP: An Interpretable Deep Covariate Temporal Point Processes
[AUTHORS]
Zizhuo Meng, Boyu Li, Xuhui Fan, Zhidong Li, Yang Wang, Fang Chen, Feng Zhou
[ABSTRACT]
The classical temporal point process (TPP) constructs an intensity function
by taking the occurrence times into account. Nevertheless, occurrence time may
not be the only relevant factor, other contextual data, termed covariates, may
also impact the event evolution. Incorporating such covariates into the model
is beneficial, while distinguishing their relevance to the event dynamics is of
great practical significance. In this work, we propose a Transformer-based
covariate temporal point process (TransFeat-TPP) model to improve the
interpretability of deep covariate-TPPs while maintaining powerful
expressiveness. TransFeat-TPP can effectively model complex relationships
between events and covariates, and provide enhanced interpretability by
discerning the importance of various covariates. Experimental results on
synthetic and real datasets demonstrate improved prediction accuracy and
consistently interpretable feature importance when compared to existing deep
covariate-TPPs.
[LINK]
http://arxiv.org/abs/2407.16161v1
[DATE]
2024-07-23 12:05:29+08:00
[CATEGORIES]
cs.LG
Exploring The Neural Burden In Pruned Models: An Insight Inspired By Neuroscience
[AUTHORS]
Zeyu Wang, Weichen Dai, Xiangyu Zhou, Ji Qi, Yi Zhou
[ABSTRACT]
Vision Transformer and its variants have been adopted in many visual tasks
due to their powerful capabilities, which also bring significant challenges in
computation and storage. Consequently, researchers have introduced various
compression methods in recent years, among which the pruning techniques are
widely used to remove a significant fraction of the network. Therefore, these
methods can reduce significant percent of the FLOPs, but often lead to a
decrease in model performance. To investigate the underlying causes, we focus
on the pruning methods specifically belonging to the pruning-during-training
category, then drew inspiration from neuroscience and propose a new concept for
artificial neural network models named Neural Burden. We investigate its impact
in the model pruning process, and subsequently explore a simple yet effective
approach to mitigate the decline in model performance, which can be applied to
any pruning-during-training technique. Extensive experiments indicate that the
neural burden phenomenon indeed exists, and show the potential of our method.
We hope that our findings can provide valuable insights for future research.
Code will be made publicly available after this paper is published.
[LINK]
http://arxiv.org/abs/2407.16716v1
[DATE]
2024-07-23 11:43:21+08:00
[CATEGORIES]
cs.LG
On the Benefits of Rank in Attention Layers
[AUTHORS]
Noah Amsel, Gilad Yehudai, Joan Bruna
[ABSTRACT]
Attention-based mechanisms are widely used in machine learning, most
prominently in transformers. However, hyperparameters such as the rank of the
attention matrices and the number of heads are scaled nearly the same way in
all realizations of this architecture, without theoretical justification. In
this work we show that there are dramatic trade-offs between the rank and
number of heads of the attention mechanism. Specifically, we present a simple
and natural target function that can be represented using a single full-rank
attention head for any context length, but that cannot be approximated by
low-rank attention unless the number of heads is exponential in the embedding
dimension, even for short context lengths. Moreover, we prove that, for short
context lengths, adding depth allows the target to be approximated by low-rank
attention. For long contexts, we conjecture that full-rank attention is
necessary. Finally, we present experiments with off-the-shelf transformers that
validate our theoretical findings.
[LINK]
http://arxiv.org/abs/2407.16153v1
[DATE]
2024-07-23 11:40:24+08:00
[CATEGORIES]
cs.LG
On-Device Soft Sensors: Real-Time Fluid Flow Estimation from Level Sensor Data
[AUTHORS]
Tianheng Ling, Chao Qian, Gregor Schiele
[ABSTRACT]
Soft sensors are crucial in bridging autonomous systems’ physical and digital
realms, enhancing sensor fusion and perception. Instead of deploying soft
sensors on the Cloud, this study shift towards employing on-device soft
sensors, promising heightened efficiency and bolstering data security. Our
approach substantially improves energy efficiency by deploying Artificial
Intelligence (AI) directly on devices within a wireless sensor network.
Furthermore, the synergistic integration of the Microcontroller Unit and
Field-Programmable Gate Array (FPGA) leverages the rapid AI inference
capabilities of the latter. Empirical evidence from our real-world use case
demonstrates that FPGA-based soft sensors achieve inference times ranging
remarkably from 1.04 to 12.04 microseconds. These compelling results highlight
the considerable potential of our innovative approach for executing real-time
inference tasks efficiently, thereby presenting a feasible alternative that
effectively addresses the latency challenges intrinsic to Cloud-based
deployments.
[COMMENTS]
8 pages, 6 figures, 1 Table, Accepted by the 1st AUTONOMOUS
UBIQUITOUS SYSTEMS (AUTOQUITOUS) WORKSHOP of EAI MobiQuitous 2023 - 20th EAI
International Conference on Mobile and Ubiquitous Systems: Computing,
Networking and Services
[LINK]
http://arxiv.org/abs/2311.15036v3
[DATE]
2024-07-23 11:33:16+08:00
[CATEGORIES]
cs.LG
EquiPocket: an E(3)-Equivariant Geometric Graph Neural Network for Ligand Binding Site Prediction
[AUTHORS]
Yang Zhang, Zhewei Wei, Ye Yuan, Chongxuan Li, Wenbing Huang
[ABSTRACT]
Predicting the binding sites of target proteins plays a fundamental role in
drug discovery. Most existing deep-learning methods consider a protein as a 3D
image by spatially clustering its atoms into voxels and then feed the voxelized
protein into a 3D CNN for prediction. However, the CNN-based methods encounter
several critical issues: 1) defective in representing irregular protein
structures; 2) sensitive to rotations; 3) insufficient to characterize the
protein surface; 4) unaware of protein size shift. To address the above issues,
this work proposes EquiPocket, an E(3)-equivariant Graph Neural Network (GNN)
for binding site prediction, which comprises three modules: the first one to
extract local geometric information for each surface atom, the second one to
model both the chemical and spatial structure of protein and the last one to
capture the geometry of the surface via equivariant message passing over the
surface atoms. We further propose a dense attention output layer to alleviate
the effect incurred by variable protein size. Extensive experiments on several
representative benchmarks demonstrate the superiority of our framework to the
state-of-the-art methods.
[COMMENTS]
Accepted to ICML 2024 (Oral)
[LINK]
http://arxiv.org/abs/2302.12177v3
[DATE]
2024-07-23 11:32:32+08:00
[CATEGORIES]
cs.LG
Predicting Stock Prices with FinBERT-LSTM: Integrating News Sentiment Analysis
[AUTHORS]
Wenjun Gu, Yihao Zhong, Shizun Li, Changsong Wei, Liting Dong, Zhuoyue Wang, Chao Yan
[ABSTRACT]
The stock market’s ascent typically mirrors the flourishing state of the
economy, whereas its decline is often an indicator of an economic downturn.
Therefore, for a long time, significant correlation elements for predicting
trends in financial stock markets have been widely discussed, and people are
becoming increasingly interested in the task of financial text mining. The
inherent instability of stock prices makes them acutely responsive to
fluctuations within the financial markets. In this article, we use deep
learning networks, based on the history of stock prices and articles of
financial, business, technical news that introduce market information to
predict stock prices. We illustrate the enhancement of predictive precision by
integrating weighted news categories into the forecasting model. We developed a
pre-trained NLP model known as FinBERT, designed to discern the sentiments
within financial texts. Subsequently, we advanced this model by incorporating
the sophisticated Long Short Term Memory (LSTM) architecture, thus constructing
the innovative FinBERT-LSTM model. This model utilizes news categories related
to the stock market structure hierarchy, namely market, industry, and stock
related news categories, combined with the stock market’s stock price situation
in the previous week for prediction. We selected NASDAQ-100 index stock data
and trained the model on Benzinga news articles, and utilized Mean Absolute
Error (MAE), Mean Absolute Percentage Error (MAPE), and Accuracy as the key
metrics for the assessment and comparative analysis of the model’s performance.
The results indicate that FinBERT-LSTM performs the best, followed by LSTM, and
DNN model ranks third in terms of effectiveness.
[COMMENTS]
10 pages, 6 figures, 2 tables, 2024 8th International Conference on
Cloud and Big Data Computing
[LINK]
http://arxiv.org/abs/2407.16150v1
[DATE]
2024-07-23 11:26:07+08:00
[CATEGORIES]
cs.LG
Research on Adverse Drug Reaction Prediction Model Combining Knowledge Graph Embedding and Deep Learning
[AUTHORS]
Yufeng Li, Wenchao Zhao, Bo Dang, Xu Yan, Weimin Wang, Min Gao, Mingxuan Xiao
[ABSTRACT]
In clinical treatment, identifying potential adverse reactions of drugs can
help assist doctors in making medication decisions. In response to the problems
in previous studies that features are high-dimensional and sparse, independent
prediction models need to be constructed for each adverse reaction of drugs,
and the prediction accuracy is low, this paper develops an adverse drug
reaction prediction model based on knowledge graph embedding and deep learning,
which can predict experimental results. Unified prediction of adverse drug
reactions covered. Knowledge graph embedding technology can fuse the associated
information between drugs and alleviate the shortcomings of high-dimensional
sparsity in feature matrices, and the efficient training capabilities of deep
learning can improve the prediction accuracy of the model. This article builds
an adverse drug reaction knowledge graph based on drug feature data; by
analyzing the embedding effect of the knowledge graph under different embedding
strategies, the best embedding strategy is selected to obtain sample vectors;
and then a convolutional neural network model is constructed to predict adverse
reactions. The results show that under the DistMult embedding model and
400-dimensional embedding strategy, the convolutional neural network model has
the best prediction effect; the average accuracy, F_1 score, recall rate and
area under the curve of repeated experiments are better than the methods
reported in the literature. The obtained prediction model has good prediction
accuracy and stability, and can provide an effective reference for later safe
medication guidance.
[COMMENTS]
12 pages, 4 figures, 9 tables
[LINK]
http://arxiv.org/abs/2407.16715v1
[DATE]
2024-07-23 11:25:55+08:00
[CATEGORIES]
cs.LG
Improved Few-Shot Image Classification Through Multiple-Choice Questions
[AUTHORS]
Dipika Khullar, Emmett Goodman, Negin Sokhandan
[ABSTRACT]
Through a simple multiple choice language prompt a VQA model can operate as a
zero-shot image classifier, producing a classification label. Compared to
typical image encoders, VQA models offer an advantage: VQA-produced image
embeddings can be infused with the most relevant visual information through
tailored language prompts. Nevertheless, for most tasks, zero-shot VQA
performance is lacking, either because of unfamiliar category names, or
dissimilar pre-training data and test data distributions. We propose a simple
method to boost VQA performance for image classification using only a handful
of labeled examples and a multiple-choice question. This few-shot method is
training-free and maintains the dynamic and flexible advantages of the VQA
model. Rather than relying on the final language output, our approach uses
multiple-choice questions to extract prompt-specific latent representations,
which are enriched with relevant visual information. These representations are
combined to create a final overall image embedding, which is decoded via
reference to latent class prototypes constructed from the few labeled examples.
We demonstrate this method outperforms both pure visual encoders and zero-shot
VQA baselines to achieve impressive performance on common few-shot tasks
including MiniImageNet, Caltech-UCSD Birds, and CIFAR-100. Finally, we show our
approach does particularly well in settings with numerous diverse visual
attributes such as the fabric, article-style, texture, and view of different
articles of clothing, where other few-shot approaches struggle, as we can
tailor our image representations only on the semantic features of interest.
[LINK]
http://arxiv.org/abs/2407.16145v1
[DATE]
2024-07-23 11:09:42+08:00
[CATEGORIES]
cs.LG
Diffusion Models as Optimizers for Efficient Planning in Offline RL
[AUTHORS]
Renming Huang, Yunqiang Pei, Guoqing Wang, Yangming Zhang, Yang Yang, Peng Wang, Hengtao Shen
[ABSTRACT]
Diffusion models have shown strong competitiveness in offline reinforcement
learning tasks by formulating decision-making as sequential generation.
However, the practicality of these methods is limited due to the lengthy
inference processes they require. In this paper, we address this problem by
decomposing the sampling process of diffusion models into two decoupled
subprocesses: 1) generating a feasible trajectory, which is a time-consuming
process, and 2) optimizing the trajectory. With this decomposition approach, we
are able to partially separate efficiency and quality factors, enabling us to
simultaneously gain efficiency advantages and ensure quality assurance. We
propose the Trajectory Diffuser, which utilizes a faster autoregressive model
to handle the generation of feasible trajectories while retaining the
trajectory optimization process of diffusion models. This allows us to achieve
more efficient planning without sacrificing capability. To evaluate the
effectiveness and efficiency of the Trajectory Diffuser, we conduct experiments
on the D4RL benchmarks. The results demonstrate that our method achieves $\it
3$-$\it 10 \times$ faster inference speed compared to previous sequence
modeling methods, while also outperforming them in terms of overall
performance. https://github.com/RenMing-Huang/TrajectoryDiffuser
Keywords: Reinforcement Learning and Efficient Planning and Diffusion Model
[COMMENTS]
The paper was accepted by ECCV2024
[LINK]
http://arxiv.org/abs/2407.16142v1
[DATE]
2024-07-23 11:00:01+08:00
[CATEGORIES]
cs.LG
Diffusion Transformer Captures Spatial-Temporal Dependencies: A Theory for Gaussian Process Data
[AUTHORS]
Hengyu Fu, Zehao Dou, Jiawei Guo, Mengdi Wang, Minshuo Chen
[ABSTRACT]
Diffusion Transformer, the backbone of Sora for video generation,
successfully scales the capacity of diffusion models, pioneering new avenues
for high-fidelity sequential data generation. Unlike static data such as
images, sequential data consists of consecutive data frames indexed by time,
exhibiting rich spatial and temporal dependencies. These dependencies represent
the underlying dynamic model and are critical to validate the generated data.
In this paper, we make the first theoretical step towards bridging diffusion
transformers for capturing spatial-temporal dependencies. Specifically, we
establish score approximation and distribution estimation guarantees of
diffusion transformers for learning Gaussian process data with covariance
functions of various decay patterns. We highlight how the spatial-temporal
dependencies are captured and affect learning efficiency. Our study proposes a
novel transformer approximation theory, where the transformer acts to unroll an
algorithm. We support our theoretical results by numerical experiments,
providing strong evidence that spatial-temporal dependencies are captured
within attention layers, aligning with our approximation theory.
[COMMENTS]
52 pages, 8 figures
[LINK]
http://arxiv.org/abs/2407.16134v1
[DATE]
2024-07-23 10:42:43+08:00
[CATEGORIES]
cs.LG
Crystals with Transformers on Graphs, for Prediction of Unconventional Crystal Material Properties and the Benchmark
[AUTHORS]
Hongyi Wang, Ji Sun, Jinzhe Liang, Li Zhai, Zitian Tang, Zijian Li, Wei Zhai, Xusheng Wang, Weihao Gao, Sheng Gong, Bolong Huang, Hua Zhang
[ABSTRACT]
The ionic bonding across the lattice and ordered microscopic structures endow
crystals with unique symmetry and determine their macroscopic properties.
Unconventional crystals, in particular, exhibit non-traditional lattice
structures or possess exotic physical properties, making them intriguing
subjects for investigation. Therefore, to accurately predict the physical and
chemical properties of crystals, it is crucial to consider long-range orders.
While GNN excels at capturing the local environment of atoms in crystals, they
often face challenges in effectively capturing longer-ranged interactions due
to their limited depth. In this paper, we propose CrysToGraph
($\textbf{Crys}$tals with $\textbf{T}$ransformers $\textbf{o}$n
$\textbf{Graph}$s), a novel transformer-based geometric graph network designed
specifically for unconventional crystalline systems, and UnconvBench, a
comprehensive benchmark to evaluate models’ predictive performance on
unconventional crystal materials such as defected crystals, low-dimension
crystals and MOF. CrysToGraph effectively captures short-range interactions
with transformer-based graph convolution blocks as well as long-range
interactions with graph-wise transformer blocks. CrysToGraph proofs its
effectiveness in modelling unconventional crystal materials in multiple tasks,
and moreover, it outperforms most existing methods, achieving new
state-of-the-art results on the benchmarks of both unconventional crystals and
traditional crystals.
[LINK]
http://arxiv.org/abs/2407.16131v1
[DATE]
2024-07-23 10:31:06+08:00
[CATEGORIES]
cs.LG
Masked Graph Learning with Recurrent Alignment for Multimodal Emotion Recognition in Conversation
[AUTHORS]
Tao Meng, Fuchen Zhang, Yuntao Shou, Hongen Shao, Wei Ai, Keqin Li
[ABSTRACT]
Since Multimodal Emotion Recognition in Conversation (MERC) can be applied to
public opinion monitoring, intelligent dialogue robots, and other fields, it
has received extensive research attention in recent years. Unlike traditional
unimodal emotion recognition, MERC can fuse complementary semantic information
between multiple modalities (e.g., text, audio, and vision) to improve emotion
recognition. However, previous work ignored the inter-modal alignment process
and the intra-modal noise information before multimodal fusion but directly
fuses multimodal features, which will hinder the model for representation
learning. In this study, we have developed a novel approach called Masked Graph
Learning with Recursive Alignment (MGLRA) to tackle this problem, which uses a
recurrent iterative module with memory to align multimodal features, and then
uses the masked GCN for multimodal feature fusion. First, we employ LSTM to
capture contextual information and use a graph attention-filtering mechanism to
eliminate noise effectively within the modality. Second, we build a recurrent
iteration module with a memory function, which can use communication between
different modalities to eliminate the gap between modalities and achieve the
preliminary alignment of features between modalities. Then, a cross-modal
multi-head attention mechanism is introduced to achieve feature alignment
between modalities and construct a masked GCN for multimodal feature fusion,
which can perform random mask reconstruction on the nodes in the graph to
obtain better node feature representation. Finally, we utilize a multilayer
perceptron (MLP) for emotion recognition. Extensive experiments on two
benchmark datasets (i.e., IEMOCAP and MELD) demonstrate that {MGLRA}
outperforms state-of-the-art methods.
[COMMENTS]
15 pages, 9 figures
[LINK]
http://arxiv.org/abs/2407.16714v1
[DATE]
2024-07-23 10:23:51+08:00
[CATEGORIES]
cs.LG
Towards Effective Fusion and Forecasting of Multimodal Spatio-temporal Data for Smart Mobility
[AUTHORS]
Chenxing Wang
[ABSTRACT]
With the rapid development of location based services, multimodal
spatio-temporal (ST) data including trajectories, transportation modes, traffic
flow and social check-ins are being collected for deep learning based methods.
These deep learning based methods learn ST correlations to support the
downstream tasks in the fields such as smart mobility, smart city and other
intelligent transportation systems. Despite their effectiveness, ST data fusion
and forecasting methods face practical challenges in real-world scenarios.
First, forecasting performance for ST data-insufficient area is inferior,
making it necessary to transfer meta knowledge from heterogeneous area to
enhance the sparse representations. Second, it is nontrivial to accurately
forecast in multi-transportation-mode scenarios due to the fine-grained ST
features of similar transportation modes, making it necessary to distinguish
and measure the ST correlations to alleviate the influence caused by entangled
ST features. At last, partial data modalities (e.g., transportation mode) are
lost due to privacy or technical issues in certain scenarios, making it
necessary to effectively fuse the multimodal sparse ST features and enrich the
ST representations. To tackle these challenges, our research work aim to
develop effective fusion and forecasting methods for multimodal ST data in
smart mobility scenario. In this paper, we will introduce our recent works that
investigates the challenges in terms of various real-world applications and
establish the open challenges in this field for future work.
[COMMENTS]
4 pages
[LINK]
http://arxiv.org/abs/2407.16123v1
[DATE]
2024-07-23 10:08:22+08:00
[CATEGORIES]
cs.LG
Uncertainty-Aware Deep Neural Representations for Visual Analysis of Vector Field Data
[AUTHORS]
Atul Kumar, Siddharth Garg, Soumya Dutta
[ABSTRACT]
The widespread use of Deep Neural Networks (DNNs) has recently resulted in
their application to challenging scientific visualization tasks. While advanced
DNNs demonstrate impressive generalization abilities, understanding factors
like prediction quality, confidence, robustness, and uncertainty is crucial.
These insights aid application scientists in making informed decisions.
However, DNNs lack inherent mechanisms to measure prediction uncertainty,
prompting the creation of distinct frameworks for constructing robust
uncertainty-aware models tailored to various visualization tasks. In this work,
we develop uncertainty-aware implicit neural representations to model
steady-state vector fields effectively. We comprehensively evaluate the
efficacy of two principled deep uncertainty estimation techniques: (1) Deep
Ensemble and (2) Monte Carlo Dropout, aimed at enabling uncertainty-informed
visual analysis of features within steady vector field data. Our detailed
exploration using several vector data sets indicate that uncertainty-aware
models generate informative visualization results of vector field features.
Furthermore, incorporating prediction uncertainty improves the resilience and
interpretability of our DNN model, rendering it applicable for the analysis of
non-trivial vector field data sets.
[COMMENTS]
Accepted for publication at IEEE Visualization 2024
[LINK]
http://arxiv.org/abs/2407.16119v1
[DATE]
2024-07-23 09:59:58+08:00
[CATEGORIES]
cs.LG
Transformer-based Graph Neural Networks for Battery Range Prediction in AIoT Battery-Swap Services
[AUTHORS]
Zhao Li, Yang Liu, Chuan Zhou, Xuanwu Liu, Xuming Pan, Buqing Cao, Xindong Wu
[ABSTRACT]
The concept of the sharing economy has gained broad recognition, and within
this context, Sharing E-Bike Battery (SEB) have emerged as a focal point of
societal interest. Despite the popularity, a notable discrepancy remains
between user expectations regarding the remaining battery range of SEBs and the
reality, leading to a pronounced inclination among users to find an available
SEB during emergency situations. In response to this challenge, the integration
of Artificial Intelligence of Things (AIoT) and battery-swap services has
surfaced as a viable solution. In this paper, we propose a novel structural
Transformer-based model, referred to as the SEB-Transformer, designed
specifically for predicting the battery range of SEBs. The scenario is
conceptualized as a dynamic heterogeneous graph that encapsulates the
interactions between users and bicycles, providing a comprehensive framework
for analysis. Furthermore, we incorporate the graph structure into the
SEB-Transformer to facilitate the estimation of the remaining e-bike battery
range, in conjunction with mean structural similarity, enhancing the prediction
accuracy. By employing the predictions made by our model, we are able to
dynamically adjust the optimal cycling routes for users in real-time, while
also considering the strategic locations of charging stations, thereby
optimizing the user experience. Empirically our results on real-world datasets
demonstrate the superiority of our model against nine competitive baselines.
These innovations, powered by AIoT, not only bridge the gap between user
expectations and the physical limitations of battery range but also
significantly improve the operational efficiency and sustainability of SEB
services. Through these advancements, the shared electric bicycle ecosystem is
evolving, making strides towards a more reliable, user-friendly, and
sustainable mode of transportation.
[COMMENTS]
9pages, 6figures, accepted by IEEE ICWS 2024 The International
Conference on Web Services
[LINK]
http://arxiv.org/abs/2407.16115v1
[DATE]
2024-07-23 09:33:21+08:00
[CATEGORIES]
cs.LG
Reinforcement Learning Pair Trading: A Dynamic Scaling approach
[AUTHORS]
Hongshen Yang, Avinash Malik
[COMMENTS]
31 pages
[LINK]
http://arxiv.org/abs/2407.16103v1
[DATE]
2024-07-23 08:16:27+08:00
[CATEGORIES]
cs.LG
Universal Spectral Transfer with Physical Prior-Informed Deep Generative Learning
[AUTHORS]
Yanmin Zhu, Loza F. Tadesse
[ABSTRACT]
Spectroscopy is a powerful analytical technique for characterizing matter
across physical and biological realms1-5. However, its fundamental principle
necessitates specialized instrumentation per physical phenomena probed,
limiting broad adoption and use in all relevant research. In this study, we
introduce SpectroGen, a novel physical prior-informed deep generative model for
generating relevant spectral signatures across modalities using experimentally
collected spectral input only from a single modality. We achieve this by
reimagining the representation of spectral data as mathematical constructs of
distributions instead of their traditional physical and molecular state
representations. The results from 319 standard mineral samples tested
demonstrate generating with 99% correlation and 0.01 root mean square error
with superior resolution than experimentally acquired ground truth spectra. We
showed transferring capability across Raman, Infrared, and X-ray Diffraction
modalities with Gaussian, Lorentzian, and Voigt distribution priors
respectively6-10. This approach however is globally generalizable for any
spectral input that can be represented by a distribution prior, making it
universally applicable. We believe our work revolutionizes the application
sphere of spectroscopy, which has traditionally been limited by access to the
required sophisticated and often expensive equipment towards accelerating
material, pharmaceutical, and biological discoveries.
[LINK]
http://arxiv.org/abs/2407.16094v1
[DATE]
2024-07-23 07:31:10+08:00
[CATEGORIES]
cs.LG
Rapid Switching and Multi-Adapter Fusion via Sparse High Rank Adapters
[AUTHORS]
Kartikeya Bhardwaj, Nilesh Prasad Pandey, Sweta Priyadarshi, Viswanath Ganapathy, Rafael Esteves, Shreya Kadambi, Shubhankar Borse, Paul Whatmough, Risheek Garrepalli, Mart Van Baalen, Harris Teague, Markus Nagel
[ABSTRACT]
In this paper, we propose Sparse High Rank Adapters (SHiRA) that directly
finetune 1-2% of the base model weights while leaving others unchanged, thus,
resulting in a highly sparse adapter. This high sparsity incurs no inference
overhead, enables rapid switching directly in the fused mode, and significantly
reduces concept-loss during multi-adapter fusion. Our extensive experiments on
LVMs and LLMs demonstrate that finetuning merely 1-2% parameters in the base
model is sufficient for many adapter tasks and significantly outperforms Low
Rank Adaptation (LoRA). We also show that SHiRA is orthogonal to advanced LoRA
methods such as DoRA and can be easily combined with existing techniques.
[COMMENTS]
Published at ICML 2024 Workshop on Foundation Models in the Wild.
arXiv admin note: substantial text overlap with arXiv:2406.13175
[LINK]
http://arxiv.org/abs/2407.16712v1
[DATE]
2024-07-23 06:46:36+08:00
[CATEGORIES]
cs.LG
Decentralized Personalized Federated Learning based on a Conditional Sparse-to-Sparser Scheme
[AUTHORS]
Qianyu Long, Qiyuan Wang, Christos Anagnostopoulos, Daning Bi
[ABSTRACT]
Decentralized Federated Learning (DFL) has become popular due to its
robustness and avoidance of centralized coordination. In this paradigm, clients
actively engage in training by exchanging models with their networked
neighbors. However, DFL introduces increased costs in terms of training and
communication. Existing methods focus on minimizing communication often
overlooking training efficiency and data heterogeneity. To address this gap, we
propose a novel \textit{sparse-to-sparser} training scheme: DA-DPFL. DA-DPFL
initializes with a subset of model parameters, which progressively reduces
during training via \textit{dynamic aggregation} and leads to substantial
energy savings while retaining adequate information during critical learning
periods.
Our experiments showcase that DA-DPFL substantially outperforms DFL baselines
in test accuracy, while achieving up to $5$ times reduction in energy costs. We
provide a theoretical analysis of DA-DPFL’s convergence by solidifying its
applicability in decentralized and personalized learning. The code is available
at:https://github.com/EricLoong/da-dpfl
[COMMENTS]
15 pages, 9 figures, 3 pages theory
[LINK]
http://arxiv.org/abs/2404.15943v3
[DATE]
2024-07-23 05:58:05+08:00
[CATEGORIES]
cs.LG
LCA-on-the-Line: Benchmarking Out-of-Distribution Generalization with Class Taxonomies
[AUTHORS]
Jia Shi, Gautam Gare, Jinjin Tian, Siqi Chai, Zhiqiu Lin, Arun Vasudevan, Di Feng, Francesco Ferroni, Shu Kong
[COMMENTS]
ICML 2024 Oral Presentation; Project Page:
https://elvishelvis.github.io/papers/lca/
[LINK]
http://arxiv.org/abs/2407.16067v1
[DATE]
2024-07-23 05:54:19+08:00
[CATEGORIES]
cs.LG
Artificial Intelligence-based Decision Support Systems for Precision and Digital Health
[AUTHORS]
Nina Deliu, Bibhas Chakraborty
[ABSTRACT]
Precision health, increasingly supported by digital technologies, is a domain
of research that broadens the paradigm of precision medicine, advancing
everyday healthcare. This vision goes hand in hand with the groundbreaking
advent of artificial intelligence (AI), which is reshaping the way we diagnose,
treat, and monitor both clinical subjects and the general population. AI tools
powered by machine learning have shown considerable improvements in a variety
of healthcare domains. In particular, reinforcement learning (RL) holds great
promise for sequential and dynamic problems such as dynamic treatment regimes
and just-in-time adaptive interventions in digital health. In this work, we
discuss the opportunity offered by AI, more specifically RL, to current trends
in healthcare, providing a methodological survey of RL methods in the context
of precision and digital health. Focusing on the area of adaptive
interventions, we expand the methodological survey with illustrative case
studies that used RL in real practice.
This invited article has undergone anonymous review and is intended as a book
chapter for the volume “Frontiers of Statistics and Data Science” edited by
Subhashis Ghoshal and Anindya Roy for the International Indian Statistical
Association Series on Statistics and Data Science, published by Springer. It
covers the material from a short course titled “Artificial Intelligence in
Precision and Digital Health” taught by the author Bibhas Chakraborty at the
IISA 2022 Conference, December 26-30 2022, at the Indian Institute of Science,
Bengaluru.
[COMMENTS]
arXiv admin note: substantial text overlap with arXiv:2203.02605
[LINK]
http://arxiv.org/abs/2407.16062v1
[DATE]
2024-07-23 05:39:34+08:00
[CATEGORIES]
cs.LG
Revisiting Score Function Estimators for $k$-Subset Sampling
[AUTHORS]
Klas Wijk, Ricardo Vinuesa, Hossein Azizpour
[ABSTRACT]
Are score function estimators an underestimated approach to learning with
$k$-subset sampling? Sampling $k$-subsets is a fundamental operation in many
machine learning tasks that is not amenable to differentiable parametrization,
impeding gradient-based optimization. Prior work has focused on relaxed
sampling or pathwise gradient estimators. Inspired by the success of score
function estimators in variational inference and reinforcement learning, we
revisit them within the context of $k$-subset sampling. Specifically, we
demonstrate how to efficiently compute the $k$-subset distribution’s score
function using a discrete Fourier transform, and reduce the estimator’s
variance with control variates. The resulting estimator provides both exact
samples and unbiased gradient estimates while also applying to
non-differentiable downstream models, unlike existing methods. Experiments in
feature selection show results competitive with current methods, despite weaker
assumptions.
[LINK]
http://arxiv.org/abs/2407.16058v1
[DATE]
2024-07-23 05:26:39+08:00
[CATEGORIES]
cs.LG
HIERVAR: A Hierarchical Feature Selection Method for Time Series Analysis
[AUTHORS]
Alireza Keshavarzian, Shahrokh Valaee
[ABSTRACT]
Time series classification stands as a pivotal and intricate challenge across
various domains, including finance, healthcare, and industrial systems. In
contemporary research, there has been a notable upsurge in exploring feature
extraction through random sampling. Unlike deep convolutional networks, these
methods sidestep elaborate training procedures, yet they often necessitate
generating a surplus of features to comprehensively encapsulate time series
nuances. Consequently, some features may lack relevance to labels or exhibit
multi-collinearity with others. In this paper, we propose a novel hierarchical
feature selection method aided by ANOVA variance analysis to address this
challenge. Through meticulous experimentation, we demonstrate that our method
substantially reduces features by over 94% while preserving accuracy – a
significant advancement in the field of time series analysis and feature
selection.
[COMMENTS]
6 pages, 5 figures, IEEE Machine Learning and Signal processing 2024
[LINK]
http://arxiv.org/abs/2407.16048v1
[DATE]
2024-07-23 04:55:13+08:00
[CATEGORIES]
cs.LG
Transformer-based Capacity Prediction for Lithium-ion Batteries with Data Augmentation
[AUTHORS]
Gift Modekwe, Saif Al-Wahaibi, Qiugang Lu
[ABSTRACT]
Lithium-ion batteries are pivotal to technological advancements in
transportation, electronics, and clean energy storage. The optimal operation
and safety of these batteries require proper and reliable estimation of battery
capacities to monitor the state of health. Current methods for estimating the
capacities fail to adequately account for long-term temporal dependencies of
key variables (e.g., voltage, current, and temperature) associated with battery
aging and degradation. In this study, we explore the usage of transformer
networks to enhance the estimation of battery capacity. We develop a
transformer-based battery capacity prediction model that accounts for both
long-term and short-term patterns in battery data. Further, to tackle the data
scarcity issue, data augmentation is used to increase the data size, which
helps to improve the performance of the model. Our proposed method is validated
with benchmark datasets. Simulation results show the effectiveness of data
augmentation and the transformer network in improving the accuracy and
robustness of battery capacity prediction.
[LINK]
http://arxiv.org/abs/2407.16036v1
[DATE]
2024-07-23 04:21:40+08:00
[CATEGORIES]
cs.LG
Pavement Fatigue Crack Detection and Severity Classification Based on Convolutional Neural Network
[AUTHORS]
Zhen Wang, Dylan G. Ildefonzo, Linbing Wang
[ABSTRACT]
Due to the varying intensity of pavement cracks, the complexity of
topological structure, and the noise of texture background, image
classification for asphalt pavement cracking has proven to be a challenging
problem. Fatigue cracking, also known as alligator cracking, is one of the
common distresses of asphalt pavement. It is thus important to detect and
monitor the condition of alligator cracking on roadway pavements. Most research
in this area has typically focused on pixel-level detection of cracking using
limited datasets. A novel deep convolutional neural network that can achieve
two objectives is proposed. The first objective of the proposed neural network
is to classify presence of fatigue cracking based on pavement surface images.
The second objective is to classify the fatigue cracking severity level based
on the Distress Identification Manual (DIM) standard. In this paper, a databank
of 4484 high-resolution pavement surface images is established in which images
are taken locally in the Town of Blacksburg, Virginia, USA. In the data
pre-preparation, over 4000 images are labeled into 4 categories manually
according to DIM standards. A four-layer convolutional neural network model is
then built to achieve the goal of classification of images by pavement crack
severity category. The trained model reached the highest accuracy among all
existing methods. After only 30 epochs of training, the model achieved a crack
existence classification accuracy of 96.23% and a severity level classification
accuracy of 96.74%. After 20 epochs of training, the model achieved a pavement
marking presence classification accuracy of 97.64%.
[COMMENTS]
10 pages, 14 figures, 3 tables
[LINK]
http://arxiv.org/abs/2407.16021v1
[DATE]
2024-07-23 03:56:03+08:00
[CATEGORIES]
cs.LG
Sharp Convergence Rates for Matching Pursuit
[AUTHORS]
Jason M. Klusowski, Jonathan W. Siegel
[ABSTRACT]
We study the fundamental limits of matching pursuit, or the pure greedy
algorithm, for approximating a target function $ f $ by a linear combination
$f_n$ of $n$ elements from a dictionary. When the target function is contained
in the variation space corresponding to the dictionary, many impressive works
over the past few decades have obtained upper and lower bounds on the error
$|f-f_n|$ of matching pursuit, but they do not match. The main contribution
of this paper is to close this gap and obtain a sharp characterization of the
decay rate, $n^{-\alpha}$, of matching pursuit. Specifically, we construct a
worst case dictionary which shows that the existing best upper bound cannot be
significantly improved. It turns out that, unlike other greedy algorithm
variants which converge at the optimal rate $ n^{-1/2}$, the convergence rate
$n^{-\alpha}$ is suboptimal. Here, $\alpha \approx 0.182$ is determined by the
solution to a certain non-linear equation.
[LINK]
http://arxiv.org/abs/2307.07679v3
[DATE]
2024-07-23 03:54:38+08:00
[CATEGORIES]
cs.LG
Unsupervised anomaly localization in high-resolution breast scans using deep pluralistic image completion
[AUTHORS]
Nicholas Konz, Haoyu Dong, Maciej A. Mazurowski
[ABSTRACT]
Automated tumor detection in Digital Breast Tomosynthesis (DBT) is a
difficult task due to natural tumor rarity, breast tissue variability, and high
resolution. Given the scarcity of abnormal images and the abundance of normal
images for this problem, an anomaly detection/localization approach could be
well-suited. However, most anomaly localization research in machine learning
focuses on non-medical datasets, and we find that these methods fall short when
adapted to medical imaging datasets. The problem is alleviated when we solve
the task from the image completion perspective, in which the presence of
anomalies can be indicated by a discrepancy between the original appearance and
its auto-completion conditioned on the surroundings. However, there are often
many valid normal completions given the same surroundings, especially in the
DBT dataset, making this evaluation criterion less precise. To address such an
issue, we consider pluralistic image completion by exploring the distribution
of possible completions instead of generating fixed predictions. This is
achieved through our novel application of spatial dropout on the completion
network during inference time only, which requires no additional training cost
and is effective at generating diverse completions. We further propose minimum
completion distance (MCD), a new metric for detecting anomalies, thanks to
these stochastic completions. We provide theoretical as well as empirical
support for the superiority over existing methods of using the proposed method
for anomaly localization. On the DBT dataset, our model outperforms other
state-of-the-art methods by at least 10\% AUROC for pixel-level detection.
[COMMENTS]
Accepted in Medical Image Analysis (2023). Our code is at
https://github.com/mazurowski-lab/picard
[LINK]
http://arxiv.org/abs/2305.03098v2
[DATE]
2024-07-23 03:41:11+08:00
[CATEGORIES]
cs.LG
AIDE: Antithetical, Intent-based, and Diverse Example-Based Explanations
[AUTHORS]
Ikhtiyor Nematov, Dimitris Sacharidis, Tomer Sagi, Katja Hose
[ABSTRACT]
For many use-cases, it is often important to explain the prediction of a
black-box model by identifying the most influential training data samples.
Existing approaches lack customization for user intent and often provide a
homogeneous set of explanation samples, failing to reveal the model’s reasoning
from different angles.
In this paper, we propose AIDE, an approach for providing antithetical (i.e.,
contrastive), intent-based, diverse explanations for opaque and complex models.
AIDE distinguishes three types of explainability intents: interpreting a
correct, investigating a wrong, and clarifying an ambiguous prediction. For
each intent, AIDE selects an appropriate set of influential training samples
that support or oppose the prediction either directly or by contrast. To
provide a succinct summary, AIDE uses diversity-aware sampling to avoid
redundancy and increase coverage of the training data.
We demonstrate the effectiveness of AIDE on image and text classification
tasks, in three ways: quantitatively, assessing correctness and continuity;
qualitatively, comparing anecdotal evidence from AIDE and other example-based
approaches; and via a user study, evaluating multiple aspects of AIDE. The
results show that AIDE addresses the limitations of existing methods and
exhibits desirable traits for an explainability method.
[LINK]
http://arxiv.org/abs/2407.16010v1
[DATE]
2024-07-23 03:33:12+08:00
[CATEGORIES]
cs.LG
Restarts subject to approximate sharpness: A parameter-free and optimal scheme for first-order methods
[AUTHORS]
Ben Adcock, Matthew J. Colbrook, Maksym Neyra-Nesterenko
[ABSTRACT]
Sharpness is an almost generic assumption in continuous optimization that
bounds the distance from minima by objective function suboptimality. It
facilitates the acceleration of first-order methods through restarts. However,
sharpness involves problem-specific constants that are typically unknown, and
restart schemes typically reduce convergence rates. Moreover, these schemes are
challenging to apply in the presence of noise or with approximate model classes
(e.g., in compressive imaging or learning problems), and they generally assume
that the first-order method used produces feasible iterates. We consider the
assumption of approximate sharpness, a generalization of sharpness that
incorporates an unknown constant perturbation to the objective function error.
This constant offers greater robustness (e.g., with respect to noise or
relaxation of model classes) for finding approximate minimizers. By employing a
new type of search over the unknown constants, we design a restart scheme that
applies to general first-order methods and does not require the first-order
method to produce feasible iterates. Our scheme maintains the same convergence
rate as when the constants are known. The convergence rates we achieve for
various first-order methods match the optimal rates or improve on previously
established rates for a wide range of problems. We showcase our restart scheme
in several examples and highlight potential future applications and
developments of our framework and theory.
[COMMENTS]
Version accepted in Foundations of Computational Mathematics
[LINK]
http://arxiv.org/abs/2301.02268v2
[DATE]
2024-07-23 03:29:18+08:00
[CATEGORIES]
cs.LG
A Semi-Supervised Approach for Power System Event Identification
[AUTHORS]
Nima Taghipourbazargani, Lalitha Sankar, Oliver Kosut
[ABSTRACT]
Event identification is increasingly recognized as crucial for enhancing the
reliability, security, and stability of the electric power system. With the
growing deployment of Phasor Measurement Units (PMUs) and advancements in data
science, there are promising opportunities to explore data-driven event
identification via machine learning classification techniques. However,
obtaining accurately-labeled eventful PMU data samples remains challenging due
to its labor-intensive nature and uncertainty about the event type (class) in
real-time. Thus, it is natural to use semi-supervised learning techniques,
which make use of both labeled and unlabeled samples. %We propose a novel
semi-supervised framework to assess the effectiveness of incorporating
unlabeled eventful samples to enhance existing event identification
methodologies. We evaluate three categories of classical semi-supervised
approaches: (i) self-training, (ii) transductive support vector machines
(TSVM), and (iii) graph-based label spreading (LS) method. Our approach
characterizes events using physically interpretable features extracted from
modal analysis of synthetic eventful PMU data. In particular, we focus on the
identification of four event classes whose identification is crucial for grid
operations. We have developed and publicly shared a comprehensive Event
Identification package which consists of three aspects: data generation,
feature extraction, and event identification with limited labels using
semi-supervised methodologies. Using this package, we generate and evaluate
eventful PMU data for the South Carolina synthetic network. Our evaluation
consistently demonstrates that graph-based LS outperforms the other two
semi-supervised methods that we consider, and can noticeably improve event
identification performance relative to the setting with only a small number of
labeled samples.
[LINK]
http://arxiv.org/abs/2309.10095v2
[DATE]
2024-07-23 03:01:37+08:00
[CATEGORIES]
cs.LG
LiNR: Model Based Neural Retrieval on GPUs at LinkedIn
[AUTHORS]
Fedor Borisyuk, Qingquan Song, Mingzhou Zhou, Ganesh Parameswaran, Madhu Arun, Siva Popuri, Tugrul Bingol, Zhuotao Pei, Kuang-Hsuan Lee, Lu Zheng, Qizhan Shao, Ali Naqvi, Sen Zhou, Aman Gupta
[ABSTRACT]
This paper introduces LiNR, LinkedIn’s large-scale, GPU-based retrieval
system. LiNR supports a billion-sized index on GPU models. We discuss our
experiences and challenges in creating scalable, differentiable search indexes
using TensorFlow and PyTorch at production scale. In LiNR, both items and model
weights are integrated into the model binary. Viewing index construction as a
form of model training, we describe scaling our system for large indexes,
incorporating full scans and efficient filtering. A key focus is on enabling
attribute-based pre-filtering for exhaustive GPU searches, addressing the
common challenge of post-filtering in KNN searches that often reduces system
quality. We further provide multi-embedding retrieval algorithms and strategies
for tackling cold start issues in retrieval. Our advancements in supporting
larger indexes through quantization are also discussed. We believe LiNR
represents one of the industry’s first Live-updated model-based retrieval
indexes. Applied to out-of-network post recommendations on LinkedIn Feed, LiNR
has contributed to a 3% relative increase in professional daily active users.
We envisage LiNR as a step towards integrating retrieval and ranking into a
single GPU model, simplifying complex infrastructures and enabling end-to-end
optimization of the entire differentiable infrastructure through gradient
descent.
[LINK]
http://arxiv.org/abs/2407.13218v2
[DATE]
2024-07-23 02:33:25+08:00
[CATEGORIES]
cs.LG
No Dimensional Sampling Coresets for Classification
[AUTHORS]
Meysam Alishahi, Jeff M. Phillips
[ABSTRACT]
We refine and generalize what is known about coresets for classification
problems via the sensitivity sampling framework. Such coresets seek the
smallest possible subsets of input data, so one can optimize a loss function on
the coreset and ensure approximation guarantees with respect to the original
data. Our analysis provides the first no dimensional coresets, so the size does
not depend on the dimension. Moreover, our results are general, apply for
distributional input and can use iid samples, so provide sample complexity
bounds, and work for a variety of loss functions. A key tool we develop is a
Radamacher complexity version of the main sensitivity sampling approach, which
can be of independent interest.
[COMMENTS]
42 Pages
[LINK]
http://arxiv.org/abs/2402.05280v2
[DATE]
2024-07-23 02:12:31+08:00
[CATEGORIES]
cs.LG
Multicell-Fold: geometric learning in folding multicellular life
[AUTHORS]
Haiqian Yang, Anh Q. Nguyen, Dapeng Bi, Markus J. Buehler, Ming Guo
[ABSTRACT]
During developmental processes such as embryogenesis, how a group of cells
fold into specific structures, is a central question in biology that defines
how living organisms form. Establishing tissue-level morphology critically
relies on how every single cell decides to position itself relative to its
neighboring cells. Despite its importance, it remains a major challenge to
understand and predict the behavior of every cell within the living tissue over
time during such intricate processes. To tackle this question, we propose a
geometric deep learning model that can predict multicellular folding and
embryogenesis, accurately capturing the highly convoluted spatial interactions
among cells. We demonstrate that multicellular data can be represented with
both granular and foam-like physical pictures through a unified graph data
structure, considering both cellular interactions and cell junction networks.
We successfully use our model to achieve two important tasks, interpretable 4-D
morphological sequence alignment, and predicting local cell rearrangements
before they occur at single-cell resolution. Furthermore, using an activation
map and ablation studies, we demonstrate that cell geometries and cell junction
networks together regulate local cell rearrangement which is critical for
embryo morphogenesis. This approach provides a novel paradigm to study
morphogenesis, highlighting a unified data structure and harnessing the power
of geometric deep learning to accurately model the mechanisms and behaviors of
cells during development. It offers a pathway toward creating a unified dynamic
morphological atlas for a variety of developmental processes such as
embryogenesis.
[LINK]
http://arxiv.org/abs/2407.07055v2
[DATE]
2024-07-23 01:59:15+08:00
[CATEGORIES]
cs.LG
HandDGP: Camera-Space Hand Mesh Prediction with Differentiable Global Positioning
[AUTHORS]
Eugene Valassakis, Guillermo Garcia-Hernando
[ABSTRACT]
Predicting camera-space hand meshes from single RGB images is crucial for
enabling realistic hand interactions in 3D virtual and augmented worlds.
Previous work typically divided the task into two stages: given a cropped image
of the hand, predict meshes in relative coordinates, followed by lifting these
predictions into camera space in a separate and independent stage, often
resulting in the loss of valuable contextual and scale information. To prevent
the loss of these cues, we propose unifying these two stages into an end-to-end
solution that addresses the 2D-3D correspondence problem. This solution enables
back-propagation from camera space outputs to the rest of the network through a
new differentiable global positioning module. We also introduce an image
rectification step that harmonizes both the training dataset and the input
image as if they were acquired with the same camera, helping to alleviate the
inherent scale-depth ambiguity of the problem. We validate the effectiveness of
our framework in evaluations against several baselines and state-of-the-art
approaches across three public benchmarks.
[COMMENTS]
To be presented at ECCV 2024
[LINK]
http://arxiv.org/abs/2407.15844v1
[DATE]
2024-07-23 01:59:01+08:00
[CATEGORIES]
cs.LG
Detecting Brittle Decisions for Free: Leveraging Margin Consistency in Deep Robust Classifiers
[AUTHORS]
Jonas Ngnawé, Sabyasachi Sahoo, Yann Pequignot, Frédéric Precioso, Christian Gagné
[ABSTRACT]
Despite extensive research on adversarial training strategies to improve
robustness, the decisions of even the most robust deep learning models can
still be quite sensitive to imperceptible perturbations, creating serious risks
when deploying them for high-stakes real-world applications. While detecting
such cases may be critical, evaluating a model’s vulnerability at a
per-instance level using adversarial attacks is computationally too intensive
and unsuitable for real-time deployment scenarios. The input space margin is
the exact score to detect non-robust samples and is intractable for deep neural
networks. This paper introduces the concept of margin consistency – a property
that links the input space margins and the logit margins in robust models –
for efficient detection of vulnerable samples. First, we establish that margin
consistency is a necessary and sufficient condition to use a model’s logit
margin as a score for identifying non-robust samples. Next, through
comprehensive empirical analysis of various robustly trained models on CIFAR10
and CIFAR100 datasets, we show that they indicate strong margin consistency
with a strong correlation between their input space margins and the logit
margins. Then, we show that we can effectively use the logit margin to
confidently detect brittle decisions with such models and accurately estimate
robust accuracy on an arbitrarily large test set by estimating the input
margins only on a small subset. Finally, we address cases where the model is
not sufficiently margin-consistent by learning a pseudo-margin from the feature
representation. Our findings highlight the potential of leveraging deep
representations to efficiently assess adversarial vulnerability in deployment
scenarios.
[COMMENTS]
11 pages, 7 figures, 2 tables, 1 algorithm. Version Update: Figure 6
[LINK]
http://arxiv.org/abs/2406.18451v2
[DATE]
2024-07-23 01:52:19+08:00
[CATEGORIES]
cs.LG
Uncertainty Quantification and Propagation in Surrogate-based Bayesian Inference
[AUTHORS]
Philipp Reiser, Javier Enrique Aguilar, Anneli Guthke, Paul-Christian Bürkner
[ABSTRACT]
Surrogate models are statistical or conceptual approximations for more
complex simulation models. In this context, it is crucial to propagate the
uncertainty induced by limited simulation budget and surrogate approximation
error to predictions, inference, and subsequent decision-relevant quantities.
However, quantifying and then propagating the uncertainty of surrogates is
usually limited to special analytic cases or is otherwise computationally very
expensive. In this paper, we propose a framework enabling a scalable, Bayesian
approach to surrogate modeling with thorough uncertainty quantification,
propagation, and validation. Specifically, we present three methods for
Bayesian inference with surrogate models given measurement data. This is a task
where the propagation of surrogate uncertainty is especially relevant, because
failing to account for it may lead to biased and/or overconfident estimates of
the parameters of interest. We showcase our approach in three detailed case
studies for linear and nonlinear real-world modeling scenarios. Uncertainty
propagation in surrogate models enables more reliable and safe approximation of
expensive simulators and will therefore be useful in various fields of
applications.
[LINK]
http://arxiv.org/abs/2312.05153v2
[DATE]
2024-07-23 01:37:44+08:00
[CATEGORIES]
cs.LG
Stretching Each Dollar: Diffusion Training from Scratch on a Micro-Budget
[AUTHORS]
Vikash Sehwag, Xianghao Kong, Jingtao Li, Michael Spranger, Lingjuan Lyu
[ABSTRACT]
As scaling laws in generative AI push performance, they also simultaneously
concentrate the development of these models among actors with large
computational resources. With a focus on text-to-image (T2I) generative models,
we aim to address this bottleneck by demonstrating very low-cost training of
large-scale T2I diffusion transformer models. As the computational cost of
transformers increases with the number of patches in each image, we propose to
randomly mask up to 75% of the image patches during training. We propose a
deferred masking strategy that preprocesses all patches using a patch-mixer
before masking, thus significantly reducing the performance degradation with
masking, making it superior to model downscaling in reducing computational
cost. We also incorporate the latest improvements in transformer architecture,
such as the use of mixture-of-experts layers, to improve performance and
further identify the critical benefit of using synthetic images in micro-budget
training. Finally, using only 37M publicly available real and synthetic images,
we train a 1.16 billion parameter sparse transformer with only $1,890
economical cost and achieve a 12.7 FID in zero-shot generation on the COCO
dataset. Notably, our model achieves competitive FID and high-quality
generations while incurring 118$\times$ lower cost than stable diffusion models
and 14$\times$ lower cost than the current state-of-the-art approach that costs
$28,400. We aim to release our end-to-end training pipeline to further
democratize the training of large-scale diffusion models on micro-budgets.
[COMMENTS]
41 pages, 28 figures, 5 tables
[LINK]
http://arxiv.org/abs/2407.15811v1
[DATE]
2024-07-23 01:23:28+08:00
[CATEGORIES]
cs.LG
DropKAN: Regularizing KANs by masking post-activations
[AUTHORS]
Mohammed Ghaith Altarabichi
[ABSTRACT]
We propose DropKAN (Dropout Kolmogorov-Arnold Networks) a regularization
method that prevents co-adaptation of activation function weights in
Kolmogorov-Arnold Networks (KANs). DropKAN operates by randomly masking some of
the post-activations within the KANs computation graph, while scaling-up the
retained post-activations. We show that this simple procedure that require
minimal coding effort has a regularizing effect and consistently lead to better
generalization of KANs.
We analyze the adaptation of the standard Dropout with KANs and demonstrate
that Dropout applied to KANs’ neurons can lead to unpredictable behaviour in
the feedforward pass. We carry an empirical study with real world Machine
Learning datasets to validate our findings. Our results suggest that DropKAN is
consistently a better alternative to using standard Dropout with KANs, and
improves the generalization performance of KANs. Our implementation of DropKAN
is available at: \url{https://github.com/Ghaith81/dropkan}.
[LINK]
http://arxiv.org/abs/2407.13044v2
[DATE]
2024-07-23 01:12:39+08:00
[CATEGORIES]
cs.LG
Development of Multistage Machine Learning Classifier using Decision Trees and Boosting Algorithms over Darknet Network Traffic
[AUTHORS]
Anjali Sureshkumar Nair, Dr. Prashant Nitnaware
[ABSTRACT]
In recent years, the clandestine nature of darknet activities has presented
an escalating challenge to cybersecurity efforts, necessitating sophisticated
methods for the detection and classification of network traffic associated with
these covert operations. The system addresses the significant challenge of
class imbalance within Darknet traffic datasets, where malicious traffic
constitutes a minority, hindering effective discrimination between normal and
malicious behavior. By leveraging boosting algorithms like AdaBoost and
Gradient Boosting coupled with decision trees, this study proposes a robust
solution for network traffic classification. Boosting algorithms ensemble
learning corrects errors iteratively and assigns higher weights to minority
class instances, complemented by the hierarchical structure of decision trees.
The additional Feature Selection which is a preprocessing method by utilizing
Information Gain metrics, Fisher’s Score, and Chi-Square test selection for
features is employed. Rigorous experimentation with diverse Darknet traffic
datasets validates the efficacy of the proposed multistage classifier,
evaluated through various performance metrics such as accuracy, precision,
recall, and F1-score, offering a comprehensive solution for accurate detection
and classification of Darknet activities.
[COMMENTS]
6 pages, 5 figures
[LINK]
http://arxiv.org/abs/2407.15910v1
[DATE]
2024-07-23 01:10:26+08:00
[CATEGORIES]
cs.LG
CLIP with Generative Latent Replay: a Strong Baseline for Incremental Learning
[AUTHORS]
Emanuele Frascaroli, Aniello Panariello, Pietro Buzzega, Lorenzo Bonicelli, Angelo Porrello, Simone Calderara
[ABSTRACT]
With the emergence of Transformers and Vision-Language Models (VLMs) such as
CLIP, large pre-trained models have become a common strategy to enhance
performance in Continual Learning scenarios. This led to the development of
numerous prompting strategies to effectively fine-tune transformer-based models
without succumbing to catastrophic forgetting. However, these methods struggle
to specialize the model on domains significantly deviating from the
pre-training and preserving its zero-shot capabilities. In this work, we
propose Continual Generative training for Incremental prompt-Learning, a novel
approach to mitigate forgetting while adapting a VLM, which exploits generative
replay to align prompts to tasks. We also introduce a new metric to evaluate
zero-shot capabilities within CL benchmarks. Through extensive experiments on
different domains, we demonstrate the effectiveness of our framework in
adapting to new tasks while improving zero-shot capabilities. Further analysis
reveals that our approach can bridge the gap with joint prompt tuning. The
codebase is available at https://github.com/aimagelab/mammoth.
[COMMENTS]
15 pages, 1 figure. Accepted at the The 35th British Machine Vision
Conference 2024 (BMVC 2024), Glasgow, UK
[LINK]
http://arxiv.org/abs/2407.15793v1
[DATE]
2024-07-23 00:51:28+08:00
[CATEGORIES]
cs.LG
Robust Mixture Learning when Outliers Overwhelm Small Groups
[AUTHORS]
Daniil Dmitriev, Rares-Darius Buhai, Stefan Tiegel, Alexander Wolters, Gleb Novikov, Amartya Sanyal, David Steurer, Fanny Yang
[ABSTRACT]
We study the problem of estimating the means of well-separated mixtures when
an adversary may add arbitrary outliers. While strong guarantees are available
when the outlier fraction is significantly smaller than the minimum mixing
weight, much less is known when outliers may crowd out low-weight clusters - a
setting we refer to as list-decodable mixture learning (LD-ML). In this case,
adversarial outliers can simulate additional spurious mixture components.
Hence, if all means of the mixture must be recovered up to a small error in the
output list, the list size needs to be larger than the number of (true)
components. We propose an algorithm that obtains order-optimal error guarantees
for each mixture mean with a minimal list-size overhead, significantly
improving upon list-decodable mean estimation, the only existing method that is
applicable for LD-ML. Although improvements are observed even when the mixture
is non-separated, our algorithm achieves particularly strong guarantees when
the mixture is separated: it can leverage the mixture structure to partially
cluster the samples before carefully iterating a base learner for
list-decodable mean estimation at different scales.
[LINK]
http://arxiv.org/abs/2407.15792v1
[DATE]
2024-07-23 00:51:05+08:00
[CATEGORIES]
cs.LG
Uncertainty-aware transfer across tasks using hybrid model-based successor feature reinforcement learning
[AUTHORS]
Parvin Malekzadeh, Ming Hou, Konstantinos N. Plataniotis
[ABSTRACT]
Sample efficiency is central to developing practical reinforcement learning
(RL) for complex and large-scale decision-making problems. The ability to
transfer and generalize knowledge gained from previous experiences to
downstream tasks can significantly improve sample efficiency. Recent research
indicates that successor feature (SF) RL algorithms enable knowledge
generalization between tasks with different rewards but identical transition
dynamics. It has recently been hypothesized that combining model-based (MB)
methods with SF algorithms can alleviate the limitation of fixed transition
dynamics. Furthermore, uncertainty-aware exploration is widely recognized as
another appealing approach for improving sample efficiency. Putting together
two ideas of hybrid model-based successor feature (MB-SF) and uncertainty leads
to an approach to the problem of sample efficient uncertainty-aware knowledge
transfer across tasks with different transition dynamics or/and reward
functions. In this paper, the uncertainty of the value of each action is
approximated by a Kalman filter (KF)-based multiple-model adaptive estimation.
This KF-based framework treats the parameters of a model as random variables.
To the best of our knowledge, this is the first attempt at formulating a hybrid
MB-SF algorithm capable of generalizing knowledge across large or continuous
state space tasks with various transition dynamics while requiring less
computation at decision time than MB methods. The number of samples required to
learn the tasks was compared to recent SF and MB baselines. The results show
that our algorithm generalizes its knowledge across different transition
dynamics, learns downstream tasks with significantly fewer samples than
starting from scratch, and outperforms existing approaches.
[COMMENTS]
40 pages
[LINK]
http://arxiv.org/abs/2310.10818v3
[DATE]
2024-07-23 00:47:09+08:00
[CATEGORIES]
cs.LG
Under-confidence Backdoors Are Resilient and Stealthy Backdoors
[AUTHORS]
Minlong Peng, Zidi Xiong, Quang H. Nguyen, Mingming Sun, Khoa D. Doan, Ping Li
[ABSTRACT]
By injecting a small number of poisoned samples into the training set,
backdoor attacks aim to make the victim model produce designed outputs on any
input injected with pre-designed backdoors. In order to achieve a high attack
success rate using as few poisoned training samples as possible, most existing
attack methods change the labels of the poisoned samples to the target class.
This practice often results in severe over-fitting of the victim model over the
backdoors, making the attack quite effective in output control but easier to be
identified by human inspection or automatic defense algorithms.
In this work, we proposed a label-smoothing strategy to overcome the
over-fitting problem of these attack methods, obtaining a
\textit{Label-Smoothed Backdoor Attack} (LSBA). In the LSBA, the label of the
poisoned sample $\bm{x}$ will be changed to the target class with a probability
of $p_n(\bm{x})$ instead of 100\%, and the value of $p_n(\bm{x})$ is
specifically designed to make the prediction probability the target class be
only slightly greater than those of the other classes. Empirical studies on
several existing backdoor attacks show that our strategy can considerably
improve the stealthiness of these attacks and, at the same time, achieve a high
attack success rate. In addition, our strategy makes it able to manually
control the prediction probability of the design output through manipulating
the applied and activated number of LSBAs\footnote{Source code will be
published at \url{https://github.com/v-mipeng/LabelSmoothedAttack.git}}.
[COMMENTS]
Backdoor Attack
[LINK]
http://arxiv.org/abs/2202.11203v2
[DATE]
2024-07-23 00:45:24+08:00
[CATEGORIES]
cs.LG
Distilled Thompson Sampling: Practical and Efficient Thompson Sampling via Imitation Learning
[AUTHORS]
Hongseok Namkoong, Samuel Daulton, Eytan Bakshy
[ABSTRACT]
Thompson sampling (TS) has emerged as a robust technique for contextual
bandit problems. However, TS requires posterior inference and optimization for
action generation, prohibiting its use in many online platforms where latency
and ease of deployment are of concern. We operationalize TS by proposing a
novel imitation-learning-based algorithm that distills a TS policy into an
explicit policy representation, allowing fast decision-making and easy
deployment in mobile and server-based environments. Using batched data
collected under the imitation policy, our algorithm iteratively performs
offline updates to the TS policy, and learns a new explicit policy
representation to imitate it. Empirically, our imitation policy achieves
performance comparable to batch TS while allowing more than an order of
magnitude reduction in decision-time latency. Buoyed by low latency and
simplicity of implementation, our algorithm has been successfully deployed in
multiple video upload systems for Meta. Using a randomized controlled trial, we
show our algorithm resulted in significant improvements in video quality and
watch time.
[LINK]
http://arxiv.org/abs/2011.14266v3
[DATE]
2024-07-23 00:30:05+08:00
[CATEGORIES]
cs.LG
In Search of Quantum Advantage: Estimating the Number of Shots in Quantum Kernel Methods
[AUTHORS]
Artur Miroszewski, Marco Fellous Asiani, Jakub Mielczarek, Bertrand Le Saux, Jakub Nalepa
[ABSTRACT]
Quantum Machine Learning (QML) has gathered significant attention through
approaches like Quantum Kernel Machines. While these methods hold considerable
promise, their quantum nature presents inherent challenges. One major challenge
is the limited resolution of estimated kernel values caused by the finite
number of circuit runs performed on a quantum device. In this study, we propose
a comprehensive system of rules and heuristics for estimating the required
number of circuit runs in quantum kernel methods. We introduce two critical
effects that necessitate an increased measurement precision through additional
circuit runs: the spread effect and the concentration effect. The effects are
analyzed in the context of fidelity and projected quantum kernels. To address
these phenomena, we develop an approach for estimating desired precision of
kernel values, which, in turn, is translated into the number of circuit runs.
Our methodology is validated through extensive numerical simulations, focusing
on the problem of exponential value concentration. We stress that quantum
kernel methods should not only be considered from the machine learning
performance perspective, but also from the context of the resource consumption.
The results provide insights into the possible benefits of quantum kernel
methods, offering a guidance for their application in quantum machine learning
tasks.
[COMMENTS]
18 + 13 pages, 8 figures. This manuscript is a first release that
will be improved in future versions. We wanted to provide this preview now as
we recently became aware of extensive modifications in arXiv:2208.11060
[LINK]
http://arxiv.org/abs/2407.15776v1
[DATE]
2024-07-23 00:29:35+08:00
[CATEGORIES]
cs.LG
STAMP: Outlier-Aware Test-Time Adaptation with Stable Memory Replay
[AUTHORS]
Yongcan Yu, Lijun Sheng, Ran He, Jian Liang
[ABSTRACT]
Test-time adaptation (TTA) aims to address the distribution shift between the
training and test data with only unlabeled data at test time. Existing TTA
methods often focus on improving recognition performance specifically for test
data associated with classes in the training set. However, during the
open-world inference process, there are inevitably test data instances from
unknown classes, commonly referred to as outliers. This paper pays attention to
the problem that conducts both sample recognition and outlier rejection during
inference while outliers exist. To address this problem, we propose a new
approach called STAble Memory rePlay (STAMP), which performs optimization over
a stable memory bank instead of the risky mini-batch. In particular, the memory
bank is dynamically updated by selecting low-entropy and label-consistent
samples in a class-balanced manner. In addition, we develop a self-weighted
entropy minimization strategy that assigns higher weight to low-entropy
samples. Extensive results demonstrate that STAMP outperforms existing TTA
methods in terms of both recognition and outlier detection performance. The
code is released at https://github.com/yuyongcan/STAMP.
[COMMENTS]
Accepted by ECCV 2024
[LINK]
http://arxiv.org/abs/2407.15773v1
[DATE]
2024-07-23 00:25:41+08:00
[CATEGORIES]
cs.LG
Beyond Memorization: The Challenge of Random Memory Access in Language Models
[AUTHORS]
Tongyao Zhu, Qian Liu, Liang Pang, Zhengbao Jiang, Min-Yen Kan, Min Lin
[COMMENTS]
9 pages, 4 figures; accepted by ACL 2024 (oral)
[LINK]
http://arxiv.org/abs/2403.07805v3
[DATE]
2024-07-22 23:29:00+08:00
[CATEGORIES]
cs.CL
DStruct2Design: Data and Benchmarks for Data Structure Driven Generative Floor Plan Design
[AUTHORS]
Zhi Hao Luo, Luis Lara, Ge Ya Luo, Florian Golemo, Christopher Beckham, Christopher Pal
[ABSTRACT]
Text conditioned generative models for images have yielded impressive
results. Text conditioned floorplan generation as a special type of raster
image generation task also received particular attention. However there are
many use cases in floorpla generation where numerical properties of the
generated result are more important than the aesthetics. For instance, one
might want to specify sizes for certain rooms in a floorplan and compare the
generated floorplan with given specifications Current approaches, datasets and
commonly used evaluations do not support these kinds of constraints. As such,
an attractive strategy is to generate an intermediate data structure that
contains numerical properties of a floorplan which can be used to generate the
final floorplan image. To explore this setting we (1) construct a new dataset
for this data-structure to data-structure formulation of floorplan generation
using two popular image based floorplan datasets RPLAN and ProcTHOR-10k, and
provide the tools to convert further procedurally generated ProcTHOR floorplan
data into our format. (2) We explore the task of floorplan generation given a
partial or complete set of constraints and we design a series of metrics and
benchmarks to enable evaluating how well samples generated from models respect
the constraints. (3) We create multiple baselines by finetuning a large
language model (LLM), Llama3, and demonstrate the feasibility of using
floorplan data structure conditioned LLMs for the problem of floorplan
generation respecting numerical constraints. We hope that our new datasets and
benchmarks will encourage further research on different ways to improve the
performance of LLMs and other generative modelling techniques for generating
designs where quantitative constraints are only partially specified, but must
be respected.
[LINK]
http://arxiv.org/abs/2407.15723v1
[DATE]
2024-07-22 23:27:55+08:00
[CATEGORIES]
cs.CL
Do Large Language Models Have Compositional Ability? An Investigation into Limitations and Scalability
[AUTHORS]
Zhuoyan Xu, Zhenmei Shi, Yingyu Liang
[ABSTRACT]
Large language models (LLMs) have emerged as powerful tools for many AI
problems and exhibit remarkable in-context learning (ICL) capabilities.
Compositional ability, solving unseen complex tasks that combine two or more
simple tasks, is an essential reasoning ability for Artificial General
Intelligence. Despite LLM’s tremendous success, how they approach composite
tasks, especially those not encountered during the pretraining phase, remains
an open question and largely ununderstood. In this study, we delve into the ICL
capabilities of LLMs on composite tasks, with only simple tasks as in-context
examples. We develop a test suite of composite tasks that include linguistic
and logical challenges and perform empirical studies across different LLM
families. We observe that models exhibit divergent behaviors: (1) For simpler
composite tasks that apply distinct mapping mechanisms to different input
segments, the models demonstrate decent compositional ability, while scaling up
the model enhances this ability; (2) for more complex composite tasks that
involving reasoning multiple steps, where each step represent one task, models
typically underperform, and scaling up generally provide no improvements. We
offer theoretical analysis in a simplified setting, explaining that models
exhibit compositional capability when the task handles different input parts
separately. We believe our work sheds new light on the capabilities of LLMs in
solving composite tasks regarding the nature of the tasks and model scale. Our
dataset and code are available at
{\url{https://github.com/OliverXUZY/LLM_Compose}}.
[LINK]
http://arxiv.org/abs/2407.15720v1
[DATE]
2024-07-22 23:22:34+08:00
[CATEGORIES]
cs.CL
cs.LG
Supporting the Digital Autonomy of Elders Through LLM Assistance
[AUTHORS]
Jesse Roberts, Lindsey Roberts, Alice Reed
[ABSTRACT]
The internet offers tremendous access to services, social connections, and
needed products. However, to those without sufficient experience, engaging with
businesses and friends across the internet can be daunting due to the ever
present danger of scammers and thieves, to say nothing of the myriad of
potential computer viruses. Like a forest rich with both edible and poisonous
plants, those familiar with the norms inhabit it safely with ease while
newcomers need a guide. However, reliance on a human digital guide can be
taxing and often impractical. We propose and pilot a simple but unexplored
idea: could an LLM provide the necessary support to help the elderly who are
separated by the digital divide safely achieve digital autonomy?
[LINK]
http://arxiv.org/abs/2407.15695v1
[DATE]
2024-07-22 23:01:45+08:00
[CATEGORIES]
cs.CL
Counter Turing Test ($CT^2$): Investigating AI-Generated Text Detection for Hindi – Ranking LLMs based on Hindi AI Detectability Index ($ADI_{hi}$)
[AUTHORS]
Ishan Kavathekar, Anku Rani, Ashmit Chamoli, Ponnurangam Kumaraguru, Amit Sheth, Amitava Das
[LINK]
http://arxiv.org/abs/2407.15694v1
[DATE]
2024-07-22 23:00:23+08:00
[CATEGORIES]
cs.CL
Chain of Code: Reasoning with a Language Model-Augmented Code Emulator
[AUTHORS]
Chengshu Li, Jacky Liang, Andy Zeng, Xinyun Chen, Karol Hausman, Dorsa Sadigh, Sergey Levine, Li Fei-Fei, Fei Xia, Brian Ichter
[ABSTRACT]
Code provides a general syntactic structure to build complex programs and
perform precise computations when paired with a code interpreter - we
hypothesize that language models (LMs) can leverage code-writing to improve
Chain of Thought reasoning not only for logic and arithmetic tasks, but also
for semantic ones (and in particular, those that are a mix of both). For
example, consider prompting an LM to write code that counts the number of times
it detects sarcasm in an essay: the LM may struggle to write an implementation
for “detect_sarcasm(string)” that can be executed by the interpreter (handling
the edge cases would be insurmountable). However, LMs may still produce a valid
solution if they not only write code, but also selectively “emulate” the
interpreter by generating the expected output of “detect_sarcasm(string)”. In
this work, we propose Chain of Code (CoC), a simple yet surprisingly effective
extension that improves LM code-driven reasoning. The key idea is to encourage
LMs to format semantic sub-tasks in a program as flexible pseudocode that the
interpreter can explicitly catch undefined behaviors and hand off to simulate
with an LM (as an “LMulator”). Experiments demonstrate that Chain of Code
outperforms Chain of Thought and other baselines across a variety of
benchmarks; on BIG-Bench Hard, Chain of Code achieves 84%, a gain of 12% over
Chain of Thought. In a nutshell, CoC broadens the scope of reasoning questions
that LMs can answer by “thinking in code”.
[COMMENTS]
ICML 2024 Oral; Project webpage: https://chain-of-code.github.io
[LINK]
http://arxiv.org/abs/2312.04474v3
[DATE]
2024-07-22 22:27:56+08:00
[CATEGORIES]
cs.CL
cs.LG
RadioRAG: Factual Large Language Models for Enhanced Diagnostics in Radiology Using Dynamic Retrieval Augmented Generation
[AUTHORS]
Soroosh Tayebi Arasteh, Mahshad Lotfinia, Keno Bressem, Robert Siepmann, Dyke Ferber, Christiane Kuhl, Jakob Nikolas Kather, Sven Nebelung, Daniel Truhn
[ABSTRACT]
Large language models (LLMs) have advanced the field of artificial
intelligence (AI) in medicine. However LLMs often generate outdated or
inaccurate information based on static training datasets. Retrieval augmented
generation (RAG) mitigates this by integrating outside data sources. While
previous RAG systems used pre-assembled, fixed databases with limited
flexibility, we have developed Radiology RAG (RadioRAG) as an end-to-end
framework that retrieves data from authoritative radiologic online sources in
real-time. RadioRAG is evaluated using a dedicated radiologic
question-and-answer dataset (RadioQA). We evaluate the diagnostic accuracy of
various LLMs when answering radiology-specific questions with and without
access to additional online information via RAG. Using 80 questions from RSNA
Case Collection across radiologic subspecialties and 24 additional
expert-curated questions, for which the correct gold-standard answers were
available, LLMs (GPT-3.5-turbo, GPT-4, Mistral-7B, Mixtral-8x7B, and Llama3 [8B
and 70B]) were prompted with and without RadioRAG. RadioRAG retrieved
context-specific information from www.radiopaedia.org in real-time and
incorporated them into its reply. RadioRAG consistently improved diagnostic
accuracy across all LLMs, with relative improvements ranging from 2% to 54%. It
matched or exceeded question answering without RAG across radiologic
subspecialties, particularly in breast imaging and emergency radiology.
However, degree of improvement varied among models; GPT-3.5-turbo and
Mixtral-8x7B-instruct-v0.1 saw notable gains, while Mistral-7B-instruct-v0.2
showed no improvement, highlighting variability in its effectiveness. LLMs
benefit when provided access to domain-specific data beyond their training
data. For radiology, RadioRAG establishes a robust framework that substantially
improves diagnostic accuracy and factuality in radiological question answering.
[LINK]
http://arxiv.org/abs/2407.15621v1
[DATE]
2024-07-22 21:29:56+08:00
[CATEGORIES]
cs.CL
cs.LG
StylusAI: Stylistic Adaptation for Robust German Handwritten Text Generation
[AUTHORS]
Nauman Riaz, Saifullah Saifullah, Stefan Agne, Andreas Dengel, Sheraz Ahmed
[ABSTRACT]
In this study, we introduce StylusAI, a novel architecture leveraging
diffusion models in the domain of handwriting style generation. StylusAI is
specifically designed to adapt and integrate the stylistic nuances of one
language’s handwriting into another, particularly focusing on blending English
handwriting styles into the context of the German writing system. This approach
enables the generation of German text in English handwriting styles and German
handwriting styles into English, enriching machine-generated handwriting
diversity while ensuring that the generated text remains legible across both
languages. To support the development and evaluation of StylusAI, we present
the \lq{Deutscher Handschriften-Datensatz}\rq~(DHSD), a comprehensive dataset
encompassing 37 distinct handwriting styles within the German language. This
dataset provides a fundamental resource for training and benchmarking in the
realm of handwritten text generation. Our results demonstrate that StylusAI not
only introduces a new method for style adaptation in handwritten text
generation but also surpasses existing models in generating handwriting samples
that improve both text quality and stylistic fidelity, evidenced by its
performance on the IAM database and our newly proposed DHSD. Thus, StylusAI
represents a significant advancement in the field of handwriting style
generation, offering promising avenues for future research and applications in
cross-linguistic style adaptation for languages with similar scripts.
[COMMENTS]
Accepted in ICDAR 2024
[LINK]
http://arxiv.org/abs/2407.15608v1
[DATE]
2024-07-22 21:08:30+08:00
[CATEGORIES]
cs.CL
Mitigating Entity-Level Hallucination in Large Language Models
[AUTHORS]
Weihang Su, Yichen Tang, Qingyao Ai, Changyue Wang, Zhijing Wu, Yiqun Liu
[ABSTRACT]
The emergence of Large Language Models (LLMs) has revolutionized how users
access information, shifting from traditional search engines to direct
question-and-answer interactions with LLMs. However, the widespread adoption of
LLMs has revealed a significant challenge known as hallucination, wherein LLMs
generate coherent yet factually inaccurate responses. This hallucination
phenomenon has led to users’ distrust in information retrieval systems based on
LLMs. To tackle this challenge, this paper proposes Dynamic Retrieval
Augmentation based on hallucination Detection (DRAD) as a novel method to
detect and mitigate hallucinations in LLMs. DRAD improves upon traditional
retrieval augmentation by dynamically adapting the retrieval process based on
real-time hallucination detection. It features two main components: Real-time
Hallucination Detection (RHD) for identifying potential hallucinations without
external models, and Self-correction based on External Knowledge (SEK) for
correcting these errors using external knowledge. Experiment results show that
DRAD demonstrates superior performance in both detecting and mitigating
hallucinations in LLMs. All of our code and data are open-sourced at
https://github.com/oneal2000/EntityHallucination.
[LINK]
http://arxiv.org/abs/2407.09417v2
[DATE]
2024-07-22 20:28:05+08:00
[CATEGORIES]
cs.CL
Adversarial Style Augmentation via Large Language Model for Robust Fake News Detection
[AUTHORS]
Sungwon Park, Sungwon Han, Meeyoung Cha
[ABSTRACT]
The spread of fake news negatively impacts individuals and is regarded as a
significant social challenge that needs to be addressed. A number of
algorithmic and insightful features have been identified for detecting fake
news. However, with the recent LLMs and their advanced generation capabilities,
many of the detectable features (e.g., style-conversion attacks) can be
altered, making it more challenging to distinguish from real news. This study
proposes adversarial style augmentation, AdStyle, to train a fake news detector
that remains robust against various style-conversion attacks. Our model’s key
mechanism is the careful use of LLMs to automatically generate a diverse yet
coherent range of style-conversion attack prompts. This improves the generation
of prompts that are particularly difficult for the detector to handle.
Experiments show that our augmentation strategy improves robustness and
detection performance when tested on fake news benchmark datasets.
[COMMENTS]
8 pages
[LINK]
http://arxiv.org/abs/2406.11260v2
[DATE]
2024-07-22 19:56:44+08:00
[CATEGORIES]
cs.CL
An Empirical Study of Retrieval Augmented Generation with Chain-of-Thought
[AUTHORS]
Yuetong Zhao, Hongyu Cao, Xianyu Zhao, Zhijian Ou
[ABSTRACT]
Since the launch of ChatGPT at the end of 2022, generative dialogue models
represented by ChatGPT have quickly become essential tools in daily life. As
user expectations increase, enhancing the capability of generative dialogue
models to solve complex problems has become a focal point of current research.
This paper delves into the effectiveness of the RAFT (Retrieval Augmented
Fine-Tuning) method in improving the performance of Generative dialogue models.
RAFT combines chain-of-thought with model supervised fine-tuning (SFT) and
retrieval augmented generation (RAG), which significantly enhanced the model’s
information extraction and logical reasoning abilities. We evaluated the RAFT
method across multiple datasets and analysed its performance in various
reasoning tasks, including long-form QA and short-form QA tasks, tasks in both
Chinese and English, and supportive and comparison reasoning tasks. Notably, it
addresses the gaps in previous research regarding long-form QA tasks and
Chinese datasets. Moreover, we also evaluate the benefit of the
chain-of-thought (CoT) in the RAFT method. This work offers valuable insights
for studies focused on enhancing the performance of generative dialogue models.
[COMMENTS]
5 pages, 4 figures
[LINK]
http://arxiv.org/abs/2407.15569v1
[DATE]
2024-07-22 19:55:14+08:00
[CATEGORIES]
cs.CL
Unipa-GPT: Large Language Models for university-oriented QA in Italian
[AUTHORS]
Irene Siragusa, Roberto Pirrone
[ABSTRACT]
This paper illustrates the architecture and training of Unipa-GPT, a chatbot
relying on a Large Language Model, developed for assisting students in choosing
a bachelor/master degree course at the University of Palermo. Unipa-GPT relies
on gpt-3.5-turbo, it was presented in the context of the European Researchers’
Night (SHARPER night). In our experiments we adopted both the Retrieval
Augmented Generation (RAG) approach and fine-tuning to develop the system. The
whole architecture of Unipa-GPT is presented, both the RAG and the fine-tuned
systems are compared, and a brief discussion on their performance is reported.
Further comparison with other Large Language Models and the experimental
results during the SHARPER night are illustrated.
[LINK]
http://arxiv.org/abs/2407.14246v2
[DATE]
2024-07-22 19:22:30+08:00
[CATEGORIES]
cs.CL
Targeted Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs
[AUTHORS]
Abhay Sheshadri, Aidan Ewart, Phillip Guo, Aengus Lynch, Cindy Wu, Vivek Hebbar, Henry Sleight, Asa Cooper Stickland, Ethan Perez, Dylan Hadfield-Menell, Stephen Casper
[ABSTRACT]
Large language models (LLMs) can often be made to behave in undesirable ways
that they are explicitly fine-tuned not to. For example, the LLM red-teaming
literature has produced a wide variety of `jailbreaking’ techniques to elicit
harmful text from models that were fine-tuned to be harmless. Recent work on
red-teaming, model editing, and interpretability suggests that this challenge
stems from how (adversarial) fine-tuning largely serves to suppress rather than
remove undesirable capabilities from LLMs. Prior work has introduced latent
adversarial training (LAT) as a way to improve robustness to broad classes of
failures. These prior works have considered untargeted latent space attacks
where the adversary perturbs latent activations to maximize loss on examples of
desirable behavior. Untargeted LAT can provide a generic type of robustness but
does not leverage information about specific failure modes. Here, we experiment
with targeted LAT where the adversary seeks to minimize loss on a specific
competing task. We find that it can augment a wide variety of state-of-the-art
methods. First, we use targeted LAT to improve robustness to jailbreaks,
outperforming a strong R2D2 baseline with orders of magnitude less compute.
Second, we use it to more effectively remove backdoors with no knowledge of the
trigger. Finally, we use it to more effectively unlearn knowledge for specific
undesirable tasks in a way that is also more robust to re-learning. Overall,
our results suggest that targeted LAT can be an effective tool for defending
against harmful behaviors from LLMs.
[LINK]
http://arxiv.org/abs/2407.15549v1
[DATE]
2024-07-22 19:19:14+08:00
[CATEGORIES]
cs.LG
cs.CL
MAPLE: Multilingual Evaluation of Parameter Efficient Finetuning of Large Language Models
[AUTHORS]
Divyanshu Aggarwal, Ashutosh Sathe, Ishaan Watts, Sunayana Sitaram
[ABSTRACT]
Parameter Efficient Finetuning (PEFT) has emerged as a viable solution for
improving the performance of Large Language Models (LLMs) without requiring
massive resources and compute. Prior work on multilingual evaluation has shown
that there is a large gap between the performance of LLMs on English and other
languages. Further, there is also a large gap between the performance of
smaller open-source models and larger LLMs. Finetuning can be an effective way
to bridge this gap and make language models more equitable. In this work, we
finetune the LLama-2-7B and Mistral-7B models on two synthetic multilingual
instruction tuning datasets to determine its effect on model performance on six
downstream tasks covering forty languages in all. Additionally, we experiment
with various parameters, such as rank for low-rank adaptation and values of
quantisation to determine their effects on downstream performance and find that
higher rank and higher quantisation values benefit low-resource languages. We
find that PEFT of smaller open-source models sometimes bridges the gap between
the performance of these models and the larger ones, however, English
performance can take a hit. We also find that finetuning sometimes improves
performance on low-resource languages, while degrading performance on
high-resource languages.
[COMMENTS]
46 pages, 23 figures, 45 tables. Accepted in ACL 2024 findings
[LINK]
http://arxiv.org/abs/2401.07598v3
[DATE]
2024-07-22 19:13:54+08:00
[CATEGORIES]
cs.CL
Fundamental Limits of Prompt Compression: A Rate-Distortion Framework for Black-Box Language Models
[AUTHORS]
Adway Girish, Alliot Nagle, Marco Bondaschi, Michael Gastpar, Ashok Vardhan Makkuva, Hyeji Kim
[ABSTRACT]
We formalize the problem of prompt compression for large language models
(LLMs) and present a framework to unify token-level prompt compression methods
which create hard prompts for black-box models. We derive the distortion-rate
function for this setup as a linear program, and provide an efficient algorithm
to compute this fundamental limit via the dual of the linear program. Using the
distortion-rate function as the baseline, we study the performance of existing
compression schemes on a synthetic dataset consisting of prompts generated from
a Markov chain, natural language queries, and their respective answers. Our
empirical analysis demonstrates the criticality of query-aware prompt
compression, where the compressor has knowledge of the downstream task/query
for the black-box LLM. We show that there is a large gap between the
performance of current prompt compression methods and the optimal strategy, and
propose a query-aware, variable-rate adaptation of a prior work to close the
gap. We extend our experiments to a small natural language dataset to further
confirm our findings on our synthetic dataset.
[COMMENTS]
40 pages, 15 figures. Under review
[LINK]
http://arxiv.org/abs/2407.15504v1
[DATE]
2024-07-22 17:40:13+08:00
[CATEGORIES]
cs.LG
cs.CL
Meta-Task Prompting Elicits Embeddings from Large Language Models
[AUTHORS]
Yibin Lei, Di Wu, Tianyi Zhou, Tao Shen, Yu Cao, Chongyang Tao, Andrew Yates
[ABSTRACT]
We introduce a new unsupervised text embedding method, Meta-Task Prompting
with Explicit One-Word Limitation (MetaEOL), for generating high-quality
sentence embeddings from Large Language Models (LLMs) without the need for
model fine-tuning. Leveraging meta-task prompting, MetaEOL guides LLMs to
produce embeddings through a series of carefully designed prompts that address
multiple representational aspects. Our comprehensive experiments demonstrate
that embeddings averaged from various meta-tasks are versatile embeddings that
yield competitive performance on Semantic Textual Similarity (STS) benchmarks
and excel in downstream tasks, surpassing contrastive-trained models. Our
findings suggest a new scaling law, offering a versatile and resource-efficient
approach for embedding generation across diverse scenarios.
[COMMENTS]
ACL 2024
[LINK]
http://arxiv.org/abs/2402.18458v2
[DATE]
2024-07-22 17:35:08+08:00
[CATEGORIES]
cs.CL
Refining Corpora from a Model Calibration Perspective for Chinese Spelling Correction
[AUTHORS]
Dingyao Yu, Yang An, Wei Ye, Xiongfeng Xiao, Shaoguang Mao, Tao Ge, Shikun Zhang
[ABSTRACT]
Chinese Spelling Correction (CSC) commonly lacks large-scale high-quality
corpora, due to the labor-intensive labeling of spelling errors in real-life
human writing or typing scenarios. Two data augmentation methods are widely
adopted: (1) \textit{Random Replacement} with the guidance of confusion sets
and (2) \textit{OCR/ASR-based Generation} that simulates character misusing.
However, both methods inevitably introduce noisy data (e.g., false spelling
errors), potentially leading to over-correction. By carefully analyzing the two
types of corpora, we find that though the latter achieves more robust
generalization performance, the former yields better-calibrated CSC models. We
then provide a theoretical analysis of this empirical observation, based on
which a corpus refining strategy is proposed. Specifically, OCR/ASR-based data
samples are fed into a well-calibrated CSC model trained on random
replacement-based corpora and then filtered based on prediction confidence. By
learning a simple BERT-based model on the refined OCR/ASR-based corpus, we set
up impressive state-of-the-art performance on three widely-used benchmarks,
while significantly alleviating over-correction (e.g., lowering false positive
predictions).
[LINK]
http://arxiv.org/abs/2407.15498v1
[DATE]
2024-07-22 17:26:35+08:00
[CATEGORIES]
cs.CL
EAG: Extract and Generate Multi-way Aligned Corpus for Complete Multi-lingual Neural Machine Translation
[AUTHORS]
Yulin Xu, Zhen Yang, Fandong Meng, JieZhou
[ABSTRACT]
Complete Multi-lingual Neural Machine Translation (C-MNMT) achieves superior
performance against the conventional MNMT by constructing multi-way aligned
corpus, i.e., aligning bilingual training examples from different language
pairs when either their source or target sides are identical. However, since
exactly identical sentences from different language pairs are scarce, the power
of the multi-way aligned corpus is limited by its scale. To handle this
problem, this paper proposes “Extract and Generate” (EAG), a two-step approach
to construct large-scale and high-quality multi-way aligned corpus from
bilingual data. Specifically, we first extract candidate aligned examples by
pairing the bilingual examples from different language pairs with highly
similar source or target sentences; and then generate the final aligned
examples from the candidates with a well-trained generation model. With this
two-step pipeline, EAG can construct a large-scale and multi-way aligned corpus
whose diversity is almost identical to the original bilingual corpus.
Experiments on two publicly available datasets i.e., WMT-5 and OPUS-100, show
that the proposed method achieves significant improvements over strong
baselines, with +1.1 and +1.4 BLEU points improvements on the two datasets
respectively.
[COMMENTS]
Accepted as a long paper at ACL 2022
[LINK]
http://arxiv.org/abs/2203.02180v2
[DATE]
2024-07-22 17:22:23+08:00
[CATEGORIES]
cs.CL
Two Stacks Are Better Than One: A Comparison of Language Modeling and Translation as Multilingual Pretraining Objectives
[AUTHORS]
Zihao Li, Shaoxiong Ji, Timothee Mickus, Vincent Segonne, Jörg Tiedemann
[ABSTRACT]
Pretrained language models (PLMs) display impressive performances and have
captured the attention of the NLP community. Establishing the best practices in
pretraining has therefore become a major point of focus for much of NLP
research – especially since the insights developed for monolingual English
models need not carry to more complex multilingual. One significant caveat of
the current state of the art is that different works are rarely comparable:
they often discuss different parameter counts, training data, and evaluation
methodology.
This paper proposes a comparison of multilingual pretraining objectives in a
controlled methodological environment. We ensure that training data and model
architectures are comparable, and discuss the downstream performances across 6
languages that we observe in probing and fine-tuning scenarios. We make two key
observations: (1) the architecture dictates which pretraining objective is
optimal; (2) multilingual translation is a very effective pre-training
objective under the right conditions. We make our code, data, and model weights
available at \texttt{\url{https://github.com/Helsinki-NLP/lm-vs-mt}}.
[LINK]
http://arxiv.org/abs/2407.15489v1
[DATE]
2024-07-22 17:16:30+08:00
[CATEGORIES]
cs.CL
From Black Boxes to Conversations: Incorporating XAI in a Conversational Agent
[AUTHORS]
Van Bach Nguyen, Jörg Schlötterer, Christin Seifert
[ABSTRACT]
The goal of Explainable AI (XAI) is to design methods to provide insights
into the reasoning process of black-box models, such as deep neural networks,
in order to explain them to humans. Social science research states that such
explanations should be conversational, similar to human-to-human explanations.
In this work, we show how to incorporate XAI in a conversational agent, using a
standard design for the agent comprising natural language understanding and
generation components. We build upon an XAI question bank, which we extend by
quality-controlled paraphrases, to understand the user’s information needs. We
further systematically survey the literature for suitable explanation methods
that provide the information to answer those questions, and present a
comprehensive list of suggestions. Our work is the first step towards truly
natural conversations about machine learning models with an explanation agent.
The comprehensive list of XAI questions and the corresponding explanation
methods may support other researchers in providing the necessary information to
address users’ demands. To facilitate future work, we release our source code
and data.
[COMMENTS]
Accepted at The World Conference on eXplainable Artificial
Intelligence 2023 (XAI-2023)
[LINK]
http://arxiv.org/abs/2209.02552v3
[DATE]
2024-07-22 17:10:34+08:00
[CATEGORIES]
cs.CL
TokenSHAP: Interpreting Large Language Models with Monte Carlo Shapley Value Estimation
[AUTHORS]
Roni Goldshmidt, Miriam Horovicz
[ABSTRACT]
As large language models (LLMs) become increasingly prevalent in critical
applications, the need for interpretable AI has grown. We introduce TokenSHAP,
a novel method for interpreting LLMs by attributing importance to individual
tokens or substrings within input prompts. This approach adapts Shapley values
from cooperative game theory to natural language processing, offering a
rigorous framework for understanding how different parts of an input contribute
to a model’s response. TokenSHAP leverages Monte Carlo sampling for
computational efficiency, providing interpretable, quantitative measures of
token importance. We demonstrate its efficacy across diverse prompts and LLM
architectures, showing consistent improvements over existing baselines in
alignment with human judgments, faithfulness to model behavior, and
consistency.
Our method’s ability to capture nuanced interactions between tokens provides
valuable insights into LLM behavior, enhancing model transparency, improving
prompt engineering, and aiding in the development of more reliable AI systems.
TokenSHAP represents a significant step towards the necessary interpretability
for responsible AI deployment, contributing to the broader goal of creating
more transparent, accountable, and trustworthy AI systems.
[LINK]
http://arxiv.org/abs/2407.10114v2
[DATE]
2024-07-22 16:59:07+08:00
[CATEGORIES]
cs.CL
Text-to-Battery Recipe: A language modeling-based protocol for automatic battery recipe extraction and retrieval
[AUTHORS]
Daeun Lee, Jaewoong Choi, Hiroshi Mizuseki, Byungju Lee
[ABSTRACT]
Recent studies have increasingly applied natural language processing (NLP) to
automatically extract experimental research data from the extensive battery
materials literature. Despite the complex process involved in battery
manufacturing – from material synthesis to cell assembly – there has been no
comprehensive study systematically organizing this information. In response, we
propose a language modeling-based protocol, Text-to-Battery Recipe (T2BR), for
the automatic extraction of end-to-end battery recipes, validated using a case
study on batteries containing LiFePO4 cathode material. We report machine
learning-based paper filtering models, screening 2,174 relevant papers from the
keyword-based search results, and unsupervised topic models to identify 2,876
paragraphs related to cathode synthesis and 2,958 paragraphs related to cell
assembly. Then, focusing on the two topics, two deep learning-based named
entity recognition models are developed to extract a total of 30 entities –
including precursors, active materials, and synthesis methods – achieving F1
scores of 88.18% and 94.61%. The accurate extraction of entities enables the
systematic generation of 165 end-toend recipes of LiFePO4 batteries. Our
protocol and results offer valuable insights into specific trends, such as
associations between precursor materials and synthesis methods, or combinations
between different precursor materials. We anticipate that our findings will
serve as a foundational knowledge base for facilitating battery-recipe
information retrieval. The proposed protocol will significantly accelerate the
review of battery material literature and catalyze innovations in battery
design and development.
[LINK]
http://arxiv.org/abs/2407.15459v1
[DATE]
2024-07-22 16:15:02+08:00
[CATEGORIES]
cs.CL
Developing a Reliable, General-Purpose Hallucination Detection and Mitigation Service: Insights and Lessons Learned
[AUTHORS]
Song Wang, Xun Wang, Jie Mei, Yujia Xie, Sean Muarray, Zhang Li, Lingfeng Wu, Si-Qing Chen, Wayne Xiong
[ABSTRACT]
Hallucination, a phenomenon where large language models (LLMs) produce output
that is factually incorrect or unrelated to the input, is a major challenge for
LLM applications that require accuracy and dependability. In this paper, we
introduce a reliable and high-speed production system aimed at detecting and
rectifying the hallucination issue within LLMs. Our system encompasses named
entity recognition (NER), natural language inference (NLI), span-based
detection (SBD), and an intricate decision tree-based process to reliably
detect a wide range of hallucinations in LLM responses. Furthermore, our team
has crafted a rewriting mechanism that maintains an optimal mix of precision,
response time, and cost-effectiveness. We detail the core elements of our
framework and underscore the paramount challenges tied to response time,
availability, and performance metrics, which are crucial for real-world
deployment of these technologies. Our extensive evaluation, utilizing offline
data and live production traffic, confirms the efficacy of our proposed
framework and service.
[LINK]
http://arxiv.org/abs/2407.15441v1
[DATE]
2024-07-22 15:48:30+08:00
[CATEGORIES]
cs.CL
UltraEval: A Lightweight Platform for Flexible and Comprehensive Evaluation for LLMs
[AUTHORS]
Chaoqun He, Renjie Luo, Shengding Hu, Yuanqian Zhao, Jie Zhou, Hanghao Wu, Jiajie Zhang, Xu Han, Zhiyuan Liu, Maosong Sun
[COMMENTS]
Accepted by ACL 2024 System Demostration Track, update
[LINK]
http://arxiv.org/abs/2404.07584v3
[DATE]
2024-07-22 15:07:06+08:00
[CATEGORIES]
cs.CL
Empirical Capacity Model for Self-Attention Neural Networks
[AUTHORS]
Aki Härmä, Marcin Pietrasik, Anna Wilbik
[ABSTRACT]
Large pretrained self-attention neural networks, or transformers, have been
very successful in various tasks recently. The performance of a model on a
given task depends on its ability to memorize and generalize the training data.
Large transformer models, which may have billions of parameters, in theory have
a huge capacity to memorize content. However, the current algorithms for the
optimization fall short of the theoretical capacity, and the capacity is also
highly dependent on the content. In this paper, we focus on the memory capacity
of these models obtained using common training algorithms and synthetic
training data. Based on the results, we derive an empirical capacity model
(ECM) for a generic transformer. The ECM can be used to design task-specific
transformer models with an optimal number of parameters in cases where the
target memorization capability of the task can be defined.
[COMMENTS]
Submitted to BNAIC’24, 14 pages + refs
[LINK]
http://arxiv.org/abs/2407.15425v1
[DATE]
2024-07-22 15:02:15+08:00
[CATEGORIES]
cs.LG
cs.CL
LLaST: Improved End-to-end Speech Translation System Leveraged by Large Language Models
[AUTHORS]
Xi Chen, Songyang Zhang, Qibing Bai, Kai Chen, Satoshi Nakamura
[ABSTRACT]
We introduces LLaST, a framework for building high-performance Large Language
model based Speech-to-text Translation systems. We address the limitations of
end-to-end speech translation(E2E ST) models by exploring model architecture
design and optimization techniques tailored for LLMs. Our approach includes
LLM-based speech translation architecture design, ASR-augmented training,
multilingual data augmentation, and dual-LoRA optimization. Our approach
demonstrates superior performance on the CoVoST-2 benchmark and showcases
exceptional scaling capabilities powered by LLMs. We believe this effective
method will serve as a strong baseline for speech translation and provide
insights for future improvements of the LLM-based speech translation framework.
We release the data, code and models in https://github.com/openaudiolab/LLaST.
[LINK]
http://arxiv.org/abs/2407.15415v1
[DATE]
2024-07-22 14:42:00+08:00
[CATEGORIES]
cs.CL
Knowledge Mechanisms in Large Language Models: A Survey and Perspective
[AUTHORS]
Mengru Wang, Yunzhi Yao, Ziwen Xu, Shuofei Qiao, Shumin Deng, Peng Wang, Xiang Chen, Jia-Chen Gu, Yong Jiang, Pengjun Xie, Fei Huang, Huajun Chen, Ningyu Zhang
[ABSTRACT]
Understanding knowledge mechanisms in Large Language Models (LLMs) is crucial
for advancing towards trustworthy AGI. This paper reviews knowledge mechanism
analysis from a novel taxonomy including knowledge utilization and evolution.
Knowledge utilization delves into the mechanism of memorization, comprehension
and application, and creation. Knowledge evolution focuses on the dynamic
progression of knowledge within individual and group LLMs. Moreover, we discuss
what knowledge LLMs have learned, the reasons for the fragility of parametric
knowledge, and the potential dark knowledge (hypothesis) that will be
challenging to address. We hope this work can help understand knowledge in LLMs
and provide insights for future research.
[COMMENTS]
Ongoing work (v1); 34 pages, 5 figures
[LINK]
http://arxiv.org/abs/2407.15017v1
[DATE]
2024-07-22 14:15:59+08:00
[CATEGORIES]
cs.CL
cs.LG
Imposter.AI: Adversarial Attacks with Hidden Intentions towards Aligned Large Language Models
[AUTHORS]
Xiao Liu, Liangzhi Li, Tong Xiang, Fuying Ye, Lu Wei, Wangyue Li, Noa Garcia
[ABSTRACT]
With the development of large language models (LLMs) like ChatGPT, both their
vast applications and potential vulnerabilities have come to the forefront.
While developers have integrated multiple safety mechanisms to mitigate their
misuse, a risk remains, particularly when models encounter adversarial inputs.
This study unveils an attack mechanism that capitalizes on human conversation
strategies to extract harmful information from LLMs. We delineate three pivotal
strategies: (i) decomposing malicious questions into seemingly innocent
sub-questions; (ii) rewriting overtly malicious questions into more covert,
benign-sounding ones; (iii) enhancing the harmfulness of responses by prompting
models for illustrative examples. Unlike conventional methods that target
explicit malicious responses, our approach delves deeper into the nature of the
information provided in responses. Through our experiments conducted on
GPT-3.5-turbo, GPT-4, and Llama2, our method has demonstrated a marked efficacy
compared to conventional attack methods. In summary, this work introduces a
novel attack method that outperforms previous approaches, raising an important
question: How to discern whether the ultimate intent in a dialogue is
malicious?
[LINK]
http://arxiv.org/abs/2407.15399v1
[DATE]
2024-07-22 14:04:29+08:00
[CATEGORIES]
cs.CL
ALLaM: Large Language Models for Arabic and English
[AUTHORS]
M Saiful Bari, Yazeed Alnumay, Norah A. Alzahrani, Nouf M. Alotaibi, Hisham A. Alyahya, Sultan AlRashed, Faisal A. Mirza, Shaykhah Z. Alsubaie, Hassan A. Alahmed, Ghadah Alabduljabbar, Raghad Alkhathran, Yousef Almushayqih, Raneem Alnajim, Salman Alsubaihi, Maryam Al Mansour, Majed Alrubaian, Ali Alammari, Zaki Alawami, Abdulmohsen Al-Thubaity, Ahmed Abdelali, Jeril Kuriakose, Abdalghani Abujabal, Nora Al-Twairesh, Areeb Alowisheq, Haidar Khan
[ABSTRACT]
We present ALLaM: Arabic Large Language Model, a series of large language
models to support the ecosystem of Arabic Language Technologies (ALT). ALLaM is
carefully trained considering the values of language alignment and knowledge
transfer at scale. Our autoregressive decoder-only architecture models
demonstrate how second-language acquisition via vocabulary expansion and
pretraining on a mixture of Arabic and English text can steer a model towards a
new language (Arabic) without any catastrophic forgetting in the original
language (English). Furthermore, we highlight the effectiveness of using
parallel/translated data to aid the process of knowledge alignment between
languages. Finally, we show that extensive alignment with human preferences can
significantly enhance the performance of a language model compared to models of
a larger scale with lower quality alignment. ALLaM achieves state-of-the-art
performance in various Arabic benchmarks, including MMLU Arabic, ACVA, and
Arabic Exams. Our aligned models improve both in Arabic and English from their
base aligned models.
[LINK]
http://arxiv.org/abs/2407.15390v1
[DATE]
2024-07-22 13:35:17+08:00
[CATEGORIES]
cs.CL
ILiAD: An Interactive Corpus for Linguistic Annotated Data from Twitter Posts
[AUTHORS]
Simon Gonzalez
[ABSTRACT]
Social Media platforms have offered invaluable opportunities for linguistic
research. The availability of up-to-date data, coming from any part in the
world, and coming from natural contexts, has allowed researchers to study
language in real time. One of the fields that has made great use of social
media platforms is Corpus Linguistics. There is currently a wide range of
projects which have been able to successfully create corpora from social media.
In this paper, we present the development and deployment of a linguistic corpus
from Twitter posts in English, coming from 26 news agencies and 27 individuals.
The main goal was to create a fully annotated English corpus for linguistic
analysis. We include information on morphology and syntax, as well as NLP
features such as tokenization, lemmas, and n- grams. The information is
presented through a range of powerful visualisations for users to explore
linguistic patterns in the corpus. With this tool, we aim to contribute to the
area of language technologies applied to linguistic research.
[COMMENTS]
Conference on Language Technologies & Digital Humanities Ljubljana,
2022
[LINK]
http://arxiv.org/abs/2407.15374v1
[DATE]
2024-07-22 12:48:04+08:00
[CATEGORIES]
cs.CL
FineSurE: Fine-grained Summarization Evaluation using LLMs
[AUTHORS]
Hwanjun Song, Hang Su, Igor Shalyminov, Jason Cai, Saab Mansour
[COMMENTS]
Accepted at ACL 2024 (main, long)
[LINK]
http://arxiv.org/abs/2407.00908v3
[DATE]
2024-07-22 12:45:11+08:00
[CATEGORIES]
cs.CL
Dissecting Multiplication in Transformers: Insights into LLMs
[AUTHORS]
Luyu Qiu, Jianing Li, Chi Su, Chen Jason Zhang, Lei Chen
[ABSTRACT]
Transformer-based large language models have achieved remarkable performance
across various natural language processing tasks. However, they often struggle
with seemingly easy tasks like arithmetic despite their vast capabilities. This
stark disparity raise human’s concerns about their safe and ethical use, hinder
their widespread adoption.In this paper, we focus on a typical arithmetic task,
integer multiplication, to explore and explain the imperfection of transformers
in this domain. We provide comprehensive analysis of a vanilla transformer
trained to perform n-digit integer multiplication. Our observations indicate
that the model decomposes multiplication task into multiple parallel subtasks,
sequentially optimizing each subtask for each digit to complete the final
multiplication. Based on observation and analysis, we infer the reasons of
transformers deficiencies in multiplication tasks lies in their difficulty in
calculating successive carryovers and caching intermediate results, and
confirmed this inference through experiments. Guided by these findings, we
propose improvements to enhance transformers performance on multiplication
tasks. These enhancements are validated through rigorous testing and
mathematical modeling, not only enhance transformer’s interpretability, but
also improve its performance, e.g., we achieve over 99.9% accuracy on 5-digit
integer multiplication with a tiny transformer, outperform LLMs GPT-4. Our
method contributes to the broader fields of model understanding and
interpretability, paving the way for analyzing more complex tasks and
Transformer models. This work underscores the importance of explainable AI,
helping to build trust in large language models and promoting their adoption in
critical applications.
[COMMENTS]
8 pages, 5 figures
[LINK]
http://arxiv.org/abs/2407.15360v1
[DATE]
2024-07-22 12:07:26+08:00
[CATEGORIES]
cs.CL
UF-HOBI at “Discharge Me!”: A Hybrid Solution for Discharge Summary Generation Through Prompt-based Tuning of GatorTronGPT Models
[AUTHORS]
Mengxian Lyu, Cheng Peng, Daniel Paredes, Ziyi Chen, Aokun Chen, Jiang Bian, Yonghui Wu
[ABSTRACT]
Automatic generation of discharge summaries presents significant challenges
due to the length of clinical documentation, the dispersed nature of patient
information, and the diverse terminology used in healthcare. This paper
presents a hybrid solution for generating discharge summary sections as part of
our participation in the “Discharge Me!” Challenge at the BioNLP 2024 Shared
Task. We developed a two-stage generation method using both extractive and
abstractive techniques, in which we first apply name entity recognition (NER)
to extract key clinical concepts, which are then used as input for a
prompt-tuning-based GatorTronGPT model to generate coherent text for two
important sections including “Brief Hospital Course” and “Discharge
Instructions”. Our system was ranked 5th in this challenge, achieving an
overall score of 0.284. The results demonstrate the effectiveness of our hybrid
solution in improving the quality of automated discharge section generation.
[COMMENTS]
BIONLP 2024 and Shared Tasks @ ACL 2024
[LINK]
http://arxiv.org/abs/2407.15359v1
[DATE]
2024-07-22 12:02:45+08:00
[CATEGORIES]
cs.CL
Customized Retrieval Augmented Generation and Benchmarking for EDA Tool Documentation QA
[AUTHORS]
Yuan Pu, Zhuolun He, Tairu Qiu, Haoyuan Wu, Bei Yu
[ABSTRACT]
Retrieval augmented generation (RAG) enhances the accuracy and reliability of
generative AI models by sourcing factual information from external databases,
which is extensively employed in document-grounded question-answering (QA)
tasks. Off-the-shelf RAG flows are well pretrained on general-purpose
documents, yet they encounter significant challenges when being applied to
knowledge-intensive vertical domains, such as electronic design automation
(EDA). This paper addresses such issue by proposing a customized RAG framework
along with three domain-specific techniques for EDA tool documentation QA,
including a contrastive learning scheme for text embedding model fine-tuning, a
reranker distilled from proprietary LLM, and a generative LLM fine-tuned with
high-quality domain corpus. Furthermore, we have developed and released a
documentation QA evaluation benchmark, ORD-QA, for OpenROAD, an advanced
RTL-to-GDSII design platform. Experimental results demonstrate that our
proposed RAG flow and techniques have achieved superior performance on ORD-QA
as well as on a commercial tool, compared with state-of-the-arts. The ORD-QA
benchmark and the training dataset for our customized RAG flow are open-source
at https://github.com/lesliepy99/RAG-EDA.
[LINK]
http://arxiv.org/abs/2407.15353v1
[DATE]
2024-07-22 11:44:27+08:00
[CATEGORIES]
cs.CL
Knowledge Acquisition Disentanglement for Knowledge-based Visual Question Answering with Large Language Models
[AUTHORS]
Wenbin An, Feng Tian, Jiahao Nie, Wenkai Shi, Haonan Lin, Yan Chen, QianYing Wang, Yaqiang Wu, Guang Dai, Ping Chen
[ABSTRACT]
Knowledge-based Visual Question Answering (KVQA) requires both image and
world knowledge to answer questions. Current methods first retrieve knowledge
from the image and external knowledge base with the original complex question,
then generate answers with Large Language Models (LLMs). However, since the
original question contains complex elements that require knowledge from
different sources, acquiring different kinds of knowledge in a coupled manner
may confuse models and hinder them from retrieving precise knowledge.
Furthermore, the “forward-only” answering process fails to explicitly capture
the knowledge needs of LLMs, which can further hurt answering quality. To cope
with the above limitations, we propose DKA: Disentangled Knowledge Acquisition
from LLM feedback, a training-free framework that disentangles knowledge
acquisition to avoid confusion and uses LLM’s feedback to specify the required
knowledge. Specifically, DKA requires LLMs to specify what knowledge they need
to answer the question and decompose the original complex question into two
simple sub-questions: Image-based sub-question and Knowledge-based
sub-question. Then we use the two sub-questions to retrieve knowledge from the
image and knowledge base, respectively. In this way, two knowledge acquisition
models can focus on the content that corresponds to them and avoid disturbance
of irrelevant elements in the original complex question, which can help to
provide more precise knowledge and better align the knowledge needs of LLMs to
yield correct answers. Experiments on benchmark datasets show that DKA
significantly outperforms SOTA models. To facilitate future research, our data
and code are available at \url{https://github.com/Lackel/DKA}.
[COMMENTS]
Pre-print
[LINK]
http://arxiv.org/abs/2407.15346v1
[DATE]
2024-07-22 11:05:32+08:00
[CATEGORIES]
cs.CL
Improving Minimum Bayes Risk Decoding with Multi-Prompt
[AUTHORS]
David Heineman, Yao Dou, Wei Xu
[ABSTRACT]
While instruction fine-tuned LLMs are effective text generators, sensitivity
to prompt construction makes performance unstable and sub-optimal in practice.
Relying on a single “best” prompt cannot capture all differing approaches to a
generation problem. Using this observation, we propose multi-prompt decoding,
where many candidate generations are decoded from a prompt bank at
inference-time. To ensemble candidates, we use Minimum Bayes Risk (MBR)
decoding, which selects a final output using a trained value metric. We show
multi-prompt improves MBR across a comprehensive set of conditional generation
tasks, and show this is a result of estimating a more diverse and higher
quality candidate space than that of a single prompt. Further experiments
confirm multi-prompt improves generation across tasks, models and metrics.
[LINK]
http://arxiv.org/abs/2407.15343v1
[DATE]
2024-07-22 10:57:10+08:00
[CATEGORIES]
cs.CL
Deep Learning for Economists
[AUTHORS]
Melissa Dell
[ABSTRACT]
Deep learning provides powerful methods to impute structured information from
large-scale, unstructured text and image datasets. For example, economists
might wish to detect the presence of economic activity in satellite images, or
to measure the topics or entities mentioned in social media, the congressional
record, or firm filings. This review introduces deep neural networks, covering
methods such as classifiers, regression models, generative AI, and embedding
models. Applications include classification, document digitization, record
linkage, and methods for data exploration in massive scale text and image
corpora. When suitable methods are used, deep learning models can be cheap to
tune and can scale affordably to problems involving millions or billions of
data points.. The review is accompanied by a companion website, EconDL, with
user-friendly demo notebooks, software resources, and a knowledge base that
provides technical details and additional applications.
[LINK]
http://arxiv.org/abs/2407.15339v1
[DATE]
2024-07-22 10:53:18+08:00
[CATEGORIES]
cs.CL
Sketch-Guided Constrained Decoding for Boosting Blackbox Large Language Models without Logit Access
[AUTHORS]
Saibo Geng, Berkay Döner, Chris Wendler, Martin Josifoski, Robert West
[ABSTRACT]
Constrained decoding, a technique for enforcing constraints on language model
outputs, offers a way to control text generation without retraining or
architectural modifications. Its application is, however, typically restricted
to models that give users access to next-token distributions (usually via
softmax logits), which poses a limitation with blackbox large language models
(LLMs). This paper introduces sketch-guided constrained decoding (SGCD), a
novel approach to constrained decoding for blackbox LLMs, which operates
without access to the logits of the blackbox LLM. SGCD utilizes a locally
hosted auxiliary model to refine the output of an unconstrained blackbox LLM,
effectively treating this initial output as a “sketch” for further elaboration.
This approach is complementary to traditional logit-based techniques and
enables the application of constrained decoding in settings where full model
transparency is unavailable. We demonstrate the efficacy of SGCD through
experiments in closed information extraction and constituency parsing, showing
how it enhances the utility and flexibility of blackbox LLMs for complex NLP
tasks.
[COMMENTS]
Accepted to ACL 2024 Oral
[LINK]
http://arxiv.org/abs/2401.09967v4
[DATE]
2024-07-22 09:05:29+08:00
[CATEGORIES]
cs.CL
$\forall$uto$\exists$val: Autonomous Assessment of LLMs in Formal Synthesis and Interpretation Tasks
[AUTHORS]
Rushang Karia, Daniel Bramblett, Daksh Dobhal, Pulkit Verma, Siddharth Srivastava
[ABSTRACT]
This paper presents $\forall$uto$\exists$val, a new approach for scaling LLM
assessment in translating formal syntax – such as first-order logic, regular
expressions, etc – to natural language (interpretation) or vice versa
(compilation), thereby facilitating their use in applications such as
generating/explaining logic and control flow for programs etc. Existing
approaches for LLM assessment in these areas require labor-intensive
ground-truth creation, the availability of which undermines the separation of
training and test sets. Furthermore, such datasets typically include relatively
few hand-coded test cases over which LLM accuracy is determined, thus making
them inadequate for determining the safety or correctness of their generated
outputs. We introduce a new approach that utilizes context-free grammars (CFGs)
to generate out-of-distribution datasets on the fly and perform closed-loop
testing of LLM capabilities using formal verifiers to guarantee the correctness
of LLM outputs without any human intervention. We release our dataset and
benchmark as open-source code at
\url{https://github.com/AAIR-lab/auto-llm-assessment}. We also conduct an
assessment of several SOTA closed and open-source LLMs to showcase the
feasibility and scalability of this paradigm. Our experiments reveal that SOTA
LLMs are unable to solve the formal translation task adequately.
[LINK]
http://arxiv.org/abs/2403.18327v2
[DATE]
2024-07-22 08:41:38+08:00
[CATEGORIES]
cs.CL
MLRegTest: A Benchmark for the Machine Learning of Regular Languages
[AUTHORS]
Sam van der Poel, Dakotah Lambert, Kalina Kostyszyn, Tiantian Gao, Rahul Verma, Derek Andersen, Joanne Chau, Emily Peterson, Cody St. Clair, Paul Fodor, Chihiro Shibata, Jeffrey Heinz
[ABSTRACT]
Synthetic datasets constructed from formal languages allow fine-grained
examination of the learning and generalization capabilities of machine learning
systems for sequence classification. This article presents a new benchmark for
machine learning systems on sequence classification called MLRegTest, which
contains training, development, and test sets from 1,800 regular languages.
Different kinds of formal languages represent different kinds of long-distance
dependencies, and correctly identifying long-distance dependencies in sequences
is a known challenge for ML systems to generalize successfully. MLRegTest
organizes its languages according to their logical complexity (monadic second
order, first order, propositional, or monomial expressions) and the kind of
logical literals (string, tier-string, subsequence, or combinations thereof).
The logical complexity and choice of literal provides a systematic way to
understand different kinds of long-distance dependencies in regular languages,
and therefore to understand the capacities of different ML systems to learn
such long-distance dependencies. Finally, the performance of different neural
networks (simple RNN, LSTM, GRU, transformer) on MLRegTest is examined. The
main conclusion is that performance depends significantly on the kind of test
set, the class of language, and the neural network architecture.
[COMMENTS]
43 pages, MLRegTest benchmark available at
https://doi.org/10.5061/dryad.dncjsxm4h , associated code at
https://github.com/heinz-jeffrey/subregular-learning
[LINK]
http://arxiv.org/abs/2304.07687v3
[DATE]
2024-07-22 08:40:17+08:00
[CATEGORIES]
cs.LG
cs.CL
Weak-to-Strong Compositional Learning from Generative Models for Language-based Object Detection
[AUTHORS]
Kwanyong Park, Kuniaki Saito, Donghyun Kim
[ABSTRACT]
Vision-language (VL) models often exhibit a limited understanding of complex
expressions of visual objects (e.g., attributes, shapes, and their relations),
given complex and diverse language queries. Traditional approaches attempt to
improve VL models using hard negative synthetic text, but their effectiveness
is limited. In this paper, we harness the exceptional compositional
understanding capabilities of generative foundational models. We introduce a
novel method for structured synthetic data generation aimed at enhancing the
compositional understanding of VL models in language-based object detection.
Our framework generates densely paired positive and negative triplets (image,
text descriptions, and bounding boxes) in both image and text domains. By
leveraging these synthetic triplets, we transform ‘weaker’ VL models into
‘stronger’ models in terms of compositional understanding, a process we call
“Weak-to-Strong Compositional Learning” (WSCL). To achieve this, we propose a
new compositional contrastive learning formulation that discovers semantics and
structures in complex descriptions from synthetic triplets. As a result, VL
models trained with our synthetic data generation exhibit a significant
performance boost in the Omnilabel benchmark by up to +5AP and the D3 benchmark
by +6.9AP upon existing baselines.
[COMMENTS]
ECCV 2024
[LINK]
http://arxiv.org/abs/2407.15296v1
[DATE]
2024-07-22 07:43:24+08:00
[CATEGORIES]
cs.CL
cs.LG
Retrieval-Augmented and Knowledge-Grounded Language Models for Faithful Clinical Medicine
[AUTHORS]
Fenglin Liu, Bang Yang, Chenyu You, Xian Wu, Shen Ge, Zhangdaihong Liu, Xu Sun, Yang Yang, David A. Clifton
[ABSTRACT]
Language models (LMs), including large language models (such as ChatGPT),
have the potential to assist clinicians in generating various clinical notes.
However, LMs are prone to produce “hallucinations”, i.e., generated content
that is not aligned with facts and knowledge. In this paper, we propose the
Re$^3$Writer method with retrieval-augmented generation and knowledge-grounded
reasoning to enable LMs to generate faithful clinical texts. We demonstrate the
effectiveness of our method in generating patient discharge instructions. It
requires the LMs not to only understand the patients’ long clinical documents,
i.e., the health records during hospitalization, but also to generate critical
instructional information provided both to carers and to the patient at the
time of discharge. The proposed Re$^3$Writer imitates the working patterns of
physicians to first \textbf{re}trieve related working experience from
historical instructions written by physicians, then \textbf{re}ason related
medical knowledge. Finally, it \textbf{re}fines the retrieved working
experience and reasoned medical knowledge to extract useful information, which
is used to generate the discharge instructions for previously-unseen patients.
Our experiments show that, using our method, the performance of five
representative LMs can be substantially boosted across all metrics. Meanwhile,
we show results from human evaluations to measure the effectiveness in terms of
fluency, faithfulness, and comprehensiveness.
[LINK]
http://arxiv.org/abs/2210.12777v4
[DATE]
2024-07-22 06:57:58+08:00
[CATEGORIES]
cs.CL
cs.LG
Intrinsic Self-correction for Enhanced Morality: An Analysis of Internal Mechanisms and the Superficial Hypothesis
[AUTHORS]
Guangliang Liu, Haitao Mao, Jiliang Tang, Kristen Marie Johnson
[ABSTRACT]
Large Language Models (LLMs) are capable of producing content that
perpetuates stereotypes, discrimination, and toxicity. The recently proposed
moral self-correction is a computationally efficient method for reducing
harmful content in the responses of LLMs. However, the process of how injecting
self-correction instructions can modify the behavior of LLMs remains
under-explored. In this paper, we explore the effectiveness of moral
self-correction by answering three research questions: (1) In what scenarios
does moral self-correction work? (2) What are the internal mechanisms of LLMs,
e.g., hidden states, that are influenced by moral self-correction instructions?
(3) Is intrinsic moral self-correction actually superficial? We argue that
self-correction can help LLMs find a shortcut to more morally correct output,
rather than truly reducing the immorality stored in hidden states. Through
empirical investigation with tasks of language generation and multi-choice
question answering, we conclude: (i) LLMs exhibit good performance across both
tasks, and self-correction instructions are particularly beneficial when the
correct answer is already top-ranked; (ii) The morality levels in intermediate
hidden states are strong indicators as to whether one instruction would be more
effective than another; (iii) Based on our analysis of intermediate hidden
states and task case studies of self-correction behaviors, we are first to
propose the hypothesis that intrinsic moral self-correction is in fact
superficial.
[LINK]
http://arxiv.org/abs/2407.15286v1
[DATE]
2024-07-22 06:50:11+08:00
[CATEGORIES]
cs.CL
DPO Meets PPO: Reinforced Token Optimization for RLHF
[AUTHORS]
Han Zhong, Guhao Feng, Wei Xiong, Xinle Cheng, Li Zhao, Di He, Jiang Bian, Liwei Wang
[ABSTRACT]
In the classical Reinforcement Learning from Human Feedback (RLHF) framework,
Proximal Policy Optimization (PPO) is employed to learn from sparse,
sentence-level rewards – a challenging scenario in traditional deep
reinforcement learning. Despite the great successes of PPO in the alignment of
state-of-the-art closed-source large language models (LLMs), its open-source
implementation is still largely sub-optimal, as widely reported by numerous
research studies. To address these issues, we introduce a framework that models
RLHF problems as a Markov decision process (MDP), enabling the capture of
fine-grained token-wise information. Furthermore, we provide theoretical
insights that demonstrate the superiority of our MDP framework over the
previous sentence-level bandit formulation. Under this framework, we introduce
an algorithm, dubbed as Reinforced Token Optimization (\texttt{RTO}), which
learns the token-wise reward function from preference data and performs policy
optimization based on this learned token-wise reward signal. Theoretically,
\texttt{RTO} is proven to have the capability of finding the near-optimal
policy sample-efficiently. For its practical implementation, \texttt{RTO}
innovatively integrates Direct Preference Optimization (DPO) and PPO. DPO,
originally derived from sparse sentence rewards, surprisingly provides us with
a token-wise characterization of response quality, which is seamlessly
incorporated into our subsequent PPO training stage. Extensive real-world
alignment experiments verify the effectiveness of the proposed approach.
[LINK]
http://arxiv.org/abs/2404.18922v2
[DATE]
2024-07-22 05:48:54+08:00
[CATEGORIES]
cs.LG
cs.CL
Fact-Aware Multimodal Retrieval Augmentation for Accurate Medical Radiology Report Generation
[AUTHORS]
Liwen Sun, James Zhao, Megan Han, Chenyan Xiong
[ABSTRACT]
Multimodal foundation models hold significant potential for automating
radiology report generation, thereby assisting clinicians in diagnosing cardiac
diseases. However, generated reports often suffer from serious factual
inaccuracy. In this paper, we introduce a fact-aware multimodal
retrieval-augmented pipeline in generating accurate radiology reports
(FactMM-RAG). We first leverage RadGraph to mine factual report pairs, then
integrate factual knowledge to train a universal multimodal retriever. Given a
radiology image, our retriever can identify high-quality reference reports to
augment multimodal foundation models, thus enhancing the factual completeness
and correctness of report generation. Experiments on two benchmark datasets
show that our multimodal retriever outperforms state-of-the-art retrievers on
both language generation and radiology-specific metrics, up to 6.5% and 2%
score in F1CheXbert and F1RadGraph. Further analysis indicates that employing
our factually-informed training strategy imposes an effective supervision
signal, without relying on explicit diagnostic label guidance, and successfully
propagates fact-aware capabilities from the multimodal retriever to the
multimodal foundation model in radiology report generation.
[LINK]
http://arxiv.org/abs/2407.15268v1
[DATE]
2024-07-22 05:04:28+08:00
[CATEGORIES]
cs.CL
Cross-Modal Projection in Multimodal LLMs Doesn’t Really Project Visual Attributes to Textual Space
[AUTHORS]
Gaurav Verma, Minje Choi, Kartik Sharma, Jamelle Watson-Daniels, Sejoon Oh, Srijan Kumar
[ABSTRACT]
Multimodal large language models (MLLMs) like LLaVA and GPT-4(V) enable
general-purpose conversations about images with the language modality. As
off-the-shelf MLLMs may have limited capabilities on images from domains like
dermatology and agriculture, they must be fine-tuned to unlock domain-specific
applications. The prevalent architecture of current open-source MLLMs comprises
two major modules: an image-language (cross-modal) projection network and a
large language model. It is desirable to understand the roles of these two
modules in modeling domain-specific visual attributes to inform the design of
future models and streamline the interpretability efforts on the current
models. To this end, via experiments on 4 datasets and under 2 fine-tuning
settings, we find that as the MLLM is fine-tuned, it indeed gains
domain-specific visual capabilities, but the updates do not lead to the
projection extracting relevant domain-specific visual attributes. Our results
indicate that the domain-specific visual attributes are modeled by the LLM,
even when only the projection is fine-tuned. Through this study, we offer a
potential reinterpretation of the role of cross-modal projections in MLLM
architectures. Project webpage:
https://claws-lab.github.io/projection-in-MLLMs/
[COMMENTS]
Accepted at ACL 2024 (Main, Short)
[LINK]
http://arxiv.org/abs/2402.16832v2
[DATE]
2024-07-22 02:11:34+08:00
[CATEGORIES]
cs.CL
Two eyes, Two views, and finally, One summary! Towards Multi-modal Multi-tasking Knowledge-Infused Medical Dialogue Summarization
[AUTHORS]
Anisha Saha, Abhisek Tiwari, Sai Ruthvik, Sriparna Saha
[ABSTRACT]
We often summarize a multi-party conversation in two stages: chunking with
homogeneous units and summarizing the chunks. Thus, we hypothesize that there
exists a correlation between homogeneous speaker chunking and overall
summarization tasks. In this work, we investigate the effectiveness of a
multi-faceted approach that simultaneously produces summaries of medical
concerns, doctor impressions, and an overall view. We introduce a multi-modal,
multi-tasking, knowledge-infused medical dialogue summary generation
(MMK-Summation) model, which is incorporated with adapter-based fine-tuning
through a gated mechanism for multi-modal information integration. The model,
MMK-Summation, takes dialogues as input, extracts pertinent external knowledge
based on the context, integrates the knowledge and visual cues from the
dialogues into the textual content, and ultimately generates concise summaries
encompassing medical concerns, doctor impressions, and a comprehensive
overview. The introduced model surpasses multiple baselines and traditional
summarization models across all evaluation metrics (including human
evaluation), which firmly demonstrates the efficacy of the knowledge-guided
multi-tasking, multimodal medical conversation summarization. The code is
available at https://github.com/NLP-RL/MMK-Summation.
[LINK]
http://arxiv.org/abs/2407.15237v1
[DATE]
2024-07-22 02:00:10+08:00
[CATEGORIES]
cs.CL
TAGCOS: Task-agnostic Gradient Clustered Coreset Selection for Instruction Tuning Data
[AUTHORS]
Jipeng Zhang, Yaxuan Qin, Renjie Pi, Weizhong Zhang, Rui Pan, Tong Zhang
[ABSTRACT]
Instruction tuning has achieved unprecedented success in NLP, turning large
language models into versatile chatbots. However, the increasing variety and
volume of instruction datasets demand significant computational resources. To
address this, it is essential to extract a small and highly informative subset
(i.e., Coreset) that achieves comparable performance to the full dataset.
Achieving this goal poses non-trivial challenges: 1) data selection requires
accurate data representations that reflect the training samples’ quality, 2)
considering the diverse nature of instruction datasets, and 3) ensuring the
efficiency of the coreset selection algorithm for large models. To address
these challenges, we propose Task-Agnostic Gradient Clustered COreset Selection
(TAGCOS). Specifically, we leverage sample gradients as the data
representations, perform clustering to group similar data, and apply an
efficient greedy algorithm for coreset selection. Experimental results show
that our algorithm, selecting only 5% of the data, surpasses other unsupervised
methods and achieves performance close to that of the full dataset.
[COMMENTS]
Preprint. Our code and models are available at:
https://github.com/2003pro/TAGCOS
[LINK]
http://arxiv.org/abs/2407.15235v1
[DATE]
2024-07-22 01:59:20+08:00
[CATEGORIES]
cs.CL
cs.LG
A Community-Centric Perspective for Characterizing and Detecting Anti-Asian Violence-Provoking Speech
[AUTHORS]
Gaurav Verma, Rynaa Grover, Jiawei Zhou, Binny Mathew, Jordan Kraemer, Munmun De Choudhury, Srijan Kumar
[COMMENTS]
Accepted to ACL 2024 Main
[LINK]
http://arxiv.org/abs/2407.15227v1
[DATE]
2024-07-22 01:27:17+08:00
[CATEGORIES]
cs.CL
When Do Universal Image Jailbreaks Transfer Between Vision-Language Models?
[AUTHORS]
Rylan Schaeffer, Dan Valentine, Luke Bailey, James Chua, Cristóbal Eyzaguirre, Zane Durante, Joe Benton, Brando Miranda, Henry Sleight, John Hughes, Rajashree Agrawal, Mrinank Sharma, Scott Emmons, Sanmi Koyejo, Ethan Perez
[ABSTRACT]
The integration of new modalities into frontier AI systems offers exciting
capabilities, but also increases the possibility such systems can be
adversarially manipulated in undesirable ways. In this work, we focus on a
popular class of vision-language models (VLMs) that generate text outputs
conditioned on visual and textual inputs. We conducted a large-scale empirical
study to assess the transferability of gradient-based universal image
“jailbreaks” using a diverse set of over 40 open-parameter VLMs, including 18
new VLMs that we publicly release. Overall, we find that transferable
gradient-based image jailbreaks are extremely difficult to obtain. When an
image jailbreak is optimized against a single VLM or against an ensemble of
VLMs, the jailbreak successfully jailbreaks the attacked VLM(s), but exhibits
little-to-no transfer to any other VLMs; transfer is not affected by whether
the attacked and target VLMs possess matching vision backbones or language
models, whether the language model underwent instruction-following and/or
safety-alignment training, or many other factors. Only two settings display
partially successful transfer: between identically-pretrained and
identically-initialized VLMs with slightly different VLM training data, and
between different training checkpoints of a single VLM. Leveraging these
results, we then demonstrate that transfer can be significantly improved
against a specific target VLM by attacking larger ensembles of “highly-similar”
VLMs. These results stand in stark contrast to existing evidence of universal
and transferable text jailbreaks against language models and transferable
adversarial attacks against image classifiers, suggesting that VLMs may be more
robust to gradient-based transfer attacks.
[LINK]
http://arxiv.org/abs/2407.15211v1
[DATE]
2024-07-22 00:27:24+08:00
[CATEGORIES]
cs.CL
cs.LG
Diffusion for Out-of-Distribution Detection on Road Scenes and Beyond
[AUTHORS]
Silvio Galesso, Philipp Schröppel, Hssan Driss, Thomas Brox
[ABSTRACT]
In recent years, research on out-of-distribution (OoD) detection for semantic
segmentation has mainly focused on road scenes – a domain with a constrained
amount of semantic diversity. In this work, we challenge this constraint and
extend the domain of this task to general natural images. To this end, we
introduce: 1. the ADE-OoD benchmark, which is based on the ADE20k dataset and
includes images from diverse domains with a high semantic diversity, and 2. a
novel approach that uses Diffusion score matching for OoD detection (DOoD) and
is robust to the increased semantic diversity. ADE-OoD features indoor and
outdoor images, defines 150 semantic categories as in-distribution, and
contains a variety of OoD objects. For DOoD, we train a diffusion model with an
MLP architecture on semantic in-distribution embeddings and build on the score
matching interpretation to compute pixel-wise OoD scores at inference time. On
common road scene OoD benchmarks, DOoD performs on par or better than the state
of the art, without using outliers for training or making assumptions about the
data domain. On ADE-OoD, DOoD outperforms previous approaches, but leaves much
room for future improvements.
[COMMENTS]
ECCV 2024 - Benchmark page: https://ade-ood.github.io/
[LINK]
http://arxiv.org/abs/2407.15739v1
[DATE]
2024-07-22 23:41:37+08:00
[CATEGORIES]
cs.LG
Parallel Split Learning with Global Sampling
[AUTHORS]
Mohammad Kohankhaki, Ahmad Ayad, Mahdi Barhoush, Anke Schmeink
[ABSTRACT]
The expansion of IoT devices and the demands of Deep Learning have
highlighted significant challenges in Distributed Deep Learning (DDL) systems.
Parallel Split Learning (PSL) has emerged as a promising derivative of Split
Learning that is well suited for distributed learning on resource-constrained
devices. However, PSL faces several obstacles, such as large effective batch
sizes, non-IID data distributions, and the straggler effect. We view these
issues as a sampling dilemma and propose to address them by orchestrating the
mini-batch sampling process on the server side. We introduce the Uniform Global
Sampling (UGS) method to decouple the effective batch size from the number of
clients and reduce mini-batch deviation in non-IID settings. To address the
straggler effect, we introduce the Latent Dirichlet Sampling (LDS) method,
which generalizes UGS to balance the trade-off between batch deviation and
training time. Our simulations reveal that our proposed methods enhance model
accuracy by up to 34.1% in non-IID settings and reduce the training time in the
presence of stragglers by up to 62%. In particular, LDS effectively mitigates
the straggler effect without compromising model accuracy or adding significant
computational overhead compared to UGS. Our results demonstrate the potential
of our methods as a promising solution for DDL in real applications.
[LINK]
http://arxiv.org/abs/2407.15738v1
[DATE]
2024-07-22 23:41:23+08:00
[CATEGORIES]
cs.LG
Simulation-Based Inference with Quantile Regression
[AUTHORS]
He Jia
[ABSTRACT]
We present Neural Quantile Estimation (NQE), a novel Simulation-Based
Inference (SBI) method based on conditional quantile regression. NQE
autoregressively learns individual one dimensional quantiles for each posterior
dimension, conditioned on the data and previous posterior dimensions. Posterior
samples are obtained by interpolating the predicted quantiles using monotonic
cubic Hermite spline, with specific treatment for the tail behavior and
multi-modal distributions. We introduce an alternative definition for the
Bayesian credible region using the local Cumulative Density Function (CDF),
offering substantially faster evaluation than the traditional Highest Posterior
Density Region (HPDR). In case of limited simulation budget and/or known model
misspecification, a post-processing calibration step can be integrated into NQE
to ensure the unbiasedness of the posterior estimation with negligible
additional computational cost. We demonstrate that NQE achieves
state-of-the-art performance on a variety of benchmark problems.
[COMMENTS]
9+13 pages, 8+8 figures, ICML 2024
[LINK]
http://arxiv.org/abs/2401.02413v2
[DATE]
2024-07-22 23:37:39+08:00
[CATEGORIES]
cs.LG
Estimating Probability Densities with Transformer and Denoising Diffusion
[AUTHORS]
Henry W. Leung, Jo Bovy, Joshua S. Speagle
[ABSTRACT]
Transformers are often the go-to architecture to build foundation models that
ingest a large amount of training data. But these models do not estimate the
probability density distribution when trained on regression problems, yet
obtaining full probabilistic outputs is crucial to many fields of science,
where the probability distribution of the answer can be non-Gaussian and
multimodal. In this work, we demonstrate that training a probabilistic model
using a denoising diffusion head on top of the Transformer provides reasonable
probability density estimation even for high-dimensional inputs. The combined
Transformer+Denoising Diffusion model allows conditioning the output
probability density on arbitrary combinations of inputs and it is thus a highly
flexible density function emulator of all possible input/output combinations.
We illustrate our Transformer+Denoising Diffusion model by training it on a
large dataset of astronomical observations and measured labels of stars within
our Galaxy and we apply it to a variety of inference tasks to show that the
model can infer labels accurately with reasonable distributions.
[COMMENTS]
Accepted at the ICML 2024 Workshop on Foundation Models in the Wild
[LINK]
http://arxiv.org/abs/2407.15703v1
[DATE]
2024-07-22 23:10:41+08:00
[CATEGORIES]
cs.LG
Multimodal Explainability via Latent Shift applied to COVID-19 stratification
[AUTHORS]
Valerio Guarrasi, Lorenzo Tronchin, Domenico Albano, Eliodoro Faiella, Deborah Fazzini, Domiziana Santucci, Paolo Soda
[ABSTRACT]
We are witnessing a widespread adoption of artificial intelligence in
healthcare. However, most of the advancements in deep learning in this area
consider only unimodal data, neglecting other modalities. Their multimodal
interpretation necessary for supporting diagnosis, prognosis and treatment
decisions. In this work we present a deep architecture, which jointly learns
modality reconstructions and sample classifications using tabular and imaging
data. The explanation of the decision taken is computed by applying a latent
shift that, simulates a counterfactual prediction revealing the features of
each modality that contribute the most to the decision and a quantitative score
indicating the modality importance. We validate our approach in the context of
COVID-19 pandemic using the AIforCOVID dataset, which contains multimodal data
for the early identification of patients at risk of severe outcome. The results
show that the proposed method provides meaningful explanations without
degrading the classification performance.
[LINK]
http://arxiv.org/abs/2212.14084v2
[DATE]
2024-07-22 23:02:58+08:00
[CATEGORIES]
cs.LG
Fisher-Rao Gradient Flow: Geodesic Convexity and Functional Inequalities
[AUTHORS]
José A. Carrillo, Yifan Chen, Daniel Zhengyu Huang, Jiaoyang Huang, Dongyi Wei
[ABSTRACT]
The dynamics of probability density functions has been extensively studied in
science and engineering to understand physical phenomena and facilitate
algorithmic design. Of particular interest are dynamics that can be formulated
as gradient flows of energy functionals under the Wasserstein metric. The
development of functional inequalities, such as the log-Sobolev inequality,
plays a pivotal role in analyzing the convergence of these dynamics. The goal
of this paper is to parallel the success of techniques using functional
inequalities, for dynamics that are gradient flows under the Fisher-Rao metric,
with various $f$-divergences as energy functionals. Such dynamics take the form
of a nonlocal differential equation, for which existing analysis critically
relies on using the explicit solution formula in special cases. We provide a
comprehensive study on functional inequalities and the relevant geodesic
convexity for Fisher-Rao gradient flows under minimal assumptions. A notable
feature of the obtained functional inequalities is that they do not depend on
the log-concavity or log-Sobolev constants of the target distribution.
Consequently, the convergence rate of the dynamics (assuming well-posed) is
uniform across general target distributions, making them potentially desirable
dynamics for posterior sampling applications in Bayesian inference.
[COMMENTS]
38 pages
[LINK]
http://arxiv.org/abs/2407.15693v1
[DATE]
2024-07-22 23:00:14+08:00
[CATEGORIES]
cs.LG
Predictive Coding Networks and Inference Learning: Tutorial and Survey
[AUTHORS]
Björn van Zwol, Ro Jefferson, Egon L. van den Broek
[ABSTRACT]
Recent years have witnessed a growing call for renewed emphasis on
neuroscience-inspired approaches in artificial intelligence research, under the
banner of NeuroAI. A prime example of this is predictive coding networks
(PCNs), based on the neuroscientific framework of predictive coding. This
framework views the brain as a hierarchical Bayesian inference model that
minimizes prediction errors through feedback connections. Unlike traditional
neural networks trained with backpropagation (BP), PCNs utilize inference
learning (IL), a more biologically plausible algorithm that explains patterns
of neural activity that BP cannot. Historically, IL has been more
computationally intensive, but recent advancements have demonstrated that it
can achieve higher efficiency than BP with sufficient parallelization.
Furthermore, PCNs can be mathematically considered a superset of traditional
feedforward neural networks (FNNs), significantly extending the range of
trainable architectures. As inherently probabilistic (graphical) latent
variable models, PCNs provide a versatile framework for both supervised
learning and unsupervised (generative) modeling that goes beyond traditional
artificial neural networks. This work provides a comprehensive review and
detailed formal specification of PCNs, particularly situating them within the
context of modern ML methods. Additionally, we introduce a Python library
(PRECO) for practical implementation. This positions PC as a promising
framework for future ML innovations.
[COMMENTS]
46 pages, 13 figures, 8 tables
[LINK]
http://arxiv.org/abs/2407.04117v2
[DATE]
2024-07-22 22:56:46+08:00
[CATEGORIES]
cs.LG
SoftCVI: contrastive variational inference with self-generated soft labels
[AUTHORS]
Daniel Ward, Mark Beaumont, Matteo Fasiolo
[ABSTRACT]
Estimating a distribution given access to its unnormalized density is pivotal
in Bayesian inference, where the posterior is generally known only up to an
unknown normalizing constant. Variational inference and Markov chain Monte
Carlo methods are the predominant tools for this task; however, both methods
are often challenging to apply reliably, particularly when the posterior has
complex geometry. Here, we introduce Soft Contrastive Variational Inference
(SoftCVI), which allows a family of variational objectives to be derived
through a contrastive estimation framework. These objectives have zero variance
gradient when the variational approximation is exact, without the need for
specialized gradient estimators. The approach involves parameterizing a
classifier in terms of the variational distribution, which allows the inference
task to be reframed as a contrastive estimation problem, aiming to identify a
single true posterior sample among a set of samples. Despite this framing, we
do not require positive or negative samples, but rather learn by sampling the
variational distribution and computing ground truth soft classification labels
from the unnormalized posterior itself. We empirically investigate the
performance on a variety of Bayesian inference tasks, using both using both
simple (e.g. normal) and expressive (normalizing flow) variational
distributions. We find that SoftCVI objectives often outperform other commonly
used variational objectives.
[LINK]
http://arxiv.org/abs/2407.15687v1
[DATE]
2024-07-22 22:54:12+08:00
[CATEGORIES]
cs.LG
A Simple and Optimal Policy Design with Safety against Heavy-Tailed Risk for Stochastic Bandits
[AUTHORS]
David Simchi-Levi, Zeyu Zheng, Feng Zhu
[ABSTRACT]
We study the stochastic multi-armed bandit problem and design new policies
that enjoy both worst-case optimality for expected regret and light-tailed risk
for regret distribution. Specifically, our policy design (i) enjoys the
worst-case optimality for the expected regret at order $O(\sqrt{KT\ln T})$ and
(ii) has the worst-case tail probability of incurring a regret larger than any
$x>0$ being upper bounded by $\exp(-\Omega(x/\sqrt{KT}))$, a rate that we prove
to be best achievable with respect to $T$ for all worst-case optimal policies.
Our proposed policy achieves a delicate balance between doing more exploration
at the beginning of the time horizon and doing more exploitation when
approaching the end, compared to standard confidence-bound-based policies. We
also enhance the policy design to accommodate the “any-time” setting where $T$
is unknown a priori, and prove equivalently desired policy performances as
compared to the “fixed-time” setting with known $T$. Numerical experiments are
conducted to illustrate the theoretical findings. We find that from a
managerial perspective, our new policy design yields better tail distributions
and is preferable than celebrated policies especially when (i) there is a risk
of under-estimating the volatility profile, or (ii) there is a challenge of
tuning policy hyper-parameters. We conclude by extending our proposed policy
design to the stochastic linear bandit setting that leads to both worst-case
optimality in terms of expected regret and light-tailed risk on the regret
distribution.
[COMMENTS]
Preliminary version appeared in NeurIPS 2022
[LINK]
http://arxiv.org/abs/2206.02969v6
[DATE]
2024-07-22 22:45:09+08:00
[CATEGORIES]
cs.LG
An Ad-hoc graph node vector embedding algorithm for general knowledge graphs using Kinetica-Graph
[AUTHORS]
B. Kaan Karamete, Eli Glaser
[ABSTRACT]
This paper discusses how to generate general graph node embeddings from
knowledge graph representations. The embedded space is composed of a number of
sub-features to mimic both local affinity and remote structural relevance.
These sub-feature dimensions are defined by several indicators that we
speculate to catch nodal similarities, such as hop-based topological patterns,
the number of overlapping labels, the transitional probabilities (markov-chain
probabilities), and the cluster indices computed by our recursive spectral
bisection (RSB) algorithm. These measures are flattened over the one
dimensional vector space into their respective sub-component ranges such that
the entire set of vector similarity functions could be used for finding similar
nodes. The error is defined by the sum of pairwise square differences across a
randomly selected sample of graph nodes between the assumed embeddings and the
ground truth estimates as our novel loss function. The ground truth is
estimated to be a combination of pairwise Jaccard similarity and the number of
overlapping labels. Finally, we demonstrate a multi-variate stochastic gradient
descent (SGD) algorithm to compute the weighing factors among sub-vector spaces
to minimize the average error using a random sampling logic.
[COMMENTS]
11 pages, 16 figures, 16 references
[LINK]
http://arxiv.org/abs/2407.15906v1
[DATE]
2024-07-22 22:43:10+08:00
[CATEGORIES]
cs.LG
vTensor: Flexible Virtual Tensor Management for Efficient LLM Serving
[AUTHORS]
Jiale Xu, Rui Zhang, Cong Guo, Weiming Hu, Zihan Liu, Feiyang Wu, Yu Feng, Shixuan Sun, Changxu Shao, Yuhong Guo, Junping Zhao, Ke Zhang, Minyi Guo, Jingwen Leng
[ABSTRACT]
Large Language Models (LLMs) are widely used across various domains,
processing millions of daily requests. This surge in demand poses significant
challenges in optimizing throughput and latency while keeping costs manageable.
The Key-Value (KV) cache, a standard method for retaining previous
computations, makes LLM inference highly bounded by memory. While batching
strategies can enhance performance, they frequently lead to significant memory
fragmentation. Even though cutting-edge systems like vLLM mitigate KV cache
fragmentation using paged Attention mechanisms, they still suffer from
inefficient memory and computational operations due to the tightly coupled page
management and computation kernels.
This study introduces the vTensor, an innovative tensor structure for LLM
inference based on GPU virtual memory management (VMM). vTensor addresses
existing limitations by decoupling computation from memory defragmentation and
offering dynamic extensibility. Our framework employs a CPU-GPU heterogeneous
approach, ensuring efficient, fragmentation-free memory management while
accommodating various computation kernels across different LLM architectures.
Experimental results indicate that vTensor achieves an average speedup of 1.86x
across different models, with up to 2.42x in multi-turn chat scenarios.
Additionally, vTensor provides average speedups of 2.12x and 3.15x in kernel
evaluation, reaching up to 3.92x and 3.27x compared to SGLang Triton
prefix-prefilling kernels and vLLM paged Attention kernel, respectively.
Furthermore, it frees approximately 71.25% (57GB) of memory on the NVIDIA A100
GPU compared to vLLM, enabling more memory-intensive workloads.
[COMMENTS]
16 pages, 12 figures
[LINK]
http://arxiv.org/abs/2407.15309v1
[DATE]
2024-07-22 22:37:58+08:00
[CATEGORIES]
cs.LG
Automated Deterministic Auction Design with Objective Decomposition
[AUTHORS]
Zhijian Duan, Haoran Sun, Yichong Xia, Siqiang Wang, Zhilin Zhang, Chuan Yu, Jian Xu, Bo Zheng, Xiaotie Deng
[ABSTRACT]
Identifying high-revenue mechanisms that are both dominant strategy incentive
compatible (DSIC) and individually rational (IR) is a fundamental challenge in
auction design. While theoretical approaches have encountered bottlenecks in
multi-item auctions, there has been much empirical progress in automated
designing such mechanisms using machine learning. However, existing research
primarily focuses on randomized auctions, with less attention given to the more
practical deterministic auctions. Therefore, this paper investigates the
automated design of deterministic auctions and introduces OD-VVCA, an objective
decomposition approach for automated designing Virtual Valuations Combinatorial
Auctions (VVCAs). Firstly, we restrict our mechanism to deterministic VVCAs,
which are inherently DSIC and IR. Afterward, we utilize a parallelizable
dynamic programming algorithm to compute the allocation and revenue outcomes of
a VVCA efficiently. We then decompose the revenue objective function into
continuous and piecewise constant discontinuous components, optimizing each
using distinct methods. Extensive experiments show that OD-VVCA achieves high
revenue in multi-item auctions, especially in large-scale settings where it
outperforms both randomized and deterministic baselines, indicating its
efficacy and scalability.
[LINK]
http://arxiv.org/abs/2402.11904v2
[DATE]
2024-07-22 22:32:46+08:00
[CATEGORIES]
cs.LG
Comprehensive Study on Performance Evaluation and Optimization of Model Compression: Bridging Traditional Deep Learning and Large Language Models
[AUTHORS]
Aayush Saxena, Arit Kumar Bishwas, Ayush Ashok Mishra, Ryan Armstrong
[ABSTRACT]
Deep learning models have achieved tremendous success in most of the
industries in recent years. The evolution of these models has also led to an
increase in the model size and energy requirement, making it difficult to
deploy in production on low compute devices. An increase in the number of
connected devices around the world warrants compressed models that can be
easily deployed at the local devices with low compute capacity and power
accessibility. A wide range of solutions have been proposed by different
researchers to reduce the size and complexity of such models, prominent among
them are, Weight Quantization, Parameter Pruning, Network Pruning, low-rank
representation, weights sharing, neural architecture search, knowledge
distillation etc. In this research work, we investigate the performance impacts
on various trained deep learning models, compressed using quantization and
pruning techniques. We implemented both, quantization and pruning, compression
techniques on popular deep learning models used in the image classification,
object detection, language models and generative models-based problem
statements. We also explored performance of various large language models
(LLMs) after quantization and low rank adaptation. We used the standard
evaluation metrics (model’s size, accuracy, and inference time) for all the
related problem statements and concluded this paper by discussing the
challenges and future work.
[LINK]
http://arxiv.org/abs/2407.15904v1
[DATE]
2024-07-22 22:20:53+08:00
[CATEGORIES]
cs.LG
How to Shrink Confidence Sets for Many Equivalent Discrete Distributions?
[AUTHORS]
Odalric-Ambrym Maillard, Mohammad Sadegh Talebi
[ABSTRACT]
We consider the situation when a learner faces a set of unknown discrete
distributions $(p_k){k\in \mathcal K}$ defined over a common alphabet
$\mathcal X$, and can build for each distribution $p_k$ an individual
high-probability confidence set thanks to $n_k$ observations sampled from
$p_k$. The set $(p_k){k\in \mathcal K}$ is structured: each distribution $p_k$
is obtained from the same common, but unknown, distribution q via applying an
unknown permutation to $\mathcal X$. We call this
\emph{permutation-equivalence}. The goal is to build refined confidence sets
\emph{exploiting} this structural property. Like other popular notions of
structure (Lipschitz smoothness, Linearity, etc.) permutation-equivalence
naturally appears in machine learning problems, and to benefit from its
potential gain calls for a specific approach. We present a strategy to
effectively exploit permutation-equivalence, and provide a finite-time
high-probability bound on the size of the refined confidence sets output by the
strategy. Since a refinement is not possible for too few observations in
general, under mild technical assumptions, our finite-time analysis establish
when the number of observations $(n_k){k\in \mathcal K}$ are large enough so
that the output confidence sets improve over initial individual sets. We
carefully characterize this event and the corresponding improvement. Further,
our result implies that the size of confidence sets shrink at asymptotic rates
of $O(1/\sqrt{\sum{k\in \mathcal K} n_k})$ and $O(1/\max_{k\in K} n_{k})$,
respectively for elements inside and outside the support of q, when the size of
each individual confidence set shrinks at respective rates of $O(1/\sqrt{n_k})$
and $O(1/n_k)$. We illustrate the practical benefit of exploiting permutation
equivalence on a reinforcement learning task.
[LINK]
http://arxiv.org/abs/2407.15662v1
[DATE]
2024-07-22 22:19:19+08:00
[CATEGORIES]
cs.LG
MuTT: A Multimodal Trajectory Transformer for Robot Skills
[AUTHORS]
Claudius Kienle, Benjamin Alt, Onur Celik, Philipp Becker, Darko Katic, Rainer Jäkel, Gerhard Neumann
[ABSTRACT]
High-level robot skills represent an increasingly popular paradigm in robot
programming. However, configuring the skills’ parameters for a specific task
remains a manual and time-consuming endeavor. Existing approaches for learning
or optimizing these parameters often require numerous real-world executions or
do not work in dynamic environments. To address these challenges, we propose
MuTT, a novel encoder-decoder transformer architecture designed to predict
environment-aware executions of robot skills by integrating vision, trajectory,
and robot skill parameters. Notably, we pioneer the fusion of vision and
trajectory, introducing a novel trajectory projection. Furthermore, we
illustrate MuTT’s efficacy as a predictor when combined with a model-based
robot skill optimizer. This approach facilitates the optimization of robot
skill parameters for the current environment, without the need for real-world
executions during optimization. Designed for compatibility with any
representation of robot skills, MuTT demonstrates its versatility across three
comprehensive experiments, showcasing superior performance across two different
skill representations.
[LINK]
http://arxiv.org/abs/2407.15660v1
[DATE]
2024-07-22 22:18:52+08:00
[CATEGORIES]
cs.LG
Quantum Normalizing Flows for Anomaly Detection
[AUTHORS]
Bodo Rosenhahn, Christoph Hirche
[ABSTRACT]
A Normalizing Flow computes a bijective mapping from an arbitrary
distribution to a predefined (e.g. normal) distribution. Such a flow can be
used to address different tasks, e.g. anomaly detection, once such a mapping
has been learned. In this work we introduce Normalizing Flows for Quantum
architectures, describe how to model and optimize such a flow and evaluate our
method on example datasets. Our proposed models show competitive performance
for anomaly detection compared to classical methods, esp. those ones where
there are already quantum inspired algorithms available. In the experiments we
compare our performance to isolation forests (IF), the local outlier factor
(LOF) or single-class SVMs.
[COMMENTS]
v3: 15 pages, 8 figures
[LINK]
http://arxiv.org/abs/2402.02866v3
[DATE]
2024-07-22 22:18:42+08:00
[CATEGORIES]
cs.LG
Link Polarity Prediction from Sparse and Noisy Labels via Multiscale Social Balance
[AUTHORS]
Marco Minici, Federico Cinus, Francesco Bonchi, Giuseppe Manco
[ABSTRACT]
Signed Graph Neural Networks (SGNNs) have recently gained attention as an
effective tool for several learning tasks on signed networks, i.e., graphs
where edges have an associated polarity. One of these tasks is to predict the
polarity of the links for which this information is missing, starting from the
network structure and the other available polarities. However, when the
available polarities are few and potentially noisy, such a task becomes
challenging.
In this work, we devise a semi-supervised learning framework that builds
around the novel concept of \emph{multiscale social balance} to improve the
prediction of link polarities in settings characterized by limited data
quantity and quality. Our model-agnostic approach can seamlessly integrate with
any SGNN architecture, dynamically reweighting the importance of each data
sample while making strategic use of the structural information from unlabeled
edges combined with social balance theory.
Empirical validation demonstrates that our approach outperforms established
baseline models, effectively addressing the limitations imposed by noisy and
sparse data. This result underlines the benefits of incorporating multiscale
social balance into SGNNs, opening new avenues for robust and accurate
predictions in signed network analysis.
[LINK]
http://arxiv.org/abs/2407.15643v1
[DATE]
2024-07-22 22:02:28+08:00
[CATEGORIES]
cs.LG
Generating Sample-Based Musical Instruments Using Neural Audio Codec Language Models
[AUTHORS]
Shahan Nercessian, Johannes Imort, Ninon Devis, Frederik Blang
[ABSTRACT]
In this paper, we propose and investigate the use of neural audio codec
language models for the automatic generation of sample-based musical
instruments based on text or reference audio prompts. Our approach extends a
generative audio framework to condition on pitch across an 88-key spectrum,
velocity, and a combined text/audio embedding. We identify maintaining timbral
consistency within the generated instruments as a major challenge. To tackle
this issue, we introduce three distinct conditioning schemes. We analyze our
methods through objective metrics and human listening tests, demonstrating that
our approach can produce compelling musical instruments. Specifically, we
introduce a new objective metric to evaluate the timbral consistency of the
generated instruments and adapt the average Contrastive Language-Audio
Pretraining (CLAP) score for the text-to-instrument case, noting that its naive
application is unsuitable for assessing this task. Our findings reveal a
complex interplay between timbral consistency, the quality of generated
samples, and their correspondence to the input prompt.
[COMMENTS]
8 pages, 2 figures. Accepted to the 25th Conference of the
International Society for Music Information Retrieval (ISMIR)
[LINK]
http://arxiv.org/abs/2407.15641v1
[DATE]
2024-07-22 21:59:58+08:00
[CATEGORIES]
cs.LG
360VFI: A Dataset and Benchmark for Omnidirectional Video Frame Interpolation
[AUTHORS]
Wenxuan Lu, Mengshun Hu, Yansheng Qiu, Liang Liao, Zheng Wang
[ABSTRACT]
With the development of VR-related techniques, viewers can enjoy a realistic
and immersive experience through a head-mounted display, while omnidirectional
video with a low frame rate can lead to user dizziness. However, the prevailing
plane frame interpolation methodologies are unsuitable for Omnidirectional
Video Interpolation, chiefly due to the lack of models tailored to such videos
with strong distortion, compounded by the scarcity of valuable datasets for
Omnidirectional Video Frame Interpolation. In this paper, we introduce the
benchmark dataset, 360VFI, for Omnidirectional Video Frame Interpolation. We
present a practical implementation that introduces a distortion prior from
omnidirectional video into the network to modulate distortions. We especially
propose a pyramid distortion-sensitive feature extractor that uses the unique
characteristics of equirectangular projection (ERP) format as prior
information. Moreover, we devise a decoder that uses an affine transformation
to facilitate the synthesis of intermediate frames further. 360VFI is the first
dataset and benchmark that explores the challenge of Omnidirectional Video
Frame Interpolation. Through our benchmark analysis, we presented four
different distortion conditions scenes in the proposed 360VFI dataset to
evaluate the challenge triggered by distortion during interpolation. Besides,
experimental results demonstrate that Omnidirectional Video Interpolation can
be effectively improved by modeling for omnidirectional distortion.
[LINK]
http://arxiv.org/abs/2407.14066v2
[DATE]
2024-07-22 21:50:55+08:00
[CATEGORIES]
cs.LG
Ridge Estimation with Nonlinear Transformations
[AUTHORS]
Zheng Zhai, Hengchao Chen, Zhigang Yao
[ABSTRACT]
Ridge estimation is an important manifold learning technique. The goal of
this paper is to examine the effects of nonlinear transformations on the ridge
sets. The main result proves the inclusion relationship between ridges:
$\cR(f\circ p)\subseteq \cR(p)$, provided that the transformation $f$ is
strictly increasing and concave on the range of the function $p$. Additionally,
given an underlying true manifold $\cM$, we show that the Hausdorff distance
between $\cR(f\circ p)$ and its projection onto $\cM$ is smaller than the
Hausdorff distance between $\cR(p)$ and the corresponding projection. This
motivates us to apply an increasing and concave transformation before the ridge
estimation. In specific, we show that the power transformations
$f^{q}(y)=y^q/q,-\infty<q\leq 1$ are increasing and concave on $\RR_+$, and
thus we can use such power transformations when $p$ is strictly positive.
Numerical experiments demonstrate the advantages of the proposed methods.
[COMMENTS]
There are some flaws in the proofs for Lemma 1 and Theorem 1. We want
to withdraw this version to prevent any potential misunderstanding for
readers
[LINK]
http://arxiv.org/abs/2306.05722v3
[DATE]
2024-07-22 21:48:36+08:00
[CATEGORIES]
cs.LG
Learning Non-Vacuous Generalization Bounds from Optimization
[AUTHORS]
Chengli Tan, Jiangshe Zhang, Junmin Liu
[ABSTRACT]
One of the fundamental challenges in the deep learning community is to
theoretically understand how well a deep neural network generalizes to unseen
data. However, current approaches often yield generalization bounds that are
either too loose to be informative of the true generalization error or only
valid to the compressed nets. In this study, we present a simple yet
non-vacuous generalization bound from the optimization perspective. We achieve
this goal by leveraging that the hypothesis set accessed by stochastic gradient
algorithms is essentially fractal-like and thus can derive a tighter bound over
the algorithm-dependent Rademacher complexity. The main argument rests on
modeling the discrete-time recursion process via a continuous-time stochastic
differential equation driven by fractional Brownian motion. Numerical studies
demonstrate that our approach is able to yield plausible generalization
guarantees for modern neural networks such as ResNet and Vision Transformer,
even when they are trained on a large-scale dataset (e.g. ImageNet-1K).
[COMMENTS]
35pages
[LINK]
http://arxiv.org/abs/2206.04359v2
[DATE]
2024-07-22 21:47:46+08:00
[CATEGORIES]
cs.LG
Sample-Efficient Linear Representation Learning from Non-IID Non-Isotropic Data
[AUTHORS]
Thomas T. C. K. Zhang, Leonardo F. Toso, James Anderson, Nikolai Matni
[ABSTRACT]
A powerful concept behind much of the recent progress in machine learning is
the extraction of common features across data from heterogeneous sources or
tasks. Intuitively, using all of one’s data to learn a common representation
function benefits both computational effort and statistical generalization by
leaving a smaller number of parameters to fine-tune on a given task. Toward
theoretically grounding these merits, we propose a general setting of
recovering linear operators $M$ from noisy vector measurements $y = Mx + w$,
where the covariates $x$ may be both non-i.i.d. and non-isotropic. We
demonstrate that existing isotropy-agnostic representation learning approaches
incur biases on the representation update, which causes the scaling of the
noise terms to lose favorable dependence on the number of source tasks. This in
turn can cause the sample complexity of representation learning to be
bottlenecked by the single-task data size. We introduce an adaptation,
$\texttt{De-bias & Feature-Whiten}$ ($\texttt{DFW}$), of the popular
alternating minimization-descent scheme proposed independently in Collins et
al., (2021) and Nayer and Vaswani (2022), and establish linear convergence to
the optimal representation with noise level scaling down with the
$\textit{total}$ source data size. This leads to generalization bounds on the
same order as an oracle empirical risk minimizer. We verify the vital
importance of $\texttt{DFW}$ on various numerical simulations. In particular,
we show that vanilla alternating-minimization descent fails catastrophically
even for iid, but mildly non-isotropic data. Our analysis unifies and
generalizes prior work, and provides a flexible framework for a wider range of
applications, such as in controls and dynamical systems.
[COMMENTS]
Appeared at ICLR 2024 (spotlight presentation)
[LINK]
http://arxiv.org/abs/2308.04428v2
[DATE]
2024-07-22 21:36:50+08:00
[CATEGORIES]
cs.LG
Machine-learning-based particle identification with missing data
[AUTHORS]
Miłosz Kasak, Kamil Deja, Maja Karwowska, Monika Jakubowska, Łukasz Graczykowski, Małgorzata Janik
[ABSTRACT]
In this work, we introduce a novel method for Particle Identification (PID)
within the scope of the ALICE experiment at the Large Hadron Collider at CERN.
Identifying products of ultrarelativisitc collisions delivered by the LHC is
one of the crucial objectives of ALICE. Typically employed PID methods rely on
hand-crafted selections, which compare experimental data to theoretical
simulations. To improve the performance of the baseline methods, novel
approaches use machine learning models that learn the proper assignment in a
classification task. However, because of the various detection techniques used
by different subdetectors, as well as the limited detector efficiency and
acceptance, produced particles do not always yield signals in all of the ALICE
components. This results in data with missing values. Machine learning
techniques cannot be trained with such examples, so a significant part of the
data is skipped during training. In this work, we propose the first method for
PID that can be trained with all of the available data examples, including
incomplete ones. Our approach improves the PID purity and efficiency of the
selected sample for all investigated particle species.
[LINK]
http://arxiv.org/abs/2401.01905v2
[DATE]
2024-07-22 21:33:25+08:00
[CATEGORIES]
cs.LG
Distance-based mutual congestion feature selection with genetic algorithm for high-dimensional medical datasets
[AUTHORS]
Hossein Nematzadeh, Joseph Mani, Zahra Nematzadeh, Ebrahim Akbari, Radziah Mohamad
[ABSTRACT]
Feature selection poses a challenge in small-sample high-dimensional
datasets, where the number of features exceeds the number of observations, as
seen in microarray, gene expression, and medical datasets. There isn’t a
universally optimal feature selection method applicable to any data
distribution, and as a result, the literature consistently endeavors to address
this issue. One recent approach in feature selection is termed frequency-based
feature selection. However, existing methods in this domain tend to overlook
feature values, focusing solely on the distribution in the response variable.
In response, this paper introduces the Distance-based Mutual Congestion (DMC)
as a filter method that considers both the feature values and the distribution
of observations in the response variable. DMC sorts the features of datasets,
and the top 5% are retained and clustered by KMeans to mitigate
multicollinearity. This is achieved by randomly selecting one feature from each
cluster. The selected features form the feature space, and the search space for
the Genetic Algorithm with Adaptive Rates (GAwAR) will be approximated using
this feature space. GAwAR approximates the combination of the top 10 features
that maximizes prediction accuracy within a wrapper scheme. To prevent
premature convergence, GAwAR adaptively updates the crossover and mutation
rates. The hybrid DMC-GAwAR is applicable to binary classification datasets,
and experimental results demonstrate its superiority over some recent works.
The implementation and corresponding data are available at
https://github.com/hnematzadeh/DMC-GAwAR
[LINK]
http://arxiv.org/abs/2407.15611v1
[DATE]
2024-07-22 21:08:50+08:00
[CATEGORIES]
cs.LG
Graph Condensation: A Survey
[AUTHORS]
Xinyi Gao, Junliang Yu, Tong Chen, Guanhua Ye, Wentao Zhang, Hongzhi Yin
[ABSTRACT]
The rapid growth of graph data poses significant challenges in storage,
transmission, and particularly the training of graph neural networks (GNNs). To
address these challenges, graph condensation (GC) has emerged as an innovative
solution. GC focuses on synthesizing a compact yet highly representative graph,
enabling GNNs trained on it to achieve performance comparable to those trained
on the original large graph. The notable efficacy of GC and its broad prospects
have garnered significant attention and spurred extensive research. This survey
paper provides an up-to-date and systematic overview of GC, organizing existing
research into five categories aligned with critical GC evaluation criteria:
effectiveness, generalization, efficiency, fairness, and robustness. To
facilitate an in-depth and comprehensive understanding of GC, this paper
examines various methods under each category and thoroughly discusses two
essential components within GC: optimization strategies and condensed graph
generation. We also empirically compare and analyze representative GC methods
with diverse optimization strategies based on the five proposed GC evaluation
criteria. Finally, we explore the applications of GC in various fields, outline
the related open-source libraries, and highlight the present challenges and
novel insights, with the aim of promoting advancements in future research. The
related resources can be found at
https://github.com/XYGaoG/Graph-Condensation-Papers.
[LINK]
http://arxiv.org/abs/2401.11720v2
[DATE]
2024-07-22 20:39:21+08:00
[CATEGORIES]
cs.LG
Discrete Flow Matching
[AUTHORS]
Itai Gat, Tal Remez, Neta Shaul, Felix Kreuk, Ricky T. Q. Chen, Gabriel Synnaeve, Yossi Adi, Yaron Lipman
[ABSTRACT]
Despite Flow Matching and diffusion models having emerged as powerful
generative paradigms for continuous variables such as images and videos, their
application to high-dimensional discrete data, such as language, is still
limited. In this work, we present Discrete Flow Matching, a novel discrete flow
paradigm designed specifically for generating discrete data. Discrete Flow
Matching offers several key contributions: (i) it works with a general family
of probability paths interpolating between source and target distributions;
(ii) it allows for a generic formula for sampling from these probability paths
using learned posteriors such as the probability denoiser ($x$-prediction) and
noise-prediction ($\epsilon$-prediction); (iii) practically, focusing on
specific probability paths defined with different schedulers considerably
improves generative perplexity compared to previous discrete diffusion and flow
models; and (iv) by scaling Discrete Flow Matching models up to 1.7B
parameters, we reach 6.7% Pass@1 and 13.4% Pass@10 on HumanEval and 6.7% Pass@1
and 20.6% Pass@10 on 1-shot MBPP coding benchmarks. Our approach is capable of
generating high-quality discrete data in a non-autoregressive fashion,
significantly closing the gap between autoregressive models and discrete flow
models.
[LINK]
http://arxiv.org/abs/2407.15595v1
[DATE]
2024-07-22 20:33:27+08:00
[CATEGORIES]
cs.LG
Data driven weather forecasts trained and initialised directly from observations
[AUTHORS]
Anthony McNally, Christian Lessig, Peter Lean, Eulalie Boucher, Mihai Alexe, Ewan Pinnington, Matthew Chantry, Simon Lang, Chris Burrows, Marcin Chrust, Florian Pinault, Ethel Villeneuve, Niels Bormann, Sean Healy
[ABSTRACT]
Skilful Machine Learned weather forecasts have challenged our approach to
numerical weather prediction, demonstrating competitive performance compared to
traditional physics-based approaches. Data-driven systems have been trained to
forecast future weather by learning from long historical records of past
weather such as the ECMWF ERA5. These datasets have been made freely available
to the wider research community, including the commercial sector, which has
been a major factor in the rapid rise of ML forecast systems and the levels of
accuracy they have achieved. However, historical reanalyses used for training
and real-time analyses used for initial conditions are produced by data
assimilation, an optimal blending of observations with a physics-based forecast
model. As such, many ML forecast systems have an implicit and unquantified
dependence on the physics-based models they seek to challenge. Here we propose
a new approach, training a neural network to predict future weather purely from
historical observations with no dependence on reanalyses. We use raw
observations to initialise a model of the atmosphere (in observation space)
learned directly from the observations themselves. Forecasts of crucial weather
parameters (such as surface temperature and wind) are obtained by predicting
weather parameter observations (e.g. SYNOP surface data) at future times and
arbitrary locations. We present preliminary results on forecasting observations
12-hours into the future. These already demonstrate successful learning of time
evolutions of the physical processes captured in real observations. We argue
that this new approach, by staying purely in observation space, avoids many of
the challenges of traditional data assimilation, can exploit a wider range of
observations and is readily expanded to simultaneous forecasting of the full
Earth system (atmosphere, land, ocean and composition).
[LINK]
http://arxiv.org/abs/2407.15586v1
[DATE]
2024-07-22 20:23:26+08:00
[CATEGORIES]
cs.LG
KANQAS: Kolmogorov-Arnold Network for Quantum Architecture Search
[AUTHORS]
Akash Kundu, Aritra Sarkar, Abhishek Sadhu
[ABSTRACT]
Quantum architecture search (QAS) is a promising direction for optimization
and automated design of quantum circuits towards quantum advantage. Recent
techniques in QAS focus on machine learning-based approaches from reinforcement
learning, like deep Q-network. While multi-layer perceptron-based deep
Q-networks have been applied for QAS, their interpretability remains
challenging due to the high number of parameters. In this work, we evaluate the
practicality of Kolmogorov-Arnold Networks (KANs) in QAS problems, analyzing
their efficiency in the task of quantum state preparation and quantum
chemistry. In quantum state preparation, our results show that in a noiseless
scenario, the probability of success and the number of optimal quantum circuit
configurations to generate the multi-qubit maximally entangled states are
$2\times$ to $5\times$ higher than Multi-Layer perceptions (MLPs). Moreover, in
noisy scenarios, KAN can achieve a better fidelity in approximating maximally
entangled state than MLPs, where the performance of the MLP significantly
depends on the choice of activation function. In tackling quantum chemistry
problems, we enhance the recently proposed QAS algorithm by integrating
Curriculum Reinforcement Learning (CRL) with a KAN structure instead of the
traditional MLP. This modification allows us to design a parameterized quantum
circuit that contains fewer 2-qubit gates and has a shallower depth, thereby
improving the efficiency of finding the ground state of a chemical Hamiltonian.
Further investigation reveals that KAN requires a significantly smaller number
of learnable parameters compared to MLPs; however, the average time of
executing each episode for KAN is higher.
[COMMENTS]
11 pages and 5 figures, 7 tables. New experiments added and typo
removed
[LINK]
http://arxiv.org/abs/2406.17630v2
[DATE]
2024-07-22 20:00:43+08:00
[CATEGORIES]
cs.LG
A New Theoretical Perspective on Data Heterogeneity in Federated Optimization
[AUTHORS]
Jiayi Wang, Shiqiang Wang, Rong-Rong Chen, Mingyue Ji
[COMMENTS]
ICML 2024
[LINK]
http://arxiv.org/abs/2407.15567v1
[DATE]
2024-07-22 19:52:58+08:00
[CATEGORIES]
cs.LG
The Rlign Algorithm for Enhanced Electrocardiogram Analysis through R-Peak Alignment for Explainable Classification and Clustering
[AUTHORS]
Lucas Plagwitz, Lucas Bickmann, Michael Fujarski, Alexander Brenner, Warnes Gobalakrishnan, Lars Eckardt, Antonius Büscher, Julian Varghese
[ABSTRACT]
Electrocardiogram (ECG) recordings have long been vital in diagnosing
different cardiac conditions. Recently, research in the field of automatic ECG
processing using machine learning methods has gained importance, mainly by
utilizing deep learning methods on raw ECG signals. A major advantage of models
like convolutional neural networks (CNNs) is their ability to effectively
process biomedical imaging or signal data. However, this strength is tempered
by challenges related to their lack of explainability, the need for a large
amount of training data, and the complexities involved in adapting them for
unsupervised clustering tasks. In addressing these tasks, we aim to reintroduce
shallow learning techniques, including support vector machines and principal
components analysis, into ECG signal processing by leveraging their
semi-structured, cyclic form. To this end, we developed and evaluated a
transformation that effectively restructures ECG signals into a fully
structured format, facilitating their subsequent analysis using shallow
learning algorithms. In this study, we present this adaptive transformative
approach that aligns R-peaks across all signals in a dataset and resamples the
segments between R-peaks, both with and without heart rate dependencies. We
illustrate the substantial benefit of this transformation for traditional
analysis techniques in the areas of classification, clustering, and
explainability, outperforming commercial software for median beat
transformation and CNN approaches. Our approach demonstrates a significant
advantage for shallow machine learning methods over CNNs, especially when
dealing with limited training data. Additionally, we release a fully tested and
publicly accessible code framework, providing a robust alignment pipeline to
support future research, available at https://github.com/ imi-ms/rlign.
[LINK]
http://arxiv.org/abs/2407.15555v1
[DATE]
2024-07-22 19:34:47+08:00
[CATEGORIES]
cs.LG
Consistent Diffusion Meets Tweedie: Training Exact Ambient Diffusion Models with Noisy Data
[AUTHORS]
Giannis Daras, Alexandros G. Dimakis, Constantinos Daskalakis
[ABSTRACT]
Ambient diffusion is a recently proposed framework for training diffusion
models using corrupted data. Both Ambient Diffusion and alternative SURE-based
approaches for learning diffusion models from corrupted data resort to
approximations which deteriorate performance. We present the first framework
for training diffusion models that provably sample from the uncorrupted
distribution given only noisy training data, solving an open problem in this
space. Our key technical contribution is a method that uses a double
application of Tweedie’s formula and a consistency loss function that allows us
to extend sampling at noise levels below the observed data noise. We also
provide further evidence that diffusion models memorize from their training
sets by identifying extremely corrupted images that are almost perfectly
reconstructed, raising copyright and privacy concerns. Our method for training
using corrupted samples can be used to mitigate this problem. We demonstrate
this by fine-tuning Stable Diffusion XL to generate samples from a distribution
using only noisy samples. Our framework reduces the amount of memorization of
the fine-tuning dataset, while maintaining competitive performance.
[COMMENTS]
Accepted to ICML 2024
[LINK]
http://arxiv.org/abs/2404.10177v2
[DATE]
2024-07-22 19:31:08+08:00
[CATEGORIES]
cs.LG
One Size Fits All for Semantic Shifts: Adaptive Prompt Tuning for Continual Learning
[AUTHORS]
Doyoung Kim, Susik Yoon, Dongmin Park, Youngjun Lee, Hwanjun Song, Jihwan Bang, Jae-Gil Lee
[COMMENTS]
ICML 2024
[LINK]
http://arxiv.org/abs/2311.12048v2
[DATE]
2024-07-22 19:11:28+08:00
[CATEGORIES]
cs.LG
Inverted Activations
[AUTHORS]
Georgii Novikov, Ivan Oseledets
[ABSTRACT]
The scaling of neural networks with increasing data and model sizes
necessitates more efficient deep learning algorithms. This paper addresses the
memory footprint challenge in neural network training by proposing a
modification to the handling of activation tensors in pointwise nonlinearity
layers. Traditionally, these layers save the entire input tensor for the
backward pass, leading to substantial memory use. Our method involves saving
the output tensor instead, reducing the memory required when the subsequent
layer also saves its input tensor. This approach is particularly beneficial for
transformer-based architectures like GPT, BERT, Mistral, and Llama. Application
of our method involves taken an inverse function of nonlinearity. To the best
of our knowledge, that can not be done analitically and instead we buid an
accurate approximations using simpler functions. Experimental results confirm
that our method significantly reduces memory usage without affecting training
accuracy. The implementation is available at
https://github.com/PgLoLo/optiacts.
[LINK]
http://arxiv.org/abs/2407.15545v1
[DATE]
2024-07-22 19:11:17+08:00
[CATEGORIES]
cs.LG
Harnessing Quantum Support Vector Machines for Cross-Domain Classification of Quantum States
[AUTHORS]
Diksha Sharma, Vivek Balasaheb Sabale, Parvinder Singh, Atul Kumar
[ABSTRACT]
In the present study, we use cross-domain classification using quantum
machine learning for quantum advantages to readdress the entanglement versus
separability paradigm. The inherent structure of quantum states and its
relation to a particular class of quantum states are used to intuitively
classify testing states from domains different from training states, called
\textit{cross-domain classification}. Using our quantum machine learning
algorithm, we demonstrate efficient classifications of two-qubit mixed states
into entangled and separable classes. For analyzing the quantumness of
correlations, our model adequately classifies Bell diagonal states as zero and
non-zero discord states. In addition, we also extend our analysis to evaluate
the robustness of our model using random local unitary transformations. Our
results demonstrate the potential of the quantum support vector machine for
classifying quantum states across the multi-dimensional Hilbert space in
comparison to classical support vector machines and neural networks.
[LINK]
http://arxiv.org/abs/2407.00774v2
[DATE]
2024-07-22 19:06:22+08:00
[CATEGORIES]
cs.LG
Exterior Penalty Policy Optimization with Penalty Metric Network under Constraints
[AUTHORS]
Shiqing Gao, Jiaxin Ding, Luoyi Fu, Xinbing Wang, Chenghu Zhou
[ABSTRACT]
In Constrained Reinforcement Learning (CRL), agents explore the environment
to learn the optimal policy while satisfying constraints. The penalty function
method has recently been studied as an effective approach for handling
constraints, which imposes constraints penalties on the objective to transform
the constrained problem into an unconstrained one. However, it is challenging
to choose appropriate penalties that balance policy performance and constraint
satisfaction efficiently. In this paper, we propose a theoretically guaranteed
penalty function method, Exterior Penalty Policy Optimization (EPO), with
adaptive penalties generated by a Penalty Metric Network (PMN). PMN responds
appropriately to varying degrees of constraint violations, enabling efficient
constraint satisfaction and safe exploration. We theoretically prove that EPO
consistently improves constraint satisfaction with a convergence guarantee. We
propose a new surrogate function and provide worst-case constraint violation
and approximation error. In practice, we propose an effective smooth penalty
function, which can be easily implemented with a first-order optimizer.
Extensive experiments are conducted, showing that EPO outperforms the baselines
in terms of policy performance and constraint satisfaction with a stable
training process, particularly on complex tasks.
[COMMENTS]
To be published in the 33rd International Joint Conference on
Artificial Intelligence (IJCAI 2024)
[LINK]
http://arxiv.org/abs/2407.15537v1
[DATE]
2024-07-22 18:57:32+08:00
[CATEGORIES]
cs.LG
Synthetic Image Learning: Preserving Performance and Preventing Membership Inference Attacks
[AUTHORS]
Eugenio Lomurno, Matteo Matteucci
[ABSTRACT]
Generative artificial intelligence has transformed the generation of
synthetic data, providing innovative solutions to challenges like data scarcity
and privacy, which are particularly critical in fields such as medicine.
However, the effective use of this synthetic data to train high-performance
models remains a significant challenge. This paper addresses this issue by
introducing Knowledge Recycling (KR), a pipeline designed to optimise the
generation and use of synthetic data for training downstream classifiers. At
the heart of this pipeline is Generative Knowledge Distillation (GKD), the
proposed technique that significantly improves the quality and usefulness of
the information provided to classifiers through a synthetic dataset
regeneration and soft labelling mechanism. The KR pipeline has been tested on a
variety of datasets, with a focus on six highly heterogeneous medical image
datasets, ranging from retinal images to organ scans. The results show a
significant reduction in the performance gap between models trained on real and
synthetic data, with models based on synthetic data outperforming those trained
on real data in some cases. Furthermore, the resulting models show almost
complete immunity to Membership Inference Attacks, manifesting privacy
properties missing in models trained with conventional techniques.
[LINK]
http://arxiv.org/abs/2407.15526v1
[DATE]
2024-07-22 18:31:07+08:00
[CATEGORIES]
cs.LG
Semantic Communication for Cooperative Multi-Task Processing over Wireless Networks
[AUTHORS]
Ahmad Halimi Razlighi, Carsten Bockelmann, Armin Dekorsy
[ABSTRACT]
In this paper, we investigated semantic communication for multi-task
processing using an information-theoretic approach. We introduced the concept
of a “semantic source”, allowing multiple semantic interpretations from a
single observation. We formulated an end-to-end optimization problem taking
into account the communication channel, maximizing mutual information (infomax)
to design the semantic encoding and decoding process exploiting the statistical
relations between semantic variables. To solve the problem we perform
data-driven deep learning employing variational approximation techniques. Our
semantic encoder is divided into a common unit and multiple specific units to
facilitate cooperative multi-task processing. Simulation results demonstrate
the effectiveness of our proposed semantic source and system design when
statistical relationships exist, comparing cooperative task processing with
independent task processing. However, our findings highlight that cooperative
multi-tasking is not always beneficial, emphasizing the importance of
statistical relationships between tasks and indicating the need for further
investigation into the semantically processing of multiple tasks.
[COMMENTS]
This work has been submitted to the IEEE Wireless Communications
Letters for possible publication
[LINK]
http://arxiv.org/abs/2404.08483v4
[DATE]
2024-07-22 18:30:21+08:00
[CATEGORIES]
cs.LG
Multiple importance sampling for stochastic gradient estimation
[AUTHORS]
Corentin Salaün, Xingchang Huang, Iliyan Georgiev, Niloy J. Mitra, Gurprit Singh
[ABSTRACT]
We introduce a theoretical and practical framework for efficient importance
sampling of mini-batch samples for gradient estimation from single and multiple
probability distributions. To handle noisy gradients, our framework dynamically
evolves the importance distribution during training by utilizing a
self-adaptive metric. Our framework combines multiple, diverse sampling
distributions, each tailored to specific parameter gradients. This approach
facilitates the importance sampling of vector-valued gradient estimation.
Rather than naively combining multiple distributions, our framework involves
optimally weighting data contribution across multiple distributions. This
adapted combination of multiple importance yields superior gradient estimates,
leading to faster training convergence. We demonstrate the effectiveness of our
approach through empirical evaluations across a range of optimization tasks
like classification and regression on both image and point cloud datasets.
[COMMENTS]
13 pages, 11 figures
[LINK]
http://arxiv.org/abs/2407.15525v1
[DATE]
2024-07-22 18:28:56+08:00
[CATEGORIES]
cs.LG
MSegRNN:Enhanced SegRNN Model with Mamba for Long-Term Time Series Forecasting
[AUTHORS]
GaoXiang Zhao, XiaoQiang Wang
[ABSTRACT]
The field of long-term time series forecasting demands handling extensive
look-back windows and long-range prediction steps, posing significant
challenges for RNN-based methodologies. Among these, SegRNN, a robust
RNN-driven model, has gained considerable attention in LTSF analysis for
achieving state-of-the-art results while maintaining a remarkably streamlined
architecture. Concurrently, the Mamba structure has demonstrated its advantages
in small to medium-sized models due to its capability for information
selection. This study introduces a variant of SegRNN that preprocesses
information using a fine-tuned single-layer Mamba structure. Additionally, it
incorporates implicit segmentation and residual structures into the model’s
encoding section to further reduce the inherent data iterative cycles of RNN
architectures and implicitly integrate inter-channel correlations. This
variant, named MSegRNN, utilizes the Mamba structure to select useful
information, resulting in a transformed sequence. The linear-strategy-adapted
derivative retains the superior memory efficiency of the original SegRNN while
demonstrating enhanced performance. Empirical evaluations on real-world LTSF
datasets demonstrate the superior performance of our model, thereby
contributing to the advancement of LTSF methodologies.
[LINK]
http://arxiv.org/abs/2407.10768v2
[DATE]
2024-07-22 18:26:41+08:00
[CATEGORIES]
cs.LG
DP-DyLoRA: Fine-Tuning Transformer-Based Models On-Device under Differentially Private Federated Learning using Dynamic Low-Rank Adaptation
[AUTHORS]
Jie Xu, Karthikeyan Saravanan, Rogier van Dalen, Haaris Mehmood, David Tuckey, Mete Ozay
[ABSTRACT]
Federated learning (FL) allows clients to collaboratively train a global
model without sharing their local data with a server. However, clients’
contributions to the server can still leak sensitive information. Differential
privacy (DP) addresses such leakage by providing formal privacy guarantees,
with mechanisms that add randomness to the clients’ contributions. The
randomness makes it infeasible to train large transformer-based models, common
in modern federated learning systems. In this work, we empirically evaluate the
practicality of fine-tuning large scale on-device transformer-based models with
differential privacy in a federated learning system. We conduct comprehensive
experiments on various system properties for tasks spanning a multitude of
domains: speech recognition, computer vision (CV) and natural language
understanding (NLU). Our results show that full fine-tuning under
differentially private federated learning (DP-FL) generally leads to huge
performance degradation which can be alleviated by reducing the dimensionality
of contributions through parameter-efficient fine-tuning (PEFT). Our benchmarks
of existing DP-PEFT methods show that DP-Low-Rank Adaptation (DP-LoRA)
consistently outperforms other methods. An even more promising approach,
DyLoRA, which makes the low rank variable, when naively combined with FL would
straightforwardly break differential privacy. We therefore propose an
adaptation method that can be combined with differential privacy and call it
DP-DyLoRA. Finally, we are able to reduce the accuracy degradation and word
error rate (WER) increase due to DP to less than 2% and 7% respectively with 1
million clients and a stringent privacy budget of $\epsilon=2$.
[COMMENTS]
16 pages, 10 figures, 5 tables
[LINK]
http://arxiv.org/abs/2405.06368v3
[DATE]
2024-07-22 18:21:49+08:00
[CATEGORIES]
cs.LG
Progressive Semantic-Guided Vision Transformer for Zero-Shot Learning
[AUTHORS]
Shiming Chen, Wenjin Hou, Salman Khan, Fahad Shahbaz Khan
[ABSTRACT]
Zero-shot learning (ZSL) recognizes the unseen classes by conducting
visual-semantic interactions to transfer semantic knowledge from seen classes
to unseen ones, supported by semantic information (e.g., attributes). However,
existing ZSL methods simply extract visual features using a pre-trained network
backbone (i.e., CNN or ViT), which fail to learn matched visual-semantic
correspondences for representing semantic-related visual features as lacking of
the guidance of semantic information, resulting in undesirable visual-semantic
interactions. To tackle this issue, we propose a progressive semantic-guided
vision transformer for zero-shot learning (dubbed ZSLViT). ZSLViT mainly
considers two properties in the whole network: i) discover the semantic-related
visual representations explicitly, and ii) discard the semantic-unrelated
visual information. Specifically, we first introduce semantic-embedded token
learning to improve the visual-semantic correspondences via semantic
enhancement and discover the semantic-related visual tokens explicitly with
semantic-guided token attention. Then, we fuse low semantic-visual
correspondence visual tokens to discard the semantic-unrelated visual
information for visual enhancement. These two operations are integrated into
various encoders to progressively learn semantic-related visual representations
for accurate visual-semantic interactions in ZSL. The extensive experiments
show that our ZSLViT achieves significant performance gains on three popular
benchmark datasets, i.e., CUB, SUN, and AWA2. Codes are available at:
https://github.com/shiming-chen/ZSLViT .
[COMMENTS]
Accepted to CVPR’24
[LINK]
http://arxiv.org/abs/2404.07713v2
[DATE]
2024-07-22 18:09:39+08:00
[CATEGORIES]
cs.LG
RepCodec: A Speech Representation Codec for Speech Tokenization
[AUTHORS]
Zhichao Huang, Chutong Meng, Tom Ko
[ABSTRACT]
With recent rapid growth of large language models (LLMs), discrete speech
tokenization has played an important role for injecting speech into LLMs.
However, this discretization gives rise to a loss of information, consequently
impairing overall performance. To improve the performance of these discrete
speech tokens, we present RepCodec, a novel speech representation codec for
semantic speech tokenization. In contrast to audio codecs which reconstruct the
raw audio, RepCodec learns a vector quantization codebook through
reconstructing speech representations from speech encoders like HuBERT or
data2vec. Together, the speech encoder, the codec encoder and the vector
quantization codebook form a pipeline for converting speech waveforms into
semantic tokens. The extensive experiments illustrate that RepCodec, by virtue
of its enhanced information retention capacity, significantly outperforms the
widely used k-means clustering approach in both speech understanding and
generation. Furthermore, this superiority extends across various speech
encoders and languages, affirming the robustness of RepCodec. We believe our
method can facilitate large language modeling research on speech processing.
[COMMENTS]
ACL 2024 (Main)
[LINK]
http://arxiv.org/abs/2309.00169v3
[DATE]
2024-07-22 17:53:44+08:00
[CATEGORIES]
cs.LG
PDiscoFormer: Relaxing Part Discovery Constraints with Vision Transformers
[AUTHORS]
Ananthu Aniraj, Cassio F. Dantas, Dino Ienco, Diego Marcos
[ABSTRACT]
Computer vision methods that explicitly detect object parts and reason on
them are a step towards inherently interpretable models. Existing approaches
that perform part discovery driven by a fine-grained classification task make
very restrictive assumptions on the geometric properties of the discovered
parts; they should be small and compact. Although this prior is useful in some
cases, in this paper we show that pre-trained transformer-based vision models,
such as self-supervised DINOv2 ViT, enable the relaxation of these constraints.
In particular, we find that a total variation (TV) prior, which allows for
multiple connected components of any size, substantially outperforms previous
work. We test our approach on three fine-grained classification benchmarks:
CUB, PartImageNet and Oxford Flowers, and compare our results to previously
published methods as well as a re-implementation of the state-of-the-art method
PDiscoNet with a transformer-based backbone. We consistently obtain substantial
improvements across the board, both on part discovery metrics and the
downstream classification task, showing that the strong inductive biases in
self-supervised ViT models require to rethink the geometric priors that can be
used for unsupervised part discovery.
[COMMENTS]
Accepted as a main conference paper at the European Conference of
Computer Vision (ECCV) 2024
[LINK]
http://arxiv.org/abs/2407.04538v3
[DATE]
2024-07-22 17:41:39+08:00
[CATEGORIES]
cs.LG
Towards diffusion models for large-scale sea-ice modelling
[AUTHORS]
Tobias Sebastian Finn, Charlotte Durand, Alban Farchi, Marc Bocquet, Julien Brajard
[ABSTRACT]
We make the first steps towards diffusion models for unconditional generation
of multivariate and Arctic-wide sea-ice states. While targeting to reduce the
computational costs by diffusion in latent space, latent diffusion models also
offer the possibility to integrate physical knowledge into the generation
process. We tailor latent diffusion models to sea-ice physics with a censored
Gaussian distribution in data space to generate data that follows the physical
bounds of the modelled variables. Our latent diffusion models reach similar
scores as the diffusion model trained in data space, but they smooth the
generated fields as caused by the latent mapping. While enforcing physical
bounds cannot reduce the smoothing, it improves the representation of the
marginal ice zone. Therefore, for large-scale Earth system modelling, latent
diffusion models can have many advantages compared to diffusion in data space
if the significant barrier of smoothing can be resolved.
[COMMENTS]
21 pages, 5 Figures, Camera-ready version for the ICML 2024 Machine
Learning for Earth System Modeling workshop
[LINK]
http://arxiv.org/abs/2406.18417v2
[DATE]
2024-07-22 17:35:36+08:00
[CATEGORIES]
cs.LG
MidiCaps: A large-scale MIDI dataset with text captions
[AUTHORS]
Jan Melechovsky, Abhinaba Roy, Dorien Herremans
[ABSTRACT]
Generative models guided by text prompts are increasingly becoming more
popular. However, no text-to-MIDI models currently exist due to the lack of a
captioned MIDI dataset. This work aims to enable research that combines LLMs
with symbolic music by presenting, the first openly available large-scale MIDI
dataset with text captions. MIDI (Musical Instrument Digital Interface) files
are widely used for encoding musical information and can capture the nuances of
musical composition. They are widely used by music producers, composers,
musicologists, and performers alike. Inspired by recent advancements in
captioning techniques, we present a curated dataset of over 168k MIDI files
with textual descriptions. Each MIDI caption describes the musical content,
including tempo, chord progression, time signature, instruments, genre, and
mood, thus facilitating multi-modal exploration and analysis. The dataset
encompasses various genres, styles, and complexities, offering a rich data
source for training and evaluating models for tasks such as music information
retrieval, music understanding, and cross-modal translation. We provide
detailed statistics about the dataset and have assessed the quality of the
captions in an extensive listening study. We anticipate that this resource will
stimulate further research at the intersection of music and natural language
processing, fostering advancements in both fields.
[COMMENTS]
Accepted in ISMIR2024
[LINK]
http://arxiv.org/abs/2406.02255v2
[DATE]
2024-07-22 17:34:46+08:00
[CATEGORIES]
cs.LG
CBGBench: Fill in the Blank of Protein-Molecule Complex Binding Graph
[AUTHORS]
Haitao Lin, Guojiang Zhao, Odin Zhang, Yufei Huang, Lirong Wu, Zicheng Liu, Siyuan Li, Cheng Tan, Zhifeng Gao, Stan Z. Li
[ABSTRACT]
Structure-based drug design (SBDD) aims to generate potential drugs that can
bind to a target protein and is greatly expedited by the aid of AI techniques
in generative models. However, a lack of systematic understanding persists due
to the diverse settings, complex implementation, difficult reproducibility, and
task singularity. Firstly, the absence of standardization can lead to unfair
comparisons and inconclusive insights. To address this dilemma, we propose
CBGBench, a comprehensive benchmark for SBDD, that unifies the task as a
generative heterogeneous graph completion, analogous to fill-in-the-blank of
the 3D complex binding graph. By categorizing existing methods based on their
attributes, CBGBench facilitates a modular and extensible framework that
implements various cutting-edge methods. Secondly, a single task on \textit{de
novo} molecule generation can hardly reflect their capabilities. To broaden the
scope, we have adapted these models to a range of tasks essential in drug
design, which are considered sub-tasks within the graph fill-in-the-blank
tasks. These tasks include the generative designation of \textit{de novo}
molecules, linkers, fragments, scaffolds, and sidechains, all conditioned on
the structures of protein pockets. Our evaluations are conducted with fairness,
encompassing comprehensive perspectives on interaction, chemical properties,
geometry authenticity, and substructure validity. We further provide the
pre-trained versions of the state-of-the-art models and deep insights with
analysis from empirical studies. The codebase for CBGBench is publicly
accessible at \url{https://github.com/Edapinenut/CBGBench}.
[COMMENTS]
9 pages main context
[LINK]
http://arxiv.org/abs/2406.10840v2
[DATE]
2024-07-22 17:22:37+08:00
[CATEGORIES]
cs.LG
Generalizing Denoising to Non-Equilibrium Structures Improves Equivariant Force Fields
[AUTHORS]
Yi-Lun Liao, Tess Smidt, Muhammed Shuaibi, Abhishek Das
[ABSTRACT]
Understanding the interactions of atoms such as forces in 3D atomistic
systems is fundamental to many applications like molecular dynamics and
catalyst design. However, simulating these interactions requires
compute-intensive ab initio calculations and thus results in limited data for
training neural networks. In this paper, we propose to use denoising
non-equilibrium structures (DeNS) as an auxiliary task to better leverage
training data and improve performance. For training with DeNS, we first corrupt
a 3D structure by adding noise to its 3D coordinates and then predict the
noise. Different from previous works on denoising, which are limited to
equilibrium structures, the proposed method generalizes denoising to a much
larger set of non-equilibrium structures. The main difference is that a
non-equilibrium structure does not correspond to local energy minima and has
non-zero forces, and therefore it can have many possible atomic positions
compared to an equilibrium structure. This makes denoising non-equilibrium
structures an ill-posed problem since the target of denoising is not uniquely
defined. Our key insight is to additionally encode the forces of the original
non-equilibrium structure to specify which non-equilibrium structure we are
denoising. Concretely, given a corrupted non-equilibrium structure and the
forces of the original one, we predict the non-equilibrium structure satisfying
the input forces instead of any arbitrary structures. Since DeNS requires
encoding forces, DeNS favors equivariant networks, which can easily incorporate
forces and other higher-order tensors in node embeddings. We study the
effectiveness of training equivariant networks with DeNS on OC20, OC22 and MD17
datasets and demonstrate that DeNS can achieve new state-of-the-art results on
OC20 and OC22 and significantly improve training efficiency on MD17.
[LINK]
http://arxiv.org/abs/2403.09549v2
[DATE]
2024-07-22 17:22:09+08:00
[CATEGORIES]
cs.LG
What is Dataset Distillation Learning?
[AUTHORS]
William Yang, Ye Zhu, Zhiwei Deng, Olga Russakovsky
[ABSTRACT]
Dataset distillation has emerged as a strategy to overcome the hurdles
associated with large datasets by learning a compact set of synthetic data that
retains essential information from the original dataset. While distilled data
can be used to train high performing models, little is understood about how the
information is stored. In this study, we posit and answer three questions about
the behavior, representativeness, and point-wise information content of
distilled data. We reveal distilled data cannot serve as a substitute for real
data during training outside the standard evaluation setting for dataset
distillation. Additionally, the distillation process retains high task
performance by compressing information related to the early training dynamics
of real models. Finally, we provide an framework for interpreting distilled
data and reveal that individual distilled data points contain meaningful
semantic information. This investigation sheds light on the intricate nature of
distilled data, providing a better understanding on how they can be effectively
utilized.
[COMMENTS]
ICML 2024
[LINK]
http://arxiv.org/abs/2406.04284v2
[DATE]
2024-07-22 17:11:04+08:00
[CATEGORIES]
cs.LG
Decentralized Entropic Optimal Transport for Distributed Distribution Comparison
[AUTHORS]
Xiangfeng Wang, Hongteng Xu, Moyi Yang
[ABSTRACT]
Distributed distribution comparison aims to measure the distance between the
distributions whose data are scattered across different agents in a distributed
system and cannot even be shared directly among the agents. In this study, we
propose a novel decentralized entropic optimal transport (DEOT) method, which
provides a communication-efficient and privacy-preserving solution to this
problem with theoretical guarantees. In particular, we design a mini-batch
randomized block-coordinate descent (MRBCD) scheme to optimize the DEOT
distance in its dual form. The dual variables are scattered across different
agents and updated locally and iteratively with limited communications among
partial agents. The kernel matrix involved in the gradients of the dual
variables is estimated by a decentralized kernel approximation method, in which
each agent only needs to approximate and store a sub-kernel matrix by one-shot
communication and without sharing raw data. Besides computing entropic
Wasserstein distance, we show that the proposed MRBCD scheme and kernel
approximation method also apply to entropic Gromov-Wasserstein distance. We
analyze our method’s communication complexity and, under mild assumptions,
provide a theoretical bound for the approximation error caused by the
convergence error, the estimated kernel, and the mismatch between the storage
and communication protocols. In addition, we discuss the trade-off between the
precision of the EOT distance and the strength of privacy protection when
implementing our method. Experiments on synthetic data and real-world
distributed domain adaptation tasks demonstrate the effectiveness of our
method.
[LINK]
http://arxiv.org/abs/2301.12065v2
[DATE]
2024-07-22 17:06:22+08:00
[CATEGORIES]
cs.LG
Meta-Learning and representation learner: A short theoretical note
[AUTHORS]
Mouad El Bouchattaoui
[ABSTRACT]
Meta-learning, or “learning to learn,” is a subfield of machine learning
where the goal is to develop models and algorithms that can learn from various
tasks and improve their learning process over time. Unlike traditional machine
learning methods focusing on learning a specific task, meta-learning aims to
leverage experience from previous tasks to enhance future learning. This
approach is particularly beneficial in scenarios where the available data for a
new task is limited, but there exists abundant data from related tasks. By
extracting and utilizing the underlying structure and patterns across these
tasks, meta-learning algorithms can achieve faster convergence and better
performance with fewer data. The following notes are mainly inspired from
\cite{vanschoren2018meta}, \cite{baxter2019learning}, and
\cite{maurer2005algorithmic}.
[LINK]
http://arxiv.org/abs/2407.04189v2
[DATE]
2024-07-22 16:45:22+08:00
[CATEGORIES]
cs.LG
Regression under demographic parity constraints via unlabeled post-processing
[AUTHORS]
Evgenii Chzhen, Mohamed Hebiri, Gayane Taturyan
[ABSTRACT]
We address the problem of performing regression while ensuring demographic
parity, even without access to sensitive attributes during inference. We
present a general-purpose post-processing algorithm that, using accurate
estimates of the regression function and a sensitive attribute predictor,
generates predictions that meet the demographic parity constraint. Our method
involves discretization and stochastic minimization of a smooth convex
function. It is suitable for online post-processing and multi-class
classification tasks only involving unlabeled data for the post-processing.
Unlike prior methods, our approach is fully theory-driven. We require precise
control over the gradient norm of the convex function, and thus, we rely on
more advanced techniques than standard stochastic gradient descent. Our
algorithm is backed by finite-sample analysis and post-processing bounds, with
experimental results validating our theoretical findings.
[LINK]
http://arxiv.org/abs/2407.15453v1
[DATE]
2024-07-22 16:11:58+08:00
[CATEGORIES]
cs.LG
GraphScale: A Framework to Enable Machine Learning over Billion-node Graphs
[AUTHORS]
Vipul Gupta, Xin Chen, Ruoyun Huang, Fanlong Meng, Jianjun Chen, Yujun Yan
[ABSTRACT]
Graph Neural Networks (GNNs) have emerged as powerful tools for supervised
machine learning over graph-structured data, while sampling-based node
representation learning is widely utilized in unsupervised learning. However,
scalability remains a major challenge in both supervised and unsupervised
learning for large graphs (e.g., those with over 1 billion nodes). The
scalability bottleneck largely stems from the mini-batch sampling phase in GNNs
and the random walk sampling phase in unsupervised methods. These processes
often require storing features or embeddings in memory. In the context of
distributed training, they require frequent, inefficient random access to data
stored across different workers. Such repeated inter-worker communication for
each mini-batch leads to high communication overhead and computational
inefficiency.
We propose GraphScale, a unified framework for both supervised and
unsupervised learning to store and process large graph data distributedly. The
key insight in our design is the separation of workers who store data and those
who perform the training. This separation allows us to decouple computing and
storage in graph training, thus effectively building a pipeline where data
fetching and data computation can overlap asynchronously. Our experiments show
that GraphScale outperforms state-of-the-art methods for distributed training
of both GNNs and node embeddings. We evaluate GraphScale both on public and
proprietary graph datasets and observe a reduction of at least 40% in
end-to-end training times compared to popular distributed frameworks, without
any loss in performance. While most existing methods don’t support billion-node
graphs for training node embeddings, GraphScale is currently deployed in
production at TikTok enabling efficient learning over such large graphs.
[COMMENTS]
Published in the Proceedings of the 33rd ACM International Conference
on Information and Knowledge Management (CIKM 2024), 8 Pages, 12 Figures
[LINK]
http://arxiv.org/abs/2407.15452v1
[DATE]
2024-07-22 16:09:36+08:00
[CATEGORIES]
cs.LG
A Benchmark Study of Deep-RL Methods for Maximum Coverage Problems over Graphs
[AUTHORS]
Zhicheng Liang, Yu Yang, Xiangyu Ke, Xiaokui Xiao, Yunjun Gao
[ABSTRACT]
Recent years have witnessed a growing trend toward employing deep
reinforcement learning (Deep-RL) to derive heuristics for combinatorial
optimization (CO) problems on graphs. Maximum Coverage Problem (MCP) and its
probabilistic variant on social networks, Influence Maximization (IM), have
been particularly prominent in this line of research. In this paper, we present
a comprehensive benchmark study that thoroughly investigates the effectiveness
and efficiency of five recent Deep-RL methods for MCP and IM. These methods
were published in top data science venues, namely S2V-DQN, Geometric-QN, GCOMB,
RL4IM, and LeNSE. Our findings reveal that, across various scenarios, the Lazy
Greedy algorithm consistently outperforms all Deep-RL methods for MCP. In the
case of IM, theoretically sound algorithms like IMM and OPIM demonstrate
superior performance compared to Deep-RL methods in most scenarios. Notably, we
observe an abnormal phenomenon in IM problem where Deep-RL methods slightly
outperform IMM and OPIM when the influence spread nearly does not increase as
the budget increases. Furthermore, our experimental results highlight common
issues when applying Deep-RL methods to MCP and IM in practical settings.
Finally, we discuss potential avenues for improving Deep-RL methods. Our
benchmark study sheds light on potential challenges in current deep
reinforcement learning research for solving combinatorial optimization
problems.
[COMMENTS]
This paper has been accepted by VLDB 2024
[LINK]
http://arxiv.org/abs/2406.14697v2
[DATE]
2024-07-22 16:03:26+08:00
[CATEGORIES]
cs.LG
Ensemble Kalman Filtering Meets Gaussian Process SSM for Non-Mean-Field and Online Inference
[AUTHORS]
Zhidi Lin, Yiyong Sun, Feng Yin, Alexandre Hoang Thiéry
[ABSTRACT]
The Gaussian process state-space models (GPSSMs) represent a versatile class
of data-driven nonlinear dynamical system models. However, the presence of
numerous latent variables in GPSSM incurs unresolved issues for existing
variational inference approaches, particularly under the more realistic
non-mean-field (NMF) assumption, including extensive training effort,
compromised inference accuracy, and infeasibility for online applications,
among others. In this paper, we tackle these challenges by incorporating the
ensemble Kalman filter (EnKF), a well-established model-based filtering
technique, into the NMF variational inference framework to approximate the
posterior distribution of the latent states. This novel marriage between EnKF
and GPSSM not only eliminates the need for extensive parameterization in
learning variational distributions, but also enables an interpretable,
closed-form approximation of the evidence lower bound (ELBO). Moreover, owing
to the streamlined parameterization via the EnKF, the new GPSSM model can be
easily accommodated in online learning applications. We demonstrate that the
resulting EnKF-aided online algorithm embodies a principled objective function
by ensuring data-fitting accuracy while incorporating model regularizations to
mitigate overfitting. We also provide detailed analysis and fresh insights for
the proposed algorithms. Comprehensive evaluation across diverse real and
synthetic datasets corroborates the superior learning and inference performance
of our EnKF-aided variational inference algorithms compared to existing
methods.
[COMMENTS]
Gaussian process, state-space model, ensemble Kalman filter, online
learning, variational inference
[LINK]
http://arxiv.org/abs/2312.05910v5
[DATE]
2024-07-22 15:45:40+08:00
[CATEGORIES]
cs.LG
Merit-based Fair Combinatorial Semi-Bandit with Unrestricted Feedback Delays
[AUTHORS]
Ziqun Chen, Kechao Cai, Zhuoyue Chen, Jinbei Zhang, John C. S. Lui
[ABSTRACT]
We study the stochastic combinatorial semi-bandit problem with unrestricted
feedback delays under merit-based fairness constraints. This is motivated by
applications such as crowdsourcing, and online advertising, where immediate
feedback is not immediately available and fairness among different choices (or
arms) is crucial. We consider two types of unrestricted feedback delays:
reward-independent delays where the feedback delays are independent of the
rewards, and reward-dependent delays where the feedback delays are correlated
with the rewards. Furthermore, we introduce merit-based fairness constraints to
ensure a fair selection of the arms. We define the reward regret and the
fairness regret and present new bandit algorithms to select arms under
unrestricted feedback delays based on their merits. We prove that our
algorithms all achieve sublinear expected reward regret and expected fairness
regret, with a dependence on the quantiles of the delay distribution. We also
conduct extensive experiments using synthetic and real-world data and show that
our algorithms can fairly select arms with different feedback delays.
[COMMENTS]
28 pages, 9 figures, accepted for 27th European Conference on
Artificial Intelligence
[LINK]
http://arxiv.org/abs/2407.15439v1
[DATE]
2024-07-22 15:36:27+08:00
[CATEGORIES]
cs.LG
Pre-Training and Prompting for Few-Shot Node Classification on Text-Attributed Graphs
[AUTHORS]
Huanjing Zhao, Beining Yang, Yukuo Cen, Junyu Ren, Chenhui Zhang, Yuxiao Dong, Evgeny Kharlamov, Shu Zhao, Jie Tang
[ABSTRACT]
The text-attributed graph (TAG) is one kind of important real-world
graph-structured data with each node associated with raw texts. For TAGs,
traditional few-shot node classification methods directly conduct training on
the pre-processed node features and do not consider the raw texts. The
performance is highly dependent on the choice of the feature pre-processing
method. In this paper, we propose P2TAG, a framework designed for few-shot node
classification on TAGs with graph pre-training and prompting. P2TAG first
pre-trains the language model (LM) and graph neural network (GNN) on TAGs with
self-supervised loss. To fully utilize the ability of language models, we adapt
the masked language modeling objective for our framework. The pre-trained model
is then used for the few-shot node classification with a mixed prompt method,
which simultaneously considers both text and graph information. We conduct
experiments on six real-world TAGs, including paper citation networks and
product co-purchasing networks. Experimental results demonstrate that our
proposed framework outperforms existing graph few-shot learning methods on
these datasets with +18.98% ~ +35.98% improvements.
[COMMENTS]
Accepted to KDD’24
[LINK]
http://arxiv.org/abs/2407.15431v1
[DATE]
2024-07-22 15:24:21+08:00
[CATEGORIES]
cs.LG
Speed-accuracy trade-off for the diffusion models: Wisdom from nonequilibrium thermodynamics and optimal transport
[AUTHORS]
Kotaro Ikeda, Tomoya Uda, Daisuke Okanohara, Sosuke Ito
[ABSTRACT]
We discuss a connection between a generative model, called the diffusion
model, and nonequilibrium thermodynamics for the Fokker-Planck equation, called
stochastic thermodynamics. Based on the techniques of stochastic
thermodynamics, we derive the speed-accuracy trade-off for the diffusion
models, which is a trade-off relationship between the speed and accuracy of
data generation in diffusion models. Our result implies that the entropy
production rate in the forward process affects the errors in data generation.
From a stochastic thermodynamic perspective, our results provide quantitative
insight into how best to generate data in diffusion models. The optimal
learning protocol is introduced by the conservative force in stochastic
thermodynamics and the geodesic of space by the 2-Wasserstein distance in
optimal transport theory. We numerically illustrate the validity of the
speed-accuracy trade-off for the diffusion models with different noise
schedules such as the cosine schedule, the conditional optimal transport, and
the optimal transport.
[COMMENTS]
26 pages, 5 figures
[LINK]
http://arxiv.org/abs/2407.04495v3
[DATE]
2024-07-22 15:19:24+08:00
[CATEGORIES]
cs.LG
LLM4ED: Large Language Models for Automatic Equation Discovery
[AUTHORS]
Mengge Du, Yuntian Chen, Zhongzheng Wang, Longfeng Nie, Dongxiao Zhang
[ABSTRACT]
Equation discovery is aimed at directly extracting physical laws from data
and has emerged as a pivotal research domain. Previous methods based on
symbolic mathematics have achieved substantial advancements, but often require
the design of implementation of complex algorithms. In this paper, we introduce
a new framework that utilizes natural language-based prompts to guide large
language models (LLMs) in automatically mining governing equations from data.
Specifically, we first utilize the generation capability of LLMs to generate
diverse equations in string form, and then evaluate the generated equations
based on observations. In the optimization phase, we propose two alternately
iterated strategies to optimize generated equations collaboratively. The first
strategy is to take LLMs as a black-box optimizer and achieve equation
self-improvement based on historical samples and their performance. The second
strategy is to instruct LLMs to perform evolutionary operators for global
search. Experiments are extensively conducted on both partial differential
equations and ordinary differential equations. Results demonstrate that our
framework can discover effective equations to reveal the underlying physical
laws under various nonlinear dynamic systems. Further comparisons are made with
state-of-the-art models, demonstrating good stability and usability. Our
framework substantially lowers the barriers to learning and applying equation
discovery techniques, demonstrating the application potential of LLMs in the
field of knowledge discovery.
[LINK]
http://arxiv.org/abs/2405.07761v2
[DATE]
2024-07-22 15:13:18+08:00
[CATEGORIES]
cs.LG
Optimal Defender Strategies for CAGE-2 using Causal Modeling and Tree Search
[AUTHORS]
Kim Hammar, Neil Dhir, Rolf Stadler
[ABSTRACT]
The CAGE-2 challenge is considered a standard benchmark to compare methods
for autonomous cyber defense. Current state-of-the-art methods evaluated
against this benchmark are based on model-free (offline) reinforcement
learning, which does not provide provably optimal defender strategies. We
address this limitation and present a formal (causal) model of CAGE-2 together
with a method that produces a provably optimal defender strategy, which we call
Causal Partially Observable Monte-Carlo Planning (C-POMCP). It has two key
properties. First, it incorporates the causal structure of the target system,
i.e., the causal relationships among the system variables. This structure
allows for a significant reduction of the search space of defender strategies.
Second, it is an online method that updates the defender strategy at each time
step via tree search. Evaluations against the CAGE-2 benchmark show that
C-POMCP achieves state-of-the-art performance with respect to effectiveness and
is two orders of magnitude more efficient in computing time than the closest
competitor method.
[COMMENTS]
This work has been submitted to the IEEE for possible publication.
Copyright may be transferred without notice, after which this version may no
longer be accessible
[LINK]
http://arxiv.org/abs/2407.11070v2
[DATE]
2024-07-22 15:08:31+08:00
[CATEGORIES]
cs.LG
Planning behavior in a recurrent neural network that plays Sokoban
[AUTHORS]
Adrià Garriga-Alonso, Mohammad Taufeeque, Adam Gleave
[ABSTRACT]
To predict how advanced neural networks generalize to novel situations, it is
essential to understand how they reason. Guez et al. (2019, “An investigation
of model-free planning”) trained a recurrent neural network (RNN) to play
Sokoban with model-free reinforcement learning. They found that adding extra
computation steps to the start of episodes at test time improves the RNN’s
success rate. We further investigate this phenomenon, finding that it rapidly
emerges early on in training and then slowly fades, but only for comparatively
easier levels. The RNN also often takes redundant actions at episode starts,
and these are reduced by adding extra computation steps. Our results suggest
that the RNN learns to take time to think by `pacing’, despite the per-step
penalties, indicating that training incentivizes planning capabilities. The
small size (1.29M parameters) and interesting behavior of this model make it an
excellent model organism for mechanistic interpretability.
[COMMENTS]
Mechanistic Interpretability workshop, ICML 2024
[LINK]
http://arxiv.org/abs/2407.15421v1
[DATE]
2024-07-22 14:57:34+08:00
[CATEGORIES]
cs.LG
MergeSFL: Split Federated Learning with Feature Merging and Batch Size Regulation
[AUTHORS]
Yunming Liao, Yang Xu, Hongli Xu, Lun Wang, Zhiwei Yao, Chunming Qiao
[ABSTRACT]
Recently, federated learning (FL) has emerged as a popular technique for edge
AI to mine valuable knowledge in edge computing (EC) systems. To mitigate the
computing/communication burden on resource-constrained workers and protect
model privacy, split federated learning (SFL) has been released by integrating
both data and model parallelism. Despite resource limitations, SFL still faces
two other critical challenges in EC, i.e., statistical heterogeneity and system
heterogeneity. To address these challenges, we propose a novel SFL framework,
termed MergeSFL, by incorporating feature merging and batch size regulation in
SFL. Concretely, feature merging aims to merge the features from workers into a
mixed feature sequence, which is approximately equivalent to the features
derived from IID data and is employed to promote model accuracy. While batch
size regulation aims to assign diverse and suitable batch sizes for
heterogeneous workers to improve training efficiency. Moreover, MergeSFL
explores to jointly optimize these two strategies upon their coupled
relationship to better enhance the performance of SFL. Extensive experiments
are conducted on a physical platform with 80 NVIDIA Jetson edge devices, and
the experimental results show that MergeSFL can improve the final model
accuracy by 5.82% to 26.22%, with a speedup by about 1.74x to 4.14x, compared
to the baselines.
[LINK]
http://arxiv.org/abs/2311.13348v2
[DATE]
2024-07-22 14:43:13+08:00
[CATEGORIES]
cs.LG
Weights Shuffling for Improving DPSGD in Transformer-based Models
[AUTHORS]
Jungang Yang, Zhe Ji, Liyao Xiang
[ABSTRACT]
Differential Privacy (DP) mechanisms, especially in high-dimensional
settings, often face the challenge of maintaining privacy without compromising
the data utility. This work introduces an innovative shuffling mechanism in
Differentially-Private Stochastic Gradient Descent (DPSGD) to enhance the
utility of large models at the same privacy guarantee of the unshuffled case.
Specifically, we reveal that random shuffling brings additional randomness to
the trajectory of gradient descent while not impacting the model accuracy by
the permutation invariance property – the model can be equivalently computed
in both forward and backward propagations under permutation. We show that
permutation indeed improves the privacy guarantee of DPSGD in theory, but
tracking the exact privacy loss on shuffled model is particularly challenging.
Hence we exploit the approximation on sum of lognormal distributions to derive
the condition for the shuffled DPSGD to meet the DP guarantee. Auditing results
show that our condition offers a DP guarantee quite close to the audited
privacy level, demonstrating our approach an effective estimation in practice.
Experimental results have verified our theoretical derivation and illustrate
that our mechanism improves the accuracy of DPSGD over the state-of-the-art
baselines on a variety of models and tasks.
[LINK]
http://arxiv.org/abs/2407.15414v1
[DATE]
2024-07-22 14:41:59+08:00
[CATEGORIES]
cs.LG
Retrieval Augmented Deep Anomaly Detection for Tabular Data
[AUTHORS]
Hugo Thimonier, Fabrice Popineau, Arpad Rimmel, Bich-Liên Doan
[ABSTRACT]
Deep learning for tabular data has garnered increasing attention in recent
years, yet employing deep models for structured data remains challenging. While
these models excel with unstructured data, their efficacy with structured data
has been limited. Recent research has introduced retrieval-augmented models to
address this gap, demonstrating promising results in supervised tasks such as
classification and regression. In this work, we investigate using
retrieval-augmented models for anomaly detection on tabular data. We propose a
reconstruction-based approach in which a transformer model learns to
reconstruct masked features of \textit{normal} samples. We test the
effectiveness of KNN-based and attention-based modules to select relevant
samples to help in the reconstruction process of the target sample. Our
experiments on a benchmark of 31 tabular datasets reveal that augmenting this
reconstruction-based anomaly detection (AD) method with sample-sample
dependencies via retrieval modules significantly boosts performance. The
present work supports the idea that retrieval module are useful to augment any
deep AD method to enhance anomaly detection on tabular data.
[COMMENTS]
Accepted at CIKM 2024
[LINK]
http://arxiv.org/abs/2401.17052v2
[DATE]
2024-07-22 14:23:02+08:00
[CATEGORIES]
cs.LG
Poisoning with A Pill: Circumventing Detection in Federated Learning
[AUTHORS]
Hanxi Guo, Hao Wang, Tao Song, Tianhang Zheng, Yang Hua, Haibing Guan, Xiangyu Zhang
[ABSTRACT]
Without direct access to the client’s data, federated learning (FL) is
well-known for its unique strength in data privacy protection among existing
distributed machine learning techniques. However, its distributive and
iterative nature makes FL inherently vulnerable to various poisoning attacks.
To counteract these threats, extensive defenses have been proposed to filter
out malicious clients, using various detection metrics. Based on our analysis
of existing attacks and defenses, we find that there is a lack of attention to
model redundancy. In neural networks, various model parameters contribute
differently to the model’s performance. However, existing attacks in FL
manipulate all the model update parameters with the same strategy, making them
easily detectable by common defenses. Meanwhile, the defenses also tend to
analyze the overall statistical features of the entire model updates, leaving
room for sophisticated attacks. Based on these observations, this paper
proposes a generic and attack-agnostic augmentation approach designed to
enhance the effectiveness and stealthiness of existing FL poisoning attacks
against detection in FL, pointing out the inherent flaws of existing defenses
and exposing the necessity of fine-grained FL security. Specifically, we employ
a three-stage methodology that strategically constructs, generates, and injects
poison (generated by existing attacks) into a pill (a tiny subnet with a novel
structure) during the FL training, named as pill construction, pill poisoning,
and pill injection accordingly. Extensive experimental results show that FL
poisoning attacks enhanced by our method can bypass all the popular defenses,
and can gain an up to 7x error rate increase, as well as on average a more than
2x error rate increase on both IID and non-IID data, in both cross-silo and
cross-device FL systems.
[LINK]
http://arxiv.org/abs/2407.15389v1
[DATE]
2024-07-22 13:34:47+08:00
[CATEGORIES]
cs.LG
Improving the Accuracy-Robustness Trade-Off of Classifiers via Adaptive Smoothing
[AUTHORS]
Yatong Bai, Brendon G. Anderson, Aerin Kim, Somayeh Sojoudi
[ABSTRACT]
While prior research has proposed a plethora of methods that build neural
classifiers robust against adversarial robustness, practitioners are still
reluctant to adopt them due to their unacceptably severe clean accuracy
penalties. This paper significantly alleviates this accuracy-robustness
trade-off by mixing the output probabilities of a standard classifier and a
robust classifier, where the standard network is optimized for clean accuracy
and is not robust in general. We show that the robust base classifier’s
confidence difference for correct and incorrect examples is the key to this
improvement. In addition to providing intuitions and empirical evidence, we
theoretically certify the robustness of the mixed classifier under realistic
assumptions. Furthermore, we adapt an adversarial input detector into a mixing
network that adaptively adjusts the mixture of the two base models, further
reducing the accuracy penalty of achieving robustness. The proposed flexible
method, termed “adaptive smoothing”, can work in conjunction with existing or
even future methods that improve clean accuracy, robustness, or adversary
detection. Our empirical evaluation considers strong attack methods, including
AutoAttack and adaptive attack. On the CIFAR-100 dataset, our method achieves
an 85.21% clean accuracy while maintaining a 38.72% $\ell_\infty$-AutoAttacked
($\epsilon = 8/255$) accuracy, becoming the second most robust method on the
RobustBench CIFAR-100 benchmark as of submission, while improving the clean
accuracy by ten percentage points compared with all listed models. The code
that implements our method is available at
https://github.com/Bai-YT/AdaptiveSmoothing.
[LINK]
http://arxiv.org/abs/2301.12554v5
[DATE]
2024-07-22 11:41:03+08:00
[CATEGORIES]
cs.LG
Reinforcement Learning With Sparse-Executing Actions via Sparsity Regularization
[AUTHORS]
Jing-Cheng Pang, Tian Xu, Shengyi Jiang, Yu-Ren Liu, Yang Yu
[ABSTRACT]
Reinforcement learning (RL) has demonstrated impressive performance in
decision-making tasks like embodied control, autonomous driving and financial
trading. In many decision-making tasks, the agents often encounter the problem
of executing actions under limited budgets. However, classic RL methods
typically overlook the challenges posed by such sparse-executing actions. They
operate under the assumption that all actions can be taken for a unlimited
number of times, both in the formulation of the problem and in the development
of effective algorithms. To tackle the issue of limited action execution in RL,
this paper first formalizes the problem as a Sparse Action Markov Decision
Process (SA-MDP), in which specific actions in the action space can only be
executed for a limited time. Then, we propose a policy optimization algorithm,
Action Sparsity REgularization (ASRE), which adaptively handles each action
with a distinct preference. ASRE operates through two steps: First, ASRE
evaluates action sparsity by constrained action sampling. Following this, ASRE
incorporates the sparsity evaluation into policy learning by way of an action
distribution regularization. We provide theoretical identification that
validates the convergence of ASRE to a regularized optimal value function.
Experiments on tasks with known sparse-executing actions, where classical RL
algorithms struggle to train policy efficiently, ASRE effectively constrains
the action sampling and outperforms baselines. Moreover, we present that ASRE
can generally improve the performance in Atari games, demonstrating its broad
applicability.
[LINK]
http://arxiv.org/abs/2105.08666v4
[DATE]
2024-07-22 11:34:57+08:00
[CATEGORIES]
cs.LG
Advancing TTP Analysis: Harnessing the Power of Large Language Models with Retrieval Augmented Generation
[AUTHORS]
Reza Fayyazi, Rozhina Taghdimi, Shanchieh Jay Yang
[ABSTRACT]
Tactics, Techniques, and Procedures (TTPs) outline the methods attackers use
to exploit vulnerabilities. The interpretation of TTPs in the MITRE ATT&CK
framework can be challenging for cybersecurity practitioners due to presumed
expertise and complex dependencies. Meanwhile, advancements with Large Language
Models (LLMs) have led to recent surge in studies exploring its uses in
cybersecurity operations. It is, however, unclear how LLMs can be used in an
efficient and proper way to provide accurate responses for critical domains
such as cybersecurity. This leads us to investigate how to better use two types
of LLMs: small-scale encoder-only (e.g., RoBERTa) and larger decoder-only
(e.g., GPT-3.5) LLMs to comprehend and summarize TTPs with the intended
purposes (i.e., tactics) of a cyberattack procedure. This work studies and
compares the uses of supervised fine-tuning (SFT) of encoder-only LLMs vs.
Retrieval Augmented Generation (RAG) for decoder-only LLMs (without
fine-tuning). Both SFT and RAG techniques presumably enhance the LLMs with
relevant contexts for each cyberattack procedure. Our studies show decoder-only
LLMs with RAG achieves better performance than encoder-only models with SFT,
particularly when directly relevant context is extracted by RAG. The
decoder-only results could suffer low Precision' while achieving high
Recall’. Our findings further highlight a counter-intuitive observation that
more generic prompts tend to yield better predictions of cyberattack tactics
than those that are more specifically tailored.
[LINK]
http://arxiv.org/abs/2401.00280v3
[DATE]
2024-07-22 10:51:05+08:00
[CATEGORIES]
cs.LG
Cascaded two-stage feature clustering and selection via separability and consistency in fuzzy decision systems
[AUTHORS]
Yuepeng Chen, Weiping Ding, Hengrong Ju, Jiashuang Huang, Tao Yin
[ABSTRACT]
Feature selection is a vital technique in machine learning, as it can reduce
computational complexity, improve model performance, and mitigate the risk of
overfitting. However, the increasing complexity and dimensionality of datasets
pose significant challenges in the selection of features. Focusing on these
challenges, this paper proposes a cascaded two-stage feature clustering and
selection algorithm for fuzzy decision systems. In the first stage, we reduce
the search space by clustering relevant features and addressing inter-feature
redundancy. In the second stage, a clustering-based sequentially forward
selection method that explores the global and local structure of data is
presented. We propose a novel metric for assessing the significance of
features, which considers both global separability and local consistency.
Global separability measures the degree of intra-class cohesion and inter-class
separation based on fuzzy membership, providing a comprehensive understanding
of data separability. Meanwhile, local consistency leverages the fuzzy
neighborhood rough set model to capture uncertainty and fuzziness in the data.
The effectiveness of our proposed algorithm is evaluated through experiments
conducted on 18 public datasets and a real-world schizophrenia dataset. The
experiment results demonstrate our algorithm’s superiority over benchmarking
algorithms in both classification accuracy and the number of selected features.
[COMMENTS]
This paper has been accepted by IEEE Transactions on Fuzzy Systems
for publication. Permission from IEEE must be obtained for all other uses, in
any current or future media. The final version is available at
[10.1109/TFUZZ.2024.3420963]
[LINK]
http://arxiv.org/abs/2407.15893v1
[DATE]
2024-07-22 10:44:32+08:00
[CATEGORIES]
cs.LG
Investigating the Indirect Object Identification circuit in Mamba
[AUTHORS]
Danielle Ensign, Adrià Garriga-Alonso
[ABSTRACT]
How well will current interpretability techniques generalize to future
models? A relevant case study is Mamba, a recent recurrent architecture with
scaling comparable to Transformers. We adapt pre-Mamba techniques to Mamba and
partially reverse-engineer the circuit responsible for the Indirect Object
Identification (IOI) task. Our techniques provide evidence that 1) Layer 39 is
a key bottleneck, 2) Convolutions in layer 39 shift names one position forward,
and 3) The name entities are stored linearly in Layer 39’s SSM. Finally, we
adapt an automatic circuit discovery tool, positional Edge Attribution
Patching, to identify a Mamba IOI circuit. Our contributions provide initial
evidence that circuit-based mechanistic interpretability tools work well for
the Mamba architecture.
[LINK]
http://arxiv.org/abs/2407.14008v2
[DATE]
2024-07-22 10:13:58+08:00
[CATEGORIES]
cs.LG
Aligning Cyber Space with Physical World: A Comprehensive Survey on Embodied AI
[AUTHORS]
Yang Liu, Weixing Chen, Yongjie Bai, Guanbin Li, Wen Gao, Liang Lin
[ABSTRACT]
Embodied Artificial Intelligence (Embodied AI) is crucial for achieving
Artificial General Intelligence (AGI) and serves as a foundation for various
applications that bridge cyberspace and the physical world. Recently, the
emergence of Multi-modal Large Models (MLMs) and World Models (WMs) have
attracted significant attention due to their remarkable perception,
interaction, and reasoning capabilities, making them a promising architecture
for the brain of embodied agents. However, there is no comprehensive survey for
Embodied AI in the era of MLMs. In this survey, we give a comprehensive
exploration of the latest advancements in Embodied AI. Our analysis firstly
navigates through the forefront of representative works of embodied robots and
simulators, to fully understand the research focuses and their limitations.
Then, we analyze four main research targets: 1) embodied perception, 2)
embodied interaction, 3) embodied agent, and 4) sim-to-real adaptation,
covering the state-of-the-art methods, essential paradigms, and comprehensive
datasets. Additionally, we explore the complexities of MLMs in virtual and real
embodied agents, highlighting their significance in facilitating interactions
in dynamic digital and physical environments. Finally, we summarize the
challenges and limitations of embodied AI and discuss their potential future
directions. We hope this survey will serve as a foundational reference for the
research community and inspire continued innovation. The associated project can
be found at https://github.com/HCPLab-SYSU/Embodied_AI_Paper_List.
[COMMENTS]
The first comprehensive review of Embodied AI in the era of MLMs, 35
pages. We also provide the paper list for Embodied AI:
https://github.com/HCPLab-SYSU/Embodied_AI_Paper_List
[LINK]
http://arxiv.org/abs/2407.06886v5
[DATE]
2024-07-22 09:59:21+08:00
[CATEGORIES]
cs.LG
MINI-SEQUENCE TRANSFORMER: Optimizing Intermediate Memory for Long Sequences Training
[AUTHORS]
Cheng Luo, Jiawei Zhao, Zhuoming Chen, Beidi Chen, Anima Anandkumar
[ABSTRACT]
We introduce Mini-Sequence Transformer (MsT), a simple and effective
methodology for highly efficient and accurate LLM training with extremely long
sequences. MsT partitions input sequences and iteratively processes
mini-sequences to reduce intermediate memory usage. Integrated with activation
recomputation, it enables significant memory savings in both forward and
backward passes. In experiments with the Llama3-8B model, with MsT, we measure
no degradation in throughput or convergence even with 12x longer sequences than
standard implementations due to our careful memory optimizations. MsT is fully
general, implementation-agnostic, and requires minimal code changes to
integrate with existing LLM training frameworks.
[LINK]
http://arxiv.org/abs/2407.15892v1
[DATE]
2024-07-22 09:52:30+08:00
[CATEGORIES]
cs.LG
Multi-Objective Latent Space Optimization of Generative Molecular Design Models
[AUTHORS]
A N M Nafiz Abeer, Nathan Urban, M Ryan Weil, Francis J. Alexander, Byung-Jun Yoon
[ABSTRACT]
Molecular design based on generative models, such as variational autoencoders
(VAEs), has become increasingly popular in recent years due to its efficiency
for exploring high-dimensional molecular space to identify molecules with
desired properties. While the efficacy of the initial model strongly depends on
the training data, the sampling efficiency of the model for suggesting novel
molecules with enhanced properties can be further enhanced via latent space
optimization. In this paper, we propose a multi-objective latent space
optimization (LSO) method that can significantly enhance the performance of
generative molecular design (GMD). The proposed method adopts an iterative
weighted retraining approach, where the respective weights of the molecules in
the training data are determined by their Pareto efficiency. We demonstrate
that our multi-objective GMD LSO method can significantly improve the
performance of GMD for jointly optimizing multiple molecular properties.
[COMMENTS]
23 pages, 9 figures
[LINK]
http://arxiv.org/abs/2203.00526v3
[DATE]
2024-07-22 09:26:32+08:00
[CATEGORIES]
cs.LG
U-learning for Prediction Inference via Combinatory Multi-Subsampling: With Applications to LASSO and Neural Networks
[AUTHORS]
Zhe Fei, Yi Li
[ABSTRACT]
Epigenetic aging clocks play a pivotal role in estimating an individual’s
biological age through the examination of DNA methylation patterns at numerous
CpG (Cytosine-phosphate-Guanine) sites within their genome. However, making
valid inferences on predicted epigenetic ages, or more broadly, on predictions
derived from high-dimensional inputs, presents challenges. We introduce a novel
U-learning approach via combinatory multi-subsampling for making ensemble
predictions and constructing confidence intervals for predictions of continuous
outcomes when traditional asymptotic methods are not applicable. More
specifically, our approach conceptualizes the ensemble estimators within the
framework of generalized U-statistics and invokes the H'ajek projection for
deriving the variances of predictions and constructing confidence intervals
with valid conditional coverage probabilities. We apply our approach to two
commonly used predictive algorithms, Lasso and deep neural networks (DNNs), and
illustrate the validity of inferences with extensive numerical studies. We have
applied these methods to predict the DNA methylation age (DNAmAge) of patients
with various health conditions, aiming to accurately characterize the aging
process and potentially guide anti-aging interventions.
[LINK]
http://arxiv.org/abs/2407.15301v1
[DATE]
2024-07-22 08:03:51+08:00
[CATEGORIES]
cs.LG
Enhancing Hardware Fault Tolerance in Machines with Reinforcement Learning Policy Gradient Algorithms
[AUTHORS]
Sheila Schoepp, Mehran Taghian, Shotaro Miwa, Yoshihiro Mitsuka, Shadan Golestan, Osmar Zaïane
[ABSTRACT]
Industry is rapidly moving towards fully autonomous and interconnected
systems that can detect and adapt to changing conditions, including machine
hardware faults. Traditional methods for adding hardware fault tolerance to
machines involve duplicating components and algorithmically reconfiguring a
machine’s processes when a fault occurs. However, the growing interest in
reinforcement learning-based robotic control offers a new perspective on
achieving hardware fault tolerance. However, limited research has explored the
potential of these approaches for hardware fault tolerance in machines. This
paper investigates the potential of two state-of-the-art reinforcement learning
algorithms, Proximal Policy Optimization (PPO) and Soft Actor-Critic (SAC), to
enhance hardware fault tolerance into machines. We assess the performance of
these algorithms in two OpenAI Gym simulated environments, Ant-v2 and
FetchReach-v1. Robot models in these environments are subjected to six
simulated hardware faults. Additionally, we conduct an ablation study to
determine the optimal method for transferring an agent’s knowledge, acquired
through learning in a normal (pre-fault) environment, to a (post-)fault
environment in a continual learning setting. Our results demonstrate that
reinforcement learning-based approaches can enhance hardware fault tolerance in
simulated machines, with adaptation occurring within minutes. Specifically, PPO
exhibits the fastest adaptation when retaining the knowledge within its models,
while SAC performs best when discarding all acquired knowledge. Overall, this
study highlights the potential of reinforcement learning-based approaches, such
as PPO and SAC, for hardware fault tolerance in machines. These findings pave
the way for the development of robust and adaptive machines capable of
effectively operating in real-world scenarios.
[LINK]
http://arxiv.org/abs/2407.15283v1
[DATE]
2024-07-22 06:24:16+08:00
[CATEGORIES]
cs.LG
Conformal Predictions under Markovian Data
[AUTHORS]
Frédéric Zheng, Alexandre Proutiere
[ABSTRACT]
We study the split Conformal Prediction method when applied to Markovian
data. We quantify the gap in terms of coverage induced by the correlations in
the data (compared to exchangeable data). This gap strongly depends on the
mixing properties of the underlying Markov chain, and we prove that it
typically scales as $\sqrt{t_\mathrm{mix}\ln(n)/n}$ (where $t_\mathrm{mix}$ is
the mixing time of the chain). We also derive upper bounds on the impact of the
correlations on the size of the prediction set. Finally we present $K$-split
CP, a method that consists in thinning the calibration dataset and that adapts
to the mixing properties of the chain. Its coverage gap is reduced to
$t_\mathrm{mix}/(n\ln(n))$ without really affecting the size of the prediction
set. We finally test our algorithms on synthetic and real-world datasets.
[LINK]
http://arxiv.org/abs/2407.15277v1
[DATE]
2024-07-22 06:01:09+08:00
[CATEGORIES]
cs.LG
Unifying Invariant and Variant Features for Graph Out-of-Distribution via Probability of Necessity and Sufficiency
[AUTHORS]
Xuexin Chen, Ruichu Cai, Kaitao Zheng, Zhifan Jiang, Zhengting Huang, Zhifeng Hao, Zijian Li
[ABSTRACT]
Graph Out-of-Distribution (OOD), requiring that models trained on biased data
generalize to the unseen test data, has considerable real-world applications.
One of the most mainstream methods is to extract the invariant subgraph by
aligning the original and augmented data with the help of environment
augmentation. However, these solutions might lead to the loss or redundancy of
semantic subgraphs and result in suboptimal generalization. To address this
challenge, we propose exploiting Probability of Necessity and Sufficiency (PNS)
to extract sufficient and necessary invariant substructures. Beyond that, we
further leverage the domain variant subgraphs related to the labels to boost
the generalization performance in an ensemble manner. Specifically, we first
consider the data generation process for graph data. Under mild conditions, we
show that the sufficient and necessary invariant subgraph can be extracted by
minimizing an upper bound, built on the theoretical advance of the probability
of necessity and sufficiency. To further bridge the theory and algorithm, we
devise the model called Sufficiency and Necessity Inspired Graph Learning
(SNIGL), which ensembles an invariant subgraph classifier on top of latent
sufficient and necessary invariant subgraphs, and a domain variant subgraph
classifier specific to the test domain for generalization enhancement.
Experimental results demonstrate that our SNIGL model outperforms the
state-of-the-art techniques on six public benchmarks, highlighting its
effectiveness in real-world scenarios.
[LINK]
http://arxiv.org/abs/2407.15273v1
[DATE]
2024-07-22 05:35:01+08:00
[CATEGORIES]
cs.LG
LSM-GNN: Large-scale Storage-based Multi-GPU GNN Training by Optimizing Data Transfer Scheme
[AUTHORS]
Jeongmin Brian Park, Kun Wu, Vikram Sharma Mailthody, Zaid Quresh, Scott Mahlke, Wen-mei Hwu
[ABSTRACT]
Graph Neural Networks (GNNs) are widely used today in recommendation systems,
fraud detection, and node/link classification tasks. Real world GNNs continue
to scale in size and require a large memory footprint for storing graphs and
embeddings that often exceed the memory capacities of the target GPUs used for
training. To address limited memory capacities, traditional GNN training
approaches use graph partitioning and sharding techniques to scale up across
multiple GPUs within a node and/or scale out across multiple nodes. However,
this approach suffers from the high computational costs of graph partitioning
algorithms and inefficient communication across GPUs.
To address these overheads, we propose Large-scale Storage-based Multi-GPU
GNN framework (LSM-GNN), a storagebased approach to train GNN models that
utilizes a novel communication layer enabling GPU software caches to function
as a system-wide shared cache with low overheads.LSM-GNN incorporates a hybrid
eviction policy that intelligently manages cache space by using both static and
dynamic node information to significantly enhance cache performance.
Furthermore, we introduce the Preemptive Victim-buffer Prefetcher (PVP), a
mechanism for prefetching node feature data from a Victim Buffer located in CPU
pinned-memory to further reduce the pressure on the storage devices.
Experimental results show that despite the lower compute capabilities and
memory capacities, LSM-GNN in a single node with two GPUs offers superior
performance over two-node-four-GPU Dist-DGL baseline and provides up to 3.75x
speed up on end-to-end epoch time while running large-scale GNN training
[LINK]
http://arxiv.org/abs/2407.15264v1
[DATE]
2024-07-22 04:41:39+08:00
[CATEGORIES]
cs.LG
Fast Risk Assessment in Power Grids through Novel Gaussian Process and Active Learning
[AUTHORS]
Parikshit Pareek, Deepjyoti Deka, Sidhant Misra
[ABSTRACT]
This paper presents a graph-structured Gaussian process (GP) model for
data-driven risk assessment of critical voltage constraints. The proposed GP is
based on a novel kernel, named the vertex-degree kernel (VDK), that decomposes
the voltage-load relationship based on the network graph. To estimate the GP
efficiently, we propose a novel active learning scheme that leverages the
additive structure of VDK. Further, we prove a probabilistic bound on the error
in risk estimation using VDK-GP model that demonstrates that it is
statistically comparable to using standard AC power flow (AC-PF), but does not
require computing a large number of ACPF solutions. Simulations demonstrate
that the proposed VDK-GP achieves more than two fold sample complexity
reduction, compared to a generic GP on medium scale 500-Bus and large scale
1354-Bus power systems. Moreover, active learning achieves an impressive
reduction of over 15 times in comparison to the time complexity of Monte-Carlo
simulations (MCS), and have risk estimation error of order 1E-4 for both
500-Bus and 1354-Bus system, demonstrating its superior efficiency in risk
estimation.
[COMMENTS]
9 pages
[LINK]
http://arxiv.org/abs/2308.07867v2
[DATE]
2024-07-22 04:32:38+08:00
[CATEGORIES]
cs.LG
Convergence Analysis of Probability Flow ODE for Score-based Generative Models
[AUTHORS]
Daniel Zhengyu Huang, Jiaoyang Huang, Zhengjiang Lin
[ABSTRACT]
Score-based generative models have emerged as a powerful approach for
sampling high-dimensional probability distributions. Despite their
effectiveness, their theoretical underpinnings remain relatively
underdeveloped. In this work, we study the convergence properties of
deterministic samplers based on probability flow ODEs from both theoretical and
numerical perspectives. Assuming access to $L^2$-accurate estimates of the
score function, we prove the total variation between the target and the
generated data distributions can be bounded above by
$\mathcal{O}(d^{3/4}\delta^{1/2})$ in the continuous time level, where $d$
denotes the data dimension and $\delta$ represents the $L^2$-score matching
error. For practical implementations using a $p$-th order Runge-Kutta
integrator with step size $h$, we establish error bounds of
$\mathcal{O}(d^{3/4}\delta^{1/2} + d\cdot(dh)^p)$ at the discrete level.
Finally, we present numerical studies on problems up to 128 dimensions to
verify our theory.
[COMMENTS]
37 pages, 7 figures
[LINK]
http://arxiv.org/abs/2404.09730v2
[DATE]
2024-07-22 04:23:07+08:00
[CATEGORIES]
cs.LG
Explainable bank failure prediction models: Counterfactual explanations to reduce the failure risk
[AUTHORS]
Seyma Gunonu, Gizem Altun, Mustafa Cavus
[ABSTRACT]
The accuracy and understandability of bank failure prediction models are
crucial. While interpretable models like logistic regression are favored for
their explainability, complex models such as random forest, support vector
machines, and deep learning offer higher predictive performance but lower
explainability. These models, known as black boxes, make it difficult to derive
actionable insights. To address this challenge, using counterfactual
explanations is suggested. These explanations demonstrate how changes in input
variables can alter the model output and suggest ways to mitigate bank failure
risk. The key challenge lies in selecting the most effective method for
generating useful counterfactuals, which should demonstrate validity,
proximity, sparsity, and plausibility. The paper evaluates several
counterfactual generation methods: WhatIf, Multi Objective, and Nearest
Instance Counterfactual Explanation, and also explores resampling methods like
undersampling, oversampling, SMOTE, and the cost sensitive approach to address
data imbalance in bank failure prediction in the US. The results indicate that
the Nearest Instance Counterfactual Explanation method yields higher quality
counterfactual explanations, mainly using the cost sensitive approach. Overall,
the Multi Objective Counterfactual and Nearest Instance Counterfactual
Explanation methods outperform others regarding validity, proximity, and
sparsity metrics, with the cost sensitive approach providing the most desirable
counterfactual explanations. These findings highlight the variability in the
performance of counterfactual generation methods across different balancing
strategies and machine learning models, offering valuable strategies to enhance
the utility of black box bank failure prediction models.
[COMMENTS]
20 pages, 1 figure
[LINK]
http://arxiv.org/abs/2407.11089v2
[DATE]
2024-07-22 03:47:47+08:00
[CATEGORIES]
cs.LG
A Benchmark Dataset for Multimodal Prediction of Enzymatic Function Coupling DNA Sequences and Natural Language
[AUTHORS]
Yuchen Zhang, Ratish Kumar Chandrakant Jha, Soumya Bharadwaj, Vatsal Sanjaykumar Thakkar, Adrienne Hoarfrost, Jin Sun
[ABSTRACT]
Predicting gene function from its DNA sequence is a fundamental challenge in
biology. Many deep learning models have been proposed to embed DNA sequences
and predict their enzymatic function, leveraging information in public
databases linking DNA sequences to an enzymatic function label. However, much
of the scientific community’s knowledge of biological function is not
represented in these categorical labels, and is instead captured in
unstructured text descriptions of mechanisms, reactions, and enzyme behavior.
These descriptions are often captured alongside DNA sequences in biological
databases, albeit in an unstructured manner. Deep learning of models predicting
enzymatic function are likely to benefit from incorporating this multi-modal
data encoding scientific knowledge of biological function. There is, however,
no dataset designed for machine learning algorithms to leverage this
multi-modal information. Here we propose a novel dataset and benchmark suite
that enables the exploration and development of large multi-modal neural
network models on gene DNA sequences and natural language descriptions of gene
function. We present baseline performance on benchmarks for both unsupervised
and supervised tasks that demonstrate the difficulty of this modeling
objective, while demonstrating the potential benefit of incorporating
multi-modal data types in function prediction compared to DNA sequences alone.
Our dataset is at: https://hoarfrost-lab.github.io/BioTalk/.
[LINK]
http://arxiv.org/abs/2407.15888v1
[DATE]
2024-07-22 03:27:43+08:00
[CATEGORIES]
cs.LG
Weyl Calculus and Exactly Solvable Schrödinger Bridges with Quadratic State Cost
[AUTHORS]
Alexis M. H. Teter, Wenqing Wang, Abhishek Halder
[ABSTRACT]
Schr"{o}dinger bridge–a stochastic dynamical generalization of optimal mass
transport–exhibits a learning-control duality. Viewed as a stochastic control
problem, the Schr"{o}dinger bridge finds an optimal control policy that steers
a given joint state statistics to another while minimizing the total control
effort subject to controlled diffusion and deadline constraints. Viewed as a
stochastic learning problem, the Schr"{o}dinger bridge finds the most-likely
distribution-valued trajectory connecting endpoint distributional observations,
i.e., solves the two point boundary-constrained maximum likelihood problem over
the manifold of probability distributions. Recent works have shown that solving
the Schr"{o}dinger bridge problem with state cost requires finding the Markov
kernel associated with a reaction-diffusion PDE where the state cost appears as
a state-dependent reaction rate. We explain how ideas from Weyl calculus in
quantum mechanics, specifically the Weyl operator and the Weyl symbol, can help
determine such Markov kernels. We illustrate these ideas by explicitly finding
the Markov kernel for the case of quadratic state cost via Weyl calculus,
recovering our earlier results but avoiding tedious computation with Hermite
polynomials.
[LINK]
http://arxiv.org/abs/2407.15245v1
[DATE]
2024-07-22 03:05:30+08:00
[CATEGORIES]
cs.LG
Compact Proofs of Model Performance via Mechanistic Interpretability
[AUTHORS]
Jason Gross, Rajashree Agrawal, Thomas Kwa, Euan Ong, Chun Hei Yip, Alex Gibson, Soufiane Noubir, Lawrence Chan
[ABSTRACT]
We propose using mechanistic interpretability – techniques for reverse
engineering model weights into human-interpretable algorithms – to derive and
compactly prove formal guarantees on model performance. We prototype this
approach by formally proving lower bounds on the accuracy of 151 small
transformers trained on a Max-of-$K$ task. We create 102 different
computer-assisted proof strategies and assess their length and tightness of
bound on each of our models. Using quantitative metrics, we find that shorter
proofs seem to require and provide more mechanistic understanding. Moreover, we
find that more faithful mechanistic understanding leads to tighter performance
bounds. We confirm these connections by qualitatively examining a subset of our
proofs. Finally, we identify compounding structureless noise as a key challenge
for using mechanistic interpretability to generate compact proofs on model
performance.
[COMMENTS]
accepted to ICML 2024 Workshop on Mechanistic Interpretability
(Spotlight)
[LINK]
http://arxiv.org/abs/2406.11779v9
[DATE]
2024-07-22 02:30:20+08:00
[CATEGORIES]
cs.LG
Online Optimization and Ambiguity-based Learning of Distributionally Uncertain Dynamic Systems
[AUTHORS]
Dan Li, Dariush Fooladivanda, Sonia Martinez
[ABSTRACT]
This paper proposes a novel approach to construct data-driven online
solutions to optimization problems (P) subject to a class of distributionally
uncertain dynamical systems. The introduced framework allows for the
simultaneous learning of distributional system uncertainty via a parameterized,
control-dependent ambiguity set using a finite historical data set, and its use
to make online decisions with probabilistic regret function bounds. Leveraging
the merits of Machine Learning, the main technical approach relies on the
theory of Distributional Robust Optimization (DRO), to hedge against
uncertainty and provide less conservative results than standard Robust
Optimization approaches. Starting from recent results that describe ambiguity
sets via parameterized, and control-dependent empirical distributions as well
as ambiguity radii, we first present a tractable reformulation of the
corresponding optimization problem while maintaining the probabilistic
guarantees. We then specialize these problems to the cases of 1) optimal
one-stage control of distributionally uncertain nonlinear systems, and 2)
resource allocation under distributional uncertainty. A novelty of this work is
that it extends DRO to online optimization problems subject to a
distributionally uncertain dynamical system constraint, handled via a
control-dependent ambiguity set that leads to online-tractable optimization
with probabilistic guarantees on regret bounds. Further, we introduce an online
version of Nesterov’s accelerated-gradient algorithm, and analyze its
performance to solve this class of problems via dissipativity theory.
[LINK]
http://arxiv.org/abs/2102.09111v2
[DATE]
2024-07-22 02:11:23+08:00
[CATEGORIES]
cs.LG
Temporal Abstraction in Reinforcement Learning with Offline Data
[AUTHORS]
Ranga Shaarad Ayyagari, Anurita Ghosh, Ambedkar Dukkipati
[ABSTRACT]
Standard reinforcement learning algorithms with a single policy perform
poorly on tasks in complex environments involving sparse rewards, diverse
behaviors, or long-term planning. This led to the study of algorithms that
incorporate temporal abstraction by training a hierarchy of policies that plan
over different time scales. The options framework has been introduced to
implement such temporal abstraction by learning low-level options that act as
extended actions controlled by a high-level policy. The main challenge in
applying these algorithms to real-world problems is that they suffer from high
sample complexity to train multiple levels of the hierarchy, which is
impossible in online settings. Motivated by this, in this paper, we propose an
offline hierarchical RL method that can learn options from existing offline
datasets collected by other unknown agents. This is a very challenging problem
due to the distribution mismatch between the learned options and the policies
responsible for the offline dataset and to our knowledge, this is the first
work in this direction. In this work, we propose a framework by which an online
hierarchical reinforcement learning algorithm can be trained on an offline
dataset of transitions collected by an unknown behavior policy. We validate our
method on Gym MuJoCo locomotion environments and robotic gripper block-stacking
tasks in the standard as well as transfer and goal-conditioned settings.
[LINK]
http://arxiv.org/abs/2407.15241v1
[DATE]
2024-07-22 02:10:31+08:00
[CATEGORIES]
cs.LG
Variational Potential Flow: A Novel Probabilistic Framework for Energy-Based Generative Modelling
[AUTHORS]
Junn Yong Loo, Michelle Adeline, Arghya Pal, Vishnu Monn Baskaran, Chee-Ming Ting, Raphael C. -W. Phan
[ABSTRACT]
Energy based models (EBMs) are appealing for their generality and simplicity
in data likelihood modeling, but have conventionally been difficult to train
due to the unstable and time-consuming implicit MCMC sampling during
contrastive divergence training. In this paper, we present a novel energy-based
generative framework, Variational Potential Flow (VAPO), that entirely
dispenses with implicit MCMC sampling and does not rely on complementary latent
models or cooperative training. The VAPO framework aims to learn a potential
energy function whose gradient (flow) guides the prior samples, so that their
density evolution closely follows an approximate data likelihood homotopy. An
energy loss function is then formulated to minimize the Kullback-Leibler
divergence between density evolution of the flow-driven prior and the data
likelihood homotopy. Images can be generated after training the potential
energy, by initializing the samples from Gaussian prior and solving the ODE
governing the potential flow on a fixed time interval using generic ODE
solvers. Experiment results show that the proposed VAPO framework is capable of
generating realistic images on various image datasets. In particular, our
proposed framework achieves competitive FID scores for unconditional image
generation on the CIFAR-10 and CelebA datasets.
[LINK]
http://arxiv.org/abs/2407.15238v1
[DATE]
2024-07-22 02:08:12+08:00
[CATEGORIES]
cs.LG
Lossless Image Compression Using Multi-level Dictionaries: Binary Images
[AUTHORS]
Samar Agnihotri, Renu Rameshan, Ritwik Ghosal
[ABSTRACT]
Lossless image compression is required in various applications to reduce
storage or transmission costs of images, while requiring the reconstructed
images to have zero information loss compared to the original. Existing
lossless image compression methods either have simple design but poor
compression performance, or complex design, better performance, but with no
performance guarantees. In our endeavor to develop a lossless image compression
method with low complexity and guaranteed performance, we argue that
compressibility of a color image is essentially derived from the patterns in
its spatial structure, intensity variations, and color variations. Thus, we
divide the overall design of a lossless image compression scheme into three
parts that exploit corresponding redundancies. We further argue that the
binarized version of an image captures its fundamental spatial structure. In
this first part of our work, we propose a scheme for lossless compression of
binary images.
The proposed scheme first learns dictionaries of $16\times16$, $8\times8$,
$4\times4$, and $2\times 2$ square pixel patterns from various datasets of
binary images. It then uses these dictionaries to encode binary images. These
dictionaries have various interesting properties that are further exploited to
construct an efficient and scalable scheme. Our preliminary results show that
the proposed scheme consistently outperforms existing conventional and learning
based lossless compression approaches, and provides, on average, as much as
$1.5\times$ better performance than a common general purpose lossless
compression scheme (WebP), more than $3\times$ better performance than a state
of the art learning based scheme, and better performance than a specialized
scheme for binary image compression (JBIG2).
[COMMENTS]
Slightly reorganized content, some new results, and updated existing
results with 13 pages, 11 figures, and 5 tables
[LINK]
http://arxiv.org/abs/2406.03087v2
[DATE]
2024-07-22 02:00:52+08:00
[CATEGORIES]
cs.LG
Deep State Space Recurrent Neural Networks for Time Series Forecasting
[AUTHORS]
Hugo Inzirillo
[ABSTRACT]
We explore various neural network architectures for modeling the dynamics of
the cryptocurrency market. Traditional linear models often fall short in
accurately capturing the unique and complex dynamics of this market. In
contrast, Deep Neural Networks (DNNs) have demonstrated considerable
proficiency in time series forecasting. This papers introduces novel neural
network framework that blend the principles of econometric state space models
with the dynamic capabilities of Recurrent Neural Networks (RNNs). We propose
state space models using Long Short Term Memory (LSTM), Gated Residual Units
(GRU) and Temporal Kolmogorov-Arnold Networks (TKANs). According to the
results, TKANs, inspired by Kolmogorov-Arnold Networks (KANs) and LSTM,
demonstrate promising outcomes.
[LINK]
http://arxiv.org/abs/2407.15236v1
[DATE]
2024-07-22 01:59:27+08:00
[CATEGORIES]
cs.LG
Privacy-Preserving Multi-Center Differential Protein Abundance Analysis with FedProt
[AUTHORS]
Yuliya Burankova, Miriam Abele, Mohammad Bakhtiari, Christine von Törne, Teresa Barth, Lisa Schweizer, Pieter Giesbertz, Johannes R. Schmidt, Stefan Kalkhof, Janina Müller-Deile, Peter A van Veelen, Yassene Mohammed, Elke Hammer, Lis Arend, Klaudia Adamowicz, Tanja Laske, Anne Hartebrodt, Tobias Frisch, Chen Meng, Julian Matschinske, Julian Späth, Richard Röttger, Veit Schwämmle, Stefanie M. Hauck, Stefan Lichtenthaler, Axel Imhof, Matthias Mann, Christina Ludwig, Bernhard Kuster, Jan Baumbach, Olga Zolotareva
[ABSTRACT]
Quantitative mass spectrometry has revolutionized proteomics by enabling
simultaneous quantification of thousands of proteins. Pooling patient-derived
data from multiple institutions enhances statistical power but raises
significant privacy concerns. Here we introduce FedProt, the first
privacy-preserving tool for collaborative differential protein abundance
analysis of distributed data, which utilizes federated learning and additive
secret sharing. In the absence of a multicenter patient-derived dataset for
evaluation, we created two, one at five centers from LFQ E.coli experiments and
one at three centers from TMT human serum. Evaluations using these datasets
confirm that FedProt achieves accuracy equivalent to DEqMS applied to pooled
data, with completely negligible absolute differences no greater than $\text{$4
\times 10^{-12}$}$. In contrast, -log10(p-values) computed by the most accurate
meta-analysis methods diverged from the centralized analysis results by up to
25-27. FedProt is available as a web tool with detailed documentation as a
FeatureCloud App.
[COMMENTS]
52 pages, 16 figures, 12 tables. Last two authors listed are joint
last authors
[LINK]
http://arxiv.org/abs/2407.15220v1
[DATE]
2024-07-22 01:09:20+08:00
[CATEGORIES]
cs.LG
Efficient Visual Transformer by Learnable Token Merging
[AUTHORS]
Yancheng Wang, Yingzhen Yang
[ABSTRACT]
Self-attention and transformers have been widely used in deep learning.
Recent efforts have been devoted to incorporating transformer blocks into
different neural architectures, including those with convolutions, leading to
various visual transformers for computer vision tasks. In this paper, we
propose a novel and compact transformer block, Transformer with Learnable Token
Merging (LTM), or LTM-Transformer. LTM-Transformer performs token merging in a
learnable scheme. LTM-Transformer is compatible with many popular and compact
transformer networks, and it reduces the FLOPs and the inference time of the
visual transformers while maintaining or even improving the prediction
accuracy. In the experiments, we replace all the transformer blocks in popular
visual transformers, including MobileViT, EfficientViT, ViT-S/16, and Swin-T,
with LTM-Transformer blocks, leading to LTM-Transformer networks with different
backbones. The LTM-Transformer is motivated by reduction of Information
Bottleneck, and a novel and separable variational upper bound for the IB loss
is derived. The architecture of mask module in our LTM blocks which generate
the token merging mask is designed to reduce the derived upper bound for the IB
loss. Extensive results on computer vision tasks evidence that LTM-Transformer
renders compact and efficient visual transformers with comparable or much
better prediction accuracy than the original visual transformers. The code of
the LTM-Transformer is available at
\url{https://github.com/Statistical-Deep-Learning/LTM}.
[LINK]
http://arxiv.org/abs/2407.15219v1
[DATE]
2024-07-22 01:09:19+08:00
[CATEGORIES]
cs.LG
Separable DeepONet: Breaking the Curse of Dimensionality in Physics-Informed Machine Learning
[AUTHORS]
Luis Mandl, Somdatta Goswami, Lena Lambers, Tim Ricken
[ABSTRACT]
The deep operator network (DeepONet) is a popular neural operator
architecture that has shown promise in solving partial differential equations
(PDEs) by using deep neural networks to map between infinite-dimensional
function spaces. In the absence of labeled datasets, we utilize the PDE
residual loss to learn the physical system, an approach known as
physics-informed DeepONet. This method faces significant computational
challenges, primarily due to the curse of dimensionality, as the computational
cost increases exponentially with finer discretization. In this paper, we
introduce the Separable DeepONet framework to address these challenges and
improve scalability for high-dimensional PDEs. Our approach involves a
factorization technique where sub-networks handle individual one-dimensional
coordinates, thereby reducing the number of forward passes and the size of the
Jacobian matrix. By using forward-mode automatic differentiation, we further
optimize the computational cost related to the Jacobian matrix. As a result,
our modifications lead to a linear scaling of computational cost with
discretization density, making Separable DeepONet suitable for high-dimensional
PDEs. We validate the effectiveness of the separable architecture through three
benchmark PDE models: the viscous Burgers equation, Biot’s consolidation
theory, and a parametrized heat equation. In all cases, our proposed framework
achieves comparable or improved accuracy while significantly reducing
computational time compared to conventional DeepONet. These results demonstrate
the potential of Separable DeepONet in efficiently solving complex,
high-dimensional PDEs, advancing the field of physics-informed machine
learning.
[COMMENTS]
23 Pages, 9 Figures and 1 Table
[LINK]
http://arxiv.org/abs/2407.15887v1
[DATE]
2024-07-22 00:33:56+08:00
[CATEGORIES]
cs.LG
Adaptive Foundation Models for Online Decisions: HyperAgent with Fast Incremental Uncertainty Estimation
[AUTHORS]
Yingru Li, Jiawei Xu, Zhi-Quan Luo
[ABSTRACT]
Foundation models often struggle with uncertainty when faced with new
situations in online decision-making, necessitating scalable and efficient
exploration to resolve this uncertainty. We introduce GPT-HyperAgent, an
augmentation of GPT with HyperAgent for uncertainty-aware, scalable exploration
in contextual bandits, a fundamental online decision problem involving natural
language input. We prove that HyperAgent achieves fast incremental uncertainty
estimation with $\tilde{O}(\log T)$ per-step computational complexity over $T$
periods under the linear realizable assumption. Our analysis demonstrates that
HyperAgent’s regret order matches that of exact Thompson sampling in linear
contextual bandits, closing a significant theoretical gap in scalable
exploration. Empirical results in real-world contextual bandit tasks, such as
automated content moderation with human feedback, validate the practical
effectiveness of GPT-HyperAgent for safety-critical decisions. Our code is
open-sourced at \url{https://github.com/szrlee/GPT-HyperAgent/}.
[COMMENTS]
43 pages. Presentation at ICML 2024 Workshops: (1) Aligning
Reinforcement Learning Experimentalists and Theorists; (2) Automated
Reinforcement Learning: Exploring Meta-Learning, AutoML, and LLMs
[LINK]
http://arxiv.org/abs/2407.13195v2
[DATE]
2024-07-22 00:31:14+08:00
[CATEGORIES]
cs.LG
LSTM Autoencoder-based Deep Neural Networks for Barley Genotype-to-Phenotype Prediction
[AUTHORS]
Guanjin Wang, Junyu Xuan, Penghao Wang, Chengdao Li, Jie Lu
[ABSTRACT]
Artificial Intelligence (AI) has emerged as a key driver of precision
agriculture, facilitating enhanced crop productivity, optimized resource use,
farm sustainability, and informed decision-making. Also, the expansion of
genome sequencing technology has greatly increased crop genomic resources,
deepening our understanding of genetic variation and enhancing desirable crop
traits to optimize performance in various environments. There is increasing
interest in using machine learning (ML) and deep learning (DL) algorithms for
genotype-to-phenotype prediction due to their excellence in capturing complex
interactions within large, high-dimensional datasets. In this work, we propose
a new LSTM autoencoder-based model for barley genotype-to-phenotype prediction,
specifically for flowering time and grain yield estimation, which could
potentially help optimize yields and management practices. Our model
outperformed the other baseline methods, demonstrating its potential in
handling complex high-dimensional agricultural datasets and enhancing crop
phenotype prediction performance.
[LINK]
http://arxiv.org/abs/2407.16709v1
[DATE]
2024-07-22 00:07:43+08:00
[CATEGORIES]
cs.LG
Superpixel Semantics Representation and Pre-training for Vision-Language Task
[AUTHORS]
Siyu Zhang, Yeming Chen, Yaoru Sun, Fang Wang, Jun Yang, Lizhi Bai, Shangce Gao
[ABSTRACT]
The key to integrating visual language tasks is to establish a good alignment
strategy. Recently, visual semantic representation has achieved fine-grained
visual understanding by dividing grids or image patches. However, the
coarse-grained semantic interactions in image space should not be ignored,
which hinders the extraction of complex contextual semantic relations at the
scene boundaries. This paper proposes superpixels as comprehensive and robust
visual primitives, which mine coarse-grained semantic interactions by
clustering perceptually similar pixels, speeding up the subsequent processing
of primitives. To capture superpixel-level semantic features, we propose a
Multiscale Difference Graph Convolutional Network (MDGCN). It allows parsing
the entire image as a fine-to-coarse visual hierarchy. To reason actual
semantic relations, we reduce potential noise interference by aggregating
difference information between adjacent graph nodes. Finally, we propose a
multi-level fusion rule in a bottom-up manner to avoid understanding deviation
by mining complementary spatial information at different levels. Experiments
show that the proposed method can effectively promote the learning of multiple
downstream tasks. Encouragingly, our method outperforms previous methods on all
metrics. Our code will be released upon publication.
[LINK]
http://arxiv.org/abs/2310.13447v3
[DATE]
2024-07-21 23:38:23+08:00
[CATEGORIES]
cs.CL
A Survey on Employing Large Language Models for Text-to-SQL Tasks
[AUTHORS]
Liang Shi, Zhengju Tang, Zhi Yang
[ABSTRACT]
The increasing volume of data stored in relational databases has led to the
need for efficient querying and utilization of this data in various sectors.
However, writing SQL queries requires specialized knowledge, which poses a
challenge for non-professional users trying to access and query databases.
Text-to-SQL parsing solves this issue by converting natural language queries
into SQL queries, thus making database access more accessible for non-expert
users. To take advantage of the recent developments in Large Language Models
(LLMs), a range of new methods have emerged, with a primary focus on prompt
engineering and fine-tuning. This survey provides a comprehensive overview of
LLMs in text-to-SQL tasks, discussing benchmark datasets, prompt engineering,
fine-tuning methods, and future research directions. We hope this review will
enable readers to gain a broader understanding of the recent advances in this
field and offer some insights into its future trajectory.
[LINK]
http://arxiv.org/abs/2407.15186v1
[DATE]
2024-07-21 22:48:23+08:00
[CATEGORIES]
cs.CL
Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference
[AUTHORS]
Yuan Feng, Junlin Lv, Yukun Cao, Xike Xie, S. Kevin Zhou
[ABSTRACT]
Large Language Models have excelled in various fields but encounter
efficiency limitations due to the substantial Key-Value (KV) cache required for
long-sequence inference. Recent efforts try to evict non-critical cache
elements during runtime, thereby reducing cache size within given memory
budgets while preserving generation quality. Our reexamination of foundational
principles reveals that prevailing methods aim to minimize an upper bound of
eviction loss, quantified as the L1 distance between the pre- and post-eviction
outputs of multi-head self-attention mechanisms. Moreover, our analysis
indicates that the common practices of uniformly assigning budgets across
different attention heads during cache eviction hinder their budget
utilization, negatively impacting generation quality. In light of these
findings, we propose a simple yet effective adaptive budget allocation
algorithm. This algorithm not only optimizes the loss upper bound in theory but
also reduces the eviction loss in practice by aligning with the intrinsic
patterns of self-attention mechanisms. Integrating this algorithm into two
advanced methods, we develop Ada-SnapKV and Ada-Pyramid. Extensive evaluations
on 16 datasets and the Needle-in-a-Haystack test confirm that they both
significantly boost performance across various tasks.
[LINK]
http://arxiv.org/abs/2407.11550v2
[DATE]
2024-07-21 22:08:42+08:00
[CATEGORIES]
cs.CL
When Can Transformers Count to n?
[AUTHORS]
Gilad Yehudai, Haim Kaplan, Asma Ghandeharioun, Mor Geva, Amir Globerson
[ABSTRACT]
Large language models based on the transformer architectures can solve highly
complex tasks. But are there simple tasks that such models cannot solve? Here
we focus on very simple counting tasks, that involve counting how many times a
token in the vocabulary have appeared in a string. We show that if the
dimension of the transformer state is linear in the context length, this task
can be solved. However, the solution we propose does not scale beyond this
limit, and we provide theoretical arguments for why it is likely impossible for
a size limited transformer to implement this task. Our empirical results
demonstrate the same phase-transition in performance, as anticipated by the
theoretical argument. Our results demonstrate the importance of understanding
how transformers can solve simple tasks.
[LINK]
http://arxiv.org/abs/2407.15160v1
[DATE]
2024-07-21 21:31:02+08:00
[CATEGORIES]
cs.CL
cs.LG
Fine-grained Gender Control in Machine Translation with Large Language Models
[AUTHORS]
Minwoo Lee, Hyukhun Koh, Minsung Kim, Kyomin Jung
[ABSTRACT]
In machine translation, the problem of ambiguously gendered input has been
pointed out, where the gender of an entity is not available in the source
sentence. To address this ambiguity issue, the task of controlled translation
that takes the gender of the ambiguous entity as additional input have been
proposed. However, most existing works have only considered a simplified setup
of one target gender for input. In this paper, we tackle controlled translation
in a more realistic setting of inputs with multiple entities and propose
Gender-of-Entity (GoE) prompting method for LLMs. Our proposed method instructs
the model with fine-grained entity-level gender information to translate with
correct gender inflections. By utilizing four evaluation benchmarks, we
investigate the controlled translation capability of LLMs in multiple
dimensions and find that LLMs reach state-of-the-art performance in controlled
translation. Furthermore, we discover an emergence of gender interference
phenomenon when controlling the gender of multiple entities. Finally, we
address the limitations of existing gender accuracy evaluation metrics and
propose leveraging LLMs as an evaluator for gender inflection in machine
translation.
[COMMENTS]
NAACL 2024 Main track long paper
[LINK]
http://arxiv.org/abs/2407.15154v1
[DATE]
2024-07-21 21:15:00+08:00
[CATEGORIES]
cs.CL
VerityMath: Advancing Mathematical Reasoning by Self-Verification Through Unit Consistency
[AUTHORS]
Vernon Toh Yan Han, Ratish Puduppully, Nancy F. Chen
[COMMENTS]
AI4MATH Workshop @ ICML 2024
[LINK]
http://arxiv.org/abs/2311.07172v2
[DATE]
2024-07-21 20:41:18+08:00
[CATEGORIES]
cs.CL
A multi-level multi-label text classification dataset of 19th century Ottoman and Russian literary and critical texts
[AUTHORS]
Gokcen Gokceoglu, Devrim Cavusoglu, Emre Akbas, Özen Nergis Dolcerocca
[ABSTRACT]
This paper introduces a multi-level, multi-label text classification dataset
comprising over 3000 documents. The dataset features literary and critical
texts from 19th-century Ottoman Turkish and Russian. It is the first study to
apply large language models (LLMs) to this dataset, sourced from prominent
literary periodicals of the era. The texts have been meticulously organized and
labeled. This was done according to a taxonomic framework that takes into
account both their structural and semantic attributes. Articles are categorized
and tagged with bibliometric metadata by human experts. We present baseline
classification results using a classical bag-of-words (BoW) naive Bayes model
and three modern LLMs: multilingual BERT, Falcon, and Llama-v2. We found that
in certain cases, Bag of Words (BoW) outperforms Large Language Models (LLMs),
emphasizing the need for additional research, especially in low-resource
language settings. This dataset is expected to be a valuable resource for
researchers in natural language processing and machine learning, especially for
historical and low-resource languages. The dataset is publicly available^1.
[LINK]
http://arxiv.org/abs/2407.15136v1
[DATE]
2024-07-21 20:14:45+08:00
[CATEGORIES]
cs.CL
Towards Better Question Generation in QA-based Event Extraction
[AUTHORS]
Zijin Hong, Jian Liu
[COMMENTS]
Accepted to ACL2024 Findings
[LINK]
http://arxiv.org/abs/2405.10517v3
[DATE]
2024-07-21 20:01:08+08:00
[CATEGORIES]
cs.CL
Automatically Identifying Local and Global Circuits with Linear Computation Graphs
[AUTHORS]
Xuyang Ge, Fukang Zhu, Wentao Shu, Junxuan Wang, Zhengfu He, Xipeng Qiu
[ABSTRACT]
Circuit analysis of any certain model behavior is a central task in
mechanistic interpretability. We introduce our circuit discovery pipeline with
Sparse Autoencoders (SAEs) and a variant called Transcoders. With these two
modules inserted into the model, the model’s computation graph with respect to
OV and MLP circuits becomes strictly linear. Our methods do not require linear
approximation to compute the causal effect of each node. This fine-grained
graph identifies both end-to-end and local circuits accounting for either
logits or intermediate features. We can scalably apply this pipeline with a
technique called Hierarchical Attribution. We analyze three kinds of circuits
in GPT-2 Small: bracket, induction, and Indirect Object Identification
circuits. Our results reveal new findings underlying existing discoveries.
[LINK]
http://arxiv.org/abs/2405.13868v2
[DATE]
2024-07-21 19:42:32+08:00
[CATEGORIES]
cs.LG
cs.CL
Language Models as Science Tutors
[AUTHORS]
Alexis Chevalier, Jiayi Geng, Alexander Wettig, Howard Chen, Sebastian Mizera, Toni Annala, Max Jameson Aragon, Arturo Rodríguez Fanlo, Simon Frieder, Simon Machado, Akshara Prabhakar, Ellie Thieu, Jiachen T. Wang, Zirui Wang, Xindi Wu, Mengzhou Xia, Wenhan Xia, Jiatong Yu, Jun-Jie Zhu, Zhiyong Jason Ren, Sanjeev Arora, Danqi Chen
[ABSTRACT]
NLP has recently made exciting progress toward training language models (LMs)
with strong scientific problem-solving skills. However, model development has
not focused on real-life use-cases of LMs for science, including applications
in education that require processing long scientific documents. To address
this, we introduce TutorEval and TutorChat. TutorEval is a diverse
question-answering benchmark consisting of questions about long chapters from
STEM textbooks, written by experts. TutorEval helps measure real-life usability
of LMs as scientific assistants, and it is the first benchmark combining long
contexts, free-form generation, and multi-disciplinary scientific knowledge.
Moreover, we show that fine-tuning base models with existing dialogue datasets
leads to poor performance on TutorEval. Therefore, we create TutorChat, a
dataset of 80,000 long synthetic dialogues about textbooks. We use TutorChat to
fine-tune Llemma models with 7B and 34B parameters. These LM tutors specialized
in math have a 32K-token context window, and they excel at TutorEval while
performing strongly on GSM8K and MATH. Our datasets build on open-source
materials, and we release our models, data, and evaluations.
[COMMENTS]
8 pages without bibliography and appendix, 26 pages total
[LINK]
http://arxiv.org/abs/2402.11111v2
[DATE]
2024-07-21 19:11:49+08:00
[CATEGORIES]
cs.CL
Active Prompting with Chain-of-Thought for Large Language Models
[AUTHORS]
Shizhe Diao, Pengcheng Wang, Yong Lin, Rui Pan, Xiang Liu, Tong Zhang
[COMMENTS]
Published in ACL 2024
[LINK]
http://arxiv.org/abs/2302.12246v5
[DATE]
2024-07-21 16:01:00+08:00
[CATEGORIES]
cs.CL
Retrieval-Pretrained Transformer: Long-range Language Modeling with Self-retrieval
[AUTHORS]
Ohad Rubin, Jonathan Berant
[ABSTRACT]
Retrieval-augmented language models (LMs) have received much attention
recently. However, typically the retriever is not trained jointly as a native
component of the LM, but added post-hoc to an already-pretrained LM, which
limits the ability of the LM and the retriever to adapt to one another. In this
work, we propose the Retrieval-Pretrained Transformer (RPT), an architecture
and training procedure for jointly training a retrieval-augmented LM from
scratch and apply it to the task of modeling long texts. Given a recently
generated text chunk in a long document, the LM computes query representations,
which are then used to retrieve earlier chunks in the document, located
potentially tens of thousands of tokens before. Information from retrieved
chunks is fused into the LM representations to predict the next target chunk.
We train the retriever component with a semantic objective, where the goal is
to retrieve chunks that increase the probability of the next chunk, according
to a reference LM. We evaluate RPT on four long-range language modeling tasks,
spanning books, code, and mathematical writing, and demonstrate that RPT
improves retrieval quality and subsequently perplexity across the board
compared to strong baselines.
[COMMENTS]
Accepted to TACL 2024
[LINK]
http://arxiv.org/abs/2306.13421v2
[DATE]
2024-07-21 15:35:23+08:00
[CATEGORIES]
cs.CL
Natural Language Task-Oriented Dialog System 2.0
[AUTHORS]
Adib Mosharrof, A. B. Siddique
[ABSTRACT]
Task-oriented dialog (TOD) systems play a crucial role in facilitating
efficient interactions between users and machines by focusing on achieving
specific goals through natural language communication. These systems
traditionally rely on manually annotated metadata, such as dialog states and
policy annotations, which is labor-intensive, expensive, inconsistent, and
prone to errors, thereby limiting the potential to leverage the vast amounts of
available conversational data. A critical aspect of TOD systems involves
accessing and integrating information from external sources to effectively
engage users. The process of determining when and how to query external
resources represents a fundamental challenge in system design, however existing
approaches expect this information to provided in the context. In this paper,
we introduce Natural Language Task Oriented Dialog System (NL-ToD), a novel
model that removes the dependency on manually annotated turn-wise data by
utilizing dialog history and domain schemas to create a Zero Shot Generalizable
TOD system. We also incorporate query generation as a core task of the system,
where the output of the system could be a response to the user or an API query
to communicate with an external resource. To achieve a more granular analysis
of the system output, we classify the output into multiple categories: slot
filling, retrieval, and query generation. Our analysis reveals that slot
filling is the most challenging TOD task for all models. Experimental results
on three popular TOD datasets (SGD, KETOD and BiToD) shows the effectiveness of
our approach as NL-ToD outperforms state-of-the-art approaches, particularly
with a \textbf{31.4\%} and \textbf{82.1\%} improvement in the BLEU-4 score on
the SGD and KETOD dataset.
[LINK]
http://arxiv.org/abs/2407.15055v1
[DATE]
2024-07-21 12:52:38+08:00
[CATEGORIES]
cs.CL
Diagnostic Benchmark and Iterative Inpainting for Layout-Guided Image Generation
[AUTHORS]
Jaemin Cho, Linjie Li, Zhengyuan Yang, Zhe Gan, Lijuan Wang, Mohit Bansal
[ABSTRACT]
Spatial control is a core capability in controllable image generation.
Advancements in layout-guided image generation have shown promising results on
in-distribution (ID) datasets with similar spatial configurations. However, it
is unclear how these models perform when facing out-of-distribution (OOD)
samples with arbitrary, unseen layouts. In this paper, we propose LayoutBench,
a diagnostic benchmark for layout-guided image generation that examines four
categories of spatial control skills: number, position, size, and shape. We
benchmark two recent representative layout-guided image generation methods and
observe that the good ID layout control may not generalize well to arbitrary
layouts in the wild (e.g., objects at the boundary). Next, we propose
IterInpaint, a new baseline that generates foreground and background regions
step-by-step via inpainting, demonstrating stronger generalizability than
existing models on OOD layouts in LayoutBench. We perform quantitative and
qualitative evaluation and fine-grained analysis on the four LayoutBench skills
to pinpoint the weaknesses of existing models. We show comprehensive ablation
studies on IterInpaint, including training task ratio, crop&paste vs. repaint,
and generation order. Lastly, we evaluate the zero-shot performance of
different pretrained layout-guided image generation models on LayoutBench-COCO,
our new benchmark for OOD layouts with real objects, where our IterInpaint
consistently outperforms SOTA baselines in all four splits. Project website:
https://layoutbench.github.io
[COMMENTS]
CVPR 2024 Workshop; Project website: https://layoutbench.github.io
[LINK]
http://arxiv.org/abs/2304.06671v3
[DATE]
2024-07-21 12:14:21+08:00
[CATEGORIES]
cs.CL
cs.LG
Automatic Interactive Evaluation for Large Language Models with State Aware Patient Simulator
[AUTHORS]
Yusheng Liao, Yutong Meng, Yuhao Wang, Hongcheng Liu, Yanfeng Wang, Yu Wang
[ABSTRACT]
Large Language Models (LLMs) have demonstrated remarkable proficiency in
human interactions, yet their application within the medical field remains
insufficiently explored. Previous works mainly focus on the performance of
medical knowledge with examinations, which is far from the realistic scenarios,
falling short in assessing the abilities of LLMs on clinical tasks. In the
quest to enhance the application of Large Language Models (LLMs) in healthcare,
this paper introduces the Automated Interactive Evaluation (AIE) framework and
the State-Aware Patient Simulator (SAPS), targeting the gap between traditional
LLM evaluations and the nuanced demands of clinical practice. Unlike prior
methods that rely on static medical knowledge assessments, AIE and SAPS provide
a dynamic, realistic platform for assessing LLMs through multi-turn
doctor-patient simulations. This approach offers a closer approximation to real
clinical scenarios and allows for a detailed analysis of LLM behaviors in
response to complex patient interactions. Our extensive experimental validation
demonstrates the effectiveness of the AIE framework, with outcomes that align
well with human evaluations, underscoring its potential to revolutionize
medical LLM testing for improved healthcare delivery.
[COMMENTS]
23 pages, 5 figures
[LINK]
http://arxiv.org/abs/2403.08495v4
[DATE]
2024-07-21 11:00:04+08:00
[CATEGORIES]
cs.CL
Model Tells You Where to Merge: Adaptive KV Cache Merging for LLMs on Long-Context Tasks
[AUTHORS]
Zheng Wang, Boxiao Jin, Zhongzhi Yu, Minjia Zhang
[ABSTRACT]
How to efficiently serve Large Language Models (LLMs) has become a pressing
issue because of their huge computational cost in their autoregressive
generation process. To mitigate computational costs, LLMs often employ the KV
Cache technique to improve the generation speed. While improving the
computational efficiency, the storage requirements of the KV cache are
substantial, particularly in long-context scenarios, leading to significant
memory consumption. Existing KV cache eviction methods often degrade the
performance of LLMs in long-context scenarios due to the information loss
introduced by eviction. In this paper, we propose a novel KV cache merging
approach, called KVMerger, to achieve adaptive KV cache compression for
long-context tasks without significant performance degradation under
constrained memory budgets. Our approach is inspired by the intriguing
observation that key states exhibit high similarity at the token level within a
single sequence. To facilitate merging, we develop an effective yet
straightforward merging set identification algorithm to identify suitable KV
states for merging. Our merging set identification algorithm stimulates the
second observation that KV cache sparsity, from similarity perspective, is
independent of the dataset and remains persistent at the model level.
Subsequently, we propose a Gaussian kernel weighted merging algorithm to
selectively merge all states within each merging set. We conduct extensive
experiments to demonstrate the effectiveness of KVMerger for long-context tasks
under constrained memory budgets, applying it to models including
Llama2-7B-chat and Llama2-13B-chat. Using the LongBench and ZeroScroll
benchmarks, we compare our method with other KV cache compression techniques,
including H2O and CaM, showing that our method achieves superior performance
across tasks with both 50% and 35% KV cache budgets.
[LINK]
http://arxiv.org/abs/2407.08454v2
[DATE]
2024-07-21 10:37:11+08:00
[CATEGORIES]
cs.CL
Medical Spoken Named Entity Recognition
[AUTHORS]
Khai Le-Duc, David Thulke, Hung-Phong Tran, Long Vo-Dang, Khai-Nguyen Nguyen, Truong-Son Hy, Ralf Schlüter
[ABSTRACT]
Spoken Named Entity Recognition (NER) aims to extracting named entities from
speech and categorizing them into types like person, location, organization,
etc. In this work, we present VietMed-NER - the first spoken NER dataset in the
medical domain. To our best knowledge, our real-world dataset is the largest
spoken NER dataset in the world in terms of the number of entity types,
featuring 18 distinct types. Secondly, we present baseline results using
various state-of-the-art pre-trained models: encoder-only and
sequence-to-sequence. We found that pre-trained multilingual models XLM-R
outperformed all monolingual models on both reference text and ASR output. Also
in general, encoders perform better than sequence-to-sequence models for the
NER task. By simply translating, the transcript is applicable not just to
Vietnamese but to other languages as well. All code, data and models are made
publicly available here: https://github.com/leduckhai/MultiMed
[COMMENTS]
Preprint, 41 pages
[LINK]
http://arxiv.org/abs/2406.13337v2
[DATE]
2024-07-21 08:54:08+08:00
[CATEGORIES]
cs.CL
cs.LG
BrainStorm @ iREL at #SMM4H 2024: Leveraging Translation and Topical Embeddings for Annotation Detection in Tweets
[AUTHORS]
Manav Chaudhary, Harshit Gupta, Vasudeva Varma
[COMMENTS]
Accepted at SMM4H, colocated at ACL 2024
[LINK]
http://arxiv.org/abs/2405.11192v2
[DATE]
2024-07-21 08:30:07+08:00
[CATEGORIES]
cs.CL
Enhancing Incremental Summarization with Structured Representations
[AUTHORS]
EunJeong Hwang, Yichao Zhou, James Bradley Wendt, Beliz Gunel, Nguyen Vo, Jing Xie, Sandeep Tata
[ABSTRACT]
Large language models (LLMs) often struggle with processing extensive input
contexts, which can lead to redundant, inaccurate, or incoherent summaries.
Recent methods have used unstructured memory to incrementally process these
contexts, but they still suffer from information overload due to the volume of
unstructured data handled. In our study, we introduce structured knowledge
representations ($GU_{json}$), which significantly improve summarization
performance by 40% and 14% across two public datasets. Most notably, we propose
the Chain-of-Key strategy ($CoK_{json}$) that dynamically updates or augments
these representations with new information, rather than recreating the
structured memory for each new source. This method further enhances performance
by 7% and 4% on the datasets.
[LINK]
http://arxiv.org/abs/2407.15021v1
[DATE]
2024-07-21 08:23:33+08:00
[CATEGORIES]
cs.CL
Answer, Assemble, Ace: Understanding How Transformers Answer Multiple Choice Questions
[AUTHORS]
Sarah Wiegreffe, Oyvind Tafjord, Yonatan Belinkov, Hannaneh Hajishirzi, Ashish Sabharwal
[ABSTRACT]
Multiple-choice question answering (MCQA) is a key competence of performant
transformer language models that is tested by mainstream benchmarks. However,
recent evidence shows that models can have quite a range of performance,
particularly when the task format is diversified slightly (such as by shuffling
answer choice order). In this work we ask: how do successful models perform
formatted MCQA? We employ vocabulary projection and activation patching methods
to localize key hidden states that encode relevant information for predicting
the correct answer. We find that prediction of a specific answer symbol is
causally attributed to a single middle layer, and specifically its multi-head
self-attention mechanism. We show that subsequent layers increase the
probability of the predicted answer symbol in vocabulary space, and that this
probability increase is associated with a sparse set of attention heads with
unique roles. We additionally uncover differences in how different models
adjust to alternative symbols. Finally, we demonstrate that a synthetic task
can disentangle sources of model error to pinpoint when a model has learned
formatted MCQA, and show that an inability to separate answer symbol tokens in
vocabulary space is a property of models unable to perform formatted MCQA
tasks.
[COMMENTS]
Preprint. Code will be available at
https://github.com/allenai/understanding_mcqa
[LINK]
http://arxiv.org/abs/2407.15018v1
[DATE]
2024-07-21 08:10:23+08:00
[CATEGORIES]
cs.CL
Improving Citation Text Generation: Overcoming Limitations in Length Control
[AUTHORS]
Biswadip Mandal, Xiangci Li, Jessica Ouyang
[ABSTRACT]
A key challenge in citation text generation is that the length of generated
text often differs from the length of the target, lowering the quality of the
generation. While prior works have investigated length-controlled generation,
their effectiveness depends on knowing the appropriate generation length. In
this work, we present an in-depth study of the limitations of predicting
scientific citation text length and explore the use of heuristic estimates of
desired length.
[LINK]
http://arxiv.org/abs/2407.14997v1
[DATE]
2024-07-21 06:10:37+08:00
[CATEGORIES]
cs.CL
Generalization v.s. Memorization: Tracing Language Models’ Capabilities Back to Pretraining Data
[AUTHORS]
Antonis Antoniades, Xinyi Wang, Yanai Elazar, Alfonso Amayuelas, Alon Albalak, Kexun Zhang, William Yang Wang
[ABSTRACT]
Despite the proven utility of large language models (LLMs) in real-world
applications, there remains a lack of understanding regarding how they leverage
their large-scale pretraining text corpora to achieve such capabilities. In
this work, we investigate the interplay between generalization and memorization
in pretrained LLMs at scale, through a comprehensive $n$-gram analysis of their
training data. Our experiments focus on three general task types: translation,
question-answering, and multiple-choice reasoning. With various sizes of
open-source LLMs and their pretraining corpora, we observe that as the model
size increases, the task-relevant $n$-gram pair data becomes increasingly
important, leading to improved task performance, decreased memorization,
stronger generalization, and emergent abilities. Our results support the
hypothesis that LLMs’ capabilities emerge from a delicate balance of
memorization and generalization with sufficient task-related pretraining data,
and point the way to larger-scale analyses that could further improve our
understanding of these models.
[COMMENTS]
ICML FM-Wild workshop version
[LINK]
http://arxiv.org/abs/2407.14985v1
[DATE]
2024-07-21 05:24:40+08:00
[CATEGORIES]
cs.CL
cs.LG
Read and Reap the Rewards: Learning to Play Atari with the Help of Instruction Manuals
[AUTHORS]
Yue Wu, Yewen Fan, Paul Pu Liang, Amos Azaria, Yuanzhi Li, Tom M. Mitchell
[ABSTRACT]
High sample complexity has long been a challenge for RL. On the other hand,
humans learn to perform tasks not only from interaction or demonstrations, but
also by reading unstructured text documents, e.g., instruction manuals.
Instruction manuals and wiki pages are among the most abundant data that could
inform agents of valuable features and policies or task-specific environmental
dynamics and reward structures. Therefore, we hypothesize that the ability to
utilize human-written instruction manuals to assist learning policies for
specific tasks should lead to a more efficient and better-performing agent. We
propose the Read and Reward framework. Read and Reward speeds up RL algorithms
on Atari games by reading manuals released by the Atari game developers. Our
framework consists of a QA Extraction module that extracts and summarizes
relevant information from the manual and a Reasoning module that evaluates
object-agent interactions based on information from the manual. An auxiliary
reward is then provided to a standard A2C RL agent, when interaction is
detected. Experimentally, various RL algorithms obtain significant improvement
in performance and training speed when assisted by our design.
[LINK]
http://arxiv.org/abs/2302.04449v4
[DATE]
2024-07-21 04:16:44+08:00
[CATEGORIES]
cs.LG
cs.CL
Sim-CLIP: Unsupervised Siamese Adversarial Fine-Tuning for Robust and Semantically-Rich Vision-Language Models
[AUTHORS]
Md Zarif Hossain, Ahmed Imteaj
[ABSTRACT]
Vision-language models (VLMs) have achieved significant strides in recent
times specially in multimodal tasks, yet they remain susceptible to adversarial
attacks on their vision components. To address this, we propose Sim-CLIP, an
unsupervised adversarial fine-tuning method that enhances the robustness of the
widely-used CLIP vision encoder against such attacks while maintaining semantic
richness and specificity. By employing a Siamese architecture with cosine
similarity loss, Sim-CLIP learns semantically meaningful and attack-resilient
visual representations without requiring large batch sizes or momentum
encoders. Our results demonstrate that VLMs enhanced with Sim-CLIP’s fine-tuned
CLIP encoder exhibit significantly enhanced robustness against adversarial
attacks, while preserving semantic meaning of the perturbed images. Notably,
Sim-CLIP does not require additional training or fine-tuning of the VLM itself;
replacing the original vision encoder with our fine-tuned Sim-CLIP suffices to
provide robustness. This work underscores the significance of reinforcing
foundational models like CLIP to safeguard the reliability of downstream VLM
applications, paving the way for more secure and effective multimodal systems.
[LINK]
http://arxiv.org/abs/2407.14971v1
[DATE]
2024-07-21 03:53:52+08:00
[CATEGORIES]
cs.CL
cs.LG
Efficient Pre-training for Localized Instruction Generation of Videos
[AUTHORS]
Anil Batra, Davide Moltisanti, Laura Sevilla-Lara, Marcus Rohrbach, Frank Keller
[ABSTRACT]
Procedural videos, exemplified by recipe demonstrations, are instrumental in
conveying step-by-step instructions. However, understanding such videos is
challenging as it involves the precise localization of steps and the generation
of textual instructions. Manually annotating steps and writing instructions is
costly, which limits the size of current datasets and hinders effective
learning. Leveraging large but noisy video-transcript datasets for pre-training
can boost performance but demands significant computational resources.
Furthermore, transcripts contain irrelevant content and differ in style from
human-written instructions. To mitigate these issues, we propose a novel
technique, Sieve-&-Swap, to automatically generate high-quality training data
for the recipe domain: (i) Sieve: filters irrelevant transcripts and (ii) Swap:
acquires high-quality text by replacing transcripts with human-written
instruction from a text-only recipe dataset. The resulting dataset is three
orders of magnitude smaller than current web-scale datasets but enables
efficient training of large-scale models. Alongside Sieve-&-Swap, we propose
Procedure Transformer (ProcX), a model for end-to-end step localization and
instruction generation for procedural videos. When pre-trained on our curated
dataset, this model achieves state-of-the-art performance on YouCook2 and Tasty
while using a fraction of the training data. We have released code and dataset.
[COMMENTS]
ECCV 2024
[LINK]
http://arxiv.org/abs/2311.15964v4
[DATE]
2024-07-21 01:55:37+08:00
[CATEGORIES]
cs.CL
cs.LG
HyperbolicLR: Epoch insensitive learning rate scheduler
[AUTHORS]
Tae-Geun Kim
[ABSTRACT]
This study proposes two novel learning rate schedulers: the Hyperbolic
Learning Rate Scheduler (HyperbolicLR) and the Exponential Hyperbolic Learning
Rate Scheduler (ExpHyperbolicLR). These schedulers attempt to address the
inconsistent learning curves often observed in conventional schedulers when
adjusting the number of epochs. By leveraging the asymptotic behavior of
hyperbolic curves, the proposed schedulers maintain more consistent learning
curves across varying epoch settings. The HyperbolicLR algorithm directly
applies this property to the epoch-learning rate space, while the
ExpHyperbolicLR maps this concept onto the exponential space of epochs and
learning rates. To evaluate the performance of these schedulers, first we found
the optimal hyperparameters for each scheduler on a small number of epochs,
fixed these values, and compared their performance as the number of epochs
increased. Our experimental results on various deep learning tasks and
architectures demonstrate that both HyperbolicLR and ExpHyperbolicLR maintain
more consistent performance improvements compared to conventional schedulers as
the number of epochs increases. These findings suggest that our
hyperbolic-based learning rate schedulers offer a more robust and efficient
approach to training deep neural networks, especially in scenarios where
computational resources or time constraints limit extensive hyperparameter
searches.
[COMMENTS]
26 pages, 7 figures
[LINK]
http://arxiv.org/abs/2407.15200v1
[DATE]
2024-07-21 23:43:52+08:00
[CATEGORIES]
cs.LG
CoLoRA: Continuous low-rank adaptation for reduced implicit neural modeling of parameterized partial differential equations
[AUTHORS]
Jules Berman, Benjamin Peherstorfer
[ABSTRACT]
This work introduces reduced models based on Continuous Low Rank Adaptation
(CoLoRA) that pre-train neural networks for a given partial differential
equation and then continuously adapt low-rank weights in time to rapidly
predict the evolution of solution fields at new physics parameters and new
initial conditions. The adaptation can be either purely data-driven or via an
equation-driven variational approach that provides Galerkin-optimal
approximations. Because CoLoRA approximates solution fields locally in time,
the rank of the weights can be kept small, which means that only few training
trajectories are required offline so that CoLoRA is well suited for data-scarce
regimes. Predictions with CoLoRA are orders of magnitude faster than with
classical methods and their accuracy and parameter efficiency is higher
compared to other neural network approaches.
[LINK]
http://arxiv.org/abs/2402.14646v2
[DATE]
2024-07-21 23:11:04+08:00
[CATEGORIES]
cs.LG
Generalizing Trilateration: Approximate Maximum Likelihood Estimator for Initial Orbit Determination in Low-Earth Orbit
[AUTHORS]
Ricardo Ferreira, Filipa Valdeira, Marta Guimarães, Cláudia Soares
[ABSTRACT]
With the increase in the number of active satellites and space debris in
orbit, the problem of initial orbit determination (IOD) becomes increasingly
important, demanding a high accuracy. Over the years, different approaches have
been presented such as filtering methods (for example, Extended Kalman Filter),
differential algebra or solving Lambert’s problem. In this work, we consider a
setting of three monostatic radars, where all available measurements are taken
approximately at the same instant. This follows a similar setting as
trilateration, a state-of-the-art approach, where each radar is able to obtain
a single measurement of range and range-rate. Differently, and due to advances
in Multiple-Input Multiple-Output (MIMO) radars, we assume that each location
is able to obtain a larger set of range, angle and Doppler shift measurements.
Thus, our method can be understood as an extension of trilateration leveraging
more recent technology and incorporating additional data. We formulate the
problem as a Maximum Likelihood Estimator (MLE), which for some number of
observations is asymptotically unbiased and asymptotically efficient. Through
numerical experiments, we demonstrate that our method attains the same accuracy
as the trilateration method for the same number of measurements and offers an
alternative and generalization, returning a more accurate estimation of the
satellite’s state vector, as the number of available measurements increases.
[LINK]
http://arxiv.org/abs/2407.15180v1
[DATE]
2024-07-21 22:37:24+08:00
[CATEGORIES]
cs.LG
${\it Asparagus}$: A Toolkit for Autonomous, User-Guided Construction of Machine-Learned Potential Energy Surfaces
[AUTHORS]
Kai Töpfer, Luis Itza Vazquez-Salazar, Markus Meuwly
[ABSTRACT]
With the establishment of machine learning (ML) techniques in the scientific
community, the construction of ML potential energy surfaces (ML-PES) has become
a standard process in physics and chemistry. So far, improvements in the
construction of ML-PES models have been conducted independently, creating an
initial hurdle for new users to overcome and complicating the reproducibility
of results. Aiming to reduce the bar for the extensive use of ML-PES, we
introduce ${\it Asparagus}$, a software package encompassing the different
parts into one coherent implementation that allows an autonomous, user-guided
construction of ML-PES models. ${\it Asparagus}$ combines capabilities of
initial data sampling with interfaces to ${\it ab
initio}$ calculation programs, ML model training, as well as model evaluation
and its application within other codes such as ASE or CHARMM. The
functionalities of the code are illustrated in different examples, including
the dynamics of small molecules, the representation of reactive potentials in
organometallic compounds, and atom diffusion on periodic surface structures.
The modular framework of ${\it Asparagus}$ is designed to allow simple
implementations of further ML-related methods and models to provide constant
user-friendly access to state-of-the-art ML techniques.
[LINK]
http://arxiv.org/abs/2407.15175v1
[DATE]
2024-07-21 22:22:47+08:00
[CATEGORIES]
cs.LG
TADA: Temporal Adversarial Data Augmentation for Time Series Data
[AUTHORS]
Byeong Tak Lee, Joon-myoung Kwon, Yong-Yeon Jo
[ABSTRACT]
Domain generalization involves training machine learning models to perform
robustly on unseen samples from out-of-distribution datasets. Adversarial Data
Augmentation (ADA) is a commonly used approach that enhances model adaptability
by incorporating synthetic samples, designed to simulate potential unseen
samples. While ADA effectively addresses amplitude-related distribution shifts,
it falls short in managing temporal shifts, which are essential for time series
data. To address this limitation, we propose the Temporal Adversarial Data
Augmentation for time teries Data (TADA), which incorporates a time warping
technique specifically targeting temporal shifts. Recognizing the challenge of
non-differentiability in traditional time warping, we make it differentiable by
leveraging phase shifts in the frequency domain. Our evaluations across diverse
domains demonstrate that TADA significantly outperforms existing ADA variants,
enhancing model performance across time series datasets with varied
distributions.
[LINK]
http://arxiv.org/abs/2407.15174v1
[DATE]
2024-07-21 22:21:00+08:00
[CATEGORIES]
cs.LG
Adversarial Circuit Evaluation
[AUTHORS]
Niels uit de Bos, Adrià Garriga-Alonso
[ABSTRACT]
Circuits are supposed to accurately describe how a neural network performs a
specific task, but do they really? We evaluate three circuits found in the
literature (IOI, greater-than, and docstring) in an adversarial manner,
considering inputs where the circuit’s behavior maximally diverges from the
full model. Concretely, we measure the KL divergence between the full model’s
output and the circuit’s output, calculated through resample ablation, and we
analyze the worst-performing inputs. Our results show that the circuits for the
IOI and docstring tasks fail to behave similarly to the full model even on
completely benign inputs from the original task, indicating that more robust
circuits are needed for safety-critical applications.
[COMMENTS]
19 pages, 10 figures
[LINK]
http://arxiv.org/abs/2407.15166v1
[DATE]
2024-07-21 21:43:44+08:00
[CATEGORIES]
cs.LG
FFHFlow: A Flow-based Variational Approach for Multi-fingered Grasp Synthesis in Real Time
[AUTHORS]
Qian Feng, Jianxiang Feng, Zhaopeng Chen, Rudolph Triebel, Alois Knoll
[ABSTRACT]
Synthesizing diverse and accurate grasps with multi-fingered hands is an
important yet challenging task in robotics. Previous efforts focusing on
generative modeling have fallen short of precisely capturing the multi-modal,
high-dimensional grasp distribution. To address this, we propose exploiting a
special kind of Deep Generative Model (DGM) based on Normalizing Flows (NFs),
an expressive model for learning complex probability distributions.
Specifically, we first observed an encouraging improvement in diversity by
directly applying a single conditional NFs (cNFs), dubbed FFHFlow-cnf, to learn
a grasp distribution conditioned on the incomplete point cloud. However, we
also recognized limited performance gains due to restricted expressivity in the
latent space. This motivated us to develop a novel flow-based d Deep Latent
Variable Model (DLVM), namely FFHFlow-lvm, which facilitates more reasonable
latent features, leading to both diverse and accurate grasp synthesis for
unseen objects. Unlike Variational Autoencoders (VAEs), the proposed DLVM
counteracts typical pitfalls such as mode collapse and mis-specified priors by
leveraging two cNFs for the prior and likelihood distributions, which are
usually restricted to being isotropic Gaussian. Comprehensive experiments in
simulation and real-robot scenarios demonstrate that our method generates more
accurate and diverse grasps than the VAE baselines. Additionally, a run-time
comparison is conducted to reveal its high potential for real-time
applications.
[COMMENTS]
First two authors contributed equally, whose ordering decided via
coin-tossing
[LINK]
http://arxiv.org/abs/2407.15161v1
[DATE]
2024-07-21 21:33:08+08:00
[CATEGORIES]
cs.LG
Studying How to Efficiently and Effectively Guide Models with Explanations
[AUTHORS]
Sukrut Rao, Moritz Böhle, Amin Parchami-Araghi, Bernt Schiele
[ABSTRACT]
Despite being highly performant, deep neural networks might base their
decisions on features that spuriously correlate with the provided labels, thus
hurting generalization. To mitigate this, ‘model guidance’ has recently gained
popularity, i.e. the idea of regularizing the models’ explanations to ensure
that they are “right for the right reasons”. While various techniques to
achieve such model guidance have been proposed, experimental validation of
these approaches has thus far been limited to relatively simple and / or
synthetic datasets. To better understand the effectiveness of the various
design choices that have been explored in the context of model guidance, in
this work we conduct an in-depth evaluation across various loss functions,
attribution methods, models, and ‘guidance depths’ on the PASCAL VOC 2007 and
MS COCO 2014 datasets. As annotation costs for model guidance can limit its
applicability, we also place a particular focus on efficiency. Specifically, we
guide the models via bounding box annotations, which are much cheaper to obtain
than the commonly used segmentation masks, and evaluate the robustness of model
guidance under limited (e.g. with only 1% of annotated images) or overly coarse
annotations. Further, we propose using the EPG score as an additional
evaluation metric and loss function (‘Energy loss’). We show that optimizing
for the Energy loss leads to models that exhibit a distinct focus on
object-specific features, despite only using bounding box annotations that also
include background regions. Lastly, we show that such model guidance can
improve generalization under distribution shifts. Code available at:
https://github.com/sukrutrao/Model-Guidance.
[COMMENTS]
41 pages, 38 figures, 4 tables, IEEE/CVF International Conference on
Computer Vision (ICCV) 2023
[LINK]
http://arxiv.org/abs/2303.11932v2
[DATE]
2024-07-21 20:55:08+08:00
[CATEGORIES]
cs.LG
Proximal Policy Distillation
[AUTHORS]
Giacomo Spigler
[ABSTRACT]
We introduce Proximal Policy Distillation (PPD), a novel policy distillation
method that integrates student-driven distillation and Proximal Policy
Optimization (PPO) to increase sample efficiency and to leverage the additional
rewards that the student policy collects during distillation. To assess the
efficacy of our method, we compare PPD with two common alternatives,
student-distill and teacher-distill, over a wide range of reinforcement
learning environments that include discrete actions and continuous control
(ATARI, Mujoco, and Procgen). For each environment and method, we perform
distillation to a set of target student neural networks that are smaller,
identical (self-distillation), or larger than the teacher network. Our findings
indicate that PPD improves sample efficiency and produces better student
policies compared to typical policy distillation approaches. Moreover, PPD
demonstrates greater robustness than alternative methods when distilling
policies from imperfect demonstrations. The code for the paper is released as
part of a new Python library built on top of stable-baselines3 to facilitate
policy distillation: `sb3-distill’.
[LINK]
http://arxiv.org/abs/2407.15134v1
[DATE]
2024-07-21 20:08:54+08:00
[CATEGORIES]
cs.LG
xLSTMTime : Long-term Time Series Forecasting With xLSTM
[AUTHORS]
Musleh Alharthi, Ausif Mahmood
[ABSTRACT]
In recent years, transformer-based models have gained prominence in
multivariate long-term time series forecasting (LTSF), demonstrating
significant advancements despite facing challenges such as high computational
demands, difficulty in capturing temporal dynamics, and managing long-term
dependencies. The emergence of LTSF-Linear, with its straightforward linear
architecture, has notably outperformed transformer-based counterparts,
prompting a reevaluation of the transformer’s utility in time series
forecasting. In response, this paper presents an adaptation of a recent
architecture termed extended LSTM (xLSTM) for LTSF. xLSTM incorporates
exponential gating and a revised memory structure with higher capacity that has
good potential for LTSF. Our adopted architecture for LTSF termed as xLSTMTime
surpasses current approaches. We compare xLSTMTime’s performance against
various state-of-the-art models across multiple real-world da-tasets,
demonstrating superior forecasting capabilities. Our findings suggest that
refined recurrent architectures can offer competitive alternatives to
transformer-based models in LTSF tasks, po-tentially redefining the landscape
of time series forecasting.
[LINK]
http://arxiv.org/abs/2407.10240v2
[DATE]
2024-07-21 20:08:13+08:00
[CATEGORIES]
cs.LG
Deep multimodal saliency parcellation of cerebellar pathways: linking microstructure and individual function through explainable multitask learning
[AUTHORS]
Ari Tchetchenian, Leo Zekelman, Yuqian Chen, Jarrett Rushmore, Fan Zhang, Edward H. Yeterian, Nikos Makris, Yogesh Rathi, Erik Meijering, Yang Song, Lauren J. O’Donnell
[ABSTRACT]
Parcellation of human cerebellar pathways is essential for advancing our
understanding of the human brain. Existing diffusion MRI tractography
parcellation methods have been successful in defining major cerebellar fibre
tracts, while relying solely on fibre tract structure. However, each fibre
tract may relay information related to multiple cognitive and motor functions
of the cerebellum. Hence, it may be beneficial for parcellation to consider the
potential importance of the fibre tracts for individual motor and cognitive
functional performance measures. In this work, we propose a multimodal
data-driven method for cerebellar pathway parcellation, which incorporates both
measures of microstructure and connectivity, and measures of individual
functional performance. Our method involves first training a multitask deep
network to predict various cognitive and motor measures from a set of fibre
tract structural features. The importance of each structural feature for
predicting each functional measure is then computed, resulting in a set of
structure-function saliency values that are clustered to parcellate cerebellar
pathways. We refer to our method as Deep Multimodal Saliency Parcellation
(DeepMSP), as it computes the saliency of structural measures for predicting
cognitive and motor functional performance, with these saliencies being applied
to the task of parcellation. Applying DeepMSP we found that it was feasible to
identify multiple cerebellar pathway parcels with unique structure-function
saliency patterns that were stable across training folds.
[LINK]
http://arxiv.org/abs/2407.15132v1
[DATE]
2024-07-21 20:05:45+08:00
[CATEGORIES]
cs.LG
Token-Picker: Accelerating Attention in Text Generation with Minimized Memory Transfer via Probability Estimation
[AUTHORS]
Junyoung Park, Myeonggu Kang, Yunki Han, Yanggon Kim, Jaekang Shin, Lee-Sup Kim
[ABSTRACT]
The attention mechanism in text generation is memory-bounded due to its
sequential characteristics. Therefore, off-chip memory accesses should be
minimized for faster execution. Although previous methods addressed this by
pruning unimportant tokens, they fall short in selectively removing tokens with
near-zero attention probabilities in each instance. Our method estimates the
probability before the softmax function, effectively removing low probability
tokens and achieving an 12.1x pruning ratio without fine-tuning. Additionally,
we present a hardware design supporting seamless on-demand off-chip access. Our
approach shows 2.6x reduced memory accesses, leading to an average 2.3x speedup
and a 2.4x energy efficiency.
[COMMENTS]
To appear in the proceedings of 61st Design Automation Conference
(DAC)
[LINK]
http://arxiv.org/abs/2407.15131v1
[DATE]
2024-07-21 19:56:54+08:00
[CATEGORIES]
cs.LG
D-Flow: Differentiating through Flows for Controlled Generation
[AUTHORS]
Heli Ben-Hamu, Omri Puny, Itai Gat, Brian Karrer, Uriel Singer, Yaron Lipman
[ABSTRACT]
Taming the generation outcome of state of the art Diffusion and Flow-Matching
(FM) models without having to re-train a task-specific model unlocks a powerful
tool for solving inverse problems, conditional generation, and controlled
generation in general. In this work we introduce D-Flow, a simple framework for
controlling the generation process by differentiating through the flow,
optimizing for the source (noise) point. We motivate this framework by our key
observation stating that for Diffusion/FM models trained with Gaussian
probability paths, differentiating through the generation process projects
gradient on the data manifold, implicitly injecting the prior into the
optimization process. We validate our framework on linear and non-linear
controlled generation problems including: image and audio inverse problems and
conditional molecule generation reaching state of the art performance across
all.
[COMMENTS]
ICML 2024
[LINK]
http://arxiv.org/abs/2402.14017v2
[DATE]
2024-07-21 19:19:38+08:00
[CATEGORIES]
cs.LG
BAFFLE: A Baseline of Backpropagation-Free Federated Learning
[AUTHORS]
Haozhe Feng, Tianyu Pang, Chao Du, Wei Chen, Shuicheng Yan, Min Lin
[ABSTRACT]
Federated learning (FL) is a general principle for decentralized clients to
train a server model collectively without sharing local data. FL is a promising
framework with practical applications, but its standard training paradigm
requires the clients to backpropagate through the model to compute gradients.
Since these clients are typically edge devices and not fully trusted, executing
backpropagation on them incurs computational and storage overhead as well as
white-box vulnerability. In light of this, we develop backpropagation-free
federated learning, dubbed BAFFLE, in which backpropagation is replaced by
multiple forward processes to estimate gradients. BAFFLE is 1) memory-efficient
and easily fits uploading bandwidth; 2) compatible with inference-only hardware
optimization and model quantization or pruning; and 3) well-suited to trusted
execution environments, because the clients in BAFFLE only execute forward
propagation and return a set of scalars to the server. Empirically we use
BAFFLE to train deep models from scratch or to finetune pretrained models,
achieving acceptable results. Code is available in
https://github.com/FengHZ/BAFFLE.
[COMMENTS]
ECCV 2024
[LINK]
http://arxiv.org/abs/2301.12195v3
[DATE]
2024-07-21 19:01:00+08:00
[CATEGORIES]
cs.LG
Practical multi-fidelity machine learning: fusion of deterministic and Bayesian models
[AUTHORS]
Jiaxiang Yi, Ji Cheng, Miguel A. Bessa
[ABSTRACT]
Multi-fidelity machine learning methods address the accuracy-efficiency
trade-off by integrating scarce, resource-intensive high-fidelity data with
abundant but less accurate low-fidelity data. We propose a practical
multi-fidelity strategy for problems spanning low- and high-dimensional
domains, integrating a non-probabilistic regression model for the low-fidelity
with a Bayesian model for the high-fidelity. The models are trained in a
staggered scheme, where the low-fidelity model is transfer-learned to the
high-fidelity data and a Bayesian model is trained for the residual. This
three-model strategy – deterministic low-fidelity, transfer learning, and
Bayesian residual – leads to a prediction that includes uncertainty
quantification both for noisy and noiseless multi-fidelity data. The strategy
is general and unifies the topic, highlighting the expressivity trade-off
between the transfer-learning and Bayesian models (a complex transfer-learning
model leads to a simpler Bayesian model, and vice versa). We propose modeling
choices for two scenarios, and argue in favor of using a linear
transfer-learning model that fuses 1) kernel ridge regression for low-fidelity
with Gaussian processes for high-fidelity; or 2) deep neural network for
low-fidelity with a Bayesian neural network for high-fidelity. We demonstrate
the effectiveness and efficiency of the proposed strategies and contrast them
with the state-of-the-art based on various numerical examples. The simplicity
of these formulations makes them practical for a broad scope of future
engineering applications.
[COMMENTS]
33 Pages, 21 Figures
[LINK]
http://arxiv.org/abs/2407.15110v1
[DATE]
2024-07-21 18:40:50+08:00
[CATEGORIES]
cs.LG
Distributed Gradient Descent for Functional Learning
[AUTHORS]
Zhan Yu, Jun Fan, Zhongjie Shi, Ding-Xuan Zhou
[ABSTRACT]
In recent years, different types of distributed and parallel learning schemes
have received increasing attention for their strong advantages in handling
large-scale data information. In the information era, to face the big data
challenges {that} stem from functional data analysis very recently, we propose
a novel distributed gradient descent functional learning (DGDFL) algorithm to
tackle functional data across numerous local machines (processors) in the
framework of reproducing kernel Hilbert space. Based on integral operator
approaches, we provide the first theoretical understanding of the DGDFL
algorithm in many different aspects of the literature. On the way of
understanding DGDFL, firstly, a data-based gradient descent functional learning
(GDFL) algorithm associated with a single-machine model is proposed and
comprehensively studied. Under mild conditions, confidence-based optimal
learning rates of DGDFL are obtained without the saturation boundary on the
regularity index suffered in previous works in functional regression. We
further provide a semi-supervised DGDFL approach to weaken the restriction on
the maximal number of local machines to ensure optimal rates. To our best
knowledge, the DGDFL provides the first divide-and-conquer iterative training
approach to functional learning based on data samples of intrinsically
infinite-dimensional random functions (functional covariates) and enriches the
methodologies for functional data analysis.
[COMMENTS]
48 pages
[LINK]
http://arxiv.org/abs/2305.07408v3
[DATE]
2024-07-21 18:18:39+08:00
[CATEGORIES]
cs.LG
GLOP: Learning Global Partition and Local Construction for Solving Large-scale Routing Problems in Real-time
[AUTHORS]
Haoran Ye, Jiarui Wang, Helan Liang, Zhiguang Cao, Yong Li, Fanzhang Li
[ABSTRACT]
The recent end-to-end neural solvers have shown promise for small-scale
routing problems but suffered from limited real-time scaling-up performance.
This paper proposes GLOP (Global and Local Optimization Policies), a unified
hierarchical framework that efficiently scales toward large-scale routing
problems. GLOP partitions large routing problems into Travelling Salesman
Problems (TSPs) and TSPs into Shortest Hamiltonian Path Problems. For the first
time, we hybridize non-autoregressive neural heuristics for coarse-grained
problem partitions and autoregressive neural heuristics for fine-grained route
constructions, leveraging the scalability of the former and the meticulousness
of the latter. Experimental results show that GLOP achieves competitive and
state-of-the-art real-time performance on large-scale routing problems,
including TSP, ATSP, CVRP, and PCTSP.
[COMMENTS]
Accepted at AAAI 2024
[LINK]
http://arxiv.org/abs/2312.08224v2
[DATE]
2024-07-21 18:18:30+08:00
[CATEGORIES]
cs.LG
Boltzmann machine learning and regularization methods for inferring evolutionary fields and couplings from a multiple sequence alignment
[AUTHORS]
Sanzo Miyazawa
[ABSTRACT]
The inverse Potts problem to infer a Boltzmann distribution for homologous
protein sequences from their single-site and pairwise amino acid frequencies
recently attracts a great deal of attention in the studies of protein structure
and evolution. We study regularization and learning methods and how to tune
regularization parameters to correctly infer interactions in Boltzmann machine
learning. Using $L_2$ regularization for fields, group $L_1$ for couplings is
shown to be very effective for sparse couplings in comparison with $L_2$ and
$L_1$. Two regularization parameters are tuned to yield equal values for both
the sample and ensemble averages of evolutionary energy. Both averages smoothly
change and converge, but their learning profiles are very different between
learning methods. The Adam method is modified to make stepsize proportional to
the gradient for sparse couplings. It is shown by first inferring interactions
from protein sequences and then from Monte Carlo samples that the fields and
couplings can be well recovered, but that recovering the pairwise correlations
in the resolution of a total energy is harder for the natural proteins than for
the protein-like sequences. Selective temperature for folding/structural
constrains in protein evolution is also estimated.
[COMMENTS]
In arXiv:1909.05006v3 the values of selective temperature for protein
PF00153, $T_s$ in Table 5 and in the section 2.8, and folding free energy for
PF00595, and in the v4 the method for soft-thresholding were corrected; shown
in red. The v2 was published in the IEEE/ACM Transactions on Computational
Biology and Bioinformatics. The program is available from
https://gitlab.com/sanzo.miyazawa/BM/
[LINK]
http://arxiv.org/abs/1909.05006v5
[DATE]
2024-07-21 17:39:03+08:00
[CATEGORIES]
cs.LG
Improving Prediction of Need for Mechanical Ventilation using Cross-Attention
[AUTHORS]
Anwesh Mohanty, Supreeth P. Shashikumar, Jonathan Y. Lam, Shamim Nemati
[ABSTRACT]
In the intensive care unit, the capability to predict the need for mechanical
ventilation (MV) facilitates more timely interventions to improve patient
outcomes. Recent works have demonstrated good performance in this task
utilizing machine learning models. This paper explores the novel application of
a deep learning model with multi-head attention (FFNN-MHA) to make more
accurate MV predictions and reduce false positives by learning personalized
contextual information of individual patients. Utilizing the publicly available
MIMIC-IV dataset, FFNN-MHA demonstrates an improvement of 0.0379 in AUC and a
17.8\% decrease in false positives compared to baseline models such as
feed-forward neural networks. Our results highlight the potential of the
FFNN-MHA model as an effective tool for accurate prediction of the need for
mechanical ventilation in critical care settings.
[LINK]
http://arxiv.org/abs/2407.15885v1
[DATE]
2024-07-21 17:37:30+08:00
[CATEGORIES]
cs.LG
SeqMIA: Sequential-Metric Based Membership Inference Attack
[AUTHORS]
Hao Li, Zheng Li, Siyuan Wu, Chengrui Hu, Yutong Ye, Min Zhang, Dengguo Feng, Yang Zhang
[ABSTRACT]
Most existing membership inference attacks (MIAs) utilize metrics (e.g.,
loss) calculated on the model’s final state, while recent advanced attacks
leverage metrics computed at various stages, including both intermediate and
final stages, throughout the model training. Nevertheless, these attacks often
process multiple intermediate states of the metric independently, ignoring
their time-dependent patterns. Consequently, they struggle to effectively
distinguish between members and non-members who exhibit similar metric values,
particularly resulting in a high false-positive rate.
In this study, we delve deeper into the new membership signals in the
black-box scenario. We identify a new, more integrated membership signal: the
Pattern of Metric Sequence, derived from the various stages of model training.
We contend that current signals provide only partial perspectives of this new
signal: the new one encompasses both the model’s multiple intermediate and
final states, with a greater emphasis on temporal patterns among them. Building
upon this signal, we introduce a novel attack method called Sequential-metric
based Membership Inference Attack (SeqMIA). Specifically, we utilize knowledge
distillation to obtain a set of distilled models representing various stages of
the target model’s training. We then assess multiple metrics on these distilled
models in chronological order, creating distilled metric sequence. We finally
integrate distilled multi-metric sequences as a sequential multiformat and
employ an attention-based RNN attack model for inference. Empirical results
show SeqMIA outperforms all baselines, especially can achieve an order of
magnitude improvement in terms of TPR @ 0.1% FPR. Furthermore, we delve into
the reasons why this signal contributes to SeqMIA’s high attack performance,
and assess various defense mechanisms against SeqMIA.
[COMMENTS]
Accepted by ACM CCS 2024
[LINK]
http://arxiv.org/abs/2407.15098v1
[DATE]
2024-07-21 17:11:08+08:00
[CATEGORIES]
cs.LG
Learning Physics for Unveiling Hidden Earthquake Ground Motions via Conditional Generative Modeling
[AUTHORS]
Pu Ren, Rie Nakata, Maxime Lacour, Ilan Naiman, Nori Nakata, Jialin Song, Zhengfa Bi, Osman Asif Malik, Dmitriy Morozov, Omri Azencot, N. Benjamin Erichson, Michael W. Mahoney
[ABSTRACT]
Predicting high-fidelity ground motions for future earthquakes is crucial for
seismic hazard assessment and infrastructure resilience. Conventional empirical
simulations suffer from sparse sensor distribution and geographically localized
earthquake locations, while physics-based methods are computationally intensive
and require accurate representations of Earth structures and earthquake
sources. We propose a novel artificial intelligence (AI) simulator, Conditional
Generative Modeling for Ground Motion (CGM-GM), to synthesize high-frequency
and spatially continuous earthquake ground motion waveforms. CGM-GM leverages
earthquake magnitudes and geographic coordinates of earthquakes and sensors as
inputs, learning complex wave physics and Earth heterogeneities, without
explicit physics constraints. This is achieved through a probabilistic
autoencoder that captures latent distributions in the time-frequency domain and
variational sequential models for prior and posterior distributions. We
evaluate the performance of CGM-GM using small-magnitude earthquake records
from the San Francisco Bay Area, a region with high seismic risks. CGM-GM
demonstrates a strong potential for outperforming a state-of-the-art
non-ergodic empirical ground motion model and shows great promise in seismology
and beyond.
[LINK]
http://arxiv.org/abs/2407.15089v1
[DATE]
2024-07-21 16:23:37+08:00
[CATEGORIES]
cs.LG
Learning to Compile Programs to Neural Networks
[AUTHORS]
Logan Weber, Jesse Michel, Alex Renda, Michael Carbin
[ABSTRACT]
A $\textit{neural surrogate of a program}$ is a neural network that mimics
the behavior of a program. Researchers have used these neural surrogates to
automatically tune program inputs, adapt programs to new settings, and
accelerate computations. Researchers traditionally develop neural surrogates by
training on input-output examples from a single program. Alternatively,
language models trained on a large dataset including many programs can consume
program text, to act as a neural surrogate. Using a language model to both
generate a surrogate and act as a surrogate, however, leading to a trade-off
between resource consumption and accuracy. We present $\textit{neural surrogate
compilation}$, a technique for producing neural surrogates directly from
program text without coupling neural surrogate generation and execution. We
implement neural surrogate compilers using hypernetworks trained on a dataset
of C programs and find that they produce neural surrogates that are
$1.9$-$9.5\times$ as data-efficient, produce visual results that are
$1.0$-$1.3\times$ more similar to ground truth, and train in $4.3$-$7.3\times$
fewer epochs than neural surrogates trained from scratch.
[LINK]
http://arxiv.org/abs/2407.15078v1
[DATE]
2024-07-21 15:04:52+08:00
[CATEGORIES]
cs.LG
Mitigating Partial Observability in Sequential Decision Processes via the Lambda Discrepancy
[AUTHORS]
Cameron Allen, Aaron Kirtland, Ruo Yu Tao, Sam Lobel, Daniel Scott, Nicholas Petrocelli, Omer Gottesman, Ronald Parr, Michael L. Littman, George Konidaris
[ABSTRACT]
Reinforcement learning algorithms typically rely on the assumption that the
environment dynamics and value function can be expressed in terms of a
Markovian state representation. However, when state information is only
partially observable, how can an agent learn such a state representation, and
how can it detect when it has found one? We introduce a metric that can
accomplish both objectives, without requiring access to–or knowledge of–an
underlying, unobservable state space. Our metric, the $\lambda$-discrepancy, is
the difference between two distinct temporal difference (TD) value estimates,
each computed using TD($\lambda$) with a different value of $\lambda$. Since
TD($\lambda$=0) makes an implicit Markov assumption and TD($\lambda$=1) does
not, a discrepancy between these estimates is a potential indicator of a
non-Markovian state representation. Indeed, we prove that the
$\lambda$-discrepancy is exactly zero for all Markov decision processes and
almost always non-zero for a broad class of partially observable environments.
We also demonstrate empirically that, once detected, minimizing the
$\lambda$-discrepancy can help with learning a memory function to mitigate the
corresponding partial observability. We then train a reinforcement learning
agent that simultaneously constructs two recurrent value networks with
different $\lambda$ parameters and minimizes the difference between them as an
auxiliary loss. The approach scales to challenging partially observable
domains, where the resulting agent frequently performs significantly better
(and never performs worse) than a baseline recurrent agent with only a single
value network.
[COMMENTS]
GitHub URL: https://github.com/brownirl/lambda_discrepancy; Videos:
https://lambda-discrepancy.github.io/
[LINK]
http://arxiv.org/abs/2407.07333v2
[DATE]
2024-07-21 14:43:18+08:00
[CATEGORIES]
cs.LG
Trading Devil Final: Backdoor attack via Stock market and Bayesian Optimization
[AUTHORS]
Orson Mengara
[COMMENTS]
jumps-Diffusion and stock market: Better quantify uncertainty in
financial simulations
[LINK]
http://arxiv.org/abs/2407.14573v1
[DATE]
2024-07-21 14:27:45+08:00
[CATEGORIES]
cs.LG
Selective Amnesia: On Efficient, High-Fidelity and Blind Suppression of Backdoor Effects in Trojaned Machine Learning Models
[AUTHORS]
Rui Zhu, Di Tang, Siyuan Tang, XiaoFeng Wang, Haixu Tang
[ABSTRACT]
In this paper, we present a simple yet surprisingly effective technique to
induce “selective amnesia” on a backdoored model. Our approach, called SEAM,
has been inspired by the problem of catastrophic forgetting (CF), a long
standing issue in continual learning. Our idea is to retrain a given DNN model
on randomly labeled clean data, to induce a CF on the model, leading to a
sudden forget on both primary and backdoor tasks; then we recover the primary
task by retraining the randomized model on correctly labeled clean data. We
analyzed SEAM by modeling the unlearning process as continual learning and
further approximating a DNN using Neural Tangent Kernel for measuring CF. Our
analysis shows that our random-labeling approach actually maximizes the CF on
an unknown backdoor in the absence of triggered inputs, and also preserves some
feature extraction in the network to enable a fast revival of the primary task.
We further evaluated SEAM on both image processing and Natural Language
Processing tasks, under both data contamination and training manipulation
attacks, over thousands of models either trained on popular image datasets or
provided by the TrojAI competition. Our experiments show that SEAM vastly
outperforms the state-of-the-art unlearning techniques, achieving a high
Fidelity (measuring the gap between the accuracy of the primary task and that
of the backdoor) within a few minutes (about 30 times faster than training a
model from scratch using the MNIST dataset), with only a small amount of clean
data (0.1% of training data for TrojAI models).
[LINK]
http://arxiv.org/abs/2212.04687v2
[DATE]
2024-07-21 12:38:34+08:00
[CATEGORIES]
cs.LG
Arondight: Red Teaming Large Vision Language Models with Auto-generated Multi-modal Jailbreak Prompts
[AUTHORS]
Yi Liu, Chengjun Cai, Xiaoli Zhang, Xingliang Yuan, Cong Wang
[ABSTRACT]
Large Vision Language Models (VLMs) extend and enhance the perceptual
abilities of Large Language Models (LLMs). Despite offering new possibilities
for LLM applications, these advancements raise significant security and ethical
concerns, particularly regarding the generation of harmful content. While LLMs
have undergone extensive security evaluations with the aid of red teaming
frameworks, VLMs currently lack a well-developed one. To fill this gap, we
introduce Arondight, a standardized red team framework tailored specifically
for VLMs. Arondight is dedicated to resolving issues related to the absence of
visual modality and inadequate diversity encountered when transitioning
existing red teaming methodologies from LLMs to VLMs. Our framework features an
automated multi-modal jailbreak attack, wherein visual jailbreak prompts are
produced by a red team VLM, and textual prompts are generated by a red team LLM
guided by a reinforcement learning agent. To enhance the comprehensiveness of
VLM security evaluation, we integrate entropy bonuses and novelty reward
metrics. These elements incentivize the RL agent to guide the red team LLM in
creating a wider array of diverse and previously unseen test cases. Our
evaluation of ten cutting-edge VLMs exposes significant security
vulnerabilities, particularly in generating toxic images and aligning
multi-modal prompts. In particular, our Arondight achieves an average attack
success rate of 84.5\% on GPT-4 in all fourteen prohibited scenarios defined by
OpenAI in terms of generating toxic text. For a clearer comparison, we also
categorize existing VLMs based on their safety levels and provide corresponding
reinforcement recommendations. Our multimodal prompt dataset and red team code
will be released after ethics committee approval. CONTENT WARNING: THIS PAPER
CONTAINS HARMFUL MODEL RESPONSES.
[COMMENTS]
To be published in ACM MM 2024
[LINK]
http://arxiv.org/abs/2407.15050v1
[DATE]
2024-07-21 12:37:11+08:00
[CATEGORIES]
cs.LG
CausalMed: Causality-Based Personalized Medication Recommendation Centered on Patient health state
[AUTHORS]
Xiang Li, Shunpan Liang, Yu Lei, Chen Li, Yulei Hou, Tengfei Ma
[ABSTRACT]
Medication recommendation systems are developed to recommend suitable
medications tailored to specific patient. Previous researches primarily focus
on learning medication representations, which have yielded notable advances.
However, these methods are limited to capturing personalized patient
representations due to the following primary limitations: (i) unable to capture
the differences in the impact of diseases/procedures on patients across various
patient health states; (ii) fail to model the direct causal relationships
between medications and specific health state of patients, resulting in an
inability to determine which specific disease each medication is treating. To
address these limitations, we propose CausalMed, a patient health state-centric
model capable of enhancing the personalization of patient representations.
Specifically, CausalMed first captures the causal relationship between
diseases/procedures and medications through causal discovery and evaluates
their causal effects. Building upon this, CausalMed focuses on analyzing the
health state of patients, capturing the dynamic differences of
diseases/procedures in different health states of patients, and transforming
diseases/procedures into medications on direct causal relationships.
Ultimately, CausalMed integrates information from longitudinal visits to
recommend medication combinations. Extensive experiments on real-world datasets
show that our method learns more personalized patient representation and
outperforms state-of-the-art models in accuracy and safety.
[COMMENTS]
CIKM 2024 Full Research Paper
[LINK]
http://arxiv.org/abs/2404.12228v3
[DATE]
2024-07-21 11:55:46+08:00
[CATEGORIES]
cs.LG
Efficient Sampling for Data-Driven Frequency Stability Constraint via Forward-Mode Automatic Differentiation
[AUTHORS]
Wangkun Xu, Qian Chen, Pudong Ge, Zhongda Chu, Fei Teng
[ABSTRACT]
Encoding frequency stability constraints in the operation problem is
challenging due to its complex dynamics. Recently, data-driven approaches have
been proposed to learn the stability criteria offline with the trained model
embedded as a constraint of online optimization. However, random sampling of
stationary operation points is less efficient in generating balanced stable and
unstable samples. Meanwhile, the performance of such a model is strongly
dependent on the quality of the training dataset. Observing this research gap,
we propose a gradient-based data generation method via forward-mode automatic
differentiation. In this method, the original dynamic system is augmented with
new states that represent the dynamic of sensitivities of the original states,
which can be solved by invoking any ODE solver for a single time. To compensate
for the contradiction between the gradient of various frequency stability
criteria, gradient surgery is proposed by projecting the gradient on the normal
plane of the other. In the end, we demonstrate the superior performance of the
proposed sampling algorithm, compared with the unrolling differentiation and
finite difference. All codes are available at
https://github.com/xuwkk/frequency_sample_ad.
[LINK]
http://arxiv.org/abs/2407.15045v1
[DATE]
2024-07-21 11:50:11+08:00
[CATEGORIES]
cs.LG
Jetfire: Efficient and Accurate Transformer Pretraining with INT8 Data Flow and Per-Block Quantization
[AUTHORS]
Haocheng Xi, Yuxiang Chen, Kang Zhao, Kai Jun Teh, Jianfei Chen, Jun Zhu
[ABSTRACT]
Pretraining transformers are generally time-consuming. Fully quantized
training (FQT) is a promising approach to speed up pretraining. However, most
FQT methods adopt a quantize-compute-dequantize procedure, which often leads to
suboptimal speedup and significant performance degradation when used in
transformers due to the high memory access overheads and low-precision
computations. In this work, we propose Jetfire, an efficient and accurate INT8
training method specific to transformers. Our method features an INT8 data flow
to optimize memory access and a per-block quantization method to maintain the
accuracy of pretrained transformers. Extensive experiments demonstrate that our
INT8 FQT method achieves comparable accuracy to the FP16 training baseline and
outperforms the existing INT8 training works for transformers. Moreover, for a
standard transformer block, our method offers an end-to-end training speedup of
1.42x and a 1.49x memory reduction compared to the FP16 baseline. Our code is
open sourced at https://github.com/thu-ml/Jetfire-INT8Training.
[COMMENTS]
15 pages, 8 figures, 11 tables
[LINK]
http://arxiv.org/abs/2403.12422v2
[DATE]
2024-07-21 10:23:00+08:00
[CATEGORIES]
cs.LG
Invertible Coarse Graining with Physics-Informed Generative Artificial Intelligence
[AUTHORS]
Jun Zhang, Xiaohan Lin, Weinan E, Yi Qin Gao
[COMMENTS]
16 pages, 5 figures
[LINK]
http://arxiv.org/abs/2305.01243v2
[DATE]
2024-07-21 09:20:55+08:00
[CATEGORIES]
cs.LG
ControlNet++: Improving Conditional Controls with Efficient Consistency Feedback
[AUTHORS]
Ming Li, Taojiannan Yang, Huafeng Kuang, Jie Wu, Zhaoning Wang, Xuefeng Xiao, Chen Chen
[ABSTRACT]
To enhance the controllability of text-to-image diffusion models, existing
efforts like ControlNet incorporated image-based conditional controls. In this
paper, we reveal that existing methods still face significant challenges in
generating images that align with the image conditional controls. To this end,
we propose ControlNet++, a novel approach that improves controllable generation
by explicitly optimizing pixel-level cycle consistency between generated images
and conditional controls. Specifically, for an input conditional control, we
use a pre-trained discriminative reward model to extract the corresponding
condition of the generated images, and then optimize the consistency loss
between the input conditional control and extracted condition. A
straightforward implementation would be generating images from random noises
and then calculating the consistency loss, but such an approach requires
storing gradients for multiple sampling timesteps, leading to considerable time
and memory costs. To address this, we introduce an efficient reward strategy
that deliberately disturbs the input images by adding noise, and then uses the
single-step denoised images for reward fine-tuning. This avoids the extensive
costs associated with image sampling, allowing for more efficient reward
fine-tuning. Extensive experiments show that ControlNet++ significantly
improves controllability under various conditional controls. For example, it
achieves improvements over ControlNet by 11.1% mIoU, 13.4% SSIM, and 7.6% RMSE,
respectively, for segmentation mask, line-art edge, and depth conditions. All
the code, models, demo and organized data have been open sourced on our Github
Repo.
[COMMENTS]
Camera Ready Version. Project Page:
https://liming-ai.github.io/ControlNet_Plus_Plus; Code & Data:
https://github.com/liming-ai/ControlNet_Plus_Plus
[LINK]
http://arxiv.org/abs/2404.07987v2
[DATE]
2024-07-21 08:38:35+08:00
[CATEGORIES]
cs.LG
Is Behavior Cloning All You Need? Understanding Horizon in Imitation Learning
[AUTHORS]
Dylan J. Foster, Adam Block, Dipendra Misra
[ABSTRACT]
Imitation learning (IL) aims to mimic the behavior of an expert in a
sequential decision making task by learning from demonstrations, and has been
widely applied to robotics, autonomous driving, and autoregressive text
generation. The simplest approach to IL, behavior cloning (BC), is thought to
incur sample complexity with unfavorable quadratic dependence on the problem
horizon, motivating a variety of different online algorithms that attain
improved linear horizon dependence under stronger assumptions on the data and
the learner’s access to the expert.
We revisit the apparent gap between offline and online IL from a
learning-theoretic perspective, with a focus on general policy classes up to
and including deep neural networks. Through a new analysis of behavior cloning
with the logarithmic loss, we show that it is possible to achieve
horizon-independent sample complexity in offline IL whenever (i) the range of
the cumulative payoffs is controlled, and (ii) an appropriate notion of
supervised learning complexity for the policy class is controlled. Specializing
our results to deterministic, stationary policies, we show that the gap between
offline and online IL is not fundamental: (i) it is possible to achieve linear
dependence on horizon in offline IL under dense rewards (matching what was
previously only known to be achievable in online IL); and (ii) without further
assumptions on the policy class, online IL cannot improve over offline IL with
the logarithmic loss, even in benign MDPs. We complement our theoretical
results with experiments on standard RL tasks and autoregressive language
generation to validate the practical relevance of our findings.
[LINK]
http://arxiv.org/abs/2407.15007v1
[DATE]
2024-07-21 07:31:56+08:00
[CATEGORIES]
cs.LG
All Against Some: Efficient Integration of Large Language Models for Message Passing in Graph Neural Networks
[AUTHORS]
Ajay Jaiswal, Nurendra Choudhary, Ravinarayana Adkathimar, Muthu P. Alagappan, Gaurush Hiranandani, Ying Ding, Zhangyang Wang, Edward W Huang, Karthik Subbian
[ABSTRACT]
Graph Neural Networks (GNNs) have attracted immense attention in the past
decade due to their numerous real-world applications built around
graph-structured data. On the other hand, Large Language Models (LLMs) with
extensive pretrained knowledge and powerful semantic comprehension abilities
have recently shown a remarkable ability to benefit applications using vision
and text data. In this paper, we investigate how LLMs can be leveraged in a
computationally efficient fashion to benefit rich graph-structured data, a
modality relatively unexplored in LLM literature. Prior works in this area
exploit LLMs to augment every node features in an ad-hoc fashion (not scalable
for large graphs), use natural language to describe the complex structural
information of graphs, or perform computationally expensive finetuning of LLMs
in conjunction with GNNs. We propose E-LLaGNN (Efficient LLMs augmented GNNs),
a framework with an on-demand LLM service that enriches message passing
procedure of graph learning by enhancing a limited fraction of nodes from the
graph. More specifically, E-LLaGNN relies on sampling high-quality
neighborhoods using LLMs, followed by on-demand neighborhood feature
enhancement using diverse prompts from our prompt catalog, and finally
information aggregation using message passing from conventional GNN
architectures. We explore several heuristics-based active node selection
strategies to limit the computational and memory footprint of LLMs when
handling millions of nodes. Through extensive experiments & ablation on popular
graph benchmarks of varying scales (Cora, PubMed, ArXiv, & Products), we
illustrate the effectiveness of our E-LLaGNN framework and reveal many
interesting capabilities such as improved gradient flow in deep GNNs, LLM-free
inference ability etc.
[LINK]
http://arxiv.org/abs/2407.14996v1
[DATE]
2024-07-21 06:09:42+08:00
[CATEGORIES]
cs.LG
Gauges and Accelerated Optimization over Smooth and/or Strongly Convex Sets
[AUTHORS]
Ning Liu, Benjamin Grimmer
[ABSTRACT]
We consider feasibility and constrained optimization problems defined over
smooth and/or strongly convex sets. These notions mirror their popular function
counterparts but are much less explored in the first-order optimization
literature. We propose new scalable, projection-free, accelerated first-order
methods in these settings. Our methods avoid linear optimization or projection
oracles, only using cheap one-dimensional linesearches and normal vector
computations. Despite this, we derive optimal accelerated convergence
guarantees of $O(1/T)$ for strongly convex problems, $O(1/T^2)$ for smooth
problems, and accelerated linear convergence given both. Our algorithms and
analysis are based on novel characterizations of the Minkowski gauge of smooth
and/or strongly convex sets, which may be of independent interest: although the
gauge is neither smooth nor strongly convex, we show the gauge squared inherits
any structure present in the set.
[COMMENTS]
28pages (45pages with references and appendix)
[LINK]
http://arxiv.org/abs/2303.05037v3
[DATE]
2024-07-21 06:09:10+08:00
[CATEGORIES]
cs.LG
The Missing Curve Detectors of InceptionV1: Applying Sparse Autoencoders to InceptionV1 Early Vision
[AUTHORS]
Liv Gorton
[ABSTRACT]
Recent work on sparse autoencoders (SAEs) has shown promise in extracting
interpretable features from neural networks and addressing challenges with
polysemantic neurons caused by superposition. In this paper, we apply SAEs to
the early vision layers of InceptionV1, a well-studied convolutional neural
network, with a focus on curve detectors. Our results demonstrate that SAEs can
uncover new interpretable features not apparent from examining individual
neurons, including additional curve detectors that fill in previous gaps. We
also find that SAEs can decompose some polysemantic neurons into more
monosemantic constituent features. These findings suggest SAEs are a valuable
tool for understanding InceptionV1, and convolutional neural networks more
generally.
[COMMENTS]
Corrected typos
[LINK]
http://arxiv.org/abs/2406.03662v2
[DATE]
2024-07-21 05:32:28+08:00
[CATEGORIES]
cs.LG
Enhancing Microgrid Performance Prediction with Attention-based Deep Learning Models
[AUTHORS]
Vinod Kumar Maddineni, Naga Babu Koganti, Praveen Damacharla
[ABSTRACT]
In this research, an effort is made to address microgrid systems’ operational
challenges, characterized by power oscillations that eventually contribute to
grid instability. An integrated strategy is proposed, leveraging the strengths
of convolutional and Gated Recurrent Unit (GRU) layers. This approach is aimed
at effectively extracting temporal data from energy datasets to improve the
precision of microgrid behavior forecasts. Additionally, an attention layer is
employed to underscore significant features within the time-series data,
optimizing the forecasting process. The framework is anchored by a Multi-Layer
Perceptron (MLP) model, which is tasked with comprehensive load forecasting and
the identification of abnormal grid behaviors. Our methodology underwent
rigorous evaluation using the Micro-grid Tariff Assessment Tool dataset, with
Root Mean Square Error (RMSE), Mean Absolute Error (MAE), and the coefficient
of determination (r2-score) serving as the primary metrics. The approach
demonstrated exemplary performance, evidenced by a MAE of 0.39, RMSE of 0.28,
and an r2-score of 98.89\% in load forecasting, along with near-perfect zero
state prediction accuracy (approximately 99.9\%). Significantly outperforming
conventional machine learning models such as support vector regression and
random forest regression, our model’s streamlined architecture is particularly
suitable for real-time applications, thereby facilitating more effective and
reliable microgrid management.
[COMMENTS]
2024 11th International Conference on Information Technology,
Computer, and Electrical Engineering (ICITACEE)
[LINK]
http://arxiv.org/abs/2407.14984v1
[DATE]
2024-07-21 05:24:11+08:00
[CATEGORIES]
cs.LG
Technical report: Improving the properties of molecules generated by LIMO
[AUTHORS]
Vineet Thumuluri, Peter Eckmann, Michael K. Gilson, Rose Yu
[ABSTRACT]
This technical report investigates variants of the Latent Inceptionism on
Molecules (LIMO) framework to improve the properties of generated molecules. We
conduct ablative studies of molecular representation, decoder model, and
surrogate model training scheme. The experiments suggest that an autogressive
Transformer decoder with GroupSELFIES achieves the best average properties for
the random generation task.
[COMMENTS]
9 pages, 2 figures
[LINK]
http://arxiv.org/abs/2407.14968v1
[DATE]
2024-07-21 03:30:38+08:00
[CATEGORIES]
cs.LG
Physics-informed active learning with simultaneous weak-form latent space dynamics identification
[AUTHORS]
Xiaolong He, April Tran, David M. Bortz, Youngsoo Choi
[ABSTRACT]
The parametric greedy latent space dynamics identification (gLaSDI) framework
has demonstrated promising potential for accurate and efficient modeling of
high-dimensional nonlinear physical systems. However, it remains challenging to
handle noisy data. To enhance robustness against noise, we incorporate the
weak-form estimation of nonlinear dynamics (WENDy) into gLaSDI. In the proposed
weak-form gLaSDI (WgLaSDI) framework, an autoencoder and WENDy are trained
simultaneously to discover intrinsic nonlinear latent-space dynamics of
high-dimensional data. Compared to the standard sparse identification of
nonlinear dynamics (SINDy) employed in gLaSDI, WENDy enables variance reduction
and robust latent space discovery, therefore leading to more accurate and
efficient reduced-order modeling. Furthermore, the greedy physics-informed
active learning in WgLaSDI enables adaptive sampling of optimal training data
on the fly for enhanced modeling accuracy. The effectiveness of the proposed
framework is demonstrated by modeling various nonlinear dynamical problems,
including viscous and inviscid Burgers’ equations, time-dependent radial
advection, and the Vlasov equation for plasma physics. With data that contains
5-10% Gaussian white noise, WgLaSDI outperforms gLaSDI by orders of magnitude,
achieving 1-7% relative errors. Compared with the high-fidelity models, WgLaSDI
achieves 121 to 1,779x speed-up.
[LINK]
http://arxiv.org/abs/2407.00337v2
[DATE]
2024-07-21 03:21:42+08:00
[CATEGORIES]
cs.LG
Constrained Decoding for Secure Code Generation
[AUTHORS]
Yanjun Fu, Ethan Baker, Yu Ding, Yizheng Chen
[ABSTRACT]
Code Large Language Models (Code LLMs) have been increasingly used by
developers to boost productivity, but they often generate vulnerable code.
Thus, there is an urgent need to ensure that code generated by Code LLMs is
correct and secure. Previous research has primarily focused on generating
secure code, overlooking the fact that secure code also needs to be correct.
This oversight can lead to a false sense of security. Currently, the community
lacks a method to measure actual progress in this area, and we need solutions
that address both security and correctness of code generation.
This paper introduces a new benchmark, CodeGuard+, along with two new
metrics, to measure Code LLMs’ ability to generate both secure and correct
code. Using our new evaluation methods, we show that the state-of-the-art
defense technique, prefix tuning, may not be as strong as previously believed,
since it generates secure code but sacrifices functional correctness. We also
demonstrate that different decoding methods significantly affect the security
of Code LLMs.
Furthermore, we explore a new defense direction: constrained decoding for
secure code generation. We propose new constrained decoding techniques to
generate secure code. Our results reveal that constrained decoding is more
effective than prefix tuning to improve the security of Code LLMs, without
requiring a specialized training dataset. Moreover, our evaluations over eight
state-of-the-art Code LLMs show that constrained decoding has strong
performance to improve the security of Code LLMs, and our technique outperforms
GPT-4.
[COMMENTS]
17 pages, 9 figures, our website is available at
https://codeguardplus.github.io
[LINK]
http://arxiv.org/abs/2405.00218v3
[DATE]
2024-07-21 03:14:03+08:00
[CATEGORIES]
cs.LG
Stochastic optimization with arbitrary recurrent data sampling
[AUTHORS]
William G. Powell, Hanbaek Lyu
[ABSTRACT]
For obtaining optimal first-order convergence guarantee for stochastic
optimization, it is necessary to use a recurrent data sampling algorithm that
samples every data point with sufficient frequency. Most commonly used data
sampling algorithms (e.g., i.i.d., MCMC, random reshuffling) are indeed
recurrent under mild assumptions. In this work, we show that for a particular
class of stochastic optimization algorithms, we do not need any other property
(e.g., independence, exponential mixing, and reshuffling) than recurrence in
data sampling algorithms to guarantee the optimal rate of first-order
convergence. Namely, using regularized versions of Minimization by Incremental
Surrogate Optimization (MISO), we show that for non-convex and possibly
non-smooth objective functions, the expected optimality gap converges at an
optimal rate $O(n^{-1/2})$ under general recurrent sampling schemes.
Furthermore, the implied constant depends explicitly on the speed of
<span style="color:#e74d3c;">recurrence</span>', measured by the expected amount of time to visit a given data
point either averaged (
target time’) or supremized (`hitting time’) over the
current location. We demonstrate theoretically and empirically that convergence
can be accelerated by selecting sampling algorithms that cover the data set
most effectively. We discuss applications of our general framework to
decentralized optimization and distributed non-negative matrix factorization.
[COMMENTS]
39 pages, 4 figures, 1 table
[LINK]
http://arxiv.org/abs/2401.07694v2
[DATE]
2024-07-21 02:56:48+08:00
[CATEGORIES]
cs.LG
Addressing Data Heterogeneity in Federated Learning of Cox Proportional Hazards Models
[AUTHORS]
Navid Seidi, Satyaki Roy, Sajal K. Das, Ardhendu Tripathy
[ABSTRACT]
The diversity in disease profiles and therapeutic approaches between
hospitals and health professionals underscores the need for patient-centric
personalized strategies in healthcare. Alongside this, similarities in disease
progression across patients can be utilized to improve prediction models in
survival analysis. The need for patient privacy and the utility of prediction
models can be simultaneously addressed in the framework of Federated Learning
(FL). This paper outlines an approach in the domain of federated survival
analysis, specifically the Cox Proportional Hazards (CoxPH) model, with a
specific focus on mitigating data heterogeneity and elevating model
performance. We present an FL approach that employs feature-based clustering to
enhance model accuracy across synthetic datasets and real-world applications,
including the Surveillance, Epidemiology, and End Results (SEER) database.
Furthermore, we consider an event-based reporting strategy that provides a
dynamic approach to model adaptation by responding to local data changes. Our
experiments show the efficacy of our approach and discuss future directions for
a practical application of FL in healthcare.
[LINK]
http://arxiv.org/abs/2407.14960v1
[DATE]
2024-07-21 02:34:20+08:00
[CATEGORIES]
cs.LG
Strongly Isomorphic Neural Optimal Transport Across Incomparable Spaces
[AUTHORS]
Athina Sotiropoulou, David Alvarez-Melis
[ABSTRACT]
Optimal Transport (OT) has recently emerged as a powerful framework for
learning minimal-displacement maps between distributions. The predominant
approach involves a neural parametrization of the Monge formulation of OT,
typically assuming the same space for both distributions. However, the setting
across “incomparable spaces” (e.g., of different dimensionality),
corresponding to the Gromov- Wasserstein distance, remains underexplored, with
existing methods often imposing restrictive assumptions on the cost function.
In this paper, we present a novel neural formulation of the Gromov-Monge (GM)
problem rooted in one of its fundamental properties: invariance to strong
isomorphisms. We operationalize this property by decomposing the learnable OT
map into two components: (i) an approximate strong isomorphism between the
source distribution and an intermediate reference distribution, and (ii) a
GM-optimal map between this reference and the target distribution. Our
formulation leverages and extends the Monge gap regularizer of Uscidda & Cuturi
(2023) to eliminate the need for complex architectural requirements of other
neural OT methods, yielding a simple but practical method that enjoys favorable
theoretical guarantees. Our preliminary empirical results show that our
framework provides a promising approach to learn OT maps across diverse spaces.
[COMMENTS]
ICML 2024 Workshop on Geometry-grounded Representation Learning and
Generative Modeling
[LINK]
http://arxiv.org/abs/2407.14957v1
[DATE]
2024-07-21 02:27:11+08:00
[CATEGORIES]
cs.LG
Data Sharing for Mean Estimation Among Heterogeneous Strategic Agents
[AUTHORS]
Alex Clinton, Yiding Chen, Xiaojin Zhu, Kirthevasan Kandasamy
[ABSTRACT]
We study a collaborative learning problem where $m$ agents estimate a vector
$\mu\in\mathbb{R}^d$ by collecting samples from normal distributions, with each
agent $i$ incurring a cost $c_{i,k} \in (0, \infty]$ to sample from the
$k^{\text{th}}$ distribution $\mathcal{N}(\mu_k, \sigma^2)$. Instead of working
on their own, agents can collect data that is cheap to them, and share it with
others in exchange for data that is expensive or even inaccessible to them,
thereby simultaneously reducing data collection costs and estimation error.
However, when agents have different collection costs, we need to first decide
how to fairly divide the work of data collection so as to benefit all agents.
Moreover, in naive sharing protocols, strategic agents may under-collect and/or
fabricate data, leading to socially undesirable outcomes. Our mechanism
addresses these challenges by combining ideas from cooperative and
non-cooperative game theory. We use ideas from axiomatic bargaining to divide
the cost of data collection. Given such a solution, we develop a Nash
incentive-compatible (NIC) mechanism to enforce truthful reporting. We achieve
a $\mathcal{O}(\sqrt{m})$ approximation to the minimum social penalty (sum of
agent estimation errors and data collection costs) in the worst case, and a
$\mathcal{O}(1)$ approximation under favorable conditions. We complement this
with a hardness result, showing that $\Omega(\sqrt{m})$ is unavoidable in any
NIC mechanism.
[LINK]
http://arxiv.org/abs/2407.15881v1
[DATE]
2024-07-21 01:45:40+08:00
[CATEGORIES]
cs.LG
Dynamic Pricing and Learning with Long-term Reference Effects
[AUTHORS]
Shipra Agrawal, Wei Tang
[ABSTRACT]
We consider a dynamic pricing problem where customer response to the current
price is impacted by the customer price expectation, aka reference price. We
study a simple and novel reference price mechanism where reference price is the
average of the past prices offered by the seller. As opposed to the more
commonly studied exponential smoothing mechanism, in our reference price
mechanism the prices offered by seller have a longer term effect on the future
customer expectations.
We show that under this mechanism, a markdown policy is near-optimal
irrespective of the parameters of the model. This matches the common intuition
that a seller may be better off by starting with a higher price and then
decreasing it, as the customers feel like they are getting bargains on items
that are ordinarily more expensive. For linear demand models, we also provide a
detailed characterization of the near-optimal markdown policy along with an
efficient way of computing it.
We then consider a more challenging dynamic pricing and learning problem,
where the demand model parameters are apriori unknown, and the seller needs to
learn them online from the customers’ responses to the offered prices while
simultaneously optimizing revenue. The objective is to minimize regret, i.e.,
the $T$-round revenue loss compared to a clairvoyant optimal policy. This task
essentially amounts to learning a non-stationary optimal policy in a
time-variant Markov Decision Process (MDP). For linear demand models, we
provide an efficient learning algorithm with an optimal $\tilde{O}(\sqrt{T})$
regret upper bound.
[COMMENTS]
50 pages, two figures. One-page abstract appeared in EC’24
[LINK]
http://arxiv.org/abs/2402.12562v2
[DATE]
2024-07-21 00:14:42+08:00
[CATEGORIES]
cs.LG
BiasAlert: A Plug-and-play Tool for Social Bias Detection in LLMs
[AUTHORS]
Zhiting Fan, Ruizhe Chen, Ruiling Xu, Zuozhu Liu
[ABSTRACT]
Evaluating the bias in Large Language Models (LLMs) becomes increasingly
crucial with their rapid development. However, existing evaluation methods rely
on fixed-form outputs and cannot adapt to the flexible open-text generation
scenarios of LLMs (e.g., sentence completion and question answering). To
address this, we introduce BiasAlert, a plug-and-play tool designed to detect
social bias in open-text generations of LLMs. BiasAlert integrates external
human knowledge with inherent reasoning capabilities to detect bias reliably.
Extensive experiments demonstrate that BiasAlert significantly outperforms
existing state-of-the-art methods like GPT4-as-A-Judge in detecting bias.
Furthermore, through application studies, we demonstrate the utility of
BiasAlert in reliable LLM bias evaluation and bias mitigation across various
scenarios. Model and code will be publicly released.
[LINK]
http://arxiv.org/abs/2407.10241v2
[DATE]
2024-07-20 23:59:46+08:00
[CATEGORIES]
cs.CL
Modular Sentence Encoders: Separating Language Specialization from Cross-Lingual Alignment
[AUTHORS]
Yongxin Huang, Kexin Wang, Goran Glavaš, Iryna Gurevych
[ABSTRACT]
Multilingual sentence encoders are commonly obtained by training multilingual
language models to map sentences from different languages into a shared
semantic space. As such, they are subject to curse of multilinguality, a loss
of monolingual representational accuracy due to parameter sharing. Another
limitation of multilingual sentence encoders is the trade-off between
monolingual and cross-lingual performance. Training for cross-lingual alignment
of sentence embeddings distorts the optimal monolingual structure of semantic
spaces of individual languages, harming the utility of sentence embeddings in
monolingual tasks. In this work, we address both issues by modular training of
sentence encoders, i.e., by separating monolingual specialization from
cross-lingual alignment. We first efficiently train language-specific sentence
encoders to avoid negative interference between languages (i.e., the curse). We
then align all non-English monolingual encoders to the English encoder by
training a cross-lingual alignment adapter on top of each, preventing
interference with monolingual specialization from the first step. In both
steps, we resort to contrastive learning on machine-translated paraphrase data.
Monolingual and cross-lingual evaluations on semantic text
similarity/relatedness and multiple-choice QA render our modular solution more
effective than multilingual sentence encoders, especially benefiting
low-resource languages.
[LINK]
http://arxiv.org/abs/2407.14878v1
[DATE]
2024-07-20 21:56:39+08:00
[CATEGORIES]
cs.CL
Understanding the Relationship between Prompts and Response Uncertainty in Large Language Models
[AUTHORS]
Ze Yu Zhang, Arun Verma, Finale Doshi-Velez, Bryan Kian Hsiang Low
[ABSTRACT]
Large language models (LLMs) are widely used in decision-making, but their
reliability, especially in critical tasks like healthcare, is not
well-established. Therefore, understanding how LLMs reason and make decisions
is crucial for their safe deployment. This paper investigates how the
uncertainty of responses generated by LLMs relates to the information provided
in the input prompt. Leveraging the insight that LLMs learn to infer latent
concepts during pretraining, we propose a prompt-response concept model that
explains how LLMs generate responses and helps understand the relationship
between prompts and response uncertainty. We show that the uncertainty
decreases as the prompt’s informativeness increases, similar to epistemic
uncertainty. Our detailed experimental results on real datasets validate our
proposed model.
[COMMENTS]
27 pages, 11 figures
[LINK]
http://arxiv.org/abs/2407.14845v1
[DATE]
2024-07-20 19:19:58+08:00
[CATEGORIES]
cs.LG
cs.CL
Tuning LLMs with Contrastive Alignment Instructions for Machine Translation in Unseen, Low-resource Languages
[AUTHORS]
Zhuoyuan Mao, Yen Yu
[ABSTRACT]
This article introduces contrastive alignment instructions (AlignInstruct) to
address two challenges in machine translation (MT) on large language models
(LLMs). One is the expansion of supported languages to previously unseen ones.
The second relates to the lack of data in low-resource languages. Model
fine-tuning through MT instructions (MTInstruct) is a straightforward approach
to the first challenge. However, MTInstruct is limited by weak cross-lingual
signals inherent in the second challenge. AlignInstruct emphasizes
cross-lingual supervision via a cross-lingual discriminator built using
statistical word alignments. Our results based on fine-tuning the BLOOMZ models
(1b1, 3b, and 7b1) in up to 24 unseen languages showed that: (1) LLMs can
effectively translate unseen languages using MTInstruct; (2) AlignInstruct led
to consistent improvements in translation quality across 48 translation
directions involving English; (3) Discriminator-based instructions outperformed
their generative counterparts as cross-lingual instructions; (4) AlignInstruct
improved performance in 30 zero-shot directions.
[COMMENTS]
Accepted to LoResMT 2024
[LINK]
http://arxiv.org/abs/2401.05811v2
[DATE]
2024-07-20 19:13:38+08:00
[CATEGORIES]
cs.CL
Text Style Transfer: An Introductory Overview
[AUTHORS]
Sourabrata Mukherjee, Ondrej Dušek
[ABSTRACT]
Text Style Transfer (TST) is a pivotal task in natural language generation to
manipulate text style attributes while preserving style-independent content.
The attributes targeted in TST can vary widely, including politeness,
authorship, mitigation of offensive language, modification of feelings, and
adjustment of text formality. TST has become a widely researched topic with
substantial advancements in recent years. This paper provides an introductory
overview of TST, addressing its challenges, existing approaches, datasets,
evaluation measures, subtasks, and applications. This fundamental overview
improves understanding of the background and fundamentals of text style
transfer.
[COMMENTS]
Accepted at 4EU+ International Workshop on Recent Advancements in
Artificial Intelligence
[LINK]
http://arxiv.org/abs/2407.14822v1
[DATE]
2024-07-20 17:54:55+08:00
[CATEGORIES]
cs.CL
Korean Aspect-Based Sentiment Analysis via Implicit-Feature Alignment with Corpus Filtering
[AUTHORS]
Kibeom Nam
[COMMENTS]
13 pages, EMNLP Industry Track (submitted), DMLR@ICML 2024
[LINK]
http://arxiv.org/abs/2407.00342v3
[DATE]
2024-07-20 17:32:01+08:00
[CATEGORIES]
cs.CL
Can MLLMs Perform Text-to-Image In-Context Learning?
[AUTHORS]
Yuchen Zeng, Wonjun Kang, Yicong Chen, Hyung Il Koo, Kangwook Lee
[ABSTRACT]
The evolution from Large Language Models (LLMs) to Multimodal Large Language
Models (MLLMs) has spurred research into extending In-Context Learning (ICL) to
its multimodal counterpart. Existing such studies have primarily concentrated
on image-to-text ICL. However, the Text-to-Image ICL (T2I-ICL), with its unique
characteristics and potential applications, remains underexplored. To address
this gap, we formally define the task of T2I-ICL and present CoBSAT, the first
T2I-ICL benchmark dataset, encompassing ten tasks. Utilizing our dataset to
benchmark six state-of-the-art MLLMs, we uncover considerable difficulties
MLLMs encounter in solving T2I-ICL. We identify the primary challenges as the
inherent complexity of multimodality and image generation, and show that
strategies such as fine-tuning and Chain-of-Thought prompting help to mitigate
these difficulties, leading to notable improvements in performance. Our code
and dataset are available at https://github.com/UW-Madison-Lee-Lab/CoBSAT.
[COMMENTS]
Accepted at COLM 2024
[LINK]
http://arxiv.org/abs/2402.01293v3
[DATE]
2024-07-20 15:52:29+08:00
[CATEGORIES]
cs.LG
cs.CL
Automatic Real-word Error Correction in Persian Text
[AUTHORS]
Seyed Mohammad Sadegh Dashti, Amid Khatibi Bardsiri, Mehdi Jafari Shahbazzadeh
[ABSTRACT]
Automatic spelling correction stands as a pivotal challenge within the ambit
of natural language processing (NLP), demanding nuanced solutions. Traditional
spelling correction techniques are typically only capable of detecting and
correcting non-word errors, such as typos and misspellings. However,
context-sensitive errors, also known as real-word errors, are more challenging
to detect because they are valid words that are used incorrectly in a given
context. The Persian language, characterized by its rich morphology and complex
syntax, presents formidable challenges to automatic spelling correction
systems. Furthermore, the limited availability of Persian language resources
makes it difficult to train effective spelling correction models. This paper
introduces a cutting-edge approach for precise and efficient real-word error
correction in Persian text. Our methodology adopts a structured, multi-tiered
approach, employing semantic analysis, feature selection, and advanced
classifiers to enhance error detection and correction efficacy. The innovative
architecture discovers and stores semantic similarities between words and
phrases in Persian text. The classifiers accurately identify real-word errors,
while the semantic ranking algorithm determines the most probable corrections
for real-word errors, taking into account specific spelling correction and
context properties such as context, semantic similarity, and edit-distance
measures. Evaluations have demonstrated that our proposed method surpasses
previous Persian real-word error correction models. Our method achieves an
impressive F-measure of 96.6% in the detection phase and an accuracy of 99.1%
in the correction phase. These results clearly indicate that our approach is a
highly promising solution for automatic real-word error correction in Persian
text.
[COMMENTS]
Neural Comput & Applic (2024)
[LINK]
http://arxiv.org/abs/2407.14795v1
[DATE]
2024-07-20 15:50:52+08:00
[CATEGORIES]
cs.CL
On the Design and Analysis of LLM-Based Algorithms
[AUTHORS]
Yanxi Chen, Yaliang Li, Bolin Ding, Jingren Zhou
[ABSTRACT]
We initiate a formal investigation into the design and analysis of LLM-based
algorithms, i.e. algorithms that contain one or multiple calls of large
language models (LLMs) as sub-routines and critically rely on the capabilities
of LLMs. While LLM-based algorithms, ranging from basic LLM calls with prompt
engineering to complicated LLM-powered agent systems and compound AI systems,
have achieved remarkable empirical success, the design and optimization of them
have mostly relied on heuristics and trial-and-errors, which is largely due to
a lack of formal and analytical study for these algorithms. To fill this gap,
we start by identifying the computational-graph representation of LLM-based
algorithms, the design principle of task decomposition, and some key
abstractions, which then facilitate our formal analysis for the accuracy and
efficiency of LLM-based algorithms, despite the black-box nature of LLMs. We
further consider parallel decomposition for a case study, providing extensive
analytical and empirical study for four concrete examples of this pattern. Our
proposed framework holds promise for advancing LLM-based algorithms, by
revealing the reasons behind curious empirical phenomena, guiding the choices
of hyperparameters, predicting the empirical performance of algorithms, and
inspiring new algorithm design. To promote further study of LLM-based
algorithms, we release our source code at
https://github.com/modelscope/agentscope/tree/main/examples/paper_llm_based_algorithm.
[LINK]
http://arxiv.org/abs/2407.14788v1
[DATE]
2024-07-20 15:39:07+08:00
[CATEGORIES]
cs.LG
cs.CL
Learning Rate Curriculum
[AUTHORS]
Florinel-Alin Croitoru, Nicolae-Catalin Ristea, Radu Tudor Ionescu, Nicu Sebe
[ABSTRACT]
Most curriculum learning methods require an approach to sort the data samples
by difficulty, which is often cumbersome to perform. In this work, we propose a
novel curriculum learning approach termed Learning Rate Curriculum (LeRaC),
which leverages the use of a different learning rate for each layer of a neural
network to create a data-agnostic curriculum during the initial training
epochs. More specifically, LeRaC assigns higher learning rates to neural layers
closer to the input, gradually decreasing the learning rates as the layers are
placed farther away from the input. The learning rates increase at various
paces during the first training iterations, until they all reach the same
value. From this point on, the neural model is trained as usual. This creates a
model-level curriculum learning strategy that does not require sorting the
examples by difficulty and is compatible with any neural network, generating
higher performance levels regardless of the architecture. We conduct
comprehensive experiments on 12 data sets from the computer vision (CIFAR-10,
CIFAR-100, Tiny ImageNet, ImageNet-200, Food-101, UTKFace, PASCAL VOC),
language (BoolQ, QNLI, RTE) and audio (ESC-50, CREMA-D) domains, considering
various convolutional (ResNet-18, Wide-ResNet-50, DenseNet-121, YOLOv5),
recurrent (LSTM) and transformer (CvT, BERT, SepTr) architectures. We compare
our approach with the conventional training regime, as well as with Curriculum
by Smoothing (CBS), a state-of-the-art data-agnostic curriculum learning
approach. Unlike CBS, our performance improvements over the standard training
regime are consistent across all data sets and models. Furthermore, we
significantly surpass CBS in terms of training time (there is no additional
cost over the standard training regime for LeRaC). Our code is freely available
at: https://github.com/CroitoruAlin/LeRaC.
[COMMENTS]
Accepted at the International Journal of Computer Vision
[LINK]
http://arxiv.org/abs/2205.09180v4
[DATE]
2024-07-20 15:29:22+08:00
[CATEGORIES]
cs.LG
cs.CL
I Need Help! Evaluating LLM’s Ability to Ask for Users’ Support: A Case Study on Text-to-SQL Generation
[AUTHORS]
Cheng-Kuang Wu, Zhi Rui Tam, Chao-Chung Wu, Chieh-Yen Lin, Hung-yi Lee, Yun-Nung Chen
[ABSTRACT]
In this study, we explore the proactive ability of LLMs to seek user support,
using text-to-SQL generation as a case study. We propose metrics to evaluate
the trade-off between performance improvements and user burden, and investigate
whether LLMs can determine when to request help and examine their performance
with varying levels of information availability. Our experiments reveal that
without external feedback, many LLMs struggle to recognize their need for
additional support. Our findings highlight the importance of external signals
and provide insights for future research on improving support-seeking
strategies.
[COMMENTS]
9 pages, 9 figures
[LINK]
http://arxiv.org/abs/2407.14767v1
[DATE]
2024-07-20 14:12:29+08:00
[CATEGORIES]
cs.CL
From Words to Actions: Unveiling the Theoretical Underpinnings of LLM-Driven Autonomous Systems
[AUTHORS]
Jianliang He, Siyu Chen, Fengzhuo Zhang, Zhuoran Yang
[ABSTRACT]
In this work, from a theoretical lens, we aim to understand why large
language model (LLM) empowered agents are able to solve decision-making
problems in the physical world. To this end, consider a hierarchical
reinforcement learning (RL) model where the LLM Planner and the Actor perform
high-level task planning and low-level execution, respectively. Under this
model, the LLM Planner navigates a partially observable Markov decision process
(POMDP) by iteratively generating language-based subgoals via prompting. Under
proper assumptions on the pretraining data, we prove that the pretrained LLM
Planner effectively performs Bayesian aggregated imitation learning (BAIL)
through in-context learning. Additionally, we highlight the necessity for
exploration beyond the subgoals derived from BAIL by proving that naively
executing the subgoals returned by LLM leads to a linear regret. As a remedy,
we introduce an $\epsilon$-greedy exploration strategy to BAIL, which is proven
to incur sublinear regret when the pretraining error is small. Finally, we
extend our theoretical framework to include scenarios where the LLM Planner
serves as a world model for inferring the transition model of the environment
and to multi-agent settings, enabling coordination among multiple Actors.
[COMMENTS]
47 pages, accepted by ICML 2024
[LINK]
http://arxiv.org/abs/2405.19883v2
[DATE]
2024-07-20 14:00:22+08:00
[CATEGORIES]
cs.LG
cs.CL
Gaussian Stochastic Weight Averaging for Bayesian Low-Rank Adaptation of Large Language Models
[AUTHORS]
Emre Onal, Klemens Flöge, Emma Caldwell, Arsen Sheverdin, Vincent Fortuin
[ABSTRACT]
Fine-tuned Large Language Models (LLMs) often suffer from overconfidence and
poor calibration, particularly when fine-tuned on small datasets. To address
these challenges, we propose a simple combination of Low-Rank Adaptation (LoRA)
with Gaussian Stochastic Weight Averaging (SWAG), facilitating approximate
Bayesian inference in LLMs. Through extensive testing across several Natural
Language Processing (NLP) benchmarks, we demonstrate that our straightforward
and computationally efficient approach improves model generalization and
calibration competitively with comparable, more sophisticated methods for
Bayesian inference in LLMs. We further show that our method exhibits greater
robustness against distribution shift, as reflected in its improved performance
on out-of-distribution tasks.
[COMMENTS]
14 pages, 1 figure, 2 tables
[LINK]
http://arxiv.org/abs/2405.03425v2
[DATE]
2024-07-20 12:36:27+08:00
[CATEGORIES]
cs.CL
Long-Term Ad Memorability: Understanding & Generating Memorable Ads
[AUTHORS]
Harini S I, Somesh Singh, Yaman K Singla, Aanisha Bhattacharyya, Veeky Baths, Changyou Chen, Rajiv Ratn Shah, Balaji Krishnamurthy
[ABSTRACT]
Marketers spend billions of dollars on advertisements, but to what end? At
purchase time, if customers cannot recognize the brand for which they saw an
ad, the money spent on the ad is essentially wasted. Despite its importance in
marketing, until now, there has been no large-scale study on the memorability
of ads. All previous memorability studies have been conducted on short-term
recall on specific content types like action videos. On the other hand, the
advertising industry only cares about long-term memorability, and ads are
almost always highly multimodal. Therefore, we release the first memorability
dataset, LAMBDA, consisting of 1749 participants and 2205 ads covering 276
brands. Running statistical tests over different participant subpopulations and
ad types, we find many interesting insights into what makes an ad memorable,
e.g., fast-moving ads are more memorable than those with slower scenes; people
who use ad-blockers remember a lower number of ads than those who don’t. Next,
we present a model, Henry, to predict the memorability of a content. Henry
achieves state-of-the-art performance across all prominent literature
memorability datasets. It shows strong generalization performance with better
results in 0-shot on unseen datasets. Finally, with the intent of memorable ad
generation, we present a scalable method to build a high-quality memorable ad
generation model by leveraging automatically annotated data. Our approach, SEED
(Self rEwarding mEmorability Modeling), starts with a language model trained on
LAMBDA as seed data and progressively trains an LLM to generate more memorable
ads. We show that the generated advertisements have 44% higher memorability
scores than the original ads. We release this large-scale ad dataset,
UltraLAMBDA, consisting of 5 million ads. Our code and datasets are available
at https://behavior-in-the-wild.github.io/memorability.
[LINK]
http://arxiv.org/abs/2309.00378v4
[DATE]
2024-07-20 12:23:44+08:00
[CATEGORIES]
cs.CL
OTCE: Hybrid SSM and Attention with Cross Domain Mixture of Experts to construct Observer-Thinker-Conceiver-Expresser
[AUTHORS]
Jingze Shi, Ting Xie, Bingheng Wu, Chunjun Zheng, Kai Wang
[ABSTRACT]
Recent research has shown that combining Mamba with Transformer architecture,
which has selective state space and quadratic self-attention mechanism,
outperforms using Mamba or Transformer architecture alone in language modeling
tasks. The quadratic self-attention mechanism effectively alleviates the
shortcomings of selective state space in handling long-term dependencies of any
element in the sequence. We propose a position information injection method
that connects the selective state space model with the quadratic attention, and
integrates these two architectures with hybrid experts with cross-sharing
domains, so that we can enjoy the advantages of both. We design a new
architecture with a more biomimetic idea: Observer-Thinker-Conceiver-Expresser
(OTCE), which can compete with well-known medium-scale open-source language
models on a small scale in language modeling tasks.
[LINK]
http://arxiv.org/abs/2406.16495v3
[DATE]
2024-07-20 11:35:45+08:00
[CATEGORIES]
cs.CL
FINEMATCH: Aspect-based Fine-grained Image and Text Mismatch Detection and Correction
[AUTHORS]
Hang Hua, Jing Shi, Kushal Kafle, Simon Jenni, Daoan Zhang, John Collomosse, Scott Cohen, Jiebo Luo
[ABSTRACT]
Recent progress in large-scale pre-training has led to the development of
advanced vision-language models (VLMs) with remarkable proficiency in
comprehending and generating multimodal content. Despite the impressive ability
to perform complex reasoning for VLMs, current models often struggle to
effectively and precisely capture the compositional information on both the
image and text sides. To address this, we propose FineMatch, a new aspect-based
fine-grained text and image matching benchmark, focusing on text and image
mismatch detection and correction. This benchmark introduces a novel task for
boosting and evaluating the VLMs’ compositionality for aspect-based
fine-grained text and image matching. In this task, models are required to
identify mismatched aspect phrases within a caption, determine the aspect’s
class, and propose corrections for an image-text pair that may contain between
0 and 3 mismatches. To evaluate the models’ performance on this new task, we
propose a new evaluation metric named ITM-IoU for which our experiments show a
high correlation to human evaluation. In addition, we also provide a
comprehensive experimental analysis of existing mainstream VLMs, including
fully supervised learning and in-context learning settings. We have found that
models trained on FineMatch demonstrate enhanced proficiency in detecting
fine-grained text and image mismatches. Moreover, models (e.g., GPT-4V, Gemini
Pro Vision) with strong abilities to perform multimodal in-context learning are
not as skilled at fine-grained compositional image and text matching analysis.
With FineMatch, we are able to build a system for text-to-image generation
hallucination detection and correction.
[COMMENTS]
ECCV 2024
[LINK]
http://arxiv.org/abs/2404.14715v2
[DATE]
2024-07-20 11:32:40+08:00
[CATEGORIES]
cs.CL
Meta-optimized Angular Margin Contrastive Framework for Video-Language Representation Learning
[AUTHORS]
Thong Nguyen, Yi Bin, Xiaobao Wu, Xinshuai Dong, Zhiyuan Hu, Khoi Le, Cong-Duy Nguyen, See-Kiong Ng, Luu Anh Tuan
[ABSTRACT]
Data quality stands at the forefront of deciding the effectiveness of
video-language representation learning. However, video-text pairs in previous
data typically do not align perfectly with each other, which might lead to
video-language representations that do not accurately reflect cross-modal
semantics. Moreover, previous data also possess an uneven distribution of
concepts, thereby hampering the downstream performance across unpopular
subjects. To address these problems, we propose a contrastive objective with a
subtractive angular margin to regularize cross-modal representations in their
effort to reach perfect similarity. Furthermore, to adapt to the non-uniform
concept distribution, we propose a multi-layer perceptron (MLP)-parameterized
weighting function that maps loss values to sample weights which enable dynamic
adjustment of the model’s focus throughout the training. With the training
guided by a small amount of unbiased meta-data and augmented by video-text data
generated by large vision-language model, we improve video-language
representations and achieve superior performances on commonly used video
question answering and text-video retrieval datasets.
[COMMENTS]
Accepted to ECCV 2024
[LINK]
http://arxiv.org/abs/2407.03788v2
[DATE]
2024-07-20 11:15:26+08:00
[CATEGORIES]
cs.CL
Hard Prompts Made Interpretable: Sparse Entropy Regularization for Prompt Tuning with RL
[AUTHORS]
Yunseon Choi, Sangmin Bae, Seonghyun Ban, Minchan Jeong, Chuheng Zhang, Lei Song, Li Zhao, Jiang Bian, Kee-Eung Kim
[ABSTRACT]
With the advent of foundation models, prompt tuning has positioned itself as
an important technique for directing model behaviors and eliciting desired
responses. Prompt tuning regards selecting appropriate keywords included into
the input, thereby adapting to the downstream task without adjusting or
fine-tuning the model parameters. There is a wide range of work in prompt
tuning, from approaches that directly harness the backpropagated gradient
signals from the model, to those employing black-box optimization such as
reinforcement learning (RL) methods. Our primary focus is on RLPrompt, which
aims to find optimal prompt tokens leveraging soft Q-learning. While the
results show promise, we have observed that the prompts frequently appear
unnatural, which impedes their interpretability. We address this limitation by
using sparse Tsallis entropy regularization, a principled approach to filtering
out unlikely tokens from consideration. We extensively evaluate our approach
across various tasks, including few-shot text classification, unsupervised text
style transfer, and textual inversion from images. The results indicate a
notable improvement over baselines, highlighting the efficacy of our approach
in addressing the challenges of prompt tuning. Moreover, we show that the
prompts discovered using our method are more natural and interpretable compared
to those from other baselines.
[LINK]
http://arxiv.org/abs/2407.14733v1
[DATE]
2024-07-20 11:10:19+08:00
[CATEGORIES]
cs.LG
cs.CL
MixLoRA: Enhancing Large Language Models Fine-Tuning with LoRA-based Mixture of Experts
[AUTHORS]
Dengchun Li, Yingzi Ma, Naizheng Wang, Zhengmao Ye, Zhiyuan Cheng, Yinghao Tang, Yan Zhang, Lei Duan, Jie Zuo, Cal Yang, Mingjie Tang
[ABSTRACT]
Fine-tuning Large Language Models (LLMs) is a common practice to adapt
pre-trained models for specific applications. While methods like LoRA have
effectively addressed GPU memory constraints during fine-tuning, their
performance often falls short, especially in multi-task scenarios. In contrast,
Mixture-of-Expert (MoE) models, such as Mixtral 8x7B, demonstrate remarkable
performance in multi-task learning scenarios while maintaining a reduced
parameter count. However, the resource requirements of these MoEs remain
challenging, particularly for consumer-grade GPUs with less than 24GB memory.
To tackle these challenges, we propose MixLoRA, an approach to construct a
resource-efficient sparse MoE model based on LoRA. MixLoRA inserts multiple
LoRA-based experts within the feed-forward network block of a frozen
pre-trained dense model and employs a commonly used top-k router. Unlike other
LoRA-based MoE methods, MixLoRA enhances model performance by utilizing
independent attention-layer LoRA adapters. Additionally, an auxiliary load
balance loss is employed to address the imbalance problem of the router. Our
evaluations show that MixLoRA improves about 9% accuracy compared to
state-of-the-art PEFT methods in multi-task learning scenarios. We also propose
a new high-throughput framework to alleviate the computation and memory
bottlenecks during the training and inference of MOE models. This framework
reduces GPU memory consumption by 40% and token computation latency by 30%
during both training and inference.
[COMMENTS]
18 pages, 5 figures
[LINK]
http://arxiv.org/abs/2404.15159v3
[DATE]
2024-07-20 10:26:49+08:00
[CATEGORIES]
cs.CL
Trust No Bot: Discovering Personal Disclosures in Human-LLM Conversations in the Wild
[AUTHORS]
Niloofar Mireshghallah, Maria Antoniak, Yash More, Yejin Choi, Golnoosh Farnadi
[ABSTRACT]
Measuring personal disclosures made in human-chatbot interactions can provide
a better understanding of users’ AI literacy and facilitate privacy research
for large language models (LLMs). We run an extensive, fine-grained analysis on
the personal disclosures made by real users to commercial GPT models,
investigating the leakage of personally identifiable and sensitive information.
To understand the contexts in which users disclose to chatbots, we develop a
taxonomy of tasks and sensitive topics, based on qualitative and quantitative
analysis of naturally occurring conversations. We discuss these potential
privacy harms and observe that: (1) personally identifiable information (PII)
appears in unexpected contexts such as in translation or code editing (48% and
16% of the time, respectively) and (2) PII detection alone is insufficient to
capture the sensitive topics that are common in human-chatbot interactions,
such as detailed sexual preferences or specific drug use habits. We believe
that these high disclosure rates are of significant importance for researchers
and data curators, and we call for the design of appropriate nudging mechanisms
to help users moderate their interactions.
[LINK]
http://arxiv.org/abs/2407.11438v2
[DATE]
2024-07-20 08:47:32+08:00
[CATEGORIES]
cs.CL
Towards Robust Model-Based Reinforcement Learning Against Adversarial Corruption
[AUTHORS]
Chenlu Ye, Jiafan He, Quanquan Gu, Tong Zhang
[ABSTRACT]
This study tackles the challenges of adversarial corruption in model-based
reinforcement learning (RL), where the transition dynamics can be corrupted by
an adversary. Existing studies on corruption-robust RL mostly focus on the
setting of model-free RL, where robust least-square regression is often
employed for value function estimation. However, these techniques cannot be
directly applied to model-based RL. In this paper, we focus on model-based RL
and take the maximum likelihood estimation (MLE) approach to learn transition
model. Our work encompasses both online and offline settings. In the online
setting, we introduce an algorithm called corruption-robust optimistic MLE
(CR-OMLE), which leverages total-variation (TV)-based information ratios as
uncertainty weights for MLE. We prove that CR-OMLE achieves a regret of
$\tilde{\mathcal{O}}(\sqrt{T} + C)$, where $C$ denotes the cumulative
corruption level after $T$ episodes. We also prove a lower bound to show that
the additive dependence on $C$ is optimal. We extend our weighting technique to
the offline setting, and propose an algorithm named corruption-robust
pessimistic MLE (CR-PMLE). Under a uniform coverage condition, CR-PMLE exhibits
suboptimality worsened by $\mathcal{O}(C/n)$, nearly matching the lower bound.
To the best of our knowledge, this is the first work on corruption-robust
model-based RL algorithms with provable guarantees.
[LINK]
http://arxiv.org/abs/2402.08991v3
[DATE]
2024-07-20 23:23:25+08:00
[CATEGORIES]
cs.LG
Hyperspectral Unmixing Under Endmember Variability: A Variational Inference Framework
[AUTHORS]
Yuening Li, Xiao Fu, Junbin Liu, Wing-Kin Ma
[ABSTRACT]
This work proposes a variational inference (VI) framework for hyperspectral
unmixing in the presence of endmember variability (HU-EV). An EV-accounted
noisy linear mixture model (LMM) is considered, and the presence of outliers is
also incorporated into the model. Following the marginalized maximum likelihood
(MML) principle, a VI algorithmic structure is designed for probabilistic
inference for HU-EV. Specifically, a patch-wise static endmember assumption is
employed to exploit spatial smoothness and to try to overcome the ill-posed
nature of the HU-EV problem. The design facilitates lightweight, continuous
optimization-based updates under a variety of endmember priors. Some of the
priors, such as the Beta prior, were previously used under computationally
heavy, sampling-based probabilistic HU-EV methods. The effectiveness of the
proposed framework is demonstrated through synthetic, semi-real, and real-data
experiments.
[LINK]
http://arxiv.org/abs/2407.14899v1
[DATE]
2024-07-20 23:16:14+08:00
[CATEGORIES]
cs.LG
Latent Pollution Model: The Hidden Carbon Footprint in 3D Image Synthesis
[AUTHORS]
Marvin Seyfarth, Salman Ul Hassan Dar, Sandy Engelhardt
[ABSTRACT]
Contemporary developments in generative AI are rapidly transforming the field
of medical AI. These developments have been predominantly driven by the
availability of large datasets and high computing power, which have facilitated
a significant increase in model capacity. Despite their considerable potential,
these models demand substantially high power, leading to high carbon dioxide
(CO2) emissions. Given the harm such models are causing to the environment,
there has been little focus on the carbon footprints of such models. This study
analyzes carbon emissions from 2D and 3D latent diffusion models (LDMs) during
training and data generation phases, revealing a surprising finding: the
synthesis of large images contributes most significantly to these emissions. We
assess different scenarios including model sizes, image dimensions, distributed
training, and data generation steps. Our findings reveal substantial carbon
emissions from these models, with training 2D and 3D models comparable to
driving a car for 10 km and 90 km, respectively. The process of data generation
is even more significant, with CO2 emissions equivalent to driving 160 km for
2D models and driving for up to 3345 km for 3D synthesis. Additionally, we
found that the location of the experiment can increase carbon emissions by up
to 94 times, and even the time of year can influence emissions by up to 50%.
These figures are alarming, considering they represent only a single training
and data generation phase for each model. Our results emphasize the urgent need
for developing environmentally sustainable strategies in generative AI.
[LINK]
http://arxiv.org/abs/2407.14892v1
[DATE]
2024-07-20 22:44:44+08:00
[CATEGORIES]
cs.LG
VITS : Variational Inference Thompson Sampling for contextual bandits
[AUTHORS]
Pierre Clavier, Tom Huix, Alain Durmus
[ABSTRACT]
In this paper, we introduce and analyze a variant of the Thompson sampling
(TS) algorithm for contextual bandits. At each round, traditional TS requires
samples from the current posterior distribution, which is usually intractable.
To circumvent this issue, approximate inference techniques can be used and
provide samples with distribution close to the posteriors. However, current
approximate techniques yield to either poor estimation (Laplace approximation)
or can be computationally expensive (MCMC methods, Ensemble sampling…). In
this paper, we propose a new algorithm, Varational Inference Thompson sampling
VITS, based on Gaussian Variational Inference. This scheme provides powerful
posterior approximations which are easy to sample from, and is computationally
efficient, making it an ideal choice for TS. In addition, we show that VITS
achieves a sub-linear regret bound of the same order in the dimension and
number of round as traditional TS for linear contextual bandit. Finally, we
demonstrate experimentally the effectiveness of VITS on both synthetic and real
world datasets.
[LINK]
http://arxiv.org/abs/2307.10167v4
[DATE]
2024-07-20 22:38:26+08:00
[CATEGORIES]
cs.LG
Reduced Effectiveness of Kolmogorov-Arnold Networks on Functions with Noise
[AUTHORS]
Haoran Shen, Chen Zeng, Jiahui Wang, Qiao Wang
[ABSTRACT]
It has been observed that even a small amount of noise introduced into the
dataset can significantly degrade the performance of KAN. In this brief note,
we aim to quantitatively evaluate the performance when noise is added to the
dataset. We propose an oversampling technique combined with denoising to
alleviate the impact of noise. Specifically, we employ kernel filtering based
on diffusion maps for pre-filtering the noisy data for training KAN network.
Our experiments show that while adding i.i.d. noise with any fixed SNR, when we
increase the amount of training data by a factor of $r$, the test-loss (RMSE)
of KANs will exhibit a performance trend like $\text{test-loss} \sim
\mathcal{O}(r^{-\frac{1}{2}})$ as $r\to +\infty$. We conclude that applying
both oversampling and filtering strategies can reduce the detrimental effects
of noise. Nevertheless, determining the optimal variance for the kernel
filtering process is challenging, and enhancing the volume of training data
substantially increases the associated costs, because the training dataset
needs to be expanded multiple times in comparison to the initial clean data. As
a result, the noise present in the data ultimately diminishes the effectiveness
of Kolmogorov-Arnold networks.
[LINK]
http://arxiv.org/abs/2407.14882v1
[DATE]
2024-07-20 22:17:10+08:00
[CATEGORIES]
cs.LG
Thompson Sampling Itself is Differentially Private
[AUTHORS]
Tingting Ou, Marco Avella Medina, Rachel Cummings
[ABSTRACT]
In this work we first show that the classical Thompson sampling algorithm for
multi-arm bandits is differentially private as-is, without any modification. We
provide per-round privacy guarantees as a function of problem parameters and
show composition over $T$ rounds; since the algorithm is unchanged, existing
$O(\sqrt{NT\log N})$ regret bounds still hold and there is no loss in
performance due to privacy. We then show that simple modifications – such as
pre-pulling all arms a fixed number of times, increasing the sampling variance
– can provide tighter privacy guarantees. We again provide privacy guarantees
that now depend on the new parameters introduced in the modification, which
allows the analyst to tune the privacy guarantee as desired. We also provide a
novel regret analysis for this new algorithm, and show how the new parameters
also impact expected regret. Finally, we empirically validate and illustrate
our theoretical findings in two parameter regimes and demonstrate that tuning
the new parameters substantially improve the privacy-regret tradeoff.
[COMMENTS]
Published at AISTATS 2023
[LINK]
http://arxiv.org/abs/2407.14879v1
[DATE]
2024-07-20 22:01:03+08:00
[CATEGORIES]
cs.LG
Adversarial Sparse Teacher: Defense Against Distillation-Based Model Stealing Attacks Using Adversarial Examples
[AUTHORS]
Eda Yilmaz, Hacer Yalim Keles
[ABSTRACT]
We introduce Adversarial Sparse Teacher (AST), a robust defense method
against distillation-based model stealing attacks. Our approach trains a
teacher model using adversarial examples to produce sparse logit responses and
increase the entropy of the output distribution. Typically, a model generates a
peak in its output corresponding to its prediction. By leveraging adversarial
examples, AST modifies the teacher model’s original response, embedding a few
altered logits into the output while keeping the primary response slightly
higher. Concurrently, all remaining logits are elevated to further increase the
output distribution’s entropy. All these complex manipulations are performed
using an optimization function with our proposed Exponential Predictive
Divergence (EPD) loss function. EPD allows us to maintain higher entropy levels
compared to traditional KL divergence, effectively confusing attackers.
Experiments on CIFAR-10 and CIFAR-100 datasets demonstrate that AST outperforms
state-of-the-art methods, providing effective defense against model stealing
while preserving high accuracy. The source codes will be made publicly
available here soon.
[COMMENTS]
14 pages, 3 figures, 11 tables
[LINK]
http://arxiv.org/abs/2403.05181v2
[DATE]
2024-07-20 21:46:07+08:00
[CATEGORIES]
cs.LG
Bayesian identification of nonseparable Hamiltonians with multiplicative noise using deep learning and reduced-order modeling
[AUTHORS]
Nicholas Galioto, Harsh Sharma, Boris Kramer, Alex Arkady Gorodetsky
[ABSTRACT]
This paper presents a structure-preserving Bayesian approach for learning
nonseparable Hamiltonian systems using stochastic dynamic models allowing for
statistically-dependent, vector-valued additive and multiplicative measurement
noise. The approach is comprised of three main facets. First, we derive a
Gaussian filter for a statistically-dependent, vector-valued, additive and
multiplicative noise model that is needed to evaluate the likelihood within the
Bayesian posterior. Second, we develop a novel algorithm for cost-effective
application of Bayesian system identification to high-dimensional systems.
Third, we demonstrate how structure-preserving methods can be incorporated into
the proposed framework, using nonseparable Hamiltonians as an illustrative
system class. We assess the method’s performance based on the forecasting
accuracy of a model estimated from single-trajectory data. We compare the
Bayesian method to a state-of-the-art machine learning method on a canonical
nonseparable Hamiltonian model and a chaotic double pendulum model with small,
noisy training datasets. The results show that using the Bayesian posterior as
a training objective can yield upwards of 724 times improvement in Hamiltonian
mean squared error using training data with up to 10% multiplicative noise
compared to a standard training objective. Lastly, we demonstrate the utility
of the novel algorithm for parameter estimation of a 64-dimensional model of
the spatially-discretized nonlinear Schr"odinger equation with data corrupted
by up to 20% multiplicative noise.
[LINK]
http://arxiv.org/abs/2401.12476v3
[DATE]
2024-07-20 21:04:54+08:00
[CATEGORIES]
cs.LG
Improving Bias Correction Standards by Quantifying its Effects on Treatment Outcomes
[AUTHORS]
Alexandre Abraham, Andrés Hoyos Idrobo
[ABSTRACT]
With the growing access to administrative health databases, retrospective
studies have become crucial evidence for medical treatments. Yet,
non-randomized studies frequently face selection biases, requiring mitigation
strategies. Propensity score matching (PSM) addresses these biases by selecting
comparable populations, allowing for analysis without further methodological
constraints. However, PSM has several drawbacks. Different matching methods can
produce significantly different Average Treatment Effects (ATE) for the same
task, even when meeting all validation criteria. To prevent cherry-picking the
best method, public authorities must involve field experts and engage in
extensive discussions with researchers.
To address this issue, we introduce a novel metric, A2A, to reduce the number
of valid matches. A2A constructs artificial matching tasks that mirror the
original ones but with known outcomes, assessing each matching method’s
performance comprehensively from propensity estimation to ATE estimation. When
combined with Standardized Mean Difference, A2A enhances the precision of model
selection, resulting in a reduction of up to 50% in ATE estimation errors
across synthetic tasks and up to 90% in predicted ATE variability across both
synthetic and real-world datasets. To our knowledge, A2A is the first metric
capable of evaluating outcome correction accuracy using covariates not involved
in selection.
Computing A2A requires solving hundreds of PSMs, we therefore automate all
manual steps of the PSM pipeline. We integrate PSM methods from Python and R,
our automated pipeline, a new metric, and reproducible experiments into
popmatch, our new Python package, to enhance reproducibility and accessibility
to bias correction methods.
[COMMENTS]
ECML PKDD 2024, 18 pages, 2 figures, 5 tables
[LINK]
http://arxiv.org/abs/2407.14861v1
[DATE]
2024-07-20 20:42:24+08:00
[CATEGORIES]
cs.LG
Enhancing High-Energy Particle Physics Collision Analysis through Graph Data Attribution Techniques
[AUTHORS]
A. Verdone, A. Devoto, C. Sebastiani, J. Carmignani, M. D’Onofrio, S. Giagu, S. Scardapane, M. Panella
[ABSTRACT]
The experiments at the Large Hadron Collider at CERN generate vast amounts of
complex data from high-energy particle collisions. This data presents
significant challenges due to its volume and complex reconstruction,
necessitating the use of advanced analysis techniques for analysis. Recent
advancements in deep learning, particularly Graph Neural Networks, have shown
promising results in addressing the challenges but remain computationally
expensive. The study presented in this paper uses a simulated particle
collision dataset to integrate influence analysis inside the graph
classification pipeline aiming at improving the accuracy and efficiency of
collision event prediction tasks. By using a Graph Neural Network for initial
training, we applied a gradient-based data influence method to identify
influential training samples and then we refined the dataset by removing
non-contributory elements: the model trained on this new reduced dataset can
achieve good performances at a reduced computational cost. The method is
completely agnostic to the specific influence method: different influence
modalities can be easily integrated into our methodology. Moreover, by
analyzing the discarded elements we can provide further insights about the
event classification task. The novelty of integrating data attribution
techniques together with Graph Neural Networks in high-energy physics tasks can
offer a robust solution for managing large-scale data problems, capturing
critical patterns, and maximizing accuracy across several high-data demand
domains.
[COMMENTS]
10 pages, 6 figures, 2 tables
[LINK]
http://arxiv.org/abs/2407.14859v1
[DATE]
2024-07-20 20:40:03+08:00
[CATEGORIES]
cs.LG
Diff4VS: HIV-inhibiting Molecules Generation with Classifier Guidance Diffusion for Virtual Screening
[AUTHORS]
Jiaqing Lyu, Changjie Chen, Bing Liang, Yijia Zhang
[ABSTRACT]
The AIDS epidemic has killed 40 million people and caused serious global
problems. The identification of new HIV-inhibiting molecules is of great
importance for combating the AIDS epidemic. Here, the Classifier Guidance
Diffusion model and ligand-based virtual screening strategy are combined to
discover potential HIV-inhibiting molecules for the first time. We call it
Diff4VS. An extra classifier is trained using the HIV molecule dataset, and the
gradient of the classifier is used to guide the Diffusion to generate
HIV-inhibiting molecules. Experiments show that Diff4VS can generate more
candidate HIV-inhibiting molecules than other methods. Inspired by ligand-based
virtual screening, a new metric DrugIndex is proposed. The DrugIndex is the
ratio of the proportion of candidate drug molecules in the generated molecule
to the proportion of candidate drug molecules in the training set. DrugIndex
provides a new evaluation method for evolving molecular generative models from
a pharmaceutical perspective. Besides, we report a new phenomenon observed when
using molecule generation models for virtual screening. Compared to real
molecules, the generated molecules have a lower proportion that is highly
similar to known drug molecules. We call it Degradation in molecule generation.
Based on the data analysis, the Degradation may result from the difficulty of
generating molecules with a specific structure in the generative model. Our
research contributes to the application of generative models in drug design
from method, metric, and phenomenon analysis.
[LINK]
http://arxiv.org/abs/2407.15880v1
[DATE]
2024-07-20 20:34:02+08:00
[CATEGORIES]
cs.LG
Integrated BIM and Machine Learning System for Circularity Prediction of Construction Demolition Waste
[AUTHORS]
Abdullahi Saka, Ridwan Taiwo, Nurudeen Saka, Benjamin Oluleye, Jamiu Dauda, Lukman Akanbi
[ABSTRACT]
Effective management of construction and demolition waste (C&DW) is crucial
for sustainable development, as the industry accounts for 40% of the waste
generated globally. The effectiveness of the C&DW management relies on the
proper quantification of C&DW to be generated. Despite demolition activities
having larger contributions to C&DW generation, extant studies have focused on
construction waste. The few extant studies on demolition are often from the
regional level perspective and provide no circularity insights. Thus, this
study advances demolition quantification via Variable Modelling (VM) with
Machine Learning (ML). The demolition dataset of 2280 projects were leveraged
for the ML modelling, with XGBoost model emerging as the best (based on the
Copeland algorithm), achieving R2 of 0.9977 and a Mean Absolute Error of 5.0910
on the testing dataset. Through the integration of the ML model with Building
Information Modelling (BIM), the study developed a system for predicting
quantities of recyclable and landfill materials from building demolitions. This
provides detailed insights into the circularity of demolition waste and
facilitates better planning and management. The SHapley Additive exPlanations
(SHAP) method highlighted the implications of the features for demolition waste
circularity. The study contributes to empirical studies on pre-demolition
auditing at the project level and provides practical tools for implementation.
Its findings would benefit stakeholders in driving a circular economy in the
industry.
[COMMENTS]
30 pages, 19 figures
[LINK]
http://arxiv.org/abs/2407.14847v1
[DATE]
2024-07-20 19:32:46+08:00
[CATEGORIES]
cs.LG
SE(3)-bi-equivariant Transformers for Point Cloud Assembly
[AUTHORS]
Ziming Wang, Rebecka Jörnsten
[ABSTRACT]
Given a pair of point clouds, the goal of assembly is to recover a rigid
transformation that aligns one point cloud to the other. This task is
challenging because the point clouds may be non-overlapped, and they may have
arbitrary initial positions. To address these difficulties, we propose a
method, called SE(3)-bi-equivariant transformer (BITR), based on the
SE(3)-bi-equivariance prior of the task: it guarantees that when the inputs are
rigidly perturbed, the output will transform accordingly. Due to its
equivariance property, BITR can not only handle non-overlapped PCs, but also
guarantee robustness against initial positions. Specifically, BITR first
extracts features of the inputs using a novel $SE(3) \times SE(3)$-transformer,
and then projects the learned feature to group SE(3) as the output. Moreover,
we theoretically show that swap and scale equivariances can be incorporated
into BITR, thus it further guarantees stable performance under scaling and
swapping the inputs. We experimentally show the effectiveness of BITR in
practical tasks.
[COMMENTS]
A weaker assumption is used in Prop C.3
[LINK]
http://arxiv.org/abs/2407.09167v2
[DATE]
2024-07-20 19:04:40+08:00
[CATEGORIES]
cs.LG
Toward Efficient Convolutional Neural Networks With Structured Ternary Patterns
[AUTHORS]
Christos Kyrkou
[ABSTRACT]
High-efficiency deep learning (DL) models are necessary not only to
facilitate their use in devices with limited resources but also to improve
resources required for training. Convolutional neural networks (ConvNets)
typically exert severe demands on local device resources and this
conventionally limits their adoption within mobile and embedded platforms. This
brief presents work toward utilizing static convolutional filters generated
from the space of local binary patterns (LBPs) and Haar features to design
efficient ConvNet architectures. These are referred to as Structured Ternary
Patterns (STePs) and can be generated during network initialization in a
systematic way instead of having learnable weight parameters thus reducing the
total weight updates. The ternary values require significantly less storage and
with the appropriate low-level implementation, can also lead to inference
improvements. The proposed approach is validated using four image
classification datasets, demonstrating that common network backbones can be
made more efficient and provide competitive results. It is also demonstrated
that it is possible to generate completely custom STeP-based networks that
provide good trade-offs for on-device applications such as unmanned aerial
vehicle (UAV)-based aerial vehicle detection. The experimental results show
that the proposed method maintains high detection accuracy while reducing the
trainable parameters by 40-80%. This work motivates further research toward
good priors for non-learnable weights that can make DL architectures more
efficient without having to alter the network during or after training.
[COMMENTS]
Published in: IEEE Transactions on Neural Networks and Learning
Systems Code: https://github.com/ckyrkou/STeP_Models ImageNet-16 Dataset:
https://zenodo.org/records/8027520
[LINK]
http://arxiv.org/abs/2407.14831v1
[DATE]
2024-07-20 18:18:42+08:00
[CATEGORIES]
cs.LG
CrossDehaze: Scaling Up Image Dehazing with Cross-Data Vision Alignment and Augmentation
[AUTHORS]
Yukai Shi, Zhipeng Weng, Yupei Lin, Cidan Shi, Xiaojun Yang, Liang Lin
[COMMENTS]
A cross-dataset vision alignment and augmentation technology is
proposed to boost generalizable feature learning in the de-hazing task
[LINK]
http://arxiv.org/abs/2407.14823v1
[DATE]
2024-07-20 18:00:20+08:00
[CATEGORIES]
cs.LG
FMamba: Mamba based on Fast-attention for Multivariate Time-series Forecasting
[AUTHORS]
Shusen Ma, Yu Kang, Peng Bai, Yun-Bo Zhao
[ABSTRACT]
In multivariate time-series forecasting (MTSF), extracting the temporal
correlations of the input sequences is crucial. While popular Transformer-based
predictive models can perform well, their quadratic computational complexity
results in inefficiency and high overhead. The recently emerged Mamba, a
selective state space model, has shown promising results in many fields due to
its strong temporal feature extraction capabilities and linear computational
complexity. However, due to the unilateral nature of Mamba, channel-independent
predictive models based on Mamba cannot attend to the relationships among all
variables in the manner of Transformer-based models. To address this issue, we
combine fast-attention with Mamba to introduce a novel framework named FMamba
for MTSF. Technically, we first extract the temporal features of the input
variables through an embedding layer, then compute the dependencies among input
variables via the fast-attention module. Subsequently, we use Mamba to
selectively deal with the input features and further extract the temporal
dependencies of the variables through the multi-layer perceptron block
(MLP-block). Finally, FMamba obtains the predictive results through the
projector, a linear layer. Experimental results on eight public datasets
demonstrate that FMamba can achieve state-of-the-art performance while
maintaining low computational overhead.
[LINK]
http://arxiv.org/abs/2407.14814v1
[DATE]
2024-07-20 17:14:05+08:00
[CATEGORIES]
cs.LG
Understanding Matrix Function Normalizations in Covariance Pooling through the Lens of Riemannian Geometry
[AUTHORS]
Ziheng Chen, Yue Song, Xiao-Jun Wu, Gaowen Liu, Nicu Sebe
[ABSTRACT]
Global Covariance Pooling (GCP) has been demonstrated to improve the
performance of Deep Neural Networks (DNNs) by exploiting second-order
statistics of high-level representations. GCP typically performs classification
of the covariance matrices by applying matrix function normalization, such as
matrix logarithm or power, followed by a Euclidean classifier. However,
covariance matrices inherently lie in a Riemannian manifold, known as the
Symmetric Positive Definite (SPD) manifold. The current literature does not
provide a satisfactory explanation of why Euclidean classifiers can be applied
directly to Riemannian features after the normalization of the matrix power. To
mitigate this gap, this paper provides a comprehensive and unified
understanding of the matrix logarithm and power from a Riemannian geometry
perspective. The underlying mechanism of matrix functions in GCP is interpreted
from two perspectives: one based on tangent classifiers (Euclidean classifiers
on the tangent space) and the other based on Riemannian classifiers. Via
theoretical analysis and empirical validation through extensive experiments on
fine-grained and large-scale visual classification datasets, we conclude that
the working mechanism of the matrix functions should be attributed to the
Riemannian classifiers they implicitly respect.
[COMMENTS]
24 pages, 3 figures
[LINK]
http://arxiv.org/abs/2407.10484v2
[DATE]
2024-07-20 16:11:10+08:00
[CATEGORIES]
cs.LG
Perturb-and-Project: Differentially Private Similarities and Marginals
[AUTHORS]
Vincent Cohen-Addad, Tommaso d’Orsi, Alessandro Epasto, Vahab Mirrokni, Peilin Zhong
[ABSTRACT]
We revisit the input perturbations framework for differential privacy where
noise is added to the input $A\in \mathcal{S}$ and the result is then projected
back to the space of admissible datasets $\mathcal{S}$. Through this framework,
we first design novel efficient algorithms to privately release pair-wise
cosine similarities. Second, we derive a novel algorithm to compute $k$-way
marginal queries over $n$ features. Prior work could achieve comparable
guarantees only for $k$ even. Furthermore, we extend our results to $t$-sparse
datasets, where our efficient algorithms yields novel, stronger guarantees
whenever $t\le n^{5/6}/\log n\,.$ Finally, we provide a theoretical perspective
on why \textit{fast} input perturbation algorithms works well in practice. The
key technical ingredients behind our results are tight sum-of-squares
certificates upper bounding the Gaussian complexity of sets of solutions.
[COMMENTS]
21 ppages, ICML 2024
[LINK]
http://arxiv.org/abs/2406.04868v2
[DATE]
2024-07-20 16:03:37+08:00
[CATEGORIES]
cs.LG
Latent Conditional Diffusion-based Data Augmentation for Continuous-Time Dynamic Graph Model
[AUTHORS]
Yuxing Tian, Yiyan Qi, Aiwen Jiang, Qi Huang, Jian Guo
[ABSTRACT]
Continuous-Time Dynamic Graph (CTDG) precisely models evolving real-world
relationships, drawing heightened interest in dynamic graph learning across
academia and industry. However, existing CTDG models encounter challenges
stemming from noise and limited historical data. Graph Data Augmentation (GDA)
emerges as a critical solution, yet current approaches primarily focus on
static graphs and struggle to effectively address the dynamics inherent in
CTDGs. Moreover, these methods often demand substantial domain expertise for
parameter tuning and lack theoretical guarantees for augmentation efficacy. To
address these issues, we propose Conda, a novel latent diffusion-based GDA
method tailored for CTDGs. Conda features a sandwich-like architecture,
incorporating a Variational Auto-Encoder (VAE) and a conditional diffusion
model, aimed at generating enhanced historical neighbor embeddings for target
nodes. Unlike conventional diffusion models trained on entire graphs via
pre-training, Conda requires historical neighbor sequence embeddings of target
nodes for training, thus facilitating more targeted augmentation. We integrate
Conda into the CTDG model and adopt an alternating training strategy to
optimize performance. Extensive experimentation across six widely used
real-world datasets showcases the consistent performance improvement of our
approach, particularly in scenarios with limited historical data.
[COMMENTS]
Accepted by KDD 2024
[LINK]
http://arxiv.org/abs/2407.08500v2
[DATE]
2024-07-20 14:44:09+08:00
[CATEGORIES]
cs.LG
Teach Harder, Learn Poorer: Rethinking Hard Sample Distillation for GNN-to-MLP Knowledge Distillation
[AUTHORS]
Lirong Wu, Yunfan Liu, Haitao Lin, Yufei Huang, Stan Z. Li
[ABSTRACT]
To bridge the gaps between powerful Graph Neural Networks (GNNs) and
lightweight Multi-Layer Perceptron (MLPs), GNN-to-MLP Knowledge Distillation
(KD) proposes to distill knowledge from a well-trained teacher GNN into a
student MLP. In this paper, we revisit the knowledge samples (nodes) in teacher
GNNs from the perspective of hardness, and identify that hard sample
distillation may be a major performance bottleneck of existing graph KD
algorithms. The GNN-to-MLP KD involves two different types of hardness, one
student-free knowledge hardness describing the inherent complexity of GNN
knowledge, and the other student-dependent distillation hardness describing the
difficulty of teacher-to-student distillation. However, most of the existing
work focuses on only one of these aspects or regards them as one thing. This
paper proposes a simple yet effective Hardness-aware GNN-to-MLP Distillation
(HGMD) framework, which decouples the two hardnesses and estimates them using a
non-parametric approach. Finally, two hardness-aware distillation schemes
(i.e., HGMD-weight and HGMD-mixup) are further proposed to distill
hardness-aware knowledge from teacher GNNs into the corresponding nodes of
student MLPs. As non-parametric distillation, HGMD does not involve any
additional learnable parameters beyond the student MLPs, but it still
outperforms most of the state-of-the-art competitors. HGMD-mixup improves over
the vanilla MLPs by 12.95% and outperforms its teacher GNNs by 2.48% averaged
over seven real-world datasets.
[LINK]
http://arxiv.org/abs/2407.14768v1
[DATE]
2024-07-20 14:13:00+08:00
[CATEGORIES]
cs.LG
Smooth Nash Equilibria: Algorithms and Complexity
[AUTHORS]
Constantinos Daskalakis, Noah Golowich, Nika Haghtalab, Abhishek Shetty
[ABSTRACT]
A fundamental shortcoming of the concept of Nash equilibrium is its
computational intractability: approximating Nash equilibria in normal-form
games is PPAD-hard. In this paper, inspired by the ideas of smoothed analysis,
we introduce a relaxed variant of Nash equilibrium called $\sigma$-smooth Nash
equilibrium, for a smoothness parameter $\sigma$. In a $\sigma$-smooth Nash
equilibrium, players only need to achieve utility at least as high as their
best deviation to a $\sigma$-smooth strategy, which is a distribution that does
not put too much mass (as parametrized by $\sigma$) on any fixed action. We
distinguish two variants of $\sigma$-smooth Nash equilibria: strong
$\sigma$-smooth Nash equilibria, in which players are required to play
$\sigma$-smooth strategies under equilibrium play, and weak $\sigma$-smooth
Nash equilibria, where there is no such requirement.
We show that both weak and strong $\sigma$-smooth Nash equilibria have
superior computational properties to Nash equilibria: when $\sigma$ as well as
an approximation parameter $\epsilon$ and the number of players are all
constants, there is a constant-time randomized algorithm to find a weak
$\epsilon$-approximate $\sigma$-smooth Nash equilibrium in normal-form games.
In the same parameter regime, there is a polynomial-time deterministic
algorithm to find a strong $\epsilon$-approximate $\sigma$-smooth Nash
equilibrium in a normal-form game. These results stand in contrast to the
optimal algorithm for computing $\epsilon$-approximate Nash equilibria, which
cannot run in faster than quasipolynomial-time. We complement our upper bounds
by showing that when either $\sigma$ or $\epsilon$ is an inverse polynomial,
finding a weak $\epsilon$-approximate $\sigma$-smooth Nash equilibria becomes
computationally intractable.
[LINK]
http://arxiv.org/abs/2309.12226v2
[DATE]
2024-07-20 13:39:44+08:00
[CATEGORIES]
cs.LG
Differentiating Through Integer Linear Programs with Quadratic Regularization and Davis-Yin Splitting
[AUTHORS]
Daniel McKenzie, Samy Wu Fung, Howard Heaton
[ABSTRACT]
In many applications, a combinatorial problem must be repeatedly solved with
similar, but distinct parameters. Yet, the parameters $w$ are not directly
observed; only contextual data $d$ that correlates with $w$ is available. It is
tempting to use a neural network to predict $w$ given $d$. However, training
such a model requires reconciling the discrete nature of combinatorial
optimization with the gradient-based frameworks used to train neural networks.
We study the case where the problem in question is an Integer Linear Program
(ILP). We propose applying a three-operator splitting technique, also known as
Davis-Yin splitting (DYS), to the quadratically regularized continuous
relaxation of the ILP. We prove that the resulting scheme is compatible with
the recently introduced Jacobian-free backpropagation (JFB). Our experiments on
two representative ILPs: the shortest path problem and the knapsack problem,
demonstrate that this combination-DYS on the forward pass, JFB on the backward
pass-yields a scheme which scales more effectively to high-dimensional problems
than existing schemes. All code associated with this paper is available at
github.com/mines-opt-ml/fpo-dys.
[LINK]
http://arxiv.org/abs/2301.13395v4
[DATE]
2024-07-20 11:49:12+08:00
[CATEGORIES]
cs.LG
Hyperdimensional Computing for Node Classification and Link Prediction
[AUTHORS]
Abhishek Dalvi, Vasant Honavar
[ABSTRACT]
We introduce a novel method for transductive learning on graphs using
hyperdimensional representations. The proposed approach encodes data samples
using random projections into a very high-dimensional space (hyperdimensional
or HD space for short). It obviates the need for expensive iterative training
of the sort required by deep learning methods. Specifically, we propose a
Hyperdimensional Graph Learning (HDGL) algorithm. HDGL leverages the
\emph{injectivity} property of node representations of a family of Graph Neural
Networks (GNNs) to map node features to the HD space and then uses HD operators
such as bundling and binding to aggregate information from the local
neighborhood of each node. The resulting latent node representations support
both node classification and link prediction tasks, unlike typical deep
learning methods, which often require separate models for these tasks.
We report results of experiments using widely used benchmark datasets which
demonstrate that, on the node classification task, HDGL is competitive with the
SOTA GNN methods with respect to accuracy, at substantially reduced
computational cost. Furthermore, HDGL is well-suited for class incremental
learning where the model has to learn to effectively discriminate between a
growing number of classes. Our experiments also show that the HD representation
constructed by HDGL supports link prediction at accuracies comparable to that
of DeepWalk and related methods, although it falls short of SOTA Graph Neural
Network (GNN) methods that rely on computationally expensive iterative
training. We conclude that HDGL offers a computationally efficient alternative
to graph neural networks for node classification, especially in settings that
call for class-incremental learning or in applications that demand high
accuracy models at significantly lower computational cost and learning time
than possible with the SOTA GNNs.
[LINK]
http://arxiv.org/abs/2402.17073v2
[DATE]
2024-07-20 11:46:13+08:00
[CATEGORIES]
cs.LG
PANDA: Expanded Width-Aware Message Passing Beyond Rewiring
[AUTHORS]
Jeongwhan Choi, Sumin Park, Hyowon Wi, Sung-Bae Cho, Noseong Park
[ABSTRACT]
Recent research in the field of graph neural network (GNN) has identified a
critical issue known as “over-squashing,” resulting from the bottleneck
phenomenon in graph structures, which impedes the propagation of long-range
information. Prior works have proposed a variety of graph rewiring concepts
that aim at optimizing the spatial or spectral properties of graphs to promote
the signal propagation. However, such approaches inevitably deteriorate the
original graph topology, which may lead to a distortion of information flow. To
address this, we introduce an expanded width-aware (PANDA) message passing, a
new message passing paradigm where nodes with high centrality, a potential
source of over-squashing, are selectively expanded in width to encapsulate the
growing influx of signals from distant nodes. Experimental results show that
our method outperforms existing rewiring methods, suggesting that selectively
expanding the hidden state of nodes can be a compelling alternative to graph
rewiring for addressing the over-squashing.
[COMMENTS]
Accepted at ICML 2024
[LINK]
http://arxiv.org/abs/2406.03671v2
[DATE]
2024-07-20 11:44:32+08:00
[CATEGORIES]
cs.LG
Early Detection of Coffee Leaf Rust Through Convolutional Neural Networks Trained on Low-Resolution Images
[AUTHORS]
Angelly Cabrera, Kleanthis Avramidis, Shrikanth Narayanan
[ABSTRACT]
Coffee leaf rust, a foliar disease caused by the fungus Hemileia vastatrix,
poses a major threat to coffee production, especially in Central America.
Climate change further aggravates this issue, as it shortens the latency period
between initial infection and the emergence of visible symptoms in diseases
like leaf rust. Shortened latency periods can lead to more severe plant
epidemics and faster spread of diseases. There is, hence, an urgent need for
effective disease management strategies. To address these challenges, we
explore the potential of deep learning models for enhancing early disease
detection. However, deep learning models require extensive processing power and
large amounts of data for model training, resources that are typically scarce.
To overcome these barriers, we propose a preprocessing technique that involves
convolving training images with a high-pass filter to enhance lesion-leaf
contrast, significantly improving model efficacy in resource-limited
environments. This method and our model demonstrated a strong performance,
achieving over 90% across all evaluation metrics–including precision, recall,
F1-score, and the Dice coefficient. Our experiments show that this approach
outperforms other methods, including two different image preprocessing
techniques and using unaltered, full-color images.
[LINK]
http://arxiv.org/abs/2407.14737v1
[DATE]
2024-07-20 11:24:25+08:00
[CATEGORIES]
cs.LG
Bag of Tricks to Boost Adversarial Transferability
[AUTHORS]
Zeliang Zhang, Wei Yao, Xiaosen Wang
[ABSTRACT]
Deep neural networks are widely known to be vulnerable to adversarial
examples. However, vanilla adversarial examples generated under the white-box
setting often exhibit low transferability across different models. Since
adversarial transferability poses more severe threats to practical
applications, various approaches have been proposed for better transferability,
including gradient-based, input transformation-based, and model-related
attacks, \etc. In this work, we find that several tiny changes in the existing
adversarial attacks can significantly affect the attack performance, \eg, the
number of iterations and step size. Based on careful studies of existing
adversarial attacks, we propose a bag of tricks to enhance adversarial
transferability, including momentum initialization, scheduled step size, dual
example, spectral-based input transformation, and several ensemble strategies.
Extensive experiments on the ImageNet dataset validate the high effectiveness
of our proposed tricks and show that combining them can further boost
adversarial transferability. Our work provides practical insights and
techniques to enhance adversarial transferability, and offers guidance to
improve the attack performance on the real-world application through simple
adjustments.
[LINK]
http://arxiv.org/abs/2401.08734v2
[DATE]
2024-07-20 11:11:22+08:00
[CATEGORIES]
cs.LG
HRNet: Differentially Private Hierarchical and Multi-Resolution Network for Human Mobility Data Synthesization
[AUTHORS]
Shun Takagi, Li Xiong, Fumiyuki Kato, Yang Cao, Masatoshi Yoshikawa
[ABSTRACT]
Human mobility data offers valuable insights for many applications such as
urban planning and pandemic response, but its use also raises privacy concerns.
In this paper, we introduce the Hierarchical and Multi-Resolution Network
(HRNet), a novel deep generative model specifically designed to synthesize
realistic human mobility data while guaranteeing differential privacy. We first
identify the key difficulties inherent in learning human mobility data under
differential privacy. In response to these challenges, HRNet integrates three
components: a hierarchical location encoding mechanism, multi-task learning
across multiple resolutions, and private pre-training. These elements
collectively enhance the model’s ability under the constraints of differential
privacy. Through extensive comparative experiments utilizing a real-world
dataset, HRNet demonstrates a marked improvement over existing methods in
balancing the utility-privacy trade-off.
[LINK]
http://arxiv.org/abs/2405.08043v2
[DATE]
2024-07-20 11:06:26+08:00
[CATEGORIES]
cs.LG
Meta-GPS++: Enhancing Graph Meta-Learning with Contrastive Learning and Self-Training
[AUTHORS]
Yonghao Liu, Mengyu Li, Ximing Li, Lan Huang, Fausto Giunchiglia, Yanchun Liang, Xiaoyue Feng, Renchu Guan
[ABSTRACT]
Node classification is an essential problem in graph learning. However, many
models typically obtain unsatisfactory performance when applied to few-shot
scenarios. Some studies have attempted to combine meta-learning with graph
neural networks to solve few-shot node classification on graphs. Despite their
promising performance, some limitations remain. First, they employ the node
encoding mechanism of homophilic graphs to learn node embeddings, even in
heterophilic graphs. Second, existing models based on meta-learning ignore the
interference of randomness in the learning process. Third, they are trained
using only limited labeled nodes within the specific task, without explicitly
utilizing numerous unlabeled nodes. Finally, they treat almost all sampled
tasks equally without customizing them for their uniqueness. To address these
issues, we propose a novel framework for few-shot node classification called
Meta-GPS++. Specifically, we first adopt an efficient method to learn
discriminative node representations on homophilic and heterophilic graphs.
Then, we leverage a prototype-based approach to initialize parameters and
contrastive learning for regularizing the distribution of node embeddings.
Moreover, we apply self-training to extract valuable information from unlabeled
nodes. Additionally, we adopt S$^2$ (scaling & shifting) transformation to
learn transferable knowledge from diverse tasks. The results on real-world
datasets show the superiority of Meta-GPS++. Our code is available here.
[COMMENTS]
ACM Transactions on Knowledge Discovery from Data (TKDD)
[LINK]
http://arxiv.org/abs/2407.14732v1
[DATE]
2024-07-20 11:05:12+08:00
[CATEGORIES]
cs.LG
FedDM: Enhancing Communication Efficiency and Handling Data Heterogeneity in Federated Diffusion Models
[AUTHORS]
Jayneel Vora, Nader Bouacida, Aditya Krishnan, Prasant Mohapatra
[ABSTRACT]
We introduce FedDM, a novel training framework designed for the federated
training of diffusion models. Our theoretical analysis establishes the
convergence of diffusion models when trained in a federated setting, presenting
the specific conditions under which this convergence is guaranteed. We propose
a suite of training algorithms that leverage the U-Net architecture as the
backbone for our diffusion models. These include a basic Federated Averaging
variant, FedDM-vanilla, FedDM-prox to handle data heterogeneity among clients,
and FedDM-quant, which incorporates a quantization module to reduce the model
update size, thereby enhancing communication efficiency across the federated
network.
We evaluate our algorithms on FashionMNIST (28x28 resolution), CIFAR-10
(32x32 resolution), and CelebA (64x64 resolution) for DDPMs, as well as LSUN
Church Outdoors (256x256 resolution) for LDMs, focusing exclusively on the
imaging modality. Our evaluation results demonstrate that FedDM algorithms
maintain high generation quality across image resolutions. At the same time,
the use of quantized updates and proximal terms in the local training objective
significantly enhances communication efficiency (up to 4x) and model
convergence, particularly in non-IID data settings, at the cost of increased
FID scores (up to 1.75x).
[COMMENTS]
13 pages,3 figures, 2 algorithms, 3 tables
[LINK]
http://arxiv.org/abs/2407.14730v1
[DATE]
2024-07-20 10:54:41+08:00
[CATEGORIES]
cs.LG
Interacting Diffusion Processes for Event Sequence Forecasting
[AUTHORS]
Mai Zeng, Florence Regol, Mark Coates
[ABSTRACT]
Neural Temporal Point Processes (TPPs) have emerged as the primary framework
for predicting sequences of events that occur at irregular time intervals, but
their sequential nature can hamper performance for long-horizon forecasts. To
address this, we introduce a novel approach that incorporates a diffusion
generative model. The model facilitates sequence-to-sequence prediction,
allowing multi-step predictions based on historical event sequences. In
contrast to previous approaches, our model directly learns the joint
probability distribution of types and inter-arrival times for multiple events.
This allows us to fully leverage the high dimensional modeling capability of
modern generative models. Our model is composed of two diffusion processes, one
for the time intervals and one for the event types. These processes interact
through their respective denoising functions, which can take as input
intermediate representations from both processes, allowing the model to learn
complex interactions. We demonstrate that our proposal outperforms
state-of-the-art baselines for long-horizon forecasting of TPP.
[COMMENTS]
camera ready version for ICML
[LINK]
http://arxiv.org/abs/2310.17800v2
[DATE]
2024-07-20 10:52:55+08:00
[CATEGORIES]
cs.LG
Downstream-Pretext Domain Knowledge Traceback for Active Learning
[AUTHORS]
Beichen Zhang, Liang Li, Zheng-Jun Zha, Jiebo Luo, Qingming Huang
[ABSTRACT]
Active learning (AL) is designed to construct a high-quality labeled dataset
by iteratively selecting the most informative samples. Such sampling heavily
relies on data representation, while recently pre-training is popular for
robust feature learning. However, as pre-training utilizes low-level pretext
tasks that lack annotation, directly using pre-trained representation in AL is
inadequate for determining the sampling score. To address this problem, we
propose a downstream-pretext domain knowledge traceback (DOKT) method that
traces the data interactions of downstream knowledge and pre-training guidance
for selecting diverse and instructive samples near the decision boundary. DOKT
consists of a traceback diversity indicator and a domain-based uncertainty
estimator. The diversity indicator constructs two feature spaces based on the
pre-training pretext model and the downstream knowledge from annotation, by
which it locates the neighbors of unlabeled data from the downstream space in
the pretext space to explore the interaction of samples. With this mechanism,
DOKT unifies the data relations of low-level and high-level representations to
estimate traceback diversity. Next, in the uncertainty estimator, domain mixing
is designed to enforce perceptual perturbing to unlabeled samples with similar
visual patches in the pretext space. Then the divergence of perturbed samples
is measured to estimate the domain uncertainty. As a result, DOKT selects the
most diverse and important samples based on these two modules. The experiments
conducted on ten datasets show that our model outperforms other
state-of-the-art methods and generalizes well to various application scenarios
such as semantic segmentation and image captioning.
[LINK]
http://arxiv.org/abs/2407.14720v1
[DATE]
2024-07-20 09:34:13+08:00
[CATEGORIES]
cs.LG
Differential Privacy of Cross-Attention with Provable Guarantee
[AUTHORS]
Jiuxiang Gu, Yingyu Liang, Zhenmei Shi, Zhao Song, Yufa Zhou
[ABSTRACT]
Cross-attention has become a fundamental module nowadays in many important
artificial intelligence applications, e.g., retrieval-augmented generation
(RAG), system prompt, guided stable diffusion, and many so on. Ensuring
cross-attention privacy is crucial and urgently needed because its key and
value matrices may contain sensitive information about companies and their
users, many of which profit solely from their system prompts or RAG data. In
this work, we design a novel differential privacy (DP) data structure to
address the privacy security of cross-attention with a theoretical guarantee.
In detail, let $n$ be the input token length of system prompt/RAG data, $d$ be
the feature dimension, $0 < \alpha \le 1$ be the relative error parameter, $R$
be the maximum value of the query and key matrices, $R_w$ be the maximum value
of the value matrix, and $r,s,\epsilon_s$ be parameters of polynomial kernel
methods. Then, our data structure requires $\widetilde{O}(ndr^2)$ memory
consumption with $\widetilde{O}(nr^2)$ initialization time complexity and
$\widetilde{O}(\alpha^{-1} r^2)$ query time complexity for a single token
query. In addition, our data structure can guarantee that the user query is
$(\epsilon, \delta)$-DP with $\widetilde{O}(n^{-1} \epsilon^{-1} \alpha^{-1/2}
R^{2s} R_w r^2)$ additive error and $n^{-1} (\alpha + \epsilon_s)$ relative
error between our output and the true answer. Furthermore, our result is robust
to adaptive queries in which users can intentionally attack the cross-attention
system. To our knowledge, this is the first work to provide DP for
cross-attention. We believe it can inspire more privacy algorithm design in
large generative models (LGMs).
[LINK]
http://arxiv.org/abs/2407.14717v1
[DATE]
2024-07-20 09:02:27+08:00
[CATEGORIES]
cs.LG
Unveiling the Decision-Making Process in Reinforcement Learning with Genetic Programming
[AUTHORS]
Manuel Eberhardinger, Florian Rupp, Johannes Maucher, Setareh Maghsudi
[ABSTRACT]
Despite tremendous progress, machine learning and deep learning still suffer
from incomprehensible predictions. Incomprehensibility, however, is not an
option for the use of (deep) reinforcement learning in the real world, as
unpredictable actions can seriously harm the involved individuals. In this
work, we propose a genetic programming framework to generate explanations for
the decision-making process of already trained agents by imitating them with
programs. Programs are interpretable and can be executed to generate
explanations of why the agent chooses a particular action. Furthermore, we
conduct an ablation study that investigates how extending the domain-specific
language by using library learning alters the performance of the method. We
compare our results with the previous state of the art for this problem and
show that we are comparable in performance but require much less hardware
resources and computation time.
[COMMENTS]
Accepted at: The Fifteenth International Conference on Swarm
Intelligence (ICSI’2024)
[LINK]
http://arxiv.org/abs/2407.14714v1
[DATE]
2024-07-20 08:45:03+08:00
[CATEGORIES]
cs.LG
Efficient Active Learning Halfspaces with Tsybakov Noise: A Non-convex Optimization Approach
[AUTHORS]
Yinan Li, Chicheng Zhang
[ABSTRACT]
We study the problem of computationally and label efficient PAC active
learning $d$-dimensional halfspaces with Tsybakov
Noise~\citep{tsybakov2004optimal} under structured unlabeled data
distributions. Inspired by~\cite{diakonikolas2020learning}, we prove that any
approximate first-order stationary point of a smooth nonconvex loss function
yields a halfspace with a low excess error guarantee. In light of the above
structural result, we design a nonconvex optimization-based algorithm with a
label complexity of $\tilde{O}(d
(\frac{1}{\epsilon})^{\frac{8-6\alpha}{3\alpha-1}})$, under the assumption that
the Tsybakov noise parameter $\alpha \in (\frac13, 1]$, which narrows down the
gap between the label complexities of the previously known efficient passive or
active algorithms~\citep{diakonikolas2020polynomial,zhang2021improved} and the
information-theoretic lower bound in this setting.
[COMMENTS]
29 pages
[LINK]
http://arxiv.org/abs/2310.15411v2
[DATE]
2024-07-20 08:01:06+08:00
[CATEGORIES]
cs.LG