arxiv compressed, 2024-07-08

This page contains one-sentence summaries of cs.AI/ML/CV/CL papers announced on 2024-07-08 generated by the compressor, my personal LLM-based project.

LaRa: Efficient Large-Baseline Radiance Fields

http://arxiv.org/abs/2407.04699v1

Compressor summary: The paper presents a novel method using transformers with local and global attention for efficient and accurate 3D reconstruction from novel views.

VCoME: Verbal Video Composition with Multimodal Editing Effects

http://arxiv.org/abs/2407.04697v1

Compressor summary: The paper introduces VCoME, a framework for generating coherent and visually appealing verbal videos with multimodal editing effects, which is 85 times more efficient than professional editors.

Me, Myself, and AI: The Situational Awareness Dataset (SAD) for LLMs

http://arxiv.org/abs/2407.04694v1

Compressor summary: The Situational Awareness Dataset (SAD) is a benchmark that tests the self-knowledge and situational awareness of large language models (LLMs), which are important for their capacity to plan and act autonomously, but also raise novel risks.

ANAH-v2: Scaling Analytical Hallucination Annotation of Large Language Models

http://arxiv.org/abs/2407.04693v1

Compressor summary: The paper proposes an iterative self-training framework that scales up hallucination annotation and improves the accuracy of the annotator, outperforming GPT-4 and achieving state-of-the-art results in hallucination detection.

Missed Causes and Ambiguous Effects: Counterfactuals Pose Challenges for Interpreting Neural Networks

http://arxiv.org/abs/2407.04690v1

Compressor summary: Interpretability research using counterfactual theories faces issues with capturing multiple causes and transitive dependencies in neural networks, which affects the accuracy of causal graphs extraction and interpretation.

Enhancing Vehicle Re-identification and Matching for Weaving Analysis

http://arxiv.org/abs/2407.04688v1

Compressor summary: The paper presents a new method to collect video data on lane-specific weaving patterns, which can help improve traffic management systems.

Rethinking Visual Prompting for Multimodal Large Language Models with External Knowledge

http://arxiv.org/abs/2407.04681v1

Compressor summary: The paper proposes a new visual prompt approach to integrate fine-grained external knowledge into multimodal large language models, improving their ability to understand detailed or localized visual elements.

XQSV: A Structurally Variable Network to Imitate Human Play in Xiangqi

http://arxiv.org/abs/2407.04678v1

Compressor summary: The paper introduces XQSV, a deep learning architecture for Xiangqi that can change its structure and mimic human players with high accuracy and indistinguishability.

Is plantar thermography a valid digital biomarker for characterising diabetic foot ulceration risk?

http://arxiv.org/abs/2407.04676v1

Compressor summary: The study found strong associations between plantar thermography clusters and diabetic foot ulceration risk factors, but these associations were not predictive of the risk factors.

Unsupervised 4D Cardiac Motion Tracking with Spatiotemporal Optical Flow Networks

http://arxiv.org/abs/2407.04663v1

Compressor summary: The paper introduces an unsupervised deep learning method for tracking myocardial motion from low-resolution echocardiography images, which improves accuracy and speed over existing methods.

SAM Fewshot Finetuning for Anatomical Segmentation in Medical Images

http://arxiv.org/abs/2407.04651v1

Compressor summary: The authors propose a few-shot fine-tuning strategy for adapting Segment Anything (SAM) to medical image segmentation tasks, which reduces user interaction and improves efficiency by using embeddings from annotated slices as prompts.

Semi-Supervised Segmentation via Embedding Matching

http://arxiv.org/abs/2407.04638v1

Compressor summary: The authors propose a semi-supervised segmentation method that uses unlabeled images and a few labeled ones to train a model, improving hip bone segmentation in CT scans with less data.

Entity Decomposition with Filtering: A Zero-Shot Clinical Named Entity Recognition Framework

http://arxiv.org/abs/2407.04629v1

Compressor summary: This paper proposes EDF, a novel framework that improves open NER LLMs' performance in clinical named entity recognition by decomposing the task into sub-entity retrievals and filtering incorrect entities.

On scalable oversight with weak LLMs judging strong LLMs

http://arxiv.org/abs/2407.04622v1

Compressor summary: The paper explores how different oversight protocols involving AI and human judges perform across various tasks with information asymmetry and finds that debate outperforms consultancy and is comparable to direct question-answering depending on the task type.

OneRestore: A Universal Restoration Framework for Composite Degradation

http://arxiv.org/abs/2407.04621v1

Compressor summary: OneRestore is a transformer-based framework that adapts to different image degradation scenarios by merging scene descriptors with image features using cross-attention, achieving superior restoration results.

Learning to (Learn at Test Time): RNNs with Expressive Hidden States

http://arxiv.org/abs/2407.04620v1

Compressor summary: The authors propose Test-Time Training layers that use a hidden state as a machine learning model and update it during self-supervised training, achieving linear complexity with expressive power comparable to Transformer and RNNs in sequence modeling.

CountGD: Multi-Modal Open-World Counting

http://arxiv.org/abs/2407.04619v1

Compressor summary: The paper proposes CountGD, a model that can count objects in images using text or visual exemplars or both, and shows its superior performance on multiple counting benchmarks.

Randomized Physics-Informed Neural Networks for Bayesian Data Assimilation

http://arxiv.org/abs/2407.04617v1

Compressor summary: The rPINN method is a faster and more effective alternative to HMC for uncertainty quantification in inverse PDE problems with noisy data.

Isomorphic Pruning for Vision Models

http://arxiv.org/abs/2407.04616v1

Compressor summary: Isomorphic Pruning is a simple approach that effectively prunes heterogeneous sub-structures in advanced vision models, improving accuracy and reducing computation.

ARM: Efficient Guided Decoding with Autoregressive Reward Models

http://arxiv.org/abs/2407.04615v1

Compressor summary: The paper proposes an efficient way to improve language models with task-specific rewards for real world applications like detoxification and sentiment control.

PartCraft: Crafting Creative Objects by Parts

http://arxiv.org/abs/2407.04604v1

Compressor summary: PartCraft is a generative visual AI system that allows users to select and combine parts of objects for fine-grained, faithful, and plausible results.

AWT: Transferring Vision-Language Models via Augmentation, Weighting, and Transportation

http://arxiv.org/abs/2407.04603v1

Compressor summary: AWT is a novel adaptation framework that enhances vision-language models for various visual tasks by augmenting inputs, dynamically weighting them, and mining semantic correlations in the vision-language space.

Understanding the Gains from Repeated Self-Distillation

http://arxiv.org/abs/2407.04600v1

Compressor summary: Self-distillation improves performance, especially with multiple steps, and can significantly reduce excess risk in linear regression.

Feature Attenuation of Defective Representation Can Resolve Incomplete Masking on Anomaly Detection

http://arxiv.org/abs/2407.04597v1

Compressor summary: FADeR improves unsupervised anomaly detection using minimal changes and fewer layers in a neural network, reducing false alarms and increasing efficiency for edge computing.

Testing learning hypotheses using neural networks by manipulating learning data

http://arxiv.org/abs/2407.04593v1

Compressor summary: The study explores how English speakers learn exceptions to the passive voice rule using neural network language models, and finds that verb frequency in the passive affects its passivizability more than semantics.

Smell and Emotion: Recognising emotions in smell-related artworks

http://arxiv.org/abs/2407.04592v1

Compressor summary: This paper shows how to recognize emotions from smell-related artworks using style transfer and hyperparameter tuning, and suggests ways to improve it further.

Proximal Point Method for Online Saddle Point Problem

http://arxiv.org/abs/2407.04591v1

Compressor summary: The paper proposes and analyzes three variants of the online proximal point method for solving time-varying convex-concave games, achieving near-optimal duality gap and dynamic Nash equilibrium regret bounds in benign environments.

SH17: A Dataset for Human Safety and Personal Protective Equipment Detection in Manufacturing Industry

http://arxiv.org/abs/2407.04590v1

Compressor summary: The researchers developed a system using object detection and neural networks to detect and verify the proper use of personal protective equipment in various industries, creating a dataset and achieving promising results.

Multimodal Classification via Modal-Aware Interactive Enhancement

http://arxiv.org/abs/2407.04587v1

Compressor summary: The paper introduces modal-aware interactive enhancement (MIE), a novel multimodal learning method that uses sharpness aware minimization and gradient modification to balance learning speeds, improve generalization, and reduce modality forgetting.

Leveraging Large Language Models for Integrated Satellite-Aerial-Terrestrial Networks: Recent Advances and Future Directions

http://arxiv.org/abs/2407.04581v1

Compressor summary: The paper discusses how integrating Large Language Models into Integrated Satellite, Aerial, and Terrestrial Networks can improve connectivity and performance by enhancing data processing, network management, and advanced algorithms.

GOALPlace: Begin with the End in Mind

http://arxiv.org/abs/2407.04579v1

Compressor summary: GOALPlace is a new learning-based method that improves placement congestion by controlling cell density, achieving superior or comparable results to commercial tools.

Real Time Emotion Analysis Using Deep Learning for Education, Entertainment, and Beyond

http://arxiv.org/abs/2407.04560v1

Compressor summary: The authors are developing a system that uses deep learning to detect facial expressions and display matching emojis in real-time for various applications such as education and entertainment.

Not (yet) the whole story: Evaluating Visual Storytelling Requires More than Measuring Coherence, Grounding, and Repetition

http://arxiv.org/abs/2407.04559v1

Compressor summary: The paper proposes a method to measure visual storytelling quality based on human likeness and evaluates several models, finding LLaVA as the best performer but with room for improvement.

Spontaneous Reward Hacking in Iterative Self-Refinement

http://arxiv.org/abs/2407.04549v1

Compressor summary: The paper investigates how iterative self-refinement in language models can lead to reward hacking, where the generated output optimizes the evaluator's ratings instead of actual user preference.

Gaussian Eigen Models for Human Heads

http://arxiv.org/abs/2407.04545v1

Compressor summary: The paper proposes a novel method for generating facial expressions using low-dimensional linear spaces based on dynamic 3D Gaussians, enabling real-time rendering and control.

Strengthening Structural Inductive Biases by Pre-training to Perform Syntactic Transformations

http://arxiv.org/abs/2407.04543v1

Compressor summary: The paper proposes intermediate pre-training for Transformers to improve their inductive biases for syntactic transformations in seq2seq tasks, leading to better few-shot learning and structural generalization.

PoPreRo: A New Dataset for Popularity Prediction of Romanian Reddit Posts

http://arxiv.org/abs/2407.04541v1

Compressor summary: The authors present PoPreRo, a dataset for predicting the popularity of Romanian Reddit posts, and show that it is a challenging task even for state-of-the-art models.

PDiscoFormer: Relaxing Part Discovery Constraints with Vision Transformers

http://arxiv.org/abs/2407.04538v1

Compressor summary: The paper shows that pre-trained transformer-based vision models enable more flexible object part detection and improve fine-grained classification tasks, challenging restrictive assumptions on part properties.

Introducing 'Inside' Out of Distribution

http://arxiv.org/abs/2407.04534v1

Compressor summary: This study proposes a new way to categorize out-of-distribution samples (inside and outside) and analyzes their impact on machine learning models' performance.

Performance Analysis of Speech Encoders for Low-Resource SLU and ASR in Tunisian Dialect

http://arxiv.org/abs/2407.04533v1

Compressor summary: SSL models pretrained on speech improve SLU and ASR for low-resource Tunisian Arabic dialect, especially when fine-tuned with limited data.

GPT vs RETRO: Exploring the Intersection of Retrieval and Parameter-Efficient Fine-Tuning

http://arxiv.org/abs/2407.04528v1

Compressor summary: The paper compares different fine-tuning techniques for large language models, showing that RETRO models excel in zero-shot settings while GPT models have higher potential with parameter efficiency, and 8B models strike an optimal balance between cost and performance.

Graph Reinforcement Learning in Power Grids: A Survey

http://arxiv.org/abs/2407.04522v1

Compressor summary: This review discusses the potential of graph neural networks (GNNs) and reinforcement learning (RL) for improving power grid control, addressing different use cases and challenges in transmission and distribution grids.

Success or Failure? Analyzing Segmentation Refinement with Few-Shot Segmentation

http://arxiv.org/abs/2407.04519v1

Compressor summary: The paper proposes JFS, a method to assess the success of segmentation refinement using few-shot segmentation, which evaluates the quality of refined masks compared to coarse masks.

G-Adaptive mesh refinement -- leveraging graph neural networks and differentiable finite element solvers

http://arxiv.org/abs/2407.04516v1

Compressor summary: The paper proposes a novel graph neural network approach that optimizes mesh adaptivity in finite element methods by minimizing the solution error, improving accuracy and efficiency compared to classical and previous machine learning methods.

LayerShuffle: Enhancing Robustness in Vision Transformers by Randomizing Layer Execution Order

http://arxiv.org/abs/2407.04513v1

Compressor summary: The paper proposes training methods for vision transformers that allow them to adapt to different layer execution orders, random merging, and layer pruning at test time, but with some reduction in accuracy.

Hyperspectral Dataset and Deep Learning methods for Waste from Electric and Electronic Equipment Identification (WEEE)

http://arxiv.org/abs/2407.04505v1

Compressor summary: The paper evaluates deep learning architectures for hyperspectral image segmentation and shows that combining spectral and spatial information improves results, while also releasing a new dataset for the task.

Segment Any 4D Gaussians

http://arxiv.org/abs/2407.04504v1

Compressor summary: The paper introduces SA4D, a framework for segmenting objects in 4D digital scenes using 4D Gaussians and temporal identity features, with applications like removal, recoloring, composition, and rendering of masks.

PROUD: PaRetO-gUided Diffusion Model for Multi-objective Generation

http://arxiv.org/abs/2407.04493v1

Compressor summary: PROUD is a deep generative model that optimizes multiple properties simultaneously, preserving sample quality and achieving Pareto optimality in image and protein generation tasks.

Better by Default: Strong Pre-Tuned MLPs and Boosted Trees on Tabular Data

http://arxiv.org/abs/2407.04491v1

Compressor summary: The authors introduce RealMLP, an improved MLP, and better default parameters for GBDTs and RealMLP, which offer a good time-accuracy tradeoff and can achieve excellent results on tabular data without hyperparameter tuning.

Micro-gesture Online Recognition using Learnable Query Points

http://arxiv.org/abs/2407.04490v1

Compressor summary: The paper presents HFUT-VUT, a system for recognizing micro-gestures in videos that ranks 2nd in a challenge.

Dude: Dual Distribution-Aware Context Prompt Learning For Large Vision-Language Model

http://arxiv.org/abs/2407.04489v1

Compressor summary: The paper proposes a new framework for few-shot learning that uses dual contexts and unbalanced optimal transport theory to improve feature representation, alignment, and image augmentation, achieving better results than existing methods.

Leveraging Graph Structures to Detect Hallucinations in Large Language Models

http://arxiv.org/abs/2407.04485v1

Compressor summary: The authors propose a method using graph attention networks and contrastive learning to detect hallucinations in large language models and improve their trustworthiness.

Optimizing the image correction pipeline for pedestrian detection in the thermal-infrared domain

http://arxiv.org/abs/2407.04484v1

Compressor summary: This paper investigates how different infrared processing methods affect pedestrian detection in low-visibility urban scenarios and recommends using a specific shutterless pipeline with tonemapping for autonomous driving.

Using Petri Nets as an Integrated Constraint Mechanism for Reinforcement Learning Tasks

http://arxiv.org/abs/2407.04481v1

Compressor summary: The paper proposes using Petri nets to integrate RL models in real-world domains, improving trustworthiness by enabling verification of model properties and enforcing constraints.

LoCo: Low-Bit Communication Adaptor for Large-scale Model Training

http://arxiv.org/abs/2407.04480v1

Compressor summary: LoCo is a method that compensates gradients before compression, ensuring efficient synchronization and maintaining training quality for large-scale models.

Rethinking Data Input for Point Cloud Upsampling

http://arxiv.org/abs/2407.04476v1

Compressor summary: The text discusses a new method for point cloud upsampling that compares patch-based and full-model inputs, but finds that patch-based methods perform better.

Are Large Language Models Strategic Decision Makers? A Study of Performance and Bias in Two-Player Non-Zero-Sum Games

http://arxiv.org/abs/2407.04467v1

Compressor summary: This study examines how large language models perform in strategic games and finds they have systematic biases that affect their decision-making, especially when game settings or prompts change.

Using LLMs to label medical papers according to the CIViC evidence model

http://arxiv.org/abs/2407.04466v1

Compressor summary: The authors present CIViC Evidence, a sequence classification problem in medical NLP, and compare different language models and GPT-4's few-shot performance on it.

VCD-Texture: Variance Alignment based 3D-2D Co-Denoising for Text-Guided Texturing

http://arxiv.org/abs/2407.04461v1

Compressor summary: The paper proposes a novel texture synthesis method for 3D shapes using a 2D-3D collaborative denoising framework, which addresses the modal gap and improves the consistency of textures by aligning variances and refining inpainting.

Generalists vs. Specialists: Evaluating Large Language Models for Urdu

http://arxiv.org/abs/2407.04459v1

Compressor summary: The paper compares general and special-purpose language models on Urdu tasks, finding that special-purpose models perform better and GPT-4-Turbo is more aligned with human evaluation than Llama-3-8b-Instruct.

Robust Multimodal Learning via Representation Decoupling

http://arxiv.org/abs/2407.04458v1

Compressor summary: DMRNet is a novel method for robust multimodal learning that samples embeddings from probabilistic distributions instead of fixed points, allowing it to capture modality-specific information and perform better than existing methods.

Hindsight Preference Learning for Offline Preference-based Reinforcement Learning

http://arxiv.org/abs/2407.04451v1

Compressor summary: HPL uses hindsight information to model human preferences for offline RL, resulting in more robust and advantageous rewards.

Multi-modal Masked Siamese Network Improves Chest X-Ray Representation Learning

http://arxiv.org/abs/2407.04449v1

Compressor summary: The paper proposes using Electronic Health Record (EHR) data to improve self-supervised learning for chest X-ray images by incorporating it into a Masked Siamese Network.

TokenVerse: Unifying Speech and NLP Tasks via Transducer-based ASR

http://arxiv.org/abs/2407.04444v1

Compressor summary: TokenVerse is a single Transducer-based model for conversational intelligence that handles multiple tasks by integrating task-specific tokens during training and improving ASR and task performance over the cascaded pipeline approach.

Wavelet-based Temporal Attention Improves Traffic Forecasting

http://arxiv.org/abs/2407.04440v1

Compressor summary: The paper presents a wavelet-based neural network model that efficiently captures spatio-temporal correlations in traffic flow data and outperforms existing methods on three real-world datasets.

From 'Showgirls' to 'Performers': Fine-tuning with Gender-inclusive Language for Bias Reduction in LLMs

http://arxiv.org/abs/2407.04434v1

Compressor summary: The paper proposes a catalogue of gender-exclusive terms and their neutral alternatives, which is used to fine-tune LLMs and reduce gender stereotyping.

The Complexity of Symmetry Breaking Beyond Lex-Leader

http://arxiv.org/abs/2407.04419v1

Compressor summary: The paper studies the complexity of finding symmetry breaking predicates (SBPs) for solving constraint programming problems like SAT, and shows that certifying graph non-isomorphism is a natural barrier for efficient SBP computation.

Trustworthy Classification through Rank-Based Conformal Prediction Sets

http://arxiv.org/abs/2407.04407v1

Compressor summary: The paper proposes a new conformal prediction method for classification tasks that uses rank-based scores to capture uncertainty and provides better coverage than existing methods.

On Quantum Channel Learning

http://arxiv.org/abs/2407.04406v1

Compressor summary: The text proposes an optimization problem and algorithm for finding the best mapping between two Hilbert spaces using density matrix measurements and quantum channels, which generalizes unitary learning and allows studying probabilistic mixtures and superpositions.