arxiv compressed, 2024-07-11

This page contains one-sentence summaries of cs.AI/ML/CV/CL papers announced on 2024-07-11 generated by the compressor, my personal LLM-based project.


LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models

http://arxiv.org/abs/2407.07895v1

Compressor summary: The paper introduces LLaVA-NeXT-Interleave, a Large Multimodal Model that handles multiple scenarios (multi-image, multi-frame, multi-view, and multi-patch) using the interleaved data format and the M4-Instruct dataset.


Training on the Test Task Confounds Evaluation and Emergence

http://arxiv.org/abs/2407.07890v1

Compressor summary: The text discusses how training large language models on task-relevant data during pretraining can influence their performance and emergent behavior evaluations, and proposes a method to adjust for this issue by fine-tuning models on the same data before evaluation.


Towards Robust Alignment of Language Models: Distributionally Robustifying Direct Preference Optimization

http://arxiv.org/abs/2407.07880v1

Compressor summary: The study proposes Dr. DPO, a method that enhances the robustness of Direct Preference Optimization (DPO) by integrating pairwise and pointwise noise resilience using Distributionally Robust Optimization (DRO).


Toto: Time Series Optimized Transformer for Observability

http://arxiv.org/abs/2407.07874v1

Compressor summary: Toto is a new foundation model for time series forecasting that excels at both observability and general-purpose forecasting tasks.


Dynamical Measure Transport and Neural PDE Solvers for Sampling

http://arxiv.org/abs/2407.07873v1

Compressor summary: The paper presents a new framework for sampling from probability densities using physics-informed neural networks that solve partial differential equations, improving efficiency and coverage compared to existing methods.


Controlling Space and Time with Diffusion Models

http://arxiv.org/abs/2407.07860v1

Compressor summary: 4DiM is a novel view synthesis model that uses diffusion on 3D, 4D, and video data, enabling better fidelity and pose control, as well as handling temporal dynamics in scenes.


FACTS About Building Retrieval Augmented Generation-based Chatbots

http://arxiv.org/abs/2407.07858v1

Compressor summary: Key points: - Enterprise chatbots use generative AI to boost employee productivity - RAG, LLMs, and orchestration frameworks are essential for building these chatbots - Creating effective chatbots is challenging and requires careful engineering of RAG pipelines - The authors present a framework (FACTS) and provide empirical results on tradeoffs between large and small LLMs Summary: The paper introduces FACTS, a framework for building secure enterprise chatbots using generative AI, and discusses the challenges and tradeoffs involved.


Progressive Growing of Patch Size: Resource-Efficient Curriculum Learning for Dense Prediction Tasks

http://arxiv.org/abs/2407.07853v1

Compressor summary: Progressive Growing of Patch Size is a resource-efficient method for dense prediction tasks that gradually increases difficulty by growing the patch size during model training, improving convergence and reducing costs.


OpenDiLoCo: An Open-Source Framework for Globally Distributed Low-Communication Training

http://arxiv.org/abs/2407.07852v1

Compressor summary: OpenDiLoCo is an open-source tool that enables efficient and scalable training of large language models across multiple locations with low communication and high compute utilization.


Uncovering Layer-Dependent Activation Sparsity Patterns in ReLU Transformers

http://arxiv.org/abs/2407.07848v1

Compressor summary: Our study reveals how sparsity patterns in ReLU Transformers vary across layers and time, affecting feature learning and causing some dimensions to turn off during training.


OV-DINO: Unified Open-Vocabulary Detection with Language-Aware Selective Fusion

http://arxiv.org/abs/2407.07844v1

Compressor summary: The paper proposes OV-DINO, a method for open-vocabulary detection that pre-trains on diverse datasets and uses language-aware selective fusion to improve performance.


Study on Aspect Ratio Variability toward Robustness of Vision Transformer-based Vehicle Re-identification

http://arxiv.org/abs/2407.07842v1

Compressor summary: The paper proposes a ViT-based ReID framework that fuses models trained on different aspect ratios, improving performance on vehicle re-identification tasks with non-square inputs.


Benchmarking Embedding Aggregation Methods in Computational Pathology: A Clinical Data Perspective

http://arxiv.org/abs/2407.07841v1

Compressor summary: The study benchmarks ten slide-level aggregation techniques for medical imaging and finds that domain-specific foundation models outperform generic ones, but no single model excels in all tasks.


Decompose and Compare Consistency: Measuring VLMs' Answer Reliability via Task-Decomposition Consistency Comparison

http://arxiv.org/abs/2407.07840v1

Compressor summary: The proposed method DeCC can measure the reliability of VLM's answers by comparing their internal reasoning process and indirect answers obtained from sub-questions.


RoBus: A Multimodal Dataset for Controllable Road Networks and Building Layouts Generation

http://arxiv.org/abs/2407.07835v1

Compressor summary: The paper introduces RoBus, a large multimodal dataset for controllable generation of road networks and building layouts in 3D cities, incorporating urban characteristics and providing evaluation metrics.


Disentangled Representation Learning through Geometry Preservation with the Gromov-Monge Gap

http://arxiv.org/abs/2407.07829v1

Compressor summary: The paper proposes Gromov-Monge-Gap (GMG), a regularizer for unsupervised disentangled representation learning that leverages geometrical constraints to preserve the structure of distributions supported on different spaces, making the model decoder-free and more scalable.


Estimating the stability number of a random graph using convolutional neural networks

http://arxiv.org/abs/2407.07827v1

Compressor summary: The paper investigates using convolutional neural networks to predict the stability number of random graphs from their graph images.


When to Accept Automated Predictions and When to Defer to Human Judgment?

http://arxiv.org/abs/2407.07821v1

Compressor summary: The paper proposes a new method to measure the reliability of neural network predictions under data distribution shifts by clustering outputs and using distances between class centroids and incorrect predictions as a metric for confidence.


The Misclassification Likelihood Matrix: Some Classes Are More Likely To Be Misclassified Than Others

http://arxiv.org/abs/2407.07818v1

Compressor summary: The Misclassification Likelihood Matrix (MLM) is a new tool that helps assess how reliable neural networks are when predictions change due to distribution shifts and suggests ways to improve them.


A Survey on Deep Stereo Matching in the Twenties

http://arxiv.org/abs/2407.07816v1

Compressor summary: The paper reviews the recent developments and challenges in deep stereo matching, a field that has seen significant advancements in the last five years thanks to new architectures and paradigms.


Transformer Alignment in Large Language Models

http://arxiv.org/abs/2407.07810v1

Compressor summary: The study analyzes how large language models work by tracing token trajectories through transformer blocks and finds that increased alignment between singular vectors of Residual Jacobians positively correlates with model performance.


SUMix: Mixup with Semantic and Uncertain Information

http://arxiv.org/abs/2407.07805v1

Compressor summary: SUMix is a novel data augmentation approach for deep learning that learns the mixing ratio and uncertainty of mixed samples to improve generalization ability and performance.


ROSA: Random Subspace Adaptation for Efficient Fine-Tuning

http://arxiv.org/abs/2407.07802v1

Compressor summary: ROSA improves parameter efficient fine-tuning for large models in natural language processing tasks by adapting subspaces of arbitrary dimension with zero latency overhead.


Attribute or Abstain: Large Language Models as Long Document Assistants

http://arxiv.org/abs/2407.07799v1

Compressor summary: The text discusses LAB, a benchmark for evaluating attribution in long document tasks, and finds that citation (response generation and evidence extraction in one step) mostly performs best.


Evaluating Large Language Models with Grid-Based Game Competitions: An Extensible LLM Benchmark and Leaderboard

http://arxiv.org/abs/2407.07796v1

Compressor summary: The text introduces a new benchmark for testing large language models on grid-based games, showing varying performance across different games and prompt types, and providing open-access data for analysis.


Reinforcement Learning of Adaptive Acquisition Policies for Inverse Problems

http://arxiv.org/abs/2407.07794v1

Compressor summary: The paper proposes a reinforcement learning method to sequentially collect measurements for solving under-determined inverse problems, aiming to recover the signal with fewer measurements and improved performance.


Flooding Spread of Manipulated Knowledge in LLM-Based Multi-Agent Communities

http://arxiv.org/abs/2407.07791v1

Compressor summary: The paper investigates security risks of large language models in multi-agent systems due to the spread of manipulated knowledge and proposes a two-stage attack method to exploit these vulnerabilities.


Raising the Ceiling: Conflict-Free Local Feature Matching with Dynamic View Switching

http://arxiv.org/abs/2407.07789v1

Compressor summary: RCM is a novel feature matching method that addresses scarcity of matchable points, matching conflicts, and keypoint-repeatability reliance issues by switching views, using a conflict-free coarse matching module, and integrating the semi-sparse paradigm and coarse-to-fine architecture.


Cross Domain Object Detection via Multi-Granularity Confidence Alignment based Mean Teacher

http://arxiv.org/abs/2407.07780v1

Compressor summary: The paper proposes MGCAMT, a framework that aligns confidence across different levels to improve cross domain object detection using pseudo labels and Mean Teacher approach.


WorldAPIs: The World Is Worth How Many APIs? A Thought Experiment

http://arxiv.org/abs/2407.07778v1

Compressor summary: Key points: - The paper explores how many and what kind of primitive actions (APIs) are needed for a versatile embodied agent using wikiHow tutorials as a source of instructions. - The paper proposes a framework that uses few-shot prompting to generate Pythonic programs as agent policies and bootstrap a universe of APIs by reusing and creating them. - The paper applies the framework on a small fraction of wikiHow tutorials and induces an action space of 300+ APIs, which are mostly not supported by existing embodied simulators. Summary: The paper investigates how to define a large action space for embodied agents using wikiHow tutorials and a few-shot prompting framework that generates Pythonic programs as policies and bootstrap APIs.


Multi-task Prompt Words Learning for Social Media Content Generation

http://arxiv.org/abs/2407.07771v1

Compressor summary: The proposed framework combines multiple tasks to generate comprehensive prompt words that guide ChatGPT to create high-quality tweets, and uses ChatGPT to evaluate the generated content.


Ramsey Theorems for Trees and a General 'Private Learning Implies Online Learning' Theorem

http://arxiv.org/abs/2407.07765v1

Compressor summary: This paper shows that differential privacy and online learning are related in general classification tasks, using Ramsey-type theorems for trees.


PosFormer: Recognizing Complex Handwritten Mathematical Expression with Position Forest Transformer

http://arxiv.org/abs/2407.07764v1

Compressor summary: The paper proposes PosFormer, a position forest transformer for handwritten mathematical expression recognition that uses a forest structure to model the position and hierarchy of symbols and an implicit attention correction module to improve performance on complex datasets.


S&D Messenger: Exchanging Semantic and Domain Knowledge for Generic Semi-Supervised Medical Image Segmentation

http://arxiv.org/abs/2407.07763v1

Compressor summary: The paper proposes a framework that enables efficient medical image segmentation by delivering semantic and domain knowledge between labeled and unlabeled data, improving performance on various challenging scenarios.


Learning Spatial-Semantic Features for Robust Video Object Segmentation

http://arxiv.org/abs/2407.07760v1

Compressor summary: This paper presents a robust video object segmentation framework with spatial-semantic features and discriminative object queries that achieves state-of-the-art performance on multiple datasets.


LSM: A Comprehensive Metric for Assessing the Safety of Lane Detection Systems in Autonomous Driving

http://arxiv.org/abs/2407.07740v1

Compressor summary: The Lane Safety Metric (LSM) is a new method to assess the safety of lane detection systems for autonomous vehicles by considering factors like object detection, road type, and vehicle speed.


Fine-Tuning Large Language Models with User-Level Differential Privacy

http://arxiv.org/abs/2407.07737v1

Compressor summary: The paper compares two methods for training large language models with user-level differential privacy, showing that one method performs better in different scenarios depending on the number of examples per user and the desired privacy level.


Protecting NeRFs' Copyright via Plug-And-Play Watermarking Base Model

http://arxiv.org/abs/2407.07735v1

Compressor summary: The paper introduces NeRFProtector, a tool that allows NeRF creators to embed binary messages in their 3D scene representations while maintaining performance quality.


PaliGemma: A versatile 3B VLM for transfer

http://arxiv.org/abs/2407.07726v1

Compressor summary: PaliGemma is an open VLM that excels at various tasks thanks to its versatile vision and language models.


Deep-Graph-Sprints: Accelerated Representation Learning in Continuous-Time Dynamic Graphs

http://arxiv.org/abs/2407.07712v1

Compressor summary: Deep-Graph-Sprints (DGS) is a fast and efficient deep learning architecture for representing interconnected, evolving systems on continuous-time dynamic graphs (CTDGs).


Feasibility Study on Active Learning of Smart Surrogates for Scientific Simulations

http://arxiv.org/abs/2407.07674v1

Compressor summary: The paper explores using active learning to train deep neural networks as surrogate models for scientific simulations, reducing the need for extensive and expensive simulation data.


Towards Adaptive Pseudo-label Learning for Semi-Supervised Temporal Action Localization

http://arxiv.org/abs/2407.07673v1

Compressor summary: The paper proposes a new framework for Semi-Supervised Temporal Action Localization (SS-TAL) that improves pseudo-label selection by jointly learning classification confidence and localization reliability, eliminating ambiguous positives, and enhancing action discrimination.


Why should we ever automate moral decision making?

http://arxiv.org/abs/2407.07671v1

Compressor summary: The text discusses the challenges and risks of AI making moral decisions, as ethics lacks a precise mathematical framework and human moral decision-making is imperfect.


How to Leverage Predictive Uncertainty Estimates for Reducing Catastrophic Forgetting in Online Continual Learning

http://arxiv.org/abs/2407.07668v1

Compressor summary: The text discusses how to reduce catastrophic forgetting in machine learning models using memory and predictive uncertainty measures.


VEnhancer: Generative Space-Time Enhancement for Video Generation

http://arxiv.org/abs/2407.07667v1

Compressor summary: VEnhancer is a framework that improves low-quality generated videos by adding details and removing artifacts using a conditioned video diffusion model and a video ControlNet.


A Proposed S.C.O.R.E. Evaluation Framework for Large Language Models : Safety, Consensus, Objectivity, Reproducibility and Explainability

http://arxiv.org/abs/2407.07666v1

Compressor summary: S.C.O.R.E. is a 5-aspect framework to evaluate large language models in healthcare based on safety, consensus, objectivity, reproducibility, and explainability.


A Coding-Theoretic Analysis of Hyperspherical Prototypical Learning Geometry

http://arxiv.org/abs/2407.07664v1

Compressor summary: HPL is an approach to representation learning that uses class prototypes on the unit hypersphere to bias representations for scale invariant and known geometry, with improved optimisation procedure and prototype placement.


Mitigating Backdoor Attacks using Activation-Guided Model Editing

http://arxiv.org/abs/2407.07662v1

Compressor summary: The paper proposes a new method to remove hidden triggers from machine learning models, which can cause them to behave unexpectedly, by using unseen data samples to adjust the model's weights.


Boosting Medical Image Synthesis via Registration-guided Consistency and Disentanglement Learning

http://arxiv.org/abs/2407.07660v1

Compressor summary: The paper proposes a registration-guided consistency approach with disentanglement learning for medical image synthesis, which improves alignment and preserves anatomical structures.


The Selective G-Bispectrum and its Inversion: Applications to G-Invariant Networks

http://arxiv.org/abs/2407.07655v1

Compressor summary: The authors propose a selective $G$-Bispectrum that reduces computational cost and improves accuracy and robustness in deep neural networks for achieving group-invariance.


Explaining Graph Neural Networks for Node Similarity on Graphs

http://arxiv.org/abs/2407.07639v1

Compressor summary: The paper evaluates two methods for providing explanations for similarity search in graph data using GNNs and shows that gradient-based explanations have desirable properties such as actionability, consistency, and sparsity.


Tuning Vision-Language Models with Candidate Labels by Prompt Alignment

http://arxiv.org/abs/2407.07638v1

Compressor summary: Key points: - VLMs learn image-text representations and use prompt learning to adapt to downstream tasks - Prompt learning needs true labels, but candidate labels are often available due to privacy or sensitivity issues - The paper proposes a framework that disambiguates candidate labels and leverages VLM's prior knowledge - The framework improves the robustness of prompt learning with candidate labels Summary: The paper introduces a framework for prompt learning with candidate labels for vision-language models, which aligns model output with mixed class posterior and uses various training objectives to improve performance.


Pessimism Meets Risk: Risk-Sensitive Offline Reinforcement Learning

http://arxiv.org/abs/2407.07631v1

Compressor summary: The paper develops sample-efficient risk-sensitive offline reinforcement learning algorithms for linear Markov Decision Processes using entropic risk measure.


A Review of the Challenges with Massive Web-mined Corpora Used in Large Language Models Pre-Training

http://arxiv.org/abs/2407.07630v1

Compressor summary: The article reviews challenges of using web-mined data for pre-training large language models and suggests ways to improve their accuracy, reliability, and ethical responsibility.


Synthetic to Authentic: Transferring Realism to 3D Face Renderings for Boosting Face Recognition

http://arxiv.org/abs/2407.07627v1

Compressor summary: The paper explores using image-to-image translation to make 3D-rendered facial images more realistic and improve face recognition systems' performance on real-world data.


Psycho-linguistic Experiment on Universal Semantic Components of Verbal Humor: System Description and Annotation

http://arxiv.org/abs/2407.07617v1

Compressor summary: The article discusses an annotation system for humor in texts based on readers' self-paced reading behavior and a related psycho-linguistic experiment.


Satellite Image Time Series Semantic Change Detection: Novel Architecture and Analysis of Domain Shift

http://arxiv.org/abs/2407.07616v1

Compressor summary: Key points: - The paper tackles change detection and semantic segmentation with satellite image time series (SITS-SCD) - It proposes a new architecture that improves over the state of the art and leverages long-term temporal information - It investigates the impact of spatial and temporal shifts on SITS datasets using DynamicEarthNet and MUDS - It finds that spatial domain shift is the most complex setting and temporal shift affects change detection more than semantic segmentation Summary: The paper presents a new method for detecting changes and identifying objects in satellite images over time, and studies how different types of shifts in data affect its performance.


MARS: Mixture of Auto-Regressive Models for Fine-grained Text-to-image Synthesis

http://arxiv.org/abs/2407.07614v1

Compressor summary: MARS is a new framework for generating images from text that combines pre-trained language models with visual understanding, enabling bilingual and efficient image synthesis.


Probabilistic learning rate scheduler with provable convergence

http://arxiv.org/abs/2407.07613v1

Compressor summary: The paper introduces a probabilistic learning rate scheduler (PLRS) that does not follow the typical monotonically decreasing rule and proves its convergence, while also showing competitive performance in experiments.


Teaching Transformers Causal Reasoning through Axiomatic Training

http://arxiv.org/abs/2407.07612v1

Compressor summary: The study shows that large transformer models can learn causal reasoning from passive data and generalize well to new scenarios by training on multiple demonstrations of causal axioms.


Physics-Informed Geometric Operators to Support Surrogate, Dimension Reduction and Generative Models for Engineering Design

http://arxiv.org/abs/2407.07611v1

Compressor summary: The text proposes physics-informed geometric operators (GOs) for improving performance prediction, dimension reduction, and generative models using high-level intrinsic geometric information and physics in the feature vector.


The Computational Learning of Construction Grammars: State of the Art and Prospective Roadmap

http://arxiv.org/abs/2407.07606v1

Compressor summary: The paper reviews computational models of construction grammar learning, synthesizes existing methodologies and results, identifies challenges and opportunities for future research.


Early Explorations of Lightweight Models for Wound Segmentation on Mobile Devices

http://arxiv.org/abs/2407.07605v1

Compressor summary: The researchers developed a smartphone app that uses computer vision to automatically recognize and distinguish wounds on elderly patients' skin.


H-FCBFormer Hierarchical Fully Convolutional Branch Transformer for Occlusal Contact Segmentation with Articulating Paper

http://arxiv.org/abs/2407.07604v1

Compressor summary: The H-FCBFormer model uses a Vision Transformer and Fully Convolutional Network ensemble to accurately detect occlusal contacts in dentistry, improving on other machine learning methods and outperforming human dentists.


iiANET: Inception Inspired Attention Hybrid Network for efficient Long-Range Dependency

http://arxiv.org/abs/2407.07603v1

Compressor summary: iiANET is a hybrid model that combines global self-attention, local convolutions, and channel attention to capture long-range dependencies in complex images, outperforming some state-of-the-art models.


Learning treatment effects while treating those in need

http://arxiv.org/abs/2407.07596v1

Compressor summary: The authors propose a framework to design randomized allocation rules for social programs that balance allocating resources to high-need individuals with evaluating the program's effectiveness, and demonstrate its benefits using data from human services in Allegheny County, Pennsylvania.


Let Occ Flow: Self-Supervised 3D Occupancy Flow Prediction

http://arxiv.org/abs/2407.07587v1

Compressor summary: The paper presents Let Occ Flow, a self-supervised method for predicting 3D occupancy and flow using only camera inputs, which outperforms existing methods on nuScenes and KITTI datasets.


Simplifying Source-Free Domain Adaptation for Object Detection: Effective Self-Training Strategies and Performance Insights

http://arxiv.org/abs/2407.07586v1

Compressor summary: The paper explores simple methods for source-free object detection adaptation and shows that adapting batch statistics and using a modified Mean Teacher with strong-weak augmentation can outperform previous approaches.


TIP: Tabular-Image Pre-training for Multimodal Classification with Incomplete Data

http://arxiv.org/abs/2407.07582v1

Compressor summary: TIP is a novel framework for learning multimodal representations robust to incomplete tabular data, using self-supervised learning strategies and a versatile encoder.


InstructLayout: Instruction-Driven 2D and 3D Layout Synthesis with Semantic Graph Prior

http://arxiv.org/abs/2407.07580v1

Compressor summary: InstructLayout is a new framework for creating 2D and 3D layouts from natural language instructions with better controllability and fidelity, using a semantic graph prior and a layout decoder, and it outperforms existing methods in various tasks.


IDA-VLM: Towards Movie Understanding via ID-Aware Large Vision-Language Model

http://arxiv.org/abs/2407.07577v1

Compressor summary: The paper introduces a new model (IDA-VLM) and benchmark (MM-ID) to improve the ability of large vision-language models to recognize and associate instance identities across different scenes, which is crucial for understanding complex visual content like movies.


Resource Allocation for Twin Maintenance and Computing Task Processing in Digital Twin Vehicular Edge Computing Network

http://arxiv.org/abs/2407.07575v1

Compressor summary: The paper proposes a method to optimize resource allocation for vehicular edge computing networks using multi-agent deep reinforcement learning, considering delays caused by digital twin maintenance and computational processing.


HebDB: a Weakly Supervised Dataset for Hebrew Speech Processing

http://arxiv.org/abs/2407.07566v1

Compressor summary: HebDB is a large dataset for Hebrew speech processing with raw recordings and pre-processed versions, along with two baseline ASR systems that outperform current multi-lingual alternatives.


On Leakage of Code Generation Evaluation Datasets

http://arxiv.org/abs/2407.07565v1

Compressor summary: The paper investigates how code generation test sets can contaminate large language models and identifies three sources of contamination using a new dataset of python prompts and solutions.


Trainable Highly-expressive Activation Functions

http://arxiv.org/abs/2407.07564v1

Compressor summary: DiTAC is a new trainable activation function that enhances the expressiveness and performance of deep neural nets in various tasks.