arxiv compressed, 2024-03-15

This page contains one-sentence summaries of cs.AI/ML/CV/CL papers announced on 2024-03-15 generated by the compressor, my personal LLM-based project.


GroupContrast: Semantic-aware Self-supervised Representation Learning for 3D Understanding

http://arxiv.org/abs/2403.09639v1

Compressor summary: GroupContrast is a novel self-supervised method for 3D representation learning that combines segment grouping and semantic-aware contrastive learning to address the "semantic conflict" problem.


SCP-Diff: Photo-Realistic Semantic Image Synthesis with Spatial-Categorical Joint Prior

http://arxiv.org/abs/2403.09638v1

Compressor summary: SCP-Diff is a new method for semantic image synthesis that uses specific noise priors to overcome issues with current approaches, achieving high quality results.


Dynamic Memory Compression: Retrofitting LLMs for Accelerated Inference

http://arxiv.org/abs/2403.09636v1

Compressor summary: DMC is a method for compressing the key-value cache in LLMs, improving efficiency and enabling longer contexts and larger batches without sacrificing performance.


Transformers Get Stable: An End-to-End Signal Propagation Theory for Language Models

http://arxiv.org/abs/2403.09635v1

Compressor summary: The authors develop a theory to understand and improve deep transformer models, enabling them to achieve better performance on various tasks.


OneTracker: Unifying Visual Object Tracking with Foundation Models and Efficient Tuning

http://arxiv.org/abs/2403.09634v1

Compressor summary: OneTracker is a general framework for visual object tracking that unifies various tracking tasks by pre-training on RGB tracker Foundation Tracker and finetuning with modality-specific information.


Holo-Relighting: Controllable Volumetric Portrait Relighting from a Single Image

http://arxiv.org/abs/2403.09632v1

Compressor summary: Holo-Relighting is a novel volumetric relighting method that synthesizes new viewpoints and lighting from a single portrait image using a pretrained 3D GAN and can generate complex non-Lambertian lighting effects without physical lighting priors.


3D-VLA: A 3D Vision-Language-Action Generative World Model

http://arxiv.org/abs/2403.09631v1

Compressor summary: The paper proposes a new model called 3D-VLA that integrates 3D perception, reasoning, and action, enabling better planning and multimodal generation for embodied AI tasks.


Generalized Predictive Model for Autonomous Driving

http://arxiv.org/abs/2403.09630v1

Compressor summary: The paper presents GenAD, a large-scale video prediction model for autonomous driving that uses web data and text descriptions, outperforming existing models and having potential for real-world applications.


Quiet-STaR: Language Models Can Teach Themselves to Think Before Speaking

http://arxiv.org/abs/2403.09629v1

Compressor summary: Quiet-STaR is a method for improving language models by teaching them to generate rationales for their predictions, which helps them answer difficult questions without fine-tuning.


Video Mamba Suite: State Space Model as a Versatile Alternative for Video Understanding

http://arxiv.org/abs/2403.09626v1

Compressor summary: The paper explores the potential of Mamba, a state space model, for various video understanding tasks, showing its efficiency and effectiveness compared to Transformers.


Make-Your-3D: Fast and Consistent Subject-Driven 3D Content Generation

http://arxiv.org/abs/2403.09625v1

Compressor summary: Make-Your-3D is a new method that creates personalized, realistic 3D content from a single image of a person and their description, using a co-evolution framework to align multiple models.


Glyph-ByT5: A Customized Text Encoder for Accurate Visual Text Rendering

http://arxiv.org/abs/2403.09622v1

Compressor summary: The authors propose Glyph-ByT5, a text encoder that improves visual text rendering by fine-tuning ByT5 with a paired glyph-text dataset, and integrate it with SDXL to create Glyph-SDXL, which achieves high accuracy in design image and real image text rendering.


Minimax Optimal and Computationally Efficient Algorithms for Distributionally Robust Offline Reinforcement Learning

http://arxiv.org/abs/2403.09621v1

Compressor summary: The paper proposes optimal algorithms for function approximation in distributionally robust offline reinforcement learning with linearly parameterized models, and analyzes their instance-dependent suboptimality in comparison to standard offline RL.


PosSAM: Panoptic Open-vocabulary Segment Anything

http://arxiv.org/abs/2403.09620v1

Compressor summary: The paper presents PosSAM, an end-to-end open-vocabulary panoptic segmentation model that combines SAM's spatial features with CLIP's semantic features and introduces LDP and MASE techniques to improve performance.


Explore In-Context Segmentation via Latent Diffusion Models

http://arxiv.org/abs/2403.09616v1

Compressor summary: This paper explores using latent diffusion models (LDM) for in-context segmentation, proposes new meta-architectures and strategies, and builds a benchmark with image and video datasets.


Reawakening knowledge: Anticipatory recovery from catastrophic interference via structured training

http://arxiv.org/abs/2403.09613v1

Compressor summary: The text studies how large language models fine-tuned sequentially can recover from forgetting and exhibit anticipatory behavior in a cyclical document sequence.


MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training

http://arxiv.org/abs/2403.09611v1

Compressor summary: This paper investigates the key components of multimodal language models and shows how to build high-performing ones using different data and model architectures.


Large Language Models and Causal Inference in Collaboration: A Comprehensive Survey

http://arxiv.org/abs/2403.09606v1

Compressor summary: Causal inference can enhance NLP models' accuracy, fairness, robustness, and explainability by capturing causal relationships among variables, and LLMs can contribute to causal inference and improve various NLP domains with their advanced reasoning capabilities.


Counterfactual contrastive learning: robust representations via causal image synthesis

http://arxiv.org/abs/2403.09605v1

Compressor summary: CF-SimCLR is a new contrastive pretraining method that uses counterfactual image generation to improve robustness and generalization across different medical imaging domains.


Renovating Names in Open-Vocabulary Segmentation Benchmarks

http://arxiv.org/abs/2403.09593v1

Compressor summary: The paper proposes a framework called RENOVATE that generates more precise names for visual segments in open-vocabulary segmentation datasets, which improves performance of segmentation models.


Iterative Forgetting: Online Data Stream Regression Using Database-Inspired Adaptive Granulation

http://arxiv.org/abs/2403.09588v1

Compressor summary: The text introduces a database-inspired data stream regression model that uses R*-trees to create and forget granules for low-latency and accurate real-time predictions in time-sensitive systems.


Algorithmic syntactic causal identification

http://arxiv.org/abs/2403.09580v1

Compressor summary: The authors propose a new approach to causal identification using symmetric monoidal categories that allows for applications in settings where classical probability theory fails, such as relational databases and machine learning algorithms.


The NeRFect Match: Exploring NeRF Features for Visual Localization

http://arxiv.org/abs/2403.09577v1

Compressor summary: The paper presents NeRFMatch, a 2D-3D matching method that leverages the internal knowledge of NeRF learned through view synthesis for visual localization tasks, achieving state-of-the-art results on Cambridge Landmarks.


Eyes Closed, Safety On: Protecting Multimodal LLMs via Image-to-Text Transformation

http://arxiv.org/abs/2403.09572v1

Compressor summary: ECSO is a novel training-free approach that enhances the safety of multimodal large language models by transforming unsafe images into texts, improving their performance on safety benchmarks.


Multi-Fidelity Bayesian Optimization With Across-Task Transferable Max-Value Entropy Search

http://arxiv.org/abs/2403.09570v1

Compressor summary: The paper introduces a new acquisition function for multi-fidelity black-box optimization that balances information accrual for the current task and future tasks, using shared latent variables and particle-based variational Bayesian updates.


Self-Consistency Training for Hamiltonian Prediction

http://arxiv.org/abs/2403.09560v1

Compressor summary: The authors propose an exact training method for Hamiltonian prediction that doesn't need labeled data, allowing it to leverage unlabeled data for better generalization and efficiency in molecular science problems.


Less is More: Data Value Estimation for Visual Instruction Tuning

http://arxiv.org/abs/2403.09559v1

Compressor summary: The paper proposes TIVE, a data selection approach that eliminates redundancy within visual instruction datasets for multimodal large language models, achieving comparable or better performance with significantly less data.


Cloud gap-filling with deep learning for improved grassland monitoring

http://arxiv.org/abs/2403.09554v1

Compressor summary: Key points: - The text proposes a deep learning method to generate continuous NDVI time series from optical and SAR data - The method improves event detection tasks by providing a continuous time series and filtering out noise from cloudy observations - The method is tested on grassland mowing events in Lithuania and outperforms alternative interpolation techniques Summary: The authors present a deep learning approach that combines optical and SAR data to generate continuous NDVI time series, which enhances the detection of grassland mowing events in cloudy regions.


WeakSurg: Weakly supervised surgical instrument segmentation using temporal equivariance and semantic continuity

http://arxiv.org/abs/2403.09551v1

Compressor summary: WeakSurg is a weakly supervised method for surgical instrument segmentation using only instrument presence labels and temporal information, which outperforms existing methods on both semantic and instance segmentation metrics.


Generalizing Denoising to Non-Equilibrium Structures Improves Equivariant Force Fields

http://arxiv.org/abs/2403.09549v1

Compressor summary: The paper proposes denoising non-equilibrium structures as an auxiliary task to improve neural network training for simulating 3D atomistic systems, using forces as additional input to handle ill-posed problems.


Breast Cancer Classification Using Gradient Boosting Algorithms Focusing on Reducing the False Negative and SHAP for Explainability

http://arxiv.org/abs/2403.09548v1

Compressor summary: This study uses four boosting algorithms to predict and diagnose breast cancer, focusing on the recall metric, and improves their performance using Optuna and SHAP methods.


Explorations in Texture Learning

http://arxiv.org/abs/2403.09543v1

Compressor summary: The text explores how object classification models learn and rely on textures, and reveals interesting associations between texture and object classes that can help with interpretability and bias detection.


Logits of API-Protected LLMs Leak Proprietary Information

http://arxiv.org/abs/2403.09539v1

Compressor summary: The text shows how to learn non-public information about large language models from a few API queries using a model image or signature that exploits their softmax bottleneck.


VisionGPT-3D: A Generalized Multimodal Agent for Enhanced 3D Vision Understanding

http://arxiv.org/abs/2403.09530v1

Compressor summary: VisionGPT-3D is a versatile framework that combines state-of-the-art vision models and algorithms to automate 3D vision tasks from text prompts.


MT-PATCHER: Selective and Extendable Knowledge Distillation from Large Language Models for Machine Translation

http://arxiv.org/abs/2403.09522v1

Compressor summary: MT-Patcher is a framework that transfers selective and diverse knowledge from large language models to medium-sized machine translation models, improving their performance on both specific and general tasks.


Leveraging Prototypical Representations for Mitigating Social Bias without Demographic Information

http://arxiv.org/abs/2403.09516v1

Compressor summary: DAFair is a novel method to reduce social biases in language models without using explicit demographic labels, by leveraging prototypical demographic texts and regularization during fine-tuning.


Trust AI Regulation? Discerning users are vital to build trust and effective AI regulation

http://arxiv.org/abs/2403.09510v1

Compressor summary: The authors propose using evolutionary game theory to model how regulations can incentivize trustworthy AI development and user trust, and suggest two mechanisms for effective regulation.


SkateFormer: Skeletal-Temporal Transformer for Human Action Recognition

http://arxiv.org/abs/2403.09508v1

Compressor summary: SkateFormer is a novel transformer-based method for skeleton-based action recognition that partitions joints and frames based on different types of skeletal-temporal relations and performs attention within each partition to improve efficiency and accuracy.


Don't Judge by the Look: A Motion Coherent Augmentation for Video Recognition

http://arxiv.org/abs/2403.09506v1

Compressor summary: The study proposes Motion Coherent Augmentation (MCA), a data augmentation method for video recognition that introduces appearance variation and encourages models to focus on motion patterns instead of static appearances.


EquiAV: Leveraging Equivariance for Audio-Visual Contrastive Learning

http://arxiv.org/abs/2403.09502v1

Compressor summary: EquiAV is a novel framework that leverages equivariance for audio-visual contrastive learning, enabling robust supervision with minimal computational overhead and improving performance on various benchmarks.


Faceptor: A Generalist Model for Face Perception

http://arxiv.org/abs/2403.09500v1

Compressor summary: The paper presents Naive Faceptor, a unified face perception model that can perform various tasks efficiently and adaptively by sharing structural design and using layer-attention.


A Reinforcement Learning Approach to Dairy Farm Battery Management using Q Learning

http://arxiv.org/abs/2403.09499v1

Compressor summary: The text discusses using AI to optimize battery management for renewable energy integration in dairy farming, reducing costs and demand, and proposes a Q-learning algorithm as a case study in Ireland.


Anomaly Detection by Adapting a pre-trained Vision Language Model

http://arxiv.org/abs/2403.09493v1

Compressor summary: CLIP-ADA is a framework that uses a learnable prompt and self-supervised learning to improve anomaly detection in industrial images, achieving state-of-the-art results.


On using Machine Learning Algorithms for Motorcycle Collision Detection

http://arxiv.org/abs/2403.09491v1

Compressor summary: The paper investigates using machine learning algorithms to reliably detect motorcycle collisions for activating passive safety systems, such as airbags and seat belts, to reduce severe injury and fatality risks in accidents.


Hyper-CL: Conditioning Sentence Representations with Hypernetworks

http://arxiv.org/abs/2403.09490v1

Compressor summary: Hyper-CL uses hypernetworks to adapt sentence embeddings based on different conditions, improving performance in semantic text similarity and knowledge graph completion tasks.


Rectifying Demonstration Shortcut in In-Context Learning

http://arxiv.org/abs/2403.09488v1

Compressor summary: In-Context Calibration helps large language models learn new input-label relationships from demonstrations instead of relying on pre-trained semantic priors.


SpikeReveal: Unlocking Temporal Sequences from Real Blurry Inputs with Spike Streams

http://arxiv.org/abs/2403.09486v1

Compressor summary: Key points: - The paper proposes a self-supervised framework for spike-guided motion deblurring - The framework exploits the theoretical relationships among spike streams, blurry images, and sharp sequences - The framework uses knowledge distillation and re-blurring loss to generate high-quality deblurred images - The framework outperforms existing methods in generalization ability and quality Summary: The paper presents a self-supervised method for deblurring blurry images captured by spike cameras, which exploits the spike-related information and uses knowledge distillation and re-blurring loss to achieve high-quality results.


Clinical Reasoning over Tabular Data and Text with Bayesian Networks

http://arxiv.org/abs/2403.09481v1

Compressor summary: The paper explores how to combine Bayesian networks and neural networks for clinical reasoning using text data, using pneumonia diagnosis as an example.


Laying the Foundation First? Investigating the Generalization from Atomic Skills to Complex Reasoning Tasks

http://arxiv.org/abs/2403.09479v1

Compressor summary: The paper proposes a probing framework and hierarchical curriculum learning to improve language models' generalization of atomic skills to complex reasoning tasks, which can be effective in different domains.


Easy-to-Hard Generalization: Scalable Alignment Beyond Human Supervision

http://arxiv.org/abs/2403.09472v1

Compressor summary: The paper proposes a method to improve AI alignment by using reward models trained on easier tasks to evaluate and improve policy models on harder tasks, enabling AI systems to surpass human capabilities.


MambaTalk: Efficient Holistic Gesture Synthesis with Selective State Space Models

http://arxiv.org/abs/2403.09471v1

Compressor summary: MambaTalk uses a two-stage modeling strategy with discrete motion priors to create high-quality, diverse, and rhythmic gesture sequences for human-computer interaction applications, outperforming current methods.


Eta Inversion: Designing an Optimal Eta Function for Diffusion-based Real Image Editing

http://arxiv.org/abs/2403.09468v1

Compressor summary: The authors propose a novel diffusion inversion technique for text-guided image editing that allows flexible control over the editing extent and achieves superior results compared to existing methods.


Machine learning for structural design models of continuous beam systems via influence zones

http://arxiv.org/abs/2403.09454v1

Compressor summary: The authors develop a machine learning model that predicts cross-section requirements for continuous beams based on a new influence zone concept, achieving high accuracy and generalization.


M&M: Multimodal-Multitask Model Integrating Audiovisual Cues in Cognitive Load Assessment

http://arxiv.org/abs/2403.09451v1

Compressor summary: The paper presents an M&M model that combines audiovisual cues using a dual-pathway architecture and cross-modality multihead attention for cognitive load assessment, with three branches tailored to specific labels.


Shake to Leak: Fine-tuning Diffusion Models Can Amplify the Generative Privacy Risk

http://arxiv.org/abs/2403.09450v1

Compressor summary: Shake-to-Leak (S2L) is a new privacy risk in diffusion models that can be exploited by fine-tuning them with manipulated data, potentially leaking sensitive information.


3D-SceneDreamer: Text-Driven 3D-Consistent Scene Generation

http://arxiv.org/abs/2403.09439v1

Compressor summary: The paper proposes a method to improve 3D scene generation by refining local views using global information and generating new contents with higher quality and better 3D consistency.


Improving Real-Time Omnidirectional 3D Multi-Person Human Pose Estimation with People Matching and Unsupervised 2D-3D Lifting

http://arxiv.org/abs/2403.09437v1

Compressor summary: Key points: - The paper presents a real-time 3D multi-person pose estimation system using a panoramic camera and radar sensors - It introduces several contributions, such as calibrations and matching methods for image and radar space - It uses a lightweight 2D-3D pose lifting algorithm to overcome depth and scale ambiguity problems - It achieves a high frame rate on a laptop with a commercial-grade GPU Summary: The paper introduces a real-time system that estimates 3D poses of multiple people using a panoramic camera and radar sensors, with several contributions to handle occlusion and ambiguity issues.


Open-Vocabulary Object Detection with Meta Prompt Representation and Instance Contrastive Optimization

http://arxiv.org/abs/2403.09433v1

Compressor summary: The paper proposes MIC, a framework for open-vocabulary object detection that improves generalization to novel classes using meta prompts and instance contrastive learning, without relying on extra data or complex training processes.


Reconstruction and Simulation of Elastic Objects with Spring-Mass 3D Gaussians

http://arxiv.org/abs/2403.09434v1

Compressor summary: Spring-Gaus is a framework that combines 3D Gaussians with physics-based simulation to reconstruct and simulate elastic objects from multi-view videos, achieving accurate results and better performance.


Efficient Transferability Assessment for Selection of Pre-trained Detectors

http://arxiv.org/abs/2403.09432v1

Compressor summary: The paper proposes an efficient way to evaluate how well pre-trained object detectors can be transferred to different domains using a benchmark, a unified framework, and a new metric.


Borrowing Treasures from Neighbors: In-Context Learning for Multimodal Learning with Missing Modalities and Data Scarcity

http://arxiv.org/abs/2403.09428v1

Compressor summary: This paper proposes a novel retrieval-augmented in-context learning approach to improve multimodal machine learning with missing modalities and limited data in healthcare applications, outperforming existing methods.


Mitigating attribute amplification in counterfactual image generation

http://arxiv.org/abs/2403.09422v1

Compressor summary: The paper proposes soft counterfactual fine-tuning to reduce attribute amplification, a bias issue in causal generative modelling for medical imaging using hard labels.


RoDUS: Robust Decomposition of Static and Dynamic Elements in Urban Scenes

http://arxiv.org/abs/2403.09419v1

Compressor summary: RoDUS is a pipeline that uses NeRF models to separate static and dynamic elements in urban scenes, improving accuracy and reducing artifacts with 4D semantic information and robust kernel-based initialization.


User Identification via Free Roaming Eye Tracking Data

http://arxiv.org/abs/2403.09415v1

Compressor summary: Key points: - New dataset of free roaming and targeted roaming eye movements - User identification using RBFN classifier - Highest accuracies are 87.3% for FR and 89.4% for TR - Compared to 95.3% in laboratory setting - First study in non laboratory setting - Minimum duration of each recording is 263s for FR and 154s for TR - Accuracies decrease if cutting from beginning or end of trajectories - Impact of higher order velocity derivatives investigated Summary: The authors present a new dataset of eye movements in free roaming and targeted roaming scenarios, and use a RBFN classifier to identify users with accuracies ranging from 87.3% to 89.4%. They compare their results with a laboratory setting and explore the effects of recording duration and higher order velocity derivatives.


Relaxing Accurate Initialization Constraint for 3D Gaussian Splatting

http://arxiv.org/abs/2403.09413v1

Compressor summary: The paper introduces RAIN-GS, a method that improves 3D Gaussian splatting by relaxing the accurate initialization constraint for training with random point clouds.


OpenGraph: Open-Vocabulary Hierarchical 3D Graph Representation in Large-Scale Outdoor Environments

http://arxiv.org/abs/2403.09412v1

Compressor summary: OpenGraph is a representation of open-vocabulary hierarchical graph structure for large-scale outdoor environments, enabling seamless interaction between robots and humans using visual and textual reasoning.


XCoOp: Explainable Prompt Learning for Computer-Aided Diagnosis via Concept-guided Context Optimization

http://arxiv.org/abs/2403.09410v1

Compressor summary: The text proposes a new explainable prompt learning framework for vision-language models that leverages medical knowledge, aligns image semantics, and provides visual and textual explanations, improving diagnostic performance, flexibility, and interpretability.


Heuristic Reasoning in AI: Instrumental Use and Mimetic Absorption

http://arxiv.org/abs/2403.09404v1

Compressor summary: The text proposes a new approach to AI reasoning that explores the trade-offs between accuracy and effort, and how AIs balance logical processing and heuristics, similar to human cognitive processes.


Unsupervised Modality-Transferable Video Highlight Detection with Representation Activation Sequence Learning

http://arxiv.org/abs/2403.09401v1

Compressor summary: Key points: - The paper proposes a model for unsupervised highlight detection in videos using visual and audio features - The model learns from image-audio pairs via self-reconstruction and uses contrastive learning to learn significant activations and representations - The model outperforms other methods in detecting highlights Summary: The paper presents a novel unsupervised method for finding highlights in videos using both visual and audio information, learned from image-audio pairs with contrastive learning.


ConDiSR: Contrastive Disentanglement and Style Regularization for Single Domain Generalization

http://arxiv.org/abs/2403.09400v1

Compressor summary: The paper proposes a novel medical image classification method that uses channel-wise contrastive disentanglement and style regularization to handle distribution shifts, and shows improved accuracy and stability compared to existing methods.


GiT: Towards Generalist Vision Transformer through Universal Language Interface

http://arxiv.org/abs/2403.09394v1

Compressor summary: The paper introduces GiT, a simple framework that uses a vanilla ViT to unify various vision tasks with a universal language interface, achieving strong performance and zero-shot results.


Impact of Synthetic Images on Morphing Attack Detection Using a Siamese Network

http://arxiv.org/abs/2403.09380v1

Compressor summary: The paper assessed how synthetic images affect Morphing Attack Detection using Siamese networks, finding that MAD performs better when trained on real and synthetic images mixed together.


Introducing Routing Functions to Vision-Language Parameter-Efficient Fine-Tuning with Low-Rank Bottlenecks

http://arxiv.org/abs/2403.09377v1

Compressor summary: The paper proposes routing functions to improve vision-language alignment in low-rank bottlenecks for fine-tuning pre-trained models with multiple modalities, achieving significant performance gains on various tasks.


DF4LCZ: A SAM-Empowered Data Fusion Framework for Scene-Level Local Climate Zone Classification

http://arxiv.org/abs/2403.09367v1

Compressor summary: The paper proposes a new method to classify local climate zones using data fusion from Google and Sentinel-2 imagery, enhanced by graph convolutional networks.


Sentinel-Guided Zero-Shot Learning: A Collaborative Paradigm without Real Data Exposure

http://arxiv.org/abs/2403.09363v1

Compressor summary: SG-ZSL is a privacy-preserving AI collaboration method that uses teacher models and a generator to guide student models without sharing data or models, achieving good performance in zero-shot learning tasks.