This page contains one-sentence summaries of cs.AI/ML/CV/CL papers announced on 2024-03-15 generated by the compressor, my personal LLM-based project.
http://arxiv.org/abs/2403.09639v1
Compressor summary: GroupContrast is a novel self-supervised method for 3D representation learning that combines segment grouping and semantic-aware contrastive learning to address the "semantic conflict" problem.
http://arxiv.org/abs/2403.09638v1
Compressor summary: SCP-Diff is a new method for semantic image synthesis that uses specific noise priors to overcome issues with current approaches, achieving high quality results.
http://arxiv.org/abs/2403.09636v1
Compressor summary: DMC is a method for compressing the key-value cache in LLMs, improving efficiency and enabling longer contexts and larger batches without sacrificing performance.
http://arxiv.org/abs/2403.09635v1
Compressor summary: The authors develop a theory to understand and improve deep transformer models, enabling them to achieve better performance on various tasks.
http://arxiv.org/abs/2403.09634v1
Compressor summary: OneTracker is a general framework for visual object tracking that unifies various tracking tasks by pre-training on RGB tracker Foundation Tracker and finetuning with modality-specific information.
http://arxiv.org/abs/2403.09632v1
Compressor summary: Holo-Relighting is a novel volumetric relighting method that synthesizes new viewpoints and lighting from a single portrait image using a pretrained 3D GAN and can generate complex non-Lambertian lighting effects without physical lighting priors.
http://arxiv.org/abs/2403.09631v1
Compressor summary: The paper proposes a new model called 3D-VLA that integrates 3D perception, reasoning, and action, enabling better planning and multimodal generation for embodied AI tasks.
http://arxiv.org/abs/2403.09630v1
Compressor summary: The paper presents GenAD, a large-scale video prediction model for autonomous driving that uses web data and text descriptions, outperforming existing models and having potential for real-world applications.
http://arxiv.org/abs/2403.09629v1
Compressor summary: Quiet-STaR is a method for improving language models by teaching them to generate rationales for their predictions, which helps them answer difficult questions without fine-tuning.
http://arxiv.org/abs/2403.09626v1
Compressor summary: The paper explores the potential of Mamba, a state space model, for various video understanding tasks, showing its efficiency and effectiveness compared to Transformers.
http://arxiv.org/abs/2403.09625v1
Compressor summary: Make-Your-3D is a new method that creates personalized, realistic 3D content from a single image of a person and their description, using a co-evolution framework to align multiple models.
http://arxiv.org/abs/2403.09622v1
Compressor summary: The authors propose Glyph-ByT5, a text encoder that improves visual text rendering by fine-tuning ByT5 with a paired glyph-text dataset, and integrate it with SDXL to create Glyph-SDXL, which achieves high accuracy in design image and real image text rendering.
http://arxiv.org/abs/2403.09621v1
Compressor summary: The paper proposes optimal algorithms for function approximation in distributionally robust offline reinforcement learning with linearly parameterized models, and analyzes their instance-dependent suboptimality in comparison to standard offline RL.
http://arxiv.org/abs/2403.09620v1
Compressor summary: The paper presents PosSAM, an end-to-end open-vocabulary panoptic segmentation model that combines SAM's spatial features with CLIP's semantic features and introduces LDP and MASE techniques to improve performance.
http://arxiv.org/abs/2403.09616v1
Compressor summary: This paper explores using latent diffusion models (LDM) for in-context segmentation, proposes new meta-architectures and strategies, and builds a benchmark with image and video datasets.
http://arxiv.org/abs/2403.09613v1
Compressor summary: The text studies how large language models fine-tuned sequentially can recover from forgetting and exhibit anticipatory behavior in a cyclical document sequence.
http://arxiv.org/abs/2403.09611v1
Compressor summary: This paper investigates the key components of multimodal language models and shows how to build high-performing ones using different data and model architectures.
http://arxiv.org/abs/2403.09606v1
Compressor summary: Causal inference can enhance NLP models' accuracy, fairness, robustness, and explainability by capturing causal relationships among variables, and LLMs can contribute to causal inference and improve various NLP domains with their advanced reasoning capabilities.
http://arxiv.org/abs/2403.09605v1
Compressor summary: CF-SimCLR is a new contrastive pretraining method that uses counterfactual image generation to improve robustness and generalization across different medical imaging domains.
http://arxiv.org/abs/2403.09593v1
Compressor summary: The paper proposes a framework called RENOVATE that generates more precise names for visual segments in open-vocabulary segmentation datasets, which improves performance of segmentation models.
http://arxiv.org/abs/2403.09588v1
Compressor summary: The text introduces a database-inspired data stream regression model that uses R*-trees to create and forget granules for low-latency and accurate real-time predictions in time-sensitive systems.
http://arxiv.org/abs/2403.09580v1
Compressor summary: The authors propose a new approach to causal identification using symmetric monoidal categories that allows for applications in settings where classical probability theory fails, such as relational databases and machine learning algorithms.
http://arxiv.org/abs/2403.09577v1
Compressor summary: The paper presents NeRFMatch, a 2D-3D matching method that leverages the internal knowledge of NeRF learned through view synthesis for visual localization tasks, achieving state-of-the-art results on Cambridge Landmarks.
http://arxiv.org/abs/2403.09572v1
Compressor summary: ECSO is a novel training-free approach that enhances the safety of multimodal large language models by transforming unsafe images into texts, improving their performance on safety benchmarks.
http://arxiv.org/abs/2403.09570v1
Compressor summary: The paper introduces a new acquisition function for multi-fidelity black-box optimization that balances information accrual for the current task and future tasks, using shared latent variables and particle-based variational Bayesian updates.
http://arxiv.org/abs/2403.09560v1
Compressor summary: The authors propose an exact training method for Hamiltonian prediction that doesn't need labeled data, allowing it to leverage unlabeled data for better generalization and efficiency in molecular science problems.
http://arxiv.org/abs/2403.09559v1
Compressor summary: The paper proposes TIVE, a data selection approach that eliminates redundancy within visual instruction datasets for multimodal large language models, achieving comparable or better performance with significantly less data.
http://arxiv.org/abs/2403.09554v1
Compressor summary: Key points: - The text proposes a deep learning method to generate continuous NDVI time series from optical and SAR data - The method improves event detection tasks by providing a continuous time series and filtering out noise from cloudy observations - The method is tested on grassland mowing events in Lithuania and outperforms alternative interpolation techniques Summary: The authors present a deep learning approach that combines optical and SAR data to generate continuous NDVI time series, which enhances the detection of grassland mowing events in cloudy regions.
http://arxiv.org/abs/2403.09551v1
Compressor summary: WeakSurg is a weakly supervised method for surgical instrument segmentation using only instrument presence labels and temporal information, which outperforms existing methods on both semantic and instance segmentation metrics.
http://arxiv.org/abs/2403.09549v1
Compressor summary: The paper proposes denoising non-equilibrium structures as an auxiliary task to improve neural network training for simulating 3D atomistic systems, using forces as additional input to handle ill-posed problems.
http://arxiv.org/abs/2403.09548v1
Compressor summary: This study uses four boosting algorithms to predict and diagnose breast cancer, focusing on the recall metric, and improves their performance using Optuna and SHAP methods.
http://arxiv.org/abs/2403.09543v1
Compressor summary: The text explores how object classification models learn and rely on textures, and reveals interesting associations between texture and object classes that can help with interpretability and bias detection.
http://arxiv.org/abs/2403.09539v1
Compressor summary: The text shows how to learn non-public information about large language models from a few API queries using a model image or signature that exploits their softmax bottleneck.
http://arxiv.org/abs/2403.09530v1
Compressor summary: VisionGPT-3D is a versatile framework that combines state-of-the-art vision models and algorithms to automate 3D vision tasks from text prompts.
http://arxiv.org/abs/2403.09522v1
Compressor summary: MT-Patcher is a framework that transfers selective and diverse knowledge from large language models to medium-sized machine translation models, improving their performance on both specific and general tasks.
http://arxiv.org/abs/2403.09516v1
Compressor summary: DAFair is a novel method to reduce social biases in language models without using explicit demographic labels, by leveraging prototypical demographic texts and regularization during fine-tuning.
http://arxiv.org/abs/2403.09510v1
Compressor summary: The authors propose using evolutionary game theory to model how regulations can incentivize trustworthy AI development and user trust, and suggest two mechanisms for effective regulation.
http://arxiv.org/abs/2403.09508v1
Compressor summary: SkateFormer is a novel transformer-based method for skeleton-based action recognition that partitions joints and frames based on different types of skeletal-temporal relations and performs attention within each partition to improve efficiency and accuracy.
http://arxiv.org/abs/2403.09506v1
Compressor summary: The study proposes Motion Coherent Augmentation (MCA), a data augmentation method for video recognition that introduces appearance variation and encourages models to focus on motion patterns instead of static appearances.
http://arxiv.org/abs/2403.09502v1
Compressor summary: EquiAV is a novel framework that leverages equivariance for audio-visual contrastive learning, enabling robust supervision with minimal computational overhead and improving performance on various benchmarks.
http://arxiv.org/abs/2403.09500v1
Compressor summary: The paper presents Naive Faceptor, a unified face perception model that can perform various tasks efficiently and adaptively by sharing structural design and using layer-attention.
http://arxiv.org/abs/2403.09499v1
Compressor summary: The text discusses using AI to optimize battery management for renewable energy integration in dairy farming, reducing costs and demand, and proposes a Q-learning algorithm as a case study in Ireland.
http://arxiv.org/abs/2403.09493v1
Compressor summary: CLIP-ADA is a framework that uses a learnable prompt and self-supervised learning to improve anomaly detection in industrial images, achieving state-of-the-art results.
http://arxiv.org/abs/2403.09491v1
Compressor summary: The paper investigates using machine learning algorithms to reliably detect motorcycle collisions for activating passive safety systems, such as airbags and seat belts, to reduce severe injury and fatality risks in accidents.
http://arxiv.org/abs/2403.09490v1
Compressor summary: Hyper-CL uses hypernetworks to adapt sentence embeddings based on different conditions, improving performance in semantic text similarity and knowledge graph completion tasks.
http://arxiv.org/abs/2403.09488v1
Compressor summary: In-Context Calibration helps large language models learn new input-label relationships from demonstrations instead of relying on pre-trained semantic priors.
http://arxiv.org/abs/2403.09486v1
Compressor summary: Key points: - The paper proposes a self-supervised framework for spike-guided motion deblurring - The framework exploits the theoretical relationships among spike streams, blurry images, and sharp sequences - The framework uses knowledge distillation and re-blurring loss to generate high-quality deblurred images - The framework outperforms existing methods in generalization ability and quality Summary: The paper presents a self-supervised method for deblurring blurry images captured by spike cameras, which exploits the spike-related information and uses knowledge distillation and re-blurring loss to achieve high-quality results.
http://arxiv.org/abs/2403.09481v1
Compressor summary: The paper explores how to combine Bayesian networks and neural networks for clinical reasoning using text data, using pneumonia diagnosis as an example.
http://arxiv.org/abs/2403.09479v1
Compressor summary: The paper proposes a probing framework and hierarchical curriculum learning to improve language models' generalization of atomic skills to complex reasoning tasks, which can be effective in different domains.
http://arxiv.org/abs/2403.09472v1
Compressor summary: The paper proposes a method to improve AI alignment by using reward models trained on easier tasks to evaluate and improve policy models on harder tasks, enabling AI systems to surpass human capabilities.
http://arxiv.org/abs/2403.09471v1
Compressor summary: MambaTalk uses a two-stage modeling strategy with discrete motion priors to create high-quality, diverse, and rhythmic gesture sequences for human-computer interaction applications, outperforming current methods.
http://arxiv.org/abs/2403.09468v1
Compressor summary: The authors propose a novel diffusion inversion technique for text-guided image editing that allows flexible control over the editing extent and achieves superior results compared to existing methods.
http://arxiv.org/abs/2403.09454v1
Compressor summary: The authors develop a machine learning model that predicts cross-section requirements for continuous beams based on a new influence zone concept, achieving high accuracy and generalization.
http://arxiv.org/abs/2403.09451v1
Compressor summary: The paper presents an M&M model that combines audiovisual cues using a dual-pathway architecture and cross-modality multihead attention for cognitive load assessment, with three branches tailored to specific labels.
http://arxiv.org/abs/2403.09450v1
Compressor summary: Shake-to-Leak (S2L) is a new privacy risk in diffusion models that can be exploited by fine-tuning them with manipulated data, potentially leaking sensitive information.
http://arxiv.org/abs/2403.09439v1
Compressor summary: The paper proposes a method to improve 3D scene generation by refining local views using global information and generating new contents with higher quality and better 3D consistency.
http://arxiv.org/abs/2403.09437v1
Compressor summary: Key points: - The paper presents a real-time 3D multi-person pose estimation system using a panoramic camera and radar sensors - It introduces several contributions, such as calibrations and matching methods for image and radar space - It uses a lightweight 2D-3D pose lifting algorithm to overcome depth and scale ambiguity problems - It achieves a high frame rate on a laptop with a commercial-grade GPU Summary: The paper introduces a real-time system that estimates 3D poses of multiple people using a panoramic camera and radar sensors, with several contributions to handle occlusion and ambiguity issues.
http://arxiv.org/abs/2403.09433v1
Compressor summary: The paper proposes MIC, a framework for open-vocabulary object detection that improves generalization to novel classes using meta prompts and instance contrastive learning, without relying on extra data or complex training processes.
http://arxiv.org/abs/2403.09434v1
Compressor summary: Spring-Gaus is a framework that combines 3D Gaussians with physics-based simulation to reconstruct and simulate elastic objects from multi-view videos, achieving accurate results and better performance.
http://arxiv.org/abs/2403.09432v1
Compressor summary: The paper proposes an efficient way to evaluate how well pre-trained object detectors can be transferred to different domains using a benchmark, a unified framework, and a new metric.
http://arxiv.org/abs/2403.09428v1
Compressor summary: This paper proposes a novel retrieval-augmented in-context learning approach to improve multimodal machine learning with missing modalities and limited data in healthcare applications, outperforming existing methods.
http://arxiv.org/abs/2403.09422v1
Compressor summary: The paper proposes soft counterfactual fine-tuning to reduce attribute amplification, a bias issue in causal generative modelling for medical imaging using hard labels.
http://arxiv.org/abs/2403.09419v1
Compressor summary: RoDUS is a pipeline that uses NeRF models to separate static and dynamic elements in urban scenes, improving accuracy and reducing artifacts with 4D semantic information and robust kernel-based initialization.
http://arxiv.org/abs/2403.09415v1
Compressor summary: Key points: - New dataset of free roaming and targeted roaming eye movements - User identification using RBFN classifier - Highest accuracies are 87.3% for FR and 89.4% for TR - Compared to 95.3% in laboratory setting - First study in non laboratory setting - Minimum duration of each recording is 263s for FR and 154s for TR - Accuracies decrease if cutting from beginning or end of trajectories - Impact of higher order velocity derivatives investigated Summary: The authors present a new dataset of eye movements in free roaming and targeted roaming scenarios, and use a RBFN classifier to identify users with accuracies ranging from 87.3% to 89.4%. They compare their results with a laboratory setting and explore the effects of recording duration and higher order velocity derivatives.
http://arxiv.org/abs/2403.09413v1
Compressor summary: The paper introduces RAIN-GS, a method that improves 3D Gaussian splatting by relaxing the accurate initialization constraint for training with random point clouds.
http://arxiv.org/abs/2403.09412v1
Compressor summary: OpenGraph is a representation of open-vocabulary hierarchical graph structure for large-scale outdoor environments, enabling seamless interaction between robots and humans using visual and textual reasoning.
http://arxiv.org/abs/2403.09410v1
Compressor summary: The text proposes a new explainable prompt learning framework for vision-language models that leverages medical knowledge, aligns image semantics, and provides visual and textual explanations, improving diagnostic performance, flexibility, and interpretability.
http://arxiv.org/abs/2403.09404v1
Compressor summary: The text proposes a new approach to AI reasoning that explores the trade-offs between accuracy and effort, and how AIs balance logical processing and heuristics, similar to human cognitive processes.
http://arxiv.org/abs/2403.09401v1
Compressor summary: Key points: - The paper proposes a model for unsupervised highlight detection in videos using visual and audio features - The model learns from image-audio pairs via self-reconstruction and uses contrastive learning to learn significant activations and representations - The model outperforms other methods in detecting highlights Summary: The paper presents a novel unsupervised method for finding highlights in videos using both visual and audio information, learned from image-audio pairs with contrastive learning.
http://arxiv.org/abs/2403.09400v1
Compressor summary: The paper proposes a novel medical image classification method that uses channel-wise contrastive disentanglement and style regularization to handle distribution shifts, and shows improved accuracy and stability compared to existing methods.
http://arxiv.org/abs/2403.09394v1
Compressor summary: The paper introduces GiT, a simple framework that uses a vanilla ViT to unify various vision tasks with a universal language interface, achieving strong performance and zero-shot results.
http://arxiv.org/abs/2403.09380v1
Compressor summary: The paper assessed how synthetic images affect Morphing Attack Detection using Siamese networks, finding that MAD performs better when trained on real and synthetic images mixed together.
http://arxiv.org/abs/2403.09377v1
Compressor summary: The paper proposes routing functions to improve vision-language alignment in low-rank bottlenecks for fine-tuning pre-trained models with multiple modalities, achieving significant performance gains on various tasks.
http://arxiv.org/abs/2403.09367v1
Compressor summary: The paper proposes a new method to classify local climate zones using data fusion from Google and Sentinel-2 imagery, enhanced by graph convolutional networks.
http://arxiv.org/abs/2403.09363v1
Compressor summary: SG-ZSL is a privacy-preserving AI collaboration method that uses teacher models and a generator to guide student models without sharing data or models, achieving good performance in zero-shot learning tasks.